Presentation is loading. Please wait.

Presentation is loading. Please wait.

C0-clustering  Liang Zheng.

Similar presentations


Presentation on theme: "C0-clustering  Liang Zheng."— Presentation transcript:

1 C0-clustering  Liang Zheng

2 Clustering Clustering along “One-Dimension”
Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters

3 Doc-Term Co-occurrence Matrix
C0-clustering Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions. Co-occurrence Matrices Characteristics Data sparseness High dimension Noise Doc-Term Co-occurrence Matrix

4 Related Methods Information-Theoretic Co-Clustering (ITCC)
Graph-partitioning based Co-Clustering (GPCC) Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: Inderjit S. Dhillon Center for Big Data Analytics Department of Computer Science University of Texas at Austin

5 Information-Theoretic Co-Clustering (ITCC)
View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables We seek a hard-clustering of both dimensions such that loss in “Mutual Information” is minimized given a fixed no. of row & col. clusters Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003:

6 ITCC算法的核心 第一,选择矩阵重构后仍必须保证的原始数据矩阵中的统计量,即定义不变统计量
第二,选择恰当的距离度量方法,用来度量原始矩阵与协同聚类后压缩矩阵之间信息损失,即定义目标函数。 ITCC选择矩阵行列的边缘分布作为不变统计量, 相关熵(KL-Divergence)作为聚类前后矩阵差异性的度量准则

7

8 ITCC算法相关分析 NP难问题, ITCC算法得到的最终解是局部最优解
时间复杂度是O ( t* n*(k+l)),其中n 表示矩阵非零值的个数,k 表示行簇个数,l 表示列簇的个数,t 表示迭代的次数。

9 Graph-partitioning based Co-Clustering

10 Given disjoint document clusters D1,
Given disjoint document clusters D1, ,Dk, the corresponding word clusters W1, ,Wk may be determined as follows. A given word wi belongs to the word cluster Wm if its association with the document cluster Dm is greater than its association with any other document cluster. A natural measure of the association of a word with a document cluster is the sum of the edge-weights to all documents in the cluster. Clearly the “best” word and document clustering would correspond to a partitioning of the graph such that the crossing edges between partitions have minimum weight.

11 GPCC算法 Adjacency Matrix  Laplacian matrix (L=D-A) D: degree matrix
This problem can be tackled with an SVD-related algorithm. O(|E| + h logh) h=m+n

12 Our Task Link set Entity set Entity set Type set . . . . LE TE types
lm eq eq tn LE TE types LET = LE TE links LT =

13 问题就可以描述为Link—Entity—Type协同聚类的问题:
给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance, 由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得 li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y) Definition li:  tx ; li关联到types tj:  ly; tj关联到links Relevance(li , tj)=  x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly) Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj)) Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly))

14 Q & A


Download ppt "C0-clustering  Liang Zheng."

Similar presentations


Ads by Google