C0-clustering Liang Zheng
Clustering Clustering along “One-Dimension” Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters
Doc-Term Co-occurrence Matrix C0-clustering Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions. Co-occurrence Matrices Characteristics Data sparseness High dimension Noise Doc-Term Co-occurrence Matrix
Related Methods Information-Theoretic Co-Clustering (ITCC) Graph-partitioning based Co-Clustering (GPCC) Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98. Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: 269-274. Inderjit S. Dhillon Center for Big Data Analytics Department of Computer Science University of Texas at Austin
Information-Theoretic Co-Clustering (ITCC) View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables We seek a hard-clustering of both dimensions such that loss in “Mutual Information” is minimized given a fixed no. of row & col. clusters Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98.
ITCC算法的核心 第一,选择矩阵重构后仍必须保证的原始数据矩阵中的统计量,即定义不变统计量 第二,选择恰当的距离度量方法,用来度量原始矩阵与协同聚类后压缩矩阵之间信息损失,即定义目标函数。 ITCC选择矩阵行列的边缘分布作为不变统计量, 相关熵(KL-Divergence)作为聚类前后矩阵差异性的度量准则
ITCC算法相关分析 NP难问题, ITCC算法得到的最终解是局部最优解 时间复杂度是O ( t* n*(k+l)),其中n 表示矩阵非零值的个数,k 表示行簇个数,l 表示列簇的个数,t 表示迭代的次数。
Graph-partitioning based Co-Clustering
Given disjoint document clusters D1, Given disjoint document clusters D1, . . . ,Dk, the corresponding word clusters W1, . . . ,Wk may be determined as follows. A given word wi belongs to the word cluster Wm if its association with the document cluster Dm is greater than its association with any other document cluster. A natural measure of the association of a word with a document cluster is the sum of the edge-weights to all documents in the cluster. Clearly the “best” word and document clustering would correspond to a partitioning of the graph such that the crossing edges between partitions have minimum weight.
GPCC算法 Adjacency Matrix Laplacian matrix (L=D-A) D: degree matrix This problem can be tackled with an SVD-related algorithm. O(|E| + h logh) h=m+n
Our Task Link set Entity set Entity set Type set . . . . LE TE types lm eq eq tn LE TE types LET = LE TE links LT =
问题就可以描述为Link—Entity—Type协同聚类的问题: 给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance, 由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得 li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y) Definition li: tx ; li关联到types tj: ly; tj关联到links Relevance(li , tj)= x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly) Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj)) Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly))
Q & A