C0-clustering Liang Zheng.

C0-clustering Liang Zheng

Clustering Clustering along “One-Dimension”
Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters

Doc-Term Co-occurrence Matrix
C0-clustering Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions. Co-occurrence Matrices Characteristics Data sparseness High dimension Noise Doc-Term Co-occurrence Matrix

Related Methods Information-Theoretic Co-Clustering (ITCC)
Graph-partitioning based Co-Clustering (GPCC) Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: Inderjit S. Dhillon Center for Big Data Analytics Department of Computer Science University of Texas at Austin

Information-Theoretic Co-Clustering (ITCC)
View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables We seek a hard-clustering of both dimensions such that loss in “Mutual Information” is minimized given a fixed no. of row & col. clusters Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003:

ITCC算法的核心第一，选择矩阵重构后仍必须保证的原始数据矩阵中的统计量，即定义不变统计量
第二，选择恰当的距离度量方法，用来度量原始矩阵与协同聚类后压缩矩阵之间信息损失，即定义目标函数。 ITCC选择矩阵行列的边缘分布作为不变统计量，相关熵（KL-Divergence）作为聚类前后矩阵差异性的度量准则

ITCC算法相关分析 NP难问题， ITCC算法得到的最终解是局部最优解
时间复杂度是O ( t* n*(k+l))，其中n 表示矩阵非零值的个数，k 表示行簇个数，l 表示列簇的个数，t 表示迭代的次数。

Graph-partitioning based Co-Clustering

Given disjoint document clusters D1,
Given disjoint document clusters D1, ,Dk, the corresponding word clusters W1, ,Wk may be determined as follows. A given word wi belongs to the word cluster Wm if its association with the document cluster Dm is greater than its association with any other document cluster. A natural measure of the association of a word with a document cluster is the sum of the edge-weights to all documents in the cluster. Clearly the “best” word and document clustering would correspond to a partitioning of the graph such that the crossing edges between partitions have minimum weight.

GPCC算法 Adjacency Matrix  Laplacian matrix (L=D-A) D: degree matrix
This problem can be tackled with an SVD-related algorithm. O(|E| + h logh) h=m+n

Our Task Link set Entity set Entity set Type set . . . . LE TE types
lm eq eq tn LE TE types LET = LE TE links LT =

问题就可以描述为Link—Entity—Type协同聚类的问题：
给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance, 由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得 li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y) Definition li:  tx ; li关联到types tj:  ly; tj关联到links Relevance(li , tj)=  x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly) Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj)) Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly))

C0-clustering Liang Zheng.

Similar presentations

Presentation on theme: "C0-clustering Liang Zheng."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

C0-clustering Liang Zheng.

Similar presentations

Presentation on theme: "C0-clustering Liang Zheng."— Presentation transcript:

Similar presentations

About project

反馈