Intro. to Data Mining Chapter 3.2 clustering.

Intro. to Data Mining Chapter 3.2 clustering

What is Cluster Analysis?
Cluster: A collection of data objects similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, …) Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering analysis What is a natural grouping among characters?
Segmenting characters into groups is subjective. Villains Heroes Males Females

Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method Its ability to discover some or all of the hidden patterns

Quality: What Is Good Clustering?
紅色樣本代表一個群體，黑色樣本代表一個群體，請計算群體內差異與群體間差異。樣本數據序號屬性 1 屬性 2 群體內差異：群體間差異：

applications 群集分析在資料探勘過程中所扮演的角色資料精簡推斷假設的產生推斷假設的驗證歸屬預測
將原本大量的資料加以分群成數個群集，並從每一個群集中挑選具有代表性的資料記錄來進行後續的處理推斷假設的產生推斷出所關注資料中可能存在的某些特性或現象 “年輕人通常年收入較低”、“中年人通常年收入較高” 推斷假設的驗證對推斷假設作有效性的驗證試圖驗證 “年輕人通常年收入較低，是否也代表其消費能力較低？”此假設性推斷時，可以對於 “年齡”、“年收入” 和 “消費金額” 所描述的資料記錄進行群集分析歸屬預測分群結果應用於未知分類之資料記錄，預測資料所歸屬的群集

applications 群集分析應用領域交易行為分析解各類型使用者的行為模式空間資料分析
幫助使用者自動化分析圖像資料庫所產生的影像資料，了解感興趣的特性和現象文件管理將文件加以分門別類，幫助文件資料的管理和使用

workflow 群集分析五個主要的循序工作項目資料的表示：找出代表性資料維度來表示資料點相似度的計算與測量：計算資料點間相似的程度
分群法的採用：挑選適當的分群演算法評估分群的結果：對群集分析的結果進行評估群集的解釋：領域專家對分群結果做進一步解釋

Measure the Quality of Clustering
Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics

Major Clustering Approaches (I)
Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

Major Clustering Approaches (II)
Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: Pagerank, spectral clustering

Hierarchical clustering - distance

Hierarchical clustering - correlation

Hierarchical clustering - category

Agglomerative and Divisive
Agglomerative Hierarchical Clustering Bottom-up strategy Each cluster starts with only one object Clusters are merged into larger and larger clusters AGNES Divisive Hierarchical Clustering Top-down strategy Start with all objects in one cluster Clusters are subdivided into smaller and smaller clusters DIANA Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)

Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering
No assumptions on the number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level Hierarchical clusterings may correspond to meaningful taxonomies Example in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs) etc

Input/ Initial setting
Start with clusters of individual points and a distance/proximity matrix p1 p3 p5 p4 p2 . . . . Distance/Proximity Matrix

Intermediate State After some merging steps, we have some clusters
Distance/Proximity Matrix C5 C2

Intermediate State Merge the two closest clusters (C2 and C5) and update the distance matrix. C2 C1 C3 C5 C4 C3 C4 C1 Distance/Proximity Matrix C5 C2

After Merging “How do we update the distance matrix?” C2 U C5 C1 C3 C4
? ? ? ? C4 C2 U C5 C3 ? C4 ? C1 C2 U C5

Distance Between Clusters
First measure: Minimum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the minimum distance to measure the distance between clusters is called sometimes nearest-neighbor clustering algorithm If the clustering process terminates when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm An agglomerative algorithm that uses the minimum distance measure is also called minimal spanning tree algorithm

Second measure: Maximum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the maximum distance to measure the distance between clusters is called sometimes farthest-neighbor clustering algorithm If the clustering process terminates when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithm

Minimum and maximum distances are extreme implying that they are overly sensitive to outliers or noisy data Third measure: Mean distance mi and mj are the means for cluster Ci and Cj respectively Fourth measure: Average distance |p-p’| is the distance between two objects p and p’ ni and nj are the number of objects in cluster Ci and Cj respectively Mean is difficult to compute for categorical data

Distance between two clusters
Ward’s distance最小變異數法(minimum variance method) between clusters Ci and Cj is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij ri: centroid of Ci rj: centroid of Cj rij: centroid of Cij

AGNES 使用AGNES演算法對下面的數據集進行聚類。樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 A B C
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB C D
1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CD
1.6 1.9 0.8

AGNES 使用AGNES演算法對下面的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CDE
1.6

DIANA 使用DIANA演算法對下面的數據集進行聚類。樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 使用DIANA演算法對下面的數據集進行聚類。選擇最大群組中平均相異度最大的點
Splinter Group 為A， Old Party Group 為B，C，D，E 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 1.8 1.4 Splinter Group為A ， Old Party Group為B,C,D,E。
在Old Party裡選擇一個點，計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點，計算到Old Party中的點的平均距離D2, 計算D2-D1的值。 1.8 1.4 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA Splinter Group 為A,B, Old Party Group 為C,D,E。
在Old Party裡選擇一個點，計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點，計算到Old Party中的點的平均距離D2 計算D2-D1的值。由於D2-D1均為負值，因此沒有新的Old Party的點被分配給Splinter Group，一次分裂結束。樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 使用DIANA演算法對下面的數據集進行群聚。 ABCDE AB CDE 選擇最大群組中平均相異度最大的點樣本點 A B C
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE

DIANA 使用DIANA演算法對下面的數據集進行群聚。 ABCDE AB CDE CD E 選擇直徑最大的群組CDE
樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE CD E

DIANA 使用DIANA演算法對下面的數據集進行群聚。 ABCDE AB CDE E CD C D 選擇直徑最大的群組CD 樣本點 A
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE E CD C D

DIANA vs. AGNES 使用DIANA演算法與AGNES演算法結果比較。 A B C D E ABCDE AB CDE E CD

Hierarchical Agglomerative Clustering
Example: Given 5 data objects, A D B E C Distance Matrix

Update the distance matrix by using Single Linkage function. A D B E C

How do we decide the number of clusters? Cut the tree. Dist. 3.2 K=2 2.06 K=3 1 K=4 0.71 A B C D E

Strength of MIN in hierarchical clustering

limitation of MIN in hierarchical clustering

Strength of MAX in hierarchical clustering

limitation of MAX in hierarchical clustering

Ward’s method華德法華德法又稱最小變異數法(minimum variance method)。華德法的分群方式是先將每一個個體視為一個集群，然後將各集群依序合併，合併之順序完全視合併後集群之組內總變異數之大小而定。凡使群內總變異數產生最小增量的個體即予以優先合併，愈早合併之個體表示其間的相似性愈高。

Ward’s method華德法第i組組內變異數 I. 剛開始一個人一組 II.在每一階段, 計算集群內到中心點的歐氏距離平方和
For the ith cluster : 第i組組內變異數

Ward’s method華德法組內總變異數 III. 合併ESS增加最少的兩個集群

Ward’s method華德法

Ward’s method華德法 Step4:

Ward’s method華德法 =73.4 6.5+14=20.4 6.5

hierarchical clustering example

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
根據使用者所設定之群集涵蓋範圍，例如群集之半徑，先將資料集合中的資料點以漸進式處理方法分配到許多較小、相似度高的子群集利用類似聚合式階層分群法的方式，以這些子群集為基本單元，反覆地將其聚合成較大的群集處理上其利用群集特徵(Clustering Feature, CF)來表示每個子群集，並不直接處理所有的資料點，在記憶體空間的利用上非常有效率為加速將資料點歸屬到所屬之子群集，其將動態構建出一類似B+樹 (B+ tree) Weakness: handles only numeric data, and sensitive to the order of the data record

Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points: CF3=CF1+CF2= 3+3, (9+35, 10+36), ( , ) = 6, (44,46), (446 ,478) Cluster3 Cluster 1 (2,5) (3,2) (4,3) Cluster 2 CF2= 3, (35,36), (417 ,440) CF1= 3, (2+3+4 , 5+2+3), ( , ) = 3, (9,10), (29 ,38)

The CF-tree

BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 Leaf node 2 3 4 1 2 5 Cluster1 6 If cluster 1 becomes too large by adding object 2, then split the cluster

Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 2 5 Cluster1 Cluster2 6 Leaf node with two entries

Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster2 6 entry1 is the closest to object 3 If cluster 1 becomes too large by adding object 3, then split the cluster

Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster3 Cluster2 6 Leaf node with three entries

Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 1 3 2 4 4 5 Cluster1 Cluster3 Cluster2 Cluster2 6 entry3 is the closest to object 4 Cluster 2 remains compact when adding object 4 then add object 4 to cluster 2

Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 5 2 4 5 Cluster1 Cluster3 Cluster2 6 entry2 is the closest to object 5 Cluster 3 becomes too large by adding object 5 then split cluster 3? BUT there is a limit to the number of entries a node can have Thus, split the node

Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 5 2 4 Cluster1 Cluster3 Cluster4 Cluster2

Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 6 5 2 4 Cluster1 Cluster3 Cluster3 Cluster4 Cluster2 entry1.2 is the closest to object 6 Cluster 3 remains compact when adding object 6 then add object 6 to cluster 3

LN2 sc8 sc4 sc3 B=3 L=3 sc1 sc7 sc5 sc6 LN3 sc2 Root LN1 LN1 LN2 LN3 sc8 sc6 sc7 sc1 sc2 sc3 sc4 sc5 新增節點後，某個節點的子節點數超過了平衡因數B。選擇最遠的兩個節點進行分裂。

LN1 LN2 B=3 L=3 sc8 sc4 sc3 sc1 sc7 sc5 sc6 LN3 LN1' Root sc2 LN3 LN1' LN1 LN2 sc6 sc7 sc8 sc1 sc2 sc3 sc4 sc5 選擇最遠的兩個節點作為種子進行分裂。按最近距離重新分配。更新每個非葉節點的CF信息。

Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change

The K-Means Clustering Method
K-Means (MacQueen’67, Lloyd’57/’82) Each cluster is represented by the center of the cluster Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied Different kinds of measures can be used Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity

Example: K-Means Clustering
Recompute cluster centers Assign points to clusters Redo point assignment The original data points & randomly select K = 2 centroids Execution of the K-Means Clustering Algorithm Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied

Example: Poor Initialization May Lead to Poor Clustering
Recompute cluster centers Assign points to clusters Another random selection of k centroids for the same data points Rerun of the K-Means using another random K seeds This run of K-Means generates a poor quality clustering

Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Limitations of K-means: Non-globular Shapes

Overcoming K-means Limitations
Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

k-means k-平均法在概念與實作上相當的簡單，且在處理大量資料時相當有擴充性 (scalable) 且有效率，但是卻也存在一些缺點
無法處理類別性資料維度容易受雜訊與偏移值影響其群集中心起始群集中心選擇上的影響群集數量決定上的困難由於k-平均值法以群集中的質量中心當作群集中心，對於類別性資料維度所描述之資料集合而言，並無法求得群集的質量中心

k-Means variations k-Means演算法的缺陷：群聚中心的個數k需要事先給定，但在實際中這個k值的選定是非常難以估計的
k-Means++演算法選擇初始Seeds的基本思想初始的聚類中心之間的相互距離要盡可能的遠 k -Means演算法對於孤立點是敏感的。為了解決這個問題，我們引入了k-中心點演算法。

K-means example

K-Medians: Handling Outliers by Computing Medians
Medians are less sensitive to outliers than means Think of the median salary vs. mean salary of a large firm when adding a few top executives! K-Medians: Instead of taking the mean value of the object in a cluster as a reference point, medians are used (L1-norm as the distance measure) The criterion function for the K-Medians algorithm: The K-Medians clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medians) Repeat Assign every point to its nearest median Re-compute the median using the median of each individual feature Until convergence criterion is satisfied

K-Modes: Clustering Categorical Data
K-Means cannot handle non-numerical (categorical) data Mapping categorical value to 1/0 cannot generate quality clusters for high- dimensional data K-Modes: An extension to K-Means by replacing means of clusters with modes This dissimilarity measure (distance function) is frequency-based

PAM PAM是最早提出的k-中心點演算法之一，它選用群組中最中心的對象作為代表對象，試圖對n個對象給出k個劃分。
代表對象也被稱為是中心點，其他對象則被稱為非代表對象。最初隨機選擇k個對象作為中心點，該演算法反復地用非代表對象來代替中心點，試圖找出更好的中心點，以改進群組的質量。 PAM演算法代價函數可以按如下方式理解計算各點到最近的中心點的距離計算各點到替換後的最近的中心點的距離比較替換後的距離和與替換前的距離和

Handling Outliers: From K-Means to K-Medoids
The K-Means algorithm is sensitive to outliers!—since an object with an extremely large value may substantially distort the distribution of the data K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster The K-Medoids clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medoids) Repeat Assigning each point to the cluster with the closest medoid Randomly select a non-representative object oi Compute the total cost S of swapping the medoid m with oi If S < 0, then swap m with oi to form the new set of medoids Until convergence criterion is satisfied

PAM: A Typical K-Medoids Algorithm
1 2 3 4 5 6 7 8 9 10 K = 2 Arbitrary choose K object as initial medoids Assign each remaining object to nearest medoids Compute total cost of swapping 1 2 3 4 5 6 7 8 9 10 Swapping O and Oramdom If quality is improved Randomly select a non-medoid object,Oramdom 1 2 3 4 5 6 7 8 9 10 Select initial K medoids randomly Repeat Object re-assignment Swap medoid m with oi if it improves the clustering quality Until convergence criterion is satisfied 89

Total swapping Cost (TCih)
計算替換的代價TCih=jCjih Where Cjih is the cost of swapping i with h for all non medoid objects j Cjih will vary depending upon different cases

PAM (Partitioning Around Medoids)
Use real objects to represent the clusters (called medoids) Select k representative objects arbitrarily For each pair of selected object (i) and non-selected object (h), calculate the Total swapping Cost (TCih) For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change in the medoids or in TCih.

PAM or K-Medoids: Example
9 8 7 6 5 4 3 10 1 2 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Goal: create two clusters Choose randmly two medoids 08 = (7,4) and 02 = (3,4)

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Assign each object to the closest representative object Using L1 Metric (Manhattan), we form the following clusters Cluster1 = {01, 02, 03, 04} Cluster2 = {05, 06, 07, 08, 09, 010}

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the absolute error criterion [for the set of Medoids (O2,O8)] 𝑬= 𝒊=𝟏 𝒌 𝒑∈ 𝑪 𝒊 𝒑− 𝑶 𝒊 = 𝑶 𝟏 − 𝑶 𝟐 + 𝑶 𝟑 − 𝑶 𝟐 + 𝑶 𝟒 − 𝑶 𝟐 + 𝑶 𝟓 − 𝑶 𝟖 + 𝑶 𝟔 − 𝑶 𝟖 + 𝑶 𝟕 − 𝑶 𝟖 + 𝑶 𝟗 − 𝑶 𝟖 + 𝑶 𝟏𝟎 − 𝑶 𝟖

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 The absolute error criterion [for the set of Medoids (O2,O8)] E = (3+4+4)+( ) = 20

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Choose a random object 07 Swap 08 and 07 Compute the absolute error criterion [for the set of Medoids (02,07) E = (3+4+4)+( ) = 22

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the cost function Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ] S=22-20 S> 0 => It is a bad idea to replace 08 by 07

9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 C687 = d(07,06) - d(08,06)=2-1=1 C587 = d(07,05) - d(08,05)=2-3=-1 C987 = d(07,09) - d(08,09)=3-2=1 C1087 = d(07,010) - d(08,010)=3-2=1 C187=0, C387=0, C487=0 TCih=jCjih= =2=S

CLARA (Clustering Large Applications)
CLARA (Kaufmann and Rousseeuw in 1990) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased. Choose the best clustering Clusters … sample1 sample2 samplem PAM

Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

Density-Based Clustering: Basic Concepts
Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point NEps(p): {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable: A point p is directly density-reachable from a point q if p belongs to NEps(q) core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 資料點p的位置是在某核心物件q的Eps-鄰近區域內，則資料點p可被稱為 “可由q直接密度可達(directly density-reachable)” 的物件

Density-Reachable and Density-Connected
p q p2 p1 Density-reachable: A point p is density-reachable from a point q if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o p q p1 q是一個核心對象，p1是從q直接密度可達，p是從p1直接密度可達，則對象p從對象q密度可達的。 p q o 存在一個對象o，使得p和q是從o密度可達的，那麼對象p和q密度相連的。

DBSCAN: The Algorithm Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed

DBSCAN 序號屬性 1 屬性 2 1 2 4 3 5 6 7 8 9 10 11 12 根據所給的數據通過對其進行DBSCAN演算法，以下為演算法的步驟（設n=12，用戶輸入ε=1， MinPts=4）。

DBSCAN ε-鄰域點的個數序號屬性 1 屬性 2 是否為核心點核心點可達到點的集合 1 2 × 4 3 5 √
2 × 4 3 5 √ 1,3,4,5,10 6 7 2,6,7,8,11 8 9 10 4,9,10,12 11 12

DBSCAN 序號屬性 1 屬性 2 ε-鄰域點的個數是否為核心點核心點可達到點的集合 4 1 5 √ 1,3,4,5,10 7 2,6,7,8,11 10 2 4,9,10,12 樣本4可達樣本10，樣本10可達樣本(4,9,10,12),因此樣本4的可達點和集合與樣本10的可達點的集合是密度相連的。因此這兩個可達點的集合可以合併為一個簇。最終的聚類結果為{1,3,4,5,9,10,12}為一個簇；而{2,6,7,8,11}為一個簇。

DBSCAN Algorithm: Example
Parameter e = 2 cm MinPts = 3

When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92). Original Points Cannot handle Varying densities sensitive to parameters (MinPts=4, Eps=9.75)

DBSCAN: Sensitive to Parameters

Density Based Clustering: Discussion
Advantages Clusters can have arbitrary shape and size Number of clusters is determined automatically Can separate clusters from surrounding noise Can be supported by spatial index structures Disadvantages Input parameters may be difficult to determine In some situations very sensitive to input parameter setting

DBSCAN: Sensitive to Parameters
Motivation Very different local densities may be needed to reveal clusters in different regions Clusters A,B,C1,C2, and C3 cannot be detected using one global density parameter A global density parameter can detect either A,B,C or C1,C2,C3 Solutions Use OPTICS Which density parameter to be used ? B A C C1 C2 C3  ’

Density-Based Hierarchical Clustering
Observation: Dense clusters are completely contained by less dense clusters Idea: Process objects in the “right” order and keep track of point density in their neighborhood C 1 2 D

When OPTICS Works Well Cluster-order of the objects

Gaussian Distribution
Bean machine: drop ball with pins 2-d Gaussian 1-d Gaussian

期望最大化演算法（EM） EM是一種基於一組密度分佈函數的聚類演算法。該演算法的主要思想是：
假設樣本分佈符合高斯混合模型，演算法目的是確定各個高斯部件的參數，充分擬合給定數據，並得到一個模糊聚類，即每個樣本以不同概率屬於每個高斯分佈，概率數值將由以上各個參數計算得到。其中為均值為，協方差為的高斯分佈，是混合參數，看做第i個高斯分佈的權重，表徵先驗概率。且且

Probability

EM演算法例如目前有6個人，各自的身高如下表所示：序號身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6
先隨便猜一下男生身高的正態分佈的參數：均值是1.8米，方差是0.1米；女生身高的正態分佈的參數：均值是1米6，方差是0.1米；則可分別算得 p1(女)>p1(男)， p2(女)<p2(男)， p3(女)<p3(男)， p4(女)>p4(男)， p5(女)>p5(男)， p6(女)>p6(男)。於是便可將1、4、5、6歸為一類--女，剩餘的2、3歸為一類--男。基於此分類結果，可以用2、3的真實身高更新男生身高的正態分佈的參數，用1、4、5、 6的真實身高更新女生身高的正態分佈的參數。重複第二代，只到聚類結果沒有變化，則收斂停止。例如目前有6個人，各自的身高如下表所示：序號身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6

Partitional Clustering
How do we partition a space to make the best clusters? Proximity to a cluster centroid.

Difficult Clusterings
But some clusterings don’t lend themselves to a “centroid” based definition of a cluster. clustering allows us to address these sorts of clusters. These kinds of clusters are defined by points that are close any member in the cluster, rather than the average member of the cluster.

Graph Representation We can represent the relationships between data points in a graph. Weight the edges by the similarity between points

Graphs Nodes and Edges Edges can be directed or undirected
Edges can have weights associated with them Here the weights correspond to pairwise affinity

Graphs Degree Volume of a set

Graph Cuts The cut between two subgraphs is calculated as follows

Graph cuts A cut on graph is defined as a partitioning on the nodes of the graph into two sets The size of a cut is the number of edges spanning the cut The minimum cut of a graph identifies an optimal partitioning of the data. Spectral Clustering Recursively partition the data set Identify the minimum cut Remove edges Repeat until k clusters are identified Cut size=2

Graph Cuts Minimum (bipartitional) cut

Spectral Clustering Example
Minimum Cut Height Weight 20 5 8 6 9 4 1 .2 D E C .19 .45 B .22 .24 A .08 .09

Spectral Clustering Example
Normalized Minimum Cut Height Weight 20 5 8 6 9 4 E .24 .19 B 1 .08 .22 .45 A D .09 .2 C

Unnormalized Graph Laplacian example
Adjacency matrix(A) Degree matrix (D) 1 2 3 4 2 3 4 1 1 1 1 2 3 1 3 2 3 4 2 4 Laplacian (L=D-A) 1 2 3 4 2 -1 3 1 Example graph

Spectral Clustering Construct an affinity matrix A B C D .4 .2 .5 .3
.5 .3 .6 .1 A D .2 .2 .1 B C .3

Laplacian Matrix Matrix representation of a graph
D is a normalization factor for affinity matrix A Different Laplacians are available The most important application of the Laplacian is spectral clustering that corresponds to a computationally tractable solution to the graph partitioning problem

A few linear algebra concepts
Let A be an nXn matrix Let A’ denote the transpose of A Let Rn denote the space of nX1-dimensional real-valued vectors Eigen vector of A v, in Rn, is the eigen vector (nX1) of a matrix A with eigen value λ, (a scalar) if Positive semi-definite A is said to be positive semi-definite if, for any x in Rn, the following holds

Laplacian Matrix For good clustering, we expect to have block diagonal Laplacian matrix

Spectral clustering key steps
1 N 1 N 1 k k 1 Adjacency matrix (A) 1 Laplacian (L) 1 First k eigen vectors k clusters … N N

Spectral clustering

Spectral clustering 給定6個點的相似度矩陣

Spectral clustering 計算D

Spectral clustering 計算D 加總

Spectral clustering 計算L=D-W

Spectral clustering 求特徵值特徵向量

Spectral clustering 特徵值與特徵向量

Spectral clustering

Spectral clustering 重新計算每個群聚的均值分群1的中心: 分群2的中心: 分群3的中心: B距離A最近所以跟A成一群
D距離A最近所以跟A成一群 E距離C最近所以跟C成一群分群1 : A,B,D 分群2 : C,E 分群3 : F

k-means vs. Spectral Clustering

Overall Framework of CHAMELEON
Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph: Points p and q are connected if q is among the top-k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness

KNN Graphs and Interconnectivity
K-nearest neighbor (KNN) graphs from an original data in 2D:

CHAMELEON: Clustering Complex Objects
CHAMELEON is capable to generate quality clusters at clustering complex objects

Using minimum cut for semi supervised classification?
Construct a graph representation of unseen data. Insert imaginary nodes s and t connected to labeled points with infinite similarity. Treat the min cut as a maximum flow problem from s to t t s

Kernel K-Means Clustering
Kernel K-Means can be used to detect non-convex clusters K-Means can only detect clusters that are linearly separable Idea: Project data onto the high-dimensional kernel space, and then perform K-Means clustering Map data points in the input space onto a high- dimensional feature space using the kernel function Perform K-Means on the mapped feature space Computational complexity is higher than K-Means Need to compute and store n x n kernel matrix generated from the kernel function on the original data The widely studied spectral clustering can be considered as a variant of Kernel K-Means clustering

Kernel Functions and Kernel K-Means Clustering
Kernel matrix: inner-product(i, j) matrix √ similarity distance(i, j) matrix × Typical kernel functions: Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + c)h Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ) The formula for kernel matrix K for any two points xi, xj є Ck is The SSE criterion of kernel K-means: The formula for the cluster centroid:

Example: Kernel Functions and Kernel K-Means Clustering
Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Suppose there are 5 original 2-dimensional points: x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4) If we set 𝜎 to 4, we will have the following points in the kernel space E.g., 𝑥 1 − 𝑥 = 0− −4 2 =32, therefore, 𝐾 𝑥 1 , 𝑥 2 = 𝑒 − 32 2⋅ = 𝑒 −1 Original Space RBF Kernel Space (𝜎=4) 𝑥 𝑦 x1 x2 4 x3 −4 x4 x5 𝑲( 𝒙 𝒊 , 𝒙 𝟏 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟐 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟑 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟒 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟓 ) 1 𝑒 − ⋅ = 𝑒 −1 𝑒 −1 𝑒 −2 𝑒 −4

Example: Kernel K-Means Clustering
The result of Gaussian Kernel K-Means clustering The original data set The above data set cannot generate quality clusters by K-Means since it contains non-convex clusters Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points xi, xj: and Gaussian kernel: K(Xi, Xj) = K-Means clustering is conducted on the mapped data, generating quality clusters The result of K-Means clustering

Matching-Based Measures (I): Purity vs. Maximum Matching
Purity: Quantifies the extent that cluster Ci contains points only from one (ground truth) partition: Perfect clustering if purity = 1 (the number of clusters obtained is the same as that in the ground truth) Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25; purity3 = 25/25; purity = ( )/100 = 0.75 Maximum matching: Only one cluster can match one partition Match: Pairwise matching, weight w(eij) = nij Maximum weight matching: Ex2. (green) match = purity = 0.75; (orange) match = > 0.6 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100 C\T T1 T2 T3 Sum C1 30 20 50 C2 5 25 C3 mj 100

Matching-Based Measures (II): F-Measure
Precision: The fraction of points in Ci from the majority partition (i.e., the same as purity), where ji is the partition that contains the maximum # of points from Ci Ex. For the green table prec1 = 30/50; prec2 = 20/25; prec3 = 25/25 Recall: The fraction of point in partition shared in common with cluster Ci, where recall1 = 30/35; recall2 = 20/40; recall3 = 25/25 F-measure for Ci: The harmonic means of preci and recalli: F-measure for clustering C: average of all clusters: F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100

Cluster Stability Clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or “stable” Typical approach: Find good parameter values for a given clustering algorithm Example: Find a good value of k, the correct number of clusters A bootstrapping approach to find the best value of k (judged on stability) Generate t samples of size n by sampling from D with replacement For each sample Di, run the same clustering algorithm with k values from 2 to kmax Compare the distance between all pairs of clusterings Ck(Di) and Ck(Dj) via some distance function Compute the expected pairwise distance for each value of k

Other Methods for Finding K, the Number of Clusters
Empirical method # of clusters: for a dataset of n points (e.g., n = 200, k = 10) Elbow method: Use the turning point in the curve of the sum of within cluster variance with respect to the # of clusters Cross validation method Divide a given data set into m parts Use m – 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering For example, for each point in the test set, find the closest centroid For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best

Intro. to Data Mining Chapter 3.2 clustering.

Similar presentations

Presentation on theme: "Intro. to Data Mining Chapter 3.2 clustering."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Intro. to Data Mining Chapter 3.2 clustering.

Similar presentations

Presentation on theme: "Intro. to Data Mining Chapter 3.2 clustering."— Presentation transcript:

Similar presentations

About project

反馈