Download presentation
Presentation is loading. Please wait.
1
Intro. to Data Mining Chapter 3.2 clustering
2
What is Cluster Analysis?
Cluster: A collection of data objects similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, …) Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
3
Clustering analysis What is a natural grouping among characters?
Segmenting characters into groups is subjective. Villains Heroes Males Females
4
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method Its ability to discover some or all of the hidden patterns
5
Quality: What Is Good Clustering?
紅色樣本代表一個群體,黑色樣本代表一 個群體,請計算群體內差異與群體間差異。 樣本數據 序號 屬性 1 屬性 2 群體內差異: 群體間差異:
6
applications 群集分析在資料探勘過程中所扮演的角色 資料精簡 推斷假設的產生 推斷假設的驗證 歸屬預測
將原本大量的資料加以分群成數個群集,並從每一個群集中挑選具有代表性的資料記錄來 進行後續的處理 推斷假設的產生 推斷出所關注資料中可能存在的某些特性或現象 “年輕人通常年收入較低”、“中年人通常年收入較高” 推斷假設的驗證 對推斷假設作有效性的驗證 試圖驗證 “年輕人通常年收入較低,是否也代表其消費能力較低?”此假設性推斷時,可 以對於 “年齡”、“年收入” 和 “消費金額” 所描述的資料記錄進行群集分析 歸屬預測 分群結果應用於未知分類之資料記錄,預測資料所歸屬的群集
7
applications 群集分析應用領域 交易行為分析 解各類型使用者的行為模式 空間資料分析
幫助使用者自動化分析圖像資料庫所產生的影像資料 ,了解感興 趣的特性和現象 文件管理 將文件加以分門別類,幫助文件資料的管理和使用
8
workflow 群集分析五個主要的循序工作項目 資料的表示:找出代表性資料維度來表示資料點 相似度的計算與測量:計算資料點間相似的程度
分群法的採用:挑選適當的分群演算法 評估分群的結果:對群集分析的結果進行評估 群集的解釋:領域專家對分群結果做進一步解釋
9
Measure the Quality of Clustering
Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics
10
Major Clustering Approaches (I)
Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE
11
Major Clustering Approaches (II)
Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: Pagerank, spectral clustering
12
Hierarchical clustering - distance
13
Hierarchical clustering - correlation
14
Hierarchical clustering - category
15
Agglomerative and Divisive
Agglomerative Hierarchical Clustering Bottom-up strategy Each cluster starts with only one object Clusters are merged into larger and larger clusters AGNES Divisive Hierarchical Clustering Top-down strategy Start with all objects in one cluster Clusters are subdivided into smaller and smaller clusters DIANA Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)
16
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges or splits
17
Strengths of Hierarchical Clustering
No assumptions on the number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level Hierarchical clusterings may correspond to meaningful taxonomies Example in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs) etc
18
Input/ Initial setting
Start with clusters of individual points and a distance/proximity matrix p1 p3 p5 p4 p2 . . . . Distance/Proximity Matrix
19
Intermediate State After some merging steps, we have some clusters
Distance/Proximity Matrix C5 C2
20
Intermediate State Merge the two closest clusters (C2 and C5) and update the distance matrix. C2 C1 C3 C5 C4 C3 C4 C1 Distance/Proximity Matrix C5 C2
21
After Merging “How do we update the distance matrix?” C2 U C5 C1 C3 C4
? ? ? ? C4 C2 U C5 C3 ? C4 ? C1 C2 U C5
22
Distance Between Clusters
First measure: Minimum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the minimum distance to measure the distance between clusters is called sometimes nearest-neighbor clustering algorithm If the clustering process terminates when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm An agglomerative algorithm that uses the minimum distance measure is also called minimal spanning tree algorithm
23
Distance Between Clusters
Second measure: Maximum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the maximum distance to measure the distance between clusters is called sometimes farthest-neighbor clustering algorithm If the clustering process terminates when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithm
24
Distance Between Clusters
Minimum and maximum distances are extreme implying that they are overly sensitive to outliers or noisy data Third measure: Mean distance mi and mj are the means for cluster Ci and Cj respectively Fourth measure: Average distance |p-p’| is the distance between two objects p and p’ ni and nj are the number of objects in cluster Ci and Cj respectively Mean is difficult to compute for categorical data
25
Distance between two clusters
Ward’s distance最小變異數法(minimum variance method) between clusters Ci and Cj is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij ri: centroid of Ci rj: centroid of Cj rij: centroid of Cij
26
AGNES 使用AGNES演算法對下面的數據集進行聚類。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
27
AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 A B C
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
28
AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB C D
1.6 2.1 1.9 0.6 0.8 1
29
AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CD
1.6 1.9 0.8
30
AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CDE
1.6
31
DIANA 使用DIANA演算法對下面的數據集進行聚類。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
32
DIANA 使用DIANA演算法對下面 的數據集進行聚類。 選擇最大群組中平均相異度最大的點
Splinter Group 為A, Old Party Group 為B,C,D,E 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
33
DIANA 1.8 1.4 Splinter Group為A , Old Party Group為B,C,D,E。
在Old Party裡選擇一個點,計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點,計算到Old Party中的點的平均距離D2, 計算D2-D1的值。 1.8 1.4 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
34
DIANA Splinter Group 為A,B, Old Party Group 為C,D,E。
在Old Party裡選擇一個點,計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點,計算到Old Party中的點的平均距離D2 計算D2-D1的值。 由於D2-D1均為負值,因此沒有新的Old Party的點被分配給Splinter Group,一次分裂結束。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1
35
DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE 選擇最大群組中平均相異度最大的點 樣本點 A B C
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE
36
DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE CD E 選擇直徑最大的群組CDE
樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE CD E
37
DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE E CD C D 選擇直徑最大的群組CD 樣本點 A
0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE E CD C D
38
DIANA vs. AGNES 使用DIANA演算法與AGNES演算法 結果比較。 A B C D E ABCDE AB CDE E CD
39
Hierarchical Agglomerative Clustering
Example: Given 5 data objects, A D B E C Distance Matrix
40
Hierarchical Agglomerative Clustering
Update the distance matrix by using Single Linkage function. A D B E C
41
Hierarchical Agglomerative Clustering
Update the distance matrix by using Single Linkage function. A D B E C
42
Hierarchical Agglomerative Clustering
How do we decide the number of clusters? Cut the tree. Dist. 3.2 K=2 2.06 K=3 1 K=4 0.71 A B C D E
43
Strength of MIN in hierarchical clustering
44
limitation of MIN in hierarchical clustering
45
Strength of MAX in hierarchical clustering
46
limitation of MAX in hierarchical clustering
47
Ward’s method華德法 華德法又稱最小變異數法(minimum variance method)。華德法的分群 方式是先將每一個個體視為一個集群,然後將各集群依序合併,合併 之順序完全視合併後集群之組內總變異數之大小而定。凡使群內總變 異數產生最小增量的個體即予以優先合併,愈早合併之個體表示其間 的相似性愈高。
48
Ward’s method華德法 第i組組內變異數 I. 剛開始一個人一組 II.在每一階段, 計算集群內到中心點的歐氏距離平方和
For the ith cluster : 第i組組內變異數
49
Ward’s method華德法 組內總變異數 III. 合併ESS增加最少的兩個集群
50
Ward’s method華德法
51
Ward’s method華德法
52
Ward’s method華德法
53
Ward’s method華德法
54
Ward’s method華德法 Step4:
55
Ward’s method華德法 =73.4 6.5+14=20.4 6.5
56
hierarchical clustering example
57
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
根據使用者所設定之群集涵蓋範圍,例如群集之半徑,先將資料集合中的資 料點以漸進式處理方法分配到許多較小、相似度高的子群集 利用類似聚合式階層分群法的方式,以這些子群集為基本單元,反覆地將其 聚合成較大的群集 處理上其利用群集特徵(Clustering Feature, CF)來表示每個子群集,並不直接 處理所有的資料點,在記憶體空間的利用上非常有效率 為加速將資料點歸屬到所屬之子群集,其將動態構建出一類似B+樹 (B+ tree) Weakness: handles only numeric data, and sensitive to the order of the data record
58
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)
59
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points: CF3=CF1+CF2= 3+3, (9+35, 10+36), ( , ) = 6, (44,46), (446 ,478) Cluster3 Cluster 1 (2,5) (3,2) (4,3) Cluster 2 CF2= 3, (35,36), (417 ,440) CF1= 3, (2+3+4 , 5+2+3), ( , ) = 3, (9,10), (29 ,38)
60
The CF-tree
61
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 Leaf node 2 3 4 1 2 5 Cluster1 6 If cluster 1 becomes too large by adding object 2, then split the cluster
62
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 2 5 Cluster1 Cluster2 6 Leaf node with two entries
63
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster2 6 entry1 is the closest to object 3 If cluster 1 becomes too large by adding object 3, then split the cluster
64
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster3 Cluster2 6 Leaf node with three entries
65
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 1 3 2 4 4 5 Cluster1 Cluster3 Cluster2 Cluster2 6 entry3 is the closest to object 4 Cluster 2 remains compact when adding object 4 then add object 4 to cluster 2
66
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 5 2 4 5 Cluster1 Cluster3 Cluster2 6 entry2 is the closest to object 5 Cluster 3 becomes too large by adding object 5 then split cluster 3? BUT there is a limit to the number of entries a node can have Thus, split the node
67
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 5 2 4 Cluster1 Cluster3 Cluster4 Cluster2
68
BIRCH: The Idea by example
Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 6 5 2 4 Cluster1 Cluster3 Cluster3 Cluster4 Cluster2 entry1.2 is the closest to object 6 Cluster 3 remains compact when adding object 6 then add object 6 to cluster 3
69
BIRCH: The Idea by example
LN2 sc8 sc4 sc3 B=3 L=3 sc1 sc7 sc5 sc6 LN3 sc2 Root LN1 LN1 LN2 LN3 sc8 sc6 sc7 sc1 sc2 sc3 sc4 sc5 新增節點後,某個節點的子節點數超過了平衡因數B。 選擇最遠的兩個節點進行分裂。
70
BIRCH: The Idea by example
LN1 LN2 B=3 L=3 sc8 sc4 sc3 sc1 sc7 sc5 sc6 LN3 LN1' Root sc2 LN3 LN1' LN1 LN2 sc6 sc7 sc8 sc1 sc2 sc3 sc4 sc5 選擇最遠的兩個節點作為種子進行分裂。 按最近距離重新分配。 更新每個非葉節點的CF信息。
71
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change
72
The K-Means Clustering Method
K-Means (MacQueen’67, Lloyd’57/’82) Each cluster is represented by the center of the cluster Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied Different kinds of measures can be used Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
73
Example: K-Means Clustering
Recompute cluster centers Assign points to clusters Redo point assignment The original data points & randomly select K = 2 centroids Execution of the K-Means Clustering Algorithm Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied
74
Example: Poor Initialization May Lead to Poor Clustering
Recompute cluster centers Assign points to clusters Another random selection of k centroids for the same data points Rerun of the K-Means using another random K seeds This run of K-Means generates a poor quality clustering
75
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
76
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
77
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
78
Overcoming K-means Limitations
Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.
79
Overcoming K-means Limitations
Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.
80
Overcoming K-means Limitations
Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.
81
k-means k-平均法在概念與實作上相當的簡單,且在處理大量資料時相當有擴充性 (scalable) 且有效率,但是卻也存在一些缺點
無法處理類別性資料維度 容易受雜訊與偏移值影響其群集中心 起始群集中心選擇上的影響 群集數量決定上的困難 由於k-平均值法以群集中的質量中心當作群集中心,對於類別性資料維度所 描述之資料集合而言,並無法求得群集的質量中心
82
k-Means variations k-Means演算法的缺陷: 群聚中心的個數k需要事先給定,但在實際中這個k值的選定是非常難以估計 的
k-Means++演算法選擇初始Seeds的基本思想 初始的聚類中心之間的相互距離要盡可能的遠 k -Means演算法對於孤立點是敏感的。為了解決這個問題,我們引入了k-中心 點演算法。
83
K-means example
84
K-means example
85
K-Medians: Handling Outliers by Computing Medians
Medians are less sensitive to outliers than means Think of the median salary vs. mean salary of a large firm when adding a few top executives! K-Medians: Instead of taking the mean value of the object in a cluster as a reference point, medians are used (L1-norm as the distance measure) The criterion function for the K-Medians algorithm: The K-Medians clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medians) Repeat Assign every point to its nearest median Re-compute the median using the median of each individual feature Until convergence criterion is satisfied
86
K-Modes: Clustering Categorical Data
K-Means cannot handle non-numerical (categorical) data Mapping categorical value to 1/0 cannot generate quality clusters for high- dimensional data K-Modes: An extension to K-Means by replacing means of clusters with modes This dissimilarity measure (distance function) is frequency-based
87
PAM PAM是最早提出的k-中心點演算法之一,它選用群組中最中心的對象作為 代表對象,試圖對n個對象給出k個劃分。
代表對象也被稱為是中心點,其他對象則被稱為非代表對象。 最初隨機選擇k個對象作為中心點,該演算法反復地用非代表對象來代替中 心點,試圖找出更好的中心點,以改進群組的質量。 PAM演算法代價函數可以按如下方式理解 計算各點到最近的中心點的距離 計算各點到替換後的最近的中心點的距離 比較替換後的距離和與替換前的距離和
88
Handling Outliers: From K-Means to K-Medoids
The K-Means algorithm is sensitive to outliers!—since an object with an extremely large value may substantially distort the distribution of the data K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster The K-Medoids clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medoids) Repeat Assigning each point to the cluster with the closest medoid Randomly select a non-representative object oi Compute the total cost S of swapping the medoid m with oi If S < 0, then swap m with oi to form the new set of medoids Until convergence criterion is satisfied
89
PAM: A Typical K-Medoids Algorithm
1 2 3 4 5 6 7 8 9 10 K = 2 Arbitrary choose K object as initial medoids Assign each remaining object to nearest medoids Compute total cost of swapping 1 2 3 4 5 6 7 8 9 10 Swapping O and Oramdom If quality is improved Randomly select a non-medoid object,Oramdom 1 2 3 4 5 6 7 8 9 10 Select initial K medoids randomly Repeat Object re-assignment Swap medoid m with oi if it improves the clustering quality Until convergence criterion is satisfied 89
90
Total swapping Cost (TCih)
計算替換的代價TCih=jCjih Where Cjih is the cost of swapping i with h for all non medoid objects j Cjih will vary depending upon different cases
91
PAM (Partitioning Around Medoids)
Use real objects to represent the clusters (called medoids) Select k representative objects arbitrarily For each pair of selected object (i) and non-selected object (h), calculate the Total swapping Cost (TCih) For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change in the medoids or in TCih.
92
PAM or K-Medoids: Example
9 8 7 6 5 4 3 10 1 2 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Goal: create two clusters Choose randmly two medoids 08 = (7,4) and 02 = (3,4)
93
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Assign each object to the closest representative object Using L1 Metric (Manhattan), we form the following clusters Cluster1 = {01, 02, 03, 04} Cluster2 = {05, 06, 07, 08, 09, 010}
94
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the absolute error criterion [for the set of Medoids (O2,O8)] 𝑬= 𝒊=𝟏 𝒌 𝒑∈ 𝑪 𝒊 𝒑− 𝑶 𝒊 = 𝑶 𝟏 − 𝑶 𝟐 + 𝑶 𝟑 − 𝑶 𝟐 + 𝑶 𝟒 − 𝑶 𝟐 + 𝑶 𝟓 − 𝑶 𝟖 + 𝑶 𝟔 − 𝑶 𝟖 + 𝑶 𝟕 − 𝑶 𝟖 + 𝑶 𝟗 − 𝑶 𝟖 + 𝑶 𝟏𝟎 − 𝑶 𝟖
95
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 The absolute error criterion [for the set of Medoids (O2,O8)] E = (3+4+4)+( ) = 20
96
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Choose a random object 07 Swap 08 and 07 Compute the absolute error criterion [for the set of Medoids (02,07) E = (3+4+4)+( ) = 22
97
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the cost function Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ] S=22-20 S> 0 => It is a bad idea to replace 08 by 07
98
PAM or K-Medoids: Example
9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 C687 = d(07,06) - d(08,06)=2-1=1 C587 = d(07,05) - d(08,05)=2-3=-1 C987 = d(07,09) - d(08,09)=3-2=1 C1087 = d(07,010) - d(08,010)=3-2=1 C187=0, C387=0, C487=0 TCih=jCjih= =2=S
99
CLARA (Clustering Large Applications)
CLARA (Kaufmann and Rousseeuw in 1990) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased. Choose the best clustering Clusters … sample1 sample2 samplem PAM
100
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
101
Density-Based Clustering: Basic Concepts
Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point NEps(p): {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable: A point p is directly density-reachable from a point q if p belongs to NEps(q) core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 資料點p的位置是在某核心物件q的Eps-鄰近區域內,則資料點p可被稱為 “可由q直接密度可達(directly density-reachable)” 的物件
102
Density-Reachable and Density-Connected
p q p2 p1 Density-reachable: A point p is density-reachable from a point q if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o p q p1 q是一個核心對象,p1是從q直接密度可達,p是從p1直接密度可達,則對象p從對象q密度可達的。 p q o 存在一個對象o,使得p和q是從o密度可達的,那麼對象p和q密度相連的。
103
DBSCAN: The Algorithm Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed
104
DBSCAN 序號 屬性 1 屬性 2 1 2 4 3 5 6 7 8 9 10 11 12 根據所給的數據通過對其進行DBSCAN演算法,以 下為演算法的步驟(設n=12,用戶輸入ε=1, MinPts=4) 。
105
DBSCAN ε-鄰域點的個數 序號 屬性 1 屬性 2 是否為核心點 核心點可達到點的集合 1 2 × 4 3 5 √
2 × 4 3 5 √ 1,3,4,5,10 6 7 2,6,7,8,11 8 9 10 4,9,10,12 11 12
106
DBSCAN 序號 屬性 1 屬性 2 ε-鄰域點的個數 是否為核心點 核心點可達到點的集合 4 1 5 √ 1,3,4,5,10 7 2,6,7,8,11 10 2 4,9,10,12 樣本4可達樣本10,樣本10可達樣本(4,9,10,12),因此樣本4的可達點 和集合與樣本10的可達點的集合是密度相連的。因此這兩個可達點 的集合可以合併為一個簇。 最終的聚類結果為{1,3,4,5,9,10,12}為一個簇; 而{2,6,7,8,11}為一個簇。
107
DBSCAN Algorithm: Example
Parameter e = 2 cm MinPts = 3
108
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92). Original Points Cannot handle Varying densities sensitive to parameters (MinPts=4, Eps=9.75)
109
DBSCAN: Sensitive to Parameters
110
Density Based Clustering: Discussion
Advantages Clusters can have arbitrary shape and size Number of clusters is determined automatically Can separate clusters from surrounding noise Can be supported by spatial index structures Disadvantages Input parameters may be difficult to determine In some situations very sensitive to input parameter setting
111
DBSCAN: Sensitive to Parameters
Motivation Very different local densities may be needed to reveal clusters in different regions Clusters A,B,C1,C2, and C3 cannot be detected using one global density parameter A global density parameter can detect either A,B,C or C1,C2,C3 Solutions Use OPTICS Which density parameter to be used ? B A C C1 C2 C3 ’
112
Density-Based Hierarchical Clustering
Observation: Dense clusters are completely contained by less dense clusters Idea: Process objects in the “right” order and keep track of point density in their neighborhood C 1 2 D
113
When OPTICS Works Well Cluster-order of the objects
114
Gaussian Distribution
Bean machine: drop ball with pins 2-d Gaussian 1-d Gaussian
115
期望最大化演算法(EM) EM是一種基於一組密度分佈函數的聚類演算法。 該演算法的主要思想是:
假設樣本分佈符合高斯混合模型,演算法目的是確定各個高斯部件的參數,充 分擬合給定數據,並得到一個模糊聚類,即每個樣本以不同概率屬於每個高斯 分佈,概率數值將由以上各個參數計算得到。 其中 為均值為 ,協方差為 的高斯分佈, 是混合參數,看做第i個高斯分佈的權重,表徵先驗概率。且 且
116
Probability
117
Probability
118
Probability
119
EM演算法 例如目前有6個人,各自的 身高如下表所示: 序號 身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6
先隨便猜一下男生身高的正態分佈的參數: 均值是1.8米,方差是0.1米;女生身高的正 態分佈的參數:均值是1米6,方差是0.1米; 則可分別算得 p1(女)>p1(男), p2(女)<p2(男), p3(女)<p3(男), p4(女)>p4(男), p5(女)>p5(男), p6(女)>p6(男)。 於是便可將1、4、5、6歸為一類--女,剩餘 的2、3歸為一類--男。 基於此分類結果,可以用2、3的真實身高更 新男生身高的正態分佈的參數,用1、4、5、 6的真實身高更新女生身高的正態分佈的參 數。重複第二代,只到聚類結果沒有變化, 則收斂停止。 例如目前有6個人,各自的 身高如下表所示: 序號 身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6
120
Partitional Clustering
How do we partition a space to make the best clusters? Proximity to a cluster centroid.
121
Difficult Clusterings
But some clusterings don’t lend themselves to a “centroid” based definition of a cluster. clustering allows us to address these sorts of clusters. These kinds of clusters are defined by points that are close any member in the cluster, rather than the average member of the cluster.
122
Graph Representation We can represent the relationships between data points in a graph. Weight the edges by the similarity between points
123
Graphs Nodes and Edges Edges can be directed or undirected
Edges can have weights associated with them Here the weights correspond to pairwise affinity
124
Graphs Degree Volume of a set
125
Graph Cuts The cut between two subgraphs is calculated as follows
126
Graph cuts A cut on graph is defined as a partitioning on the nodes of the graph into two sets The size of a cut is the number of edges spanning the cut The minimum cut of a graph identifies an optimal partitioning of the data. Spectral Clustering Recursively partition the data set Identify the minimum cut Remove edges Repeat until k clusters are identified Cut size=2
127
Graph Cuts Minimum (bipartitional) cut
128
Spectral Clustering Example
Minimum Cut Height Weight 20 5 8 6 9 4 1 .2 D E C .19 .45 B .22 .24 A .08 .09
129
Spectral Clustering Example
Normalized Minimum Cut Height Weight 20 5 8 6 9 4 E .24 .19 B 1 .08 .22 .45 A D .09 .2 C
130
Spectral Clustering Example
Normalized Minimum Cut Height Weight 20 5 8 6 9 4 E .24 .19 B 1 .08 .22 .45 A D .09 .2 C
131
Unnormalized Graph Laplacian example
Adjacency matrix(A) Degree matrix (D) 1 2 3 4 2 3 4 1 1 1 1 2 3 1 3 2 3 4 2 4 Laplacian (L=D-A) 1 2 3 4 2 -1 3 1 Example graph
132
Spectral Clustering Construct an affinity matrix A B C D .4 .2 .5 .3
.5 .3 .6 .1 A D .2 .2 .1 B C .3
133
Laplacian Matrix Matrix representation of a graph
D is a normalization factor for affinity matrix A Different Laplacians are available The most important application of the Laplacian is spectral clustering that corresponds to a computationally tractable solution to the graph partitioning problem
134
A few linear algebra concepts
Let A be an nXn matrix Let A’ denote the transpose of A Let Rn denote the space of nX1-dimensional real-valued vectors Eigen vector of A v, in Rn, is the eigen vector (nX1) of a matrix A with eigen value λ, (a scalar) if Positive semi-definite A is said to be positive semi-definite if, for any x in Rn, the following holds
135
Laplacian Matrix For good clustering, we expect to have block diagonal Laplacian matrix
136
Spectral clustering key steps
1 N 1 N 1 k k 1 Adjacency matrix (A) 1 Laplacian (L) 1 First k eigen vectors k clusters … N N
137
Spectral clustering
138
Spectral clustering 給定6個點的相似度矩陣
139
Spectral clustering 計算D
140
Spectral clustering 計算D 加總
141
Spectral clustering 計算L=D-W
142
Spectral clustering 求特徵值特徵向量
143
Spectral clustering 特徵值與特徵向量
144
Spectral clustering
145
Spectral clustering
146
Spectral clustering
147
Spectral clustering
148
Spectral clustering 重新計算每個群聚的均值 分群1的中心: 分群2的中心: 分群3的中心: B距離A最近所以跟A成一群
D距離A最近所以跟A成一群 E距離C最近所以跟C成一群 分群1 : A,B,D 分群2 : C,E 分群3 : F
149
k-means vs. Spectral Clustering
150
Overall Framework of CHAMELEON
Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph: Points p and q are connected if q is among the top-k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness
151
KNN Graphs and Interconnectivity
K-nearest neighbor (KNN) graphs from an original data in 2D:
152
CHAMELEON: Clustering Complex Objects
CHAMELEON is capable to generate quality clusters at clustering complex objects
153
Using minimum cut for semi supervised classification?
Construct a graph representation of unseen data. Insert imaginary nodes s and t connected to labeled points with infinite similarity. Treat the min cut as a maximum flow problem from s to t t s
154
Kernel K-Means Clustering
Kernel K-Means can be used to detect non-convex clusters K-Means can only detect clusters that are linearly separable Idea: Project data onto the high-dimensional kernel space, and then perform K-Means clustering Map data points in the input space onto a high- dimensional feature space using the kernel function Perform K-Means on the mapped feature space Computational complexity is higher than K-Means Need to compute and store n x n kernel matrix generated from the kernel function on the original data The widely studied spectral clustering can be considered as a variant of Kernel K-Means clustering
155
Kernel Functions and Kernel K-Means Clustering
Kernel matrix: inner-product(i, j) matrix √ similarity distance(i, j) matrix × Typical kernel functions: Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + c)h Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ) The formula for kernel matrix K for any two points xi, xj є Ck is The SSE criterion of kernel K-means: The formula for the cluster centroid:
156
Example: Kernel Functions and Kernel K-Means Clustering
Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Suppose there are 5 original 2-dimensional points: x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4) If we set 𝜎 to 4, we will have the following points in the kernel space E.g., 𝑥 1 − 𝑥 = 0− −4 2 =32, therefore, 𝐾 𝑥 1 , 𝑥 2 = 𝑒 − 32 2⋅ = 𝑒 −1 Original Space RBF Kernel Space (𝜎=4) 𝑥 𝑦 x1 x2 4 x3 −4 x4 x5 𝑲( 𝒙 𝒊 , 𝒙 𝟏 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟐 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟑 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟒 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟓 ) 1 𝑒 − ⋅ = 𝑒 −1 𝑒 −1 𝑒 −2 𝑒 −4
157
Example: Kernel K-Means Clustering
The result of Gaussian Kernel K-Means clustering The original data set The above data set cannot generate quality clusters by K-Means since it contains non-convex clusters Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points xi, xj: and Gaussian kernel: K(Xi, Xj) = K-Means clustering is conducted on the mapped data, generating quality clusters The result of K-Means clustering
158
Matching-Based Measures (I): Purity vs. Maximum Matching
Purity: Quantifies the extent that cluster Ci contains points only from one (ground truth) partition: Perfect clustering if purity = 1 (the number of clusters obtained is the same as that in the ground truth) Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25; purity3 = 25/25; purity = ( )/100 = 0.75 Maximum matching: Only one cluster can match one partition Match: Pairwise matching, weight w(eij) = nij Maximum weight matching: Ex2. (green) match = purity = 0.75; (orange) match = > 0.6 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100 C\T T1 T2 T3 Sum C1 30 20 50 C2 5 25 C3 mj 100
159
Matching-Based Measures (II): F-Measure
Precision: The fraction of points in Ci from the majority partition (i.e., the same as purity), where ji is the partition that contains the maximum # of points from Ci Ex. For the green table prec1 = 30/50; prec2 = 20/25; prec3 = 25/25 Recall: The fraction of point in partition shared in common with cluster Ci, where recall1 = 30/35; recall2 = 20/40; recall3 = 25/25 F-measure for Ci: The harmonic means of preci and recalli: F-measure for clustering C: average of all clusters: F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100
160
Cluster Stability Clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or “stable” Typical approach: Find good parameter values for a given clustering algorithm Example: Find a good value of k, the correct number of clusters A bootstrapping approach to find the best value of k (judged on stability) Generate t samples of size n by sampling from D with replacement For each sample Di, run the same clustering algorithm with k values from 2 to kmax Compare the distance between all pairs of clusterings Ck(Di) and Ck(Dj) via some distance function Compute the expected pairwise distance for each value of k
161
Other Methods for Finding K, the Number of Clusters
Empirical method # of clusters: for a dataset of n points (e.g., n = 200, k = 10) Elbow method: Use the turning point in the curve of the sum of within cluster variance with respect to the # of clusters Cross validation method Divide a given data set into m parts Use m – 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering For example, for each point in the test set, find the closest centroid For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best
Similar presentations