Intro. to Data Mining Chapter 3.2 clustering.

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

Measures of location and dispersion
第4章 聚类分析 4.1 概述 4.2 基于划分的聚类算法 4.3 层次聚类算法 4.4 基于密度的聚类算法 4.5 基于图的聚类算法
第十九章 聯合分析、多元尺度方法 和集群分析
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
XI. Hilbert Huang Transform (HHT)
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Minimum Spanning Trees
3-3 Modeling with Systems of DEs
-Artificial Neural Network- Adaline & Madaline
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
Population proportion and sample proportion
数据仓库与数据挖掘 复习.
! 温故知新 上下文无关文法 最左推导 最右推导 自上而下 自下而上 句柄 归约 移进-归约冲突 移进-归约分析 递归下降预测分析
Descriptive statistics
模式识别 Pattern Recognition
Manifold Learning Kai Yang
Digital Terrain Modeling
樹狀結構 陳怡芬 2018/11/16 北一女中資訊專題研究.
非線性規劃 Nonlinear Programming
Seam Carving for Content-Aware Image Resizing
On Some Fuzzy Optimization Problems
第五讲 数据的分组、合并与转换.
K-modes(补充) K-模,对k-平均方法的改进,k-原型的简化 处理分类属性
Chapter 6 Graph Chang Chi-Chung
第二章 共轴球面系统的物像关系 Chapter 2: Object-image relations of coaxial spheric system.
The Greedy Method.
Sampling Theory and Some Important Sampling Distributions
Digital Terrain Modeling
第4章(2) 空间数据库 —关系数据库 北京建筑工程学院 王文宇.
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
创建型设计模式.
组合逻辑3 Combinational Logic
3D Object Representations
Randomized Algorithms
Interval Estimation區間估計
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
消費者偏好與效用概念.
神经信息学 自组织网络 ——自组织映射 史忠植 中科院计算所 2019/2/2.
第五章 聚类方法 内容提要 聚类方法概述 划分聚类方法 层次聚类方法 密度聚类方法 其它聚类方法 2019年2月17日星期日
IBM SWG Overall Introduction
谈模式识别方法在林业管理问题中的应用 报告人:管理工程系 马宁 报告地点:学研B107
Version Control System Based DSNs
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Mechanics Exercise Class Ⅰ
相關統計觀念復習 Review II.
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
3.5 Region Filling Region Filling is a process of “coloring in” a definite image area or region. 2019/4/19.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
從 ER 到 Logical Schema ──兼談Schema Integration
第17章 集群分析 本章的學習主題  1. 集群分析的概念 2. 相似性及最近距離的衡量 3. 階層分析法 4. 非階層分析法
主講人:陳鴻文 副教授 銘傳大學資訊傳播工程系所 日期:3/13/2010
计算机问题求解 – 论题 算法方法 2016年11月28日.
Simple Regression (簡單迴歸分析)
聚类分析法预测(Cluster Analysis)
Applied Human Computer Interaction Lecture 10 Yan Ke
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
磁共振原理的临床应用.
SLIQ:一种快速可伸缩分类器 Manish Mehta, Rakesh Agrawal, Jorma Rissanen IBM Almaden Research Center, 1996 报告人:郭新涛
 隐式欧拉法 /* implicit Euler method */
More About Auto-encoder
2 Number Systems, Operations, and Codes
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

Intro. to Data Mining Chapter 3.2 clustering

What is Cluster Analysis? Cluster: A collection of data objects similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, …) Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering analysis What is a natural grouping among characters? Segmenting characters into groups is subjective. Villains Heroes Males Females

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method Its ability to discover some or all of the hidden patterns

Quality: What Is Good Clustering? 紅色樣本代表一個群體,黑色樣本代表一 個群體,請計算群體內差異與群體間差異。 樣本數據 序號 屬性 1 屬性 2 1 1 1 2 2 1 3 1 2 4 2 2 5 4 3 6 5 3 7 4 4 8 5 4 群體內差異: 群體間差異:

applications 群集分析在資料探勘過程中所扮演的角色 資料精簡 推斷假設的產生 推斷假設的驗證 歸屬預測 將原本大量的資料加以分群成數個群集,並從每一個群集中挑選具有代表性的資料記錄來 進行後續的處理 推斷假設的產生 推斷出所關注資料中可能存在的某些特性或現象 “年輕人通常年收入較低”、“中年人通常年收入較高” 推斷假設的驗證 對推斷假設作有效性的驗證 試圖驗證 “年輕人通常年收入較低,是否也代表其消費能力較低?”此假設性推斷時,可 以對於 “年齡”、“年收入” 和 “消費金額” 所描述的資料記錄進行群集分析 歸屬預測 分群結果應用於未知分類之資料記錄,預測資料所歸屬的群集

applications 群集分析應用領域 交易行為分析 解各類型使用者的行為模式 空間資料分析 幫助使用者自動化分析圖像資料庫所產生的影像資料 ,了解感興 趣的特性和現象 文件管理 將文件加以分門別類,幫助文件資料的管理和使用

workflow 群集分析五個主要的循序工作項目 資料的表示:找出代表性資料維度來表示資料點 相似度的計算與測量:計算資料點間相似的程度 分群法的採用:挑選適當的分群演算法 評估分群的結果:對群集分析的結果進行評估 群集的解釋:領域專家對分群結果做進一步解釋

Measure the Quality of Clustering Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics

Major Clustering Approaches (I) Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

Major Clustering Approaches (II) Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: Pagerank, spectral clustering

Hierarchical clustering - distance

Hierarchical clustering - correlation

Hierarchical clustering - category

Agglomerative and Divisive Agglomerative Hierarchical Clustering Bottom-up strategy Each cluster starts with only one object Clusters are merged into larger and larger clusters AGNES Divisive Hierarchical Clustering Top-down strategy Start with all objects in one cluster Clusters are subdivided into smaller and smaller clusters DIANA Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering No assumptions on the number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level Hierarchical clusterings may correspond to meaningful taxonomies Example in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs) etc

Input/ Initial setting Start with clusters of individual points and a distance/proximity matrix p1 p3 p5 p4 p2 . . . . Distance/Proximity Matrix

Intermediate State After some merging steps, we have some clusters Distance/Proximity Matrix C5 C2

Intermediate State Merge the two closest clusters (C2 and C5) and update the distance matrix. C2 C1 C3 C5 C4 C3 C4 C1 Distance/Proximity Matrix C5 C2

After Merging “How do we update the distance matrix?” C2 U C5 C1 C3 C4 ? ? ? ? C4 C2 U C5 C3 ? C4 ? C1 C2 U C5

Distance Between Clusters First measure: Minimum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the minimum distance to measure the distance between clusters is called sometimes nearest-neighbor clustering algorithm If the clustering process terminates when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm An agglomerative algorithm that uses the minimum distance measure is also called minimal spanning tree algorithm

Distance Between Clusters Second measure: Maximum distance |p-p’| is the distance between two objects p and p’ Use cases An algorithm that uses the maximum distance to measure the distance between clusters is called sometimes farthest-neighbor clustering algorithm If the clustering process terminates when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithm

Distance Between Clusters Minimum and maximum distances are extreme implying that they are overly sensitive to outliers or noisy data Third measure: Mean distance mi and mj are the means for cluster Ci and Cj respectively Fourth measure: Average distance |p-p’| is the distance between two objects p and p’ ni and nj are the number of objects in cluster Ci and Cj respectively Mean is difficult to compute for categorical data

Distance between two clusters Ward’s distance最小變異數法(minimum variance method) between clusters Ci and Cj is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij ri: centroid of Ci rj: centroid of Cj rij: centroid of Cij

AGNES 使用AGNES演算法對下面的數據集進行聚類。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 A B C 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB C D 1.6 2.1 1.9 0.6 0.8 1

AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CD 1.6 1.9 0.8

AGNES 使用AGNES演算法對下面 的數據集進行群聚。 l=4 l=3 l=2 l=1 l=0 A B C D E 樣本點 AB CDE 1.6

DIANA 使用DIANA演算法對下面的數據集進行聚類。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 使用DIANA演算法對下面 的數據集進行聚類。 選擇最大群組中平均相異度最大的點 Splinter Group 為A, Old Party Group 為B,C,D,E 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 1.8 1.4 Splinter Group為A , Old Party Group為B,C,D,E。 在Old Party裡選擇一個點,計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點,計算到Old Party中的點的平均距離D2, 計算D2-D1的值。 1.8 1.4 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA Splinter Group 為A,B, Old Party Group 為C,D,E。 在Old Party裡選擇一個點,計算到Splinter Group中的點的平均距離D1 在Old Party裡選擇一個點,計算到Old Party中的點的平均距離D2 計算D2-D1的值。 由於D2-D1均為負值,因此沒有新的Old Party的點被分配給Splinter Group,一次分裂結束。 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1

DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE 選擇最大群組中平均相異度最大的點 樣本點 A B C 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE

DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE CD E 選擇直徑最大的群組CDE 樣本點 A B C D E 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE CD E

DIANA 使用DIANA演算法對下面 的數據集進行群聚。 ABCDE AB CDE E CD C D 選擇直徑最大的群組CD 樣本點 A 0.4 2 2.5 3 1.6 2.1 1.9 0.6 0.8 1 ABCDE AB CDE E CD C D

DIANA vs. AGNES 使用DIANA演算法與AGNES演算法 結果比較。 A B C D E ABCDE AB CDE E CD

Hierarchical Agglomerative Clustering Example: Given 5 data objects, A D B E C Distance Matrix

Hierarchical Agglomerative Clustering Update the distance matrix by using Single Linkage function. A D B E C

Hierarchical Agglomerative Clustering Update the distance matrix by using Single Linkage function. A D B E C

Hierarchical Agglomerative Clustering How do we decide the number of clusters? Cut the tree. Dist. 3.2 K=2 2.06 K=3 1 K=4 0.71 A B C D E

Strength of MIN in hierarchical clustering

limitation of MIN in hierarchical clustering

Strength of MAX in hierarchical clustering

limitation of MAX in hierarchical clustering

Ward’s method華德法 華德法又稱最小變異數法(minimum variance method)。華德法的分群 方式是先將每一個個體視為一個集群,然後將各集群依序合併,合併 之順序完全視合併後集群之組內總變異數之大小而定。凡使群內總變 異數產生最小增量的個體即予以優先合併,愈早合併之個體表示其間 的相似性愈高。

Ward’s method華德法 第i組組內變異數 I. 剛開始一個人一組 II.在每一階段, 計算集群內到中心點的歐氏距離平方和 For the ith cluster : 第i組組內變異數

Ward’s method華德法 組內總變異數 III. 合併ESS增加最少的兩個集群

Ward’s method華德法

Ward’s method華德法

Ward’s method華德法

Ward’s method華德法

Ward’s method華德法 Step4:

Ward’s method華德法 20.4+53=73.4 6.5+14=20.4 6.5

hierarchical clustering example

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) 根據使用者所設定之群集涵蓋範圍,例如群集之半徑,先將資料集合中的資 料點以漸進式處理方法分配到許多較小、相似度高的子群集 利用類似聚合式階層分群法的方式,以這些子群集為基本單元,反覆地將其 聚合成較大的群集 處理上其利用群集特徵(Clustering Feature, CF)來表示每個子群集,並不直接 處理所有的資料點,在記憶體空間的利用上非常有效率 為加速將資料點歸屬到所屬之子群集,其將動態構建出一類似B+樹 (B+ tree) Weakness: handles only numeric data, and sensitive to the order of the data record

Clustering Feature Vector in BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

Clustering Feature Vector in BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points: CF3=CF1+CF2= 3+3, (9+35, 10+36), (29+417 , 38+440) = 6, (44,46), (446 ,478) Cluster3 Cluster 1 (2,5) (3,2) (4,3) Cluster 2 CF2= 3, (35,36), (417 ,440) CF1= 3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32) = 3, (9,10), (29 ,38)

The CF-tree

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Leaf node 2 3 4 1 2 5 Cluster1 6 If cluster 1 becomes too large by adding object 2, then split the cluster

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 2 5 Cluster1 Cluster2 6 Leaf node with two entries

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster2 6 entry1 is the closest to object 3 If cluster 1 becomes too large by adding object 3, then split the cluster

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster3 Cluster2 6 Leaf node with three entries

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 1 3 2 4 4 5 Cluster1 Cluster3 Cluster2 Cluster2 6 entry3 is the closest to object 4 Cluster 2 remains compact when adding object 4 then add object 4 to cluster 2

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 5 2 4 5 Cluster1 Cluster3 Cluster2 6 entry2 is the closest to object 5 Cluster 3 becomes too large by adding object 5 then split cluster 3? BUT there is a limit to the number of entries a node can have Thus, split the node

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 5 2 4 Cluster1 Cluster3 Cluster4 Cluster2

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Non-Leaf node entry 1 entry 2 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 6 5 2 4 Cluster1 Cluster3 Cluster3 Cluster4 Cluster2 entry1.2 is the closest to object 6 Cluster 3 remains compact when adding object 6 then add object 6 to cluster 3

BIRCH: The Idea by example LN2 sc8 sc4 sc3 B=3 L=3 sc1 sc7 sc5 sc6 LN3 sc2 Root LN1 LN1 LN2 LN3 sc8 sc6 sc7 sc1 sc2 sc3 sc4 sc5 新增節點後,某個節點的子節點數超過了平衡因數B。 選擇最遠的兩個節點進行分裂。

BIRCH: The Idea by example LN1 LN2 B=3 L=3 sc8 sc4 sc3 sc1 sc7 sc5 sc6 LN3 LN1' Root sc2 LN3 LN1' LN1 LN2 sc6 sc7 sc8 sc1 sc2 sc3 sc4 sc5 選擇最遠的兩個節點作為種子進行分裂。 按最近距離重新分配。 更新每個非葉節點的CF信息。

Partitioning Algorithms: Basic Concept Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change

The K-Means Clustering Method K-Means (MacQueen’67, Lloyd’57/’82) Each cluster is represented by the center of the cluster Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied Different kinds of measures can be used Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity

Example: K-Means Clustering Recompute cluster centers Assign points to clusters Redo point assignment The original data points & randomly select K = 2 centroids Execution of the K-Means Clustering Algorithm Select K points as initial centroids Repeat Form K clusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point) of each cluster Until convergence criterion is satisfied

Example: Poor Initialization May Lead to Poor Clustering Recompute cluster centers Assign points to clusters Another random selection of k centroids for the same data points Rerun of the K-Means using another random K seeds This run of K-Means generates a poor quality clustering

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

k-means k-平均法在概念與實作上相當的簡單,且在處理大量資料時相當有擴充性 (scalable) 且有效率,但是卻也存在一些缺點 無法處理類別性資料維度 容易受雜訊與偏移值影響其群集中心 起始群集中心選擇上的影響 群集數量決定上的困難 由於k-平均值法以群集中的質量中心當作群集中心,對於類別性資料維度所 描述之資料集合而言,並無法求得群集的質量中心

k-Means variations k-Means演算法的缺陷: 群聚中心的個數k需要事先給定,但在實際中這個k值的選定是非常難以估計 的 k-Means++演算法選擇初始Seeds的基本思想 初始的聚類中心之間的相互距離要盡可能的遠 k -Means演算法對於孤立點是敏感的。為了解決這個問題,我們引入了k-中心 點演算法。

K-means example

K-means example

K-Medians: Handling Outliers by Computing Medians Medians are less sensitive to outliers than means Think of the median salary vs. mean salary of a large firm when adding a few top executives! K-Medians: Instead of taking the mean value of the object in a cluster as a reference point, medians are used (L1-norm as the distance measure) The criterion function for the K-Medians algorithm: The K-Medians clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medians) Repeat Assign every point to its nearest median Re-compute the median using the median of each individual feature Until convergence criterion is satisfied

K-Modes: Clustering Categorical Data K-Means cannot handle non-numerical (categorical) data Mapping categorical value to 1/0 cannot generate quality clusters for high- dimensional data K-Modes: An extension to K-Means by replacing means of clusters with modes This dissimilarity measure (distance function) is frequency-based

PAM PAM是最早提出的k-中心點演算法之一,它選用群組中最中心的對象作為 代表對象,試圖對n個對象給出k個劃分。 代表對象也被稱為是中心點,其他對象則被稱為非代表對象。 最初隨機選擇k個對象作為中心點,該演算法反復地用非代表對象來代替中 心點,試圖找出更好的中心點,以改進群組的質量。 PAM演算法代價函數可以按如下方式理解 計算各點到最近的中心點的距離 計算各點到替換後的最近的中心點的距離 比較替換後的距離和與替換前的距離和

Handling Outliers: From K-Means to K-Medoids The K-Means algorithm is sensitive to outliers!—since an object with an extremely large value may substantially distort the distribution of the data K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster The K-Medoids clustering algorithm: Select K points as the initial representative objects (i.e., as initial K medoids) Repeat Assigning each point to the cluster with the closest medoid Randomly select a non-representative object oi Compute the total cost S of swapping the medoid m with oi If S < 0, then swap m with oi to form the new set of medoids Until convergence criterion is satisfied

PAM: A Typical K-Medoids Algorithm 1 2 3 4 5 6 7 8 9 10 K = 2 Arbitrary choose K object as initial medoids Assign each remaining object to nearest medoids Compute total cost of swapping 1 2 3 4 5 6 7 8 9 10 Swapping O and Oramdom If quality is improved Randomly select a non-medoid object,Oramdom 1 2 3 4 5 6 7 8 9 10 Select initial K medoids randomly Repeat Object re-assignment Swap medoid m with oi if it improves the clustering quality Until convergence criterion is satisfied 89

Total swapping Cost (TCih) 計算替換的代價TCih=jCjih Where Cjih is the cost of swapping i with h for all non medoid objects j Cjih will vary depending upon different cases

PAM (Partitioning Around Medoids) Use real objects to represent the clusters (called medoids) Select k representative objects arbitrarily For each pair of selected object (i) and non-selected object (h), calculate the Total swapping Cost (TCih) For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change in the medoids or in TCih.

PAM or K-Medoids: Example 9 8 7 6 5 4 3 10 1 2 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Goal: create two clusters Choose randmly two medoids 08 = (7,4) and 02 = (3,4)

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Assign each object to the closest representative object Using L1 Metric (Manhattan), we form the following clusters Cluster1 = {01, 02, 03, 04} Cluster2 = {05, 06, 07, 08, 09, 010}

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the absolute error criterion [for the set of Medoids (O2,O8)] 𝑬= 𝒊=𝟏 𝒌 𝒑∈ 𝑪 𝒊 𝒑− 𝑶 𝒊 = 𝑶 𝟏 − 𝑶 𝟐 + 𝑶 𝟑 − 𝑶 𝟐 + 𝑶 𝟒 − 𝑶 𝟐 + 𝑶 𝟓 − 𝑶 𝟖 + 𝑶 𝟔 − 𝑶 𝟖 + 𝑶 𝟕 − 𝑶 𝟖 + 𝑶 𝟗 − 𝑶 𝟖 + 𝑶 𝟏𝟎 − 𝑶 𝟖

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 The absolute error criterion [for the set of Medoids (O2,O8)] E = (3+4+4)+(3+1+1+2+2) = 20

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Choose a random object 07 Swap 08 and 07 Compute the absolute error criterion [for the set of Medoids (02,07) E = (3+4+4)+(2+2+1+3+3) = 22

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 Compute the cost function Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ] S=22-20 S> 0 => It is a bad idea to replace 08 by 07

PAM or K-Medoids: Example 9 8 7 6 5 4 cluster2 3 10 1 2 cluster1 Data Objects A1 2 3 4 6 7 8 A2 5 01 02 03 04 05 06 07 08 09 010 C687 = d(07,06) - d(08,06)=2-1=1 C587 = d(07,05) - d(08,05)=2-3=-1 C987 = d(07,09) - d(08,09)=3-2=1 C1087 = d(07,010) - d(08,010)=3-2=1 C187=0, C387=0, C487=0 TCih=jCjih=1-1+1+1=2=S

CLARA (Clustering Large Applications) CLARA (Kaufmann and Rousseeuw in 1990) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased. Choose the best clustering Clusters … sample1 sample2 samplem PAM

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

Density-Based Clustering: Basic Concepts Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point NEps(p): {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable: A point p is directly density-reachable from a point q if p belongs to NEps(q) core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 資料點p的位置是在某核心物件q的Eps-鄰近區域內,則資料點p可被稱為 “可由q直接密度可達(directly density-reachable)” 的物件

Density-Reachable and Density-Connected p q p2 p1 Density-reachable: A point p is density-reachable from a point q if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o p q p1 q是一個核心對象,p1是從q直接密度可達,p是從p1直接密度可達,則對象p從對象q密度可達的。 p q o 存在一個對象o,使得p和q是從o密度可達的,那麼對象p和q密度相連的。

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed

DBSCAN 序號 屬性 1 屬性 2 1 2 4 3 5 6 7 8 9 10 11 12 根據所給的數據通過對其進行DBSCAN演算法,以 下為演算法的步驟(設n=12,用戶輸入ε=1, MinPts=4) 。

DBSCAN ε-鄰域點的個數 序號 屬性 1 屬性 2 是否為核心點 核心點可達到點的集合 1 2 × 4 3 5 √ 2 × 4 3 5 √ 1,3,4,5,10 6 7 2,6,7,8,11 8 9 10 4,9,10,12 11 12

DBSCAN 序號 屬性 1 屬性 2 ε-鄰域點的個數 是否為核心點 核心點可達到點的集合 4 1 5 √ 1,3,4,5,10 7 2,6,7,8,11 10 2 4,9,10,12 樣本4可達樣本10,樣本10可達樣本(4,9,10,12),因此樣本4的可達點 和集合與樣本10的可達點的集合是密度相連的。因此這兩個可達點 的集合可以合併為一個簇。 最終的聚類結果為{1,3,4,5,9,10,12}為一個簇; 而{2,6,7,8,11}為一個簇。

DBSCAN Algorithm: Example Parameter e = 2 cm MinPts = 3

When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.92). Original Points Cannot handle Varying densities sensitive to parameters (MinPts=4, Eps=9.75)

DBSCAN: Sensitive to Parameters

Density Based Clustering: Discussion Advantages Clusters can have arbitrary shape and size Number of clusters is determined automatically Can separate clusters from surrounding noise Can be supported by spatial index structures Disadvantages Input parameters may be difficult to determine In some situations very sensitive to input parameter setting

DBSCAN: Sensitive to Parameters Motivation Very different local densities may be needed to reveal clusters in different regions Clusters A,B,C1,C2, and C3 cannot be detected using one global density parameter A global density parameter can detect either A,B,C or C1,C2,C3 Solutions Use OPTICS Which density parameter to be used ? B A C C1 C2 C3  ’

Density-Based Hierarchical Clustering Observation: Dense clusters are completely contained by less dense clusters Idea: Process objects in the “right” order and keep track of point density in their neighborhood C 1 2 D

When OPTICS Works Well Cluster-order of the objects

Gaussian Distribution Bean machine: drop ball with pins 2-d Gaussian 1-d Gaussian

期望最大化演算法(EM) EM是一種基於一組密度分佈函數的聚類演算法。 該演算法的主要思想是: 假設樣本分佈符合高斯混合模型,演算法目的是確定各個高斯部件的參數,充 分擬合給定數據,並得到一個模糊聚類,即每個樣本以不同概率屬於每個高斯 分佈,概率數值將由以上各個參數計算得到。 其中 為均值為 ,協方差為 的高斯分佈, 是混合參數,看做第i個高斯分佈的權重,表徵先驗概率。且 且

Probability

Probability

Probability

EM演算法 例如目前有6個人,各自的 身高如下表所示: 序號 身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6 先隨便猜一下男生身高的正態分佈的參數: 均值是1.8米,方差是0.1米;女生身高的正 態分佈的參數:均值是1米6,方差是0.1米; 則可分別算得 p1(女)>p1(男), p2(女)<p2(男), p3(女)<p3(男), p4(女)>p4(男), p5(女)>p5(男), p6(女)>p6(男)。 於是便可將1、4、5、6歸為一類--女,剩餘 的2、3歸為一類--男。 基於此分類結果,可以用2、3的真實身高更 新男生身高的正態分佈的參數,用1、4、5、 6的真實身高更新女生身高的正態分佈的參 數。重複第二代,只到聚類結果沒有變化, 則收斂停止。 例如目前有6個人,各自的 身高如下表所示: 序號 身高 1 1.6 2 1.9 3 1.86 4 1.55 5 1.62 6

Partitional Clustering How do we partition a space to make the best clusters? Proximity to a cluster centroid.

Difficult Clusterings But some clusterings don’t lend themselves to a “centroid” based definition of a cluster. clustering allows us to address these sorts of clusters. These kinds of clusters are defined by points that are close any member in the cluster, rather than the average member of the cluster.

Graph Representation We can represent the relationships between data points in a graph. Weight the edges by the similarity between points

Graphs Nodes and Edges Edges can be directed or undirected Edges can have weights associated with them Here the weights correspond to pairwise affinity

Graphs Degree Volume of a set

Graph Cuts The cut between two subgraphs is calculated as follows

Graph cuts A cut on graph is defined as a partitioning on the nodes of the graph into two sets The size of a cut is the number of edges spanning the cut The minimum cut of a graph identifies an optimal partitioning of the data. Spectral Clustering Recursively partition the data set Identify the minimum cut Remove edges Repeat until k clusters are identified Cut size=2

Graph Cuts Minimum (bipartitional) cut

Spectral Clustering Example Minimum Cut Height Weight 20 5 8 6 9 4 1 .2 D E C .19 .45 B .22 .24 A .08 .09

Spectral Clustering Example Normalized Minimum Cut Height Weight 20 5 8 6 9 4 E .24 .19 B 1 .08 .22 .45 A D .09 .2 C

Spectral Clustering Example Normalized Minimum Cut Height Weight 20 5 8 6 9 4 E .24 .19 B 1 .08 .22 .45 A D .09 .2 C

Unnormalized Graph Laplacian example Adjacency matrix(A) Degree matrix (D) 1 2 3 4 2 3 4 1 1 1 1 2 3 1 3 2 3 4 2 4 Laplacian (L=D-A) 1 2 3 4 2 -1 3 1 Example graph

Spectral Clustering Construct an affinity matrix A B C D .4 .2 .5 .3 .5 .3 .6 .1 A D .2 .2 .1 B C .3

Laplacian Matrix Matrix representation of a graph D is a normalization factor for affinity matrix A Different Laplacians are available The most important application of the Laplacian is spectral clustering that corresponds to a computationally tractable solution to the graph partitioning problem

A few linear algebra concepts Let A be an nXn matrix Let A’ denote the transpose of A Let Rn denote the space of nX1-dimensional real-valued vectors Eigen vector of A v, in Rn, is the eigen vector (nX1) of a matrix A with eigen value λ, (a scalar) if Positive semi-definite A is said to be positive semi-definite if, for any x in Rn, the following holds

Laplacian Matrix For good clustering, we expect to have block diagonal Laplacian matrix http://charlesmartin14.wordpress.com/2012/10/09/spectral-clustering/

Spectral clustering key steps 1 N 1 N 1 k k 1 Adjacency matrix (A) 1 Laplacian (L) 1 First k eigen vectors k clusters … N N

Spectral clustering

Spectral clustering 給定6個點的相似度矩陣

Spectral clustering 計算D

Spectral clustering 計算D 加總

Spectral clustering 計算L=D-W

Spectral clustering 求特徵值特徵向量

Spectral clustering 特徵值與特徵向量

Spectral clustering

Spectral clustering

Spectral clustering

Spectral clustering

Spectral clustering 重新計算每個群聚的均值 分群1的中心: 分群2的中心: 分群3的中心: B距離A最近所以跟A成一群 D距離A最近所以跟A成一群 E距離C最近所以跟C成一群 分群1 : A,B,D 分群2 : C,E 分群3 : F

k-means vs. Spectral Clustering

Overall Framework of CHAMELEON Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph: Points p and q are connected if q is among the top-k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness

KNN Graphs and Interconnectivity K-nearest neighbor (KNN) graphs from an original data in 2D:

CHAMELEON: Clustering Complex Objects CHAMELEON is capable to generate quality clusters at clustering complex objects

Using minimum cut for semi supervised classification? Construct a graph representation of unseen data. Insert imaginary nodes s and t connected to labeled points with infinite similarity. Treat the min cut as a maximum flow problem from s to t t s

Kernel K-Means Clustering Kernel K-Means can be used to detect non-convex clusters K-Means can only detect clusters that are linearly separable Idea: Project data onto the high-dimensional kernel space, and then perform K-Means clustering Map data points in the input space onto a high- dimensional feature space using the kernel function Perform K-Means on the mapped feature space Computational complexity is higher than K-Means Need to compute and store n x n kernel matrix generated from the kernel function on the original data The widely studied spectral clustering can be considered as a variant of Kernel K-Means clustering

Kernel Functions and Kernel K-Means Clustering Kernel matrix: inner-product(i, j) matrix √ similarity distance(i, j) matrix × Typical kernel functions: Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + c)h Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ) The formula for kernel matrix K for any two points xi, xj є Ck is The SSE criterion of kernel K-means: The formula for the cluster centroid:

Example: Kernel Functions and Kernel K-Means Clustering Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = Suppose there are 5 original 2-dimensional points: x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4) If we set 𝜎 to 4, we will have the following points in the kernel space E.g., 𝑥 1 − 𝑥 2 2 = 0−4 2 + 0−4 2 =32, therefore, 𝐾 𝑥 1 , 𝑥 2 = 𝑒 − 32 2⋅ 4 2 = 𝑒 −1 Original Space RBF Kernel Space (𝜎=4) 𝑥 𝑦 x1 x2 4 x3 −4 x4 x5 𝑲( 𝒙 𝒊 , 𝒙 𝟏 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟐 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟑 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟒 ) 𝑲( 𝒙 𝒊 , 𝒙 𝟓 ) 1 𝑒 − 4 2 + 4 2 2⋅ 4 2 = 𝑒 −1 𝑒 −1 𝑒 −2 𝑒 −4

Example: Kernel K-Means Clustering The result of Gaussian Kernel K-Means clustering The original data set The above data set cannot generate quality clusters by K-Means since it contains non-convex clusters Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points xi, xj: and Gaussian kernel: K(Xi, Xj) = K-Means clustering is conducted on the mapped data, generating quality clusters The result of K-Means clustering

Matching-Based Measures (I): Purity vs. Maximum Matching Purity: Quantifies the extent that cluster Ci contains points only from one (ground truth) partition: Perfect clustering if purity = 1 (the number of clusters obtained is the same as that in the ground truth) Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25; purity3 = 25/25; purity = (30 + 20 + 25)/100 = 0.75 Maximum matching: Only one cluster can match one partition Match: Pairwise matching, weight w(eij) = nij Maximum weight matching: Ex2. (green) match = purity = 0.75; (orange) match = 0.65 > 0.6 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100 C\T T1 T2 T3 Sum C1 30 20 50 C2 5 25 C3 mj 100

Matching-Based Measures (II): F-Measure Precision: The fraction of points in Ci from the majority partition (i.e., the same as purity), where ji is the partition that contains the maximum # of points from Ci Ex. For the green table prec1 = 30/50; prec2 = 20/25; prec3 = 25/25 Recall: The fraction of point in partition shared in common with cluster Ci, where recall1 = 30/35; recall2 = 20/40; recall3 = 25/25 F-measure for Ci: The harmonic means of preci and recalli: F-measure for clustering C: average of all clusters: F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774 Ground Truth T1 T2 Cluster C1 C2 T3 C3 C\T T1 T2 T3 Sum C1 20 30 50 C2 5 25 C3 mj 40 35 100

Cluster Stability Clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or “stable” Typical approach: Find good parameter values for a given clustering algorithm Example: Find a good value of k, the correct number of clusters A bootstrapping approach to find the best value of k (judged on stability) Generate t samples of size n by sampling from D with replacement For each sample Di, run the same clustering algorithm with k values from 2 to kmax Compare the distance between all pairs of clusterings Ck(Di) and Ck(Dj) via some distance function Compute the expected pairwise distance for each value of k

Other Methods for Finding K, the Number of Clusters Empirical method # of clusters: for a dataset of n points (e.g., n = 200, k = 10) Elbow method: Use the turning point in the curve of the sum of within cluster variance with respect to the # of clusters Cross validation method Divide a given data set into m parts Use m – 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering For example, for each point in the test set, find the closest centroid For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best