本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
第七课:电脑和网络. 生词 上网 vs. 网上 我上网看天气预报。 今天早上看了网上的天气预报。 正式 zhèngshì (报告,会议,纪录) 他被这所学校正式录取 大桥已经落成,日内就可以正式通车 落伍 luòw ǔ 迟到 chídào 他怕迟到,六点就起床了.
Course 1 演算法: 效率、分析與量級 Algorithms: Efficiency, Analysis, and Order
2014 年上学期 湖南长郡卫星远程学校 制作 13 Getting news from the Internet.
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
國立台灣師範大學 國際人力資源發展研究所 施正屏博士
How can we become good leamers
第4章 聚类分析 4.1 概述 4.2 基于划分的聚类算法 4.3 层次聚类算法 4.4 基于密度的聚类算法 4.5 基于图的聚类算法
Performance Evaluation
1. 理想的路由算法 有关路由选择协议的几个基本概念 算法必须是正确的和完整的。 算法在计算上应简单。
Homework 2 : VSM and Summary
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Minimum Spanning Trees
Euler’s method of construction of the Exponential function
Unit 4 I used to be afraid of the dark.
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Intro. to Data Mining Chapter 3.2 clustering.
模式识别 Pattern Recognition
樹狀結構 陳怡芬 2018/11/16 北一女中資訊專題研究.
On Some Fuzzy Optimization Problems
HOW TO ACE -- THE IELTS SPEAKING TEST
Decision Support System (靜宜資管楊子青)
Course 9 NP Theory序論 An Introduction to the Theory of NP
创建型设计模式.
Pattern Recognition Chapter1 Introduction.
光流法 (Optical Flow) 第八章 基于运动视觉的稠密估计 光流法 (Optical Flow)
组合逻辑3 Combinational Logic
製程能力分析 何正斌 教授 國立屏東科技大學工業管理學系.
This Is English 3 双向视频文稿.
Interval Estimation區間估計
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
Lesson 44:Popular Sayings
神经信息学 自组织网络 ——自组织映射 史忠植 中科院计算所 2019/2/2.
第五章 聚类方法 内容提要 聚类方法概述 划分聚类方法 层次聚类方法 密度聚类方法 其它聚类方法 2019年2月17日星期日
Decision Support System (靜宜資管楊子青)
Chapter 5 Recursion.
IBM SWG Overall Introduction
Version Control System Based DSNs
VIDEO COMPRESSION & MPEG
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Guide to a successful PowerPoint design – simple is best
BORROWING SUBTRACTION WITHIN 20
3.5 Region Filling Region Filling is a process of “coloring in” a definite image area or region. 2019/4/19.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
Common Qs Regarding Earnings
关联词 Writing.
從 ER 到 Logical Schema ──兼談Schema Integration
主講人:陳鴻文 副教授 銘傳大學資訊傳播工程系所 日期:3/13/2010
Simple Regression (簡單迴歸分析)
Applied Human Computer Interaction Lecture 10 Yan Ke
高考应试作文写作训练 5. 正反观点对比.
Distance Vector vs Link State
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
常見的巨量資料分析與應用 楊立偉教授 台大工管系暨商研所 2017.
名词从句(2).
More About Auto-encoder
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
Distance Vector vs Link State Routing Protocols
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
Homework 2 : VSM and Summary
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17 Lecture 4 : Clustering 楊立偉教授 wyang@ntu.edu.tw 本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17

Clustering : Introduction

Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 3

Data set with clear cluster structure Propose algorithm for finding the cluster structure in this example 4

Classification vs. Clustering Supervised learning Classes are human-defined and part of the input to the learning algorithm. Clustering Unsupervised learning Clusters are inferred from the data without human input. 5

Why cluster documents? Whole corpus analysis/navigation Better user interface 提供文件集合的分析與導覽 For improving recall in search applications Better search results 提供更好的搜尋結果 For better navigation of search results Effective "user recall" will be higher 搜尋結果導覽 For speeding up vector space retrieval Faster search 加快搜尋速度

For visualizing a document collection Wise et al, "Visualizing the non-visual" PNNL ThemeScapes, Cartia [Mountain height = cluster size]

For improving search recall Cluster hypothesis - "closely associated documents tend to be relevant to the same requests". Therefore, to improve search recall: Cluster docs in corpus 先將文件做分群 When a query matches a doc D, also return other docs in the cluster containing D 也建議符合的整群 Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 具有類似的文件特徵

For better navigation of search results For grouping search results thematically clusty.com / Vivisimo (Enterprise Search – Velocity)

Issues for clustering (1) General goal: put related docs in the same cluster, put unrelated docs in different clusters. Representation for clustering Document representation 如何表示一篇文件 Need a notion of similarity/distance 如何表示相似度 whittle 削切

Issues for clustering (2) How to decide the number of clusters Fixed a priori : assume the number of clusters K is given. Data driven : semiautomatic methods for determining K Avoid very small and very large clusters Define clusters that are easy to explain to the user whittle 削切

Clustering Algorithms Flat (Partitional) algorithms 無階層的聚類演算法 Usually start with a random (partial) partitioning Refine it iteratively 不斷地修正調整 K means clustering Hierarchical algorithms 有階層的聚類演算法 Create a hierarchy Bottom-up, agglomerative 由下往上聚合 Top-down, divisive 由上往下分裂

Flat (Partitioning) Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters 將 n 篇文件分到 K 群中 Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions 找出最佳切割 → 通常很耗時 Effective heuristic methods: K-means and K-medoids algorithms 用經驗法則找出近似解即可

Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. For applications like creating browsable hierarchies Ex. Put sneakers in two clusters: sports apparel, shoes You can only do that with a soft clustering approach. *only hard clustering is discussed in this class. 15

K-means algorithm

K-means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 17

K-means In vector space model, Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity 重心 or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

K-means algorithm 1. Select K random docs {s1, s2,… sK} as seeds. 先挑選種子 2. Until clustering converges or other stopping criterion: 重複下列步驟直到收斂或其它停止條件成立 2.1 For each doc di: 針對每一篇文件 Assign di to the cluster cj such that dist(xi, sj) is minimal. 將該文件加入最近的一群 2.2 For each cluster cj sj = (cj) 以各群的重心為種子,再做一次 (Update the seeds to the centroid of each cluster)

K-means algorithm

K-means example (K=2) Pick seeds Reassign clusters Compute centroids Converged! 通常做3至4回就大致穩定(但仍需視資料與群集多寡而調整)

Termination conditions Several possibilities, e.g., A fixed number of iterations. 只做固定幾回合 Doc partition unchanged. 群集不再改變 Centroid positions don’t change. 重心不再改變

Convergence of K-Means Why should the K-means algorithm ever reach a fixed point? A state in which clusters don’t change. 收斂 K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. 在理論上一定會收斂,只是要做幾回合的問題 (逼近法,且一開始逼近得快)

Convergence of K-Means : 證明 Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck)2 (sum over all di in cluster k) G = Σk Gk 計算每一群中文件與中心的距離平方,然後加總 Reassignment monotonically decreases G since each vector is assigned to the closest centroid. 每回合的動作只會讓G越來越小

Time Complexity Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm). 執行 I 回合;分 K 群;n 篇文件;m 個詞 → 慢且不scalable 改善方法:用 近似估計, 抽樣, 選擇 等技巧來加速

Issue (1) Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

Issue (2) How Many Clusters? Number of clusters K is given Partition n docs into predetermined number of clusters Finding the “right” number of clusters is part of the problem 假設 連應該分成幾群都不知道 Given docs, partition into an “appropriate” number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. 查詢結果分群時通常不會預先知道該分幾群

If K not specified in advance Suggest K automatically using heuristics based on N using K vs. Cluster-size diagram Tradeoff between having less clusters (better focus within each cluster) and having too many clusters 如何取捨

方法: 以「組間變異對應 於整體變異的百分比」來 看 (即F檢驗),每增加一 群所能帶來的邊際變異開 始下降的前一點。 Ref: "Determining the number of clusters in a data set", Wikipedia.

The Calinski-Harabasz index 群間方差和BGSS越大越好,群內方差和WGSS越小越 好,因此得到的分數越高越好。 Ref: "Clustering Indices", clusterCrit package, R project.

K-means variations Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means 每個點調整後就重算重心,可以加快收斂 spherical 球面的

Evaluation of Clustering

What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high 群內同質性越高越好 the inter-class similarity is low 群間差異大 The measured quality of a clustering depends on both the document representation and the similarity measure used

External criteria for clustering quality Based on a gold standard data set (ground truth) e.g., the Reuters collection we also used for the evaluation of classification Goal: Clustering should reproduce the classes in the gold standard Quality measured by its ability to discover some or all of the hidden patterns 用挑出中間不符合的份子來評估分群好不好

External criterion: Purity Ω= {ω1, ω2, . . . , ωK} is the set of clusters and C = {c1, c2, . . . , cJ} is the set of classes. For each cluster ωk : find class cj with most members nkj in ωk Sum all nkj and divide by total number of points purity是群中最多一類佔該群總數之比例 35

Example for computing purity To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1) 4 = maxj |ω2 ∩ cj | (class o, cluster 2) 3 = maxj |ω3 ∩ cj | (class ⋄, cluster 3) Purity is (1/17) × (5 + 4 + 3) ≈ 0.71. 36

Rand Index A B C D Number of points Same Cluster in clustering 分在同一群 Different Clusters in clustering 分在不同群 Same class in ground truth 已知同一類 A C Different classes in ground truth 已知不同類 B D

Rand index: symmetric version Compare with standard Precision and Recall.

20 24 72 Rand Index example: 0.68 Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth 20 24 Different classes in ground truth 72

DBSCAN algorithm

DBSCAN Density-based clustering : clusters are dense regions in the data space, separated by regions of lower object density. A cluster is defined as a maximal set of density-connected points May discovers clusters of arbitrary shape c.f. K-mean is spherical Ref: Density-Based Spatial Clustering of Applications with Noise 41

DBSCAN Definition Eps-neighborhood of point p : points within radius eps from p "High density" : Eps-neighborhood of a point contains at least MinPts of points For radius ɛ, MinPts=4. Density of p is "high" Density of q is "low" 42

DBSCAN Core points, Border points, and Noise points A point is a core point if it has more than a specified number of points (MinPts) within Eps—These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point nor a border point. 43

Example See http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

DBSCAN Directly density-reachable point q is directly density-reachable from point p if p is a core object and q is in eps-neighborhood of p q is directly density-reachable from p p is not directly density-reachable from q density-reachability is asymmetric MinPts=4 45

Visualize the algorithm http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

After clustering.

Hierarchical Clustering

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. One approach: recursive application of a partitional clustering algorithm. 可由每一層不斷執行分群演算法所組成 animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate Vertebrate 脊索動物

Dendrogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 另一種思考: 對階層樹橫向切一刀, 留下有連接在一起的, 就構成一群 Dendrogram 樹狀圖

Hierarchical Clustering algorithms Agglomerative (bottom-up): 由下往上聚合 Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): 由上往下分裂 Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. Does not require the number of clusters k in advance 不需要先決定要分成幾群 Needs a termination condition

Hierarchical Agglomerative Clustering (HAC) Algorithm Starts with each doc in a separate cluster 每篇文件剛開始都自成一群 then repeatedly joins the closest pair of clusters, until there is only one cluster. 不斷地將最近的二群做連接 The history of merging forms a binary tree or hierarchy. 連接的過程就構成一個二元階層樹

Dendrogram: Document Example As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3

Closest pair of clusters 如何計算最近的二群 Many variants to defining closest pair of clusters Single-link 挑群中最近的一點來代表 Similarity of the most cosine-similar (single-link) Complete-link 挑群中最遠的一點來代表 Similarity of the “furthest” points, the least cosine-similar Centroid 挑群中的重心來代表 Clusters whose centroids (centers of gravity) are the most cosine-similar Average-link 跟群中的所有點計算距離後取平均值 Average cosine between pairs of elements

Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. 長而鬆散的群集 After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Single Link Example

Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. 緊密一點的群集 After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Ci Cj Ck

Complete Link Example

Computational Complexity In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). 兩兩文件計算相似性 In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. 包含合併過程 O(n2 log n)

Key notion: cluster representative We want a notion of a representative point in a cluster 如何代表該群→可以用中心或其它點代表 Representative should be some sort of “typical” or central point in the cluster, e.g.,

Example: n=6, k=3, closest pair of centroids Centroid after second step. d1 d2 Centroid after first step.

Outliers in centroid computation Can ignore outliers when computing centroid. What is an outlier? Lots of statistical definitions, e.g. moment of point to centroid > M  some cluster moment. Say 10. Centroid Outlier 簡單說就是距離太遠的點 (ex. 10倍遠),直接忽略

Using Medoid As Cluster Representative The centroid does not have to be a document. Medoid: A cluster representative that is one of the documents 用以代表該群的某一份文件 Ex. the document closest to the centroid Why use Medoid ? Consider the representative of a large cluster (>1000 documents) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector

Clustering : discussion

Feature selection 選擇好的詞再來做分群 Which terms to use as axes for vector space? IDF is a form of feature selection the most discriminating terms 鑑別力好的詞 Ex. use only nouns/noun phrases

Labeling 在分好的群上加標記 After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster 加上簡潔厄要的標記 In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. 事後以人工編輯

How to Label Clusters Show titles of typical documents 用幾份代表文件的標題做標記 Show words/phrases prominent in cluster 用幾個較具代表性的詞做標記 More likely to fully represent cluster Use distinguishing words/phrases 配合自動產生關鍵詞的技術

Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. 通常用5~10個詞來代表該群 Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. 要挑選有鑑別力的詞

Topic Model 應用分群在主題建模上 在數量龐大的文件集合中自動地發現某些結構 (主題),並 將每個主題用某些關鍵字的形式表現 (註: 即Bag-of-Word模型); 隨後,還可以知道每篇文章中各個主題占得比重如何,並 據此判斷兩篇文章的相關程度。 分群演算法就可以將關鍵字群聚成若干主題。

Topic Modelling (with LSA, pLSA, or LDA) brain computer data dna evolve gene genetic life organism nerve neuron number … Clustering words into topics