Applied Human Computer Interaction Lecture 10 Yan Ke Clustering Applied Human Computer Interaction Lecture 10 Yan Ke
Basic Clustering Methods Kmeans Expectation Maximization (EM)
Some Data This could easily be modeled by a Gaussian Mixture (with 5 components) But let’s look at an satisfying, friendly and infinitely popular alternative… Copyright © 2001, 2004, Andrew W. Moore
Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Lossy Compression Copyright © 2001, 2004, Andrew W. Moore
Idea One Any Better Ideas? Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea One Break into a grid, decode each bit-pair as the middle of each grid-cell 00 01 10 11 Any Better Ideas? Copyright © 2001, 2004, Andrew W. Moore
Idea Two Any Further Ideas? Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea Two Break into a grid, decode each bit-pair as the centroid of all data in that grid-cell 00 01 11 10 Any Further Ideas? Copyright © 2001, 2004, Andrew W. Moore
K-means Ask user how many clusters they’d like. (e.g. k=5) Copyright © 2001, 2004, Andrew W. Moore
K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Copyright © 2001, 2004, Andrew W. Moore
K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) Copyright © 2001, 2004, Andrew W. Moore
K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns Copyright © 2001, 2004, Andrew W. Moore
K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns… …and jumps there …Repeat until terminated! Copyright © 2001, 2004, Andrew W. Moore
K-means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg’s super-duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html) Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means continues… Copyright © 2001, 2004, Andrew W. Moore
K-means terminates Copyright © 2001, 2004, Andrew W. Moore
K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal clustering? How should we start it? How could we automatically choose the number of centers? ….we’ll deal with these questions over the next few slides Copyright © 2001, 2004, Andrew W. Moore
Will we find the optimal configuration? Not necessarily. Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore
Will we find the optimal configuration? Not necessarily. Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore
Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each from a different random start configuration Many other ideas floating around. Copyright © 2001, 2004, Andrew W. Moore
Common uses of K-means Often used as an exploratory data analysis tool In one-dimension, a good way to quantize real-valued variables into k non-uniform buckets Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization) Also used for choosing color palettes on old fashioned graphical display devices! Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical Clustering How do we define similarity between clusters? Minimum distance between points in clusters (in which case we’re simply doing Euclidian Minimum Spanning Trees) Maximum distance between points in clusters Average distance between points in clusters Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat…until you’ve merged the whole dataset into one cluster You’re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Comments Also known in the trade as Hierarchical Agglomerative Clustering (note the acronym) Single Linkage Comments It’s nice that you get a hierarchy instead of an amorphous collection of groups If you want k groups, just cut the (k-1) longest links There’s no real statistical or information-theoretic foundation to this. Makes your lecturer feel a bit queasy. Copyright © 2001, 2004, Andrew W. Moore
What you should know All the details of K-means The theory behind K-means as an optimization algorithm How K-means can get stuck The outline of Hierarchical clustering Be able to contrast between which problems would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchical clustering Copyright © 2001, 2004, Andrew W. Moore
Exercise 1: Group these cards into three groups using K-means method
Layer-based Clustering Methods Caution: some Chinese slides comes in…
层次聚类方法概述 层次聚类方法对给定的数据集进行层次的分解,直到某种条件满足为止。具体又可分为: 凝聚的层次聚类:一种自底向上的策略,首先将每个对象作为一个簇,然后合并这些原子簇为越来越大的簇,直到某个终结条件被满足。 分裂的层次聚类:采用自顶向下的策略,它首先将所有对象置于一个簇中,然后逐渐细分为越来越小的簇,直到达到了某个终结条件。 层次凝聚的代表是AGNES算法。层次分裂的代表是DIANA算法。 2019年4月26日星期五 DMKD Sides By MAO
AGNES算法 AGNES (AGglomerative NESting)算法最初将每个对象作为一个簇,然后这些簇根据某些准则被一步步地合并。两个簇间的相似度由这两个不同簇中距离最近的数据点对的相似度来确定。聚类的合并过程反复进行直到所有的对象最终满足簇数目。 算法5-3 AGNES(自底向上凝聚算法) 输入:包含n个对象的数据库,终止条件簇的数目k。 输出:k个簇,达到终止条件规定簇数目。 (1) 将每个对象当成一个初始簇; (2) REPEAT (3) 根据两个簇中最近的数据点找到最近的两个簇; (4) 合并两个簇,生成新的簇的集合; (5) UNTIL 达到定义的簇的数目; 2019年4月26日星期五 DMKD Sides By MAO
AGNES算法例子 第1步:根据初始簇计算每个簇之间的距离,随机找出距离最小的两个簇,进行合并,最小距离为1,合并后1,2点合并为一个簇。 序号 属性 1 属性 2 1 1 1 2 1 2 3 2 1 4 2 2 5 3 4 6 3 5 7 4 4 8 4 5 第1步:根据初始簇计算每个簇之间的距离,随机找出距离最小的两个簇,进行合并,最小距离为1,合并后1,2点合并为一个簇。 第2步:,对上一次合并后的簇计算簇间距离,找出距离最近的两个簇进行合并,合并后3,4点成为一簇。 第3步:重复第2步的工作,5,6点成为一簇。 第4步:重复第2步的工作,7,8点成为一簇。 第5步:合并{1,2},{3,4}成为一个包含四个点的簇。 第6步:合并{5,6},{7,8},由于合并后的簇的数目已经达到了用户输入的终止条件程序结束。 步骤 最近的簇距离 最近的两个簇 合并后的新簇 1 1 {1},{2} {1,2},{3},{4},{5},{6},{7},{8} 2 1 {3},{4} {1,2},{3,4},{5},{6},{7},{8} 3 1 {5},{6} {1,2},{3,4},{5,6},{7},{8} 4 1 {7},{8} {1,2},{3,4},{5,6},{7,8} 5 1 {1,2},{3,4} {1,2,3,4},{5,6},{7,8} 6 1 {5,6},{7,8} {1,2,3,4},{5,6,7,8}结束 2019年4月26日星期五 DMKD Sides By MAO
AGNES性能分析 AGNES算法比较简单,但经常会遇到合并点选择的困难。假如一旦一组对象被合并,下一步的处理将在新生成的簇上进行。已做处理不能撤消,聚类之间也不能交换对象。如果在某一步没有很好的选择合并的决定,可能会导致低质量的聚类结果。 这种聚类方法不具有很好的可伸缩性,因为合并的决定需要检查和估算大量的对象或簇。 假定在开始的时候有n个簇,在结束的时候有1个簇,因此在主循环中有n次迭代,在第i次迭代中,我们必须在n-i+1个簇中找到最靠近的两个聚类。另外算法必须计算所有对象两两之间的距离,因此这个算法的复杂度为 O(n2),该算法对于n很大的情况是不适用的。 2019年4月26日星期五 DMKD Sides By MAO
DIANA算法 DIANA (Divisive ANAlysis)算法是典型的分裂聚类方法。 在聚类中,用户能定义希望得到的簇数目作为一个结束条件。同时,它使用下面两种测度方法: 簇的直径:在一个簇中的任意两个数据点的距离中的最大值。 平均相异度(平均距离): 算法5-4 DIANA(自顶向下分裂算法) 输入:包含n个对象的数据库,终止条件簇的数目k。 输出:k个簇,达到终止条件规定簇数目。 (1)将所有对象整个当成一个初始簇; (2) FOR (i=1; i≠k; i++) DO BEGIN (3) 在所有簇中挑出具有最大直径的簇C; (4) 找出C中与其它点平均相异度最大的一个点p并把p放入splinter group,剩余的放在old party中; (5). REPEAT (6) 在old party里找出到最近的splinter group中的点的距离不大于到old party中最近点的距离的点,并将该点加入splinter group。 (7) UNTIL 没有新的old party的点被分配给splinter group; (8) splinter group和old party为被选中的簇分裂成的两个簇,与其它簇一起组成新的簇集合。 (9) END. 2019年4月26日星期五 DMKD Sides By MAO
DIANA算法例子 第1步,找到具有最大直径的簇,对簇中的每个点计算平均相异度(假定采用是欧式距离)。 序号 属性 1 属性 2 1 1 1 2 1 2 3 2 1 4 2 2 5 3 4 6 3 5 7 4 4 8 4 5 第1步,找到具有最大直径的簇,对簇中的每个点计算平均相异度(假定采用是欧式距离)。 1的平均距离:(1+1+1.414+3.6+4.24+4.47+5)/7=2.96 类似地,2的平均距离为2.526;3的平均距离为2.68;4的平均距离为2.18;5的平均距离为2.18;6的平均距离为2.68;7的平均距离为2.526;8的平均距离为2.96。 挑出平均相异度最大的点1放到splinter group中,剩余点在old party中。 第2步,在old party里找出到最近的splinter group中的点的距离不大于到old party中最近的点的距离的点,将该点放入splinter group中,该点是2。 第3步,重复第2步的工作,splinter group中放入点3。 第4步,重复第2步的工作,splinter group中放入点4。 第5步,没有在old party中的点放入了splinter group中且达到终止条件(k-2),程序终止。如果没有到终止条件,因该从分裂好的簇中选一个直径最大的簇继续分裂。 步骤 具有最大直径的簇 splinter group Old party 1 {1,2,3,4,5,6,7,8} {1} {2,3,4,5,6,7,8} 2 {1,2,3,4,5,6,7,8} {1,2} {3,4,5,6,7,8} 3 {1,2,3,4,5,6,7,8} {1,2,3} {4,5,6,7,8} 4 {1,2,3,4,5,6,7,8} {1,2,3,4} {5,6,7,8} 5 {1,2,3,4,5,6,7,8} {1,2,3,4} {5,6,7,8} 终止 2019年4月26日星期五 DMKD Sides By MAO
Exercise 2: Group the credit card applicants into two groups (using AGNES or DIANA)
Stock Index Forecasting Applied Human Computer Interaction Lecture 11 Yan Ke
Stock indices
Stock Index Data day1 1st min 2nd min … Class1 day2 Class2 day3 Class3
Averaged curves