第三章 关联规则挖掘 Association Rule Mining
背景简介(Motivation) 超市购物:商场经理可能想了解顾客的购物习惯。例如: “顾客多半会在一次购物时买哪些商品?”。分析的结果 可用于市场规划、广告策划和分类设计。 文本分类:个性化新闻推荐系统希望对新闻进行分类,推 进用户感兴趣类别的新闻内容给用户。系统可以通过挖掘 哪些关键词与某个类别经常联系在一起,找出文档的分类 标准。 信息推荐:电子商务网站推荐用户所需的信息。如:下载 某种类型音乐的用户通常具有什么样的特点 解决这些问题的一种有效途径就是“Association Rule Mining” (关联规则挖掘)
Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!
Definition: Frequent Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold
Definition: Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example:
Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
Mining Association Rules Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements
Mining Association Rules Two-step approach: Frequent Itemset Generation Generate all itemsets whose support minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive
Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets
Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity Given d unique items: Total number of itemsets = 2d Total number of possible association rules: If d=6, R = 602 rules
Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction
Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support
Illustrating Apriori Principle Found to be Infrequent Pruned supersets
Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13
Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
Apriori Algorithm L1 = {frequent 1-itemset}; for (k=2; Lk-1 ; k++) do begin Ck = apriori-gen(Lk-1); // 生成大小为k的候选项集合 Ck为大小k的候选项集构成的集合 Lk-1 for all transactions t D do begin Ct = subset(Ck,t); // 事务数据t中包含的候选项集 for all candidates c Ct do c.count ++; end Lk = {c Ck |c.count minsup} Answer =
生成候选项集合 Apriori算法是根据有关频繁项集性质的先验知识Apriori Principal命名的。该算法使用一种逐层搜索的迭代方法 ,利用k-项集产生(k+1)-项集。 具体做法:首先找出频繁1-项集的集合,记为L1 ;再用L1 找频繁2-项集的集合L2 ;再用L2找L3 …如此下去,直到 不能找到频繁k-项集为止。找每个Lk需要一次数据库扫描 。
生成候选项集合 连接步产生候选项集 剪枝步确定频繁项集 整个过程由连接和剪枝两步组成,即: (1)连接步 为找Lk,可通过Lk-1与自己连接,产生一个候 选k-项集的集合,该候选项集的集合记作Ck 。 剪枝步确定频繁项集 连接步产生候选项集
生成候选项集合 设l1和l2是Lk-1中的项集,记号li[j]表示li 的第j 项。为方便计,假定事务或项集中的项按字典次序 排序。 执行连接 , 其中Lk-1的元素是可 连接的,如果它们前(k-2)个项相同。 Lk-1
生成候选项集合 即,Lk-1的元素l1和l2是可连接的,若: ( l1[1] = l2[1] ∧ l1 [2] = l2 [2] ∧ … ∧ l1[k-2] = l2[k- 2] ∧ l1[k-1] < l2[k-1] ) 而条件(l1[k-1] < l2[k-1])可确保不产生重复的项 集。
生成候选项集合 (2)剪枝步 Ck是Lk的超集,即它的成员不一定都是频繁项 集,但所有的频繁k-项集都包含在Ck中。 扫描数据库,确定Ck中每个候选项集的计数, 从而确定Lk。然而, Ck可能很大,这样所涉及的计 算量就很大。
生成候选项集合 为了压缩Ck,可利用Apriori性质:任何非频繁 的(k-1)-项集都不可能是频繁k-项集的子集。因此, 若一个候选k-项集的(k-1)-项子集不在Lk-1中,则该 候选也不可能是频繁的,从而可以从Ck中删除。
生成候选项集合 TID 项ID的列表 T100 I1,I2,I5 T200 I2,I4 T300 I2,I3 T400 I1,I2,I4 I1,I2,I3,I5 T900 I1,I2,I3 【例 】一个Apriori的具 体例子,该例基于右图 某商店的事务DB。DB中 有9个事务, Apriori 假定事务中的项按字典 次序存放。
(1)在算法的第一次迭代,每个项都是候选1-项 集的集合C1的成员。算法简单地扫描所有的事 务,对每个项的出现次数计数。 项集 支持度计数 {I1} 6 {I2} 7 {I3} {I4} 2 {I5} 扫描D,对每个候选计数
(2)设最小支持计数为2,可以确定频繁 1-项集的集合L1 。它由具有最小支持 度的候选1-项集组成。 支持度计数 {I1} 6 {I2} 7 {I3} {I4} 2 {I5} 比较候选支持度计数与最小支持度计数
(3)为发现频繁2-项集的集合L2 ,算法 使用 产生候选2-项集集合C2 。 {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} C2 由L1产生候选C2
(4)扫描D中事务,计算C2中每个候选项 集的支持计数。 项集 支持度计数 {I1,I2} 4 {I1,I3} {I1,I4} 1 {I1,I5} 2 {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} 扫描D,对每个候选计数
(5)确定频繁2-项集的集合L2 ,它由具 有最小支持度的C2中的候选2-项集组成。 支持度计数 {I1,I2} 4 {I1,I3} {I1,I5} 2 {I2,I3} {I2,I4} {I2,I5} 比较候选支持度计数与最小支持度计数
L2 {{I1, I2},{I1, I3},{I1, I5},{I2, I3},{I2, I4},{I2,I5}} (6)候选3-项集的集合C3 的产生如下: ① 连接: C3 = = {{I1, I2},{I1, I3},{I1, I5},{I2, I3},{I2, I4},{I2, I5}} {{I1, I2},{I1, I3},{I1, I5},{I2, I3},{I2, I4},{I2,I5}} = {{I1, I2, I3},{I1, I2, I5},{I1, I3, I5},{I2, I3, I4},{I2, I3, I5},{I2, I4, I5}} L2
② 利用Apriori性质剪枝:频繁项集的所有子集必须是频繁的。存在候选项集,判断其子集是否频繁。 {I1,I2,I3}的2-项子集是{I1,I2},{I1,I3}和{I2,I3}, 它们都是L2的元素。因此保留{I1,I2,I3}在C3中。 {I1,I2,I5}的2-项子集是{I1,I2},{I1,I5}和{I2,I5}, 它们都是L2的元素。因此保留{I1,I2,I5}在C3中。 {I1,I3,I5}的2-项子集是{I1,I3},{I1,I5}和{I3,I5}, {I3,I5}不是L2的元素,因而不是频繁的,由C3中删除{I1,I3,I5}。
{I2,I3,I4}的2-项子集是{I2,I3},{I2,I4}和{I3,I4}, 其中{I3,I4}不是L2的元素,因而不是频繁的,由C3中删除 {I2,I3,I4}。 {I2,I3,I5}的2-项子集是{I2,I3},{I2,I5}和{I3,I5}, 其中{I3,I5}不是L2的元素,因而不是频繁的, 由C3中删除 {I2,I3,I5}。 {I2,I4,I5}的2-项子集是{I2,I4},{I2,I5}和{I4,I5}, 其中{I4,I5}不是L2的元素,因而不是频繁的, 由C3中删除{I2,I4,I5} 。
C3 由L2产生候选C3 扫描D,对每个候选计数 C3 项集 {I1,I2,I3} {I1,I2,I5} 项集 支持度计数 ③ 这样,剪枝后C3 = {{I1,I2,I3},{I1,I2,I5}}。 (7)扫描D中事务,以确定L3 ,它由C3中具有 最小支持度的的候选3-项集组成。 项集 {I1,I2,I3} {I1,I2,I5} C3 由L2产生候选C3 扫描D,对每个候选计数 项集 支持度计数 {I1,I2,I3} 2 {I1,I2,I5} C3
L3 (8)算法使用 产生候选4-项集的集合C4 。尽 管连接产生结果 {{I1,I2,I3,I5}},这个项集被剪去,因为 它的子集{I2,I3,I5}不是频繁的。则 C4 = ,因此算法 终止,找出了所有的频繁项集。 ψ L3 项集 支持度计数 {I1,I2,I3} 2 {I1,I2,I5} 比较候选支持度计数与最小支持度计数
候选项目集的生成函数apriori-gen 以Lk-1作为输入,输出全部频繁k-项目集的一个超集。该函 数包含两个操作,连接(join)与修剪(prune)。连接操作将Lk- 1中的频繁项目集按如下方式进行拼接: insert into Ck select p.item1, p.item2, …, p.itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item1=q. item1,…, p. itemk-2=q. itemk-2, p. itemk-1 < q. itemk-1;
修剪操作 对Ck中任一候选项集c,若c的某个大小为k-1的子 集不属于Lk-1,则将其从Ck中删除。 forall itemsets c Ck do forall (k-1)-subsets s of c do if (sLk-1) then delete c from Ck ;
例子 假设集合L3={{1 2 3},{1 2 4},{1 3 4},{1 3 5},{2 3 4}}, 通过连接得到C4 ={{1 2 3 4},{1 3 4 5}}。 由于项集{1 4 5}不在L3中,所以修剪操作将{1 3 4 5}从C4 中删除。
例子(设min_sup = 50%)
Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function (MOD 3) Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function
Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7
Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8
Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9
Subset Operation Given a transaction t, what are the possible subsets of size 3?
Subset Operation Using Hash Tree 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 transaction 1 + 2 3 5 6 3 5 6 2 + 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 5 6 3 +
Subset Operation Using Hash Tree 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 transaction 1 + 2 3 5 6 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8
Subset Operation Using Hash Tree 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 transaction 1 + 2 3 5 6 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 Match transaction against 11 out of 15 candidates
Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
AprioriTid算法
AprioriTid算法 wm
Compact Representation of Frequent Itemsets Some itemsets are redundant because they have identical support as their supersets Number of frequent itemsets Need a compact representation
Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border
Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset Immediate紧接的
Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions
Maximal vs Closed Frequent Itemsets Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4
Maximal vs Closed Itemsets
Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice General-to-specific vs Specific-to-general
Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Equivalent Classes
Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Breadth-first vs Depth-first(通常用于发现极大频繁项集)
Alternative Methods for Frequent Itemset Generation Representation of Database horizontal vs vertical data layout
Tree Projection Set enumeration tree: Possible Extension: E(A) = {B,C,D,E} Possible Extension: E(ABC) = {D,E}
Tree Projection Items are listed in lexicographic order Each node P stores the following information: Itemset for node P List of possible lexicographic extensions of P: E(P) Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the itemset (1,0,1,1,0,0,…)
Projected Database Projected Database for node A: Original Database: For each transaction T, projected transaction at node A is T E(A)
ECLAT For each item, store a list of transaction ids (tids) TID-list
ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. 3 traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory
FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
FP-tree construction null After reading TID=1: A:1 B:1
FP-Tree Construction null B:3 A:7 B:5 C:3 C:1 D:1 D:1 C:3 E:1 D:1 E:1 Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation
FP-growth Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP-growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD null A:7 B:1 B:5 C:1 C:1 D:1 D:1 C:3 D:1 D:1 D:1
基于频繁模式树(FP-Tree)的频繁模式挖掘算法 步骤: 从交易数据库中构造FP-tree 利用FP-tree挖掘频繁模式
从交易数据库中构造FP-tree TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 步骤: 对DB进行一次扫描, 找出单个频繁项模式(frequent 1-itemset ); 根据频度的降序对频繁项进行排序; 对DB作第2次扫描, 构造FP-tree。
利用FP-tree挖掘频繁模式 基本思想 (divide-and-conquer) 方法 利用FP-tree递归地挖掘频繁模式 对每个项, 先构造其条件模式基(conditional pattern-base), 接着再构造出它的条件FP-tree ( conditional FP-tree)。 重复地对每个新创建的条件FP-tree挖掘出频繁模式 直至该FP-tree为空, 或者仅包含单条路径 (在单条路径的 情况下,生成所有可能的子路径组合, 每个子路径都是一个频繁 模式)
挖掘FP-tree的主要步骤: 为FP-tree中的每个节点构造条件模式基; 根据每个条件模式基,构造出条件FP-tree;
Step 1: 由FP-tree到条件模式基 (Conditional Pattern Base) 聚集相应频繁项的前缀路径,形成一个条件模式基 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1
Step 2: 构造条件FP-tree For each pattern-base {} Item frequency head 累加模式基中的每个项的计数 (Accumulate the count for each item in the base) 为模式基的频繁项构造FP-tree(Construct the FP-tree for the frequent items of the pattern base) {} m-conditional pattern base: fca:2, fcab:1 项目头指针表 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1
通过创建条件模式基挖掘频繁模式 Item 条件模式基 条件FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} 空 a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f
Step 3: 递归挖掘条件FP-tree {} f:3 {} c:3 f:3 c:3 {} a:3 f:3 {} f:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree
单条FP-tree路径的频繁模式生成 假定FP-tree T只有单条路径P 通过枚举路径P的所有子路径,生成T的全部频繁模式 {} 与m有关的所有频繁模式 m, fm, cm, am, fcm, fam, cam, fcam f:3 c:3 a:3 m-conditional FP-tree
Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti- monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule
Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence
Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set
Effect of Support Distribution How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective
Multiple Minimum Support How to apply multiple minimum supports? MS(i): minimum support for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support is no longer anti-monotone Suppose: Support(Milk, Coke) = 1.5% and Support(Milk, Coke, Broccoli) = 0.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent
Multiple Minimum Support
Multiple Minimum Support
Multiple Minimum Support (Liu 1999) Order the items according to their minimum support (in ascending order) e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L1 : set of frequent items F1 : set of items whose support is MS(1) where MS(1) is mini( MS(i) ) C2 : candidate itemsets of size 2 is generated from F1 instead of L1
Multiple Minimum Support (Liu 1999) Modifications to Apriori: In traditional Apriori, A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k The candidate is pruned if it contains any infrequent subsets of size k Pruning step has to be modified: Prune only if subset contains the first item e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to minimum support) {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli.
Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used
Application of Interestingness Measure Interestingness Measures
Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y X f11 f10 f1+ f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Contingency table 列联表 Used to define various measures support, confidence, lift, Gini, J-measure, etc.
Drawback of Confidence Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) < P(S) P(B) => Negatively correlated
Statistical-based Measures Measures that take into account statistical dependence
Example: Lift/Interest Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori-style support based pruning? How does it affect these measures?
Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged
Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:
Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc
Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High 2 3 5 Low 1 4 7 10 Male Female High 4 30 34 Low 2 40 42 6 70 76 2x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples
Property under Inversion Operation Transaction 1 . Transaction N
Example: -Coefficient -coefficient is analogous to correlation coefficient for continuous variables Y X 60 10 70 20 30 100 Y X 20 10 30 60 70 100 -Coefficient is the same for both tables
Property under Null Addition Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc
Different Measures have Different Properties
Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets Generate 10000 random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed
Effect of Support-based Pruning
Effect of Support-based Pruning Support-based pruning eliminates mostly negatively correlated itemsets
Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: Generate 10000 contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures
Effect of Support-based Pruning Without Support Pruning (All Pairs) Scatter Plot between Correlation & Jaccard Measure Red cells indicate correlation between the pair of measures > 0.85 40.14% pairs have correlation > 0.85
Effect of Support-based Pruning Scatter Plot between Correlation & Jaccard Measure: 61.45% pairs have correlation > 0.85
Effect of Support-based Pruning Scatter Plot between Correlation & Jaccard Measure 76.42% pairs have correlation > 0.85
Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc). Subjective measure: Rank patterns according to user’s interpretation A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin)
Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) Need to combine expectation of users with evidence from data (i.e., extracted patterns) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent + - Expected Patterns - + Unexpected Patterns
Interestingness via Unexpectedness Web Data (Cooley et al 2001) Domain knowledge in the form of site structure Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages) L: number of links connecting the pages lfactor = L / (k k-1) cfactor = 1 (if graph is connected), 0 (disconnected graph) Structure evidence = cfactor lfactor Usage evidence Use Dempster-Shafer theory to combine domain knowledge and evidence from data
P250 2、3、4、9、11