Maintaining Frequent Itemsets over High-Speed Data Streams

Slides:



Advertisements
Similar presentations
Course 1 演算法: 效率、分析與量級 Algorithms: Efficiency, Analysis, and Order
Advertisements

CHAPTER 9 虛擬記憶體管理 9.2 分頁需求 9.3 寫入時複製 9.4 分頁替換 9.5 欄的配置法則 9.6 輾轉現象
第十九课 旅行.
课程:跨境电商 资料源:阿里巴巴教学资源库
CH3 關聯規則 授課老師:簡禎富 講座教授 簡禎富、許嘉裕©2014 著作權所有.
寻找适合您的工业4.0 Dell/曾峰.
第三章 隨機變數.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
Figure Interpreting. Introduction In recording an English figure, its three digits make one subsection, while in Chinese, its four digits make one subsection.
Understanding Interest Rates
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Author: Shigeki Takeuchi,Hiroyuki Koga, Katsuyoshi Iida,
Euler’s method of construction of the Exponential function
IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 3, MARCH 2013
Unit 4 Astronomy the science of the stars.
Feng Lin, Chen Song, Yan Zhuang, Wenyao Xu, Changzhi Li, Kui Ren
Some Effective Techniques for Naive Bayes Text Classification
Rate and Distortion Optimization for Reversible Data Hiding Using Multiple Histogram Shifting Source: IEEE Transactions On Cybernetics, Vol. 47, No. 2,February.
Population proportion and sample proportion
毕业论文报告 孙悦明
微積分網路教學課程 應用統計學系 周 章.
樹狀結構 陳怡芬 2018/11/16 北一女中資訊專題研究.
Source: IEEE Access, vol. 5, pp , October 2017
On Some Fuzzy Optimization Problems
Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)
Ch2 Infinite-horizon and Overlapping- generations Models (无限期与跨期模型)
Sampling Theory and Some Important Sampling Distributions
CCF ADL66大数据管理系统和技术 刘达欣 2018/11/28.
第五組 : 廖震昌 / 謝坤吉 / 黃麗珍 陳曉伶 / 陳思因 / 林慧佳
聲轉電信號.
Interval Estimation區間估計
2019/1/1 哈萨克铬业公司介绍 2008年10月10日.
塑膠材料的種類 塑膠在模具內的流動模式 流動性質的影響 溫度性質的影響
消費者偏好與效用概念.
ZEEV ZEITIN Delft University of Technology, Netherlands
Advisor : Prof. Frank Y.S. Lin Presented by Yen-Yi, Hsu
A high payload data hiding scheme based on modified AMBTC technique
重阳节 The Double Ninth Festival


研究技巧與論文撰寫方法 中央大學資管系 陳彥良.
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Mechanics Exercise Class Ⅰ
Guide to a successful PowerPoint design – simple is best
相關統計觀念復習 Review II.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
成才之路 · 英语 人教版 · 必修1 路漫漫其修远兮 吾将上下而求索.
A Data Mining Algorithm for Generalized Web Prefetching
第九章 摘要.
資料庫系統實驗室 指導教授:張玉盈.
An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。 本文目的: 做到既有较高的藏量(1bpp),
美國亞利桑納州Eurofresh農場的晨曦
第10章 存储器接口 罗文坚 中国科大 计算机学院
An organizational learning approach to information systems development
BiCuts: A fast packet classification algorithm using bit-level cutting
Introduction to Probability Theory ‧1‧
Nucleon EM form factors in a quark-gluon core model
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
磁共振原理的临床应用.
 隐式欧拉法 /* implicit Euler method */
Speaker : YI-CHENG HUNG
动词不定式(6).
Chapter 9 Validation Prof. Dehan Luo
MGT 213 System Management Server的昨天,今天和明天
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Experimental Analysis of Distributed Graph Systems
Gaussian Process Ruohua Shi Meeting
Hybrid fractal zerotree wavelet image coding
Presentation transcript:

Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, Wilfred Ng (PAKDD 2006) Advisor:Dr. Koh Jia-Ling Speaker:Tu Yi-Lang Date:2007.10.30

Introduction Frequent itemset (FI) mining is fundamental to many important data mining tasks. Recently, the increasing prominence of data streams has led to the study of online mining of Due to the constraints on both memory consumption and processing efficiency. seek to approximate FIs over streams. 為什麼有onlin的趨勢:因為stream的量太大了,不可能將所有的steam都存在記憶體中,所以需要這種real-time的步驟(程序)。而不像off-line的方法,在事前或是事後進行分析,所花費的記憶體量會較大。 FIs.

Introduction Existing approximation techniques for mining FIs are mainly false-positive. Using an error parameter,ε, to control the accuracy ( quality ) of the mining result. A smallε:more accurate mining result but enormously larger number of itemsets to be maintained. A bigε:degrade the mining accuracys. Maintain大量的itemset能夠讓探勘結果正確性提高,但也使得需要透過更具複雜性的技術。而要花費許多的memory並且會降低程式效能。

Introduction This paper proposes a false-negative approach to mine FIs over high-speed data streams. The method here places greater importance on recent data by adopting a sliding window model. We consider as a relaxed minimum support threshold and propose to progressively increase the value of for an itemset as it is kept longer in a window. 為何著重在recent data:由於之前提出許多false-negative的方法,都是專注在stream的整個歷程上,而不會去區別目前新進與舊有的itemset。 progressively :日益增加地 目的在於漸漸增加ε的值。希望在window中找到比較長的itemset。

Introduction Compare:False-positive vs. False-negative False-positive Mined FIs FIs Mined

Preliminaries I = {x1, x2, . . . , xm} be a set of items. A transaction, X, is an itemset. A transaction data stream is a continuous sequence of transactions. A window( a time interval ) in the stream is a set of successive time units. Successive:連續的 共有j – i + 1個time units

Preliminaries A sliding window in the stream is a window that slides forward for every time unit. tτ denotes the current time unit. w is called the size of the window. current window is W = < tτ−w+1, . . . , tτ > . trans(T) as the set of transactions that arrive on the stream in a time interval T. 共w個時間單位

Preliminaries |trans(T)| as the number of transactions in The support of an itemset X over T, sup(X, T). Minimum Support Threshold (MST), σ (0 ≤ σ ≤ 1). X is a frequent itemset (FI) over T if sup(X, T) ≥ σ|trans(T)|. trans(T). sup(X, T), is the number of transactions in trans(T ) that support X.

Preliminaries w = 2 , (w:the size of the window) If the minimum support required is 3 (σ = 3/5 in both windows, σ * |tans(T)| = 3/5 * 5 = 3 ) The set of FIs over W1 and W2 are {b, c, bc} and {b, c, d, bd}. W1 W2

A Progressively Increasing MST Function Considering = rσ as a relaxed MST, where r (0 ≤ r ≤ 1) is the relaxation rate, to mine the set of FIs over each time unit t in the sliding window. Since all itemsets whose support is less than rσ|trans(t)| are discarded. 多乘了一個r值後, rσ|trans(t)| 變得更小了。也就是support需要大於等於的門檻值也跟著變小了。

A Progressively Increasing MST Function a time unit a time interval

A Progressively Increasing MST Function τ-(τ-w+1)+1=w k:在size為w的時間間隔中取某k段時間;也就是與現在時間tτ相距k時間單位 的時間間隔。ex:k=2, T2=<tτ-1, tτ>。 回到第九頁,有提到support大於等於rσ|trans(t)|,其中的 rσ|trans(t)|就是這裡讀mk。有特別強調現在時間的r與過去時間的r。(1-r)/w就是針對window中過去時間所使用的relaxation rate。當k漸漸增加的時候,rk的值也會跟著越來越大。progressively increases the relaxed MST rσ。τ:掏

Mining FIs over a Sliding Window Using a prefix tree to keep the semi-FIs. A node in the prefix tree represents an itemset, X, and has three fields: item which is the last item of X uid(X):ID of the time unit, tuid(X):X在哪一個時間單位被插入prefix tree中 sup(X) which is the computed support of X since tuid(X)

Algorithm 1 (MineSW)

Algorithm 1 (MineSW)

Algorithm 1 (MineSW)

Experimental Evaluation On Sun Ultra-SPARC III with 900MHz CPU and 4GB RAM, and two types of data streams: t10i4 and t15i6 Compare MineSW algorithm with Lossy Counting algorithm(ε = rσ in LCSW). When r increases from 0.1 to 1, the precision of LCSW drops from 98% to around 10%, while the recall of MineSW only drops from 99% to around 90%. LCSW演算法也是有應用到sliding window的方法。因此在這篇論文中被拿來與MineSW做比較。

Experimental Evaluation (ε = rσ in LCSW)左圖ε= 0.1σ, 前面有提到ε越小越accuracy, ε00.5=0.1*0.05 < ε0.1=0.1*0.1,

(ε = rσ in LCSW)左圖σ=0.1% 前面有提到ε越小越accuracy, ε0.1=0.1*σ < ε0.2=0.2*σ, So Preccision0.1 > Preccision0.2

Experimental Evaluation LCSW演算法也是有應用到sliding window的方法。因此在這篇論文中被拿來與MineSW做比較。

Conclusions Proposing a progressively increasing minimum support function, which allows us to increaseεat the expense of only slightly degraded accuracy. The MineSW algorithm is significantly faster and consumes less memory than existing algorithms, while attains the same level of accuracy.