Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maintaining Frequent Itemsets over High-Speed Data Streams

Similar presentations


Presentation on theme: "Maintaining Frequent Itemsets over High-Speed Data Streams"— Presentation transcript:

1 Maintaining Frequent Itemsets over High-Speed Data Streams
James Cheng, Yiping Ke, Wilfred Ng (PAKDD 2006) Advisor:Dr. Koh Jia-Ling Speaker:Tu Yi-Lang Date:

2 Introduction Frequent itemset (FI) mining is fundamental to many important data mining tasks. Recently, the increasing prominence of data streams has led to the study of online mining of Due to the constraints on both memory consumption and processing efficiency. seek to approximate FIs over streams. 為什麼有onlin的趨勢:因為stream的量太大了,不可能將所有的steam都存在記憶體中,所以需要這種real-time的步驟(程序)。而不像off-line的方法,在事前或是事後進行分析,所花費的記憶體量會較大。 FIs.

3 Introduction Existing approximation techniques for mining FIs are mainly false-positive. Using an error parameter,ε, to control the accuracy ( quality ) of the mining result. A smallε:more accurate mining result but enormously larger number of itemsets to be maintained. A bigε:degrade the mining accuracys. Maintain大量的itemset能夠讓探勘結果正確性提高,但也使得需要透過更具複雜性的技術。而要花費許多的memory並且會降低程式效能。

4 Introduction This paper proposes a false-negative approach to mine FIs over high-speed data streams. The method here places greater importance on recent data by adopting a sliding window model. We consider as a relaxed minimum support threshold and propose to progressively increase the value of for an itemset as it is kept longer in a window. 為何著重在recent data:由於之前提出許多false-negative的方法,都是專注在stream的整個歷程上,而不會去區別目前新進與舊有的itemset。 progressively :日益增加地 目的在於漸漸增加ε的值。希望在window中找到比較長的itemset。

5 Introduction Compare:False-positive vs. False-negative False-positive
Mined FIs FIs Mined

6 Preliminaries I = {x1, x2, . . . , xm} be a set of items.
A transaction, X, is an itemset. A transaction data stream is a continuous sequence of transactions. A window( a time interval ) in the stream is a set of successive time units. Successive:連續的 共有j – i + 1個time units

7 Preliminaries A sliding window in the stream is a window that slides forward for every time unit. tτ denotes the current time unit. w is called the size of the window. current window is W = < tτ−w+1, , tτ > . trans(T) as the set of transactions that arrive on the stream in a time interval T. 共w個時間單位

8 Preliminaries |trans(T)| as the number of transactions in
The support of an itemset X over T, sup(X, T). Minimum Support Threshold (MST), σ (0 ≤ σ ≤ 1). X is a frequent itemset (FI) over T if sup(X, T) ≥ σ|trans(T)|. trans(T). sup(X, T), is the number of transactions in trans(T ) that support X.

9 Preliminaries w = 2 , (w:the size of the window)
If the minimum support required is 3 (σ = 3/5 in both windows, σ * |tans(T)| = 3/5 * 5 = 3 ) The set of FIs over W1 and W2 are {b, c, bc} and {b, c, d, bd}. W1 W2

10 A Progressively Increasing MST Function
Considering = rσ as a relaxed MST, where r (0 ≤ r ≤ 1) is the relaxation rate, to mine the set of FIs over each time unit t in the sliding window. Since all itemsets whose support is less than rσ|trans(t)| are discarded. 多乘了一個r值後, rσ|trans(t)| 變得更小了。也就是support需要大於等於的門檻值也跟著變小了。

11 A Progressively Increasing MST Function
a time unit a time interval

12 A Progressively Increasing MST Function
τ-(τ-w+1)+1=w k:在size為w的時間間隔中取某k段時間;也就是與現在時間tτ相距k時間單位 的時間間隔。ex:k=2, T2=<tτ-1, tτ>。 回到第九頁,有提到support大於等於rσ|trans(t)|,其中的 rσ|trans(t)|就是這裡讀mk。有特別強調現在時間的r與過去時間的r。(1-r)/w就是針對window中過去時間所使用的relaxation rate。當k漸漸增加的時候,rk的值也會跟著越來越大。progressively increases the relaxed MST rσ。τ:掏

13 Mining FIs over a Sliding Window
Using a prefix tree to keep the semi-FIs. A node in the prefix tree represents an itemset, X, and has three fields: item which is the last item of X uid(X):ID of the time unit, tuid(X):X在哪一個時間單位被插入prefix tree中 sup(X) which is the computed support of X since tuid(X)

14 Algorithm 1 (MineSW)

15 Algorithm 1 (MineSW)

16 Algorithm 1 (MineSW)

17 Experimental Evaluation
On Sun Ultra-SPARC III with 900MHz CPU and 4GB RAM, and two types of data streams: t10i4 and t15i6 Compare MineSW algorithm with Lossy Counting algorithm(ε = rσ in LCSW). When r increases from 0.1 to 1, the precision of LCSW drops from 98% to around 10%, while the recall of MineSW only drops from 99% to around 90%. LCSW演算法也是有應用到sliding window的方法。因此在這篇論文中被拿來與MineSW做比較。

18 Experimental Evaluation
(ε = rσ in LCSW)左圖ε= 0.1σ, 前面有提到ε越小越accuracy, ε00.5=0.1*0.05 < ε0.1=0.1*0.1,

19 (ε = rσ in LCSW)左圖σ=0.1% 前面有提到ε越小越accuracy, ε0.1=0.1*σ < ε0.2=0.2*σ, So Preccision0.1 > Preccision0.2

20 Experimental Evaluation
LCSW演算法也是有應用到sliding window的方法。因此在這篇論文中被拿來與MineSW做比較。

21 Conclusions Proposing a progressively increasing minimum support function, which allows us to increaseεat the expense of only slightly degraded accuracy. The MineSW algorithm is significantly faster and consumes less memory than existing algorithms, while attains the same level of accuracy.


Download ppt "Maintaining Frequent Itemsets over High-Speed Data Streams"

Similar presentations


Ads by Google