A Data Mining Algorithm for Generalized Web Prefetching author: Alexandros Nanopoulos、 Dimitrios Katsaros、 Yannis Manolopoulos source: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5 year: OCTOBER. 2003 presented by C. W. Hsu
Outline Introduction BACKGROUND GENERALIZED ALGORITHM Experiment Result Conclusion
Introduction The Web has become the primary means for information dissemination. It is being used for commercial, entertainment, or educational purposes, and, thus, its popularity resulted in heavy traffic in the Internet
Web Data Traffic
The cache of web document
Cache problems
Transparent Informed Prefetching
Proposed architecture of a prediction-enabled Web server
Motivation Predictive Web prefetching Server - Cache Client - Cache 一般伺服器會將文件快取到主記憶中,但是目前也有很多人將快取資料直接傳送到客戶端。
Method Transparent Informed Prefetching DG:Dependency Graph PPM:Prediction by Partial Match WM:Web log Mining WMo:Web log Mining - Ordering
Dependency Graph First Markov Model ABCACBD and CCABCBCA
Prediction by Partial Match All-m-Order Markov model(Second Markov Model) ABCACBD and CCABCBCA
Pruning Criteria support-pruning confidence-pruning error-pruning
WMo The use of Web log mining methods for the discovery of association rules among user accesses. Association rule discovery algorithms represent candidates as sets of documents, which do not consider this ordering.
WMo C1 = {A, B} T = {B,C,A,D} S1 = {A}, S2={B} WMo={A,B}、{B、A} S1 = {A,B,C} S2 = {A,B,D} WMo = {A,B,C,D}、{A,B,D,C} 當C1 = {A,B}時,T={B,C,A,D}將無法有效的找到
Candidate generation procedure
PERFORMANCE RESULTS Usefulness Accuracy Network traffic
The Parameters for the Generator Mean value of the noise = 噪音的平均值 Variance of the noise = 噪音的變化 corProb = 最後一個節點時的移動到另一個節點的可能性 Noise = 分成兩種一種是開始移動的節點與隨機插入一個節點至交易中。
Experimental Evaluation Mean value of the noise = 噪音的平均值 Variance of the noise = 噪音的變化
Experimental Evaluation Mean value of the noise = 噪音的平均值 Variance of the noise = 噪音的變化
Experimental Evaluation Mean value of the noise = 噪音的平均值 Variance of the noise = 噪音的變化
Cache hits as a function of the cache size Mean value of the noise = 噪音的平均值 Variance of the noise = 噪音的變化
PERFORMANCE RESULTS http://ita.ee.lbl.gov/html/traces.html T = Traffic A = Accuracy U = Usefulness W = 5 => 統計的交易最高長度(代表一次鎖定的長度)
PERFORMANCE RESULTS WMo/wp = 沒有使用任何Pruning的技術
Discussion WMo prefetches more documents correctly than PPM and DG. Algorithm WM presents the worst performance because it does not consider
Discussion It is respect to the two factor: Order -> DG Noise -> PPM 實驗上指出PPM容易被Noise影響,反之DG和WM不容易造成影響。 PPM會因為Noise增加而影響正確性的部份。 DG容易因為order的因素而造成影嚮,但是PPM和WMo則不會。
END