資料庫系統實驗室 指導教授:張玉盈
Relational Database SNOOPYFAMILY 利用SQL做查詢: Select NAME Male Female Primary Key Domains 利用SQL做查詢: ID NAME SEX 1 SNOOPY Male 2 CHARLIE BROWN 3 SALLY BROWN Female 4 LUCY VAN PELT 5 LINUS VAN PELT 6 PEPPERMINT PATTY 7 MARCIE 8 SCHROEDER 9 WOODSTOCK - Select NAME From SNOOPYFAMILY Where SEX = ‘Male’; Cardinality 結果: Tuples ID NAME SEX 1 SNOOPY Male 2 CHARLIE BROWN 5 LINUS VAN PELT 8 SCHROEDER Attributes Degree
Introduction Data mining is widely used to mine or extract previously unknown, hidden, and potentially useful information from the large database. Frequent pattern mining is a basic research topic in pattern mining. It generates all frequent patterns with no smaller supports (or frequency) than a given minimum support threshold. Data mining 被廣泛用來從 database 中 mine 有用的資訊 Frequent pattern mining 在這領域是一個基本的研究主題 他產生所有的frequent 的patterns
Frequent Pattern Mining This technique has two limitations. First, it only considers that each item exists or does not exist in the binary form. Second, all items have same value with the same importance.
Data Mining Data mining is the process of finding hidden and useful knowledge form the large databases. However, items have different importance in the real world. For example, the iPhone (cellphone) is expensive and the telephone is cheap. Therefore, we have to consider the importance and the count of the items at the same time.
User wants to know which pattern can make money and the most items. Mining Weight Maximal Frequent Patterns User wants to know which pattern can make money and the most items.
Example TID Transaction 1 A, C 2 B, C 3 A, B, C 4 A, B 5 Item Weight A 0.6 B 0.8 C 0.4
Frequent Itemsets We can use the Apriori algorithm to find frequent patterns. 𝐿 1 ={A,B,C} 𝐿 1 => 𝐶 2 𝐶 2 ={AB,AC,BC} 𝐿 2 ={AB,AC,BC} 𝐿 2 => 𝐶 3 𝐶 3 ={ABC} 𝐿 3 = ø {A,B,C}:1 {A,C}:2 {B,C}:3 {A,B}:2 {A}:3 {B}:4 {C}:4 {item set}:count Min_Sup:2
Weighted Frequent pattern {A,B,C}:0.6 {A,C}:1.0 {A,B}:1.4 {A}:1.8 {B}:3.2 {C}:1.6 {B,C}:1.8 Item Weight A 0.6 B 0.8 C 0.4 WSup(PS) = sup(PS)* 𝑖=1 𝑙𝑒𝑛𝑔𝑡ℎ(𝑃𝑆) ( 𝑃 𝑖 ) 𝑙𝑒𝑛𝑔𝑡ℎ(𝑃𝑆) {item set}:WSup Min_Sup:1.8
In this case, the weighted frequent patterns are {A}, {B}, {B,C}. The weighted maximal frequent pattern is a pattern which does not have any weighted frequent super pattern. So, the weighted maximal frequent patterns in this case are {A}, {B,C}.
In the real world, each item has different profit and the number of items purchased by consumers may be not only one. In utility mining, each item has internal utility value that represents the quantity of the item in each transaction, and external utility value such as profit or price.
Mining High Utility Patterns Which itemset can contribute the most profit value of all the transactions?
In recent years, many applications have generated stream data such as transactions of retail markets. These data are continuous, unbounded, and usually coming with high speed.
Traditional vs. Stream Data Traditional Databases Data stored in finite, persistent data sets. Stream Data (Big data in cloud) Data as ordered, continuous, rapid, huge amount, time-varying data streams. (In-Memory Databases) 傳統的資料庫與資料串流的差異 傳統資料庫資料儲存於有限且永久保存的資料集合中,屬於靜態資料並不會隨著時間不斷的變化。 資料串流的資料是有序、連續、快速、大量且隨著時間變化的資料型態,隨著時間資料不斷的湧入資料庫中,無法去預測資料量且必須在有效的時間內將資料處理完畢。 15
Sliding Window Model t0 t1 t2 ti tj tj+1 tj+2 W1 W2 W3 time … t2 ti tj tj+1 tj+2 W1 W2 W3 time Figure 2. Sliding Window Sliding Window Model較異於前兩者,其探測方式永遠考慮最近的資料,設定一個固定的window size隨著時間移動window也跟著移動,其動作除了資料的新增還包含刪除的動作,將超過window size的舊資料移除,不將舊的資料加入探勘的動作中。 16
In the sliding window model, only recent data in a fixed size window are employed to discover meaningful patterns over data streams. This model is widely used for stream mining because of its ability to emphasize recent data and requires bounded memory resources.
Mining high utility patterns Problem Statement Given a data stream and a user-specified minimum utility threshold, mining high utility patterns in a window over the data stream is equivalent to discover a set of patterns having no smaller utilities than the minimum utility threshold from this window. 在數據流中的窗口中挖掘高效用模式 等價於從該窗口發現 一組不超過最小效用閾值的實用程序的模式 給定一個data stream 和 minimum utility threshold 從data stream的window中 mining high utility patterns 也等同於在這window中找到哪些pattern所貢獻的價值是不小於utility thrshold.
Simple Example TID Transaction TU T1 (A, 2) (B, 3) (C, 1) 1550 T2 (A, 1) (B, 2) 300 T3 (A, 2) 400 Item Profit A 200 B 50 C 1000 uT(AB, T1) = uT (A, T1) + uT (B, T1) = 200 × 2 + 50 × 3 = 550 u(AB) = uT (AB, T1) + uT (AB, T2) = 550 + 200 × 1 + 50 × 2 = 850 TU(T1) = 200 × 2 + 50 × 3 + 1000 × 1 = 1550 TWU(AB) = TU(T1) + TU(T2) = 1550 + 300 = 1850
Periodicity Mining in Time Series Databases Three types of periodic patterns: Symbol periodicity T = abd acb aba abc Symbol a , p = 3, stPos = 0 Sequence periodicity (partial periodic patterns) T = bbaa abbd abca abbc abcd Sequence ab, p = 4, stPos = 4 Segment periodicity (full-cycle periodicity) T = abcab abcab abcab Segement abcab, p = 5, stPos = 0
Mining Frequent Periodic Patterns User wants to know whether the pattern periodic or not in the time- series database. How to earn money? Find frequent periodic patterns and predict the future tend of the time-series database. Use computer to analyze time-series database.
Customers buy something, storage item and time-interval. Mining Time-Interval Sequential Patterns Customers buy something, storage item and time-interval. Find Time-interval patterns not only reveals the order of items but also the time intervals between successive items. Use computer to analyze database.
知識的表達 處理 效率分析 圖例. 資料庫系統的研究領域 資料庫模型、資料結構、資料整體的維護 查詢語言、使用方便性 查詢處理、簡單性、回應 時間、空間需求 查詢語言、使用方便性 圖例. 資料庫系統的研究領域