Download presentation
Presentation is loading. Please wait.
1
Data Pre-Processing … What about your data?
2
Why Data Preprocessing?
real world 的資料 「髒」 不完整 incomplete: 缺值、缺有興趣的屬性、只含 統整值(aggregate data) 有雜質 noisy: 有錯誤或有離群值 (outliers) 不一致 inconsistent: 編碼或名稱不一致 No quality data, no quality mining results! 有品質的決策乃植基於有品質的資料 Data warehouse 需要有品質資料的一致整合
3
Data Quality的多維度量(measure)
準確度 Accuracy 完整性 Completeness 一致性 Consistency 及時性 Timeliness 可信度 Believability 加值性 Value added 可解讀性 Interpretability 取及程度 Accessibility integrity, compactness
4
Data Preprocessing 主要工作
Data cleaning (清掃) Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration (整合) Integration of multiple databases, data cubes, or files Data transformation (轉換) Normalization and aggregation Data reduction (簡化) Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (離散化) Part of data reduction but with particular importance, especially for numerical data
5
Forms of data preprocessing
6
Data Cleaning 主要工作 填入缺值 Fill in missing values
確認 outliers 並解決noisy data 修正不一致 (inconsistent) 的資料
7
Missing Data 現象: Data is not always available 原因
如: sales data 中的 customer income 原因 儀器錯誤 與其他欄位不一致而刪除 因誤解而未輸入 輸入時覺得不重要而未輸入 Missing data:可能需要推論 (inferred)
8
如何處理 Missing Data Which one will bias data? 忽略本筆 人工輸入: 冗長?不可行?
when class label is missing (assuming the tasks in classification) when several attributes are missing not effective: 當各 attribute 之 missing values% 變動很大時 人工輸入: 冗長?不可行? 用 global常數替代: e.g., “unknown”,無限大, a new class?! 用 attribute 的均值替代 same class, same mean 用最可能的值填入 inference-based 如 Bayesian formula 或 decision tree Which one will bias data?
9
Noisy Data Noise: 某測量變數之隨機錯誤或變異 源由 其他需data cleaning 的資料問題 資料收集設備失誤
資料輸入有問題 資料傳輸有問題 技術限制 命名習慣不一致 其他需data cleaning 的資料問題 duplicate records incomplete data inconsistent data
10
如何處理 Noisy Data Binning method Clustering 結合 computer 與人工檢查
先將資料排序,分隔(partition)成(equi-depth) bins 再 smooth by bin means,smooth by bin median,smooth by bin boundaries (see next slide) Clustering detect and remove outliers 結合 computer 與人工檢查 偵測可疑值後由人工檢查 Regression回歸函數 smooth by fitting the data into regression functions
11
整合-處理Redundant Data 經常出現於整合不同資料庫 可能藉由 correlational analysis偵測 小心整合
同 attribute,於不同資料庫中有不同名字 某 attribute是 另一個表中的 “derived” attribute, e.g., annual revenue 可能藉由 correlational analysis偵測 小心整合 reduce/avoid redundancies and inconsistencies improve mining speed and quality Ra,b= (a-aav)(b-bav)/(n-1)ab * data integration, schema integration * data conflict: 度C,度F
12
Data Transformation轉換
將資料轉成適合mining的型態,含 Smoothing: 從資料中移除 noise Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: 縮放(scaling)以落入較小的特定範圍 min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones eg. area = h x w
13
Data Transformation: Normalization
min-max normalization z-score normalization (0-mean) normalization by decimal scaling 標準差=[ Sigma((A-Amean)2)/(n-1) ]1/2 Where j is the smallest integer such that Max(| |)<1
14
Data Reduction Strategies簡化
Warehouse 的資料量 可能 數個 terabytes mine complete data 太耗時 Data reduction:得到一個data set簡化的表示方式 much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation (聚合data cube) Dimensiony reduction (降維度) Data compression (壓縮) Numerosity reduction (大數化小!) Discretization and concept hierarchy generation (離散化與產生概念階層)
15
Data Cube Aggregation data cube 的最底層 aggregate data cube 的各個不同層
the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. aggregate data cube 的各個不同層 Further reduce the size of data to deal with 參考合適的層次 用足以解決task的最小表示方法
16
Dimensionality Reduction
選取 Feature (i.e., attribute subset selection): 選最小的feature集合,使不同class的機率分佈接近於原始分佈 reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction
17
Heuristic methods
18
Data Compression 壓縮 String compression 字串 Audio/video compression 聲、視訊
一大堆理論與演算法 通常是 lossless 沒解開通常運作受限 Audio/video compression 聲、視訊 通常是 lossy, 帶有漸進式修正 (progressive refinement) 有時可在未重建全部的情形下,重建部分資料 時間序列(Time sequence )不是 audio 通常短,隨時間緩慢變化
19
Numerosity Reduction Parametric methods 參數法 Non-parametric methods非參數法
假設資料符合某model 1)估計 model參數 2)加以儲存 3)丟掉資料(除可能的outliers外) linear regression, multiple regression, Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods非參數法 不假設某model 主要方法: histograms, clustering, sampling examples
20
離散化與產生概念階層 Concept Hierarchy price (1) 數值資料離散化
0..200, , …, 0..100, ; (1) 數值資料離散化 binning, historgram (3.15), clustering, natural partitioning (3-4-5 rule) (2) 產生概念階層 partial ordering, portion of a hierarchy, set of attributes
21
Example of 3-4-5 rule 1,5,10 2,4,8 ... Step 1:
-$351 -$159 profit $1, $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count msd=1,000 Low=-$1,000 High=$2,000 Step 2: (-$1, $2,000) (-$1, ) (0 -$ 1,000) Step 3: ($1,000 - $2,000) 1,5,10 2,4,8 ... (-$4000 -$5,000) Step 4: ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) (-$ ) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000)
22
類別資料(categorical data)概念階層的產生
Specification of a partial ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a portion of a hierarchy by explicit data grouping intermediate level (中部五縣市) Specification of a set of attributes, but not of their partial ordering system try to generate Specification of only a partial set of attributes
23
Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research 比較 Web search 與 KDD process
Similar presentations