Data Pre-Processing … What about your data?.

Data Pre-Processing … What about your data?

Why Data Preprocessing?
real world 的資料「髒」不完整 incomplete: 缺值、缺有興趣的屬性、只含統整值（aggregate data）有雜質 noisy: 有錯誤或有離群值（outliers）不一致 inconsistent: 編碼或名稱不一致 No quality data, no quality mining results! 有品質的決策乃植基於有品質的資料 Data warehouse 需要有品質資料的一致整合

Data Quality的多維度量(measure)
準確度 Accuracy 完整性 Completeness 一致性 Consistency 及時性 Timeliness 可信度 Believability 加值性 Value added 可解讀性 Interpretability 取及程度 Accessibility integrity, compactness

Data Preprocessing 主要工作
Data cleaning （清掃） Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration （整合） Integration of multiple databases, data cubes, or files Data transformation （轉換） Normalization and aggregation Data reduction （簡化） Obtains reduced representation in volume but produces the same or similar analytical results Data discretization （離散化） Part of data reduction but with particular importance, especially for numerical data

Forms of data preprocessing

Data Cleaning 主要工作填入缺值 Fill in missing values
確認 outliers 並解決noisy data 修正不一致 (inconsistent) 的資料

Missing Data 現象： Data is not always available 原因
如： sales data 中的 customer income 原因儀器錯誤與其他欄位不一致而刪除因誤解而未輸入輸入時覺得不重要而未輸入 Missing data：可能需要推論 (inferred)

如何處理 Missing Data Which one will bias data? 忽略本筆人工輸入: 冗長？不可行？
when class label is missing (assuming the tasks in classification） when several attributes are missing not effective: 當各 attribute 之 missing values% 變動很大時人工輸入: 冗長？不可行？用 global常數替代: e.g., “unknown”,無限大， a new class?! 用 attribute 的均值替代 same class, same mean 用最可能的值填入 inference-based 如 Bayesian formula 或 decision tree Which one will bias data?

Noisy Data Noise: 某測量變數之隨機錯誤或變異源由其他需data cleaning 的資料問題資料收集設備失誤
資料輸入有問題資料傳輸有問題技術限制命名習慣不一致其他需data cleaning 的資料問題 duplicate records incomplete data inconsistent data

如何處理 Noisy Data Binning method Clustering 結合 computer 與人工檢查
先將資料排序，分隔(partition)成(equi-depth) bins 再 smooth by bin means,smooth by bin median,smooth by bin boundaries （see next slide） Clustering detect and remove outliers 結合 computer 與人工檢查偵測可疑值後由人工檢查 Regression回歸函數 smooth by fitting the data into regression functions

整合-處理Redundant Data 經常出現於整合不同資料庫可能藉由 correlational analysis偵測小心整合
同 attribute，於不同資料庫中有不同名字某 attribute是另一個表中的 “derived” attribute, e.g., annual revenue 可能藉由 correlational analysis偵測小心整合 reduce/avoid redundancies and inconsistencies improve mining speed and quality Ra,b= (a-aav)(b-bav)/(n-1)ab * data integration, schema integration * data conflict: 度C,度F

Data Transformation轉換
將資料轉成適合mining的型態，含 Smoothing: 從資料中移除 noise Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: 縮放（scaling）以落入較小的特定範圍 min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones eg. area = h x w

Data Transformation: Normalization
min-max normalization z-score normalization (0-mean) normalization by decimal scaling 標準差＝[ Sigma((A-Amean)2)/(n-1) ]1/2 Where j is the smallest integer such that Max(| |)<1

Data Reduction Strategies簡化
Warehouse 的資料量可能數個 terabytes mine complete data 太耗時 Data reduction:得到一個data set簡化的表示方式 much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation (聚合data cube) Dimensiony reduction (降維度) Data compression (壓縮) Numerosity reduction （大數化小！） Discretization and concept hierarchy generation (離散化與產生概念階層)

Data Cube Aggregation data cube 的最底層 aggregate data cube 的各個不同層
the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. aggregate data cube 的各個不同層 Further reduce the size of data to deal with 參考合適的層次用足以解決task的最小表示方法

Dimensionality Reduction
選取 Feature (i.e., attribute subset selection): 選最小的feature集合，使不同class的機率分佈接近於原始分佈 reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction

Heuristic methods

Data Compression 壓縮 String compression 字串 Audio/video compression 聲、視訊
一大堆理論與演算法通常是 lossless 沒解開通常運作受限 Audio/video compression 聲、視訊通常是 lossy, 帶有漸進式修正（progressive refinement）有時可在未重建全部的情形下，重建部分資料時間序列（Time sequence ）不是 audio 通常短，隨時間緩慢變化

Numerosity Reduction Parametric methods 參數法 Non-parametric methods非參數法
假設資料符合某model 1)估計 model參數 2)加以儲存 3)丟掉資料（除可能的outliers外） linear regression, multiple regression, Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods非參數法不假設某model 主要方法: histograms, clustering, sampling examples

離散化與產生概念階層 Concept Hierarchy price (1) 數值資料離散化
0..200, , …, 0..100, ; (1) 數值資料離散化 binning, historgram (3.15), clustering, natural partitioning (3-4-5 rule) (2) 產生概念階層 partial ordering, portion of a hierarchy, set of attributes

Example of 3-4-5 rule 1,5,10 2,4,8 ... Step 1:
-$351 -$159 profit $1, $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count msd=1,000 Low=-$1,000 High=$2,000 Step 2: (-$1, $2,000) (-$1, ) (0 -$ 1,000) Step 3: ($1,000 - $2,000) 1,5,10 2,4,8 ... (-$4000 -$5,000) Step 4: ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) (-$ ) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000)

類別資料(categorical data)概念階層的產生
Specification of a partial ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a portion of a hierarchy by explicit data grouping intermediate level (中部五縣市) Specification of a set of attributes, but not of their partial ordering system try to generate Specification of only a partial set of attributes

Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research 比較 Web search 與 KDD process

Data Pre-Processing … What about your data?.

Similar presentations

Presentation on theme: "Data Pre-Processing … What about your data?."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Data Pre-Processing … What about your data?.

Similar presentations

Presentation on theme: "Data Pre-Processing … What about your data?."— Presentation transcript:

Similar presentations

About project

反馈