Data Pre-Processing … What about your data?.

Slides:

Advertisements

Similar presentations

陳春賢老師長庚大學資管系報告人 : ( 研究方向、成果與計畫 ) 資料探勘與生醫資訊相關研究 ( 研究方向、成果與計畫 )

Advertisements

加油添醋話擴寫日新國小鄒彩完.

網際網路行銷 Web 2.0 第十一章網路行銷工具 — 從大眾到小眾.

Quality & Satisfy 廣晉電子廠優秀班組長管理實務課程開始　.

METAEDGE Corporation Taiwan

如何撰寫營運(創業)計畫書主講人：石怡芬.

全球科研项目整合检索系统海研网

信用卡資料庫管理與顧客服務玉山銀行陳炳良 2002年09月

Handel Cheng, Ph.D. Dr. Jane Formula Tech. CO., LTD.

資料庫設計 Database Design.

OMC 商業智庫劉老師講題大綱參考資料.

饮食治疗篇.

資料探勘 (Data Mining) 蔡懷寬

第4讲企业财务管理.

数据库技术及应用华中科技大学管理学院课程网址：

商業智慧與資料倉儲課程簡介靜宜大學資管系楊子青.

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

Euler’s method of construction of the Exponential function

数据仓库与数据挖掘复习.

資訊管理第九章資料採礦.

模式识别 Pattern Recognition

SPC introduction.

Excellence in Manufacturing 卓越制造

線上分析處理、資料採礦與 Analysis Services

第二章資訊管理的應用系統.

SAT and max-sat Qi-Zhi Cai.

(Exec1) GIS 空间分析－使用ArcGIS (Exec1)

Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)

統計套裝軟體輔大統計資訊系黃孝雲.

運籌管理 Chapter 12 資訊科技與運籌管理電子化祝天雄博士 99年12月日.

Retail Customer Online Registration 零售顧客線上註冊教學

Decision Support System (靜宜資管楊子青)

创建型设计模式.

加油添醋話擴寫鄒彩完.

第五組 : 廖震昌 / 謝坤吉 / 黃麗珍陳曉伶 / 陳思因 / 林慧佳

3D Object Representations

圖表製作集中指標 0628 統計學.

Data Mining 資料探勘 Introduction to Data Mining Min-Yuh Day 戴敏育

生物信息学 Bioinformatics 巩晶癌症研究中心山东大学医学院

第14章竞争市场上的企业上海杉达学院国贸系.

國立政治大學資訊科學研究所知識系統實驗室研究生：鄭雍瑋指導教授：劉吉軒博士中華民國九十五年六月三十日

Interval Estimation區間估計

子博弈完美Nash均衡我们知道，一个博弈可以有多于一个的Nash均衡。在某些情况下，我们可以按照“子博弈完美”的要求，把不符合这个要求的均衡去掉。扩展型博弈G的一部分g叫做一个子博弈，如果g包含某个节点和它所有的后继点，并且一个G的信息集或者和g不相交，或者整个含于g。一个Nash均衡称为子博弈完美的，如果它在每.

Repetitive Manufacturing Application 重複性製造運用

The Nature and Scope of Econometrics

Decision Support System (靜宜資管楊子青)

Abstract Data Types 抽象数据类型 Institute of Computer Software 2019/2/24

A high payload data hiding scheme based on modified AMBTC technique

Order Flow and Exchange Rate Dynamics

資料精簡 (Data Reduction).

Version Control System Based DSNs

研究技巧與論文撰寫方法中央大學資管系陳彥良.

Dept. of Information Management OCIT February, 2002

高性能计算与天文技术联合实验室智能与计算学部天津大学

Chapter 2 存貨管理與風險共擔.

Maintaining Frequent Itemsets over High-Speed Data Streams

虚拟仪器 virtual instrument

線性規劃模式 Linear Programming Models

Representation Learning of Knowledge Graphs with Hierarchical Types

從 ER 到 Logical Schema ──兼談Schema Integration

第十章線上行銷研究.

主講人：陳鴻文副教授銘傳大學資訊傳播工程系所日期：3/13/2010

第十二章顧客關係管理.

MODELING GENERALIZATION & REFINING THE DOMAIN MODEL

SLIQ：一种快速可伸缩分类器 Manish Mehta, Rakesh Agrawal, Jorma Rissanen IBM Almaden Research Center, 1996 报告人：郭新涛

More About Auto-encoder

Multiple Regression: Estimation and Hypothesis Testing

MGT 213 System Management Server的昨天，今天和明天

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

Data Pre-Processing … What about your data?

Why Data Preprocessing? real world 的資料「髒」不完整 incomplete: 缺值、缺有興趣的屬性、只含統整值（aggregate data）有雜質 noisy: 有錯誤或有離群值（outliers）不一致 inconsistent: 編碼或名稱不一致 No quality data, no quality mining results! 有品質的決策乃植基於有品質的資料 Data warehouse 需要有品質資料的一致整合

Data Quality的多維度量(measure) 準確度 Accuracy 完整性 Completeness 一致性 Consistency 及時性 Timeliness 可信度 Believability 加值性 Value added 可解讀性 Interpretability 取及程度 Accessibility integrity, compactness

Data Preprocessing 主要工作 Data cleaning （清掃） Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration （整合） Integration of multiple databases, data cubes, or files Data transformation （轉換） Normalization and aggregation Data reduction （簡化） Obtains reduced representation in volume but produces the same or similar analytical results Data discretization （離散化） Part of data reduction but with particular importance, especially for numerical data

Forms of data preprocessing

Data Cleaning 主要工作填入缺值 Fill in missing values 確認 outliers 並解決noisy data 修正不一致 (inconsistent) 的資料

Missing Data 現象： Data is not always available 原因如： sales data 中的 customer income 原因儀器錯誤與其他欄位不一致而刪除因誤解而未輸入輸入時覺得不重要而未輸入 Missing data：可能需要推論 (inferred)

如何處理 Missing Data Which one will bias data? 忽略本筆人工輸入: 冗長？不可行？ when class label is missing (assuming the tasks in classification） when several attributes are missing not effective: 當各 attribute 之 missing values% 變動很大時人工輸入: 冗長？不可行？用 global常數替代: e.g., “unknown”,無限大， a new class?! 用 attribute 的均值替代 same class, same mean 用最可能的值填入 inference-based 如 Bayesian formula 或 decision tree Which one will bias data?

Noisy Data Noise: 某測量變數之隨機錯誤或變異源由其他需data cleaning 的資料問題資料收集設備失誤資料輸入有問題資料傳輸有問題技術限制命名習慣不一致其他需data cleaning 的資料問題 duplicate records incomplete data inconsistent data

如何處理 Noisy Data Binning method Clustering 結合 computer 與人工檢查先將資料排序，分隔(partition)成(equi-depth) bins 再 smooth by bin means,smooth by bin median,smooth by bin boundaries （see next slide） Clustering detect and remove outliers 結合 computer 與人工檢查偵測可疑值後由人工檢查 Regression回歸函數 smooth by fitting the data into regression functions

整合-處理Redundant Data 經常出現於整合不同資料庫可能藉由 correlational analysis偵測小心整合同 attribute，於不同資料庫中有不同名字某 attribute是另一個表中的 “derived” attribute, e.g., annual revenue 可能藉由 correlational analysis偵測小心整合 reduce/avoid redundancies and inconsistencies improve mining speed and quality Ra,b= (a-aav)(b-bav)/(n-1)ab * data integration, schema integration * data conflict: 度C,度F

Data Transformation轉換將資料轉成適合mining的型態，含 Smoothing: 從資料中移除 noise Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: 縮放（scaling）以落入較小的特定範圍 min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones eg. area = h x w

Data Transformation: Normalization min-max normalization z-score normalization (0-mean) normalization by decimal scaling 標準差＝[ Sigma((A-Amean)2)/(n-1) ]1/2 Where j is the smallest integer such that Max(| |)<1

Data Reduction Strategies簡化 Warehouse 的資料量可能數個 terabytes mine complete data 太耗時 Data reduction:得到一個data set簡化的表示方式 much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation (聚合data cube) Dimensiony reduction (降維度) Data compression (壓縮) Numerosity reduction （大數化小！） Discretization and concept hierarchy generation (離散化與產生概念階層)

Data Cube Aggregation data cube 的最底層 aggregate data cube 的各個不同層 the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. aggregate data cube 的各個不同層 Further reduce the size of data to deal with 參考合適的層次用足以解決task的最小表示方法

Dimensionality Reduction 選取 Feature (i.e., attribute subset selection): 選最小的feature集合，使不同class的機率分佈接近於原始分佈 reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction

Heuristic methods

Data Compression 壓縮 String compression 字串 Audio/video compression 聲、視訊一大堆理論與演算法通常是 lossless 沒解開通常運作受限 Audio/video compression 聲、視訊通常是 lossy, 帶有漸進式修正（progressive refinement）有時可在未重建全部的情形下，重建部分資料時間序列（Time sequence ）不是 audio 通常短，隨時間緩慢變化

Numerosity Reduction Parametric methods 參數法 Non-parametric methods非參數法假設資料符合某model 1)估計 model參數 2)加以儲存 3)丟掉資料（除可能的outliers外） linear regression, multiple regression, Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods非參數法不假設某model 主要方法: histograms, clustering, sampling examples

離散化與產生概念階層 Concept Hierarchy price (1) 數值資料離散化 0..1000 0..200, 200..400, …, 800..1000 0..100, 100..200; (1) 數值資料離散化 binning, historgram (3.15), clustering, natural partitioning (3-4-5 rule) (2) 產生概念階層 partial ordering, portion of a hierarchy, set of attributes

Example of 3-4-5 rule 1,5,10 2,4,8 ... Step 1: -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count msd=1,000 Low=-$1,000 High=$2,000 Step 2: (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000) Step 3: ($1,000 - $2,000) 1,5,10 2,4,8 ... (-$4000 -$5,000) Step 4: ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) (-$400 - 0) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000)

類別資料(categorical data)概念階層的產生 Specification of a partial ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a portion of a hierarchy by explicit data grouping intermediate level (中部五縣市) Specification of a set of attributes, but not of their partial ordering system try to generate Specification of only a partial set of attributes

Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research 比較 Web search 與 KDD process