挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结.

挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结

挖掘相关的数据挖掘的知识类型背景知识模式的兴趣度量结果的表示与可视化
数据挖掘原语划分挖掘相关的数据挖掘的知识类型背景知识模式的兴趣度量结果的表示与可视化 aaa

任务相关数据数据库（或数据仓库）名称数据库表（或数据仓库的立方体）数据选择条件有关的属性（或维）数据分组的标准
例如：AllElectronics_db 数据库表（或数据仓库的立方体）例如：表item,customer,purchase,items_sold 数据选择条件例如：选取本年度加拿大地区购买商品的数据选取条件可能在概念上层次高于DB/DW的数据如：”type=home entertainment”，DB/DW中数据{tv,cd player,vcr} 有关的属性（或维）例如：item表的name,price属性;customer表的income,age属性。系统应具备自动选取相关属性的机制，比如通过评估各属性与特定操作的相关程度。数据分组的标准例如：根据日期进行分组

挖掘的知识类型描述(characterization) 区别分析(discrimination) 关联(association)
分类/预测(classification/prediction) 聚类(clustering)

例: 用户如果想发掘AllElectronics数据库中用户的购买习惯，可能会选择下面关联规则： P(X:customer,W)^Q(X,Y)=>buys(X,Z) X是customer表的主键，P,Q是谓词变量(在相关数据中定义)，W,Y,Z是目标变量。可能的挖掘结果是： age(X,”30…39”) ^ income (X,”40k…49k”) => buys(X,”VCR”) [2.2%,60%] accupation(X,”student”)^age(X,”20…29”)=>buys(X,”computer”) [1.4%,70%]

背景知识：概念层次概念层次用户对数据间关系的预测模式层次(schema hierarchy)
例：Street<city<province_or_state<country 集合-分组层次(set-grouping hierarchy) 例： {young,middle_aged,senior}<all(age) {20-39} = young, {40-59} = middle_aged 基于操作层次(operation-derived hierarchy) 包括信息解码，复杂数据对象的信息提取，数据聚类，数据分布分析算法等例： address: login-name < department < university < country 基于规则层次(rule-based hierarchy) 例： low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 用户对数据间关系的预测可以用于评价挖掘模式的兴趣度量

模式兴趣度量简洁性(simplicity) 确定性(certainty) 有用性(utility) 新颖程度(novelty)
如：(关联) 规则长度, (决策) 决策树规模大小确定性(certainty) 如：confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy ( also known as rule reliability , rule strength, rule quality, certainty factor, discriminating weight )等. 有用性(utility) 如：support (association),s(A=>B)=n(A nd B)/n(all), noise threshold (description) 新颖程度(novelty) 如：not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio

结果模式的可视化挖掘系统应能够用多种形式来显示发掘出来的模式挖掘系统应能够支持挖掘结果的多种操作
如：规则，表，报表，图表，图，决策数和立方体挖掘系统应能够支持挖掘结果的多种操作如：drill-down , roll-up , slicing , dicing ,rotation…

DMQL——一个数据挖掘语言动机设计为了能提供交互式数据挖掘能力通过提供一个类似SQL的语言希望能像SQL语言一样成为挖掘标准语言
希望成为系统开发和演化(evolution)的基础希望促进信息交换，技术转移，商业化并获得广泛承认设计 DMQL 在前面介绍的挖掘原语基础之上进行设计

任务相关数据的语法表示 use database <database_name>, or
use data warehouse <data_warehouse_name> from <relation(s)/cube(s)> where [<condition>] in relevance to <att_or_dim_list> order by <order_list> group by <grouping_list> having <condition>

任务相关数据语法表示（续）例：如果挖掘AllElectronics的加拿大顾客经常购买的商品之间的关联，针对顾客不同收入和年龄，并且数据用购买日期进行分组。挖掘相关数据可以写成： use database AllElectronics_db in relevance to I.name , I.price , C.income , C.age from customer C , item I , purchase P , item_sold S where I.item_ID=S.item_ID and S.trans_ID=P.trans_ID and P.cust_ID=C.cust_ID group by P.date

挖掘知识类型的语法 <Mine_Knowledge_Specificaton>::=<Mine_Char> | <Mine_Discri> | <Mine_Assoc> | <Mine_Class> | <Mine_Pred> <Mine_Char>::= Mine characterization [as <pattern_name>] analyze <meansure(s)> 例：mine characteristics as customerPurchasing analyze count% <Mine_Discri>::=Mine comparison [as <pattern_name>] for <target_class> where <target_condition> { versus <contrast_class_i> where <contrast_condition_i>} analyze <measure(s)> 例：mine comparison as purchaseGroups for bigSpenders where avg(I.price) ≥$100 analyze count versus budgetSpenders where avg(I.price),$100

挖掘知识类型的语法(续) <Mine_assoc>::=mine association [as<pattern_name>] [matching <metapattern>] 例：mine associations as buyingHabits matching P(X:customer,W)^Q(X,Y)=>buys(X,Z) <Mine_class>::=mine classification [as <pattern_name>] analyze <classifying_attribute_or_dimention> 例： mine classification as classifyingCustomerCreditRating analyze credit_info <Mine_predi>::=Mine prediction [as <pattern_name>] analyze <prediction_attribute_or_dimention> {set { attribute_or_dimention_i>=<value_i>}} 例：mine prediction as predictItemPrice analyze price set category = “TV” and brand=“SONY”

概念层次语法语法： Use hierarchy <hierarchy> for <attribute_or_dimention > 不同概念层次采用不同定义方式模式概念层次 define hierarchy time_hierarchy on date as [date,month quarter,year] 集合-分组概念层次 define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior

概念层次语法（续）基于操作概念模式(operation-derived hierarchies)
define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) 基于规则概念模式(rule-based hierarchies) define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250

兴趣度量语法语法： with <interest_measure_name> threshold = threshold_value 例: with support threshold = 0.05 with confidence threshold = 0.7

挖掘知识表示语法 display as <result_form> 用户指定显示方法
为在不同概念层次上观察结果： Multilevel_Manipulation ::= roll up on <attribute_or_dimension> | drill down on <attribute_or_dimension> | add <attribute_or_dimension> | drop <attribute_or_dimension>

一个完整的DMQL语句 use database AllElectronics_db
use hierarchy location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age , I.type , I.place_made from customer C, item I , purchases P , items_sold S , works_at W , branch B where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx'' and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = ``Canada" and I.price >= 100 with noise threshold = 0.05 display as table

其它数据挖掘语言关联规则语言 OLEDB for DM (Microsoft’2000)
MSQL (Imielinski & Virmani’99) MineRule (Meo Psaila and Ceri’96) Query flocks 基于Datalog 语法 (Tsur et al’98) OLEDB for DM (Microsoft’2000) 和 OLE DB, OLE DB for OLAP一起致力于DB,DW,DM的标准化到2000年3月止，已经完成了predictive modeling( classification & Prediction ), clustering,还未包含 characterization, discrimination , association modeling 等。 CRISP-DM (CRoss-Industry Standard Process for Data Mining) 是一个国际性项目，包含数据库公司，数据仓库公司，用户公司(user companies) 目的在于提供有效数据挖掘的平台和过程结构(process structure) 强调运用数据挖掘技术来解决商业问题

数据挖掘系统体系结构数据挖掘系统与 DB/DW 系统的耦合程度零耦合—用文件作为数据源和存放结果数据,不推荐松散耦合
用DB/DW作数据源，查询结果写入文件或DB/DW；但不使用DB/DW的提供的数据结构和查询优化方法。半紧耦合—提升挖掘系统性能部分挖掘原语在DB/DW中实现，如sorting, indexing, aggregation , histogram analysis, multiway join, precomputation of some statistic functions such as count ,sum,max,min,standard deviation. 紧耦合—一个统一的信息处理环境 DM 被集成到DB/DW系统，作为信息系统的一个组成部分；并利用DB/DW的数据结构，索引模式查询处理过程对挖掘查询进行优化。

总结数据挖掘查询原语数据挖掘查询语言数据挖掘系统体系结构任务相关数据挖掘知识类型背景知识兴趣度量知识表示和可视化
DMQL, MS/OLEDB for DM等. 数据挖掘系统体系结构零耦合，松散耦合，半紧密耦合，紧密耦合

谢谢！报告人：李炎联系方式：michaelli@eastday.com

挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结.

Similar presentations

Presentation on theme: "挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结."— Presentation transcript:

Similar presentations

About project

Feedback

Войти

Auth with social network:

挖掘原语，语言和体系结构 数据挖掘原语 数据挖掘语言 数据挖掘系统体系结构 总结.

Similar presentations

Presentation on theme: "挖掘原语，语言和体系结构 数据挖掘原语 数据挖掘语言 数据挖掘系统体系结构 总结."— Presentation transcript:

Similar presentations

About project

Feedback

挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结.

Presentation on theme: "挖掘原语，语言和体系结构数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结."— Presentation transcript: