Date: 2012/05/14 Source: Bo Zhao et. al (CIKM’11)

Slides:



Advertisements
Similar presentations
第七組古文閱讀報告 組長:秀惠 組員:孟筑、雅曼、雅文、盈蓁. 《朱買臣苦學有成》之原文翻譯 朱買臣,字翁子,吳人也。 朱買臣,字翁子,吳國人。 家貧,好讀書,不治產業,常刈(一ˋ)薪 樵,賣以給 (ㄐㄧ ˇ ) 食。 家裡雖然很窮困,但是他還是很喜歡讀書,因 不懂得如何治理產業,只能靠著上山砍材去城.
Advertisements

資料採礦與商業智慧 第十六章 線上分析處理.
信息检索中效率问题的研究 报告人:赵江华 北京大学计算机科学与技术系 网络与分布式系统实验室 2002年4月21日.
数据库研究方法和论文写作 陆嘉恒 中国人民大学.
VS 兒童及少年身心發展 幼保三甲 幼兒期 青少年期 4A1I0014 陳佳瑩 4A1I0023 尤秀惠
鞘翅目 生科四乙 蘇俊融.
人工智能 Artificial Intelligence 第十一章
华东师范大学软件学院 王科强 (第一作者), 王晓玲
Homework 2 : VSM and Summary
第4讲 企业财务管理.
核心价值观记心中 主题班会
第三章 隨機變數.
Chapter 5 Relational Algebra
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
A TIME-FREQUENCY ADAPTIVE SIGNAL MODEL-BASED APPROACH FOR PARAMETRIC ECG COMPRESSION 14th European Signal Processing Conference (EUSIPCO 2006), Florence,
A Question Answering Approach to Emotion Cause Extraction
Minimum Spanning Trees
What water is more suitable for nurturing the goldfish
Some Effective Techniques for Naive Bayes Text Classification
Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Population proportion and sample proportion
關聯式資料庫.
模式识别 Pattern Recognition
SPC introduction.
實證醫學 嘉義基督教醫院 外科部 黃國倉醫師
Understanding Report Cards 读懂成绩单 Mr Alex Ward Director of Studies 教学总监
計算方法設計與分析 Design and Analysis of Algorithms 唐傳義
Source: IEEE Access, vol. 5, pp , October 2017
Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)
Creating Animated Apps (I) 靜宜大學資管系 楊子青
第4章(2) 空间数据库 —关系数据库 北京建筑工程学院 王文宇.
创建型设计模式.
第5章 資料倉儲的資料建置.
Word-Entity Duet Representations for Document Ranking
生物芯片技术 刘超 李世燕 谢宏林
The role of leverage in cross-border mergers and acquisitions
Interval Estimation區間估計
SpringerLink 新平台介绍.
数据库内容及检索功能 – 如何利用这些资源帮助科技论文的写作与发表 钟似璇 (Sixuan Zhong s.
基于类关联规则的分类 Classification Based on Class-Association Rules
PubMed整合显示图书馆电子资源 医科院图书馆电子资源培训讲座.
Advisor : Prof. Frank Y.S. Lin Presented by Yen-Yi, Hsu
感謝同學們在加分題建議. 我會好好研讀+反省~
A high payload data hiding scheme based on modified AMBTC technique
数据摘要现状调研报告 上下文摘要初步思考 徐丹云.
VIDEO COMPRESSION & MPEG
研究技巧與論文撰寫方法 中央大學資管系 陳彥良.
校園地震預警系統的建置與應用 林沛暘.
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
2019/4/8 A Load Balancing Mechanism for multiple SDN Controllers based on Load Informing Strategy Miultiple controller 的 load balancing 機制,使用一個叫 Load informing.
Maintaining Frequent Itemsets over High-Speed Data Streams
Total Review of Data Structures
前向人工神经网络敏感性研究 曾晓勤 河海大学计算机及信息工程学院 2003年10月.
Representation Learning of Knowledge Graphs with Hierarchical Types
Google Local Search API Research and Implementation
A Data Mining Algorithm for Generalized Web Prefetching
SpringerLink 新平台介绍.
指導老師:邱登裕老師 組員:B 張萬鈞 B 鄭瑞傑 B 蔡譯陞 B 胡瑜真
An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。 本文目的: 做到既有较高的藏量(1bpp),
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
Speaker : YI-CHENG HUNG
國立東華大學課程設計與潛能開發學系張德勝
Reversible Data Hiding in Color Image with Grayscale Invariance
Speaker : YI-CHENG HUNG
Principle and application of optical information technology
Homework 2 : VSM and Summary
Gaussian Process Ruohua Shi Meeting
Hybrid fractal zerotree wavelet image coding
Presentation transcript:

TEXplorer: Keyword-based Object Search and Exploration in Multidimensional Text Databases Date: 2012/05/14 Source: Bo Zhao et. al (CIKM’11) Speaker: Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 2019/4/18

Index Introduction Tasks of TEXplorer Significance of A Dimension Ranking Algorithms Experiment Conclusions 2019/4/18

Introduction Multi-dimensional text database 由於網路快速發展,現於網路上存在大量的資料,而有許多資料是具有豐富的結構化屬性,像產品評論… Multi-dimensional text database 2019/4/18

Introduction attribute Not helpful TEXplorer 對於這種類型的資料,現在較流行的搜尋方法是讓使用者藉由選擇某”維度”來進行相關性的排名,並回傳結果給使用者 但有時並不是太有幫助 2019/4/18

Top-1 Dimension: Person Tasks of TEXplorer TEXplorer Integrating keyword-based ranking and OLAP exploration. Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 藉由選擇某維度的屬性直,分析落在此維度值的文件,對其他屬性計算與查詢字相關的前k高的屬性值 2019/4/18

系統實例 2019/4/18

Tasks of TEXplorer Ranking dimensions for drilling down Ranking values in the same dimension 系統主要工作 2019/4/18

Significance of A Dimension doc (q,cell’s doc) Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 (q,cell) 考慮兩個feature 在同一維度底下 1.屬性質跟查詢字的關係 2.此屬性質所屬的文件與查詢字的關係 2019/4/18

Significance of A Dimension Cell variance how much the relevance of each of C’s Ai- children deviates from the relevance of C 𝒄𝒉𝒅 𝑨 𝒊 (C):denote the set of children in 𝐴 𝑖 of C. 𝑪 𝒅 :document of cell C. IDF:inverted document frequency factor of term w∈q. TFW:term frequency factor of w in document d. QTW:the query term frequency factor of w in q. 2019/4/18

Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +… q=“game”, “playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +… rel(q,d1)= 𝐼𝐷𝐹 𝑔𝑎𝑚𝑒 ∗ 𝑇𝐹𝑊 𝑔𝑎𝑚𝑒 ∗ 𝑄𝑇𝑊 𝑔𝑎𝑚𝑒 + 𝐼𝐷𝐹 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑇𝐹𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑄𝑇𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 = log 5 1 ∗ 3 100 ∗ 1 2 + log 5 2 ∗ 10 100 ∗ 1 2 =0.02985 𝐶𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 2∗0.3+1∗0.2+2∗0.1 3−1 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18

Significance of A Dimension Document variance how much the relevance of each document in C deviates from the relevance of the Ai-child of C which contains that document. 2019/4/18

a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2 q=“game playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 𝐼𝐷𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 5−3 𝑎+𝑏+𝑐 a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2 b= (rel(q,d3)−Rel(q,HP)) 2 c= (rel(q,d4)−Rel(q,acer)) 2 + (rel(q,d5)−Rel(q,acer)) 2 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18

Significance of A Dimension 1 doc Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 2019/4/18

針對具有豐富結構化屬性的文件資料庫進行關鍵字的查詢,跟一般查詢不同的是他回傳的不是依相關度排序的文件,而是提供一瀏覽介面提供與查詢字相關的維度資訊及資料細節來幫助使用者搜尋有興趣的資料。 屬性的排名是依照查詢字與各屬性的相關度決定,屬性相關度衡量的基準有二: 一是考慮查詢字與此屬性的屬性值的相關度,另一則是分布在此屬性值底下的文件與查詢字的相關程度 2019/4/18

Ranking Algorithms MultiAccess Algorithm q=“game” , “playing” SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Fetch the doc related with q 2. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K BREND DELL d1,d2 HP d3 acer d4,d5 OS XP7 d1 MacOs d2 XP d3 BREND DELL d1,d2 HP d3 OS XP7 d1 MacOs d2,d4 XP d3,d5 2019/4/18

Ranking Algorithms MultiAccess Algorithm 2019/4/18

Ranking Algorithms R𝑒𝑙(𝑞,𝐶′) OneAccess Algorithm 2019/4/18

Ranking Algorithms OneAccess Algorithm q=“game” , “playing” SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Precompute each doc’s TFW, IDF then could precompute rel ,Rel…… 2.Fetch the doc related with q 3. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 4.Return top-k 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING BREND DELL d1,d2 HP d3 acer d4,d5 BREND DELL d1,d2 HP d3 TFW IDF d1 0.3 0.1 d2 0.5 0.2 d3 d4 d5 0.4 OS XP7 d1 MacOs d2,d3 XP d4,d5 OS XP7 d1 MacOs d2 XP d3 2019/4/18

Ranking Algorithms OneAccess Algorithm 2019/4/18

Ranking Algorithms OneAccess+ Algorithm progressively fetch documents in the descending order of their relevance scores 𝑑𝑓 𝑤,𝑐 : number of documents in C that contain w 2019/4/18

≥ For each ordered documents(by relevance)(d4>d5>d1>d2>d3) Upper bound Sig Upper bound List Lower bound Sig Lower bound List 𝐿 𝑈𝐵 𝐿 𝐿𝐵 𝐿 𝐿𝐵 [k] 𝐿 𝑈𝐵 [k+1] ≥ 2019/4/18

rel: d4>d5>d1>d2>d3 q=“game” , “playing” Screen: rel: d4>d5>d1>d2>d3 Top-2 attribute SCREEN 15’0 LED 14’0 LED 14’0 TFT 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS d4 d5 BREND DELL HP acer 𝐿 𝐿𝐵 2 .score≥ 𝐿 𝑈𝐵 3 .𝑠𝑐𝑜𝑟𝑒 d4 d5 OS XP7 MacOs XP d4 d5 2019/4/18

Ranking Algorithms OneAccess+ Algorithm 2019/4/18

Ranking Algorithms OneAccess++ Algorithm 2019/4/18

Ranking Algorithms OneAccess++ Algorithm 2019/4/18

Ranking Algorithms Algorithm difference MultiAccess Algorithm Baseline approach. OneAccess Algorithm The relevance of cells can be computed without scanning any document. OneAccess+ Algorithm Exploits upper and lower bounds to enable early termination of the algorithm when top-k is guaranteed. OneAccess++ Algorithm The stop condition is different. 2019/4/18

Experiment Analyse the performance of our proposed algorithms.(Effectiveness) Verify the effectiveness of our ranking mechanisms.(Efficiency) 2019/4/18

Experiment Dataset Customer reviews for laptops from Google Products. The dataset has 26,418 reviews, 920 laptops. 11 dimensions: Audio Card, Battery Run Time, Brand, Screen Type, Color, Weight, Operating System, CPU, Hard Drive, Main Memory and Video Card. 2019/4/18

Experiment Effectiveness Compare our significance measure with the dimension ranking functions in two recent faceted search systems. Indg(indistinguishable) targets on building decision trees with minimum height on the tuples, so that users’ efforts to reach each individual tuple can be minimized. 2019/4/18

Experiment Intr(interestingness) estimates the probability of the query results using a hypergeometric distribution. A smaller p-value indicates it is less likely to get the results by chance, and therefore the cell C is more interesting. 2019/4/18

Experiment 2019/4/18

Experiment 2019/4/18

Experiment Efficiency Efficiency vs. Number of Documents issue 10 queries of length 3 2019/4/18

Experiment Efficiency Efficiency vs. Top-k 2019/4/18

Conclusions Propose a novel system TEXplorer that allows users to perform keyword search and OLAP-style aggregation and exploration of the objects in a text cube built on a multidimensional text database. A novel significance measure is proposed to rank dimensions. Top relevant objects/cells in each dimension subspace are also ranked for users to select. We develop top-k algorithms and materialization strategies to accelerate the ranking. Extensive experimental studies verify the scalability of our methods and the effectiveness of our proposed significance measure. 2019/4/18

Thanks for your listening 2019/4/18

Supplement 2019/4/18

Precision 舉個例子:假設現在資料庫中有10000筆資料, 和美食有關的文章有500篇。 使用者在輸入美食的關鍵字後,回傳的文章有 4000篇,其中有400篇是和美食有關的。 Precision = 400 / 4000 = 10% Recall = 400 / 500 = 80% 2019/4/18