Date: 2012/05/14 Source: Bo Zhao et. al (CIKM’11)

Slides:

Advertisements

Similar presentations

第七組古文閱讀報告組長：秀惠組員：孟筑、雅曼、雅文、盈蓁. 《朱買臣苦學有成》之原文翻譯朱買臣，字翁子，吳人也。朱買臣，字翁子，吳國人。家貧，好讀書，不治產業，常刈（一ˋ）薪樵，賣以給（ㄐㄧ ˇ ）食。家裡雖然很窮困，但是他還是很喜歡讀書，因不懂得如何治理產業，只能靠著上山砍材去城.

Advertisements

資料採礦與商業智慧第十六章線上分析處理.

信息检索中效率问题的研究报告人：赵江华北京大学计算机科学与技术系网络与分布式系统实验室 2002年4月21日.

数据库研究方法和论文写作陆嘉恒中国人民大学.

VS 兒童及少年身心發展幼保三甲幼兒期青少年期 4A1I0014 陳佳瑩 4A1I0023 尤秀惠

鞘翅目生科四乙蘇俊融.

人工智能 Artificial Intelligence 第十一章

华东师范大学软件学院王科强 (第一作者), 王晓玲

Homework 2 : VSM and Summary

第4讲企业财务管理.

核心价值观记心中主题班会

第三章隨機變數.

Chapter 5 Relational Algebra

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

A TIME-FREQUENCY ADAPTIVE SIGNAL MODEL-BASED APPROACH FOR PARAMETRIC ECG COMPRESSION 14th European Signal Processing Conference (EUSIPCO 2006), Florence,

A Question Answering Approach to Emotion Cause Extraction

Minimum Spanning Trees

What water is more suitable for nurturing the goldfish

Some Effective Techniques for Naive Bayes Text Classification

Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.

指導教授：許子衡教授報告學生：翁偉傑 Qiangyuan Yu , Geert Heijenk

Population proportion and sample proportion

關聯式資料庫.

模式识别 Pattern Recognition

SPC introduction.

實證醫學嘉義基督教醫院外科部　黃國倉醫師

Understanding Report Cards 读懂成绩单 Mr Alex Ward Director of Studies 教学总监

計算方法設計與分析 Design and Analysis of Algorithms 唐傳義

Source: IEEE Access, vol. 5, pp , October 2017

Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)

Creating Animated Apps (I) 靜宜大學資管系楊子青

第4章(2) 空间数据库 —关系数据库北京建筑工程学院王文宇.

创建型设计模式.

第5章資料倉儲的資料建置.

Word-Entity Duet Representations for Document Ranking

生物芯片技术刘超李世燕谢宏林

The role of leverage in cross-border mergers and acquisitions

Interval Estimation區間估計

SpringerLink 新平台介绍.

数据库内容及检索功能 – 如何利用这些资源帮助科技论文的写作与发表钟似璇 (Sixuan Zhong s.

基于类关联规则的分类 Classification Based on Class-Association Rules

PubMed整合显示图书馆电子资源医科院图书馆电子资源培训讲座.

Advisor : Prof. Frank Y.S. Lin Presented by Yen-Yi, Hsu

感謝同學們在加分題建議. 我會好好研讀+反省~

A high payload data hiding scheme based on modified AMBTC technique

数据摘要现状调研报告上下文摘要初步思考徐丹云.

VIDEO COMPRESSION & MPEG

研究技巧與論文撰寫方法中央大學資管系陳彥良.

校園地震預警系統的建置與應用林沛暘.

高性能计算与天文技术联合实验室智能与计算学部天津大学

2019/4/8 A Load Balancing Mechanism for multiple SDN Controllers based on Load Informing Strategy Miultiple controller 的 load balancing 機制，使用一個叫 Load informing.

Maintaining Frequent Itemsets over High-Speed Data Streams

Total Review of Data Structures

前向人工神经网络敏感性研究曾晓勤河海大学计算机及信息工程学院 2003年10月.

Representation Learning of Knowledge Graphs with Hierarchical Types

Google Local Search API Research and Implementation

A Data Mining Algorithm for Generalized Web Prefetching

SpringerLink 新平台介绍.

指導老師：邱登裕老師組員：B 張萬鈞 B 鄭瑞傑 B 蔡譯陞 B 胡瑜真

An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。本文目的：做到既有较高的藏量（1bpp),

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

Speaker : YI-CHENG HUNG

國立東華大學課程設計與潛能開發學系張德勝

Reversible Data Hiding in Color Image with Grayscale Invariance

Speaker : YI-CHENG HUNG

Principle and application of optical information technology

Homework 2 : VSM and Summary

Gaussian Process Ruohua Shi Meeting

Hybrid fractal zerotree wavelet image coding

Presentation transcript:

TEXplorer: Keyword-based Object Search and Exploration in Multidimensional Text Databases Date: 2012/05/14 Source: Bo Zhao et. al (CIKM’11) Speaker: Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 2019/4/18

Index Introduction Tasks of TEXplorer Significance of A Dimension Ranking Algorithms Experiment Conclusions 2019/4/18

Introduction Multi-dimensional text database 由於網路快速發展，現於網路上存在大量的資料，而有許多資料是具有豐富的結構化屬性，像產品評論… Multi-dimensional text database 2019/4/18

Introduction attribute Not helpful TEXplorer 對於這種類型的資料，現在較流行的搜尋方法是讓使用者藉由選擇某”維度”來進行相關性的排名，並回傳結果給使用者但有時並不是太有幫助 2019/4/18

Top-1 Dimension: Person Tasks of TEXplorer TEXplorer Integrating keyword-based ranking and OLAP exploration. Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 藉由選擇某維度的屬性直，分析落在此維度值的文件，對其他屬性計算與查詢字相關的前k高的屬性值 2019/4/18

系統實例 2019/4/18

Tasks of TEXplorer Ranking dimensions for drilling down Ranking values in the same dimension 系統主要工作 2019/4/18

Significance of A Dimension doc (q,cell’s doc) Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 (q,cell) 考慮兩個feature 在同一維度底下 1.屬性質跟查詢字的關係 2.此屬性質所屬的文件與查詢字的關係 2019/4/18

Significance of A Dimension Cell variance how much the relevance of each of C’s Ai- children deviates from the relevance of C 𝒄𝒉𝒅 𝑨 𝒊 (C):denote the set of children in 𝐴 𝑖 of C. 𝑪 𝒅 :document of cell C. IDF:inverted document frequency factor of term w∈q. TFW:term frequency factor of w in document d. QTW:the query term frequency factor of w in q. 2019/4/18

Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +… q=“game”, “playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +… rel(q,d1)= 𝐼𝐷𝐹 𝑔𝑎𝑚𝑒 ∗ 𝑇𝐹𝑊 𝑔𝑎𝑚𝑒 ∗ 𝑄𝑇𝑊 𝑔𝑎𝑚𝑒 + 𝐼𝐷𝐹 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑇𝐹𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑄𝑇𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 = log 5 1 ∗ 3 100 ∗ 1 2 + log 5 2 ∗ 10 100 ∗ 1 2 =0.02985 𝐶𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 2∗0.3+1∗0.2+2∗0.1 3−1 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18

Significance of A Dimension Document variance how much the relevance of each document in C deviates from the relevance of the Ai-child of C which contains that document. 2019/4/18

a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2 q=“game playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 𝐼𝐷𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 5−3 𝑎+𝑏+𝑐 a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2 b= (rel(q,d3)−Rel(q,HP)) 2 c= (rel(q,d4)−Rel(q,acer)) 2 + (rel(q,d5)−Rel(q,acer)) 2 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18

Significance of A Dimension 1 doc Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 2019/4/18

針對具有豐富結構化屬性的文件資料庫進行關鍵字的查詢，跟一般查詢不同的是他回傳的不是依相關度排序的文件，而是提供一瀏覽介面提供與查詢字相關的維度資訊及資料細節來幫助使用者搜尋有興趣的資料。屬性的排名是依照查詢字與各屬性的相關度決定，屬性相關度衡量的基準有二: 一是考慮查詢字與此屬性的屬性值的相關度，另一則是分布在此屬性值底下的文件與查詢字的相關程度 2019/4/18

Ranking Algorithms MultiAccess Algorithm q=“game” , “playing” SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Fetch the doc related with q 2. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K BREND DELL d1,d2 HP d3 acer d4,d5 OS XP7 d1 MacOs d2 XP d3 BREND DELL d1,d2 HP d3 OS XP7 d1 MacOs d2,d4 XP d3,d5 2019/4/18

Ranking Algorithms MultiAccess Algorithm 2019/4/18

Ranking Algorithms R𝑒𝑙(𝑞,𝐶′) OneAccess Algorithm 2019/4/18

Ranking Algorithms OneAccess Algorithm q=“game” , “playing” SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Precompute each doc’s TFW, IDF then could precompute rel ,Rel…… 2.Fetch the doc related with q 3. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 4.Return top-k 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING BREND DELL d1,d2 HP d3 acer d4,d5 BREND DELL d1,d2 HP d3 TFW IDF d1 0.3 0.1 d2 0.5 0.2 d3 d4 d5 0.4 OS XP7 d1 MacOs d2,d3 XP d4,d5 OS XP7 d1 MacOs d2 XP d3 2019/4/18

Ranking Algorithms OneAccess Algorithm 2019/4/18

Ranking Algorithms OneAccess+ Algorithm progressively fetch documents in the descending order of their relevance scores 𝑑𝑓 𝑤,𝑐 : number of documents in C that contain w 2019/4/18

≥ For each ordered documents(by relevance)(d4>d5>d1>d2>d3) Upper bound Sig Upper bound List Lower bound Sig Lower bound List 𝐿 𝑈𝐵 𝐿 𝐿𝐵 𝐿 𝐿𝐵 [k] 𝐿 𝑈𝐵 [k+1] ≥ 2019/4/18

rel: d4>d5>d1>d2>d3 q=“game” , “playing” Screen: rel: d4>d5>d1>d2>d3 Top-2 attribute SCREEN 15’0 LED 14’0 LED 14’0 TFT 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS d4 d5 BREND DELL HP acer 𝐿 𝐿𝐵 2 .score≥ 𝐿 𝑈𝐵 3 .𝑠𝑐𝑜𝑟𝑒 d4 d5 OS XP7 MacOs XP d4 d5 2019/4/18

Ranking Algorithms OneAccess+ Algorithm 2019/4/18

Ranking Algorithms OneAccess++ Algorithm 2019/4/18

Ranking Algorithms OneAccess++ Algorithm 2019/4/18

Ranking Algorithms Algorithm difference MultiAccess Algorithm Baseline approach. OneAccess Algorithm The relevance of cells can be computed without scanning any document. OneAccess+ Algorithm Exploits upper and lower bounds to enable early termination of the algorithm when top-k is guaranteed. OneAccess++ Algorithm The stop condition is different. 2019/4/18

Experiment Analyse the performance of our proposed algorithms.(Effectiveness) Verify the effectiveness of our ranking mechanisms.(Efﬁciency) 2019/4/18

Experiment Dataset Customer reviews for laptops from Google Products. The dataset has 26,418 reviews, 920 laptops. 11 dimensions: Audio Card, Battery Run Time, Brand, Screen Type, Color, Weight, Operating System, CPU, Hard Drive, Main Memory and Video Card. 2019/4/18

Experiment Effectiveness Compare our signiﬁcance measure with the dimension ranking functions in two recent faceted search systems. Indg(indistinguishable) targets on building decision trees with minimum height on the tuples, so that users’ efforts to reach each individual tuple can be minimized. 2019/4/18

Experiment Intr(interestingness) estimates the probability of the query results using a hypergeometric distribution. A smaller p-value indicates it is less likely to get the results by chance, and therefore the cell C is more interesting. 2019/4/18

Experiment 2019/4/18

Experiment 2019/4/18

Experiment Efﬁciency Efﬁciency vs. Number of Documents issue 10 queries of length 3 2019/4/18

Experiment Efﬁciency Efﬁciency vs. Top-k 2019/4/18

Conclusions Propose a novel system TEXplorer that allows users to perform keyword search and OLAP-style aggregation and exploration of the objects in a text cube built on a multidimensional text database. A novel signiﬁcance measure is proposed to rank dimensions. Top relevant objects/cells in each dimension subspace are also ranked for users to select. We develop top-k algorithms and materialization strategies to accelerate the ranking. Extensive experimental studies verify the scalability of our methods and the effectiveness of our proposed signiﬁcance measure. 2019/4/18

Thanks for your listening 2019/4/18

Supplement 2019/4/18

Precision 舉個例子：假設現在資料庫中有10000筆資料，和美食有關的文章有500篇。使用者在輸入美食的關鍵字後，回傳的文章有 4000篇，其中有400篇是和美食有關的。 Precision = 400 / 4000 = 10% Recall = 400 / 500 = 80% 2019/4/18