Download presentation
Presentation is loading. Please wait.
1
TEXplorer: Keyword-based Object Search and Exploration in Multidimensional Text Databases
Date: 2012/05/14 Source: Bo Zhao et. al (CIKM’11) Speaker: Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 2019/4/18
2
Index Introduction Tasks of TEXplorer Significance of A Dimension
Ranking Algorithms Experiment Conclusions 2019/4/18
3
Introduction Multi-dimensional text database
由於網路快速發展,現於網路上存在大量的資料,而有許多資料是具有豐富的結構化屬性,像產品評論… Multi-dimensional text database 2019/4/18
4
Introduction attribute Not helpful TEXplorer
對於這種類型的資料,現在較流行的搜尋方法是讓使用者藉由選擇某”維度”來進行相關性的排名,並回傳結果給使用者 但有時並不是太有幫助 2019/4/18
5
Top-1 Dimension: Person
Tasks of TEXplorer TEXplorer Integrating keyword-based ranking and OLAP exploration. Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 藉由選擇某維度的屬性直,分析落在此維度值的文件,對其他屬性計算與查詢字相關的前k高的屬性值 2019/4/18
6
系統實例 2019/4/18
7
Tasks of TEXplorer Ranking dimensions for drilling down
Ranking values in the same dimension 系統主要工作 2019/4/18
8
Significance of A Dimension
doc (q,cell’s doc) Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 (q,cell) 考慮兩個feature 在同一維度底下 1.屬性質跟查詢字的關係 2.此屬性質所屬的文件與查詢字的關係 2019/4/18
9
Significance of A Dimension
Cell variance how much the relevance of each of C’s Ai- children deviates from the relevance of C 𝒄𝒉𝒅 𝑨 𝒊 (C):denote the set of children in 𝐴 𝑖 of C. 𝑪 𝒅 :document of cell C. IDF:inverted document frequency factor of term w∈q. TFW:term frequency factor of w in document d. QTW:the query term frequency factor of w in q. 2019/4/18
10
Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +…
q=“game”, “playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 Rel(q,C)= 1 5 ∗ 𝑟𝑒𝑙 𝑞,𝑑1 +𝑟𝑒𝑙 𝑞,𝑑2 +… rel(q,d1)= 𝐼𝐷𝐹 𝑔𝑎𝑚𝑒 ∗ 𝑇𝐹𝑊 𝑔𝑎𝑚𝑒 ∗ 𝑄𝑇𝑊 𝑔𝑎𝑚𝑒 + 𝐼𝐷𝐹 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑇𝐹𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 ∗ 𝑄𝑇𝑊 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 = log 5 1 ∗ ∗ log 5 2 ∗ ∗ = 𝐶𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 2∗0.3+1∗0.2+2∗0.1 3−1 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18
11
Significance of A Dimension
Document variance how much the relevance of each document in C deviates from the relevance of the Ai-child of C which contains that document. 2019/4/18
12
a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2
q=“game playing” C= “16’0 LED” SCREEN 16’0 LED d1,d2,d3,d4,d5 BREND DELL d1,d2 HP d3 acer d4,d5 𝐼𝐷𝑉 𝐵𝑅𝐸𝑁𝐷 (𝑞,𝐶)= 5−3 𝑎+𝑏+𝑐 a= (rel(q,d1)−Rel(q,Dell)) 2 + (rel(q,d2)−Rel(q,Dell)) 2 b= (rel(q,d3)−Rel(q,HP)) 2 c= (rel(q,d4)−Rel(q,acer)) 2 + (rel(q,d5)−Rel(q,acer)) 2 OS XP7 d1 MacOs d2,d3 XP d4,d5 2019/4/18
13
Significance of A Dimension
1 doc Healthcare Reform TEXplorer System Top-1 Dimension: Person Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 2019/4/18
14
針對具有豐富結構化屬性的文件資料庫進行關鍵字的查詢,跟一般查詢不同的是他回傳的不是依相關度排序的文件,而是提供一瀏覽介面提供與查詢字相關的維度資訊及資料細節來幫助使用者搜尋有興趣的資料。
屬性的排名是依照查詢字與各屬性的相關度決定,屬性相關度衡量的基準有二: 一是考慮查詢字與此屬性的屬性值的相關度,另一則是分布在此屬性值底下的文件與查詢字的相關程度 2019/4/18
15
Ranking Algorithms MultiAccess Algorithm q=“game” , “playing”
SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Fetch the doc related with q 2. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 3.Return Top-K BREND DELL d1,d2 HP d3 acer d4,d5 OS XP7 d1 MacOs d2 XP d3 BREND DELL d1,d2 HP d3 OS XP7 d1 MacOs d2,d4 XP d3,d5 2019/4/18
16
Ranking Algorithms MultiAccess Algorithm 2019/4/18
17
Ranking Algorithms R𝑒𝑙(𝑞,𝐶′) OneAccess Algorithm 2019/4/18
18
Ranking Algorithms OneAccess Algorithm q=“game” , “playing”
SCREEN 15’0 LED d1,d2,d3 14’0 LED d4 14’0 TFT d5 1.Precompute each doc’s TFW, IDF then could precompute rel ,Rel…… 2.Fetch the doc related with q 3. 𝑆𝑖𝑔 𝑆𝐶𝑅𝐸𝐸𝑁 , 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING 4.Return top-k 1.Fetch the doc related with q and C 2. 𝑆𝑖𝑔 𝐵𝑅𝐸𝑁𝐷 , 𝑆𝑖𝑔 𝑂𝑆 get RANKING BREND DELL d1,d2 HP d3 acer d4,d5 BREND DELL d1,d2 HP d3 TFW IDF d1 0.3 0.1 d2 0.5 0.2 d3 d4 d5 0.4 OS XP7 d1 MacOs d2,d3 XP d4,d5 OS XP7 d1 MacOs d2 XP d3 2019/4/18
19
Ranking Algorithms OneAccess Algorithm 2019/4/18
20
Ranking Algorithms OneAccess+ Algorithm
progressively fetch documents in the descending order of their relevance scores 𝑑𝑓 𝑤,𝑐 : number of documents in C that contain w 2019/4/18
21
≥ For each ordered documents(by relevance)(d4>d5>d1>d2>d3)
Upper bound Sig Upper bound List Lower bound Sig Lower bound List 𝐿 𝑈𝐵 𝐿 𝐿𝐵 𝐿 𝐿𝐵 [k] 𝐿 𝑈𝐵 [k+1] ≥ 2019/4/18
22
rel: d4>d5>d1>d2>d3
q=“game” , “playing” Screen: rel: d4>d5>d1>d2>d3 Top-2 attribute SCREEN 15’0 LED 14’0 LED 14’0 TFT 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS 𝐿 𝑈𝐵 𝐿 𝐿𝐵 SCREEN BREND OS d4 d5 BREND DELL HP acer 𝐿 𝐿𝐵 2 .score≥ 𝐿 𝑈𝐵 3 .𝑠𝑐𝑜𝑟𝑒 d4 d5 OS XP7 MacOs XP d4 d5 2019/4/18
23
Ranking Algorithms OneAccess+ Algorithm 2019/4/18
24
Ranking Algorithms OneAccess++ Algorithm 2019/4/18
25
Ranking Algorithms OneAccess++ Algorithm 2019/4/18
26
Ranking Algorithms Algorithm difference MultiAccess Algorithm
Baseline approach. OneAccess Algorithm The relevance of cells can be computed without scanning any document. OneAccess+ Algorithm Exploits upper and lower bounds to enable early termination of the algorithm when top-k is guaranteed. OneAccess++ Algorithm The stop condition is different. 2019/4/18
27
Experiment Analyse the performance of our proposed algorithms.(Effectiveness) Verify the effectiveness of our ranking mechanisms.(Efficiency) 2019/4/18
28
Experiment Dataset Customer reviews for laptops from Google Products.
The dataset has 26,418 reviews, 920 laptops. 11 dimensions: Audio Card, Battery Run Time, Brand, Screen Type, Color, Weight, Operating System, CPU, Hard Drive, Main Memory and Video Card. 2019/4/18
29
Experiment Effectiveness
Compare our significance measure with the dimension ranking functions in two recent faceted search systems. Indg(indistinguishable) targets on building decision trees with minimum height on the tuples, so that users’ efforts to reach each individual tuple can be minimized. 2019/4/18
30
Experiment Intr(interestingness)
estimates the probability of the query results using a hypergeometric distribution. A smaller p-value indicates it is less likely to get the results by chance, and therefore the cell C is more interesting. 2019/4/18
31
Experiment 2019/4/18
32
Experiment 2019/4/18
33
Experiment Efficiency Efficiency vs. Number of Documents
issue 10 queries of length 3 2019/4/18
34
Experiment Efficiency Efficiency vs. Top-k 2019/4/18
35
Conclusions Propose a novel system TEXplorer that allows users to perform keyword search and OLAP-style aggregation and exploration of the objects in a text cube built on a multidimensional text database. A novel significance measure is proposed to rank dimensions. Top relevant objects/cells in each dimension subspace are also ranked for users to select. We develop top-k algorithms and materialization strategies to accelerate the ranking. Extensive experimental studies verify the scalability of our methods and the effectiveness of our proposed significance measure. 2019/4/18
36
Thanks for your listening
2019/4/18
37
Supplement 2019/4/18
38
Precision 舉個例子:假設現在資料庫中有10000筆資料, 和美食有關的文章有500篇。
使用者在輸入美食的關鍵字後,回傳的文章有 4000篇,其中有400篇是和美食有關的。 Precision = 400 / 4000 = 10% Recall = 400 / 500 = 80% 2019/4/18
Similar presentations