Top-k search的实现技术邵菲 MF1633034 2017.9.25 2018/12/25.

Slides:

Advertisements

Similar presentations

Course 1 演算法: 效率、分析與量級 Algorithms: Efficiency, Analysis, and Order

Advertisements

数据库研究方法和论文写作陆嘉恒中国人民大学.

Performance Evaluation

1. 理想的路由算法有关路由选择协议的几个基本概念算法必须是正确的和完整的。算法在计算上应简单。

資料結構與C++程式設計進階資料結構概論講師：林業峻 CSIE, NTU 6/ 7, 2010.

XI. Hilbert Huang Transform (HHT)

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

Minimum Spanning Trees

Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo +

An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET

Large-Scale Malware Indexing Using Function-Call Graphs

Tree（樹）什麼是「樹」？「樹」的範例「樹」的定義「樹」的表示法.

關聯式資料庫.

模式识别 Pattern Recognition

Digital Terrain Modeling

計算方法設計與分析 Design and Analysis of Algorithms 唐傳義

樹狀結構陳怡芬 2018/11/16 北一女中資訊專題研究.

非線性規劃 Nonlinear Programming

SAT and max-sat Qi-Zhi Cai.

Seam Carving for Content-Aware Image Resizing

Chapter 6 Graph Chang Chi-Chung

The Greedy Method.

演算法方式總覽 The Divide-and-Conquer Strategy (個各擊破)(binary Searching、Quick Sort…. ) The Greedy Method(貪婪演算法) (Prim MST、Kruskal MST、Djikstra's algorithm) Dynamic.

Logistics 物流昭安國際物流園區總經理曾玉勤.

圖論 (Graph Theory) B 電機四大鳥 B 電機四酋長 B 電機四炫大

第4章网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN

Graph 2 Michael Tsai 2012/5/1 連載: 學生上課睡覺姿勢大全

第二十九單元方向導數與梯度.

Course 9 NP Theory序論 An Introduction to the Theory of NP

Simulated Annealing 報告者:李怡緯 OPLAB in NTUIM.

Journal of High Speed Networks 15（2006）

製程能力分析何正斌教授國立屏東科技大學工業管理學系.

The expression and applications of topology on spatial data

Interval Estimation區間估計

子博弈完美Nash均衡我们知道，一个博弈可以有多于一个的Nash均衡。在某些情况下，我们可以按照“子博弈完美”的要求，把不符合这个要求的均衡去掉。扩展型博弈G的一部分g叫做一个子博弈，如果g包含某个节点和它所有的后继点，并且一个G的信息集或者和g不相交，或者整个含于g。一个Nash均衡称为子博弈完美的，如果它在每.

消費者偏好與效用概念.

数据库内容及检索功能 – 如何利用这些资源帮助科技论文的写作与发表钟似璇 (Sixuan Zhong s.

客户服务询盘惯例.

神经信息学自组织网络 ——自组织映射史忠植中科院计算所 2019/2/2.

樹 2 Michael Tsai 2013/3/26.

Ch07 圖形與網路淡江大學周清江

感謝同學們在加分題建議. 我會好好研讀+反省~

Chapter 5 Recursion.

Version Control System Based DSNs

高性能计算与天文技术联合实验室智能与计算学部天津大学

Mechanics Exercise Class Ⅰ

Total Review of Data Structures

前向人工神经网络敏感性研究曾晓勤河海大学计算机及信息工程学院 2003年10月.

VRP工具or-tools调研王羚宇

線性規劃模式 Linear Programming Models

從 ER 到 Logical Schema ──兼談Schema Integration

计算机问题求解 – 论题算法方法 2016年11月28日.

The viewpoint (culture) [观点(文化)]

Course 8 回溯、分枝與限制 Backtracking, Branch-and-Bound

鄧姚文資料結構第八章：圖（Graph）鄧姚文

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

最短通路问题.

唐常杰四川大学计算机学院计算机科学技术系

赵才荣同济大学，电子与信息工程学院，智信馆410室

计算机问题求解 – 论题图的连通度 2018年11月13日.

钱炘祺一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

以碎形正交基底和時間情境圖為基礎進行之視訊檢索 Video retrieval based on fractal orthogonal bases and temporal graph 阿凡達研究生：張敏倫指導教授：蔣依吾博士國立中山大學資訊工程學系.

Graph 1 Michael Tsai 2012/4/24 連載: 學生上課睡覺姿勢大全

Experimental Analysis of Distributed Graph Systems

JAVA 程式設計與資料結構第二十一章 Graph.

Presentation transcript:

top-k search的实现技术邵菲 MF1633034 2017.9.25 2018/12/25

1.top-k search strategy(BANKS[1]) Formal Database Model Nodes/vertices tuples in database Node weights depends upon the prestige of the node(indegree) PageRank Edges foreign key Similarity between relations s(R1;R2) Edge weights (u; v) exists, s(R(u); R(v)) (v; u) exists, INv (u) s(R(v); R(u)) (u; v) (v;u)exists, min{(R(u); R(v)); INv (u) s(R(v); R(u))} 2018/12/25

1.top-k search strategy(BANKS[1]) Query and Answer Model search terms t1; t2; : : : ; tn to locate nodes matching search terms S = (S1; S2; S3; : : : ; Sn) answer to a query a rooted directed tree containing at least one node from each Si answer tree relevance score(nodes and edges) depress the scale using logarithms(IDF) 2018/12/25

1.top-k search strategy(BANKS[1]) Searching for the best answers Backward expanding algorithm Run S copies of Dijkstra’s single source shortest path algorithm. The idea is to find a common vertex from which a forward path exists to at least one node in each set Si. Algorithm are run concurrently by creating an iterator interface to the shortest path algorithm, and creating an instance of the iterator for each keyword node. Ranking Sort in decreasing relevance order. Delete duplicate trees. 每个keyword node都生成了一个迭代器的实例，每次迭代中，算法挑选出一个迭代器，这个迭代器中下一个输出的顶点距离源点（就是keyword node）是最短的。每个被访问的顶点维护一个列表，记录已经经过该点的keyword node，用于判断是否能够以这个顶点作为根节点生成一个answer答案。结果按照tree的wight的升序排列，用heap存储生成的结果控制结果数量。 2018/12/25

1.top-k search strategy(BLINKS[2]) a novel approach (Bi-Level Indexing for Keyword Search) Better search strategy BLINKS is based on cost-balanced expansion, a new policy for the backward search strategy (which explores the graph starting from nodes containing query keywords). Combining indexing with search BLINKS augments search with an index, which selectively precomputes and materializes some shortest-path information. Partitioning-based indexing A naive realization of an index that keeps all shortest-path information would be too large to store and too expensive to maintain for large graphs. BLINKS partitions a data graph into multiple subgraphs, or blocks. 2018/12/25

1.top-k search strategy(BLINKS[2]) A Single-Level Index Keyword-node list LKN (w) denotes the list of nodes that can reach keyword w, and these nodes are ordered by their distances to w Node-keyword map MNK (u, w) returns the shortest distance from u to w, or ∞ if u cannot reach any node that contains w. 2018/12/25

1.top-k search strategy(BLINKS[2]) Searching with a Single-Level Index 2018/12/25

1.top-k search strategy(BLINKS[2]) Bi-Level Indexing in BLINKS Node-based partitioning portal node in-portal or out-portal A top-level block index Keyword-block lists the list of blocks containing keyword w. Portal-block lists the list of blocks with p as an out-portal. An intra-block index for each block Intra-block keyword-node lists Intra-block node-keyword map Intra-block portal-node lists Intra-block node-portal distance map In-portal: it has at least one incoming edge from another block and at least one outgoing edge in this block. Out-portal: it has at least one outgoing edge to another block and at least one incoming edge from this block V5是b1的out portal，可以从别的block通过v5进入b1 2018/12/25

1.top-k search strategy(BLINKS[2]) Graph Partitioning BFS or METIS-based partitioning aims to minimize the total weight of edge separators convert edge separators into node-separators portals 2018/12/25

1.top-k search strategy(BLINKS[2]) Searching with the Bi-Level Index Backward Expansion with Queues of Cursors 【line 4】为每个keyword wi维护一个cursors的队列Qi，用keyword-block列表找到那些包含wi的blocks。【line 5】每个cursors被用来扫描每个intra-block keyword-node list（for wi）。【line 12】每当我们到达当前block的一个in-portal u，我们需要继续在所有以u作为out-portal的block中继续进行backward expansion 【line 13】For each such block b, we continue expansion from u using a new cursor, this time to go over the portal-node list in block b for out-portal u. 【line 11 & 14】因为同一个点u可能被包不同的block（含同一个keyword）访问到，所以用到一个bitmap（crossed）记录u是否已经被某个keyword经过了。 Implementing Optimal Backward Search Strategy 【line 7】cost-balanced expansion strategy across clusters, function pickKeyword，which selects the keyword with the least number of explored nodes. optimal equi-distance expansion strategy within clusters: The cursor with the highest priority in the queue is the one whose next distance is the smallest. Expanding Forward 【line 26 & 27】intra-block中node-keyword的距离并不是全局最优，但是我们可以判断其有效性，使得从点u直接前向扩展到wi。查询块内node-portal distance map，若(u, out-portal)>(u, wi)，我们可以判断说最短距离(u,wi)是落在这个块中的。 Pruning and Stopping 计算lower bound之和 sumLBDist(u)作为u到每个keyword之间的距离的lower bound之和。（1）bound for search：每次选取距离最近的，所以u到wj的距离至少是这个块中下一个被扩展的点到wj的距离；（2）bound from index ： min{块b中u到wj的距离，块b中u到最近的一个out-portal的距离} 【line 28 & 29】如果一个点的组合距离已经大于阈值，那这个点不可能是topk结果中的。【line 16】结束条件：如果为每一个未访问点的组合距离都大于阈值，那么每一个非答案节点都可以被剪枝了。 2018/12/25

2. diversifying top-k results Problem definition Diversity top-k results Diversity Graph Top-k search frameworks Incremental top-k incremental-next() with non-increasing order of their scores stops after k results are generated Bounding top-k bounding-next() not necessarily with non-increasing order of their scores a bound unseen is defined to be the upper bound of the scores for the unseen results stops when the k-th largest score of all generated results is no smaller than the upper bound for the unseen results unseen 每个搜索结果看作是一个点，对应有相应的score打分，还有任意两个点之间计算sim值，当这个值大于阈值说明这两个结果是相似的。 Top-k 结果D(S)定义：个数小于等于k；不存在两个相似的结果；score之和是最大化。 Diversity Graph：每个结果看做一个点，结果之间相似的有边，G(S)中的点按照score大小非递增的顺序进行排序（降序）则Top-k结果又可以定义为：个数小于等于k；是G（S）中的点独立集；score最大化。这两种top-k搜索框架没有考虑相似结果的处理，第一种每次都按照降序取最大score的那个结果，第二种每次不一定按照降序取结果，设置了一个阈值，当第k大的结果score大于阈值unseen（upper bound），就停止。 2018/12/25

2. diversifying top-k results Diversified top-k search Sufficient Stop Condition Necessary Stop Condition 考虑多样性的新的框架，需要考虑三种方法，充分停止条件，必要停止条件，在当前结果集合中挑选diversity top-k的算法，其中最后一个算法涉及到了三种算法。更新S集合的算法就是算法1的line3-6和算法2的line4-8。充分停止条件：Di(S)是S中最好的包含i个结果的集合，当三个条件被满足的时候，说明div top-k结果已经得到了，停止算法。必要停止条件：当必要条件没有被满足的时候，div-search-current没有必要进行。上一次生成的结果为空，或者， S一撇是上一次div-search-current产生的结果集合，max{}是其最大点独立集，当其数量小于k时，说明还可以最少还可以加入（k-max{}）这么多个点使得score更大。 2018/12/25

2. diversifying top-k results Diversified top-k search Searching Current Set Div-astar searches the whole space S using the A∗ based heuristics by designing an upper bound function astar-bound() can hardly handle problems with large diversity graph G Div-dp decompose G into connected components searched independently using div-astar combine the components based on dynamic programming Div-cut decompose each connected component into subgraphs, where subgraphs are connected through a set of cut points 找到最大点独立集是个NP-hard的问题。 2018/12/25

2. diversifying top-k results Diversified top-k search Searching Current Set Div-astar 找到最大点独立集是个NP-hard的问题。一个堆heap来存储A*搜索中的entry，e=(solution,pos,score,bound)，solution就是搜索出来的集合，按照bound大小进行排序，bound是solution估计的upper bound，pos是最后一次搜索的顶点在solution中的位置，score是solution的score。【line 18-26】添加与solution中点不相邻的点，为partial solution估计upper bound。【line9-17】最大估计upper bound的solution被pop出来，计算D.solution(i)的heap中solutions可以重新被用来计算D.solution(i-1). Consider the diversity graph shown in Fig. 1. Suppose k = 3. The process for the div-astar search is shown in Fig. 4. First, entry (∅, 0, ∞) is popped from the heap H, and 6 new entries are pushed. The result {v1} is with upper bound 19 because after selecting v1, nodes v3, v5 and v4 that are adjacent to v1 should be excluded when calculating the upper bound of {v1}. The only nodes left are v2 and v6 with scores 8 and 1 respectively. Thus the upper bound for {v1} is score({v1}) + 8 + 1 = 19. The one with the highest score is ({v3}, 7, 20), and thus it is popped from H in the next iteration and generates two new entries. Accordingly, ({v3, v4}, 14, 20) and ({v3, v4, v5}, 20, 20) are popped from H in order. At this moment, D.score1 = 10, D.score2 = 14, and D.score3 = 20. The stop condition is satisfied, and D.solution3 is the optimal diversified top-3 results. Consider now we compute D.solution2 since it is not the optimal solution currently. We do not need to reconstruct H from scratch. We update all upper bounds for entries that exists on the current H. In this example, entry ({v1}, 10, 19) is updated to be ({v1}, 10, 18) as shown in Fig. 5. Continue the iteration, ({v1}, 10, 18) and ({v1, v2}, 18, 18) are popped from H in order. At this moment, D.score2 is updated to be 18 and the stop condition is satisfied. Thus 18 is the best score for k = 2 2018/12/25

2. diversifying top-k results Diversified top-k search Searching Current Set Div-dp Div-astar不适用于大型的diversity graph，所以把graph分解为若干连通分值，然后连通分支单独进行astar，然后将结果进行整合。动态规划的思想，加法：D.solution(i)是由两个结果集合中i个构成最大score的点集合；乘法：D.solution(i)是取两个结果集合中最好的那个集合中i个点。 2018/12/25

2. diversifying top-k results Diversified top-k search Searching Current Set Div-cut 当某个连通分支很大的时候，搜索整个连通分值非常耗时，同时可能很多包含松弛连通的子图，所以考虑割点。（1）Cp-compress图压缩：删去一些点，不影响solution的产生。W1可以被删去，因为w1和w2相连，并且与w2相连的点也与w1相连。（2）构建cptree：figure10，计算割点，构建子图。（3）然后搜索cptree，分为包含割点或者不包含割点两种情况来考虑，例如计算G.exclude(O24) = G34 ⊕ G12 (line 19 in the for loop from line 7). 1. o12 is excluded. It is the case for j = 0 (line 9). In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 (line 15). 2. o12 is included. It is the case for j = 1 (line 9). In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) (line 15), where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 (line 13). After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12 (line 16) 2018/12/25

2. diversifying top-k results Diversified top-k search Searching Current Set Cp-search Calculate G.exclude(O24) = G34 ⊕ G12 1. o12 is excluded. It is the case for j = 0 . In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 . 2. o12 is included. It is the case for j = 1 . In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) , where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 . After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12. 当某个连通分支很大的时候，搜索整个连通分值非常耗时，同时可能很多包含松弛连通的子图，所以考虑割点。（1）Cp-compress图压缩：删去一些点，不影响solution的产生。W1可以被删去，因为w1和w2相连，并且与w2相连的点也与w1相连。（2）构建cptree：figure10，计算割点，构建子图。（3）然后搜索cptree，分为包含割点或者不包含割点两种情况来考虑，例如计算G.exclude(O24) = G34 ⊕ G12 (line 19 in the for loop from line 7). 1. o12 is excluded. It is the case for j = 0 (line 9). In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 (line 15). 2. o12 is included. It is the case for j = 1 (line 9). In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) (line 15), where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 (line 13). After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12 (line 16) 2018/12/25

References [1] Hulgeri A, Nakhe C. Keyword Searching and Browsing in Databases using BANKS[C]// International Conference on Data Engineering, 2002. Proceedings. IEEE, 2002:431-440. [2] He H, Wang H, Yang J, et al. BLINKS: ranked keyword searches on graphs[C]// ACM SIGMOD International Conference on Management of Data. ACM, 2007:305-316. [3] Lu Qin, Jeffrey Xu Yu, Lijun Chang. Diversifying Top-K Results[J]. Proceedings of the Vldb Endowment, 2012, 5(11):1124-1135. 2018/12/25