top-k search的实现技术 邵菲 MF1633034 2017.9.25 2018/12/25
1.top-k search strategy(BANKS[1]) Formal Database Model Nodes/vertices tuples in database Node weights depends upon the prestige of the node(indegree) PageRank Edges foreign key Similarity between relations s(R1;R2) Edge weights (u; v) exists, s(R(u); R(v)) (v; u) exists, INv (u) s(R(v); R(u)) (u; v) (v;u)exists, min{(R(u); R(v)); INv (u) s(R(v); R(u))} 2018/12/25
1.top-k search strategy(BANKS[1]) Query and Answer Model search terms t1; t2; : : : ; tn to locate nodes matching search terms S = (S1; S2; S3; : : : ; Sn) answer to a query a rooted directed tree containing at least one node from each Si answer tree relevance score(nodes and edges) depress the scale using logarithms(IDF) 2018/12/25
1.top-k search strategy(BANKS[1]) Searching for the best answers Backward expanding algorithm Run S copies of Dijkstra’s single source shortest path algorithm. The idea is to find a common vertex from which a forward path exists to at least one node in each set Si. Algorithm are run concurrently by creating an iterator interface to the shortest path algorithm, and creating an instance of the iterator for each keyword node. Ranking Sort in decreasing relevance order. Delete duplicate trees. 每个keyword node都生成了一个迭代器的实例,每次迭代中,算法挑选出一个迭代器,这个迭代器中下一个输出的顶点距离源点(就是keyword node)是最短的。 每个被访问的顶点维护一个列表,记录已经经过该点的keyword node,用于判断是否能够以这个顶点作为根节点生成一个answer答案。 结果按照tree的wight的升序排列,用heap存储生成的结果控制结果数量。 2018/12/25
1.top-k search strategy(BLINKS[2]) a novel approach (Bi-Level Indexing for Keyword Search) Better search strategy BLINKS is based on cost-balanced expansion, a new policy for the backward search strategy (which explores the graph starting from nodes containing query keywords). Combining indexing with search BLINKS augments search with an index, which selectively precomputes and materializes some shortest-path information. Partitioning-based indexing A naive realization of an index that keeps all shortest-path information would be too large to store and too expensive to maintain for large graphs. BLINKS partitions a data graph into multiple subgraphs, or blocks. 2018/12/25
1.top-k search strategy(BLINKS[2]) A Single-Level Index Keyword-node list LKN (w) denotes the list of nodes that can reach keyword w, and these nodes are ordered by their distances to w Node-keyword map MNK (u, w) returns the shortest distance from u to w, or ∞ if u cannot reach any node that contains w. 2018/12/25
1.top-k search strategy(BLINKS[2]) Searching with a Single-Level Index 2018/12/25
1.top-k search strategy(BLINKS[2]) Bi-Level Indexing in BLINKS Node-based partitioning portal node in-portal or out-portal A top-level block index Keyword-block lists the list of blocks containing keyword w. Portal-block lists the list of blocks with p as an out-portal. An intra-block index for each block Intra-block keyword-node lists Intra-block node-keyword map Intra-block portal-node lists Intra-block node-portal distance map In-portal: it has at least one incoming edge from another block and at least one outgoing edge in this block. Out-portal: it has at least one outgoing edge to another block and at least one incoming edge from this block V5是b1的out portal,可以从别的block通过v5进入b1 2018/12/25
1.top-k search strategy(BLINKS[2]) Graph Partitioning BFS or METIS-based partitioning aims to minimize the total weight of edge separators convert edge separators into node-separators portals 2018/12/25
1.top-k search strategy(BLINKS[2]) Searching with the Bi-Level Index Backward Expansion with Queues of Cursors 【line 4】为每个keyword wi维护一个cursors的队列Qi,用keyword-block列表找到那些包含wi的blocks。 【line 5】每个cursors被用来扫描每个intra-block keyword-node list(for wi)。 【line 12】每当我们到达当前block的一个in-portal u,我们需要继续在所有以u作为out-portal的block中继续进行backward expansion 【line 13】For each such block b, we continue expansion from u using a new cursor, this time to go over the portal-node list in block b for out-portal u. 【line 11 & 14】因为同一个点u可能被包不同的block(含同一个keyword)访问到,所以用到一个bitmap(crossed)记录u是否已经被某个keyword经过了。 Implementing Optimal Backward Search Strategy 【line 7】cost-balanced expansion strategy across clusters, function pickKeyword,which selects the keyword with the least number of explored nodes. optimal equi-distance expansion strategy within clusters: The cursor with the highest priority in the queue is the one whose next distance is the smallest. Expanding Forward 【line 26 & 27】intra-block中node-keyword的距离并不是全局最优,但是我们可以判断其有效性,使得从点u直接前向扩展到wi。 查询块内node-portal distance map,若(u, out-portal)>(u, wi),我们可以判断说最短距离(u,wi)是落在这个块中的。 Pruning and Stopping 计算lower bound之和 sumLBDist(u)作为u到每个keyword之间的距离的lower bound之和。 (1)bound for search:每次选取距离最近的,所以u到wj的距离至少是这个块中下一个被扩展的点到wj的距离; (2)bound from index : min{块b中u到wj的距离,块b中u到最近的一个out-portal的距离} 【line 28 & 29】如果一个点的组合距离已经大于阈值,那这个点不可能是topk结果中的。 【line 16】结束条件:如果为每一个未访问点的组合距离都大于阈值,那么每一个非答案节点都可以被剪枝了。 2018/12/25
2. diversifying top-k results Problem definition Diversity top-k results Diversity Graph Top-k search frameworks Incremental top-k incremental-next() with non-increasing order of their scores stops after k results are generated Bounding top-k bounding-next() not necessarily with non-increasing order of their scores a bound unseen is defined to be the upper bound of the scores for the unseen results stops when the k-th largest score of all generated results is no smaller than the upper bound for the unseen results unseen 每个搜索结果看作是一个点,对应有相应的score打分,还有任意两个点之间计算sim值,当这个值大于阈值说明这两个结果是相似的。 Top-k 结果D(S)定义:个数小于等于k;不存在两个相似的结果;score之和是最大化。 Diversity Graph:每个结果看做一个点,结果之间相似的有边,G(S)中的点按照score大小非递增的顺序进行排序(降序) 则Top-k结果又可以定义为:个数小于等于k;是G(S)中的点独立集;score最大化。 这两种top-k搜索框架没有考虑相似结果的处理,第一种每次都按照降序取最大score的那个结果,第二种每次不一定按照降序取结果,设置了一个阈值,当第k大的结果score大于阈值unseen(upper bound),就停止。 2018/12/25
2. diversifying top-k results Diversified top-k search Sufficient Stop Condition Necessary Stop Condition 考虑多样性的新的框架,需要考虑三种方法,充分停止条件,必要停止条件,在当前结果集合中挑选diversity top-k的算法,其中最后一个算法涉及到了三种算法。 更新S集合的算法就是算法1的line3-6和算法2的line4-8。 充分停止条件:Di(S)是S中最好的包含i个结果的集合,当三个条件被满足的时候,说明div top-k结果已经得到了,停止算法。 必要停止条件:当必要条件没有被满足的时候,div-search-current没有必要进行。上一次生成的结果为空,或者, S一撇是上一次div-search-current产生的结果集合,max{}是其最大点独立集,当其数量小于k时,说明还可以最少还可以加入(k-max{})这么多个点使得score更大。 2018/12/25
2. diversifying top-k results Diversified top-k search Searching Current Set Div-astar searches the whole space S using the A∗ based heuristics by designing an upper bound function astar-bound() can hardly handle problems with large diversity graph G Div-dp decompose G into connected components searched independently using div-astar combine the components based on dynamic programming Div-cut decompose each connected component into subgraphs, where subgraphs are connected through a set of cut points 找到最大点独立集是个NP-hard的问题。 2018/12/25
2. diversifying top-k results Diversified top-k search Searching Current Set Div-astar 找到最大点独立集是个NP-hard的问题。一个堆heap来存储A*搜索中的entry,e=(solution,pos,score,bound),solution就是搜索出来的集合,按照bound大小进行排序,bound是solution估计的upper bound,pos是最后一次搜索的顶点在solution中的位置,score是solution的score。 【line 18-26】添加与solution中点不相邻的点,为partial solution估计upper bound。【line9-17】最大估计upper bound的solution被pop出来, 计算D.solution(i)的heap中solutions可以重新被用来计算D.solution(i-1). Consider the diversity graph shown in Fig. 1. Suppose k = 3. The process for the div-astar search is shown in Fig. 4. First, entry (∅, 0, ∞) is popped from the heap H, and 6 new entries are pushed. The result {v1} is with upper bound 19 because after selecting v1, nodes v3, v5 and v4 that are adjacent to v1 should be excluded when calculating the upper bound of {v1}. The only nodes left are v2 and v6 with scores 8 and 1 respectively. Thus the upper bound for {v1} is score({v1}) + 8 + 1 = 19. The one with the highest score is ({v3}, 7, 20), and thus it is popped from H in the next iteration and generates two new entries. Accordingly, ({v3, v4}, 14, 20) and ({v3, v4, v5}, 20, 20) are popped from H in order. At this moment, D.score1 = 10, D.score2 = 14, and D.score3 = 20. The stop condition is satisfied, and D.solution3 is the optimal diversified top-3 results. Consider now we compute D.solution2 since it is not the optimal solution currently. We do not need to reconstruct H from scratch. We update all upper bounds for entries that exists on the current H. In this example, entry ({v1}, 10, 19) is updated to be ({v1}, 10, 18) as shown in Fig. 5. Continue the iteration, ({v1}, 10, 18) and ({v1, v2}, 18, 18) are popped from H in order. At this moment, D.score2 is updated to be 18 and the stop condition is satisfied. Thus 18 is the best score for k = 2 2018/12/25
2. diversifying top-k results Diversified top-k search Searching Current Set Div-dp Div-astar不适用于大型的diversity graph,所以把graph分解为若干连通分值,然后连通分支单独进行astar,然后将结果进行整合。 动态规划的思想,加法:D.solution(i)是由两个结果集合中i个构成最大score的点集合;乘法:D.solution(i)是取两个结果集合中最好的那个集合中i个点。 2018/12/25
2. diversifying top-k results Diversified top-k search Searching Current Set Div-cut 当某个连通分支很大的时候,搜索整个连通分值非常耗时,同时可能很多包含松弛连通的子图,所以考虑割点。 (1)Cp-compress图压缩:删去一些点,不影响solution的产生。W1可以被删去,因为w1和w2相连,并且与w2相连的点也与w1相连。(2)构建cptree:figure10,计算割点,构建子图。(3)然后搜索cptree,分为包含割点或者不包含割点两种情况来考虑,例如计算G.exclude(O24) = G34 ⊕ G12 (line 19 in the for loop from line 7). 1. o12 is excluded. It is the case for j = 0 (line 9). In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 (line 15). 2. o12 is included. It is the case for j = 1 (line 9). In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) (line 15), where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 (line 13). After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12 (line 16) 2018/12/25
2. diversifying top-k results Diversified top-k search Searching Current Set Cp-search Calculate G.exclude(O24) = G34 ⊕ G12 1. o12 is excluded. It is the case for j = 0 . In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 . 2. o12 is included. It is the case for j = 1 . In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) , where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 . After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12. 当某个连通分支很大的时候,搜索整个连通分值非常耗时,同时可能很多包含松弛连通的子图,所以考虑割点。 (1)Cp-compress图压缩:删去一些点,不影响solution的产生。W1可以被删去,因为w1和w2相连,并且与w2相连的点也与w1相连。(2)构建cptree:figure10,计算割点,构建子图。(3)然后搜索cptree,分为包含割点或者不包含割点两种情况来考虑,例如计算G.exclude(O24) = G34 ⊕ G12 (line 19 in the for loop from line 7). 1. o12 is excluded. It is the case for j = 0 (line 9). In such a situation, we can compute G′ 12 = G1.excludeo12 ⊕ G2 (line 15). 2. o12 is included. It is the case for j = 1 (line 9). In such a situation, we can compute G′′ 12 = G1.includeo12 ⊕ (G2 - o12.adj(G)) (line 15), where G2 - o12.adj(G) is to remove the adjacent nodes of o12 from G2 (line 13). After computing 1 and 2, G12 can be computed as G12 = G′ 12 ⊗ G′′ 12 (line 16) 2018/12/25
References [1] Hulgeri A, Nakhe C. Keyword Searching and Browsing in Databases using BANKS[C]// International Conference on Data Engineering, 2002. Proceedings. IEEE, 2002:431-440. [2] He H, Wang H, Yang J, et al. BLINKS: ranked keyword searches on graphs[C]// ACM SIGMOD International Conference on Management of Data. ACM, 2007:305-316. [3] Lu Qin, Jeffrey Xu Yu, Lijun Chang. Diversifying Top-K Results[J]. Proceedings of the Vldb Endowment, 2012, 5(11):1124-1135. 2018/12/25