Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨 2018.09.21
Outline Motivation Algorithms Experiments Conclusion Review
Motivation The need to search on a large graph for semantic association between a group of entities Empty results -> disappoint user Query relax find maximum successful sub-query to minimize the loss. 在许多领域需要在很大的实体关联图上搜索一组查询实体间的复杂关联,搜索结果成为语义关联(semantic association),包含所有查询实体并且有大小限制。 现有算法可能由于查询实体间距离较远而返回空的结果,给用户造成了不好的体验。 于是,我们提出了查询松弛,为了减少损失,需要保留尽可能多的有关联的实体。
Motivation If diameter=3,the query result will be empty. Dist(Alice, Bob) = 4 > 3
Motivation Solution Result: {Alice, Dan} or {Bob, Dan} Increase the diameter constraint -> result in performance issues Remove some query entities (our solution) Result: {Alice, Dan} or {Bob, Dan}
Algorithms Baseline CertQR CertQR+ Heuristics
Algorithms —— Baseline Consider all possible sub-queries in non-increasing order of the number of query entities. For each sub-query, run existing algorithm for SA search and check whether it returns non-empty results. Low efficiency. Single run of SA search is exponential in the diameter constraint, and the number of sub-queries to be considered is exponential in the number of query entities. 考虑查询实体从多到少的所有组合,分别调用已有的搜关联的算法,直到搜到关联返回对应的子查询,否则返回空集。 效率低下。单次关联搜索的时间随着直径指数级增长,子查询的数量随着查询实体的数量指数级增长。 实验中先搜完路径,然后按照从多到少的组合判断是否有关联。
Algorithms —— CertQR Search for a certificate which can prove the existence of a SA rather than actually find such a SA. Certificate (c) : the distance between it and every query entity is not more than ceil(D/2) D=4, c=ICDE2019 dist(ICDE2019, Alice/Bob/Dan)<=ceil(D/2) 可以不用搜出实际的关联,而是找到一个certificate能够证明有关联存在。 提出一个条件:该certificate到每个查询实体的距离不超过ceil(D/2)。 D为偶数时,该条件已经充分。
Algorithms —— CertQR D=3, the condition is still satisfied but it violate diameter constraint. critical query entity exactly ceil(D/2) distance Another condition c has a neighbor c’, the distance between c’ and each critical query entity is ceil(D/2)-1 Q={Dan, Erin, Frank} c = ICDE2019 c’ = Paper02 D为奇数时,只满足刚才的条件可能会违背直径约束,需要再加一个条件。 可能造成违背直径约束的是那些到c刚好ceil(D/2)距离的实体,称为关键查询实体 c有一个邻居c’,它和那些到c有ceil(D/2)步的查询实体距离为ceil(D/2)-1。即保证了有一条边是公用的,也就不会违背直径约束了。 例如查询Q={Dan, Erin, Frank}
Algorithms —— CertQR If certificate exists, it is not more than ceil(D/2) hops away from each entity in result query. Search and check all the entities within the scope in a breadth-first manner, return the maximum successful sub-query. 如果certificate存在,那么它到结果中的每个实体的距离一定不超过ceil(D/2)。 穷举该范围内的所有点,返回最大的成功的子查询。
Algorithms —— CertQR+ Based on the idea of CertQR and improved. Best-first search Consider fewer entities, ensure that unvisited entities cannot be a certificate of any larger successful sub-query. Compute priority of entity, which estimates the number of query entities that is likely to obey the distance constrain through the certificate, is the upper bound of the result size. Use priority queue, if |Qmax|>=priority, return. 基于CertQR提出了改进的算法。 Best-first search,考虑更少的实体,保证没被访问过的实体不可能是更大的成功子查询的certificate。 计算搜索到的实体的优先级,优先级估计了可能能够遵循距离约束的查询实体数量,是该实体关联的查询实体数量的一个上界。 使用优先队列,如果当前实体的优先级<=当前最大的成功子查询的实体数量,就可直接返回。
Algorithms —— Heuristics dg: Degree-enhanced ds: Distance-enhanced dgs: Combined 启发式(为priority增加小数部分) 基于度数:大度的点在扩展邻居和计算priority时较耗时,因此优先考虑小度点。 基于距离:离可能的certificate近的点应该优先访问。 两者结合。
Experiments Datasets Queries (D=3, 4, 5, 6) DBpedia LinkedMDB Mondial Sparse query For each dataset, each number of query entities n in the range of 2-6, we construct 100 queries. Dense query DBpedia: n 2-6, 92 queries. LinkedMDB: n 2-6, 39 queries. Mondial: n 2-6, 57 queries.
Experiments 原查询找不到关联的比例,有必要进行松弛。
Experiments Baseline CertQR CertQR+求不同直径的平均时间,可以看出我们的方法有效。 三个启发式对于时间的提升。Combined效果总体最好
Conclusion Necessity of studying query relaxation for SA search Transform the problem into finding a certificate Depend on fast distance calculation, use distance oracle to achieve a trade-off between time and space. 证明了研究查询松弛的必要性。 将实际找到关联转化为找certificate的问题,并据此提出了有效的best-first算法,以及两个启发式。 但是算法依赖于快速的距离计算,使用distance oracle在时间和空间取得平衡。
Review Unfair baseline algorithm is set. The analysis of the experimental results is insufficient. Querying by entities directly is not convincing. Usually, users give keywords and matched entities need to be found. It is not user-friend to discard entities, better to relax diameter threshold. 1. Baseline设置不公平,BSL算法不仅返回了最大子查询,还计算了关联,CertQR和CertQR+没有考虑计算关联的时间。 2. 直接使用实体查询不令人信服,通常是给定关键词。 3. 实验结果分析不充分。 4. 移除实体对用户不友好,更好的方式是增加直径。
Q&A