Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo + Querying Web-Scale Information Networks Through Bounding Matching Scores [WWW’15] Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo + +School of Computer Science and Engineering, Southeast University, China *Department of Electrical and Computer Engineering, University of Massachusets Amherst, USA 马萨诸塞 Qingxia Liu
Background Quering web-scale information networks containing billions of entities subgraph matching problem matching score evaluated by comparing the structural signatures incomplete and noizy identify both exact and similar answers (top-k answers) large scale not feasible to build graph indicies for traditional graph matching (index-free) prune by bounding the matching scores during the computation distributed implementation Contributions an algorithm can identify both exact and similarity matches an index-free algorithm to scale to billions of nodes implementation in a distributed environment evaluation using real-word datasets Graph Matching using Bound (GraB); 给定一个query graph,如何从一个较大规模的graph中找出其可能匹配的subgraph; 数据不完整,以及信息杂乱; 严格匹配,近似匹配;
Preliminaries nodes in query VQ data nodes VG specific node (known entity) qs∈VQS Embedding an injective function f: VQ->VG, where q and f(q) have same type, if q∈VQS, q and f(q) have same name, query node (unknown entity) q∈VQU 找出与电影阿凡达的导演有过合作,并且在电影“Body of Lies”中出演过的演员; A typed graph (V;E; T) is an undirected graph, node names are unique (can be extend);边上没有label; known entity: both the type and the name of a specic node are known. query node: only its type is known map every query node to a data node; exact match: iff for each edge (qi,qj) ∈EQ,there is an edge (f(qi),f(qj))∈EG. injective:单射(即不同的qnode映射到不同的datanode上); (A typed graph) Find an actor who collaborated with the director of the movie ‘Avatar’ and also performed in the movie ‘Body of Lies’
Similarity Measure closeness vector closeness vector of a match structural signature of a node’s neighborhood how close to each node in the graph closeness score function based on the shortest path distance more, shorter the path-> higher score closeness vector of a match the closeness vector of f(qi) in the matched subgraph node match cost difference of closeness vectors, R(qi) - R(qi, f) similarity measure (match cost) 为了度量一个embedding的质量,根据graph的结构特征来定义度量函数; 1、用一个向量来表示一个节点与图中各节点的邻近关系; closeness vector: 由与各点的closeness值构成的向量; 1.1、l_u,v : length of the shortest paths between u and v; n_u,v : num of ...; α: [0,1] constant; N<1/α,num限制; 当n>N, 邻近打分(closeness score)bounded to Nα^l; 2、相当于qi在data中映射到的点f(qi)与映射到的子图中各点之间的closeness构成的向量; positive-difference function:避免惩罚比query更亲近的情况;避免负分抵消; 3、qi在query中的closeness vector与f(qi)在matched subgraph中的closeness vector之间各项只差; 4、k embedding with smallest match cost Min f
Algorithm Framwork Phrase I Phrase II Main Ideal speed up by reducing the search space (of embeddings) while ensuring keeping the high quality embeddings top-k* matches -> top-k embeddings Phrase I build top-k* candidate matches for query nodes by selecting the data nodes with the lowest known match cost known match cost (of matching query node q to data node v) 𝐶 𝐾 𝑣,𝑞 = 𝑞 𝑠 ∈ 𝑉 𝑄 𝑠 𝛿 𝑞 𝑠 ,𝑞 (𝑓 𝑞 𝑠 ,𝑣) Phrase II identify the top-k embeddings with the lowest match costs 任务:找使得cost最小的embedding; matches: assignment of a node, e.g. matching q to v; embeddings: assignment of all nodes, i.e. the f(); k*∈[k,2k] Phrase I:对每个未知节点去part I最优的matching作为候选; Phrase II:将这些候选matching进行组合,从组合中选出cost最低的embedding; reduce:即第一步只选k*个,以减少第二步的search space;
The Algorithm -- GraB Graph Matching using Bound (GraB) index-free computes the upper/lower bounds of the closensss scores to derive top-k embeddings bound refinement by BFSes Phrase I: each anchor node as a source node Phrase II: each candidate match as a source node 计算data graph中任意两点之间的closeness(避免了对整个图所有节点进行计算?); anchor node: 已知点(specified node)在data graph上对应的节点; candidate match:第一步中选出的与未知点(query node)对应的data graph上的节点; N(v)为v的neighbor; 每个已知点的matching上都以其为起点进行一次BFS,更新其相邻节点的bound; 因此对于一个match,要进行多次BFS(从不同起点); (closeness score 是越大表示两点越相关) (起点之间有有限关系,根据其upper bound与lower bound间的gap,
The Algorithm -- GraB Top-k embedding identification k embeddings with the smallest cost upper bounds 𝑓 1 ,…, 𝑓 𝑘 ( 𝐶 + 𝑓 1 <…< 𝐶 + 𝑓 𝑘 ) check (compare to the lower bounds of unselected embeddings) 𝐶 + 𝑓 𝑘 < 𝐶 − 𝑓 , (𝑓∉{ 𝑓 1 ,…, 𝑓 𝑘 }) cost是越小越好(可以由前面的closeness score得到); check:确保这top-k是实际上的topk个,但是如果对每一个matching都计算会很慢,所以采用了branch-and-bound method(本文未提);
Evaluation 有效性; Q1:找出两位作者,其中一个发表过CIDR 2011,另一个发表过ISMB 2008; Q3:找出两部电影,其导演是“美丽心灵”的导演,而演员出演过“变相怪杰”;(schema不一致) Q1,Q2:DBLP; Q3:Freebase;
添加index之后(GraB-Index)执行速度最快,但相比之下无index的算法(GraB)也不算太慢;
Evaluation |Vs, Vu|, a=0.01, k=10 还有一些关于高效性的实验分析,这里就不细讲了
Approximate matching
Approximate Matching Tale [ICDE’08] [TODS’06] matching of important query nodes as anchor points NH-Index label, degree, nbConnection, nbArray [TODS’06] substructure similarity akin to string matching (n-gram) Tale 5 查询图的近似匹配:即,在missing容忍度上,对(query graph中)不同重要程度的节点有不同要求; 做法:从query graph中找到重要节点,进行匹配并将此作为锚点,扩展其余点的匹配; 节点重要性度量:可配置,此处采用节点度数; 其他内容:graph index的构建(便于matching时根据邻接节点的missing情况进行pruning); nbConnection: neighbor之间的连边个数; nbArray: 用位数组(bit array)存储neighbor 节点的label; [TODS’06]: query graph 中fa,fb,fc出现的次数依次为1,2,4; (从偏图论的角度)解决substructure similarity search问题:给定一个query graph,从一系列graph中找出与query graph结构最相似的graph; 方法:抽取query graph中一些子图结构作为feature,选取满足(embedding)这些结构的次数与query graph满足的情况最接近者;
match each component seperately, then combine Approximate Matching [PODS’02] tree and graph searching steps\methods pathfix (for tree) GraphGrep (for graph) 1.dataset building root-to-leaf path index all paths with length in [1, lp]; graph fingerprint: num of existance of each path 2. filtering parent-children pairs according to fingerprint 3. find matchings at most k paths missing compare label-paths and id-paths, then concatenate overlappings 4. wildcard match each component seperately, then combine Approximate Containment (AC) queries: query中可能有冗余节点(其删除后不影响解答) GraphGrep: path长度:[1,lp]; 以label path计算哈希值,记录该label path在各graph中出现的次数; 2.filtering:exact matching的情况下,则过滤掉所有存在任一path出现次数少于query中出现次数的graph; 3. decompose branches of a DFS tree of the query into sequences of overlapping label-paths with length in [1, lp], overlap:
Main problems Index performance distance measures for fast filtering (reduce the search space) performance distance measures to support inexact matching (for approximate query processing) queries with wildcards ? : FLDC (fixed length don’t care) * : VLDC (variable length don’t care) performance:高效的matching算法 wildcards:某些情况下相当于变量,而对于
References Jiahui Jin, Samamon Khemmarat, Lixin Gao, Junzhou Luo: Querying Web-Scale Information Networks Through Bounding Matching Scores. WWW 2015:527-537 Yuanyuan Tian, Jignesh M. Patel: TALE: A Tool for Approximate Large Graph Matching. ICDE 2008:963-972 Xifeng Yan, Feida Zhu, Philip S. Yu, Jiawei Han: Feature-based similarity search in graph structures. ACM Trans. Database Syst. (TODS) 31(4):1418-1453 (2006) Dennis Shasha, Jason Tsong-Li Wang, Rosalba Giugno: Algorithmics and Applications of Tree and Graph Searching. PODS 2002:39-52
Thank you ~