Jiahui Jing +, Samamon Khemmarat , Lixin Gao *, Junzhou Luo +

Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo +
Querying Web-Scale Information Networks Through Bounding Matching Scores [WWW’15] Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo + +School of Computer Science and Engineering, Southeast University, China *Department of Electrical and Computer Engineering, University of Massachusets Amherst, USA 马萨诸塞 Qingxia Liu

Background Quering web-scale information networks containing billions of entities subgraph matching problem matching score evaluated by comparing the structural signatures incomplete and noizy identify both exact and similar answers (top-k answers) large scale not feasible to build graph indicies for traditional graph matching (index-free) prune by bounding the matching scores during the computation distributed implementation Contributions an algorithm can identify both exact and similarity matches an index-free algorithm to scale to billions of nodes implementation in a distributed environment evaluation using real-word datasets Graph Matching using Bound (GraB)；给定一个query graph，如何从一个较大规模的graph中找出其可能匹配的subgraph；数据不完整，以及信息杂乱；严格匹配，近似匹配；

Preliminaries nodes in query VQ data nodes VG specific node
(known entity) qs∈VQS Embedding an injective function f: VQ->VG, where q and f(q) have same type, if q∈VQS, q and f(q) have same name, query node (unknown entity) q∈VQU 找出与电影阿凡达的导演有过合作，并且在电影“Body of Lies”中出演过的演员； A typed graph (V;E; T) is an undirected graph, node names are unique (can be extend);边上没有label； known entity: both the type and the name of a specic node are known. query node: only its type is known map every query node to a data node; exact match: iff for each edge (qi,qj) ∈EQ,there is an edge (f(qi),f(qj))∈EG. injective：单射（即不同的qnode映射到不同的datanode上）； (A typed graph) Find an actor who collaborated with the director of the movie ‘Avatar’ and also performed in the movie ‘Body of Lies’

Similarity Measure closeness vector closeness vector of a match
structural signature of a node’s neighborhood how close to each node in the graph closeness score function based on the shortest path distance more, shorter the path-> higher score closeness vector of a match the closeness vector of f(qi) in the matched subgraph node match cost difference of closeness vectors, R(qi) - R(qi, f) similarity measure (match cost) 为了度量一个embedding的质量，根据graph的结构特征来定义度量函数； 1、用一个向量来表示一个节点与图中各节点的邻近关系； closeness vector: 由与各点的closeness值构成的向量； 1.1、l_u,v : length of the shortest paths between u and v; n_u,v : num of ...; α: [0,1] constant; N<1/α,num限制; 当n>N, 邻近打分（closeness score)bounded to Nα^l; 2、相当于qi在data中映射到的点f(qi)与映射到的子图中各点之间的closeness构成的向量； positive-difference function:避免惩罚比query更亲近的情况；避免负分抵消； 3、qi在query中的closeness vector与f(qi)在matched subgraph中的closeness vector之间各项只差； 4、k embedding with smallest match cost Min f

Algorithm Framwork Phrase I Phrase II Main Ideal
speed up by reducing the search space (of embeddings) while ensuring keeping the high quality embeddings top-k* matches -> top-k embeddings Phrase I build top-k* candidate matches for query nodes by selecting the data nodes with the lowest known match cost known match cost (of matching query node q to data node v) 𝐶 𝐾 𝑣,𝑞 = 𝑞 𝑠 ∈ 𝑉 𝑄 𝑠 𝛿 𝑞 𝑠 ,𝑞 (𝑓 𝑞 𝑠 ,𝑣) Phrase II identify the top-k embeddings with the lowest match costs 任务：找使得cost最小的embedding； matches: assignment of a node, e.g. matching q to v; embeddings: assignment of all nodes, i.e. the f(); k*∈[k,2k] Phrase I：对每个未知节点去part I最优的matching作为候选； Phrase II：将这些候选matching进行组合，从组合中选出cost最低的embedding； reduce：即第一步只选k*个，以减少第二步的search space；

The Algorithm -- GraB Graph Matching using Bound (GraB) index-free
computes the upper/lower bounds of the closensss scores to derive top-k embeddings bound refinement by BFSes Phrase I: each anchor node as a source node Phrase II: each candidate match as a source node 计算data graph中任意两点之间的closeness（避免了对整个图所有节点进行计算？）； anchor node: 已知点（specified node）在data graph上对应的节点； candidate match：第一步中选出的与未知点（query node）对应的data graph上的节点； N(v)为v的neighbor；每个已知点的matching上都以其为起点进行一次BFS，更新其相邻节点的bound；因此对于一个match，要进行多次BFS（从不同起点）； (closeness score 是越大表示两点越相关）（起点之间有有限关系，根据其upper bound与lower bound间的gap，

The Algorithm -- GraB Top-k embedding identification
k embeddings with the smallest cost upper bounds 𝑓 1 ,…, 𝑓 𝑘 ( 𝐶 + 𝑓 1 <…< 𝐶 + 𝑓 𝑘 ) check (compare to the lower bounds of unselected embeddings) 𝐶 + 𝑓 𝑘 < 𝐶 − 𝑓 , (𝑓∉{ 𝑓 1 ,…, 𝑓 𝑘 }) cost是越小越好（可以由前面的closeness score得到）； check：确保这top-k是实际上的topk个，但是如果对每一个matching都计算会很慢，所以采用了branch-and-bound method（本文未提）；

Evaluation 有效性； Q1：找出两位作者，其中一个发表过CIDR 2011,另一个发表过ISMB 2008；
Q3：找出两部电影，其导演是“美丽心灵”的导演，而演员出演过“变相怪杰”；（schema不一致） Q1，Q2：DBLP； Q3：Freebase；

添加index之后（GraB-Index）执行速度最快，但相比之下无index的算法（GraB）也不算太慢；

Evaluation |Vs, Vu|, a=0.01, k=10 还有一些关于高效性的实验分析，这里就不细讲了

Approximate matching

Approximate Matching Tale [ICDE’08] [TODS’06]
matching of important query nodes as anchor points NH-Index label, degree, nbConnection, nbArray [TODS’06] substructure similarity akin to string matching (n-gram) Tale 5 查询图的近似匹配：即，在missing容忍度上，对（query graph中）不同重要程度的节点有不同要求；做法：从query graph中找到重要节点，进行匹配并将此作为锚点，扩展其余点的匹配；节点重要性度量：可配置，此处采用节点度数；其他内容：graph index的构建（便于matching时根据邻接节点的missing情况进行pruning）； nbConnection： neighbor之间的连边个数； nbArray: 用位数组（bit array）存储neighbor 节点的label； [TODS’06]: query graph 中fa,fb,fc出现的次数依次为1,2,4; (从偏图论的角度)解决substructure similarity search问题：给定一个query graph，从一系列graph中找出与query graph结构最相似的graph；方法：抽取query graph中一些子图结构作为feature，选取满足（embedding）这些结构的次数与query graph满足的情况最接近者；

match each component seperately, then combine
Approximate Matching [PODS’02] tree and graph searching steps\methods pathfix (for tree) GraphGrep (for graph) 1.dataset building root-to-leaf path index all paths with length in [1, lp]; graph fingerprint: num of existance of each path 2. filtering parent-children pairs according to fingerprint 3. find matchings at most k paths missing compare label-paths and id-paths, then concatenate overlappings 4. wildcard match each component seperately, then combine Approximate Containment (AC) queries: query中可能有冗余节点（其删除后不影响解答） GraphGrep: path长度：[1,lp]；以label path计算哈希值，记录该label path在各graph中出现的次数； 2.filtering：exact matching的情况下，则过滤掉所有存在任一path出现次数少于query中出现次数的graph； 3. decompose branches of a DFS tree of the query into sequences of overlapping label-paths with length in [1, lp], overlap:

Main problems Index performance distance measures
for fast filtering (reduce the search space) performance distance measures to support inexact matching (for approximate query processing) queries with wildcards ? : FLDC (fixed length don’t care) * : VLDC (variable length don’t care) performance：高效的matching算法 wildcards：某些情况下相当于变量，而对于

References Jiahui Jin, Samamon Khemmarat, Lixin Gao, Junzhou Luo: Querying Web-Scale Information Networks Through Bounding Matching Scores. WWW 2015: Yuanyuan Tian, Jignesh M. Patel: TALE: A Tool for Approximate Large Graph Matching. ICDE 2008: Xifeng Yan, Feida Zhu, Philip S. Yu, Jiawei Han: Feature-based similarity search in graph structures. ACM Trans. Database Syst. (TODS) 31(4): (2006) Dennis Shasha, Jason Tsong-Li Wang, Rosalba Giugno: Algorithmics and Applications of Tree and Graph Searching. PODS 2002:39-52

Thank you ~

Jiahui Jing +, Samamon Khemmarat , Lixin Gao *, Junzhou Luo +

Similar presentations

Presentation on theme: "Jiahui Jing +, Samamon Khemmarat , Lixin Gao *, Junzhou Luo +"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo +

Similar presentations

Presentation on theme: "Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo +"— Presentation transcript:

Similar presentations

About project

反馈

Jiahui Jing +, Samamon Khemmarat , Lixin Gao *, Junzhou Luo +

Presentation on theme: "Jiahui Jing +, Samamon Khemmarat , Lixin Gao *, Junzhou Luo +"— Presentation transcript: