Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo +

Slides:



Advertisements
Similar presentations
数据库研究方法和论文写作 陆嘉恒 中国人民大学.
Advertisements

Performance Evaluation
Chapter 12 圖形結構 12.1 圖形的一些專有名詞 12.2 圖形資料結構表示法 12.3 圖形追蹤 12.4 擴展樹
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Minimum Spanning Trees
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
Rate and Distortion Optimization for Reversible Data Hiding Using Multiple Histogram Shifting Source: IEEE Transactions On Cybernetics, Vol. 47, No. 2,February.
Large-Scale Malware Indexing Using Function-Call Graphs
Image Retrieval Based on Fractal Signature
NLP Group, Dept. of CS&T, Tsinghua University
Manifold Learning Kai Yang
計算方法設計與分析 Design and Analysis of Algorithms 唐傳義
Chap5 Graph.
词汇语义资源在中文关系抽取中的应用 报告人:钱龙华 刘丹丹 胡亚楠 钱龙华 周国栋
Chapter 6 Graph Chang Chi-Chung
The Greedy Method.
圖論 (Graph Theory) B 電機四 大鳥 B 電機四 酋長 B 電機四 炫大
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
Graph 2 Michael Tsai 2012/5/1 連載: 學生上課睡覺姿勢大全
CCF-ADL 58 大媒体与大数据分析 北京·清华大学
Journal of High Speed Networks 15(2006)
Word-Entity Duet Representations for Document Ranking
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi
Chapter 7 圖形與網路 資料結構導論 - C語言實作.
Top-k search的实现技术 邵菲 MF /12/25.
Symbolic Execution During Test Data Generation and Augmentation Top Paper Review Zhiyi Zhang.
基于类关联规则的分类 Classification Based on Class-Association Rules
樹 2 Michael Tsai 2013/3/26.
Ch07 圖形與網路 淡江大學 周清江
Advisor : Prof. Frank Y.S. Lin Presented by Yen-Yi, Hsu
感謝同學們在加分題建議. 我會好好研讀+反省~
数据摘要现状调研报告 上下文摘要初步思考 徐丹云.
資料結構 Data Structures Fall 2006, 95學年第一學期 Instructor : 陳宗正.
Version Control System Based DSNs
沙勇忠 Sha Yongzhong 兰州大学图书馆 Library of Lanzhou University
2019/4/8 A Load Balancing Mechanism for multiple SDN Controllers based on Load Informing Strategy Miultiple controller 的 load balancing 機制,使用一個叫 Load informing.
Total Review of Data Structures
Unit 05 雲端分散式Hadoop實驗 -I M. S. Jian
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
复杂网络简介 LiuChang.
关联发现 工作内容阶段性总结 刘达欣
Speaker: Wang,Song-Ferng Advisor: Dr. Ho-Ting Wu 2015/7/6
Representation Learning of Knowledge Graphs with Hierarchical Types
從 ER 到 Logical Schema ──兼談Schema Integration
计算机问题求解 – 论题 算法方法 2016年11月28日.
系统科学与复杂网络初探 刘建国 上海理工大学管理学院
汪卫 王轶彤 老逸夫楼602-3 数据库新技术 汪卫 王轶彤 老逸夫楼602-3.
Disjoint Sets Michael Tsai 2013/05/14.
Course 8 回溯、分枝與限制 Backtracking, Branch-and-Bound
BiCuts: A fast packet classification algorithm using bit-level cutting
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
鄧姚文 資料結構 第八章:圖(Graph) 鄧姚文
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
赵才荣 同济大学,电子与信息工程学院,智信馆410室
More About Auto-encoder
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
以碎形正交基底和時間情境圖為基礎進行之視訊檢索 Video retrieval based on fractal orthogonal bases and temporal graph 阿凡達 研究生:張敏倫 指導教授:蔣依吾博士 國立中山大學資訊工程學系.
Advanced Competitive Programming
Advanced Competitive Programming
Advanced Competitive Programming
Graph 1 Michael Tsai 2012/4/24 連載: 學生上課睡覺姿勢大全
Center for Deliberative Democracy, Stanford University
之前都是分类的蒸馏很简单。然后从分类到分割也是一样,下一篇是检测的蒸馏
A Trie-based Approach to Fast Flow Recognition for OpenFlow
JAVA 程式設計與資料結構 第二十一章 Graph.
Some discussions on Entity Identification
POWER-EFFICIENT RANGE-MATCH-BASED PACKET CLASSIFICATION ON FPGA
Presentation transcript:

Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo + Querying Web-Scale Information Networks Through Bounding Matching Scores [WWW’15] Jiahui Jing +*, Samamon Khemmarat *, Lixin Gao *, Junzhou Luo + +School of Computer Science and Engineering, Southeast University, China *Department of Electrical and Computer Engineering, University of Massachusets Amherst, USA 马萨诸塞 Qingxia Liu

Background Quering web-scale information networks containing billions of entities subgraph matching problem matching score evaluated by comparing the structural signatures incomplete and noizy identify both exact and similar answers (top-k answers) large scale not feasible to build graph indicies for traditional graph matching (index-free) prune by bounding the matching scores during the computation distributed implementation Contributions an algorithm can identify both exact and similarity matches an index-free algorithm to scale to billions of nodes implementation in a distributed environment evaluation using real-word datasets Graph Matching using Bound (GraB); 给定一个query graph,如何从一个较大规模的graph中找出其可能匹配的subgraph; 数据不完整,以及信息杂乱; 严格匹配,近似匹配;

Preliminaries nodes in query VQ data nodes VG specific node (known entity) qs∈VQS Embedding an injective function f: VQ->VG, where q and f(q) have same type, if q∈VQS, q and f(q) have same name, query node (unknown entity) q∈VQU 找出与电影阿凡达的导演有过合作,并且在电影“Body of Lies”中出演过的演员; A typed graph (V;E; T) is an undirected graph, node names are unique (can be extend);边上没有label; known entity: both the type and the name of a specic node are known. query node: only its type is known map every query node to a data node; exact match: iff for each edge (qi,qj) ∈EQ,there is an edge (f(qi),f(qj))∈EG. injective:单射(即不同的qnode映射到不同的datanode上); (A typed graph) Find an actor who collaborated with the director of the movie ‘Avatar’ and also performed in the movie ‘Body of Lies’

Similarity Measure closeness vector closeness vector of a match structural signature of a node’s neighborhood how close to each node in the graph closeness score function based on the shortest path distance more, shorter the path-> higher score closeness vector of a match the closeness vector of f(qi) in the matched subgraph node match cost difference of closeness vectors, R(qi) - R(qi, f) similarity measure (match cost) 为了度量一个embedding的质量,根据graph的结构特征来定义度量函数; 1、用一个向量来表示一个节点与图中各节点的邻近关系; closeness vector: 由与各点的closeness值构成的向量; 1.1、l_u,v : length of the shortest paths between u and v; n_u,v : num of ...; α: [0,1] constant; N<1/α,num限制; 当n>N, 邻近打分(closeness score)bounded to Nα^l; 2、相当于qi在data中映射到的点f(qi)与映射到的子图中各点之间的closeness构成的向量; positive-difference function:避免惩罚比query更亲近的情况;避免负分抵消; 3、qi在query中的closeness vector与f(qi)在matched subgraph中的closeness vector之间各项只差; 4、k embedding with smallest match cost Min f

Algorithm Framwork Phrase I Phrase II Main Ideal speed up by reducing the search space (of embeddings) while ensuring keeping the high quality embeddings top-k* matches -> top-k embeddings Phrase I build top-k* candidate matches for query nodes by selecting the data nodes with the lowest known match cost known match cost (of matching query node q to data node v) 𝐶 𝐾 𝑣,𝑞 = 𝑞 𝑠 ∈ 𝑉 𝑄 𝑠 𝛿 𝑞 𝑠 ,𝑞 (𝑓 𝑞 𝑠 ,𝑣) Phrase II identify the top-k embeddings with the lowest match costs 任务:找使得cost最小的embedding; matches: assignment of a node, e.g. matching q to v; embeddings: assignment of all nodes, i.e. the f(); k*∈[k,2k] Phrase I:对每个未知节点去part I最优的matching作为候选; Phrase II:将这些候选matching进行组合,从组合中选出cost最低的embedding; reduce:即第一步只选k*个,以减少第二步的search space;

The Algorithm -- GraB Graph Matching using Bound (GraB) index-free computes the upper/lower bounds of the closensss scores to derive top-k embeddings bound refinement by BFSes Phrase I: each anchor node as a source node Phrase II: each candidate match as a source node 计算data graph中任意两点之间的closeness(避免了对整个图所有节点进行计算?); anchor node: 已知点(specified node)在data graph上对应的节点; candidate match:第一步中选出的与未知点(query node)对应的data graph上的节点; N(v)为v的neighbor; 每个已知点的matching上都以其为起点进行一次BFS,更新其相邻节点的bound; 因此对于一个match,要进行多次BFS(从不同起点); (closeness score 是越大表示两点越相关) (起点之间有有限关系,根据其upper bound与lower bound间的gap,

The Algorithm -- GraB Top-k embedding identification k embeddings with the smallest cost upper bounds 𝑓 1 ,…, 𝑓 𝑘 ( 𝐶 + 𝑓 1 <…< 𝐶 + 𝑓 𝑘 ) check (compare to the lower bounds of unselected embeddings) 𝐶 + 𝑓 𝑘 < 𝐶 − 𝑓 , (𝑓∉{ 𝑓 1 ,…, 𝑓 𝑘 }) cost是越小越好(可以由前面的closeness score得到); check:确保这top-k是实际上的topk个,但是如果对每一个matching都计算会很慢,所以采用了branch-and-bound method(本文未提);

Evaluation 有效性; Q1:找出两位作者,其中一个发表过CIDR 2011,另一个发表过ISMB 2008; Q3:找出两部电影,其导演是“美丽心灵”的导演,而演员出演过“变相怪杰”;(schema不一致) Q1,Q2:DBLP; Q3:Freebase;

添加index之后(GraB-Index)执行速度最快,但相比之下无index的算法(GraB)也不算太慢;

Evaluation |Vs, Vu|, a=0.01, k=10 还有一些关于高效性的实验分析,这里就不细讲了

Approximate matching

Approximate Matching Tale [ICDE’08] [TODS’06] matching of important query nodes as anchor points NH-Index label, degree, nbConnection, nbArray [TODS’06] substructure similarity akin to string matching (n-gram) Tale 5 查询图的近似匹配:即,在missing容忍度上,对(query graph中)不同重要程度的节点有不同要求; 做法:从query graph中找到重要节点,进行匹配并将此作为锚点,扩展其余点的匹配; 节点重要性度量:可配置,此处采用节点度数; 其他内容:graph index的构建(便于matching时根据邻接节点的missing情况进行pruning); nbConnection: neighbor之间的连边个数; nbArray: 用位数组(bit array)存储neighbor 节点的label; [TODS’06]: query graph 中fa,fb,fc出现的次数依次为1,2,4; (从偏图论的角度)解决substructure similarity search问题:给定一个query graph,从一系列graph中找出与query graph结构最相似的graph; 方法:抽取query graph中一些子图结构作为feature,选取满足(embedding)这些结构的次数与query graph满足的情况最接近者;

match each component seperately, then combine Approximate Matching [PODS’02] tree and graph searching steps\methods pathfix (for tree) GraphGrep (for graph) 1.dataset building root-to-leaf path index all paths with length in [1, lp]; graph fingerprint: num of existance of each path 2. filtering parent-children pairs according to fingerprint 3. find matchings at most k paths missing compare label-paths and id-paths, then concatenate overlappings 4. wildcard match each component seperately, then combine Approximate Containment (AC) queries: query中可能有冗余节点(其删除后不影响解答) GraphGrep: path长度:[1,lp]; 以label path计算哈希值,记录该label path在各graph中出现的次数; 2.filtering:exact matching的情况下,则过滤掉所有存在任一path出现次数少于query中出现次数的graph; 3. decompose branches of a DFS tree of the query into sequences of overlapping label-paths with length in [1, lp], overlap:

Main problems Index performance distance measures for fast filtering (reduce the search space) performance distance measures to support inexact matching (for approximate query processing) queries with wildcards ? : FLDC (fixed length don’t care) * : VLDC (variable length don’t care) performance:高效的matching算法 wildcards:某些情况下相当于变量,而对于

References Jiahui Jin, Samamon Khemmarat, Lixin Gao, Junzhou Luo: Querying Web-Scale Information Networks Through Bounding Matching Scores. WWW 2015:527-537 Yuanyuan Tian, Jignesh M. Patel: TALE: A Tool for Approximate Large Graph Matching. ICDE 2008:963-972 Xifeng Yan, Feida Zhu, Philip S. Yu, Jiawei Han: Feature-based similarity search in graph structures. ACM Trans. Database Syst. (TODS) 31(4):1418-1453 (2006) Dennis Shasha, Jason Tsong-Li Wang, Rosalba Giugno: Algorithmics and Applications of Tree and Graph Searching. PODS 2002:39-52

Thank you ~