Web Graph & Link Analysis

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

《互联网运营管理》系列课程 觉浅网 荣誉出品
2014 年上学期 湖南长郡卫星远程学校 制作 13 Getting news from the Internet.
复杂网络节点重要性评估及其应用研究 答 辩 人: 张翼 指导老师: 刘玉华 教授.
宏 观 经 济 学 N.Gregory Mankiw 上海杉达学院.
应如何将神的话语大声读出来会众才能真正的听见!
统计物理学与复杂系统 陈晓松 中国科学院理论物理研究所 兰州大学,2013年8月.
自衛消防編組任務職責 講 義 This template can be used as a starter file for presenting training materials in a group setting. Sections Right-click on a slide to add.
人工智能 Artificial Intelligence 第十一章
資料庫設計 Database Design.
分析抗焦慮劑/安眠劑之使用的影響因子在重度憂鬱症及廣泛性焦慮症病人和一般大眾的處方形態
复杂网络数学建模概述 南京航空航天大学应用物理系 朱陈平.
Chaoping Li, Zhejiang University
SHARE with YOU Why am I here? (堅持……) What did I do?
59 中 张丽娟 学习目标: 1. 识记并理解运用 6 个单词和 5 个短语。 (source, accessible, network, access, via, create come up with, from the moment on, consist of, go down , at the.
Figure Interpreting. Introduction In recording an English figure, its three digits make one subsection, while in Chinese, its four digits make one subsection.
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Euler’s method of construction of the Exponential function
Reading Do you remember what you were doing? 学习目标 1、了解几个重要历史事件。
Thinking of Instrumentation Survivability Under Severe Accident
Population proportion and sample proportion
International Conference ITIE2010: Inspiration from Best Practices
模式识别 Pattern Recognition
Good Afternoon. 威世智中国员工及 组织化赛事介绍 WotC China Staff and Organized Play Instruction Shane Xu
Manifold Learning Kai Yang
Chapter 6 Graph Chang Chi-Chung
Journal Citation Reports® 期刊引文分析報告的使用和檢索
WEB挖掘算法介绍.
HOW TO ACE -- THE IELTS SPEAKING TEST
顏色yán sè COLORS 紅色 藍色 綠色 黃色 紫色 白色 黑色 咖啡色 bái sè hēi sè hóng sè lǜ sè
Sampling Theory and Some Important Sampling Distributions
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
Department of Computer Science & Information Engineering
Journal of High Speed Networks 15(2006)
YBN搜索用户数据 Microsoft Confidential.
Interval Estimation區間估計
药物和疾病啥关系 ? 李智恒.
Towards Emotional Awareness in Software Development Teams
如何增加对欧贸易出口 中国制造展销中心(英国)有限公司 首席执行官 理查德·赛斯
Chp.4 The Discount Factor
IBM SWG Overall Introduction
Version Control System Based DSNs
沙勇忠 Sha Yongzhong 兰州大学图书馆 Library of Lanzhou University
Mechanics Exercise Class Ⅰ
Guide to a successful PowerPoint design – simple is best
Chp.4 The Discount Factor
BORROWING SUBTRACTION WITHIN 20
虚 拟 仪 器 virtual instrument
复杂网络简介 LiuChang.
A Data Mining Algorithm for Generalized Web Prefetching
浅谈高中英语阅读教学中的问题设计 浙江省临安中学 方利春.
The Bernoulli Distribution
Chp.4 The Discount Factor
系统科学与复杂网络初探 刘建国 上海理工大学管理学院
Distance Vector vs Link State
BiCuts: A fast packet classification algorithm using bit-level cutting
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
计算机问题求解 – 论题 图的连通度 2018年11月13日.
(二)盲信号分离.
高效洁净机械制造实验室是 2009 年教育部批准立项建设的重点实验室。实验室秉承“突出特色、创新发展“的宗旨,以求真务实的态度认真做好各项工作。 实验室主任为黄传真教授,实验室副主任为刘战强教授和李方义教授。学术委员会主任为中国工程院院士卢秉恒教授。实验室固定人员中,有中国工程院院士艾兴教授,教育部.
动词不定式(6).
Distance Vector vs Link State Routing Protocols
Arguments to the main Function and Final Project
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
Experimental Analysis of Distributed Graph Systems
Gaussian Process Ruohua Shi Meeting
Section 1 Basic concepts of web page
Presentation transcript:

Web Graph & Link Analysis http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 9/27/2010

上一讲主要内容 Crawler面临的难题 实现高效率的基本技术 有趣的技术 Scalable, fast, polite, robust, continuous 实现高效率的基本技术 Cache Prefetch Concurrency 多进程/多线程 异步I/O 有趣的技术 Bloom filter Consistent Hashing One Web , One Dream

Internet 电信网络 Complex Network Example: WWW ----- (K. C. Claffy) 有向网络, 结点:web页面,边:超链 http://dig.csail.mit.edu/breadcrumbs/node/215 In a nutshell, Berners-Lee breaks down the evolution of computer networks into three stages: 1. The Net: connects computers 2. The Web: connects documents 3. The Graph: connects people and documents (thus forming relationships)

Web Graph http://www.touchgraph.com/TGGoogleBrowser.html The TouchGraph Google Browser reveals the network of connectivity between websites, as reported by Google's database of related sites. http://www.touchgraph.com/TGGoogleBrowser.html

Giant Global Graph 社会网络(social network) 任何一种用于建立个体之间联系的自然现象、社会活动或技术机制都可能形成一张网 “朋友关系”(对称,无向图) “知晓关系”(不对称,有向图) “文献引用关系”(不对称,有向图) co-author关系(对称,无向图,成块“clique”) 通电话,通信 病毒传染(生物、计算机) 网页链接关系(不对称,有向图) 还可以考虑不同的“尺度”:网站之间,城市之间,省份之间,国家之间,…

图论、线性代数若干概念回顾 图,有向图,邻接矩阵,节点的度(degree),两节点间的距离(d),图的直径(r),图的连通性,有向图的强连通分支,连通分支 d(u,v):从u到v的最短路径的长度 r(G):最大的距离 矩阵(A),矩阵的转置(AT),行列式(|A|),线性相关性,矩阵的秩,特征值,特征向量,特征子空间

Homework 求以下矩阵的特征值和特征向量 2 3 -1 -4 0 -1 -2 1 A = 0 1 2 -2 0 1 1 2

本次课大纲 Web 有多大? Web的连通性如何? Web上节点的分布如何? Web上节点距离有多远? Web上节点重要度如何度量?

Web 有多大?

Web的大小—网页总数 图大小不可知,也无法定义 估计Web图节点数的下界 搜索引擎索引的网页数(crawled pages) 例如CNNIC中国互联网网页调查报告 能更逼近真实值吗? http://www.cnnic.net.cn/index/0E/00/11/index.htm

Capture-Recapture Model Unknown number of fish in a lake Catch a sample and mark them Let them loose Recapture a sample and look for marks Estimate population size     n1 = number in first sample 15     n2 = number in second sample 10     n12 = number in both samples 5     N = total population size assume that     n1/N = n12/n2    therefore    15/N = 5/10      N = (10 x 15) / 5 = 30 back-of-the-envelope指的是一个很粗略的计算,不精确,但可以被用作对某个观点的支持或论据。简单的说,是介于猜测和铁证之间的一个概念。常用在数学,物理,及某些工程领域中 www.ana.gov.ro/eng/doc6.htm

Web的大小 Estimate indexed web low-bound by analysis overlap of Search Engine (Steve Lawrence and C. Lee Giles,1998*) 选择6个流行的 search engine, 假设它们索引页面之间的 independency Sampling: 通过575个查询对这些SE采样,分析它们之间的overlap 用overlap来估计各个SE所覆盖的 indexable Web的大小 利用已知某个SE的页面数,来估计整个Web的大小 Searching the World Wide Web We manually checked that all results were retrieved from each engine and were parsed correctly 查询需要满足的条件: A fixed maximum of 600 documents per query was used (from all engines combined after the removal of duplicates); queries returning more documents were not included (12).

Overlap analysis P(a) =na/N = n0/nb N = na*nb/n0 During the test time, HotBot report indexed 110 miliion page lower bound on the size of the indexable Web of 320 million pages. (1998)

Web的连通性如何?

Graph structure in the web A large scale study (Altavista crawls) reveals interesting properties of web (Andrei Broder ,1999) Study of 200 million nodes & 1.5 billion links Some parts unreachable, Others have long paths found Bow-tie Structure Graph structure in the web www9 paper

Bow-tie Components Strongly Connected Component (SCC) Upstream (IN) Core Upstream (IN) Core can’t reach IN Downstream (OUT) OUT can’t reach core Disconnected Tendrils & Tubes tendril [5tendril] n. [植]卷须, 蔓, 卷须状之物 Tendrils not connected to SCC But reachable from IN and can reach OUT Tubes: directed paths IN->Tendrils->OUT

Component Properties Each component is roughly same size 问题: ~50 million nodes Probability of a path between any 2 nodes ~1 quarter (0.24) Diameter, maximal minimal path length(?) Maximal and average diameter is infinite 28 for SCC, 500 for entire graph Average length 16 (directed path exists), 7 (undirected) Shortest directed path between 2 nodes in SCC: 16-20 links on average 问题: 在这样一个巨大的图(200M nodes, 1.5G edges) 上,Diameter怎么计算出来的? 直接根据定义,用双向遍历取交集的方法求强连通分量,时间复杂度为O(N^2+M)。更好的方法是Kosaraju算法或Tarjan算法,两者的时间复杂度都是O(N+M) http://blog.sina.com.cn/s/blog_5e2497580100fr3z.html###

Web上节点的分布如何?

站点入度分布 会是下面哪一种情况?

80/20 原则 Pareto principle : for many phenomena, 80% of the consequences stem from 20% of the causes. (Vilfredo Pareto) scale free scale free 80% of that top 80% of effects coming from 20% of that top 20% of causes, and so on (80% of 80% is 64%; 20% of 20% is 4%, so this implies a "64-4 law" 80-20 rule 80% of your sales comes from 20% of your clients the richest 20% of the world's population controlling 82.7% of the world's income one might guess approximately that we wear our 20% most favoured clothes about 80% of the time, perhaps we spend 80% of the time with 20% of our acquaintances, Microsoft also noted that by fixing the top 20% of the most reported bugs, 80% of the users would not encounter any bugs The Long tail describe the niche strategy of businesses, such as Amazon.com or Netflix, that sell a large number of unique items, each in relatively small quantities. a significant portion of Amazon.com's sales come from obscure books that are not available in brick-and-mortar stores. The Long Tail is a potential market and, as the examples illustrate, the distribution and sales channel opportunities created by the Internet often enable businesses to tap that market successfully. Long tail

Power law http://www.biomedcentral.com/1471-2148/7/S1/S16/figure/F1

Power law P(x=k)=CK-λ A line appears on a log-log plot Zipf’s law is related to power law, Refer to the appendix P(x=k)=CK-λ A line appears on a log-log plot rare events are not so rare! Long tail

Power Law Size and Connectivity 站点大小Site Sizes(以页面数量计算)服从 power law 分布 跨越不同的规模 g 在1.6-1.9之间 节点的度connections per node 服从power law 分布 Study at Notre Dame University reported g = 2.45 for outdegree distribution g = 2.1 for indegree distribution Study at Notre Dame University reported g = 2.45 for outdegree distribution g = 2.1 for indegree distribution

Power Law Distribution -Examples From Graph structure in the web, (by altavista crawl,1999)

Random Graph .vs. Power Law Graph Random graphs have Poisson distribution if p is small. Random uniform graph with random independent edges of fixed probability p P(x=k)= e-λ * λk/k! Decays exponentially fast to 0 as k increases towards its maximum value n-1 Power law graphs Decays polynomially for large values Power law graph  emerging order in a large graph created by many agents

Examples with Power Law Networks Examples of networks with Power Law Distribution Internet at the router and interdomain level Citation network Collaboration network of actors Networks associated with metabolic pathways Networks formed by interacting genes and proteins Network of nervous system connection in C. elegans

Web上节点距离有多远?

Huge graph with small distance What does this mean? Size: 200M nodes, 1.5G edges Average length: 16 (directed path exists), 7 (undirected) Huge graph with small distance It’s a small world

小世界网络 It is a ‘small world’ Mathematically Millions of people. Yet, separated by “six degrees” of acquaintance relationships Popularized by Milgram’s famous experiment Mathematically Diameter of graph is small (log N) as compared to overall size Property seems interesting given ‘sparse’ nature of graph but … This property is ‘natural’ in ‘pure’ random graphs 20世纪60年代,耶鲁大学的社会心理学家米尔格兰姆(Stanley Milgram)就设计了一个连锁信件实验。他将一套连锁信件随机发送给居住在内布拉斯加州奥马哈的160个人,信中放了一个波士顿股票经纪人的名字,信中要求每个收信人将这套信寄给自己认为是比较接近那个股票经纪人的朋友。朋友收信后照此办理。最终,大部分信在经过五、六个步骤后都抵达了该股票经纪人。六度空间的概念由此而来。 然而在这个实验中,实际上只有三分之一的信送到了收信人哪里,因此实验的完成率很低。 http://current.com/items/89164026_proof_just_six_degrees_of_separation_between_us

The small world of WWW Empirical study of Web-graph reveals small-world property Graph generated using power-law model Diameter properties inferred from sampling Calculation of max. diameter computationally demanding for large values of n Average distance (d) in simulated web: d = 0.35 + 2.06 log (n) e.g. n = 109, d ~= 19

Implications for Web Logarithmic scaling of diameter makes future growth of web manageable 10-fold increase of web pages results in only 2 more additional ‘clicks’, but … Users may not take shortest path, may use bookmarks or just get distracted on the way,Therefore search engines play a crucial role

Robustness and vulnerability How diameter or connectivity affected by deleting nodes randomly? Scale-free graph are more robust than random uniform graph Specific nodes are targeted? Diameter doubling when 5% important nodes removed Topology changes under attack? Fragment and break down Phrase change for deletion ratio : 0.28 in exponential graph & 0.18 in a scale-free network

Web上节点重要度如何度量? MIW Ch.5.1-5.4

对网页重要性的评价 PageRank算法,HITS(Hyperlink Induced Topic Search)算法 都是为了利用HTML网页的链接特点,改善查询的效果 当Spam页面淹没了search engine的搜索结果页面时,除了页面内容与查询的相关性以外,页面本身的质量/重要性的作用就显现出来 Larry Page & Sergey Brin 都是1997年左右完成的研究工作,PageRank促成了Google,HITS依然有学术上的意义。 Jon Kleinberg

重要度的度量 一阶指标(“入度”) “高阶指标” Paul Erdös 刘翔 知晓关系:社会知名度 引用关系:认可程度 和一个著名人物“共同发表”论文的“距离”:越短似乎显得越“有荣誉”(例如,Erdos number,) 仅仅是“结构”就可以带来丰富的“语义” 例如省份之间的链接数差别可能有有意义的解释 文献计量学(bibliometry) 研究文献的贡献程度 哪些文章是“有影响的”文章? 研究文献的聚类,从而可能得到一个领域发展的状况 co-citation分析,如果a引用了b和c,称b和c有co-citation关系 流行传染病学,侦察、谍报学 发现那些关键节点,删除它们使得其他节点之间的距离显著扩大 模型、指标体系的“合适性”取决于应用目标

PageRank Why and how it works?

谁重要一些? 认识甲的人可能和认识乙的人一样多,但认识乙的人都是些“重要人物”,于是通常会认为乙比甲重要 如何用一个模型来刻画这种感觉,使算出来的“重要性”反映这种感觉? 例子:按照入度, 节点1,3同样重要; 2,4同样重要。但 我们似乎感到3比1 重要些,2比4重要些。 认识甲的人可能和认识乙的人一样多,但认识乙的人都是些“重要人物”,于是通常会认为乙比甲重要 不仅是人,论文也是一样,被重要的文章引用的文章可能就比较重要些

声望模型Reputation Model 给定一个群体S,及其在上面的一个“知晓”关系R,于是定义了一个有向“关系图”G。用邻接矩阵E表示,E(i,j)=1,当且仅当i “听说过” j(注意这里没有程度之分)。我们希望确定p(i):所有个体i∈S的“声望” 模型一:p(i) = ∑E[k,i],k=1,…,n,即i在G上的“入度”,亦即E的第i列的1的个数 清楚、好计算;但是“不够好” 模型二:p(i) = ∑E[k,i]p(k),k=1,…,n,即i的声望等于知晓他的人的声望之和 清楚、显得要更“精确些”;但是,好计算吗?

p = ETp 声望模型二 一般来讲:这个方程的非0解是不存在的! 对于所有i,p(i) = ∑E[k,i]p(k),k=1,…,n 也就是,记p = (p(1), p(2), …, p(n))T, p = ETp 问题是: 这个方程存在解吗? 如果存在,如何得到? 如果不存在,该怎么办? 一般来讲:这个方程的非0解是不存在的!

一般来讲,p = ETp,意味着要求ET有特征值1,这是很难得的。 S = {1,2,3}, R = {<1,2>,<1,3>,<2,3>} E = ((0,1,1),(0,0,1),(0,0,0)) ET = ((0,0,0),(1,0,0),(1,1,0)) 不难看到,方程的成立p(1)=0p(2)=0p(3)=0 2 3 一般来讲,p = ETp,意味着要求ET有特征值1,这是很难得的。

先前那4个点的例子也无解 p = ETp  (I - ET)p = 0 线性代数讲,此方程组有非0解,仅当行列式|I - ET| = 0

即使有解,还有可能不唯一! S = {1,2,3}, R = {<1,2>,<2,3>,<3,1>} 不难看出任何 p(1) = p(2) = p(3) 都是解 怎么办?

PageRank 对每一篇网页,得到一个独立于查询词的相对“重要性”指标,将这个指标和查询匹配情况结合起来(以及其他因素),形成网页的排序。 基本思想是前面讨论的“声望”模型

“Random Walker”模型 设想有一个永不休止、在网上浏览网页的人,随机选择一个链出的链接继续访问。我们问,在稳态情况下(足够长时间后),他会正在看哪一篇网页呢? 等价于:稳态情况下,每个网页v会有一个被访问的概率,p(v),它可以作为网页的重要程度的度量。 我们可以合理地设想:此时到达v的概率,依赖于上一个时刻到达“链向”v的网页的概率,以及那些网页中超链的个数。 随机   select uniformly

Random walker model p(v) = ∑E[u,v]*p(u)/du, over u ∑p(u) = 1 u3 u1 V 稳定时: p(v) = ∑E[u,v]*p(u)/du, over u 这里,du是网页u的“出度”,∑E[u,v] over u。 ∑p(u) = 1

Random Walker Model (continue) 改写一下,成 形式上和“声望”模型一样,只是矩阵L有行向量元素和为1的性质。 有用吗? Dangling Node(出度为0的节点) 对于这些节点,矩阵L对应着元素全0的行,元素和不为1 修正:L[u,v] = 1/N if du=0

Stochastic matrix 于是1也就是LT最大的特征值! 矩阵M,元素非负,每个行向量元素之和分别都等于1(亦称马尔科夫转移矩阵) 显然,随机矩阵的最大特征值为1,对应有一个全1元素的特征向量 转置矩阵的行列式和原矩阵的行列式相等 Refer to MIW A.4 Markov Chains 于是1也就是LT最大的特征值!

还有一点问题 上述“随机浏览”模型有稳态解的条件是:由网页形成的有向图允许通过链接关系访问到每一个网页 但有两个情况是破坏这条件的 图中形成“圈”(rank bounce) 有入度或者出度为0的点(rank sink) 因此该模型的表述通常要求所形成的图是irreducible(强连通)和aperiodic(不能有进去后出不来的圈)。

让这浏览者每次以一定的概率(1-β)沿着超链走,以概率(β)重新随机选择一个新的起始节点 继续修改模型 让这浏览者每次以一定的概率(1-β)沿着超链走,以概率(β)重新随机选择一个新的起始节点 这在物理意义上即总是有可能跳进入度为0的点,跳出那些“圈”。在模型表达上即为 The Anatomy of a Large-Scale Hypertextual Web Search Engine (www7) Larry Page & Sergey Brin β选在0.1和0.2之间,被称作damping factor(Page & Brin 1997) G=(1-β)LT+ β/N(1N) 被称为Google Matrix

Google Matrix特征向量求解 Power Iteration方法: 给定Google Matrix G,记|λ1| ≥|λ2| ≥…,q1是属于λ1的特征向量 初始化向量p0,使得||p0||1=1 对于k = 1, 2, …,执行如下步骤 x = Gpk-1, 基本迭代 pk = x/||x||1, 规格化步骤 可以证明(收敛速度) |pk – q1| = O(|λ2/λ1|k) Power iteration 用来找到模最大的特征值。 (我们注意到头两个特征值的差别直接影响收敛速度,越大越快)

问题 GoogleMatrix的Power Iteration求解特征向量算法 一定会经过有限步迭代终止吗? 一定会得到唯一的解吗? 不管初始值P0如何,都会收敛到相同解吗? 是否需要很多次迭代才能收敛呢? Amy Langville, Carl Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006.

例子(power iteration)

小规模数据求解 β取0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 ... 。。。。。。。 Power Iteration求解得(迭代50次) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T You can try this in MatLab

State-of-art PageRank Technology

Topic-Sensitive PageRank[1] ODP-Biasing Step 1 – ODP-Biasing Generate set of biased PageRank vectors using a set of basis topics Create 16 different biased PageRank vectors Performed offline, during preprocessing of crawled data URLs in various categories in Open Directory Project (ODP)

Results Precision @ 10 results for test queries Ranking preferred by majority of users

Pagerank for product image search[2]

Results Similarity graph generated from the top1000 search results of “Mona-Lisa.” The largest two images contain the highest rank.

Results

References [1] H. H. Taher, "Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search," IEEE Transactions on Knowledge and Data Engineering, vol. 15, pp. 784-796, 2003. [2] Y. Jing and S. Baluja, "Pagerank for product image search," in Proceeding of the 17th international conference on World Wide Web, Beijing, China, 2008, pp. 307-316.

本次课小结 Web Graph的性质 Graph Link Analysis Capture-recapture Bow-tie structure Power law Small world Graph Link Analysis 声望模型 Random Walker模型 PageRank算法 HITS算法

第一次编程作业 时间:10月25日截止 内容:crawler和graph link analysis 1。heritrix系统使用 要求:按Week2的web crawler系统结构,寻找Heritrix系统里面的crawler的两个主要模块: isUrlVisited,politeness 考察它们的主要实现技术,写一个简短的报告文档。

3。搜集web数据的graph link analysis 要求:回答以下问题,并给出方法的说明 这个网站有多少网页? 入度、出度分布情况如何? top 10的最重要页面是哪些? 以小组为单位,大家需要合作完成, 小组blog里,最少每周写一次进展报告 最后提交的报告里,写上每个同学的分工

Thank You! Q&A

阅读材料

Steve Lawrence Dr. Lawrence received a B.Sc. (summa cum laude) and B.Eng. (summa cum laude) from the Queensland University of Technology, Australia, and a Ph.D. from the University of Queensland, Australia. Steve Lawrence is a Senior Staff Research Scientist at Google. Before joining Google, he was a Senior Research Scientist at NEC Research Institute in Princeton citeseer http://citeseer.ist.psu.edu/ the creator of Google Desktop Search

Andrei Broder Andrei Broder has a B.Sc. from Technion, Israel Institute of Technology and a M.Sc. and Ph.D. in Computer Science from Stanford University.  He is a member of the research staff at Digital Equipment Corporation's Systems Research Center in Palo Alto, California.  His main interests are the design, analysis, and  implementation of probabilistic algorithms and supporting data structures, in particular in the context of Web-scale applications. 

Larry Page & Sergey Brin Google founders Larry Page and Sergey Brin The Anatomy of a Large-Scale Hypertextual Web Search Engine (www7) PageRank, an Eigenvector based Ranking Approach for Hypertext (sigir98)

Zipf’s law Sites rank ordered by their popularity

Co-citation分析 给定一个文献的集合,希望表达这些文献两两被同时(同一篇文章)引用的情况 coc[i,j]越大,表示这两篇文章的相关性越强 形成文章之间的邻接矩阵E,使得E[i,j]=1,当且仅当文章i引用了j;否则E[i,j]=0。 这意味着,E的第i列反映文章i被引用的情况; 同时引用文章i和文章j的文章数量等于E[*,i]和E[*,j]在相同的行出现1的个数。考虑到E元素的{0,1}特性,即coc[i,j]=∑E[k,i]E[k,j], k=1,2,…,n 或者coc = ETE

文章的引用矩阵E coc[i,j]=∑E[k,i]E[k,j] j 1 j i 1 i coc[i,j]= ETE i 1 1 1 1 j

修改模型 p = c*ETp 模型三:让i的声望等于知晓他的人的声望之和乘以一个常数(对所有i相同) 与模型二的关系 还是要问: p(i) = c×∑E[k,i]p(k),k=1,…,n 与模型二的关系 效果上感觉应该差不多,因为是“共同的常数”,而对我们有意义的只是“相对声望” 但并不完全等价! 还是要问: 非0解存在吗?如果存在,如何计算? p = c*ETp

p = c*ETp 解的存在性 这就是特征值、特征向量的定义方程 但这问题就来了! 还有,特征值、特征向量不是实数怎么办? 注意到c只需要在一个系统中保持常量,不同的系统是可以不一样的,1/c就是ET的特征值,可以随p同时求出 但这问题就来了! ET最多可能有n个不同的特征值 如果是有多个不同的特征值,取那一个为好? 不同的特征值对应有不同的特征向量,我们没有理由认为这不同的特征向量反映出来的节点声望序是一致的 即使是同一个特征值,对应的特征子空间中也可能有多个向量(我们也没理由认为它们反映出来的节点声望序是一致的),应该取哪一个? 还有,特征值、特征向量不是实数怎么办?

The Perron-Frobenius Theorem 如果有向图G是强连通的,则它的邻接矩阵A有一个唯一的元素全为正实数的特征向量v,且该特征向量属于模最大的特征值λ。 注: 这个特征向量的唯一性成立在忽略常数因子前提下 由于A是非负的,v是非负的,因此λ也一定是正实数 注意“强连通”的条件需要满足 Stephane Gaubert, et al, “The Perron-Frobenius Theorem for Homogeneous, Monotone Functions,” HP Lab, Bristol, 2001.

Random Walker Model (continue) 改写一下,成 形式上和“声望”模型一样,只是矩阵L有行向量元素和为1的性质。 有用吗? Dangling Node(出度为0的节点) 对于这些节点,矩阵L对应着元素全0的行,元素和不为1 修正:L[u,v] = 1/N if du=0

让这浏览者每次以一定的概率(1-β)沿着超链走,以概率(β)重新随机选择一个新的起始节点 但我们还得解决实际问题:修改模型 让这浏览者每次以一定的概率(1-β)沿着超链走,以概率(β)重新随机选择一个新的起始节点 这在物理意义上即总是有可能跳进入度为0的点,跳出那些“圈”。在模型表达上即为 The Anatomy of a Large-Scale Hypertextual Web Search Engine (www7) Larry Page & Sergey Brin 剩下来的迭代就可用 power iteration, β选在0.1和0.2之间 迭代最后得到的稳定P就是PageRank(Page & Brin 1997)

HITS 从设计思想来看,HITS和PageRank的一个基本区别是HITS针对具体查询、应用在查询时间,而PageRank是独立于查询的 Root set, R(q): 和查询q相关的网页集合 Base set, V(q): 除了R(q)外,还包括指向R(q)元素和被R(q)元素指向的网页 Expanded set = V - R 两个概念(直觉上有意义) AUTHORITY(权威型网页):内容权威,质量高的网页 HUB(目录型网页):指向许多authority网页的网页

Authority and Hub scores 针对u∈V(q),在每个网页u上定义有两个参数:a[u]和h[u],分别表示其权威性和目录性。 交叉定义 一个网页u的a值依赖于指向它的网页v的h值 一个网页u的h值依赖于它所指的网页v的a值

HITS (contd.) 声望高的(入度大) 权威性高 认识许多声望高的(出度大)目录性强 如何计算? Power Iteration on:

HITS算法过程(Topic Distillation) Send query to a text-based IR system and obtain the root-set. Expand the root-set by radius one to obtain an expanded graph. Run power iterations on the hub and authority scores together. Report top-ranking authorities and hubs.

网页的相互链接特性,使得我们可以应用社会网络分析的方法来从网页集合的结构中提炼有用的信息 PageRank & HITS 网页的相互链接特性,使得我们可以应用社会网络分析的方法来从网页集合的结构中提炼有用的信息 提炼什么信息?取决于我们对应用目标的认识,还取决于有关技术模型的精确定义、有效计算,以及对可能产生误差的认识和评估 PageRank和HITS是两个经典的例子,大目标一致,切入点不同。 它们不应该意味着不可能有更好的(或者更有特色的)新角度 它们的一个本质缺陷是:将网页之间的链接关系“太当真”。

Overlap analysis measure the relative sizes of search engine indices (Andrei Broder,1998) sample uniformly from any set and check membership in any set Pr(A & B|A) = Size(A & B)/Size(A) Pr(A & B|B) = Size(A & B)/Size(B), Size(A)/Size(B) = Pr(A & B|B) / Pr(A & B|A) A technique for measuring the relative size and overlap of public Web search engines Query based sampling Given a random query in one of the two schemes, a URL is obtained from a designated search engine by selecting a random result page from the top 100 result pages returned by the search engine for the query. Query based checking The strong query is constructed as follows. The page is fetched from the Web and we analyse the contents to compute a conjunctive query composed of the k (say 8) most significant terms in the page. Significance is taken to be inversely proportional to frequency in the lexicon. Words not found in the lexicon are ignored since they may be typographical errors (which tend to be transient) or words that might not have any discriminative value (e.g., common words in a foreign language.) The strong query is sent to each of the search engines in turn and the results are examined. If one of the result URLs matches the query URL then the query URL is noted as being present in the engine's database. size of the static public Web as of November 1997 was at least 200 million documents.

Web图大小——边的数目 Web图十分稀疏(sparse) |E| = O(n) or at least o(n2) Average number of hyperlinks per page roughly a constant