Download presentation
Presentation is loading. Please wait.
1
WebGather Design and Implementation
Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000
2
Outline Introduction of searchengine WebGather Conclusion
3
Introduction: http://www.yahoo.com/
4
Introduction: http://sohu.com/
5
Introduction: http://sina.com.cn/
6
Introduction: http://www.google.com/
7
Introduction: http://e.pku.edu.cn/
8
Introduction: Search Engine Sizes --searchenginewatch in Nov 8, 2000
GG=Google WT=WebTop.com AV=AltaVista, FAST=FAST NL=Northern Light EX=Excite INK=Inktomi, Go=Go (Infoseek)
9
Introduction: a new study
Introduction: a new study -- Inktomi and the NEC Research Institute, Inc. In Feb. 2000 Number of indexable pages on the web : over 1 billion Number of servers discovered: 6,409,521 Number of mirrors in servers discovered: 1,457,946 Number of sites (total servers minus mirrors): 4,951,247 Number of good sites (reachable over 10 day period): 4,217,324 Number of bad sites (unreachable): 733,923 Web pages on a site: 1000,000,000/4,217,324 = 237.1
10
Introduction: Inktomi Search Engine cluster
In the picture 9*8*2=144
11
WebGather: Introduction
由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于1997年10月29日正式在CERNET上向广大Internet用户提供web信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破800万人次。2000年初新成立的“天网”搜索引擎课题组在国家973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。 身无彩凤双飞翼,心有灵犀一点通
12
WebGather: in Dec. 1, 2000 2.5 million scale Index 2.5 million web pages More than 200,000 web pages everyday Ten day to update all data three PCs
13
collect all the web pages in China
WebGather: Design goals for a distributed web-crawling system for WebGather collect all the web pages in China keep pace with the rapid growth of Chinese web information 238 X 40,000 = 9,520,000
14
WebGather 2.0: architecture
Client log database User behavior Gather Database Indexer Retrieve Database Client Retriever Gatherer WWW
15
WebGather 1.2: architecture of gather subsystem 1/4
… GatherN Gather1 Main Control
16
WebGather 2.0: architecture of gather subsystem 1/4
17
WebGather : technologies in gather subsystem 1/4
Distributed system architecture High availability …… Load balance Low bandwidth Scalability Re-configurability Cut words Position relativity Anchor text, Link popularity
18
WebGather : architecture of indexer subsystem 2/4
webpage1 feature1 feature1 webpage1 webpage2 feature2 feature2 webpage2 … … webpageK feature1 featureK webpage1 … feature2 … webpage2 webpageN feature3 featureN webpage3 A B
19
WebGather : technologies in retriever subsystem 3/4
Traditional IR (VSM ) Query cache, hot click Cut words Anchor text, Link popularity
20
WebGather : technologies in user behavior subsystem 4/4
Link popularity Replica popularity User popularity
21
Conclusion : Searchengine is More and more important. Web is a good experimental object, we can do a lot R&D on it.
Similar presentations