Download presentation
Presentation is loading. Please wait.
1
Wrapper Generation and HTML Reduction
Yu Li
2
Outline 网页抽取问题 SGWrap System HTML的问题 HTML约简 Future work 基本想法 问题的定义和目标
页面模型 算法设计 Future work
3
页面抽取的问题 Web上存在大量的数据,以半结构化的HTML页面形式存在 Web数据集成需要将半结构化的数据转换成为结构化的数据
完成页面抽取任务的程序通常叫做wrapper
4
页面抽取问题 mapping wrapper
5
页面抽取问题 页面抽取的工作可以通过 手工编写wrapper:使用传统语言,将mapping“硬”编码在wrapper程序中
借助工具生成wrapper:通过计算机辅助生成wrapper程序 抽取规则、交互方式、维护 完全自动进行 页面结构的划分、Annotation
6
SGWrap System SGWrap=Schema Guided Wrapper Generation SGWrap System
interact generate Wrapper Program run HTML page data
7
SGWrap System SGWrap mainly consists of three parts.
SGWrap Runtime (Runtime, for short), which provides service to access our algorithms for web page content extraction. It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or integrate the Runtime itself. SGWrap Compiler (Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form. It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Java's compiler javac.exe. Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules.
8
SGWrap System – basic usage
9
SGWrap System – basic usage
3 Steps Design Rule by Using Visual SGWrap Compile Rule into Program by Using SGWrapC Test and Apply Wrapper by Using SGWrap (Runtime) There is a tutorial at (also in documentation of each installation)
10
Welcome to http://idke.ruc.edu.cn/sgwrap
11
SGWrap Rule Language mapping wrapper 如何形式化的描述?
12
SGWrap Rule Language A formula language describing the intent of user is important for web data extraction systems. It should be Exact. This is the basic constrain. As wrapper program must give out exact result for automatically extraction, the language describing wrapper's intention must also be exact. Expressive. The language should be able to describe typical intention and consideration of user. In our case, it should be able to express DOM tree navigation and structure result construction. Compact. The language should be simple and powerful. It can describe the problem in short script, and it should have facilities helping user performing general operation, such as string operations. Understandable. Rule is not only for computer but also for human. So the language should be human understandable, for the possibility that human will custom and adjust it.
13
SGWrap Rule Language SGWrap's Rule is designed to be that type of language. It is exact as it uses XPath as the basic DOM Tree description method. It is expressive as it introduces XQuery's FLWR expression for result construction. It is also compact and understandable. Rule consists of three parts: (a)an assign clause, (b)a variable name for returning result and (c)a return clause, which can be a variable name or a function clause or a Rule array containing other Rules.
14
SGWrap Rule Language - example
{ LET $Web_robots:=document($d) // document($d) is expression reserved by SGWrap Rule which is used to // represent the concept ``root'' of a document. RETURN <Web_robots> FOR $robot IN $Web_robots/HTML/BODY/TABLE/TBODY/TR // Following we will have a array of Rules, which means that the result // consists of a serials of child node. RETURN <robot> LET $name:=$robot/TD[0]/A RETURN <name>$name</name> } LET $Platform:=$robot/TD[1]/TABLE/TBODY/TR[contains(./TH, "Platform:")]/TD RETURN <Platform>$Platform</Platform> </robot> </Web_robots> Refer to for specification.
15
SGWrap Rule Language SGWrap Rule Language应用在HTML网页的抽取上出现了一些问题
规则没有条件分支语句,不具备条件选择的能力 规则建立在W3C DOM模型上,而W3C DOM标准与事实标准(IE DOM)并不符合
16
What is HTML? “To publish information for global distribution, one needs a universally understood language, a kind of publishing mother tongue that all computers may potentially understand. The publishing language used by the World Wide Web is HTML (from HyperText Markup Language). ” “HTML gives authors the means to: Publish online documents with headings, text, tables, lists, photos, etc. Retrieve online information via hypertext links, at the click of a button. Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc. Include spread-sheets, video clips, sound clips, and other applications directly in their documents. ”
17
HTML的问题 不同标签序列,相似的排版效果 用表示页面元素的标签组合出与划分文档结构的标签相似的排版语义
结构可以任意嵌套,且允许无意义的嵌套 结构划分与修饰语义掺杂在一起,对于文本的修饰造成不必要的结构
18
A1 B1
19
A2 B2
20
C1 D1
21
C2 D2
22
Amazon1 Amazon2 Google
23
Categories of HTML tags
划分文档结构 修饰文本 链接 页面元素 语义说明 19 23 4 8 18 插件用途 图像用途 文档元数据 Web表单 表格 特殊用途 3 10
24
Statistic on HTML tags Data set Statistic 1 Statistic 2
Taken from Contain thousands result HTML page got by querying different DBSE Statistic 1 How often do HTML page use various tags? Summary of appearance number in each page Statistic 2 How often do various tags appear in HTML pages? Summary of HTML page number have specify tag
25
Statistic 1 result Top tags Top tags for defining structure
TD, A, TR, BR, FONT, IMG, B, SPAN, TABLE, INPUT, OPTION, P, I, DIV Top tags for defining structure SPAN(32314), TABLE(27591), P(13769), DIV, LI, BODY, HTML, DD, UL, All 1798 pages
26
Statistic 2 result Top tags Top tags for defining structure
A, HEAD, BR, BODY, HTML, TITLE, IMG, TABLE, TR, TD, FORM, INPUT, B, FONT, META, P, LINK, DIV, SCRIPT Top tags for defining structure BODY(1765), HTML(1754), TABLE(1672), P(1269), DIV(1069), SPAN All 1798 pages
27
Statistic conclusion HTML语言定义了大量的标签,但是只有少部分是经常使用的
经常使用的标签不到一半 用于结构划分的标签只占HTML全部标签数量的1/4左右
28
小结 HTML文档包含了 HTML的目标是为了能够显示页面,其结构信息隐含在标签和标签的组合中
简单的文档结构划分 各种页面元素 HTML的目标是为了能够显示页面,其结构信息隐含在标签和标签的组合中 HTML文档中存在着不必要的结构和冗余,文本因为修饰的关系被划分为不连续的结构 人们在使用HTML标签时倾向于使用少数简单的标签来合成各种语义
29
W3C HTML Extraction Requirements Document with content structure
specified Document build with structure and page element Extraction Requirements
30
structure information
HTML约简 如果能够从element sequence中发现structure information,就能够将HTML文档转换为适合抽取的、仅包含结构信息的文档 Document build with structure and page element Document with content structure specified Program capture structure information in element sequence
31
问题的定义 对于HTML文档H,通过程序的处理,得到相应的具有结构信息的文档S,使得S 具有与H一样的结构信息,即对于文本内容的划分不变
32
需要解决的问题 一个合适的用于结构化信息描述的文档模型 探索当前HTML页面中存在的各种结构化信息
具有相当的描述能力,可以合适的描述大部分常用的文档结构信息 保证文本的连续特性 本身不具有冗余的特性 探索当前HTML页面中存在的各种结构化信息 已经定义在HTML中、由特定标签表示出来的结构化信息 编写网页过程中用标签组合的方式模拟的、存在于传统文档排版理论中的结构化信息 设计一套算法可以形式化的计算出HTML文档的结构化信息
33
Page Model 需要设计出什么样的Page Model? 仅描述Structure Information
不会冗余 不存在同样语义的嵌套结构 不存在不必要的结构 Page <!ELEMENT page ((text|figure|table)+)>
34
“Page Model for HTML Reduction”
line <!ELEMENT line (#PCDATA, regions?)> <!ATTLIST line id CDATA #REQUIRED> <!ELEMENT regions (region+)> <!ELEMENT region #PCDATA> <!ATTLIST region begin CDATA #REQUIRED end CDATA #REQUIRED> “Page Model for HTML Reduction” Region Line(id) figure <!ELEMENT figure #PCDATA> <!ATTLIST figure id CDATA #REQUIRED>
35
content (line|figure)+
Page Model item <!ELEMENT item (prefix?, content, line_list)> <!ATTLIST item id CDATA #REQUIRED> <!ELEMENT prefix #PCDATA> <!ELEMENT content (line|figure)+> <!ELEMENT line_list #PCDATA> item list <!ELEMENT list (item+, line_list)> <!ATTLIST list id CDATA #REQUIRED> prefix content (line|figure)+ line_list
36
Page Model row col col col line_list col (text|figure)+ line_list text
<!ELEMENT text ((line|list)+)> <!ATTLIST text id CDATA #REQUIRED> row col col col table <!ELEMENT table (row+)> <!ATTLIST table id CDATA #REQUIRED> <!ELEMENT row (col+, line_list)> <!ELEMENT col ((text|figure)+, line_list)> line_list col (text|figure)+ form -- ignores in this version frame -- ignores in this version head -- ignores in this version script & plugin -- ignores in this version line_list
37
Page Model - Misc A <page> <A> <B> <C>
38
Algorithm design 2种可能的计算方法
方法1:从HTML出发,分析HTML标签组合可能形成的pattern,将这些pattern记录下来,然后在一遍或者多遍解析HTML文件的过程中完成转换 方法2:先将HTML转换为我们设计的Page Model的文档I,在I中允许冗余结构,然后在I上进行进一步的简化,去除不必要的结构和冗余得到结果文档
39
方法1:从HTML出发 table caption col colgroup tfoot thead tbody tr th td
This is a pattern If prefix is “tfoot”, we get a “foot line” If prefix is “thead”, we get “head information” If prefix is “tbody”, each time we get a “line” table caption col colgroup tfoot thead tbody tr th td Html fragment
40
p %inline; %heading; %list; %block; The problem of this method is that There are so many possibilities of tag combination that we can not find all patterns by hand, and this must be done by some programs. %preformatted; DL DIV CENTER BLOCKQUOTE
41
方法2:在Page Model上进行简化 只考虑HTML定义的标签的结构信息,能够很容易的将HTML文档转换到Page Model上的文档I。但是文档I会存在 不必要的结构,如文本段落嵌套在一行一列的表格中 冗余的结构,如嵌套的多重表格 针对文档I 将不必要的结构消除,或者通过语义的分析转换成为等价的另一种结构 消除冗余结构 得到最终的结果文档
42
Future work Page Model的完善 算法的设计和实现 实验的设计和完成 在SGWrap System中应用 完善描述能力
编写详细的specification 算法的设计和实现 实验的设计和完成 设计:怎样检验约简的效果,实验数据的选择 在SGWrap System中应用
43
Q&A Thank You!
Similar presentations