Wrapper Generation and HTML Reduction

Slides:



Advertisements
Similar presentations
云计算辅助教学风云录 黎加厚 上海师范大学教育技术系 2010年8月9日.
Advertisements

-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
10-1 資料庫管理系統簡介 10-2 關聯式資料模式和查詢語言 10-3 Access 簡介 10-4 XML 簡介
第5章 HTML 標籤介紹.
通訊科技教育改進計畫 「網路應用與服務組」 行動電子商務課程 XML之簡介.
中职英语课程改革中 如何实践“以就业为导向,服务为宗旨”的办学理念
如何在Elsevier期刊上发表文章 china.elsevier.com
自衛消防編組任務職責 講 義 This template can be used as a starter file for presenting training materials in a group setting. Sections Right-click on a slide to add.
Web图片搜索引擎设计 ——基于文本的图片信息提取.
人工智能 Artificial Intelligence 第十一章
基于结构与内容的网页主题信息提取研究 作者:吴鹏飞,孟祥增,刘俊晓,马凤娟 宣讲:吴鹏飞
2 HTML5與CSS3 2-1 HTML5的頁面結構 2-2 HTML5的文字編排標籤 2-3 HTML5的圖片與超連結標籤
BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION
5B 教材分析.
深層學習 暑期訓練 (2017).
System Administration Practice Homework 2: Shell Programming
Euler’s method of construction of the Exponential function
Homework 4 an innovative design process model TEAM 7
Ⅱ、从方框里选择合适的单词填空,使句子完整通顺。 [ size beef special large yet ]
人际交往:科学与艺术.
Lotus Domino R7 Designer
長尾理論─打破80/20法則的新經濟學 Chris Anderson
形式语言与网络 计算环境构建 1.
簡易 Visual Studio 2010 C++ 使用手冊
HTML 钟晖云 QQ:
Area of interaction focus
Ch.13 HTML網頁實作.
Lecture 2 Lecture An Introduction To The HTML Language
Guide to Freshman Life Prepared by Sam Wu.
Decision Support System (靜宜資管楊子青)
Happy Designer 第四次聚會 課程內容分享 2008/11/20.
Chinese II Major quiz review.
JavaScript 靜宜大學 資管系 楊子青.
第三章 项目设定.
This Is English 3 双向视频文稿.
SpringerLink 新平台介绍.
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
客户服务 询盘惯例.
簡易 Visual Studio 2005 C++ 使用手冊
Decision Support System (靜宜資管楊子青)
服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies
Study for Specification of EPG EPG规范研究
Microsoft SQL Server 2008 報表服務_設計
資料結構 Data Structures Fall 2006, 95學年第一學期 Instructor : 陳宗正.
資料庫 靜宜大學資管系 楊子青.
成品检查报告 Inspection Report
Guide to a successful PowerPoint design – simple is best
BORROWING SUBTRACTION WITHIN 20
虚 拟 仪 器 virtual instrument
Unit 7 Lesson 20 九中分校 刘秀芬.
OvidSP Introduction Flexible. Innovative. Precise.
爬蟲類動物2 Random Slide Show Menu
SpringerLink 新平台介绍.
Inheritance -II.
HTML大探索.
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Create and Use the Authorization Objects in ABAP
TinyDB資料庫 靜宜大學資管系 楊子青.
Prepare for Cozy & Lazy HOME Life
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
React.js.
第2章 块级标签 经济管理学院.
如何在Elsevier期刊上发表文章 china.elsevier.com
Homework 2 : VSM and Summary
INTRODUCTION Making 24 with 4 cards DETAILS TEST GAME GAME.
教师:李金双 网页制作 教师:李金双
Section 1 Basic concepts of web page
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Wrapper Generation and HTML Reduction Yu Li

Outline 网页抽取问题 SGWrap System HTML的问题 HTML约简 Future work 基本想法 问题的定义和目标 页面模型 算法设计 Future work

页面抽取的问题 Web上存在大量的数据,以半结构化的HTML页面形式存在 Web数据集成需要将半结构化的数据转换成为结构化的数据 完成页面抽取任务的程序通常叫做wrapper

页面抽取问题 mapping wrapper

页面抽取问题 页面抽取的工作可以通过 手工编写wrapper:使用传统语言,将mapping“硬”编码在wrapper程序中 借助工具生成wrapper:通过计算机辅助生成wrapper程序 抽取规则、交互方式、维护 完全自动进行 页面结构的划分、Annotation

SGWrap System SGWrap=Schema Guided Wrapper Generation SGWrap System interact generate Wrapper Program run HTML page data

SGWrap System SGWrap mainly consists of three parts. SGWrap Runtime (Runtime, for short), which provides service to access our algorithms for web page content extraction. It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or integrate the Runtime itself. SGWrap Compiler (Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form. It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Java's compiler javac.exe. Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules.

SGWrap System – basic usage

SGWrap System – basic usage 3 Steps Design Rule by Using Visual SGWrap Compile Rule into Program by Using SGWrapC Test and Apply Wrapper by Using SGWrap (Runtime) There is a tutorial at http://idke.ruc.edu.cn/sgwrap/doc/A-10-Minutes-Tutorial.html (also in documentation of each installation)

Welcome to http://idke.ruc.edu.cn/sgwrap

SGWrap Rule Language mapping wrapper 如何形式化的描述?

SGWrap Rule Language A formula language describing the intent of user is important for web data extraction systems. It should be Exact. This is the basic constrain. As wrapper program must give out exact result for automatically extraction, the language describing wrapper's intention must also be exact. Expressive. The language should be able to describe typical intention and consideration of user. In our case, it should be able to express DOM tree navigation and structure result construction. Compact. The language should be simple and powerful. It can describe the problem in short script, and it should have facilities helping user performing general operation, such as string operations. Understandable. Rule is not only for computer but also for human. So the language should be human understandable, for the possibility that human will custom and adjust it.

SGWrap Rule Language SGWrap's Rule is designed to be that type of language. It is exact as it uses XPath as the basic DOM Tree description method. It is expressive as it introduces XQuery's FLWR expression for result construction. It is also compact and understandable. Rule consists of three parts: (a)an assign clause, (b)a variable name for returning result and (c)a return clause, which can be a variable name or a function clause or a Rule array containing other Rules.

SGWrap Rule Language - example { LET $Web_robots:=document($d) // document($d) is expression reserved by SGWrap Rule which is used to // represent the concept ``root'' of a document. RETURN <Web_robots> FOR $robot IN $Web_robots/HTML/BODY/TABLE/TBODY/TR // Following we will have a array of Rules, which means that the result // consists of a serials of child node. RETURN <robot> LET $name:=$robot/TD[0]/A RETURN <name>$name</name> } LET $Platform:=$robot/TD[1]/TABLE/TBODY/TR[contains(./TH, "Platform:")]/TD RETURN <Platform>$Platform</Platform> </robot> </Web_robots> Refer to http://idke.ruc.edu.cn/sgwrap/doc/Rule-Specification.html#Rule-Specification for specification.

SGWrap Rule Language SGWrap Rule Language应用在HTML网页的抽取上出现了一些问题 规则没有条件分支语句,不具备条件选择的能力 规则建立在W3C DOM模型上,而W3C DOM标准与事实标准(IE DOM)并不符合

What is HTML? “To publish information for global distribution, one needs a universally understood language, a kind of publishing mother tongue that all computers may potentially understand. The publishing language used by the World Wide Web is HTML (from HyperText Markup Language). ” “HTML gives authors the means to: Publish online documents with headings, text, tables, lists, photos, etc. Retrieve online information via hypertext links, at the click of a button. Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc. Include spread-sheets, video clips, sound clips, and other applications directly in their documents. ”

HTML的问题 不同标签序列,相似的排版效果 用表示页面元素的标签组合出与划分文档结构的标签相似的排版语义 结构可以任意嵌套,且允许无意义的嵌套 结构划分与修饰语义掺杂在一起,对于文本的修饰造成不必要的结构

A1 B1

A2 B2

C1 D1

C2 D2

Amazon1 Amazon2 Google

Categories of HTML tags 划分文档结构 修饰文本 链接 页面元素 语义说明 19 23 4 8 18 插件用途 图像用途 文档元数据 Web表单 表格 特殊用途 3 10

Statistic on HTML tags Data set Statistic 1 Statistic 2 Taken from http://www.data.binghamton.edu/vsewrapper.html Contain thousands result HTML page got by querying different DBSE Statistic 1 How often do HTML page use various tags? Summary of appearance number in each page Statistic 2 How often do various tags appear in HTML pages? Summary of HTML page number have specify tag

Statistic 1 result Top tags Top tags for defining structure TD, A, TR, BR, FONT, IMG, B, SPAN, TABLE, INPUT, OPTION, P, I, DIV Top tags for defining structure SPAN(32314), TABLE(27591), P(13769), DIV, LI, BODY, HTML, DD, UL, All 1798 pages

Statistic 2 result Top tags Top tags for defining structure A, HEAD, BR, BODY, HTML, TITLE, IMG, TABLE, TR, TD, FORM, INPUT, B, FONT, META, P, LINK, DIV, SCRIPT Top tags for defining structure BODY(1765), HTML(1754), TABLE(1672), P(1269), DIV(1069), SPAN All 1798 pages

Statistic conclusion HTML语言定义了大量的标签,但是只有少部分是经常使用的 经常使用的标签不到一半 用于结构划分的标签只占HTML全部标签数量的1/4左右

小结 HTML文档包含了 HTML的目标是为了能够显示页面,其结构信息隐含在标签和标签的组合中 简单的文档结构划分 各种页面元素 HTML的目标是为了能够显示页面,其结构信息隐含在标签和标签的组合中 HTML文档中存在着不必要的结构和冗余,文本因为修饰的关系被划分为不连续的结构 人们在使用HTML标签时倾向于使用少数简单的标签来合成各种语义

W3C HTML Extraction Requirements Document with content structure specified Document build with structure and page element Extraction Requirements

structure information HTML约简 如果能够从element sequence中发现structure information,就能够将HTML文档转换为适合抽取的、仅包含结构信息的文档 Document build with structure and page element Document with content structure specified Program capture structure information in element sequence

问题的定义 对于HTML文档H,通过程序的处理,得到相应的具有结构信息的文档S,使得S 具有与H一样的结构信息,即对于文本内容的划分不变

需要解决的问题 一个合适的用于结构化信息描述的文档模型 探索当前HTML页面中存在的各种结构化信息 具有相当的描述能力,可以合适的描述大部分常用的文档结构信息 保证文本的连续特性 本身不具有冗余的特性 探索当前HTML页面中存在的各种结构化信息 已经定义在HTML中、由特定标签表示出来的结构化信息 编写网页过程中用标签组合的方式模拟的、存在于传统文档排版理论中的结构化信息 设计一套算法可以形式化的计算出HTML文档的结构化信息

Page Model 需要设计出什么样的Page Model? 仅描述Structure Information 不会冗余 不存在同样语义的嵌套结构 不存在不必要的结构 Page <!ELEMENT page ((text|figure|table)+)>

“Page Model for HTML Reduction” line <!ELEMENT line (#PCDATA, regions?)> <!ATTLIST line id CDATA #REQUIRED> <!ELEMENT regions (region+)> <!ELEMENT region #PCDATA> <!ATTLIST region begin CDATA #REQUIRED end CDATA #REQUIRED> “Page Model for HTML Reduction” Region Line(id) figure <!ELEMENT figure #PCDATA> <!ATTLIST figure id CDATA #REQUIRED>

content (line|figure)+ Page Model item <!ELEMENT item (prefix?, content, line_list)> <!ATTLIST item id CDATA #REQUIRED> <!ELEMENT prefix #PCDATA> <!ELEMENT content (line|figure)+> <!ELEMENT line_list #PCDATA> item list <!ELEMENT list (item+, line_list)> <!ATTLIST list id CDATA #REQUIRED> prefix content (line|figure)+ line_list

Page Model row col col col line_list col (text|figure)+ line_list text <!ELEMENT text ((line|list)+)> <!ATTLIST text id CDATA #REQUIRED> row col col col table <!ELEMENT table (row+)> <!ATTLIST table id CDATA #REQUIRED> <!ELEMENT row (col+, line_list)> <!ELEMENT col ((text|figure)+, line_list)> line_list col (text|figure)+ form -- ignores in this version frame -- ignores in this version head -- ignores in this version script & plugin -- ignores in this version line_list

Page Model - Misc A <page> <A> <B> <C>

Algorithm design 2种可能的计算方法 方法1:从HTML出发,分析HTML标签组合可能形成的pattern,将这些pattern记录下来,然后在一遍或者多遍解析HTML文件的过程中完成转换 方法2:先将HTML转换为我们设计的Page Model的文档I,在I中允许冗余结构,然后在I上进行进一步的简化,去除不必要的结构和冗余得到结果文档

方法1:从HTML出发 table caption col colgroup tfoot thead tbody tr th td This is a pattern If prefix is “tfoot”, we get a “foot line” If prefix is “thead”, we get “head information” If prefix is “tbody”, each time we get a “line” table caption col colgroup tfoot thead tbody tr th td Html fragment

p %inline; %heading; %list; %block; The problem of this method is that There are so many possibilities of tag combination that we can not find all patterns by hand, and this must be done by some programs. %preformatted; DL DIV CENTER BLOCKQUOTE

方法2:在Page Model上进行简化 只考虑HTML定义的标签的结构信息,能够很容易的将HTML文档转换到Page Model上的文档I。但是文档I会存在 不必要的结构,如文本段落嵌套在一行一列的表格中 冗余的结构,如嵌套的多重表格 针对文档I 将不必要的结构消除,或者通过语义的分析转换成为等价的另一种结构 消除冗余结构 得到最终的结果文档

Future work Page Model的完善 算法的设计和实现 实验的设计和完成 在SGWrap System中应用 完善描述能力 编写详细的specification 算法的设计和实现 实验的设计和完成 设计:怎样检验约简的效果,实验数据的选择 在SGWrap System中应用

Q&A Thank You!