Segmentation of Chinese Long Sentences Using Commas Mei xun Jin, Mi-Young Kim, Dongil Kim, and Jong-Hyeok Lee Pohang University of Science and Technology,

Slides:



Advertisements
Similar presentations
广州市教育局教学研究室英语科 Module 1 Unit 2 Reading STANDARD ENGLISH AND DIALECTS.
Advertisements

新目标 Go For It 九年级 Unit3 情景交际用语之问路与指路 广东省东莞市石碣袁崇焕中学 彭丽霞.
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
第七课:电脑和网络. 生词 上网 vs. 网上 我上网看天气预报。 今天早上看了网上的天气预报。 正式 zhèngshì (报告,会议,纪录) 他被这所学校正式录取 大桥已经落成,日内就可以正式通车 落伍 luòw ǔ 迟到 chídào 他怕迟到,六点就起床了.
TOEFL-iBT & IELTS Writing.
黄国文 中山大学 通用型英语人才培养中的 语言学教学 黄国文 中山大学
國立成功大學 外文系 高實玫 “Theme”及“Rheme”的應用 國立成功大學 外文系 高實玫
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取
Teaching the Chinese Copula 是 for CSL Purposes
2012 年下学期 湖南长郡卫星远程学校 制作 13 Unit 4 The next step 年下学期 湖南长郡卫星远程学校 制作 13 Discussion Which university do you want to study at? Have you thought carefully.
Today – Academic Presentation 学术报告
雅思大作文的结构 Presented by: 总统秘书王富贵.
Starter: What is that secret number?.  6  7  8  9  10  Liù 六  Qī 七  Bā 八  Ji ǔ 九  Shí 十.
Welcome Welcome to my class Welcome to my class!.
Homework 4 an innovative design process model TEAM 7
Module 5.
Some Effective Techniques for Naive Bayes Text Classification
Improving classification models with taxonomy information
Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
I always like birthday parties.
Population proportion and sample proportion
NLP Group, Dept. of CS&T, Tsinghua University
模式识别 Pattern Recognition
Manifold Learning Kai Yang
Hui-Ju Chuang University of Hawaii-Manoa
Creating Animated Apps (I) 靜宜大學資管系 楊子青
Write a letter in a proper format
Fundamentals of Physics 8/e 27 - Circuit Theory
Chinese II Major quiz review.
旅游景点与度假村管理 中山大学新华学院 (Management of Attractions & Resorts) 总学时:54
Unit 2 Key points summary.
Interval Estimation區間估計
药物和疾病啥关系 ? 李智恒.
PubMed整合显示图书馆电子资源 医科院图书馆电子资源培训讲座.
第十五课:在医院看病.
職業 Random Slide Show Menu
基于文本特征的英语阅读策略的研究与实践 桐乡市高级中学 胡娟萍
Have you read Treasure Island yet?
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Guide to a successful PowerPoint design – simple is best
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
Review and Analysis of the Usage of Degree Adverbs
Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.
Unit 7 Lesson 20 九中分校 刘秀芬.
Representation Learning of Knowledge Graphs with Hierarchical Types
從 ER 到 Logical Schema ──兼談Schema Integration
Case study: a manager’s dilemma 組別:3-7 組員:資財 黃姿瑋 資財 林宛璇
Area of interaction focus
高考应试作文写作训练 5. 正反观点对比.
都;和 “both, all”; “and” 几 “how many” 做什么的 “do what (occupation)”
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
冀教版 九年级 Lesson 20: Say It in Five.
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
More About Auto-encoder
Speaker : YI-CHENG HUNG
主要内容 什么是概念图? 概念图的理论基础 概念图的功能 概念地图的种类 如何构建概念图 概念地图的评价标准 国内外概念图研究现状
專業倫理 (Professional Ethics) 2008 FALL SEMESTER (N3)
何正斌 博士 國立屏東科技大學工業管理研究所 教授
研究发现: 绵羊记忆力惊人!.
Firsthand Learning Field Trip to CCI Site.
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
My favorite subject science.
Principle and application of optical information technology
Significant Figures 有效數字
之前都是分类的蒸馏很简单。然后从分类到分割也是一样,下一篇是检测的蒸馏
Self-Attention huitr
Gaussian Process Ruohua Shi Meeting
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Segmentation of Chinese Long Sentences Using Commas Mei xun Jin, Mi-Young Kim, Dongil Kim, and Jong-Hyeok Lee Pohang University of Science and Technology, Advanced Information Technology Research Center Div. of Computer, Electronics and Telecommunications, Yanbian University of Science and Technology ACL SIGHAN Workshop 2004

My research topic Sentence is a fundamental unit for NLP. Resolving the boundaries of Chinese sentences (or topic chains). – Commas and full-stops are often confused in Chinese. – A full-stop sometimes can be replaced with a comma. – A comma sometimes should be replaced with a full- stop. – Vice versa. Sentence segmentation is inherently ambiguous.

Samples 正因為沒有經過仔細全面的設計規畫,我們發展的 步驟錯亂、標準參差,由於向歐、美、日本取法的 模範不一,不但各方面無法配合,甚至會有衝突, 而設定的辦法不能取得大眾的共識與認同,與整個 社會格格不入,導致化橘為枳,貌似神非自然不在 話下。 這是有點霸道,但也有道理,因為他們是上市公司, 每一季要向美國證管會報告總公司、附屬公司及子 公司的營運及財務狀況,帳都是照一套會計原則來 做,所以很多時候他們的要求,是出自一種單純的 需要,而並不是故意要來欺負我們。

Outline Segmentation of Chinese long sentences using commas. Types of commas Features Experiments Conclusion

Motivation Chinese has a rather different set of salient ambiguities from the perspective of statistical parsing. In Chinese, a subordinate clause or coordinate clause is sometimes connected without any conjunctions in a sentence. Clause segmentation is also rather different compared with western languages. Segment Chinese long sentences using commas.

Segmentation Syntactic analysis of a sentence 1.Segment the sentence at a comma. 2.Do the dependency analysis for each segment. 3.Set the dependency relation between segment pairs. In Chinese dependency parsing, not all commas are proper as segmentation points.

Segmentation: Case 1 There is only one dependency line cross over the comma. – one_dep_line_cross comma

Segmentation: Case 2 Some of the words fail to find their heads. – mul_dep_lines_cross comma

Segmentation: Case 3 Some words to find the wrong head. – mul_dep_lines_cross comma Segmentation at one_dep_line_cross comma is helpful for reducing parsing complexity and can contribute to accurate parsing results. Segmentation at mul_dep_line_cross comma should be avoid.

Inter-clause comma and Intra-clause comma Intra-clause comma – Occurring within a clause. – 北海在數年前,是一個默默無聞的小漁村。 Inter-clause comma – At the end of a clause. – 小明在寫作業,媽媽在打毛衣。 Segment the long sentence at inter-clause commas. – Comma classification

Two segments adjoining a comma To identify whether a comma is an inter- clause comma or an intra-clause comma. Assign values to each comma – (left_seg, right_seg) – The left_seg/right_seg can be phrase or clause. – (p, p), (p, c), (c, p), (c, c)

Syntactic relation between two adjoining segments Relation – If any words of the left segment has a dependency relation with the word of the right segment. Direction – How many direction(s) of the dependency relations the two segments have. Head – Which side of segment contains the head of any words of the other side.

Comma Classification Comma Values Syntactic Relation SampleType (c, p)Relation = 0 在單位裡,他是個好領導,在家裡,他是好 爸爸。 (c, p)-I Relation = 1 Head = right = p 科研成果快速轉化為生產力,是這個開發區 的特點。 Relation = 1 Head = left = c 學生們來到了操場,高高興興地。 (c, p)-II (p, c)Relation = 0 韓國對大連投資已連續三年增長,在大連, 韓國投資企業受到各種優惠。 (p, c)-I Relation = 1 Head = left = p 統計資料表明,大連對韓出口達一億多美元。 Relation = 1 Head = right = c 一九九四年,通用在中國購買了四千多萬美 元的東西。 (p, c)-II (p, p) 中國銀行在去年十月,聘請日某公司做顧問。 (p, p) (c, c) 一號產品佔據不到二成,二號產品比重達七 成以上。 (c, c)

Estimate the type of comma To identify the inter-clause or intra-clause role of a comma, it needs to estimate the right and the left segment conjuncts to the comma. – Classify a comma into one of (c, c), (c, p), (p, c), and (p, p). Classification using SVM – With a number of kernels.

Features Direct relevant feature category: – Predicate – Complements Indirect relevant feature category: – Auxiliary words – Adverbials – Prepositions – Clausal conjunctions

Direct relevant features

Indirect relevant features

Experiments Dataset – Chinese Penn Treebank 2.0 – 10-fold cross-validation

Results: Different kernel

Results: Window size & POS

Results: Parsing accuracy Parsing procedure 1.POS Tagging. 2.Long sentence segmentation by comma. 3.Parsing based on segmentation.

Conclusion Chinese sentence segmentation by classification of the comma. Improving the accuracy of dependency parsing by 9.6%. The accuracy for the segmentation is not yet satisfactory. – Inter-F score = 83.72% – Accuracy = 85.43%