Peng Qian, Xipeng Qiu, Xuanjing Huang

Slides:

Advertisements

Similar presentations

黄国文中山大学通用型英语人才培养中的语言学教学黄国文中山大学

Advertisements

2014 年上学期湖南长郡卫星远程学校制作 13 Getting news from the Internet.

統合分析臨床試驗實之文獻品質評分：以針灸療法之統合分析為例

Healthy Breakfast 第四組電子一甲（電資一）指導老師：高美玉組長：B 侯昌毅

中四升學講座中五 2007年12月8日.

专题八书面表达.

Chapter 5 research Methods in Social Medicine

第二章研究设计与评价.

Today – Academic Presentation 学术报告

分析抗焦慮劑/安眠劑之使用的影響因子在重度憂鬱症及廣泛性焦慮症病人和一般大眾的處方形態

Chapter 8 Liner Regression and Correlation 第八章直线回归和相关

Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述

Welcome Welcome to my class Welcome to my class!.

libD3C: 一种免参数的、支持不平衡分类的二类分类器

Homework 4 an innovative design process model TEAM 7

Visualizing and Understanding Neural Machine Translation

Module 5 Shopping 第2课时.

Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.

Thinking of Instrumentation Survivability Under Severe Accident

毕业论文报告孙悦明

NLP Group, Dept. of CS&T, Tsinghua University

Manifold Learning Kai Yang

考试与考生 --不对等与对等邹申上海外国语大学

實證醫學嘉義基督教醫院外科部　黃國倉醫師

中国汽车燃料经济性标准及燃料经济性政策研究 The Automobile Fuel Economy Standards and Fuel Efficiency Promotion Policies of China 中国汽车技术研究中心 China Automotive.

線上英檢測驗系統 Copyright © 2012 Cengage Learning Asia Pte. Ltd.,

HOW TO ACE -- THE IELTS SPEAKING TEST

Write a letter in a proper format

Guide to Freshman Life Prepared by Sam Wu.

Faculty of Arts Lingnan University 嶺南大學文學院

Chapter 9 Intelligence.

Inventory System Changes and Limitations

Interval Estimation區間估計

Formal Pivot to both Language and Intelligence in Science

Lesson 44:Popular Sayings

Chapter 3 Nationality Objectives:

Try to write He Mengling Daqu Middle School.

Towards Emotional Awareness in Software Development Teams

基于课程标准的校本课程教学研究乐清中学赵海霞.

My Internet Friend 名詞子句寫作.

解读设题意图，探究阅读策略年高考试卷题型（阅读理解）分析及对策

句子成分的省略（1）.

職業 Random Slide Show Menu

Version Control System Based DSNs

高性能计算与天文技术联合实验室智能与计算学部天津大学

Guide to a successful PowerPoint design – simple is best

Ericsson Innovation Award 2018 爱立信创新大赛 2018

汉英翻译对比练习.

Cisco Troubleshooting and Maintaining Cisco IP Networks (TSHOOT)

Review and Analysis of the Usage of Degree Adverbs

爬蟲類動物2 Random Slide Show Menu

Case study: a manager’s dilemma 組別:3-7 組員:資財黃姿瑋資財林宛璇

高考应试作文写作训练 5. 正反观点对比.

都；和 “both, all”; “and” 几 “how many” 做什么的 “do what (occupation)”

An organizational learning approach to information systems development

计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.

严肃游戏设计—— Lab-Adventure

More About Auto-encoder

Speaker : YI-CHENG HUNG

Resources Planning for Applied Research

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

Chapter 9 Validation Prof. Dehan Luo

研究发现：绵羊记忆力惊人!.

作业请您用星级模式评估您自己公司的一致性状况。您的公司与它的战略执行一致吗?.

英语口译 4 Education and Campus 大学英语教学部向丁丁.

My favorite subject science.

Principle and application of optical information technology

之前都是分类的蒸馏很简单。然后从分类到分割也是一样，下一篇是检测的蒸馏

WiFi is a powerful sensing medium

Presentation transcript:

A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation Peng Qian, Xipeng Qiu, Xuanjing Huang School of Computer Science, Fudan University

An Appetizer Case Different segmentation results that achieves the same Precision, Recall and F1-score. 白藜芦醇是一种酚类物质 P1: 白藜芦醇是一种酚类物质 P2: 白藜芦醇是一种酚类物质

Motivation With the successive improvements, standard metric is becoming hard to distinguish state-of-the-art word segmentation systems. The high performance is due to the fact that the distribution of difficulties of words is unbalanced. Human judgment depends on difficulties of segmentations. A segmenter should earn extra credits when correctly segmenting a difficult word than an easy word. Conversely, a segmenter should take extra penalties when wrongly segmenting an easy word than a difficult word.

Metric Design From Psychometrics to NLP System Evaluation

Item Analysis in Psychological Test Each item in the exam is given a credit. Difficult item gets more credits than easy one. According to psychometric theory, a reasonable difficulty can be computed by counting the ratio of the students who fail to answer the item correctly.

Transfer the Idea in Psychometrics to NLP System Evaluation Standardized test Many subjects Define the difficulty of an item in the test according to the collective performance. e.g. Word segmentation Diverse segmenters Define the difficulty of a test case according to the collective performance of the NLP system.

K 周日开拍的这场拍卖会起拍价 2.5万美元。 d d d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

K Building a Committee of Base Segmenters The Diversity of the Committee Each base segmenter is constructed with a random combination of the candidate feature template and the sampling ratio of training dataset. d d d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

K Building a Committee of Base Segmenters The Size of the Committee We analyze how the judgement of its difficulty changes as the size of committee increases. d d d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

周日开拍的这场拍卖会起拍价 2.5万美元。周日开拍的这场拍卖会起拍价 2.5万美元。 d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。 d d d d d d ’ d ’ d ’ di ’ d d d

Interpreting Difficulty

Validity and Reliability Correlation with Human Intuition Correlation in Parallel Tests

Validity: Evaluation of NLPCC2015 We demonstrate the effectiveness of the proposed method in a real evaluation by re-analyzing the submission results from NLPCC 2015 Shared Task. We select the submissions of all 7 participants from the closed track and the submissions of all 5 participants from the open track. We compare the standard precision, recall and F-score with our new metric.

Collecting Human Judgment 武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸 1 2 3 4 5 6 7 To tell whether the standard metric or the proposed metric is more reasonable, we asked three experts to evaluate the quality of the submissions from the participants. We randomly selected 50 test sentences from the WB dataset. For each test sentence, we present all the submitted candidate segmentation results to the human judges in random order. Then, the judges are asked to choose the best candidate(s) with the highest segmentation quality as well as the second-best candidate(s) among all the submissions. Human judges had no access to the source of the sentences. P1 P6 P4 P7 P5 P2 P3 1 2 3 4 5 6 7 Human I ４ 3 6 Human II ４ 5 Best Second Best

Validity: Comparison of f1, fb, and Human Judgment on NLPCC2015 Shared Task

Reliability: Parallel Test We randomly split the test dataset into two halves. Different models are evaluated on the first half and then the second half. The performances of different models with our proposed evaluation metric are significantly correlated in two parallel tests. We also include SIGHAN datasets: PKU, MSR, NCC, SXU

Reliability: Correlation between fb of parallel test sets We randomly split the test dataset into two halves. Different models are evaluated on the first half and then the second half. The performances of different models with our proposed evaluation metric are significantly correlated in two parallel tests.

Visualization

Conclusion A new psychometric-inspired method for Chinese word segmentation evaluation by weighting all the words in test dataset based on the methodology applied to psychological tests and standardized exams. Weighted evaluation metrics gives more reasonable and distinguishable scores and correlates well with human judgment. The proposed evaluation metric can be easily extended to word segmentation task for other languages (e.g. Japanese) and other sequence labelling-based NLP tasks.

All comments are welcome Thanks for Listening! All comments are welcome