Peng Qian, Xipeng Qiu, Xuanjing Huang

A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation
Peng Qian, Xipeng Qiu, Xuanjing Huang School of Computer Science, Fudan University

An Appetizer Case Different segmentation results that achieves the same Precision, Recall and F1-score. 白藜芦醇是一种酚类物质 P1: 白藜芦醇是一种酚类物质 P2: 白藜芦醇是一种酚类物质

Motivation With the successive improvements, standard metric is becoming hard to distinguish state-of-the-art word segmentation systems. The high performance is due to the fact that the distribution of difficulties of words is unbalanced. Human judgment depends on difficulties of segmentations. A segmenter should earn extra credits when correctly segmenting a difficult word than an easy word. Conversely, a segmenter should take extra penalties when wrongly segmenting an easy word than a difficult word.

Metric Design From Psychometrics to NLP System Evaluation

Item Analysis in Psychological Test
Each item in the exam is given a credit. Difficult item gets more credits than easy one. According to psychometric theory, a reasonable difficulty can be computed by counting the ratio of the students who fail to answer the item correctly.

Transfer the Idea in Psychometrics to NLP System Evaluation
Standardized test Many subjects Define the difficulty of an item in the test according to the collective performance. e.g. Word segmentation Diverse segmenters Define the difficulty of a test case according to the collective performance of the NLP system.

K 周日开拍的这场拍卖会起拍价 2.5万美元。 d d d d d d di d d d
周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

K Building a Committee of Base Segmenters
The Diversity of the Committee Each base segmenter is constructed with a random combination of the candidate feature template and the sampling ratio of training dataset. d d d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

K Building a Committee of Base Segmenters The Size of the Committee
We analyze how the judgement of its difficulty changes as the size of committee increases. d d d d d d di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖　会　起　拍　价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。周日开　拍　的　这场　拍卖　会起　拍价　2.5万　美元　。周日　开　拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的　这　场　拍卖会　起　拍价　2.5　万　美元　。周日　开拍　的这　场　拍卖　会　起　拍价　2.5万　美元　。 ······ K

周日开拍的这场拍卖会起拍价 2.5万美元。周日开拍的这场拍卖会起拍价 2.5万美元。 d d d d
di d d d 周日　开拍　的　这　场　拍卖会　起拍价　2.5万　美元　。周日　开拍　的　这　场　拍卖　会　起　拍价　2.5万　美元　。 d d d d d d ’ d ’ d ’ di ’ d d d

Interpreting Difficulty

Validity and Reliability
Correlation with Human Intuition Correlation in Parallel Tests

Validity: Evaluation of NLPCC2015
We demonstrate the effectiveness of the proposed method in a real evaluation by re-analyzing the submission results from NLPCC 2015 Shared Task. We select the submissions of all 7 participants from the closed track and the submissions of all 5 participants from the open track. We compare the standard precision, recall and F-score with our new metric.

Collecting Human Judgment
武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸武汉一高校淑女班别样 “ 军训 ” ：穿高跟鞋学剪纸 1 2 3 4 5 6 7 To tell whether the standard metric or the proposed metric is more reasonable, we asked three experts to evaluate the quality of the submissions from the participants. We randomly selected 50 test sentences from the WB dataset. For each test sentence, we present all the submitted candidate segmentation results to the human judges in random order. Then, the judges are asked to choose the best candidate(s) with the highest segmentation quality as well as the second-best candidate(s) among all the submissions. Human judges had no access to the source of the sentences. P1 P6 P4 P7 P5 P2 P3 1 2 3 4 5 6 7 Human I ４ 3 6 Human II ４ 5 Best Second Best

Validity: Comparison of f1, fb, and Human Judgment on NLPCC2015 Shared Task

Reliability: Parallel Test
We randomly split the test dataset into two halves. Different models are evaluated on the first half and then the second half. The performances of different models with our proposed evaluation metric are significantly correlated in two parallel tests. We also include SIGHAN datasets: PKU, MSR, NCC, SXU

Reliability: Correlation between fb of parallel test sets
We randomly split the test dataset into two halves. Different models are evaluated on the first half and then the second half. The performances of different models with our proposed evaluation metric are significantly correlated in two parallel tests.

Visualization

Conclusion A new psychometric-inspired method for Chinese word segmentation evaluation by weighting all the words in test dataset based on the methodology applied to psychological tests and standardized exams. Weighted evaluation metrics gives more reasonable and distinguishable scores and correlates well with human judgment. The proposed evaluation metric can be easily extended to word segmentation task for other languages (e.g. Japanese) and other sequence labelling-based NLP tasks.

All comments are welcome
Thanks for Listening! All comments are welcome

Peng Qian, Xipeng Qiu, Xuanjing Huang

Similar presentations

Presentation on theme: "Peng Qian, Xipeng Qiu, Xuanjing Huang"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Peng Qian, Xipeng Qiu, Xuanjing Huang

Similar presentations

Presentation on theme: "Peng Qian, Xipeng Qiu, Xuanjing Huang"— Presentation transcript:

Similar presentations

About project

反馈