Deep Learning Research & Application Center

Deep Learning Research & Application Center
Fake News on Weibo Deep Learning Research & Application Center 19 December 2017 Claire Li

Content Preprocessing the weibo data in variable-length interval size
Chinese text segmentation tools

Preprocessing the weibo data in interval length
Total 4,664 events (original posts) Maximum number of posts: 59,318 from :07:05 to :02:35 Minimum number of posts: 10 from Wed, 11 Nov :50:31 to 11 Nov :26:44 Minimum time span: 23 posts from :00:03 to :07:20 (7 mins 17 seconds) Unlikely to model RNN sequence length in the number of posts with each post as an input instance Batch posts within time intervals as RNN time serious given an interval length (sequence length in RNN) N which is tuned from experiments

[2] Ex1, N=96, 56,155 posts, from :03:15 to :31:50 interval-length= ts, json Ex2, N=103,135 posts, from :08:25 to :15:14, interval-length = 93 ts, json

Popular Chinese Segmentation tools
THULAC, LTP 、ICTCLAS(2015版) 、jieba(C++ 版) 等国内具代表性的分词软件 Test From Microsoft Research From pku test

THULAC[1]具有如下几个特点能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库（约含5800 万字, 需填写“资源申请表.doc”）训练而成，模型标注能力强大。准确率高。该工具包在标准数据集Chinese Treebank（CTB5）上分词的F1值可达97.3％，词性标注的F1值可达到92.9％，与该数据集上最好方法效果相当。速度较快。同时进行分词和词性标注速度为 300KB/s，每秒可处理约15万字。只进行分词速度可达到1.3MB/s。 Java, C++, and Python versions available

并且在工程上做了很多优化，比如：用DAT存储训练特征（压缩训练模型），加入了标点符号的特征（提高分词准确率）
采用character-based 的结构化感知器（Structured Perceptron, SP）分词模型 SP以 Maximum Entropy准则建模score函数，分词结果则等同于最大score函数所对应的标注序列。使用了 Word-lattice based re-ranking algorithm

Related works 孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC：一个高效的中文词法分析工具包 , Detecting Rumors from Microblogs with Recurrent Neural Networks, IJCAI-16

Deep Learning Research & Application Center

Similar presentations

Presentation on theme: "Deep Learning Research & Application Center"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Deep Learning Research & Application Center

Similar presentations

Presentation on theme: "Deep Learning Research & Application Center"— Presentation transcript:

Similar presentations

About project

反馈