Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Learning Research & Application Center

Similar presentations


Presentation on theme: "Deep Learning Research & Application Center"— Presentation transcript:

1 Deep Learning Research & Application Center
Fake News on Weibo Deep Learning Research & Application Center 19 December 2017 Claire Li

2 Content Preprocessing the weibo data in variable-length interval size
Chinese text segmentation tools

3 Preprocessing the weibo data in interval length
Total 4,664 events (original posts) Maximum number of posts: 59,318 from :07:05 to :02:35 Minimum number of posts: 10 from Wed, 11 Nov :50:31 to 11 Nov :26:44 Minimum time span: 23 posts from :00:03 to :07:20 (7 mins 17 seconds) Unlikely to model RNN sequence length in the number of posts with each post as an input instance Batch posts within time intervals as RNN time serious given an interval length (sequence length in RNN) N which is tuned from experiments

4 [2] Ex1, N=96, 56,155 posts, from :03:15 to :31:50 interval-length= ts, json Ex2, N=103,135 posts, from :08:25 to :15:14, interval-length = 93 ts, json

5 Popular Chinese Segmentation tools
THULAC, LTP 、ICTCLAS(2015版) 、jieba(C++ 版) 等国内具代表性的分词软件 Test From Microsoft Research From pku test

6

7 THULAC[1]具有如下几个特点 能力强。利用我们集成的目前世界上规模最大 的人工分词和词性标注中文语料库(约含5800 万字, 需填写“资源申请表.doc”)训练而成,模 型标注能力强大。 准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词 性标注的F1值可达到92.9%,与该数据集上最 好方法效果相当。 速度较快。同时进行分词和词性标注速度为 300KB/s,每秒可处理约15万字。只进行分词速 度可达到1.3MB/s。 Java, C++, and Python versions available

8 并且在工程上做了很多优化,比如:用DAT存储 训练特征(压缩训练模型),加入了标点符号 的特征(提高分词准确率)
采用character-based 的结构化感知器 (Structured Perceptron, SP)分词模型 SP以 Maximum Entropy准则建模score函数,分 词结果则等同于最大score函数所对应的标注序 列。 使用了 Word-lattice based re-ranking algorithm

9 Related works 孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC:一个高效的中文词法分析工具包 , Detecting Rumors from Microblogs with Recurrent Neural Networks, IJCAI-16


Download ppt "Deep Learning Research & Application Center"

Similar presentations


Ads by Google