Deep Learning Research & Application Center Fake News on Weibo Deep Learning Research & Application Center 19 December 2017 Claire Li
Content Preprocessing the weibo data in variable-length interval size Chinese text segmentation tools
Preprocessing the weibo data in interval length Total 4,664 events (original posts) Maximum number of posts: 59,318 from 2012-10-16 11:07:05 to 2014-11-04 19:02:35 Minimum number of posts: 10 from Wed, 11 Nov 2015 09:50:31 to 11 Nov 2015 18:26:44 Minimum time span: 23 posts from 2015-11-21 15:00:03 to 2015-11-21 15:07:20 (7 mins 17 seconds) Unlikely to model RNN sequence length in the number of posts with each post as an input instance Batch posts within time intervals as RNN time serious given an interval length (sequence length in RNN) N which is tuned from experiments
[2] Ex1, N=96, 56,155 posts, from 2012-08-06 00:03:15 to 2014-12- 31 12:31:50 interval-length=248400 ts, json Ex2, N=103,135 posts, from 2010-12-23 10:08:25 to2010-12-30 23:15:14, interval-length = 93 ts, json
Popular Chinese Segmentation tools THULAC, LTP-3.2.0 、ICTCLAS(2015版) 、jieba(C++ 版) 等国内具代表性的分词软件 Test From Microsoft Research From pku test
THULAC[1]具有如下几个特点 能力强。利用我们集成的目前世界上规模最大 的人工分词和词性标注中文语料库(约含5800 万字, 需填写“资源申请表.doc”)训练而成,模 型标注能力强大。 准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词 性标注的F1值可达到92.9%,与该数据集上最 好方法效果相当。 速度较快。同时进行分词和词性标注速度为 300KB/s,每秒可处理约15万字。只进行分词速 度可达到1.3MB/s。 Java, C++, and Python versions available
并且在工程上做了很多优化,比如:用DAT存储 训练特征(压缩训练模型),加入了标点符号 的特征(提高分词准确率) 采用character-based 的结构化感知器 (Structured Perceptron, SP)分词模型 SP以 Maximum Entropy准则建模score函数,分 词结果则等同于最大score函数所对应的标注序 列。 使用了 Word-lattice based re-ranking algorithm
Related works 孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC:一个高效的中文词法分析工具包. 2016, http://thulac.thunlp.org/ Detecting Rumors from Microblogs with Recurrent Neural Networks, IJCAI-16