药物和疾病啥关系 ? 李智恒
任务: BioCreative V Chemical-induced diseases relation extraction (CID) 1. UTH-CCB@BioCreative V CDR Task: Identifying Chemical-induced Disease Relations in Biomedical Text 2. RELigator: Chemical-disease relation extraction using prior knowledge and textual information
任务介绍 BioCreative V Chemical-induced diseases relation extraction (CID)
UTH-CCB@BioCreative V CDR Task sentence level Cs CID pair located in the same sentence CID abstract level CD all candidate CID pairs Cs classifier : Context words with position Knowledgebase features Others
Cs features 1 Context words with position: eg: C_D010634-induced D_D004409 in a D_D009422 child. target entities: C_D010634 , D_D009422 unigram and bigram words before, between and after target entities other entities between entity type C_D010634-induced D_disease in a D_D009422 child.
Cs features 2 Knowledgebase features: all relations of the chemical and disease pair in the CTD, MEDI, SIDER MeSH® tree structures of entities
CTD Comparative Toxicgenomics Database( http://ctdbase.org/ ) 研究环境化学对人体健康的影响
CTD 研究实体: chemical/drugs genes/proteins disease taxa(分类群) phenotypes(基因型和环境相互作用下的有机体的样子,显型) 人工标注: chemical–gene/protein interactions chemical–disease relationships gene–disease relationships chemical–phenotype relationships
CTD 数据分类: Chemical , Disease , Genes Chemical–Gene/Protein Interactions Gene–Disease Associations Chemical–Disease Associations Gene–Gene Interactions References Organisms Gene Ontology Pathways Exposures
CTD Chemical–Disease Associations 下载文件: CTD chemical disease.xml.gz therapeutic(治疗剂)或marker/mechanism(机制原理) 或 缺省
Cs features 2 Knowledgebase features: all relations of the chemical and disease pair in the CTD, MEDI, SIDER MeSH® tree structures of entities
MEDI MEDI--an Ensemble MEDication Indication Resource ( https://medschool.vanderbilt.edu/cpm/center-precision-medicine-blog/medi-ensemble-medication-indication-resource ) 电子病历中提取得到的药物指示资源
Cs features 2 Knowledgebase features: all relations of the chemical and disease pair in the CTD, MEDI, SIDER MeSH® tree structures of entities
SIDER Side Effect Resource(http://sideeffects.embl.de/) 销售药品和其他记录中的不良反应 从公开文档和包装说明书中抽取的信息 可用信息:副作用频率、药物副作用分类、 更多的信息链接(eg: drug-target relations)
SIDER
Cs features 2 Knowledgebase features: all relations of the chemical and disease pair in the CTD, MEDI, SIDER MeSH® tree structures of entities
MeSH® tree structures 可以根据参考是找到比给定标题更具体、更广泛的标题 四肢 截肢残端 下肢 臀部 脚 脚踝 前脚掌,人类 跖骨 脚趾 大拇趾
Cs features 3 Others Mentions and normalized values of entities Core chemicals: highest frequency or occurred in the title +1: 所有包含CID 关系对的句子 CID-SA —— -1 : 不包含CID 关系对的句子 +1 : 人工标注,确实含有关系的句子 CID-SM—— -1 : 人工标注,不含关系,但包含CID对的句子
CD classifier (2),(3) of Cs ( Knowledge features & core chemical ) Number of sentences between entities Trigger words For all CID pairs Cs+CD predictions 若抽取结果为空,则核心化合物连接的CID对加入最终结果集
Results Training set + development set final models 自动标注优于人工标注结果
References CTD: The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. MEDI: Development and evaluation of an ensemble resource linking medications to their indications. (2013) SIDER: A side effect resource to capture phenotypic effects of drugs.(2010)
RELigator RELigator: Chemical-disease relation extraction using prior knowledge and textual information Relation extraction: All co-occurrence pairs Cross the title-abstract border Features: Knowledge-based features Statistical features NLP features
Knowledge-based features BRAIN : a graph database UMLS 中几乎所有的实体的相关关系 (来自结构化数据库&Medline文章) Entity1 connection Entity2 (每个connection标有来源,不同来源标有不同权重) (每个connection关联一系列 关系或预测) BRAIN提供用户编程接口,可用于查询两个给定实体的关系路径(path) 关系路径:直接/间接,每个path有志新分数, 用于衡量2个实体之间的连接紧密程度
Statistical features chemical , disease , chemical-disease pair 文档中出现频次 chemical 和 disease间的:1. 最少句子间隔 2.最少单词间隔 chemical 和 disease是否出现在title中,或者二者均出现在title中
NLP features Stanford CoreNLP parser 产生句子的依存树 Governing verb:分析树中某节点上升到根的过程中遇到的第一个动词 Semantic role:实体的语义角色由分析树中的governing verb 反映 对于最近的chemical和disease: Relating word (担任) Governing verb of Relating word (宣布) Chemical是否在disease前 是否有chemical-disease pair在低一级的分析树中 所有governing verb & 出现频次
Machine learning SVM分类 Radial basis function Grid search Ten-fold cross-validation
References BRAIN: Bio-IT World. Big BRAIN: Finding Connections in the Literature Flood with Euretos BRAIN[ Internet]. Available from: http://www.bio-itworld.com/2014/7/1/big-brain-finding-gems-literature-flood-euretos-brain.html Euretos[Internet]. Available from: http://www.euretos.com.
Sequence Modeling: Recurrent and Recursive Nets 张建海