生物信息学 艾对元: 13893660097 aidy@gsau.edu.cn甘肃农业大学 QQ: 156797555 http://blog.sciencenet.cn/u/eddy7777
第三章 生物信息学网络资源 NCBI简介(专题)
National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov
Entrez系统 NCBI综合数据库 美国国家生物技术信息中心(National Center for Biotechnology Information,简称NCBI)创建于1988年 。 1991年,NCBI开发了Entrez数据库查询系统,用于对GenBank等分子生物学和生物医学文献摘要(Medline)等数据库的查询 (Schuler et al, 1996)。
Entrez系统的使用方法
www.ncbi.nlm.nih.gov
All Database integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes
that integrates NCBI databases All Database is a search and retrieval system that integrates NCBI databases
NCBI分子数据子库 1、单碱基多态性数据库dbSNP 2、基因组数据库(Genome) 3、人类基因组数据库Ensembl,UCSC 4、表达序列标记数据库dbEST 5、序列标记位点数据库dbSTS 6、面向基因聚类数据库UniGene 7、基因组调查序列 dbGSS 测序时的酶切位点附近 标记序列 8、蛋白质结构分类数据库SCOP, Pfam 9、蛋白质二级结构数据库DSSP 10、蛋白质同源序列比对数据库HSSP, Homogene 11、 OMIM(Online Mendelian Inheritance in Man) 人类基因和遗传疾病的分类数据库
GenBank分类码 中文名称 符号 灵长类动物序列 PRI 啮齿类动物序列 ROD 其他哺乳动物序列 MAM 其他脊椎动物序列 VRT back 中文名称 符号 灵长类动物序列 PRI 啮齿类动物序列 ROD 其他哺乳动物序列 MAM 其他脊椎动物序列 VRT 无脊椎动物序列 INV 植物真菌藻类序列 PLN 细菌序列 BCT 病毒序列 VRL 噬菌体序列 PHG 人工合成序列 SYN 未注释序列 UNA 表达序列标签 EST 专利序列 PAT 序列标记位点 STS 基因组测序序列 GSS 高通量基因组序列 HTG 未完成测序的高通量cDNA序列 HTC 高通量cDNA序列
Accessing information on molecular sequences
database query VS search (interleukin 18);(3f62)=PDB; Q14116 (IL18_HUMAN)=EBI/UNIPRO; RefSeq= NP_001230140.1; NM_001243211.1; NP_001553.1; NM_001562.3; Unigene= Hs.83077. Gene ID: 3606, 2. database search=数据库搜索,检索:是指通过特定的序列相似性比对算法找出数据库中与检测序列具有一定程度相似性的序列。
Accession numbers(登录号) are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data.
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein
Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)
4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) NCBI数据库的参考序列。校正的,非冗余集合,包括基因组DNA contigs,已知基因的mRNAs和蛋白。 RefSeq的Accession numbers表示形式 : Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735
From the NCBI home page, type “rbp4” and hit “Go”
By applying limits, there are now just two entries
代码 物种来源 参考文献 GeneBank格式记录序列信息
专业评论 特性
FASTA format
Entrez Gene (top of page) Note that links to many other RBP4 database entries are available
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol
Following the “genome” link yields a manageable four results Searching for HIV-1 pol: Following the “genome” link yields a manageable four results
Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq!
over 100,000 nucleotide entries for HIV-1 only 1 RefSeq
Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)
DNA RNA protein complementary DNA (cDNA) UniGene
UniGene: unique genes via ESTs • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene 被整理成簇的EST和全长mRNA序列,每一个代表一种特定已知的或假设的基因,有定位图和表达信息以及同其它资源的交叉参考。记录信息主要为该基因的相关序列(cDNA,EST等)、染色体定位和表达谱信息。其组成的ESTs来源于完整的cDNA文库。 UniGene数据库将GenBank序列自动分为很多簇(cluster),它的每个记录表示一个簇,每个簇代表了一个唯一的基因。
Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1
Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10
Cluster sizes in UniGene (human) Cluster size Number of clusters 1 8,100 2 38,200 3-4 23,300 5-8 12,000 9-16 5,600 17-32 3,700 500-1000 1,050 2000-4000 100 8000-16,000 12 16,000-30,000 2 UniGene build 172, 8/04
UniGene: unique genes via ESTs Conclusion: UniGeneis a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver).
练习 利用Enterz查找human CCL18基因的核酸\蛋白质RefSeq序列,保存为FASTA格式,记录RefSeq的Accession numbers。
NCBI www.ncbi.nlm.nih.gov 美国国家生物技术信息中心(National Center for Biotechnology Information, NCBI) NCBI成立于1988年,其主要工作是开发以GenBank为代表的数据库,进行计算生物学研究,开发用于分析基因组数据的软件工具,发布生物医学信息。 Entrez是NCBI著名的用于提取序列信息的工具,它将科学文献、DNA和蛋白质序列数据库、蛋白质三维结构数据、种群研究数据以及全基因组组装数据整合成一个高度集成的系统。类似于EBI的SRS.是一个查询、提取和显示系统。
NCBI The original version(1991) of Entrez had just 3 nods, now grown to nearly 20 nods
NCBI
NCBI
NCBI
Data base http://www.ebi.ac.uk/ http://www.ncbi.nlm.nih.gov/ http://www.nig.ac.jp/english/ 1.熟悉NCBI- GenBank Entrez检索体系 2. 熟悉SRS (EBI-EMBL) 检索体系。 UniProtKB, ensembl, AraayExp,PDBe,BLAST+,PMC-E 3. 熟悉DBGET (NIG-DDBJ ) 检索体系。
Thank you 完 艾对元: 13893660097 aidy@gsau.edu.cn甘肃农业大学 QQ: 156797555 http://blog.sciencenet.cn/u/eddy7777 APRIL. 18th, 2014 Thank you
Access to Biomedical Literature
PubMed is… National Library of Medicine's search service 12 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar)
PubMed at NCBI to find literature information
PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966.
PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/
· Journal Database 期刊浏览 · MeSh Database 可以用它来分层流览医学主题词 · Single Citation Matcher输入期刊的信息可以找到某单篇的文献或整个期刊的内容。 · Batch Citution Matcher用一种特定的形式输入期刊的信息一次搜索多篇文献。 · Clinical Queries这一部分为临床医生设置,通过过滤的方式将搜索的文献固定在4个范围:治疗、诊断、病原学与预后。 Related Resources · Order Documents可以使用户在当地得到文献的全文, 但有些是要收费的。 · NLM Mobile是对另一个NLM基于网络的查询系统的链接。
练习 在PubMed中搜索human CCL18基因研究的报道(2000年以后),列出检索到的篇目,并试图找到一至两篇全文。
BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 80,000 searches per day
Blastp Blastn Blastx Tblastn Tblastx 蛋白质 核酸 表7 BLAST程序检测序列和数据库类型 程序名 方 法 Blastp 蛋白质 用检测序列蛋白质搜索蛋白质序列数据库 Blastn 核酸 用检测序列核酸搜索核酸序列数据库 Blastx 将核酸序列按6条链翻译成蛋白质序列后搜索蛋白质序列数据库 Tblastn 用检测序列蛋白质搜索由核酸序列数据库按6条链翻译成的蛋白质序列数据库 Tblastx 将核酸序列按6条链翻译成蛋白质序列后搜索由核酸序列数据库按6条链翻译成的蛋白质序列数据库
OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU
Books is… searchable resource of on-line books
TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms
Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)
作业 利用Enterz查找human CCL18,human cxcl1基因的核酸\蛋白质RefSeq序列,保存为FASTA格式,记录从GeneBank获得的序列信息。 在PubMed中搜索human CCL18基因研究的报道(2000年以后),列出检索到的篇目,并试图找到一至两篇全文。