蛋白质序列分析 (protein sequence analysis)
一、简介(introduction) 二、蛋白质数据库 (protein databases) 三、蛋白质序列分析 (protein sequence analysis)
Why protein analysis? 人类基因组工程: 提出问题: 从蛋白质和DNA序列中得出有意义的生物信息、知识(bioinformatics)。 确定基因的位置和功能,观察蛋白质之间的反应,蛋白质保持时蛋白质的功能结构。 提出问题: 与大型生物数据集的分析密切相关 存储和查询大型基因、蛋白质数据库
Function unknown for 40% of human proteins
Importance of sequence analysis Millions of sequences available in public dbs & millions more in proprietary dbs these #s will snowball with completion of more genomes so what? Locked up in sequences is a huge amount of structural, functional & evolutionary info they're a highly valuable resource By contrast, the # of unique protein structures is ~2000 a huge information deficit (赤字)
The legacy of the genome projects Sequence-structure deficit 800 700 600 500 400 300 200 100 1988 2004 Non-redundant growth of sequences during 1988-2002 ( black line ) & the corresponding growth in the number of structures ( pink dots ).
Challenges for bioinformatics Spurred on by the seq/structure deficit, the challenges rationalise the mass of sequence data derive more efficient means of data storage design more incisive & reliable analysis tools The imperative – to convert sequence information into biochemical & biophysical knowledge to decipher the structural, functional & evolutionary clues encoded in the language of biological sequences
The Holy Grail of bioinformatics ...to be able to understand the words in a sequence sentence that form a particular protein structure
The reality of sequence analysis ...isn‘t so glamorous....but means we can recognise words that form characteristic patterns(模式), even if we don't know the precise syntax to build complete protein sentences
Pattern recognition & prediction In investigating the meaning of sequences, two distinct analytical approaches have emerged pattern recognition (模式识别)is used to detect similarity between sequences or structures & hence to infer related functions ab initio prediction (从头预测)is used to deduce structure, & to infer function, directly from sequence These methods are quite different! pattern recognition methods demand that some characteristic has been seen before & housed in a db prediction methods remove the need for template dbs, because deductions are made directly from sequence
Science fact & fiction Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition which is ~50% reliable even in expert hands Prediction is still not possible & is unlikely to be so for decades to come (if ever) Structural genomics will yield representative structures for many (but not all) proteins in future structures of new sequences will be determined by modelling prediction will become an academic exercise But, to debunk a popular myth, knowing structure alone does not inherently tell us function
A reality check What is the function of this structure? What is the function of this sequence? What is the function of this structure? What is the function of this motif? the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions – knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
“A test case for structural genomics Structure-based assignment of the biochemical function of hypothetical protein mj0577” (Zarembinski et al., PNAS 95 1998) Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown
The Twilight Zone Prediction methods don’t work because we don’t fully understand the Folding Problem we can’t read the language sequences use to create their folds But, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbs whose structures & functions we hope have been elucidated This is straightforward at high levels of identity, but below 50% it is difficult to establish relationships reliably Analyses can be pursued with decreasing certainty towards the Twilight Zone ~20% identity, where results may look plausible to the eye, but are no longer statistically significant
一、简介(introduction) 二、蛋白质数据库 (protein databases) 三、蛋白质序列分析 (protein sequence analysis)
蛋白质数据库分类 (classification) 蛋白质序列数据库(protein sequence database):以蛋白质的序列为主,并赋予相应的注释,如 PIR、SWISS-PROT、NCBI。 蛋白质模体及结构域数据库(protein motif and domain database): 收集蛋白质的保守结构域和功能域的特征序列,如 PROSITE、Pfam。
蛋白质数据库分类 (classification) 蛋白质结构数据库(protein structure database): 主要以蛋白质的结构测量数据为主,如 PDB 等。 蛋白质分类数据库(protein classification database):包含有以序列比较为基础的序列分类数据库以及以结构比较为基础的结构分类数据库,如 SCOP、CATH、FSSP 等。
蛋白质数据库的功能 (protein database function) 对数据的注释(annotation)功能 对数据的检索(search)功能 对数据的生物信息分析(bioinformatics analysis)功能
蛋白质序列数据库 PIR(protein information resource) 蛋白质信息资源数据库 http://pir.georgetown.edu/ PIR-PSD, the world's first database of classified and functionally annotated protein sequences. The sequence data come from GenBank/EMBL/DDBJ databases, published data and user directly submitted.
PIR-PSD 是一个综合全面的、非冗余的、专业注释的、分类完整的蛋白质序列数据库。 PIR-PSD的序列来自于将GenBank/EMBL/DDBJ 三大数据库的编码序列的翻译而成的蛋白质序列、发表的文献中的序列和用户直接提交的序列。 iProClass 数据库是用于描述蛋白质家族之间的关系以及结构/功能特征的综合资源,收录了包括SWISS-PROT和PIR数据库的30万多条蛋白质序列,包括超家族、蛋白质家族、功能域、结构模体、翻译后修饰位点。
http://pir.georgetown.edu/
蛋白质序列数据库 SWISS-PROT/TrEMBL database 瑞士蛋白质数据库 (www.expasy.org/swissprot) SWISS-PROT 数据库是经注释的蛋白质数据 库,由蛋白质序列条目构成。每个条目包含 蛋白质序列、引用文献信息、分类学信息、注 释等。注释中包括蛋白质的功能、转录后修饰 位点、特殊位点和区域、二级结构、四级结构、 与其它序列的相似性等信息。
Swiss-Prot Endeavours to provide high-level annotation e.g., descriptions of the function of the protein, the organisation of its domains, PTMs, family & disease relationships, variants, etc. Contains entries from >10,000 species the bulk of these from just a handful of model organisms H.sapiens, E.coli, M.musculus, D.melanogaster, S.cerevisiae, etc. The quality of its annotations sets is apart from other dbs Consequently, it cannot keep pace with the rate of data acquisition from the sequencing centres
www.expasy.org/swissprot
蛋白质结构数据库 PDB (Protein Database Bank) http://www.rcsb.org/pdb/ PDB is the single worldwide repository for the processing and distribution of 3D structure data of large molecules of proteins and nucleic acids.
PDB 的结构由一下信息组成:序列信息;原子坐标;分子结晶条件;通过多种方法计算的三位结构近似值;衍生的几何数据;结构因数;三位结构立体图象;与其它数据资源的链接。
http://www.rcsb.org/pdb/
蛋白质家族及结构域数据库 PROSITE(Database of protein families and domains ) http://www.expasy.org/prosite PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
PROSITE 数据库是基于对蛋白质家族中同源序列多重序列比对得到的保守性区域,这些区域通常与生物学功能有关,例如酶的活性位点、配体或金属结合位点等。
http://www.expasy.org/prosite
一、简介(introduction) 二、蛋白质数据库 (protein databases) 三、蛋白质序列分析 (protein sequence analysis)
三、蛋白质序列分析 (protein sequence analysis) (一)蛋白质序列信息的获取 protein sequence collection (二)蛋白质序列分析 protein sequence analysis
3 methods for collecting protein sequence data: Direct sequencing, 直接测序 e.g.用质谱仪测序 Translating DNA sequence, 翻译编码的DNA序列 e.g.用“ORF Finder”程序找到DNA的开放阅读框 Search database, 在数据库中搜索
Method 1: Direct sequencing, 直接测序 (一)protein sequence collection Method 1: Direct sequencing, 直接测序 e.g. Protein Sequencing and Identification by Mass Spectrometry, 即用质谱仪测序
Masses of Amino Acid Residues
Protein backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus AA residuei-1 AA residuei AA residuei+1
Breaking Protein into Peptides and Peptides into Fragment Ions General for sequencing Breaking Protein into Peptides and Peptides into Fragment Ions Proteases, e.g. trypsin(胰蛋白酶), break protein into peptides. A Tandem Mass Spectrometer(串联式质谱仪) further breaks the peptides down into fragment ions and measures the mass of each piece.
Breaking Protein into Peptides and Peptides into Fragment Ions General for sequencing Breaking Protein into Peptides and Peptides into Fragment Ions Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. Mass Spectrometer measure mass/charge ratio of an ion.
Peptide Fragmentation Collision Induced Dissociation H+ H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment Peptides tend to fragment along the backbone. Fragments can also loose neutral chemical groups like NH3 and H2O.
N- and C-terminal Peptides G P F N A A G P F N N A G P F C-terminal peptides N-terminal peptides F N A G P G P F N A
Terminal peptides and ion types G P F N H2O Mass (D) 57 + 97 + 147 + 114 = 415 Peptide G P F N without H2O Mass (D) 57 + 97 + 147 + 114 – 18 = 397
N- and C-terminal Peptides 486 G P F N A A G P F N 71 415 301 N A G P F 185 C-terminal peptides N-terminal peptides F N A G P 154 332 G P F N A 57 429
N- and C-terminal Peptides 486 71 415 301 185 C-terminal peptides N-terminal peptides 154 332 57 429
Peptide Fragmentation b2-H2O b3- NH3 a2 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y1 y2 - NH3 y3 -H2O
Mass Spectra The peaks in the mass spectrum: G V D D L K Prefix H2O L K L K D V G 57 Da = ‘G’ 99 Da = ‘V’ mass The peaks in the mass spectrum: Prefix Fragments with neutral losses (-H2O, -NH3) Noise and missing peaks. and Suffix Fragments.
Protein Identification with MS/MS G V D L K Peptide Identification: MS/MS mass Intensity mass
Tandem Mass-Spectrometry
Breaking Proteins into Peptides GTDIMR HPLC To MS/MS PAKID MPSERGTDIMRPAKID...... MPSER …… …… protein peptides
Matrix-Assisted Laser Desorption/Ionization (MALDI) Mass Spectrometry Matrix-Assisted Laser Desorption/Ionization (MALDI)
Tandem Mass Spectrometry MS LC Scan 1707 Ion Source MS-1 collision cell MS-2 MS/MS Scan 1708
多肽片段指纹图谱(PFF) 步骤:用酶专一性酶解蛋白质,经过分离,得到的肽段在质谱中被选择和破碎后得到MS/MS谱图,与数据库中的谱图比较进行鉴定 代表方法: LC-ESI-MS/MS 2D-LC-MS/MS(shotgun)
(一)protein sequence collection Method 2 Translating DNA sequence, 翻译编码的DNA序列 e.g.用“ORF Finder”程序找到DNA的开放阅读框。 网址:ncbi.nlm.nih.gov/gorf/gorf.html
(一)protein sequence collection Method 3: Search database, 即在数据库中 搜索 e.g. PIR-PSD database: pir.georgetown.edu/pirwww SWISS-PROT/TrEMBL database www.expasy.org/swissprot
三、蛋白质序列分析 (protein sequence analysis) (一)蛋白质序列信息的获取 protein sequence collection (二)蛋白质序列分析 protein sequence analysis
蛋白质序列分析 当我们在实验中获得了一个蛋白质序列或者一组序列,我们需要可能多的获取该蛋白的有关信息,并在可以获得的信息的基础上对蛋白的功能进行预测,所以蛋白质序列的分析是必不可少的。 通过序列分析我们可以得到蛋白质序列的基本信息,并进一步可以预测有关蛋白质二级结构、三级结构、乃至四级结构的信息。
LEVEL OF FUNCTION INFORMATION IN PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN SECONDARY STRUCTURE MOTIF SITE 3D STRUCTURE RESIDUE
The 20 Amino Acids A Ala Alanine (丙氨酸) C Cys Cysteine(半胱氨酸) D Asp Aspartic acid (Aspartate)(天冬氨酸) E Glu Glutamic acid (Glutamate)(谷氨酸) F Phe Phenylalanine(苯丙氨酸) G Gly Glycine(甘氨酸) H His Histidine(组氨酸) I Ile Isoleucine(异亮氨酸) K Lys Lysine(赖氨酸) L Leu Leucine(亮氨酸) M Met Methionine(蛋氨酸) N Asn Asparagine(天冬酰胺) P Pro Proline(脯氨酸) Q Gln Glutamine(谷氨酰胺) R Arg Arginine(精氨酸) S Ser Serine(丝氨酸) T Thr Threonine(苏氨酸) V Val Valine(缬氨酸) W Trp Tryptophan(色氨酸) Y Tyr Tyrosine(酪氨酸)
AMINO ACID PROPERTIES Small Ala, Gly Small hydroxyl Ser, Thr Basic His, Lys, Arg Aromatic Phe, Tyr, Trp Small hydrophobic Val, Leu, Ile Medium hydrophobic Val, Leu, Ile, Met Acidic/amide Asp, Glu, Asn, Gln Small/polar Ala, Gly, Ser, Thr, Pro
Protein functions from specific residues C disulphide-rich, metallo- thionein, zinc fingers DE acidic proteins (unknown) G collagens H histidine-rich glycoprotein KR nuclear proteins, nuclear localisation P collagen, filaments SR RNA binding motifs ST mucins Polar (C,D,E,H,K,N,Q,R,S,T) - active sites Aromatic (F,H,W,Y) - protein ligand- binding sites Zn+-coord (C,D,E,H,N,Q) - active site, zinc finger Ca2+-coord (D,E,N,Q) - ligand-binding site Mg/Mn-coord (D,E,N,S,R,T) - Mg2+ or Mn2+ catalysis, ligand binding Ph-bind (H,K,R,S,T) - phosphate and sulphate binding Catalytic sites mostly polar large aromatic aa often found in protein-ligand interactions za ions coordinated by mnay differenet aa ca ions often bound by acidic residues and amides Mn and Mg often bound by 2 acidic residues separated by hydrophobic aa Phosph and Sulph often bound to amino end of alpha helices
Protein functions from regions Active sites- short, highly conserved regions Loops- charged residues and variable sequence Interior of protein- conservation of charged amino acids
Additional analysis of protein sequences transmembrane regions signal sequences localisation signals targeting sequences GPI anchors glycosylation sites hydrophobicity amino acid composition molecular weight solvent accessibility antigenicity
蛋白质序列分析方法: 1.相似性搜索(或同源搜索) (similarity search/homology search) 2.模体搜索和结构域定位 (motif and domain location) 3.多重序列比对 (multiple sequence alignment) 4.同源模建 (homology modelling) 新序列与数据库中的序列对比,寻找同源性或者相似的序列 寻找蛋白质中结构域或者功能域 提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板(需作手工校正) 显著同源关系(>50%):以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建; 2. 当所击中的序列是分离的结构域,通过 二级结构预测和折叠识别寻找合适的折叠子,以折叠子为模板建模。
(similarity search/homology search) 1.相似性搜索(或同源搜索) (similarity search/homology search) ① 一个新序列与序列数据库中的序列比对, 从而找到同源或者相似序列。 ② 常用程序是BLASTp。 新序列与数据库中的序列对比,寻找同源性或者相似的序列 寻找蛋白质中结构域或者功能域 提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板(需作手工校正) 显著同源关系(>50%):以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建; 2. 当所击中的序列是分离的结构域,通过 二级结构预测和折叠识别寻找合适的折叠子,以折叠子为模板建模。
数据库搜索工具BLAST BLAST 是目前常用的数据库搜索程序,它是 Basic Local Alignment Search Tool 的缩写,意为 “基本局部相似性比对搜索工具”。 BLAST 算法的基本思路是首先找出检测序列和目标序列之间相似性程度最高的片断,并作为内核向两端延伸,以找出尽可能长的相似序列片段。
BLAST Input
BLAST results
BLAST results (2)
(motif and domain location) 2.模体搜索和结构域定位 (motif and domain location) 模体(motif)是通过对一个蛋白质家族进行 多序列比对检测出来的一种高度保守元件,它 常对应于一些功能域和结构域。 模体搜索是另一种数据库搜索方式,它搜索 的对象是序列中一些关键的保守氨基酸。 PROSITE数据库(www.expasy.ch/prosite) 新序列与数据库中的序列对比,寻找同源性或者相似的序列 寻找蛋白质中结构域或者功能域 提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板(需作手工校正) 显著同源关系(>50%):以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建; 2. 当所击中的序列是分离的结构域,通过 二级结构预测和折叠识别寻找合适的折叠子,以折叠子为模板建模。
PROSITE数据库(www.expasy.ch/prosite) 是一个蛋白质家族的模体数据库,它包括重要 模体搜索和结构域定位 PROSITE数据库(www.expasy.ch/prosite) 是一个蛋白质家族的模体数据库,它包括重要 的位点(site),序列模式(pattern)和序列 表谱(profile),可以对一个新的蛋白质序列 准确地归类。 新序列与数据库中的序列对比,寻找同源性或者相似的序列 寻找蛋白质中结构域或者功能域 提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板(需作手工校正) 显著同源关系(>50%):以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建; 2. 当所击中的序列是分离的结构域,通过 二级结构预测和折叠识别寻找合适的折叠子,以折叠子为模板建模。
模式序列 (pattern) 这是另一种序列搜索方法,其目的是寻找蛋 白质中结构域或者功能域。它直接描述序列中 关键的几个保守残基,称为 “标志”(signature), 即所谓的模式序列 (pattern)。 常用PROSITE 数据库。 新序列与数据库中的序列对比,寻找同源性或者相似的序列 寻找蛋白质中结构域或者功能域 提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板(需作手工校正) 显著同源关系(>50%):以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建; 2. 当所击中的序列是分离的结构域,通过 二级结构预测和折叠识别寻找合适的折叠子,以折叠子为模板建模。
PATTERNS Small, highly conserved regions Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} [] shows either amino acid X is any amino acid X(2) any amino acid in the next 2 positions {} shows any amino acid except these BUT- limited to near exact match in small region
FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES Pattern - short, simplest, but limited Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: Matrix Profile HMM
结构域定位 通过将序列在数据库中搜索,可以了解到序列的一些信息,接下来就可以进行结构域的定位,这样就对以后的结构预测有了一个比较清醒的认识。 模体搜索和结构域定位 结构域定位 通过将序列在数据库中搜索,可以了解到序列的一些信息,接下来就可以进行结构域的定位,这样就对以后的结构预测有了一个比较清醒的认识。 如果蛋白质序列的长度大于500个氨基酸,就可以根据搜索的情况(比如按相似性高低或者结构域多少等)将蛋白质分割成多个不连续的区域,最好将这一段一段的序列分别鉴别。
鉴定蛋白质的结构域一般都要做以下几种分析: 模体搜索和结构域定位 鉴定蛋白质的结构域一般都要做以下几种分析: (1)探测序列与其他全序列之间有无同源性.如果有,那么这是该段序列为结构域的很好证据,然后进行结构数据库的搜索,也可以搜索注释好的数据库,从而得到一些有关结构域的说明。 (2)分析低复杂度的区域。在多结构域的蛋白质中,这些低复杂度序列常常间隔结构域,长的重复序列特别是pro、glu、ser、thr等常常是连接序列,也是很好的结构域剪接位置。
鉴定蛋白质的结构域一般都要做以下几种分析: 模体搜索和结构域定位 鉴定蛋白质的结构域一般都要做以下几种分析: (3)跨膜区域。由于跨膜结构是一个非常典型的结构,这种结构连续性较强,而且预测容易,准确性也比较高,因此也是一个分割的区域,这样就很容易区分胞外和胞内区域。 (4)卷曲螺旋结构(coiled-coil)。这个结构有时也可能是蛋白质结构域之间的间隔区,可以在COIL网站上预测coiled-coil结构。
鉴定蛋白质的结构域一般都要做以下几种分析: 模体搜索和结构域定位 鉴定蛋白质的结构域一般都要做以下几种分析: (5)二级结构预测。这个方法常常用来预测一个结构中包含的不同折叠子。例如,一个序列中的一部分可能会被预测成只有α-螺旋,而另一个部分可能会被预测成只含有β-折叠,这些都可能预示有域的结构存在。 (6)如果序列已被成功地分解成成形的结构域,那么重复进行数据库搜索并且进行独立比对是很重要的.
模体搜索和结构域定位
Why create pattern databases? Often need to make more specific diagnoses than are possible simply by searching the 1's Build on the principle that sequences may be gathered into alignments, within which are regions with little variation these ‘motifs’ usually reflect some vital biological role in terms of structure or function Motifs are exploited in different ways to build diagnostic patterns for protein families new sequences can be searched against dbs of such patterns to see if they can be assigned to known families hence they offer a fast track to the inference of function
Methods for family analysis Full domain alignment methods Single motif methods Fuzzy regex (eMOTIF) Full domain alignment methods Exact regex (PROSITE) Profiles (PROFILE LIBRARY) HMMs (Pfam) Identity matrices (PRINTS) Multiple motif methods Weight matrices (Blocks)
The challenge of family analysis highly divergent family with single function? superfamily with many diverse functional families? must distinguish if function analysis done in silico a tough challenge!
Know your family
The problem with domains
PROSITE The first pattern db based on the idea that a family can be characterised by a pattern of conserved residues in a single motif Sequence information in motifs is reduced to regular expressions & the seed regex used to search SP results are inspected manually to achieve optimal results Some families can’t be characterised by single motifs here, additional regexs are created until an optimal set is achieved that captures most or all of the family results are then manually annotated for inclusion in the db
MATRIX 210 possible aa pairs (190 different aa, 20 identical aa) Start with sequence alignment and build up a table of probabilites of finding each aa in each position of the sequence Can be scored in several different ways
Matrix scores can be based on: Genetic code -base changes required to convert codons for 2 amino acids Chemical similarity -polarity, size, shape, charge Observed substitutions -based on analysing frequencies seen in alignments- inter-reliable Dayhoff mutation data matrix - likelihood of mutation from one aa to another, but different positions are not equally mutatable, and only useful for close function because sequence alignments are very related proteins
Matrix scoring continued BLOSUM -matrix from ungapped alignments of distantly related sequences -cluster sequences similar at a threshold value of % identity -substitution frequencies for all pairs of aa calculated -used to calculate a log odds BLOSUM (blocks substitution matrix). Can vary threshold values 3D structure matrix -derived from tertiary structure alignment, good, but only used if structure is known Best matrices are derived from observed substitution data, it is important to use select scoring appropriate for evolutionary distance interested in.
PROFILES Table or matrix containing comparison information for aligned sequences Used to find sequences similar to alignment rather than one sequence Contains same number of rows as positions in sequences Row contains score for alignment of position with each residue
Example of a Profile Match values are higher for conserved residues Note:third column, alanine has lower score than meth, cos meth is pphysically more similar to L, I, V and F
Building a Profile To get good profile need good, hand-curated alignment Use alignment to build up position-specific scoring matrix Use matrix (profile) to do PSI-BLAST with several iterations
SCORES E-value is chance of a random sequence sequence hitting. E-value 1.0 not significant, 0.1 possibly significant,< 0.01 most likely to be significant. All depends on database size
HIDDEN MARKOV MODELS (HMM) An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities Package used HMMER (http://hmmer.wusd.edu/) Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM E-value- number of false matches expected with a certain score Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)
REPEATS Structural and evolutionary entities found in 2 or more copies Often assemble into elongated “rods”, “superhelices” or “barrel” structures Specialised cases when building profiles
PITFALLS OF METHODS BLAST - only pick up homologues, not distant, divergent family members PSI-BLAST - fine for superfamilies, not very good for small very conserved motifs Patterns - small, localised and need to be highly conserved regions HMMER - slow process for searching database Profiles - if false positive picked up, pulls in its companions, in large families members can be missed Alignment methods - automatic, less biological significance
Big problem in protein sequence analysis- multidomain proteins: Most conserved domain will score highest in sequence similarity searches, may overlook lower scoring domains Iterative searching of multi-domain proteins could pick up unrelated proteins A A B B C C Domain 1 Domain 1 Domain 2 A=B, B=C, AC A,B & C share a common domain
SUMMARY OF PATTERN METHODS xxxxxx Extract regular expression (PROSITE) Single motif method Full domain alignment methods (ProDom, DOMO) Full domain profile or HMM (Pfam, SMART) Multiple motif methods Frequency matrix (PRINTS) or PSS matrix (BLOCKS)
COMMON PROTEIN PATTERN DATABASES Prosite patterns Prosite profiles Pfam SMART Prints ProDom DOMO BLOCKS
3. 多重序列比对 Multiple sequence alignment 多重序列比对就是把2条以上可能有系统进化 关系(亲缘关系)的序列进行比对的方法。 目前多序列比对的算法大多基于渐进的比对 (progressive alignment), 即在序列两两比对的 基础上逐步优化多序列比对的结果。
任何两条或多条核苷酸或氨基酸序列之间的比对, 从真正意义上讲,代表着有关这些序列进化历史的 明确假设。 序列比对为解决下列问题提供重要信息: - 确定新发现基因的功能; - 确定基因间、蛋白质间乃至物种之间的进化关系; - 预测蛋白质的结构和功能:即可以提供结构域相应 的信息,蛋白质功能位点的残基、蛋白质亲水性和疏 水性的氨基酸残基,从比对结果中得到更多的同源模 建或二级结构预测的模板。
Alignment (比对) Strategy for sequence alignment without use of structural information is the same for DNA & protein Must allow for point mutations Insertions & deletions
Protein sequence alignment Pairwise alignment a b a c d a b _ c d Multiple sequence alignment usually provides more information x b a c e Multiple alignment difficult to do for distantly related proteins
Questions Are two sequences homologous? What is the best alignment between them? What is the function of my protein? Is my SNP functionally important?
Alignment Strategy How we look depends on what we’re looking for Want to equivalence residues / bases (sites) that shared a common ancestor For DNA this can only really be done at the sequence level For proteins, structure important (but in general we don’t know the structure) RNA (miRNA, tRNA etc) structure (base pairing) important
多重序列比对 (Multiple Sequence Alignment) 通过序列的相似性检索得到许多相关的相似序列,将两条以上这些序列做一个总体的的比对。 用途:构建序列模式的分布图;将序列聚类构建分子进化树,等等。 工具: ClustalW_mp, 它的网址是:www.ebi.ac.uk/clustalw
Sample Multiple Alignment
AF380737 ------------------------------------------------------------ AF380734 GGGGTTGGTGTAAAATAGGGGTGGGGCTCCCCGGGCTATTTCGGCCCCTCCGGCTAGACC 60 ...... AF380737 CAAAG--CCTTGGAG---TTAGC---ATAGGACGTTGGAACGATAGTGATAACGGATATG 466 AF380735 TATTGA-CCTTATAGATTTTATA---ATCGGGGAATAGCGTGACGTTAGTGGAGTCTAGG 477 AF380736 GTTGGCAGCTTGGCTATAGCGCT---ATAGGAGCTTAG-GTAACGTAGGGATCTATGTTG 465 AF380733 TAGGGATACTTAGGG----TTGC---ATAGG---CTATAGTTTCGATAGGTAACTTTAGG 455 AF380734 TATAATATAAAGGAGAAGTTATATTGGTGGGGTTCCAAGGCTATTTAGGCTAAGGGTTGG 706 * ** AF380737 AAATTCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 804 AF380735 AAAATCCCAAACTTTTTACGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 797 AF380736 AAATTCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 804 AF380733 AAAATCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 801 AF380734 AAAAATCCCAACTTTTTAGGTCCCTTAGGTAGGGGCGTTCTCCCGAAAACCGAAAAAATC 1053 *** ** **************** ***************** ************* * AF380737 -ATGCAGAAACCCC-GTTCAAAAAT-CGGCCAAAATCGCCATTTTTACGATTTTCGTGTG 861 AF380735 -ATGCAGAAACCCC-GTTCAAAAAT-CGGCCAAAATCGCCATTTTTTCAATTTTCGTGTG 854 AF380736 -ATGCAGAAACCCC-GTTCAAAAAA-TGGCCAAAATCGCGATTTTTACGATTTTCGTGTG 861 AF380733 -ATGCAGAAACCCC-GTTCAAAAAA-TGCCCAAAATCGCGATTTTTACGATTTTCGTGTG 858 AF380734 GATGCAGAAACCCCCGTTCAAAAAAATGCCCAAAATCACGATTTTTACGATTTTCGTGTG 1113 ************* ********* * ******** * ****** * *********** AF380737 AAACTA 867 AF380735 AAACTA 860 AF380736 AAACTA 867 AF380733 AAACTA 864 AF380734 AAACTA 1119 ****** 5个序列的多序列比对结果
4.同源模建(homology modelling) (见下一章)
思考题 汉译英词汇(组) 1. 怎样获得蛋白质序列数据? 2. 熟悉一种以上蛋白质序列和结构数据库的名称及特点;简述Prosite数据库及应用。 3. 简述蛋白质序列分析的方法。 4. 多序列比对 的基本原理。 汉译英词汇(组) 序列注释;模体和结构域;多序列比对