蛋白质序列分析 (protein sequence analysis)

Slides:

Advertisements

Similar presentations

Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.

Advertisements

allow v. wrong adj. What’s wrong? midnight n. look through guess v. deal n. big deal work out 允许；准许有毛病；错误的哪儿不舒服？午夜；子夜快速查看；浏览猜测；估计协议；交易重要的事.

期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学习和生活习惯； 1. 学生住校不利于了解外界信息； 2 可与老师及同学充分交流有利于共同进步。 2. 和家人交流少。在寄宿制高中，大部分学生住校，但仍有一部分学生选择走读。你校就就此开展了一次问卷调查，主题为.

蛋白质与人类健康曹春阳中国科学院上海有机化学研究所 —— 生命有机化学应用. 报告内容蛋白质分子结构蛋白质样品制备蛋白质结构测定.

Structure and Function of Protein

一、蛋白质通论蛋白质存在于所有的生物细胞中，是构成生物体最基本的结构物质和功能物质。

氨基酸顾军北京大学生命科学院.

教学目的与要求： 1.了解生命体中的化学元素的作用； 2.了解生命体中的重要有机化合物。

听力满分不是梦博智 —— Anna钟小娜.

2014 年上学期湖南长郡卫星远程学校制作 13 Getting news from the Internet.

蛋白质结构与功能的关系.

On Irritability 英译汉.

专题八书面表达.

：sisu Password：

　蛋白质化学.

武汉职业技术学院微生物技术应用背景知识四：微生物生长测定技术.

氨基酸及其重要衍生物的生物合成.

How can we become good leamers

第一章蛋白质的结构与功能 Structure and Function of Protein.

-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學資訊管理系李麗華教授.

Chapter 8 Liner Regression and Correlation 第八章直线回归和相关

摘要的开头： The passage mainly tells us sth.

Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述

第十五章氨基酸和蛋白质第一节氨基酸第二节肽第三节蛋白质.

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

Euler’s method of construction of the Exponential function

Homework 4 an innovative design process model TEAM 7

Reading Do you remember what you were doing? 学习目标 1、了解几个重要历史事件。

Module 5 Shopping 第2课时.

Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.

生物資訊 bioinformatics 林育慶.

Population proportion and sample proportion

5、利用EST数据库发现新基因 EST (expressed sequence tags),是从基因表达的短的序列，携带着完整基因某些片断的信息，称为表达序列标签获得一个EST的途径有三种：1 大规模测序；2 比较同源性；3 差异显示或基因芯片法获得与某一性状相关的EST 电脑克隆第一步，找到与待克隆基因相关的EST；第二步.

第十章基于立体视觉的深度估计.

第三章氨基酸四大生物大分子：？其中蛋白质是生物功能的主要载体——体现在哪些方面？氨基酸：是蛋白质的组成单元（构件分子）

Digital Terrain Modeling

Introduction to Biochemistry

创建型设计模式.

Basic Local Alignment Search Tool

但是如果你把它发给最少两个朋友。。。你将会有3年的好运气！！！

This Is English 3 双向视频文稿.

Interval Estimation區間估計

Authors: Saumil Mehta and Deendayal Dinakarpandian

基于基因集富集分析的畜禽复杂性状GWAS分析平台及其应用

EGF与细胞信号传导 Signal Transduction

基于课程标准的校本课程教学研究乐清中学赵海霞.

Chapter 5 Recursion.

第十四章氨基酸、多肽与蛋白质第一节氨基酸一、氨基酸的结构和分类除甘氨酸和脯氨酸外，其他均具有如下结构通式。不变部分 -氨基酸

第九章蛋白质的加工、易位及降解 1 蛋白质的加工 2 蛋白质易位 3 蛋白质的降解 4 小结.

普通高等教育 “十三五”规划教材生物信息学 Bioinformatics 第六章：蛋白质组学.

Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学赵亚平.

BORROWING SUBTRACTION WITHIN 20

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

中央社新聞— ＜LTTC：台灣學生英語聽說提升讀寫相對下降＞

Lab 4 買房負擔著重: 不動產計算是否可承擔起買房 (lab 4) 使用”分析藍本管理員” Excel : IF 函數/功能.

從 ER 到 Logical Schema ──兼談Schema Integration

华南师范大学生命科学学院05级技术(2)班刘俏敏

Philosophy of Life.

高考应试作文写作训练 5. 正反观点对比.

Interactome data and databases: different types of protein interaction

TEEN CHALLENGE Next Steps 核心价值观总结 CORE VALUES 青年挑战核心价值观

动词不定式（6）.

(Unit I: Protein Structure and Function)

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

MATLAB 結構化財務程式之撰寫 MATLAB財務程式實作應用研習主題五資管所陳竑廷

Bayesian Joint Prediction of Associated Transcription Factors in Bacillus subtilis 陳冠廷陳靜儀謝仁傑林敬恆.

Principle and application of optical information technology

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

蛋白质序列分析 (protein sequence analysis)

一、简介（introduction）二、蛋白质数据库（protein databases）三、蛋白质序列分析（protein sequence analysis）

Why protein analysis？人类基因组工程：提出问题：从蛋白质和DNA序列中得出有意义的生物信息、知识（bioinformatics）。确定基因的位置和功能，观察蛋白质之间的反应，蛋白质保持时蛋白质的功能结构。提出问题：与大型生物数据集的分析密切相关存储和查询大型基因、蛋白质数据库

Function unknown for 40% of human proteins

Importance of sequence analysis Millions of sequences available in public dbs & millions more in proprietary dbs these #s will snowball with completion of more genomes so what? Locked up in sequences is a huge amount of structural, functional & evolutionary info they're a highly valuable resource By contrast, the # of unique protein structures is ~2000 a huge information deficit （赤字）

The legacy of the genome projects Sequence-structure deficit 800 700 600 500 400 300 200 100 1988 2004 Non-redundant growth of sequences during 1988-2002 ( black line ) & the corresponding growth in the number of structures ( pink dots ).

Challenges for bioinformatics Spurred on by the seq/structure deficit, the challenges rationalise the mass of sequence data derive more efficient means of data storage design more incisive & reliable analysis tools The imperative – to convert sequence information into biochemical & biophysical knowledge to decipher the structural, functional & evolutionary clues encoded in the language of biological sequences

The Holy Grail of bioinformatics ...to be able to understand the words in a sequence sentence that form a particular protein structure

The reality of sequence analysis ...isn‘t so glamorous....but means we can recognise words that form characteristic patterns（模式）, even if we don't know the precise syntax to build complete protein sentences

Pattern recognition & prediction In investigating the meaning of sequences, two distinct analytical approaches have emerged pattern recognition （模式识别）is used to detect similarity between sequences or structures & hence to infer related functions ab initio prediction （从头预测）is used to deduce structure, & to infer function, directly from sequence These methods are quite different! pattern recognition methods demand that some characteristic has been seen before & housed in a db prediction methods remove the need for template dbs, because deductions are made directly from sequence

Science fact & fiction Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition which is ~50% reliable even in expert hands Prediction is still not possible & is unlikely to be so for decades to come (if ever) Structural genomics will yield representative structures for many (but not all) proteins in future structures of new sequences will be determined by modelling prediction will become an academic exercise But, to debunk a popular myth, knowing structure alone does not inherently tell us function

A reality check What is the function of this structure? What is the function of this sequence? What is the function of this structure? What is the function of this motif? the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions – knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

“A test case for structural genomics Structure-based assignment of the biochemical function of hypothetical protein mj0577” (Zarembinski et al., PNAS 95 1998) Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown

The Twilight Zone Prediction methods don’t work because we don’t fully understand the Folding Problem we can’t read the language sequences use to create their folds But, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbs whose structures & functions we hope have been elucidated This is straightforward at high levels of identity, but below 50% it is difficult to establish relationships reliably Analyses can be pursued with decreasing certainty towards the Twilight Zone ~20% identity, where results may look plausible to the eye, but are no longer statistically significant

一、简介（introduction）二、蛋白质数据库（protein databases）三、蛋白质序列分析（protein sequence analysis）

蛋白质数据库分类（classification）蛋白质序列数据库（protein sequence database）：以蛋白质的序列为主，并赋予相应的注释，如 PIR、SWISS-PROT、NCBI。蛋白质模体及结构域数据库（protein motif and domain database): 收集蛋白质的保守结构域和功能域的特征序列，如 PROSITE、Pfam。

蛋白质数据库分类（classification）蛋白质结构数据库（protein structure database): 主要以蛋白质的结构测量数据为主，如 PDB 等。蛋白质分类数据库（protein classification database）：包含有以序列比较为基础的序列分类数据库以及以结构比较为基础的结构分类数据库，如 SCOP、CATH、FSSP 等。

蛋白质数据库的功能（protein database function）对数据的注释（annotation）功能对数据的检索（search）功能对数据的生物信息分析（bioinformatics analysis）功能

蛋白质序列数据库 PIR（protein information resource) 蛋白质信息资源数据库 http://pir.georgetown.edu/ PIR-PSD, the world's first database of classified and functionally annotated protein sequences. The sequence data come from GenBank/EMBL/DDBJ databases, published data and user directly submitted.

PIR-PSD 是一个综合全面的、非冗余的、专业注释的、分类完整的蛋白质序列数据库。 PIR-PSD的序列来自于将GenBank/EMBL/DDBJ 三大数据库的编码序列的翻译而成的蛋白质序列、发表的文献中的序列和用户直接提交的序列。 iProClass 数据库是用于描述蛋白质家族之间的关系以及结构/功能特征的综合资源，收录了包括SWISS-PROT和PIR数据库的30万多条蛋白质序列，包括超家族、蛋白质家族、功能域、结构模体、翻译后修饰位点。

http://pir.georgetown.edu/

蛋白质序列数据库 SWISS-PROT/TrEMBL database 瑞士蛋白质数据库（www.expasy.org/swissprot） SWISS-PROT 数据库是经注释的蛋白质数据库，由蛋白质序列条目构成。每个条目包含蛋白质序列、引用文献信息、分类学信息、注释等。注释中包括蛋白质的功能、转录后修饰位点、特殊位点和区域、二级结构、四级结构、与其它序列的相似性等信息。

Swiss-Prot Endeavours to provide high-level annotation e.g., descriptions of the function of the protein, the organisation of its domains, PTMs, family & disease relationships, variants, etc. Contains entries from >10,000 species the bulk of these from just a handful of model organisms H.sapiens, E.coli, M.musculus, D.melanogaster, S.cerevisiae, etc. The quality of its annotations sets is apart from other dbs Consequently, it cannot keep pace with the rate of data acquisition from the sequencing centres

www.expasy.org/swissprot

蛋白质结构数据库 PDB (Protein Database Bank) http://www.rcsb.org/pdb/ PDB is the single worldwide repository for the processing and distribution of 3D structure data of large molecules of proteins and nucleic acids.

PDB 的结构由一下信息组成：序列信息；原子坐标；分子结晶条件；通过多种方法计算的三位结构近似值；衍生的几何数据；结构因数；三位结构立体图象；与其它数据资源的链接。

http://www.rcsb.org/pdb/

蛋白质家族及结构域数据库 PROSITE（Database of protein families and domains ) http://www.expasy.org/prosite PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

PROSITE 数据库是基于对蛋白质家族中同源序列多重序列比对得到的保守性区域，这些区域通常与生物学功能有关，例如酶的活性位点、配体或金属结合位点等。

http://www.expasy.org/prosite

一、简介（introduction）二、蛋白质数据库（protein databases）三、蛋白质序列分析（protein sequence analysis）

三、蛋白质序列分析（protein sequence analysis）（一）蛋白质序列信息的获取 protein sequence collection （二）蛋白质序列分析 protein sequence analysis

3 methods for collecting protein sequence data: Direct sequencing, 直接测序 e.g.用质谱仪测序 Translating DNA sequence, 翻译编码的DNA序列 e.g.用“ORF Finder”程序找到DNA的开放阅读框 Search database, 在数据库中搜索

Method 1： Direct sequencing, 直接测序（一）protein sequence collection Method 1： Direct sequencing, 直接测序 e.g. Protein Sequencing and Identification by Mass Spectrometry，即用质谱仪测序

Masses of Amino Acid Residues

Protein backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus AA residuei-1 AA residuei AA residuei+1

Breaking Protein into Peptides and Peptides into Fragment Ions General for sequencing Breaking Protein into Peptides and Peptides into Fragment Ions Proteases, e.g. trypsin（胰蛋白酶）, break protein into peptides. A Tandem Mass Spectrometer（串联式质谱仪） further breaks the peptides down into fragment ions and measures the mass of each piece.

Breaking Protein into Peptides and Peptides into Fragment Ions General for sequencing Breaking Protein into Peptides and Peptides into Fragment Ions Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. Mass Spectrometer measure mass/charge ratio of an ion.

Peptide Fragmentation Collision Induced Dissociation H+ H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment Peptides tend to fragment along the backbone. Fragments can also loose neutral chemical groups like NH3 and H2O.

N- and C-terminal Peptides G P F N A A G P F N N A G P F C-terminal peptides N-terminal peptides F N A G P G P F N A

Terminal peptides and ion types G P F N H2O Mass (D) 57 + 97 + 147 + 114 = 415 Peptide G P F N without H2O Mass (D) 57 + 97 + 147 + 114 – 18 = 397

N- and C-terminal Peptides 486 G P F N A A G P F N 71 415 301 N A G P F 185 C-terminal peptides N-terminal peptides F N A G P 154 332 G P F N A 57 429

N- and C-terminal Peptides 486 71 415 301 185 C-terminal peptides N-terminal peptides 154 332 57 429

Peptide Fragmentation b2-H2O b3- NH3 a2 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y1 y2 - NH3 y3 -H2O

Mass Spectra The peaks in the mass spectrum: G V D D L K Prefix H2O L K L K D V G 57 Da = ‘G’ 99 Da = ‘V’ mass The peaks in the mass spectrum: Prefix Fragments with neutral losses (-H2O, -NH3) Noise and missing peaks. and Suffix Fragments.

Protein Identification with MS/MS G V D L K Peptide Identification: MS/MS mass Intensity mass

Tandem Mass-Spectrometry

Breaking Proteins into Peptides GTDIMR HPLC To MS/MS PAKID MPSERGTDIMRPAKID...... MPSER …… …… protein peptides

Matrix-Assisted Laser Desorption/Ionization (MALDI) Mass Spectrometry Matrix-Assisted Laser Desorption/Ionization (MALDI)

Tandem Mass Spectrometry MS LC Scan 1707 Ion Source MS-1 collision cell MS-2 MS/MS Scan 1708

多肽片段指纹图谱（PFF）步骤：用酶专一性酶解蛋白质，经过分离，得到的肽段在质谱中被选择和破碎后得到MS/MS谱图，与数据库中的谱图比较进行鉴定代表方法： LC-ESI-MS/MS 2D-LC-MS/MS（shotgun）

（一）protein sequence collection Method 2 Translating DNA sequence, 翻译编码的DNA序列 e.g.用“ORF Finder”程序找到DNA的开放阅读框。网址：ncbi.nlm.nih.gov/gorf/gorf.html

（一）protein sequence collection Method 3： Search database, 即在数据库中搜索 e.g. PIR-PSD database: pir.georgetown.edu/pirwww SWISS-PROT/TrEMBL database www.expasy.org/swissprot

三、蛋白质序列分析（protein sequence analysis）（一）蛋白质序列信息的获取 protein sequence collection （二）蛋白质序列分析 protein sequence analysis

蛋白质序列分析当我们在实验中获得了一个蛋白质序列或者一组序列，我们需要可能多的获取该蛋白的有关信息，并在可以获得的信息的基础上对蛋白的功能进行预测，所以蛋白质序列的分析是必不可少的。通过序列分析我们可以得到蛋白质序列的基本信息，并进一步可以预测有关蛋白质二级结构、三级结构、乃至四级结构的信息。

LEVEL OF FUNCTION INFORMATION IN PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN SECONDARY STRUCTURE MOTIF SITE 3D STRUCTURE RESIDUE

The 20 Amino Acids A Ala Alanine (丙氨酸） C Cys Cysteine（半胱氨酸） D Asp Aspartic acid (Aspartate)（天冬氨酸） E Glu Glutamic acid (Glutamate)（谷氨酸） F Phe Phenylalanine（苯丙氨酸） G Gly Glycine（甘氨酸） H His Histidine（组氨酸） I Ile Isoleucine（异亮氨酸） K Lys Lysine（赖氨酸） L Leu Leucine（亮氨酸） M Met Methionine（蛋氨酸） N Asn Asparagine（天冬酰胺） P Pro Proline（脯氨酸） Q Gln Glutamine（谷氨酰胺） R Arg Arginine（精氨酸） S Ser Serine（丝氨酸） T Thr Threonine（苏氨酸） V Val Valine（缬氨酸） W Trp Tryptophan（色氨酸） Y Tyr Tyrosine（酪氨酸）

AMINO ACID PROPERTIES Small Ala, Gly Small hydroxyl Ser, Thr Basic His, Lys, Arg Aromatic Phe, Tyr, Trp Small hydrophobic Val, Leu, Ile Medium hydrophobic Val, Leu, Ile, Met Acidic/amide Asp, Glu, Asn, Gln Small/polar Ala, Gly, Ser, Thr, Pro

Protein functions from specific residues C disulphide-rich, metallo- thionein, zinc fingers DE acidic proteins (unknown) G collagens H histidine-rich glycoprotein KR nuclear proteins, nuclear localisation P collagen, filaments SR RNA binding motifs ST mucins Polar (C,D,E,H,K,N,Q,R,S,T) - active sites Aromatic (F,H,W,Y) - protein ligand- binding sites Zn+-coord (C,D,E,H,N,Q) - active site, zinc finger Ca2+-coord (D,E,N,Q) - ligand-binding site Mg/Mn-coord (D,E,N,S,R,T) - Mg2+ or Mn2+ catalysis, ligand binding Ph-bind (H,K,R,S,T) - phosphate and sulphate binding Catalytic sites mostly polar large aromatic aa often found in protein-ligand interactions za ions coordinated by mnay differenet aa ca ions often bound by acidic residues and amides Mn and Mg often bound by 2 acidic residues separated by hydrophobic aa Phosph and Sulph often bound to amino end of alpha helices

Protein functions from regions Active sites- short, highly conserved regions Loops- charged residues and variable sequence Interior of protein- conservation of charged amino acids

Additional analysis of protein sequences transmembrane regions signal sequences localisation signals targeting sequences GPI anchors glycosylation sites hydrophobicity amino acid composition molecular weight solvent accessibility antigenicity

蛋白质序列分析方法： 1.相似性搜索（或同源搜索） (similarity search/homology search) 2.模体搜索和结构域定位（motif and domain location） 3.多重序列比对（multiple sequence alignment） 4.同源模建（homology modelling）新序列与数据库中的序列对比，寻找同源性或者相似的序列寻找蛋白质中结构域或者功能域提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板（需作手工校正）显著同源关系（>50%）：以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建； 2. 当所击中的序列是分离的结构域，通过二级结构预测和折叠识别寻找合适的折叠子，以折叠子为模板建模。

(similarity search/homology search) 1.相似性搜索（或同源搜索） (similarity search/homology search) ① 一个新序列与序列数据库中的序列比对，从而找到同源或者相似序列。 ② 常用程序是BLASTp。新序列与数据库中的序列对比，寻找同源性或者相似的序列寻找蛋白质中结构域或者功能域提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板（需作手工校正）显著同源关系（>50%）：以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建； 2. 当所击中的序列是分离的结构域，通过二级结构预测和折叠识别寻找合适的折叠子，以折叠子为模板建模。

数据库搜索工具BLAST BLAST 是目前常用的数据库搜索程序，它是 Basic Local Alignment Search Tool 的缩写，意为 “基本局部相似性比对搜索工具”。 BLAST 算法的基本思路是首先找出检测序列和目标序列之间相似性程度最高的片断，并作为内核向两端延伸，以找出尽可能长的相似序列片段。

BLAST Input

BLAST results

BLAST results (2)

（motif and domain location） 2.模体搜索和结构域定位（motif and domain location）模体（motif）是通过对一个蛋白质家族进行多序列比对检测出来的一种高度保守元件，它常对应于一些功能域和结构域。模体搜索是另一种数据库搜索方式，它搜索的对象是序列中一些关键的保守氨基酸。 PROSITE数据库（www.expasy.ch/prosite) 新序列与数据库中的序列对比，寻找同源性或者相似的序列寻找蛋白质中结构域或者功能域提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板（需作手工校正）显著同源关系（>50%）：以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建； 2. 当所击中的序列是分离的结构域，通过二级结构预测和折叠识别寻找合适的折叠子，以折叠子为模板建模。

PROSITE数据库（www.expasy.ch/prosite) 是一个蛋白质家族的模体数据库，它包括重要模体搜索和结构域定位 PROSITE数据库（www.expasy.ch/prosite) 是一个蛋白质家族的模体数据库，它包括重要的位点（site），序列模式（pattern）和序列表谱（profile），可以对一个新的蛋白质序列准确地归类。新序列与数据库中的序列对比，寻找同源性或者相似的序列寻找蛋白质中结构域或者功能域提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板（需作手工校正）显著同源关系（>50%）：以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建； 2. 当所击中的序列是分离的结构域，通过二级结构预测和折叠识别寻找合适的折叠子，以折叠子为模板建模。

模式序列 (pattern) 这是另一种序列搜索方法，其目的是寻找蛋白质中结构域或者功能域。它直接描述序列中关键的几个保守残基，称为 “标志”(signature), 即所谓的模式序列 (pattern)。常用PROSITE 数据库。新序列与数据库中的序列对比，寻找同源性或者相似的序列寻找蛋白质中结构域或者功能域提供结构域相应的信息、蛋白质功能位点的残基、蛋白质亲水表面和疏水核的氨基酸残基、得到更多的同源模建或二级结构预测的模板（需作手工校正）显著同源关系（>50%）：以击中序列的已知结构为模板对蛋白质进行精确的结构模型构建； 2. 当所击中的序列是分离的结构域，通过二级结构预测和折叠识别寻找合适的折叠子，以折叠子为模板建模。

PATTERNS Small, highly conserved regions Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} [] shows either amino acid X is any amino acid X(2) any amino acid in the next 2 positions {} shows any amino acid except these BUT- limited to near exact match in small region

FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES Pattern - short, simplest, but limited Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: Matrix Profile HMM

结构域定位通过将序列在数据库中搜索，可以了解到序列的一些信息，接下来就可以进行结构域的定位，这样就对以后的结构预测有了一个比较清醒的认识。模体搜索和结构域定位结构域定位通过将序列在数据库中搜索，可以了解到序列的一些信息，接下来就可以进行结构域的定位，这样就对以后的结构预测有了一个比较清醒的认识。如果蛋白质序列的长度大于500个氨基酸，就可以根据搜索的情况（比如按相似性高低或者结构域多少等）将蛋白质分割成多个不连续的区域，最好将这一段一段的序列分别鉴别。

鉴定蛋白质的结构域一般都要做以下几种分析：模体搜索和结构域定位鉴定蛋白质的结构域一般都要做以下几种分析： (1)探测序列与其他全序列之间有无同源性.如果有，那么这是该段序列为结构域的很好证据，然后进行结构数据库的搜索，也可以搜索注释好的数据库，从而得到一些有关结构域的说明。 (2)分析低复杂度的区域。在多结构域的蛋白质中，这些低复杂度序列常常间隔结构域，长的重复序列特别是pro、glu、ser、thr等常常是连接序列，也是很好的结构域剪接位置。

鉴定蛋白质的结构域一般都要做以下几种分析：模体搜索和结构域定位鉴定蛋白质的结构域一般都要做以下几种分析： (3)跨膜区域。由于跨膜结构是一个非常典型的结构，这种结构连续性较强，而且预测容易，准确性也比较高，因此也是一个分割的区域，这样就很容易区分胞外和胞内区域。 (4)卷曲螺旋结构(coiled-coil)。这个结构有时也可能是蛋白质结构域之间的间隔区，可以在COIL网站上预测coiled-coil结构。

鉴定蛋白质的结构域一般都要做以下几种分析：模体搜索和结构域定位鉴定蛋白质的结构域一般都要做以下几种分析： (5)二级结构预测。这个方法常常用来预测一个结构中包含的不同折叠子。例如，一个序列中的一部分可能会被预测成只有α-螺旋，而另一个部分可能会被预测成只含有β-折叠，这些都可能预示有域的结构存在。 (6)如果序列已被成功地分解成成形的结构域，那么重复进行数据库搜索并且进行独立比对是很重要的.

模体搜索和结构域定位

Why create pattern databases? Often need to make more specific diagnoses than are possible simply by searching the 1's Build on the principle that sequences may be gathered into alignments, within which are regions with little variation these ‘motifs’ usually reflect some vital biological role in terms of structure or function Motifs are exploited in different ways to build diagnostic patterns for protein families new sequences can be searched against dbs of such patterns to see if they can be assigned to known families hence they offer a fast track to the inference of function

Methods for family analysis Full domain alignment methods Single motif methods Fuzzy regex (eMOTIF) Full domain alignment methods Exact regex (PROSITE) Profiles (PROFILE LIBRARY) HMMs (Pfam) Identity matrices (PRINTS) Multiple motif methods Weight matrices (Blocks)

The challenge of family analysis highly divergent family with single function? superfamily with many diverse functional families? must distinguish if function analysis done in silico a tough challenge!

Know your family

The problem with domains

PROSITE The first pattern db based on the idea that a family can be characterised by a pattern of conserved residues in a single motif Sequence information in motifs is reduced to regular expressions & the seed regex used to search SP results are inspected manually to achieve optimal results Some families can’t be characterised by single motifs here, additional regexs are created until an optimal set is achieved that captures most or all of the family results are then manually annotated for inclusion in the db

MATRIX 210 possible aa pairs (190 different aa, 20 identical aa) Start with sequence alignment and build up a table of probabilites of finding each aa in each position of the sequence Can be scored in several different ways

Matrix scores can be based on: Genetic code -base changes required to convert codons for 2 amino acids Chemical similarity -polarity, size, shape, charge Observed substitutions -based on analysing frequencies seen in alignments- inter-reliable Dayhoff mutation data matrix - likelihood of mutation from one aa to another, but different positions are not equally mutatable, and only useful for close function because sequence alignments are very related proteins

Matrix scoring continued BLOSUM -matrix from ungapped alignments of distantly related sequences -cluster sequences similar at a threshold value of % identity -substitution frequencies for all pairs of aa calculated -used to calculate a log odds BLOSUM (blocks substitution matrix). Can vary threshold values 3D structure matrix -derived from tertiary structure alignment, good, but only used if structure is known Best matrices are derived from observed substitution data, it is important to use select scoring appropriate for evolutionary distance interested in.

PROFILES Table or matrix containing comparison information for aligned sequences Used to find sequences similar to alignment rather than one sequence Contains same number of rows as positions in sequences Row contains score for alignment of position with each residue

Example of a Profile Match values are higher for conserved residues Note:third column, alanine has lower score than meth, cos meth is pphysically more similar to L, I, V and F

Building a Profile To get good profile need good, hand-curated alignment Use alignment to build up position-specific scoring matrix Use matrix (profile) to do PSI-BLAST with several iterations

SCORES E-value is chance of a random sequence sequence hitting. E-value 1.0 not significant, 0.1 possibly significant,< 0.01 most likely to be significant. All depends on database size

HIDDEN MARKOV MODELS (HMM) An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities Package used HMMER (http://hmmer.wusd.edu/) Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM E-value- number of false matches expected with a certain score Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)

REPEATS Structural and evolutionary entities found in 2 or more copies Often assemble into elongated “rods”, “superhelices” or “barrel” structures Specialised cases when building profiles

PITFALLS OF METHODS BLAST - only pick up homologues, not distant, divergent family members PSI-BLAST - fine for superfamilies, not very good for small very conserved motifs Patterns - small, localised and need to be highly conserved regions HMMER - slow process for searching database Profiles - if false positive picked up, pulls in its companions, in large families members can be missed Alignment methods - automatic, less biological significance

Big problem in protein sequence analysis- multidomain proteins: Most conserved domain will score highest in sequence similarity searches, may overlook lower scoring domains Iterative searching of multi-domain proteins could pick up unrelated proteins A A B B C C Domain 1 Domain 1 Domain 2 A=B, B=C, AC A,B & C share a common domain

SUMMARY OF PATTERN METHODS xxxxxx Extract regular expression (PROSITE) Single motif method Full domain alignment methods (ProDom, DOMO) Full domain profile or HMM (Pfam, SMART) Multiple motif methods Frequency matrix (PRINTS) or PSS matrix (BLOCKS)

COMMON PROTEIN PATTERN DATABASES Prosite patterns Prosite profiles Pfam SMART Prints ProDom DOMO BLOCKS

3. 多重序列比对 Multiple sequence alignment 多重序列比对就是把2条以上可能有系统进化关系（亲缘关系）的序列进行比对的方法。目前多序列比对的算法大多基于渐进的比对 (progressive alignment), 即在序列两两比对的基础上逐步优化多序列比对的结果。

任何两条或多条核苷酸或氨基酸序列之间的比对，从真正意义上讲，代表着有关这些序列进化历史的明确假设。序列比对为解决下列问题提供重要信息： - 确定新发现基因的功能； - 确定基因间、蛋白质间乃至物种之间的进化关系； - 预测蛋白质的结构和功能：即可以提供结构域相应的信息，蛋白质功能位点的残基、蛋白质亲水性和疏水性的氨基酸残基，从比对结果中得到更多的同源模建或二级结构预测的模板。

Alignment (比对) Strategy for sequence alignment without use of structural information is the same for DNA & protein Must allow for point mutations Insertions & deletions

Protein sequence alignment Pairwise alignment a b a c d a b _ c d Multiple sequence alignment usually provides more information x b a c e Multiple alignment difficult to do for distantly related proteins

Questions Are two sequences homologous? What is the best alignment between them? What is the function of my protein? Is my SNP functionally important?

Alignment Strategy How we look depends on what we’re looking for Want to equivalence residues / bases (sites) that shared a common ancestor For DNA this can only really be done at the sequence level For proteins, structure important (but in general we don’t know the structure) RNA (miRNA, tRNA etc) structure (base pairing) important

多重序列比对 (Multiple Sequence Alignment) 通过序列的相似性检索得到许多相关的相似序列，将两条以上这些序列做一个总体的的比对。用途:构建序列模式的分布图；将序列聚类构建分子进化树，等等。工具: ClustalW_mp, 它的网址是：www.ebi.ac.uk/clustalw

Sample Multiple Alignment

AF380737 ------------------------------------------------------------ AF380734 GGGGTTGGTGTAAAATAGGGGTGGGGCTCCCCGGGCTATTTCGGCCCCTCCGGCTAGACC 60 ...... AF380737 CAAAG--CCTTGGAG---TTAGC---ATAGGACGTTGGAACGATAGTGATAACGGATATG 466 AF380735 TATTGA-CCTTATAGATTTTATA---ATCGGGGAATAGCGTGACGTTAGTGGAGTCTAGG 477 AF380736 GTTGGCAGCTTGGCTATAGCGCT---ATAGGAGCTTAG-GTAACGTAGGGATCTATGTTG 465 AF380733 TAGGGATACTTAGGG----TTGC---ATAGG---CTATAGTTTCGATAGGTAACTTTAGG 455 AF380734 TATAATATAAAGGAGAAGTTATATTGGTGGGGTTCCAAGGCTATTTAGGCTAAGGGTTGG 706 * ** AF380737 AAATTCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 804 AF380735 AAAATCCCAAACTTTTTACGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 797 AF380736 AAATTCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 804 AF380733 AAAATCCCAAACTTTTTAGGTCCCTCAGGTAGGGGCGTTCTCC-GAAAACCGAAAAATGC 801 AF380734 AAAAATCCCAACTTTTTAGGTCCCTTAGGTAGGGGCGTTCTCCCGAAAACCGAAAAAATC 1053 *** ** **************** ***************** ************* * AF380737 -ATGCAGAAACCCC-GTTCAAAAAT-CGGCCAAAATCGCCATTTTTACGATTTTCGTGTG 861 AF380735 -ATGCAGAAACCCC-GTTCAAAAAT-CGGCCAAAATCGCCATTTTTTCAATTTTCGTGTG 854 AF380736 -ATGCAGAAACCCC-GTTCAAAAAA-TGGCCAAAATCGCGATTTTTACGATTTTCGTGTG 861 AF380733 -ATGCAGAAACCCC-GTTCAAAAAA-TGCCCAAAATCGCGATTTTTACGATTTTCGTGTG 858 AF380734 GATGCAGAAACCCCCGTTCAAAAAAATGCCCAAAATCACGATTTTTACGATTTTCGTGTG 1113 ************* ********* * ******** * ****** * *********** AF380737 AAACTA 867 AF380735 AAACTA 860 AF380736 AAACTA 867 AF380733 AAACTA 864 AF380734 AAACTA 1119 ****** 5个序列的多序列比对结果

4.同源模建（homology modelling） (见下一章）

思考题汉译英词汇(组) 1. 怎样获得蛋白质序列数据？ 2. 熟悉一种以上蛋白质序列和结构数据库的名称及特点；简述Prosite数据库及应用。 3. 简述蛋白质序列分析的方法。 4. 多序列比对的基本原理。汉译英词汇(组) 序列注释；模体和结构域；多序列比对