Molecular Phylogeny 分子系统发育分析

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

考研英语复试 口语准备 考研英语口语复试. 考研英语复试 口语准备 服装 谦虚、微笑、自信 态度积极 乐观沉稳.
胸痛中心的时间流程管理 上海胸科医院 方唯一.
升中導航— 面試技巧工作坊 學校社工:江曉彤姑娘.
听力满分不是梦 博智 —— Anna钟小娜.
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
破舊立新(三) 人生召命的更新 使徒行傳廿六章19-23節.
宏 观 经 济 学 N.Gregory Mankiw 上海杉达学院.
How can we become good leamers
自衛消防編組任務職責 講 義 This template can be used as a starter file for presenting training materials in a group setting. Sections Right-click on a slide to add.
英国医生 Jenner 在 1796 年首创接种牛痘预防天花。
B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
摘要的开头: The passage mainly tells us sth.
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Minimum Spanning Trees
Unit 4 I used to be afraid of the dark.
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
基因组数据注释和功能分析 陈启昀 陈 辰 丁文超 张增明 浙江加州国际纳米技术研究院(ZCNI)
生物資訊 bioinformatics 林育慶.
Population proportion and sample proportion
模式识别 Pattern Recognition
5、利用EST数据库发现新基因 EST (expressed sequence tags),是从基因表达的短的序列,携带着完整基因某些片断的信息,称为表达序列标签 获得一个EST的途径有三种:1 大规模测序;2 比较同源性;3 差异显示或基因芯片法获得与某一性状相关的EST 电脑克隆 第一步,找到与待克隆基因相关的EST;第二步.
Journal Citation Reports® 期刊引文分析報告的使用和檢索
Guide to Freshman Life Prepared by Sam Wu.
第4章(2) 空间数据库 —关系数据库 北京建筑工程学院 王文宇.
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
Basic Local Alignment Search Tool
圖表製作 集中指標 0628 統計學.
Interval Estimation區間估計
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
Formal Pivot to both Language and Intelligence in Science
消費者偏好與效用概念.
基于基因集富集分析的畜禽复杂性状GWAS分析平台及其应用
客户服务 询盘惯例.
Lesson 44:Popular Sayings
中国农村沼气政策与发展战略 李景明 中国北京 农业部科技发展中心能源生态处处长 中国沼气学会秘书长.
一个交叉学科的胜利 简介 生物信息学 黄晓靖.
第十五课:在医院看病.
绩效管理.
第十章 古DNA数据分析.
IBM SWG Overall Introduction
Version Control System Based DSNs
Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学 赵亚平.
Maintaining Frequent Itemsets over High-Speed Data Streams
Guide to a successful PowerPoint design – simple is best
汉英翻译对比练习.
BORROWING SUBTRACTION WITHIN 20
3.5 Region Filling Region Filling is a process of “coloring in” a definite image area or region. 2019/4/19.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
Unit 5 First aid Warming up 《和你一样》 中国红十字会宣传曲 高二年级 缪娜.
华南师范大学生命科学学院05级技术(2)班 刘俏敏
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
高考应试作文写作训练 5. 正反观点对比.
Interactome data and databases: different types of protein interaction
Q & A.
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
Create and Use the Authorization Objects in ABAP
动词不定式(6).
主要内容 什么是概念图? 概念图的理论基础 概念图的功能 概念地图的种类 如何构建概念图 概念地图的评价标准 国内外概念图研究现状
缅怀植物学家钟 扬:一个心怀家国的“善梦者”
蛋白質交互作用資料庫、 網路拓樸分析與藥物標的搜尋 Protein Interactome, Topological Analysis on Complex Network for Identification of Drug Target
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
Presentation transcript:

Molecular Phylogeny 分子系统发育分析 Bacteria Archaea Eukarya

三大发现,适者生存

Introduction Natural Selection “Natural selection is daily, hourly, scrutinising the slightest variations, rejecting those that are bad, preserving and adding up all those that are good”- The Origin of Species Charles Darwin (1809 - 1882)

Darwin’s Travels Lamarck - adaptations Wallace – natural selection

Galapagos Finches

The Galapagos Finches The beaks of the finches are adapted to different jobs in the same way as tools.

Artificial Selection

Natural Selection Overproduction Individual Variation Unequal Reproductive Success

The struggle for existence induces a natural selection. Charles Darwin’s 1859 book (On the Origin of Species By Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life) introduced the theory of evolution. The struggle for existence induces a natural selection. 三大发现,适者生存

Tree of Life

Five kingdom system (Haeckel, 1879) mammals vertebrates animals invertebrates plants fungi protists protozoa monera Page 396

Introduction At the molecular level, evolution is a process of mutation with selection. Molecular evolution is the study of changes in genes and proteins throughout different branches of the tree of life. Phylogeny is the inference of evolutionary relationships. Traditionally, comparison of morphological features Today, comparison of molecular sequence data

Introduction In the 1920s and 1930s, a synthesis occurred between Darwinism and Mendel’s principles of inheritance. The basic processes of evolution are [1] mutation, and also [2] genetic recombination as two sources of variability; [3] chromosomal organization (and its variation); [4] natural selection [5] reproductive isolation, which constrains the effects of selection on populations

Levels of Selection Species Population Individual Gene Species level selection may lead to its extinction, generally a large environmental change. Interspecific competition and predation can lead to population decline, unless the population can exploit new niches or find novel ways of avoiding predation. Intraspecific competition for shared resources acts on the survival of progeny. Individuals that can exploit their environment better than others survive to pass their genes to their descendants. The importance of the phenotypic characters expressed by the genes decides how selection acts on them.

Examples of clades Lindblad-Toh et al., Nature 438: 803, 8 Dec. 2005, fig. 10

直系同源、旁系同源 旁系同源 直系同源 直系同源 Frog Chick mouse mouse chick frog α 链 β 链 Paralogs 直系同源 Orthologs Orthologs 直系同源 Frog Chick mouse mouse chick frog α 链 β 链 基因复制 原始血红蛋白基因

Gene duplication and loss 1 2 3 C B A Pseudogene gene merge Gene loss Gene Duplication

CONCEPT and DEFINITION Orthologs: They represent genes derived from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function. Paralogs homologs produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. They tend to have different functions.

CONCEPT and DEFINITION Xenologs homologs resulting from horizontal gene transfer between two organisms. The determination of whether a gene of interest was recently transferred into the current host by horizontal gene transfer is often difficult. Function of xenologs can be variable depending on how significant the change in context was for the horizontally moving gene; In general, the function tends to be similar.

CONCEPT and DEFINITION Ohnology Paralogous genes that have originated by a process of whole-genome duplication (WGD). The name was first given in honour of Susumu Ohno by Ken Wolfe. Ohnologs are interesting for evolutionary analysis because they all have been diverging for the same length of time since their common origin.

How to find orthologs and paralogs In eukaryotic genomes, most genes are members of gene families. When comparing genes from two species, therefore, most genes in one species will be homologous to multiple genes in the second. This often makes it difficult to distinguish orthologs (separated through speciation) from paralogs (separated by other types of gene duplication). Combining phylogenetic relationships, gene function and genomic position in both genomes helps to distinguish between these scenarios. There are many publications on this topic, such as: Steven B Cannon and Nevin D Young, OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies, BMC Bioinformatics 2003, 4:35

Bidirectional best hits (BBH) The best hit of a particular gene to a target genome is the gene in that genome that represents a best match. The match is bidirectional if the two genes are best hits of each other. A bidirectional best hit represents a very strong similarity between two genes, and is considered evidence that the genes may be orthologs arising from a common ancestor. formally, the paper The use of gene clusters to infer functional coupling defines a bidirectional best hit (or BBH) as follows: Given two genes Xa and Xb from two genomes Ga and Gb, Xa and Xb are called a “bidirectional best hit (BBH)” if and only if recognizable similarity exists between them (in our case, we required Similarity Scores lower than 1.0 × 10−5), there is no gene Zb in Gb that is more similar than Xb is to Xa, and there is no gene Za in Ga that is more similar than Xa is to Xb.

Use the bidirectional best hits (BBH) criterion to define orthologs when two genomes are compared by the Smith-Waterman algorithm at the amino acid sequence level with the threshold similarity score of 70. To characterize genes of an organism, its genes S(G1) are once mapped to the nodes of the graph G2 that encodes functional orthologs in another organism. After that, we compare G2 and an additional graph G3 of the original organism instead of comparing G1 and G3 directly. Gene mapping Genome Informatics 12: 44–53 (2001)

Gene mapping Gene-gene relationships on a specific attribute can be denoted by using a set of binary relationships in a general manner. For example, let a binary operator ' ∼ ' denote a binary relationship between two genes, and let g1, g2, g3, and g4 be a series of genes arranged in this order in a genome sequence, their geometrical relationships are broken down into a set of binary relationships {g1 ∼ g2, g2 ∼ g3, g3 ∼ g4}. A set of binary relationships among genes forms a graph structure as a whole. Fig. shows three graphs G1 (genome), G2 (pathway), and G3 (similarity), where each graph node corresponds to a gene or a gene product. In a graph, two nodes are connected by an edge (expressed by a solid line) when they are related by a binary relationship In a set of genes, if all or most of the genes reserve their mutual relationships in multiple graphs, like the light gray nodes and the dark gray nodes, the biological relevance among those genes is considered to be supported at high possibility. We call such a set of genes a correlated gene cluster (or simply, correlated cluster), by which we can characterize, classify, and predict the activities of genes.

A. Mouse B. Human

Overview of the defensin gene cluster region in mouse (top) and human (bottom). A clone tiling path is shown for the corresponding regions in mouse (top) and human (bottom). Clones are displayed in yellow but regions overlapping with adjacent clones are shown in black. Genes are indicated by arrows. Genes in shadowed boxes are duplicated and the color indicates the pairs; A -- highlights all potential Defcr5 genes (see color legend for more details). The mouse assembly is based on NCBIM37, in which three gaps currently exist; two gaps are indicated by grey bars and the biggest gap between the two clusters is joined by a 'V'. 小鼠defensin基因的注释:Amid et al. BMC Genomics 2009 10:606   doi:10.1186/1471-2164-10-606.

进化树的概念 Phylogenetic Trees: In each panel, the phylogenetic group is depicted by a green shaded circle. A) Monophyletic group. A species (C and D) share a common ancestor (E) not shared by any other species. B) Paraphyletic group. All species in the group share a common ancestor (F), but some species (D) have been excluded from the group. C) Polyphyletic group. A grouping of lineages each more closely related to other species not in the group than they are two each other. --From Barton et al., (2007) Evolution, p. 111.

有根树、无根树

标度树

进化树的概念 一般来说, 进化树是显示物种间进化关系的二维图, 也可以反映来自不同物种的分子 (基因) 的进化关系。 sequence A length of branches reflects number of sequence changes. Often: assume uniform rate of mutations (molecular clock hypothesis). nodes 1、rooted tree sequence B sequence C branches sequence D sequence A sequence C 2、unrooted tree sequence B sequence D

Molecular phylogeny: nomenclature of trees There are two main kinds of information inherent to any tree: topology and branch lengths. We will now describe the parts of a tree. Page 366

Molecular phylogeny uses trees to depict evolutionary relationships among organisms. These trees are based upon DNA and protein sequence data. A B C D E F G H I time 6 2 1

Tree nomenclature Node (intersection or terminating point of two or more branches) branch (edge) A B C D E F G H I time 6 2 1

Tree nomenclature taxon taxon

operational taxonomic unit (OTU) Tree nomenclature operational taxonomic unit (OTU) such as a protein sequence hypothetical taxonomic unit (HTU) A B C D E F G H I time 6 2 1

Tree nomenclature Branches are scaled... Branches are unscaled... F G H I time 6 2 1 …branch lengths are proportional to number of amino acid changes …OTUs are neatly aligned, and nodes reflect time Fig. 11.4 Page 366

Tree nomenclature bifurcating multifurcating internal internal node Fig. 11.5 Page 367

Tree nomenclature: clades Clade ABF (monophyletic group) 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Fig. 11.4 Page 366

Tree nomenclature Clade CDH Fig. 11.4 Page 366 A F G B I H C D E 2 1 1 time Fig. 11.4 Page 366

Tree nomenclature Clade ABF/CDH/G Fig. 11.4 Page 366 A F G B I H C D E 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Fig. 11.4 Page 366

单系类群、并系类群、复系类群

内类群、外类群、姐妹群

Species trees versus gene/protein trees Molecular evolutionary studies can be complicated by the fact that both species and genes evolve. speciation usually occurs when a species becomes reproductively isolated. In a species tree, each internal node represents a speciation event. Genes (and proteins) may duplicate or otherwise evolve before or after any given speciation event. The topology of a gene (or protein) based tree may differ from the topology of a species tree. Page 370

Molecular clock hypothesis In the 1960s, sequence data were accumulated for small, abundant proteins such as globins, cytochromes c, and fibrinopeptides. Some proteins appeared to evolve slowly, while others evolved rapidly. Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock: For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

Molecular clock hypothesis As an example, Richard Dickerson (1971) plotted data from three protein families: cytochrome c(细胞色素), hemoglobin (血色素), and fibrinopeptides(血纤维蛋白肽). The x-axis shows the divergence times of the species, estimated from paleontological data. The y-axis shows m, the corrected number of amino acid changes per 100 residues. n is the observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed. N 100 = 1 – e-(m/100)

corrected amino acid changes Dickerson (1971) corrected amino acid changes per 100 residues (m) Millions of years since divergence

Molecular clock hypothesis: conclusions Dickerson drew the following conclusions: For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein. The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides). The observed variations in rate of change reflect functional constraints imposed by natural selection.

Molecular clock hypothesis: l and PAM The rate of amino acid substitution is measured by l, the number of substitutions per amino acid site per year. Consider serum albumin: l = 1.9 x 10-9 l x 109 = 1.9 Dayhoff et al. reported the rate of mutation acceptance for serum albumin as 19 PAMs per amino acid residue per 100 million years. (19 subst./1 aa/108 years = 1.9 subst./100 aa/109 years)

Molecular clock for proteins: rate of substitutions per aa site per 109 years Fibrinopeptides 9.0 Kappa casein 3.3 Lactalbumin 2.7 Serum albumin 1.9 Lysozyme 0.98 Trypsin 0.59 Insulin 0.44 Cytochrome c 0.22 Histone H2B 0.09 Ubiquitin 0.010 Histone H4 0.010

系统发育数据分析的步骤 对DNA/蛋白序列进行系统发育分析的四个主要步骤: 多序列比对, 建立取代模型, 建立进化树, 进化树评估。

Partial alignment of histones from PFAM (l = 0.05) H2A1_HUMAN/4-119 R.KGNYAERV GAGAPVYLAA VLEYLTAEIL ELAGNAARDN KKTRIIPR H2A1_YEAST/3-120 R.RGNYAQRI GSGAPVYLTA VLEYLAAEIL ELAGNAARDN KKTRIIPR H2A3_VOLCA/5-119 K.KGKYAERI GAGAPVYLAA VLEYLTAEVL ELAGNAARDN KKNRIVPR H2A_PLAFA/5-120 K.KGKYAKRV GAGAPVYLAA VLEYLCAEIL ELAGNAARDN KKSRITPR H2A1_PEA/11-128 K.KGRYAQRV GTGAPVYLAA VLEYLAAEVL ELAGNAARDN KKNRISPR H2A1_TETPY/7-123 K.HGRYSERI GTGAPVYLAA VLEYLAAEVL ELAGNAAKDN KKTRIVPR H2AM_RAT/4-116 K.KGHPKYRI GVGAPVYMAA VLEYLTAEIL ELAGNAARDN KKGRVTPR H2A_EUGGR/18-134 R.AGRYAKRV GKGAPVYLAA VLEYLSAELL ELAGNASRDN KKKRITPR H2A2_XENLA/4-119 R.KGNYAERV GAGAPVYLAA VLEYLTAEIL ELAWERLPEI TKRPVLSP H2AV_CHICK/6-121 KTRTTSHGRV GATAAVYSAA ILEYLTAEVL ELAGNASKDL KVKRITPR H2AV_TETTH/6-131 KGRVSAKNRV GATAAVYAAA ILEYLTAEVL ELAGNASKDF KVRRITPR

Partial alignment of casein from PFAM (l = 3.3) CASK_BOVIN/2-190 VLSRYPSYGL NYYQQKPVAL .INNQFLPYP YYAKPAAVRS PAQILQWQVL CASK_CERNI/2-190 ALSRYPSYGL NYYQHRPVAL .INNQFLPYP YYVKPGAVRS PAQILQWQVL CASK_CAMDR/1-182 VQSRYPSYGI NYYQHRLAVP .INNQFIPYP NYAKPVAIRL HAQIPQCQAL CASK_PIG/2-188 MLNRFPSYGF .FYQHRSAVS .PNRQFIPYP YYARPVVAGP HAQKPQWQDQ CASK_HUMAN/1-182 VPNSYPYYGT NLYQRRPAIA .INNPYVPRT YYANPAVVRP HAQIPQRQYL CASK_RABIT/2-179 VMNRYPQYEP SYYLRRQAVP .TLNPFMLNP YYVKPIVFKP NVQVPHWQIL CASK_CAVPO/2-181 VLNNYLRTAP SYYQNRASVP .INNPYLCHL YYVPSFVLWA QGQIPKGPVS CASK_MOUSE/2-181 VLN.FNQYEP NYYHYRPSLP ATASPYMYYP LVVRLLLLRS PAPISKWQSM CASK_RAT/2-178 VLN.RNHYEP IYYHYRTSVP ..VSPYAYFP VGLKLLLLRS PAQILKWQPM

Most conserved proteins in worm, human, and yeast worm/ worm/ yeast/ Protein human yeast human H4 histone 99% id 91% id 92 % id H3.3 histone 99 89 90 Actin B 98 88 89 Ubiquitin 98 95 96 Calmodulin 96 59 58 Tubulin 94 75 76 See Copley et al. (1999)

Sanger and colleagues sequenced insulin (1950s) Human CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN chimpanzee CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN rabbit CGERGFFYTPKSRREVEELQVGQAELGGGPGAGGLQPSALELALQKRGIVEQCCTSICSLYQLEN dog CGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLEN horse CGERGFFYTPKAXXEAEDPQVGEVELGGGPGLGGLQPLALAGPQQXXGIVEQCCTGICSLYQLEN mouse CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVAQQKRGIVDQCCTSICSLYQLEN rat CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVARQKRGIVDQCCTSICSLYQLEN pig CGERGFFYTPKARREAENPQAGAVELGG--GLGGLQALALEGPPQKRGIVEQCCTSICSLYQLEN chicken CGERGFFYSPKARRDVEQPLVSSPLRG---EAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLEN sheep CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCAGVCSLYQLEN bovine CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCASVCSLYQLEN whale CGERGFFYTPKA-----------------------------------GIVEQCCTSICSLYQLEN elephant CGERGFFYTPKT-----------------------------------GIVEQCCTGVCSLYQLEN We can make a multiple sequence alignment of insulins from various species, and see conserved regions…

Mature insulin consists of an A chain and B chain heterodimer connected by disulphide bridges The signal peptide and C peptide are cleaved, and their sequences display fewer functional constraints.

Note the sequence divergence in the disulfide loop region of the A chain

0.1 x 10-9 1 x 10-9 0.1 x 10-9 Number of nucleotide substitutions/site/year

http://evolution.genetics.washington.edu/phylip/software.html This site lists 200 phylogeny packages. Perhaps the best- known programs are PAUP (David Swofford and colleagues) and PHYLIP (Joe Felsenstein).

在进行系统发育分析的时候,比对中引入了前导树。由CLUSTAL等比对得到前导树,转化成PHYLIP树的文件格式,然后输入到画树程序中, 常用的画树程序包括TreeTool(X windows), phylip,TREEVIEW, PAUP, MEGA 等。

三种主要的建树方法分别是: 1. 距离矩阵法 (Distance Matrix) 2. 最大简约法 (Maximum Parsimony, MP ) 3. 最大似然法 (Maximum Likelihood, ML)

距离树考察数据组中所有序列的两两比对结果,通过序列两两之间的差异决定进化树的拓扑结构和树枝长度。 最大节约方法考察数据组中序列的多重比对结果,优化出进化树。 最大似然方法考察数据组中序列的多重比对结果,优化出拥有一定拓扑结构和树枝长度的进化 树,这个进化树能够以最大的概率导致考察的多重比对结果。

距离矩阵法 邻接法 (neighbor-joining method,NJ) UPGMA法 使用这两种方法前都必须获得一个对称距离矩阵 (m阶方阵) D = {dij}m×m, 其中m为OUT(分类群〕数目。 距离系数的公式很多。例如,Nei (1972)的遗传距离系数适用于限制性内切酶和同功酶数据,Jukes-Cantor 单参数距离系数和Kimura两参数模型距离系数则广泛用于各种序列数据。

Tree-building methods We will discuss two tree-building methods: distance-based and character-based. Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. Examples of distance-based algorithms are UPGMA and neighbor-joining.

Tree-building methods We can introduce distance-based and character-based tree-building methods by referring to a tree of 13 orthologous retinol-binding proteins, and the multiple sequence alignment from which the tree was generated.

Orthologs: members of a gene (protein) family in various organisms. common carp Orthologs: members of a gene (protein) family in various organisms. This tree shows RBP orthologs. zebrafish rainbow trout teleost African clawed frog chicken human mouse horse rat pig cow rabbit 10 changes

Fish RBP orthologs Other vertebrate RBP orthologs common carp zebrafish Fish RBP orthologs rainbow trout teleost African clawed frog Other vertebrate RBP orthologs chicken human mouse horse rat pig cow rabbit 10 changes

Distance-based tree Calculate the pairwise alignments; if two sequences are related, put them next to each other on the tree

Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors

Stage 3: Tree-building methods: distance Many software packages are available for making phylogenetic trees. We will describe two programs. [1] MEGA (Molecular Evolutionary Genetics Analysis) by Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei. Download it from http://www.megasoftware.net/ [2] Phylogeny Analysis Using Parsimony (PAUP), written by David Swofford. See http://paup.csit.fsu.edu/. We will next use MEGA and PAUP to generate trees by the distance-based method UPGMA.

How to use MEGA to make a tree [1] Enter a multiple sequence alignment (.meg) file [2] Under the phylogeny menu, select one of these four methods… Maximum Likelihood (ML) Neighbor-Joining (NJ) Minimum Evolution (ME) UPGMA Maximum Parsimony (MP)

Use of MEGA for a distance-based tree: UPGMA Click green boxes to obtain options Click compute to obtain tree

Use of MEGA for a distance-based tree: UPGMA

Use of MEGA for a distance-based tree: UPGMA A variety of styles are available for tree display

Use of MEGA for a distance-based tree: UPGMA Flipping branches around a node creates an equivalent topology

How to use PAUP to make a tree step 1 step 2 step 3 Import to PAUP and execute Convert Obtain MSF step 4 step 6 step 5 View, export: Print Trees More analyses (evaluate trees) Perform analyses (generate trees)

How to use PAUP to make a tree Step 1: Get a multiple sequence alignment (e.g. from PFAM) Step 2: Convert it with ReadSeq (Google search to identify a site offering ReadSeq, Such as the Baylor College of Medicine) Step 3: Import as new file into PAUP

Fig. 11.15 Page 380

PAUP allows input of multiple sequence alignments, data editing, creation and analysis of phylogenetic trees Fig. 11.15 Page 380

Making trees using UPGMA In PAUP, you can set the tree-making criterion to “distance” then choose UPGMA (unweighted pair group method with arithmetic mean) Page 379

PAUP performs UPGMA (distance-based tree) Fig. 11.16 Page 381

Tree-building methods: UPGMA UPGMA is unweighted pair group method using arithmetic mean 1 2 3 4 5

Tree-building methods: UPGMA Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. 1 2 3 4 5

Tree-building methods: UPGMA Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 1 2

Tree-building methods: UPGMA Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 7 1 2 4 5

Tree-building methods: UPGMA Step 4: Keep going. Cluster. 1 2 3 4 5 8 7 6 1 2 4 5 3

Tree-building methods: UPGMA Step 4: Last cluster! This is your tree. 1 2 3 4 5 9 8 7 6 1 2 4 5 3

Making trees using neighbor-joining The neighbor-joining method of Saitou and Nei (1987) Is especially useful for making a tree having a large number of taxa. Begin by placing all the taxa in a star-like structure.

Tree-building methods: Neighbor joining Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Tree-building methods: Neighbor joining Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12)

Example of a neighbor-joining tree: phylogenetic analysis of 13 RBPs

Tree-building methods: character based Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters). Tree-building methods based on characters include maximum parsimony and maximum likelihood.

As an example of tree-building using maximum parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA?

Tree-building methods: Maximum parsimony 1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 1 1 2 1 2 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA Cost = 3 Cost = 4 Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Phylogram (values are proportional to branch lengths)

Rectangular phylogram (values are proportional to branch lengths)

Cladogram (values are not proportional to branch lengths)

Rectangular cladogram (values are not proportional to branch lengths) These four trees display the same data in different formats.

Making trees using maximum likelihood Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in An alignment, based upon some model of the substitution process. ML is implemented in the TREE-PUZZLE program, as well as PAUP and PHYLIP.

特征符建树方法 基于特征符的建树方法是最大节约方法和最大似然方法。 最大节约法(MP,Maximum Parsimony)最大节约方法是一种优化标准。建立进化树的原理是要求用最小的改变来解释所要研究的分类群之间的观察到的差异。特别假定最少,解释最简单的,在实际应用中,MP进化树是最短的,变化最少的进化树 。

最大简约法 (Maximum Parsimony Method) Step 1 输入:多序列对位排列 Step 2 对于每一个对位排列的位置,确定产生所观测到的序列变化需要最小数目进化变化的树 Step 3 继续对序列对位排列中的每一个位点进行分析. Step 4 在对位排列中的每一个位点的序列变异被置于树的顶端, 确定在所有的序列位点产生最小变化数量的树。 适合信息位点较多的情形

最大似然方法(ML, Maximum Likelihood) ML对系统发育问题进行了彻底搜查。ML期望能够搜寻出一种进化模型(包括对进化树本身进行搜索),使得这个模型所能产生的数据与观察到的数据最相似。 ML计算一个位点遵循一个特定取代过程时所得到的变化模式的概率;似然值就是把在这个特定的取代过程中每一个可能的取代的再现的概率进行加和。所有位点的似然值相乘就得到了整个进化树的似然值。

最大似然法(Maximum likelihood) 利用概率计算来发现最能反映序列变异的方法。 对多序列对位排列的每一个列进行分析。所有的树都要考虑。 序列变化的进化模型提供了一个碱基变为另一个碱基的速率的估计: Base A C G T A -u(aC+bG+cT) uaC ubG ucT C ugA -u(gA+dG+eT) udG ueT G uhA ujG -u(hA+jG+fT) ufT T uiA ukG ulT -u(iA+kG+lT)

最大似然法步骤 Step1: 序列集的对位排列 Step2 : 检测在每一列中的替代是否符合一组描述序列间系统发育关系的树。 基于所给的数据集,每一个树有一个可能性。 优点:可用于评价速率变异的树, 可以被用于分歧较大的序列。 缺点: 计算量大.

NEXUS 格式 ((IM21:100.0,((((((Pa10:100.0,((((NI1k:100.0,NIM3:100.0):84.0,MU4k:100.0):79.0, (((LZ11:100.0,PT18:100.0):71.0,LR20:100.0):19.0,FL19:100.0):13.0):5.0,(AC15:100.0, (MC16:100.0,FU14:100.0):99.0):89.0):13.0):6.0,((PI7k:100.0,TU6k:100.0):45.0,TE80:100.0):33.0):11.0, LG12:100.0):15.0,(XI22:100.0,(CH17:100.0,GR13:100.0):104.0):89.0):19.0,PU5k:100.0):34.0, LI90:100.0):43.0):61.0,out:100.0);

Guinea-pig

Rodents polyphyly? Tree-2 Tree-1 Human Mouse Rodents Guinea-pig D'Erchia et al. (1996) Nature 381, 597 Tree-1 Mouse Rodents Guinea-pig Human Traditional view Tree-3 Mouse Guinea-pig Human Graur, Hide and Li (1991) Nature 351, 649

ProtML Reyes et al. (2000) ME Rodent polyphyly? Reyes et al. (2000) Mol. Biol. Evol. 17:979--983 Rodent polyphyly?

Freeman & Herron, 2001. Evolutionary Analysis. Prentice Hall HIV 从哪里来? Freeman & Herron, 2001. Evolutionary Analysis. Prentice Hall

2003/6/13 Science

来自不同种类猴子的两个病毒在非洲黑猩猩体内经重组后形成了引发人类艾滋病的SIV菌株 SIVcpz是通过来自红盖猴和花鼻猴的SIVs病毒不断地传播和重组的过程变成了起源于黑猩猩的SIVcpz的。黑猩猩捕食这两种猴子。这些猴子和黑猩猩在西部中非洲有重叠的活动区域。 人类不是通过自然状态下物种间的传播而获得两种不同SIVs菌株的唯一物种,这种自然状态下的物种间传播很可能是由捕食行为产生的。 黑猩猩捕食小型猴子是不是导致了它们获得其它的SIV感染? 这些SIV与SIVcpa的共同感染或与SIVcpz进行重组可能性有多大? 这些适应了黑猩猩的SIV是不是最终更可能感染人类?

Hasegawa, 1998

TreeBASE at Harvard Univerity

TreeFam: Tree families database http://treefam.genomics.org.cn/

Pfam: Protein families database http://pfam.xfam.org/

Stage 4: Evaluating trees The main criteria by which the accuracy of a phylogentic tree is assessed are consistency, efficiency, and robustness. Evaluation of accuracy can refer to an approach (e.g. UPGMA) or to a particular tree.

严格一致树

多数一致树

Stage 4: Evaluating trees: bootstrapping Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?

Stage 4: Evaluating trees: bootstrapping Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

自展 (Bootstrap) Bootstrap.bmp

In 61% of the bootstrap resamplings, ssrbp and btrbp (pig and cow RBP) formed a distinct clade. In 39% of the cases, another protein joined the clade (e.g. ecrbp), or one of these two sequences joined another clade.

单基因系统发育分析的方法 多重序列 对位排列 是否相似性很高? Yes 选择相关序列集 MP方法 No Yes 是否有较明显的序列相似性? 距离法 No 分析数据对于假设的支持程度 (自展) ML方法

系统发育模型的组成 系统发育的建树方法都会预先假定一个进化模型。 比如,所有广泛使用的方法都假定进化的分歧是严格分枝的,因此我们可以用树状拓扑发生图来描述已知的数据。 在一个给定的数据组中,因为存在着物种的杂交以及物种之间遗传物质的传递,这个假定很可能会被推翻。因此,如果所观察的序列并非是严格遗传的话,大多数系统发育方法就会得到错误的结果。

用计算的方法进行系统发育分析的缺点: 很容易得到错误的结果,而且出错的危险几乎是不可避免的;其它学科一般都会有实验基础,而系统发育分析不太可能会拥有实验基础,至多也就是一些模拟实验或者病毒实验; 实际上,系统发育的发生过程都是已经完成的历史,只能去推断或者评估,而无法再现了。

More and more LGT(Lateral Gene Transfer ) were discovered and reported More and more LGT(Lateral Gene Transfer ) were discovered and reported. Some people guess 1.5%~14.5% of genes in a genome are related with LGT, even rRNA molecules are involved in LGT; Garcia-Vallvé S, Romeu A, Palau J. ,Genome Res, 2000, 11, 1719~1725 Yap W H, Zhang Z, Wang Y. , J. Bacteriol. 1999, 181: 5201~5209 Some people argue it is impossible to reconstruct a universal life tree; Pennisi E. ,Science, 1999, 284: 1305~1307 Doolittle R F.,Nature, 1998, 392: 339~342 As more and more whole genome sequence and the related data become available, it is possible to re-consider the phylogeny and clustering properties of species in more broad measurements, even in level of whole genome.

相关网址 Compilation of available phylogeny programs http://evolution.genetics.washington.edu/phylip/software.html BLAST2 & Orthologue Search http://www.Bork.EMBL-Heidelberg.DE/Blast2e/ CLUSTAL W http://www-igbmc.u-strasbg.fr/BioInfo/ PHYLIP http://evolution.genetics.washington.edu/phylip.html PhyloBLAST http://www.pathogenomics.bc.ca/phyloBLAST/ Phylogenetic Resources http://www.ucmp.berkeley.edu/subway/phylogen.html PUZZLEBOOT http://www.tree-puzzle.de TreeView http://taxonomy.zoology.gla.ac.uk/rod/treeview.html WebPHYLIP http://sdmc.krdl.org.sg:8080/lxzhang/phylip/

进化分析相关软件的因特网地址 ******************************************************** 序列分析和多序列比较 # BLAST Web site http://www.ncbi.nlm.nih.gov/BLAST/ # FASTA at EBI http://www2.ebi.ac.uk/fasta3/ # CLUSTALW software ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW # HMMER software http://hmmer.wustl.edu/ # SAM profile software http://www.cse.ucsc.edu/research/compbio/sam.html # BCM Search Launcher http://kiwi.imgen.bcm.tmc.edu:8088/searchlauncher/launcher.html 系统进化树构建和稳定性分析 # PHYLIP http://evolution.genetics.washington.edu/phylip.html # Hennig86 http://www.vims.edu/~mes/hennig/software.html # MEGA/METREE http://www.bio.psu.edu/faculty/nei/imeg # GAMBIT http://www.lifesci.ucla.edu/mcdbio/Faculty/Lake/Research/Programs/ # MacClade http://phylogeny.arizona.edu/macclade/macclade.html # PAUP http://onyx.si.edu/PAUP/ # GCG software package http://www.gcg.com/ *******************************************************    

Neutral theory of evolution An often-held view of evolution is that just as organisms propagate through natural selection, so also DNA and protein molecules are selected for. According to Motoo Kimura’s 1968 neutral theory of molecular evolution, the vast majority of DNA changes are not selected for in a Darwinian sense. The main cause of evolutionary change is random drift of mutant alleles that are selectively neutral (or nearly neutral). Positive Darwinian selection does occur, but it has a limited role. As an example, the divergent C peptide of insulin changes according to the neutral mutation rate.

分子进化与系统发育 高等教育出版社 2002年6月 北京 [美] 根井正利 苏德海尔·库马 著 吕宝忠 钟 扬 高莉萍 等译 [美] 根井正利 苏德海尔·库马 著 吕宝忠 钟 扬 高莉萍 等译 赵寿元 张建之 等校 高等教育出版社 2002年6月 北京