Molecular Phylogeny 分子系统发育分析

Molecular Phylogeny 分子系统发育分析
Bacteria Archaea Eukarya

三大发现，适者生存

Introduction Natural Selection
“Natural selection is daily, hourly, scrutinising the slightest variations, rejecting those that are bad, preserving and adding up all those that are good”- The Origin of Species Charles Darwin ( )

Darwin’s Travels Lamarck - adaptations Wallace – natural selection

Galapagos Finches

The Galapagos Finches The beaks of the finches are adapted to different jobs in the same way as tools.

Artificial Selection

Natural Selection Overproduction Individual Variation Unequal
Reproductive Success

The struggle for existence induces a natural selection.
Charles Darwin’s 1859 book (On the Origin of Species By Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life) introduced the theory of evolution. The struggle for existence induces a natural selection. 三大发现，适者生存

Tree of Life

Five kingdom system (Haeckel, 1879)
mammals vertebrates animals invertebrates plants fungi protists protozoa monera Page 396

Introduction At the molecular level, evolution is a process of mutation with selection. Molecular evolution is the study of changes in genes and proteins throughout different branches of the tree of life. Phylogeny is the inference of evolutionary relationships. Traditionally, comparison of morphological features Today, comparison of molecular sequence data

Introduction In the 1920s and 1930s, a synthesis occurred between Darwinism and Mendel’s principles of inheritance. The basic processes of evolution are [1] mutation, and also [2] genetic recombination as two sources of variability; [3] chromosomal organization (and its variation); [4] natural selection [5] reproductive isolation, which constrains the effects of selection on populations

Levels of Selection Species Population Individual Gene
Species level selection may lead to its extinction, generally a large environmental change. Interspecific competition and predation can lead to population decline, unless the population can exploit new niches or find novel ways of avoiding predation. Intraspecific competition for shared resources acts on the survival of progeny. Individuals that can exploit their environment better than others survive to pass their genes to their descendants. The importance of the phenotypic characters expressed by the genes decides how selection acts on them.

Examples of clades Lindblad-Toh et al., Nature 438: 803, 8 Dec. 2005, fig. 10

直系同源、旁系同源旁系同源直系同源直系同源 Frog Chick mouse mouse chick frog α 链 β 链
Paralogs 直系同源 Orthologs Orthologs 直系同源 Frog Chick mouse mouse chick frog α 链 β 链基因复制原始血红蛋白基因

Gene duplication and loss
1 2 3 C B A Pseudogene gene merge Gene loss Gene Duplication

CONCEPT and DEFINITION
Orthologs： They represent genes derived from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function. Paralogs homologs produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. They tend to have different functions.

Xenologs homologs resulting from horizontal gene transfer between two organisms. The determination of whether a gene of interest was recently transferred into the current host by horizontal gene transfer is often difficult. Function of xenologs can be variable depending on how significant the change in context was for the horizontally moving gene; In general, the function tends to be similar.

Ohnology Paralogous genes that have originated by a process of whole-genome duplication (WGD). The name was first given in honour of Susumu Ohno by Ken Wolfe. Ohnologs are interesting for evolutionary analysis because they all have been diverging for the same length of time since their common origin.

How to find orthologs and paralogs
In eukaryotic genomes, most genes are members of gene families. When comparing genes from two species, therefore, most genes in one species will be homologous to multiple genes in the second. This often makes it difficult to distinguish orthologs (separated through speciation) from paralogs (separated by other types of gene duplication). Combining phylogenetic relationships, gene function and genomic position in both genomes helps to distinguish between these scenarios. There are many publications on this topic, such as: Steven B Cannon and Nevin D Young, OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies, BMC Bioinformatics 2003, 4:35

Bidirectional best hits (BBH)
The best hit of a particular gene to a target genome is the gene in that genome that represents a best match. The match is bidirectional if the two genes are best hits of each other. A bidirectional best hit represents a very strong similarity between two genes, and is considered evidence that the genes may be orthologs arising from a common ancestor. formally, the paper The use of gene clusters to infer functional coupling defines a bidirectional best hit (or BBH) as follows: Given two genes Xa and Xb from two genomes Ga and Gb, Xa and Xb are called a “bidirectional best hit (BBH)” if and only if recognizable similarity exists between them (in our case, we required Similarity Scores lower than 1.0 × 10−5), there is no gene Zb in Gb that is more similar than Xb is to Xa, and there is no gene Za in Ga that is more similar than Xa is to Xb.

Use the bidirectional best hits (BBH) criterion to define orthologs when two genomes are compared by the Smith-Waterman algorithm at the amino acid sequence level with the threshold similarity score of 70. To characterize genes of an organism, its genes S(G1) are once mapped to the nodes of the graph G2 that encodes functional orthologs in another organism. After that, we compare G2 and an additional graph G3 of the original organism instead of comparing G1 and G3 directly. Gene mapping Genome Informatics 12: 44–53 (2001)

Gene mapping Gene-gene relationships on a specific attribute can be denoted by using a set of binary relationships in a general manner. For example, let a binary operator ' ∼ ' denote a binary relationship between two genes, and let g1, g2, g3, and g4 be a series of genes arranged in this order in a genome sequence, their geometrical relationships are broken down into a set of binary relationships {g1 ∼ g2, g2 ∼ g3, g3 ∼ g4}. A set of binary relationships among genes forms a graph structure as a whole. Fig. shows three graphs G1 (genome), G2 (pathway), and G3 (similarity), where each graph node corresponds to a gene or a gene product. In a graph, two nodes are connected by an edge (expressed by a solid line) when they are related by a binary relationship In a set of genes, if all or most of the genes reserve their mutual relationships in multiple graphs, like the light gray nodes and the dark gray nodes, the biological relevance among those genes is considered to be supported at high possibility. We call such a set of genes a correlated gene cluster (or simply, correlated cluster), by which we can characterize, classify, and predict the activities of genes.

A. Mouse B. Human

Overview of the defensin gene cluster region in mouse (top) and human (bottom). A clone tiling path is shown for the corresponding regions in mouse (top) and human (bottom). Clones are displayed in yellow but regions overlapping with adjacent clones are shown in black. Genes are indicated by arrows. Genes in shadowed boxes are duplicated and the color indicates the pairs; A -- highlights all potential Defcr5 genes (see color legend for more details). The mouse assembly is based on NCBIM37, in which three gaps currently exist; two gaps are indicated by grey bars and the biggest gap between the two clusters is joined by a 'V'. 小鼠defensin基因的注释:Amid et al. BMC Genomics :606 doi: /

进化树的概念 Phylogenetic Trees: In each panel, the phylogenetic group is depicted by a green shaded circle. A) Monophyletic group. A species (C and D) share a common ancestor (E) not shared by any other species. B) Paraphyletic group. All species in the group share a common ancestor (F), but some species (D) have been excluded from the group. C) Polyphyletic group. A grouping of lineages each more closely related to other species not in the group than they are two each other. --From Barton et al., (2007) Evolution, p. 111.

有根树、无根树

标度树

进化树的概念一般来说, 进化树是显示物种间进化关系的二维图, 也可以反映来自不同物种的分子 (基因) 的进化关系。 sequence A
length of branches reflects number of sequence changes. Often: assume uniform rate of mutations (molecular clock hypothesis). nodes 1、rooted tree sequence B sequence C branches sequence D sequence A sequence C 2、unrooted tree sequence B sequence D

Molecular phylogeny: nomenclature of trees
There are two main kinds of information inherent to any tree: topology and branch lengths. We will now describe the parts of a tree. Page 366

Molecular phylogeny uses trees to depict evolutionary
relationships among organisms. These trees are based upon DNA and protein sequence data. A B C D E F G H I time 6 2 1

Tree nomenclature Node (intersection or terminating point
of two or more branches) branch (edge) A B C D E F G H I time 6 2 1

Tree nomenclature taxon taxon

operational taxonomic unit (OTU)
Tree nomenclature operational taxonomic unit (OTU) such as a protein sequence hypothetical taxonomic unit (HTU) A B C D E F G H I time 6 2 1

Tree nomenclature Branches are scaled... Branches are unscaled...
F G H I time 6 2 1 …branch lengths are proportional to number of amino acid changes …OTUs are neatly aligned, and nodes reflect time Fig. 11.4 Page 366

Tree nomenclature bifurcating multifurcating internal internal node
Fig. 11.5 Page 367

Tree nomenclature: clades
Clade ABF (monophyletic group) 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Fig. 11.4 Page 366

Tree nomenclature Clade CDH Fig. 11.4 Page 366 A F G B I H C D E 2 1 1
time Fig. 11.4 Page 366

Tree nomenclature Clade ABF/CDH/G Fig. 11.4 Page 366 A F G B I H C D E
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Fig. 11.4 Page 366

单系类群、并系类群、复系类群

内类群、外类群、姐妹群

Species trees versus gene/protein trees
Molecular evolutionary studies can be complicated by the fact that both species and genes evolve. speciation usually occurs when a species becomes reproductively isolated. In a species tree, each internal node represents a speciation event. Genes (and proteins) may duplicate or otherwise evolve before or after any given speciation event. The topology of a gene (or protein) based tree may differ from the topology of a species tree. Page 370

Molecular clock hypothesis
In the 1960s, sequence data were accumulated for small, abundant proteins such as globins, cytochromes c, and fibrinopeptides. Some proteins appeared to evolve slowly, while others evolved rapidly. Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock: For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

Molecular clock hypothesis
As an example, Richard Dickerson (1971) plotted data from three protein families: cytochrome c(细胞色素), hemoglobin (血色素), and fibrinopeptides（血纤维蛋白肽）. The x-axis shows the divergence times of the species, estimated from paleontological data. The y-axis shows m, the corrected number of amino acid changes per 100 residues. n is the observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed. N 100 = 1 – e-(m/100)

corrected amino acid changes
Dickerson (1971) corrected amino acid changes per 100 residues (m) Millions of years since divergence

Molecular clock hypothesis: conclusions
Dickerson drew the following conclusions: For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein. The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides). The observed variations in rate of change reflect functional constraints imposed by natural selection.

Molecular clock hypothesis: l and PAM
The rate of amino acid substitution is measured by l, the number of substitutions per amino acid site per year. Consider serum albumin: l = 1.9 x 10-9 l x 109 = 1.9 Dayhoff et al. reported the rate of mutation acceptance for serum albumin as 19 PAMs per amino acid residue per 100 million years. (19 subst./1 aa/108 years = 1.9 subst./100 aa/109 years)

Molecular clock for proteins:
rate of substitutions per aa site per 109 years Fibrinopeptides 9.0 Kappa casein 3.3 Lactalbumin 2.7 Serum albumin 1.9 Lysozyme Trypsin Insulin Cytochrome c 0.22 Histone H2B Ubiquitin Histone H

系统发育数据分析的步骤对DNA/蛋白序列进行系统发育分析的四个主要步骤：多序列比对，建立取代模型，建立进化树，进化树评估。

Partial alignment of histones from PFAM (l = 0.05)
H2A1_HUMAN/ R.KGNYAERV GAGAPVYLAA VLEYLTAEIL ELAGNAARDN KKTRIIPR H2A1_YEAST/ R.RGNYAQRI GSGAPVYLTA VLEYLAAEIL ELAGNAARDN KKTRIIPR H2A3_VOLCA/ K.KGKYAERI GAGAPVYLAA VLEYLTAEVL ELAGNAARDN KKNRIVPR H2A_PLAFA/ K.KGKYAKRV GAGAPVYLAA VLEYLCAEIL ELAGNAARDN KKSRITPR H2A1_PEA/ K.KGRYAQRV GTGAPVYLAA VLEYLAAEVL ELAGNAARDN KKNRISPR H2A1_TETPY/ K.HGRYSERI GTGAPVYLAA VLEYLAAEVL ELAGNAAKDN KKTRIVPR H2AM_RAT/ K.KGHPKYRI GVGAPVYMAA VLEYLTAEIL ELAGNAARDN KKGRVTPR H2A_EUGGR/ R.AGRYAKRV GKGAPVYLAA VLEYLSAELL ELAGNASRDN KKKRITPR H2A2_XENLA/ R.KGNYAERV GAGAPVYLAA VLEYLTAEIL ELAWERLPEI TKRPVLSP H2AV_CHICK/ KTRTTSHGRV GATAAVYSAA ILEYLTAEVL ELAGNASKDL KVKRITPR H2AV_TETTH/ KGRVSAKNRV GATAAVYAAA ILEYLTAEVL ELAGNASKDF KVRRITPR

Partial alignment of casein from PFAM (l = 3.3)
CASK_BOVIN/ VLSRYPSYGL NYYQQKPVAL .INNQFLPYP YYAKPAAVRS PAQILQWQVL CASK_CERNI/ ALSRYPSYGL NYYQHRPVAL .INNQFLPYP YYVKPGAVRS PAQILQWQVL CASK_CAMDR/ VQSRYPSYGI NYYQHRLAVP .INNQFIPYP NYAKPVAIRL HAQIPQCQAL CASK_PIG/ MLNRFPSYGF .FYQHRSAVS .PNRQFIPYP YYARPVVAGP HAQKPQWQDQ CASK_HUMAN/ VPNSYPYYGT NLYQRRPAIA .INNPYVPRT YYANPAVVRP HAQIPQRQYL CASK_RABIT/ VMNRYPQYEP SYYLRRQAVP .TLNPFMLNP YYVKPIVFKP NVQVPHWQIL CASK_CAVPO/ VLNNYLRTAP SYYQNRASVP .INNPYLCHL YYVPSFVLWA QGQIPKGPVS CASK_MOUSE/ VLN.FNQYEP NYYHYRPSLP ATASPYMYYP LVVRLLLLRS PAPISKWQSM CASK_RAT/ VLN.RNHYEP IYYHYRTSVP ..VSPYAYFP VGLKLLLLRS PAQILKWQPM

Most conserved proteins
in worm, human, and yeast worm/ worm/ yeast/ Protein human yeast human H4 histone 99% id 91% id 92 % id H3.3 histone Actin B Ubiquitin Calmodulin Tubulin See Copley et al. (1999)

Sanger and colleagues sequenced insulin (1950s)
Human CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN chimpanzee CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN rabbit CGERGFFYTPKSRREVEELQVGQAELGGGPGAGGLQPSALELALQKRGIVEQCCTSICSLYQLEN dog CGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLEN horse CGERGFFYTPKAXXEAEDPQVGEVELGGGPGLGGLQPLALAGPQQXXGIVEQCCTGICSLYQLEN mouse CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVAQQKRGIVDQCCTSICSLYQLEN rat CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVARQKRGIVDQCCTSICSLYQLEN pig CGERGFFYTPKARREAENPQAGAVELGG--GLGGLQALALEGPPQKRGIVEQCCTSICSLYQLEN chicken CGERGFFYSPKARRDVEQPLVSSPLRG---EAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLEN sheep CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCAGVCSLYQLEN bovine CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCASVCSLYQLEN whale CGERGFFYTPKA GIVEQCCTSICSLYQLEN elephant CGERGFFYTPKT GIVEQCCTGVCSLYQLEN We can make a multiple sequence alignment of insulins from various species, and see conserved regions…

Mature insulin consists of an A chain and B chain
heterodimer connected by disulphide bridges The signal peptide and C peptide are cleaved, and their sequences display fewer functional constraints.

Note the sequence divergence in the
disulfide loop region of the A chain

0.1 x 10-9 1 x 10-9 0.1 x 10-9 Number of nucleotide substitutions/site/year

This site lists 200 phylogeny packages. Perhaps the best- known programs are PAUP (David Swofford and colleagues) and PHYLIP (Joe Felsenstein).

在进行系统发育分析的时候，比对中引入了前导树。由CLUSTAL等比对得到前导树，转化成PHYLIP树的文件格式，然后输入到画树程序中，
常用的画树程序包括TreeTool(X windows), phylip,TREEVIEW， PAUP, MEGA 等。

三种主要的建树方法分别是： 1. 距离矩阵法 (Distance Matrix)
2. 最大简约法 (Maximum Parsimony, MP ) 3. 最大似然法 (Maximum Likelihood, ML)

距离树考察数据组中所有序列的两两比对结果，通过序列两两之间的差异决定进化树的拓扑结构和树枝长度。
最大节约方法考察数据组中序列的多重比对结果，优化出进化树。最大似然方法考察数据组中序列的多重比对结果，优化出拥有一定拓扑结构和树枝长度的进化树，这个进化树能够以最大的概率导致考察的多重比对结果。

距离矩阵法邻接法 (neighbor-joining method，NJ） UPGMA法
使用这两种方法前都必须获得一个对称距离矩阵 (m阶方阵) D = {dij}m×m, 其中m为OUT（分类群〕数目。距离系数的公式很多。例如，Nei (1972)的遗传距离系数适用于限制性内切酶和同功酶数据，Jukes-Cantor 单参数距离系数和Kimura两参数模型距离系数则广泛用于各种序列数据。

Tree-building methods
We will discuss two tree-building methods: distance-based and character-based. Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. Examples of distance-based algorithms are UPGMA and neighbor-joining.

Tree-building methods
We can introduce distance-based and character-based tree-building methods by referring to a tree of 13 orthologous retinol-binding proteins, and the multiple sequence alignment from which the tree was generated.

Orthologs: members of a gene (protein) family in various organisms.
common carp Orthologs: members of a gene (protein) family in various organisms. This tree shows RBP orthologs. zebrafish rainbow trout teleost African clawed frog chicken human mouse horse rat pig cow rabbit 10 changes

Fish RBP orthologs Other vertebrate RBP orthologs common carp
zebrafish Fish RBP orthologs rainbow trout teleost African clawed frog Other vertebrate RBP orthologs chicken human mouse horse rat pig cow rabbit 10 changes

Distance-based tree Calculate the pairwise alignments; if two sequences are related, put them next to each other on the tree

Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors

Stage 3: Tree-building methods: distance
Many software packages are available for making phylogenetic trees. We will describe two programs. [1] MEGA (Molecular Evolutionary Genetics Analysis) by Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei. Download it from [2] Phylogeny Analysis Using Parsimony (PAUP), written by David Swofford. See We will next use MEGA and PAUP to generate trees by the distance-based method UPGMA.

How to use MEGA to make a tree
[1] Enter a multiple sequence alignment (.meg) file [2] Under the phylogeny menu, select one of these four methods… Maximum Likelihood (ML) Neighbor-Joining (NJ) Minimum Evolution (ME) UPGMA Maximum Parsimony (MP)

Use of MEGA for a distance-based tree: UPGMA
Click green boxes to obtain options Click compute to obtain tree

A variety of styles are available for tree display

Flipping branches around a node creates an equivalent topology

How to use PAUP to make a tree
step 1 step 2 step 3 Import to PAUP and execute Convert Obtain MSF step 4 step 6 step 5 View, export: Print Trees More analyses (evaluate trees) Perform analyses (generate trees)

How to use PAUP to make a tree
Step 1: Get a multiple sequence alignment (e.g. from PFAM) Step 2: Convert it with ReadSeq (Google search to identify a site offering ReadSeq, Such as the Baylor College of Medicine) Step 3: Import as new file into PAUP

Fig Page 380

PAUP allows input of multiple sequence alignments,
data editing, creation and analysis of phylogenetic trees Fig Page 380

Making trees using UPGMA
In PAUP, you can set the tree-making criterion to “distance” then choose UPGMA (unweighted pair group method with arithmetic mean) Page 379

PAUP performs UPGMA (distance-based tree)
Fig Page 381

Tree-building methods: UPGMA
UPGMA is unweighted pair group method using arithmetic mean 1 2 3 4 5

Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. 1 2 3 4 5

Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 1 2

Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 7 1 2 4 5

Step 4: Keep going. Cluster. 1 2 3 4 5 8 7 6 1 2 4 5 3

Step 4: Last cluster! This is your tree. 1 2 3 4 5 9 8 7 6 1 2 4 5 3

Making trees using neighbor-joining
The neighbor-joining method of Saitou and Nei (1987) Is especially useful for making a tree having a large number of taxa. Begin by placing all the taxa in a star-like structure.

Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Tree-building methods: Neighbor joining
Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12)

Example of a neighbor-joining tree: phylogenetic analysis of 13 RBPs

Tree-building methods: character based
Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters). Tree-building methods based on characters include maximum parsimony and maximum likelihood.

As an example of tree-building using maximum
parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA?

Tree-building methods: Maximum parsimony
1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 1 1 2 1 2 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA Cost = 3 Cost = 4 Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Phylogram (values are proportional to branch lengths)

Rectangular phylogram (values are proportional to branch lengths)

Cladogram (values are not proportional to branch lengths)

Rectangular cladogram (values are not proportional to branch lengths) These four trees display the same data in different formats.

Making trees using maximum likelihood
Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in An alignment, based upon some model of the substitution process. ML is implemented in the TREE-PUZZLE program, as well as PAUP and PHYLIP.

特征符建树方法基于特征符的建树方法是最大节约方法和最大似然方法。
最大节约法（MP，Maximum Parsimony）最大节约方法是一种优化标准。建立进化树的原理是要求用最小的改变来解释所要研究的分类群之间的观察到的差异。特别假定最少，解释最简单的，在实际应用中，MP进化树是最短的，变化最少的进化树。

最大简约法 (Maximum Parsimony Method)
Step 1 输入：多序列对位排列 Step 2 对于每一个对位排列的位置，确定产生所观测到的序列变化需要最小数目进化变化的树 Step 3 继续对序列对位排列中的每一个位点进行分析. Step 4 在对位排列中的每一个位点的序列变异被置于树的顶端, 确定在所有的序列位点产生最小变化数量的树。适合信息位点较多的情形

最大似然方法（ML, Maximum Likelihood）
ML对系统发育问题进行了彻底搜查。ML期望能够搜寻出一种进化模型（包括对进化树本身进行搜索），使得这个模型所能产生的数据与观察到的数据最相似。 ML计算一个位点遵循一个特定取代过程时所得到的变化模式的概率；似然值就是把在这个特定的取代过程中每一个可能的取代的再现的概率进行加和。所有位点的似然值相乘就得到了整个进化树的似然值。

最大似然法（Maximum likelihood）
利用概率计算来发现最能反映序列变异的方法。对多序列对位排列的每一个列进行分析。所有的树都要考虑。序列变化的进化模型提供了一个碱基变为另一个碱基的速率的估计: Base A C G T A -u(aC+bG+cT) uaC ubG ucT C ugA -u(gA+dG+eT) udG ueT G uhA ujG -u(hA+jG+fT) ufT T uiA ukG ulT -u(iA+kG+lT)

最大似然法步骤 Step1: 序列集的对位排列 Step2 : 检测在每一列中的替代是否符合一组描述序列间系统发育关系的树。
基于所给的数据集，每一个树有一个可能性。优点：可用于评价速率变异的树, 可以被用于分歧较大的序列。缺点: 计算量大.

NEXUS 格式 ((IM21:100.0,((((((Pa10:100.0,((((NI1k:100.0,NIM3:100.0):84.0,MU4k:100.0):79.0, (((LZ11:100.0,PT18:100.0):71.0,LR20:100.0):19.0,FL19:100.0):13.0):5.0,(AC15:100.0, (MC16:100.0,FU14:100.0):99.0):89.0):13.0):6.0,((PI7k:100.0,TU6k:100.0):45.0,TE80:100.0):33.0):11.0, LG12:100.0):15.0,(XI22:100.0,(CH17:100.0,GR13:100.0):104.0):89.0):19.0,PU5k:100.0):34.0, LI90:100.0):43.0):61.0,out:100.0);

Guinea-pig

Rodents polyphyly? Tree-2 Tree-1 Human Mouse Rodents Guinea-pig
D'Erchia et al. (1996) Nature 381, 597 Tree-1 Mouse Rodents Guinea-pig Human Traditional view Tree-3 Mouse Guinea-pig Human Graur, Hide and Li (1991) Nature 351, 649

ProtML Reyes et al. (2000) ME Rodent polyphyly?
Reyes et al. (2000) Mol. Biol. Evol. 17: Rodent polyphyly?

Freeman & Herron, 2001. Evolutionary Analysis. Prentice Hall
HIV 从哪里来? Freeman & Herron, Evolutionary Analysis. Prentice Hall

2003/6/13 Science

来自不同种类猴子的两个病毒在非洲黑猩猩体内经重组后形成了引发人类艾滋病的SIV菌株
SIVcpz是通过来自红盖猴和花鼻猴的SIVs病毒不断地传播和重组的过程变成了起源于黑猩猩的SIVcpz的。黑猩猩捕食这两种猴子。这些猴子和黑猩猩在西部中非洲有重叠的活动区域。人类不是通过自然状态下物种间的传播而获得两种不同SIVs菌株的唯一物种，这种自然状态下的物种间传播很可能是由捕食行为产生的。黑猩猩捕食小型猴子是不是导致了它们获得其它的SIV感染? 这些SIV与SIVcpa的共同感染或与SIVcpz进行重组可能性有多大? 这些适应了黑猩猩的SIV是不是最终更可能感染人类?

Hasegawa, 1998

TreeBASE at Harvard Univerity

TreeFam: Tree families database

Pfam: Protein families database

Stage 4: Evaluating trees
The main criteria by which the accuracy of a phylogentic tree is assessed are consistency, efficiency, and robustness. Evaluation of accuracy can refer to an approach (e.g. UPGMA) or to a particular tree.

严格一致树

多数一致树

Stage 4: Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?

Stage 4: Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

自展 (Bootstrap) Bootstrap.bmp

In 61% of the bootstrap resamplings, ssrbp and btrbp (pig and cow RBP) formed a distinct clade. In 39% of the cases, another protein joined the clade (e.g. ecrbp), or one of these two sequences joined another clade.

单基因系统发育分析的方法多重序列对位排列是否相似性很高? Yes 选择相关序列集 MP方法 No Yes 是否有较明显的序列相似性？
距离法 No 分析数据对于假设的支持程度 (自展) ML方法

系统发育模型的组成系统发育的建树方法都会预先假定一个进化模型。比如，所有广泛使用的方法都假定进化的分歧是严格分枝的，因此我们可以用树状拓扑发生图来描述已知的数据。在一个给定的数据组中，因为存在着物种的杂交以及物种之间遗传物质的传递，这个假定很可能会被推翻。因此，如果所观察的序列并非是严格遗传的话，大多数系统发育方法就会得到错误的结果。

用计算的方法进行系统发育分析的缺点：很容易得到错误的结果，而且出错的危险几乎是不可避免的；其它学科一般都会有实验基础，而系统发育分析不太可能会拥有实验基础，至多也就是一些模拟实验或者病毒实验；实际上，系统发育的发生过程都是已经完成的历史，只能去推断或者评估，而无法再现了。

More and more LGT(Lateral Gene Transfer ) were discovered and reported
More and more LGT(Lateral Gene Transfer ) were discovered and reported. Some people guess 1.5%~14.5% of genes in a genome are related with LGT, even rRNA molecules are involved in LGT; Garcia-Vallvé S, Romeu A, Palau J. ，Genome Res, 2000, 11, 1719~1725 Yap W H, Zhang Z, Wang Y. ， J. Bacteriol. 1999, 181: 5201~5209 Some people argue it is impossible to reconstruct a universal life tree; Pennisi E. ，Science, 1999, 284: 1305~1307 Doolittle R F.，Nature, 1998, 392: 339~342 As more and more whole genome sequence and the related data become available, it is possible to re-consider the phylogeny and clustering properties of species in more broad measurements, even in level of whole genome.

相关网址 Compilation of available phylogeny programs BLAST2 & Orthologue Search CLUSTAL W PHYLIP PhyloBLAST Phylogenetic Resources PUZZLEBOOT TreeView WebPHYLIP

进化分析相关软件的因特网地址 ******************************************************** 序列分析和多序列比较 # BLAST Web site # FASTA at EBI # CLUSTALW software ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW # HMMER software # SAM profile software # BCM Search Launcher 系统进化树构建和稳定性分析 # PHYLIP # Hennig # MEGA/METREE # GAMBIT # MacClade # PAUP # GCG software package *******************************************************

Neutral theory of evolution
An often-held view of evolution is that just as organisms propagate through natural selection, so also DNA and protein molecules are selected for. According to Motoo Kimura’s 1968 neutral theory of molecular evolution, the vast majority of DNA changes are not selected for in a Darwinian sense. The main cause of evolutionary change is random drift of mutant alleles that are selectively neutral (or nearly neutral). Positive Darwinian selection does occur, but it has a limited role. As an example, the divergent C peptide of insulin changes according to the neutral mutation rate.

分子进化与系统发育高等教育出版社 2002年6月北京 [美] 根井正利苏德海尔·库马著吕宝忠钟扬高莉萍等译
[美] 根井正利苏德海尔·库马著吕宝忠钟扬高莉萍等译赵寿元张建之等校高等教育出版社 2002年6月北京

Molecular Phylogeny 分子系统发育分析

Similar presentations

Presentation on theme: "Molecular Phylogeny 分子系统发育分析"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Molecular Phylogeny 分子系统发育分析

Similar presentations

Presentation on theme: "Molecular Phylogeny 分子系统发育分析"— Presentation transcript:

Similar presentations

About project

反馈