Biochemical Journal

Error message

  • Notice: PHP Error: Undefined offset: 1 in nice_menus_block_view() (line 306 of /opt/sites/jnl-ppbiochemj/drupal-webroot/releases/20151123210521/sites/all/modules/contrib/nice_menus/nice_menus.module).
  • Notice: PHP Error: Undefined offset: 1 in nice_menus_block_view() (line 306 of /opt/sites/jnl-ppbiochemj/drupal-webroot/releases/20151123210521/sites/all/modules/contrib/nice_menus/nice_menus.module).

Review article

Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes

Romain A. Studer, Benoit H. Dessailly, Christine A. Orengo


The present review focuses on the evolution of proteins and the impact of amino acid mutations on function from a structural perspective. Proteins evolve under the law of natural selection and undergo alternating periods of conservative evolution and of relatively rapid change. The likelihood of mutations being fixed in the genome depends on various factors, such as the fitness of the phenotype or the position of the residues in the three-dimensional structure. For example, co-evolution of residues located close together in three-dimensional space can occur to preserve global stability. Whereas point mutations can fine-tune the protein function, residue insertions and deletions (‘decorations’ at the structural level) can sometimes modify functional sites and protein interactions more dramatically. We discuss recent developments and tools to identify such episodic mutations, and examine their applications in medical research. Such tools have been tested on simulated data and applied to real data such as viruses or animal sequences. Traditionally, there has been little if any cross-talk between the fields of protein biophysics, protein structure–function and molecular evolution. However, the last several years have seen some exciting developments in combining these approaches to obtain an in-depth understanding of how proteins evolve. For example, a better understanding of how structural constraints affect protein evolution will greatly help us to optimize our models of sequence evolution. The present review explores this new synthesis of perspectives.

  • amino acid mutation
  • deletion
  • insertion
  • protein evolution
  • protein function
  • structure


Innovations and adaptations in organisms occur at different levels, ranging from the gene product or protein, through to changes in expression levels and protein interaction networks [1]. Proteins are essential components of biological organisms and, owing to the error-prone nature of the DNA replication process, can mutate between generations. Mutations can change the phenotype and have a beneficial, deleterious or neutral effect on the fitness of the individual organism. Phenotypes and the corresponding mutations are then acted on by selection (natural or artificial) at the population level. The fixation rate of mutations is a stochastic process that depends on the effective population size and other evolutionary processes such as natural selection. Generally, most deleterious mutations are removed; some beneficial mutations are fixed under positive selection, whereas the fate of neutral mutations depends on the genetic drift.

Changes in proteins and DNA comprise single-point mutations (SNPs; single nucleotide polymorphisms), ‘indels’ (insertions–deletions), domain shuffling or copy number variations. These changes can affect diverse protein properties, such as stability, catalytic activity or the ability to interact with other molecules. Mutations can be the source of various diseases. In coding genes, mutations can cause a defect in the protein. A well-known case is sickle-cell anaemia, where the haemoglobin molecule is affected. An alternative cause of the disease is via pathogenic organisms. In this context, the emergence of new viruses, bacteria or unicellular parasites can be facilitated by mutations that help these organisms to escape the host defence. Finally, protein mutations in some cells can lead to deregulation and the emergence of cancer cells. Understanding how protein mutations change the functions of proteins is essential for gaining a more complete understanding of diverse medical issues, ranging from microbiology to oncology and nutrition [2,3].

Many reviews have been published on the general evolution of proteins [49]. In the present paper, we focus on the adaptation of proteins, and on approaches to detect genetic variations that happen inside a protein domain (SNPs and indels, Figure 1). Mutations can be beneficial, adding new functionality and increasing the fitness of the organism or pathogenic, damaging the protein or reducing its functional activity and thereby contributing to disease phenotypes. In the present paper, we provide examples that illustrate both types of changes and review other evolutionary processes that are also relevant to medicine, such as the adaption mechanisms of viruses or cancer cells.

Figure 1 Possible effects of mutations on proteins

Various mutational processes can affect proteins. The first case is when a synonymous mutation occurs, keeping the protein sequence unchanged. In the majority of cases, these types of mutations will not affect the protein. However, the structure and function could be affected by a defect in the translation process, owing to different codon usage and the rarity of some tRNAs [195]. The second case is when the protein sequence is changed by a non-synonymous mutation. The mutation could have no impact (i.e. amino acids with similar biochemical properties), could have a potential effect, weak or strong (i.e. effect on folding, binding and enzyme catalysis) or damage the protein structure by affecting the stability. The final case is when a block of sites is inserted or deleted (indels). Again, this could have an effect or not, depending on the context. The probability of fixation of these mutations in the genome depends on different factors, such as the population size or the beneficial effect of the mutation on the organismal fitness.


Amino acid substitutions

Mutations in catalytic sites and binding pockets

Proteins perform a wide array of different functions that are mediated by the various residues that make up the functional sites. Mutations in different types of functional sites will typically have different consequences. Some of the most dramatic effects occur in the actives sites of enzymes and the binding pockets of receptors. Enzymes catalyse biochemical reactions and they generally do that via catalytic sites consisting of a handful of residues that are important for the reaction (catalytic residues), with a larger number of surrounding residues that are important for ensuring proper attachment of the substrates and cofactors to the active site cavity (binding sites residues). Owing to their small number and the fact that they perform very specific functions, catalytic residues are often highly conserved in evolution, suggesting that mutations of such residues are very detrimental [1012]. Evolutionary constraints on binding-site residues may not always be as strict, and at least some of these residues may be amenable to mutations without fatal consequences. More generally, a mutation that decreases the fitness, even to a small extent, is not likely to be fixed.

Mutations in protein–protein interfaces

As for all types of functional sites, residues in protein–protein interfaces are generally expected to be somewhat conserved in evolution. This has led to the development of a number of approaches to predict the location of protein–protein interfaces from sequences or structures, using evolutionary principles [13,14].

One area that has been the focus of significant efforts in recent years is the particular case of interactions involving SLiMs (short linear motifs) [15,16]. The small size of these linear and flexible motifs makes them more amenable to convergent evolution, as very few residue mutations are sufficient to allow for a novel interaction between a domain and such a motif [15].

An important aspect of the evolution of residues in protein–protein interfaces is co-evolution across the interface or correlated mutations. The idea, illustrated in Figure 2 for the MHC of birds [17], is that when two proteins interact, their interacting residues should generally undergo some level of co-evolution, i.e. if one residue mutates, its interacting residue in the interacting partner will be under pressure to carry out a compensatory mutation for the interaction to be maintained [18]. Different measures of co-evolution have been exploited as the basis for a number of methods to predict interaction between query proteins [18]. However, there has been considerable debate as to whether such co-evolution arises mostly from physical interactions between proteins, or from other functional constraints. For instance, one commonly used measure of co-evolution is evolutionary rate co-variation, which checks whether the evolutionary rates of two proteins co-vary along various branches of a phylogenetic tree [19] and it has been suggested that co-expressed proteins, or proteins with similar functions that do not physically interact, are also subject to similar evolutionary rates [1820].

Figure 2 Evolution of dimerization binding interface of MHC in birds

Evolutionary history of MHC II protein in birds. (A) The clustering of sites in three-dimensional space allows for the visualization of the electrostatic patches and their differences between the two branches of the trees (blue and green). These differences could help to increase the varieties of MHC. The blue colour of the protein domain indicates basicity and the red indicates acidity. (B) The structure of the chicken MHC II αβ-heterodimer. The divergent protein domain is indicated in colour. (C) Charge–charge interactions between the chicken MHC II α- and β-chains in the divergent protein domain. Residues implicated in hydrogen bonds are depicted in stick representation. Reproduced with permission from Burri, R., Salamin, N., Studer, R.A., Roulin, A. and Fumagalli, L. (2010) Adaptive divergence of ancient gene duplicates in the avian MHC class II β. Mol. Biol. Evol. 27, 2360–2374.

The availability of increasing amounts of protein–protein interaction data has allowed the calculation of large interaction networks consisting of thousands of interactions in a number of species. In turn, these networks have revealed novel evolutionary principles regarding proteins and their interactions. For instance, one observation from protein–protein interaction networks is the existence of network-central hubs that interact with many other proteins. These multiple interactions can be mediated via different interfaces on the hub, sometimes resulting in a majority of residues in the hub being involved in interactions [21]. Highly connected proteins with several interfaces would be expected to evolve more slowly since a larger proportion of their residues are directly involved in function (i.e. in protein–protein interfaces) compared with other proteins. A number of reports have suggested that this is the case [2224].

Finally, accommodating beneficial new interactions is one thing. Making sure unwanted interactions do not occur is quite another. A recent analysis has suggested that avoidance of unwanted interactions poses high constraints on residues from highly expressed proteins, causing them to evolve slowly [25].

Compensatory mechanisms and the link with stability

Other mutations outside the catalytic core can also have a functional effect on enzymatic activity, for example by affecting movement and stability in the structure, and thereby inducing allostery [26].

Globular proteins are generally found in two states: a free unfolded state and a folded state, with the latter being functional. Each of them is associated with a Gibbs free energy value (ΔG), the unfolded state being more energetic than the folded state (ΔGunfoldGfold). To move from one state to the other, there needs to be a transfer of energy (ΔΔG, expressed in kcal·mol−1; 1 kcal=4.184 kJ). In their stable state, proteins are marginally stable, between −3 and −10 kcal·mol−1 [27], and can tolerate some destabilization within a narrow range [28,29]. This equilibrium is quite sensitive and any mutation that adds energy (i.e. more than +2 kcal·mol−1) to the folded state is likely to destabilize the structure and make the protein more likely to aggregate in its unfolded form, and this could be a factor in some diseases [30]. By contrast, a mutation that removes energy from the folded state (i.e. less than +2 kcal·mol−1) is likely to stabilize the protein structure and make it too rigid and so non-functional in the case of enzymes.

A mutation that increases the catalytic activity is likely to decrease the global stability, in a ‘stability–activity trade-off’ [3133]. Tokuriki et al. [32] studied different enzymes and compared mutations known to change the function (i.e. new substrate specificity or catalytic activity) with other mutations. The authors found that the former were more likely to be destabilizing. By contrast, mutations that increase the stability of a protein will decrease the catalytic activity [34]. To stay in this optimal stability zone, other mutations are required to compensate for this loss of stability [3537]. After this adaptation event, both types of mutations, functional and compensatory, are more likely to be fixed in the genome than neutral mutations. This concept of molecular evolution is known as cryptic epistasis, and can also compensate for destabilizing mutations in protein-binding interfaces [38].

More generally, it means that several mutations may need to occur to achieve a particular effect (e.g. positive effect in the case of functional adaptation or negative in the case of disease). A substitution, either neutral or advantageous, can be accompanied by another substitution, providing a synergistic effect. These substitutions will tend to group together in the three-dimensional structure and co-evolve [39]. Similarly, residues that compensate deleterious mutations also tend to co-evolve [40,41]. Furthermore, residues more than 20 Å (1 Å=0.1 nm) apart can also have a compensatory effect [42,43]. A correlation has been observed between co-evolved residues and mobile regions, associated with networks of functional residues in the protein structure [44].

Experimental work on IMDH (isopropymalate dehydrogenase) in bacteria characterized compensatory mutations [42]. Mutations were performed on IMDH in Escherichia coli to match a mutation observed in Pseudomonas aeruginosa. Many were deleterious, including three found in the least active variants (F73L, A94D and A284C). However, one site mutation, I179V, compensates and restores activity of the F73L mutant. Similarly, Ackerman and Gatti [45] mutated in silico all positions of KDO8P synthase to all amino acids and evaluated the impact on stability. They identified co-evolved pairs of residues and estimated that a quarter of them follow a stabilization–destabilization process. Epistatic co-evolution can also confer resistance to a virus, as in the case of the influenza A virus [46].

Studies in yeast have shown that amino acids at the periphery of the protein tend to evolve faster than amino acids inside the core of the protein. However, core residues do tend to evolve if their adjacent residues at the surface evolve too [47].

Compensatory mutations also occur in bacteria to facilitate drug resistance. A stability and activity trade-off model has been in identified in TEM-1 β-lactamase [48]. The original function of this protein was involvement in resistance against penicillin antibiotics. However, its function can shift to degrade cephalosporin, another class of antibiotics. This function switch can be driven by different mutations, but it is always accompanied by a decrease in the stability of the protein. However, this stability is recovered by a subsequent mutation. Patient samples show the frequent occurrence of the M182T mutation, which although far from the active site, rescues the stability of the protein by increasing it by 2.67 kcal·mol−1 compared with the wild-type. Four mutations (A42G, E104K, M182T and G238S) have been identified in a resistant form of β-lactamase.

Therefore, although dozens of evolutionary trajectories are theoretically feasible [49], generally only a few are available in practice and most are inaccessible owing to the negative impact of the mutation. As already discussed, since a mutation probably affects different characteristics of the protein, such as activity, binding partner, stability or aggregation, to compensate these potentially negative effects, other mutations will be required in a pleiotropic manner [35]. Added to this complexity, reverse mutations can occur and erase the evolutionary signal. The four mutations mentioned above for β-lactamase are reverse mutated during different evolutionary trajectories leading to resistance [50].

Insertions and deletions

Indels: a sequence perspective

Indels are blocks of nucleotides inserted or deleted within a genome. Indels can appear in non-coding genomic regions or in coding regions and they are mainly generated by slippage mechanisms of the DNA polymerase [51]. However, insertions are mainly created by recombination, whereas deletions are created by replication mechanisms [52,53].

Although amino acid substitutions can only change the side chains, preserving the number of amino acids in the structure, indels can cause more dramatic changes to the protein and are thus rarer than SNPs [54]. The ratio of indels to SNPs in the whole human genome has been estimated to be in the range of 1:5 to 1:7 [5557], whereas this ratio is decreased to 1:10 in primates [54] and further to 1:20 in bacteria [54].

These diverse ratios in different species are partly explained by the fact that the ratio varies between coding and non-coding regions, and is much lower in protein-coding genes, as purifying selection acts against indels especially those causing frame-shifts (94% eliminated) or in-frame indels (48% eliminated) [55]. Although a ratio of one indel to seven SNPs has been reported for the whole human genome, there is only one indel for 43 SNPs in coding regions [55]. This low rate may be related to the fact that inserts should be multiples of three nucleotides to avoid frameshifts, so that proteins are not disrupted. Globally, indels tend to occur more at N- and C-termini, where frame-shifts or mutation effects are less damaging [55].

Indels: a structural perspective

Proteins are made up of domains, i.e. regions that adopt compact three-dimensional structures, and that can combine in different ways to form proteins with different multi-domain architectures [58]. These structural domains can be classified into families, which comprise close homologues with similar function and superfamilies, all the homologues and even distant homologues. In the CATH (class, architecture, topology, homology) [59] and SCOP (structural classification of proteins) [60] databases, structural data are taken into account when classifying homologous domains into superfamilies and functional families within these superfamilies [61].

Homologous protein domains usually share a common structural core, i.e. a set of secondary structure elements shared by all members of the superfamily. Insertions relative to this superfamily core can, however, be quite extensive in some highly diverse superfamilies and account for large structural additions in individual domains [62]. Many of these diverse superfamilies, although constituting less than 5% of all structural superfamilies in CATH, are highly populated, accounting for more than one half of all domain sequences in CATH. Such insertions have been coined ‘structural embellishments’. An analysis of embellishments in the superfamily of HUP domains showed that insertions located at different points in the sequence often co-locate in the structure, thus forming even larger sets of secondary structures that distinguish different members of the same superfamily [63]. Such structural embellishments can often be linked to changes in function between the different superfamily domains [62,63]. It has also been shown that structural differences between domains from the same superfamily can be exploited to automatically detect differences and similarities in function between these domains [64].

Indels rarely affect the structural scaffold of a protein. However, because they can alter peripheral elements, they can affect the enzymatic specificity or protein–protein interaction surfaces [65]. As they tend to cause more drastic structural and functional changes to proteins, large indels and the embellishments they create are more unusual than single mutations, as already discussed above. Using reduced alphabets for amino acids and structural elements, Illergard et al. [66] estimated that the structure of protein is 3–10-fold more conserved than the sequence, confirming the slower evolution of protein structures [67]. A study found that 85% of mutants with indels exhibit a RMSD (root mean square deviation) of less than 2 Å, suggesting that proteins can incorporate small internal indels without drastic modification to their structure [68].

Zhang et al. [69] studied the combined effects of amino acid substitutions and indels on the structures of homologues. The authors used pairwise structural superimposition to obtain high-quality alignments to identify indels. They then analysed both the divergence in amino acid substitutions relative to the structural distance (RMSD) between relatives and the indels relative to the same RMSD. Increasing the number of residue mutations or increasing the number of residue indels both increase the RMSD, but with different correlations, with indels causing a larger increase in RMSD.

Relationship of indels and nucleotide substitutions

Indels tend to occur in hotspot regions, which are prone to higher substitution rates of amino acids [70]. This phenomenon has been observed in both eukaryota [71] and bacteria [72]. Analysis of SCOP families revealed structural shifts in the flanking region of indels [73]. This correlation of indels with hotspots can be partly explained by the fact that both elevated rates of amino acid substitutions and indels occur in regions containing amino acid repeats, and these could act as mutagenic drivers [74] especially in the case of repeated hydrophilic residues [75].

Influence of indels on function and phenotype

It has been estimated that 13% of coding sequences have indels between human and chimpanzee, its closest relative, and approximately 5% generate stop codons in chimpanzee sequences [76]. This may contribute to the phenotypic difference between human and chimpanzee. Considering changes across a larger evolutionary scale, ~5% of enzymatic shifts detected within domain superfamilies are associated with indels of loops [77]. As with SNPs, the impact of indels on biological function depends on the context. For example, ligand-binding proteins possess more indels than enzymes, and these tend to co-occur in less constrained parts of the protein such as loops, reflecting different regimes of constraint [78]. As indels tend to occur in the protein–protein binding interface [79], they can either affect the oligomerization or affect the set of interaction partners that a protein has within a protein–protein interaction network [80]. It has also been observed that indels tend to occur more between paralogues after a duplication event than between orthologues after a speciation event, and deletions are more likely to occur than insertions [81].

Other aspects of protein evolution

Although novel protein features can arise through single-point mutations and residue indels, proteins can also evolve by adding and shuffling their domains to increase complexity. Many domains originated in the early stages of evolution and were shuffled in the genome to explore functional space and create new proteins, increasing the diversity of components within an organism [8284]. Further complexity can be achieved by varying the protein assemblies, i.e. at the quaternary level, to change the catalytic activity or explore new pathways. For example, the addition of the β-subunit to the α-subunit Na+/K+-ATPase pump enables the bidirectional transport of ion [85]. Many studies have reported the effects of changes in multidomain architecture. However, this domain of research is out of the scope of the present review and has already been explored in previous reviews [6,9,26,86].

In following section we review various methods that can be used to identify mutations and to determine their impact on function. These methods use different concepts such as the degree of conservation between amino acids or the evolutionary rate of a gene family.


Adaptive substitutions

It is important to distinguish substitutions from other types of mutations. Although SNPs are mutations that occur sporadically inside the genome of an individual, substitutions are mutations that are fixed in the genetic pool of a population. Whereas many substitutions are fixed under neutral drift, some substitutions reflect functional changes in response to the environment, for example changes in the external environment or the internal body. When adaptation to environmental changes occurs, multiple substitutions are generally fixed within a short time, according to an episodic model of evolution [87,88].

Substitutions and other specific characteristics of a protein can be revealed by comparison with its homologues, which can either be orthologues, that diverged after a speciation event, or paralogues, that diverged after a duplication event [89].

The comparison is made via multiple sequence alignment. Different tools exist for aligning sequences, with various degrees of accuracy [90]. Generally, two different regions can be identified in a multiple alignment: well-aligned blocks (mostly secondary structural elements) and less well-aligned regions (e.g. loops and indels). A phylogenetic tree can be derived from the multiple sequence alignment, which summarizes the history of the protein family, including speciations, duplications and losses. In a tree, tips (or leaves) are the modern sequences, whereas nodes represent ancestral sequences. The length of a branch is proportional to the number of residue substitutions along that branch.

Different columns of well-conserved blocks in multiple alignments may reflect different patterns of evolution for the constituent amino acids, as illustrated in Figure 3. Strictly conserved columns are likely to be crucial for function, for example, because they are important for stability, catalytic activity or because they are part of the oligomerization interface. Alternatively, there can be some slight variation but overall conservation of the physicochemical properties of the residue, such as its hydrophobicity (alanine, valine, isoleucine or leucine) or its basic character (arginine and lysine). Such positions may also be important for the function but can tolerate some deviation. At the opposite extreme, there can be highly variable columns with a great variety of amino acids, which represents a relaxation of constraints suggesting that these residues are less important positions. All these patterns are described by a homogenous model of amino acid substitution, which assumes that the rate of substitution is constant over time within columns, but can vary between columns.

Figure 3 Possible evolutionary regimes of sites

Theoretical alignment of 20 sequences displayed with Jalview [196]. Subgroup 1 (top panel) consists of sequences 1–10 and subgroup 2 (bottom panel) consists of sequences 11–20. Residues are coloured according to their level of conservation and their biochemical properties (hydrophobic residues in blue, basic in red, acidic in purple, polar in green, histidine residues in turquoise, proline residues in yellow, glycine residues in orange and cysteine-only residues in pink). Residues in white are not conserved in the subgroup. Residues under constant evolutionary rate (G0 and G1): positions 1, 2, 3, 6 or 19 present strictly conserved residues that are likely to be important for the protein function or stability. Positions 20 and 25 are relaxed conserved residues, for which the biochemical properties have to be preserved (hydrophobic for position 20, large aromatic residue such as phenylalanine or tryptophan for position 21, and polar for position 25). Positions 15 and 16 are fully relaxed, and are likely to occur at the periphery of the protein. Residues shifting in evolutionary rates or shifting in physicochemical property (G2 and G3): positions 5, 24 and 27 present a shift in evolutionary rate, whereby a residue is strictly conserved in a group of sequences and relaxed in the other group. Positions 9, 11, 12, 17 or 31 represent a shift in physicochemical property of the residue that could eventually result in a shift in biochemical specificity. Some residues are strictly conserved whereas others are relaxed, such as the acidic residues in position 9. Positions 9 and 12 present a co-evolution example between acidic and basic residues. Positions 11 and 27 present the recruitment of cysteine residue in sequences 1–10, which are likely to form a new disulfide bridge.

Using the phylogenetic tree as a guide, sequences can be clustered in groups of orthologues (e.g. sequences in rodents and sequences in primates) or in groups of paralogues (e.g. the different globins, myoglobin or haemoglobin, within the same organism). Residue mutations allow proteins to explore new areas of functional space. Although positions that are important for the function are generally conserved among groups, a shift in the type of favoured amino acid can occur between groups. For example, a residue may be conserved for a specific property within group α and conserved for another property within group β. This mode of evolution is called, depending on the authors, constant-but-different [91], conservation shifting sites [92] or type II functional divergence [93]. The evolutionary rate at these sites is, although slow, constant over time (homogeneous model of substitution).

Another possible mode of evolution is when a residue is conserved within group α (low evolutionary rate), but is relaxed within group β (high evolutionary rate). This non-homogeneous model of substitution, where the evolutionary rate varies with position and over time, is known as heterotachy [94], covarion-like [95], rate-shifting sites [92] or type I functional divergence [96]. Sites that have correlated variations in rate along the phylogenetic tree are termed ‘concomitantly variable codons’ (covarions). The different modes of evolution are illustrated in Figure 3.

Methods for detecting adaptive substitutions

Applied at the amino acid level

A plethora of different algorithms have been developed to identify sites under functional divergence (see Table 1). The sites predicted can vary greatly between these tools, depending on the definitions used for conservation and similarity [97,98]. A number of resources also provide information about sites that are conserved within functional families. These include FunShift [99], the SDR (specificity-determining residue) database [100], CATH and its sister site Gene3D [59,101] and Cube-DB [102].

View this table:
Table 1 List of tools, databases and webservers

CAPS, Coevolution Analysis using Protein Sequences; Procov, PROtein COVarion analysis.

The ET (evolutionary trace) algorithm identifies sites that vary between subgroups identified on a dendrogram built from sequence comparisons, and enhances performance by mapping detected sites on to a protein structure from the family [103]. The ET algorithm has identified important sites associated with diverse ligand specificities for serotonin and dopamine receptors [104] and can help to annotate proteins [105].

Another automated approach combines three different methods, entropy from information theory (S method), comparison of corresponding distance matrices between sites (MB method) and PCA (principal component analysis; SS method) [106,107], and has recently been extended to exploit MCA (multiple correspondence analyses) in order to simultaneously divide a family into subfamilies and identify specificity-determining positions [108]. The authors have used it to identify both catalytic sites and residues involved in binding interfaces.

Adding phylogenetic information and explicit evolutionary models to sequence analyses can improve the accuracy of predictions by discriminating between functionally important substitutions and random substitutions fixed under genetic drift. A site-class model of evolution has been developed and applied in neurotransmitter transporters [109]. As in the homogenous model, this model assumes that positions evolve at different rates, owing to different functional constraints. The homogeneous model of evolution is applied within the TDG09 (Tamuri, dos Reis and Goldstein 2009) algorithm [110]. Each position is tested for the likelihood that it has evolved under the homogenous or the non-homogeneous model. Any position with a likelihood value significantly higher in the non-homogenous model will be considered to be of importance for adaptation at a specific time.

ConSurf, although it does not detect sites between groups, can detect sites that are important for the structure and/or function, on the basis of the phylogenetic tree and the evolutionary rate [111]. The same laboratory also developed RASER (RAte Shift EstimatoR), which searches evolutionary trees for shifts in the evolutionary rates [112]. BADASP (Burst After Duplication with Ancestral Sequence Predictions) [113], CheckCov [95] or Procov [114] can be used to detect covarion sites in a protein family, reflecting structural or functional constraints. Finally, FunDi can simultaneously detect type I and type II functional divergence [115] (see Table 1).

Algorithms detecting sites that experience changes in selective constraints can be used to detect functional divergence between groups of homologous proteins and help to answer some fundamental questions in molecular evolution. For example, the study of the fate of a gene after a duplication event is an important field in molecular evolution. After duplication, a gene is present twice in the genome and different scenarios can explain the removal or retention of these genes. The main scenario is non-functionalization, where the gene undergoes pseudogenization and is then removed from the genome. Another fate is neo-functionalization, where the new gene undergoes an increase in evolutionary rate and gains a new function, mainly under positive selection. There is also sub-functionalization, where both genes have lost part of the ancestral function. These models of fate of genes after duplication are reviewed in detail in [116118]. Some questions can be asked, i.e. do speciation and duplication processes have similar impacts on the fate of gene families [116,119]? Using the above-mentioned algorithms to identify traces of functional divergence in different gene families, very few differences between orthologues and paralogues have been detected [120,121].

Applied at the nucleotide level

The genetic code is degenerate, in that one amino acid can be coded by different triplets of nucleotides. This fact can be taken advantage of in order to estimate selective pressures on protein-coding genes [122,123]. The evolutionary rate of nucleotide substitutions for dS (synonymous sites) is assumed to be constant, as the coded amino acid is not affected. However, the rate for dN (non-synonymous sites) is variable and depends on the nature of the amino acid substitutions. Deleterious mutations, that have a negative impact on the organismal fitness, will be removed by a strong purifying selection, and so dN will be lower than dS. By contrast, if a new amino acid improves the fitness, positive selection will operate to promote the fixation of the mutation and so dN will be higher than the dS. The dN/dS ratio can therefore be used to estimate which forces are in action. A dN/dS ratio of lower than 1 indicates negative (or purifying) selection, whereas a dN/dS ratio of higher than 1 indicates positive selection. A dN/dS ratio of 1 indicates that amino acids are under neutral evolution, and are randomly fixed according to different factors, such as the size of the population.

The dN/dS ratio therefore provides another means of exploring evolutionary trends, in addition to the other amino acid sequences-based methods described above. However, it can be subject to some bias. Saturation of the dS can occur when several repeated substitutions have taken place at the same position. Another problem is that dS is assumed to be strictly constant and neutral, which is probably not the case [124].

It is possible to use different codon-substitution models to analyse a multiple sequence alignment. A codon-substitution model uses a statistical framework (maximum likelihood) to estimate the different rates of substitution among the 61 codons (4×4×4, minus the three stop codons). On the basis of a phylogenetic tree, it allows the computation of different values, such as the transition/transversion rate (percentage of purine/pyrimidine changes) and the dN/dS value. The most basic approach is to estimate a single dN/dS for all branches and sites in the family. This model, owing to its simplicity, is sometimes used as a null model. More sophisticated ‘site models’ discriminate between sites subject to constraints (negative or positive selection) under neutral evolution, by allowing dN/dS to vary among sites. Alternatively, ‘branch models’ can estimate variation in dN/dS values between branches, and reveal whether proteins had global adaptive selection in the past, for example after speciation or duplication. Finally, the ‘branch-site models’ combine site models and branch models, to allow dN/dS to vary between branches and between sites [125]. These models identify the fraction of sites, if any, that have been under positive selection in one or more branches. They have more power and accuracy than the branch models [126] and are implemented in packages such as PAML (Phylogenetic Analysis by Maximum Likelihood) [127], TestNH (Test for Non-Homogenous process of sequence evolution) [128] or MEME (mixed effects model of evolution) [129] (see Table 1).

The identification of positive selection using these codon-substitution models is significant when more residues are detected than expected under a null hypothesis. Codon-substitution models have been successful in predicting adaption that has later been confirmed by experimental results. For example, the branch-site model identified functional changes under positive selection in fungal enzymes [130], butterfly rhodopsins [131] and in the Rubisco (ribulose-1,5-bisphosphate carboxylase/oxygenase; EC enzyme found in flowering plants [132]. Figure 2 shows sites detected in the MHC using these types of approaches [17]. Generally, when positive selection acts on proteins, it concerns only a small fraction of sites, in the range of 1–5% [133].

Databases have been developed to present sites under positive selection. These include the Human PAML Browser, which focuses on humans and some mammals [134], TAED (The Adaptive Evolution Database) [135] or the Selectome database that provides data on vertebrate genomes from ENSEMBL [136].

Ancestral sequence reconstruction

Ancestral sequence reconstruction can infer the evolutionary history of genes and define more precisely which amino acids are involved in functional divergence by identifying the ancestral function of a protein, such as an enzyme [137139]. Methods to estimate ancestral states are implemented in various packages, such as PAML [127], FastML [140], GASP [141] or HyPhy [142] (see Table 1). With ancestral sequence reconstruction, there is no discrimination between substitutions that are under neutral evolution or under positive selection, as both types of substitutions are taken into account.

A recent example of ancestral sequence reconstruction is that of yeast ADHs (alcohol dehydrogenases). In yeast, the cycle of ethanol is controlled by two homologous ADHs, Adh1 and Adh2. Adh1 produces alcohol whereas Adh2 consumes it. The two enzymes differ by 24 amino acids. The ancestral sequence of these ADHs has been predicted and resurrected in vitro [143]. Comparison of Adh1, Adh2 and the ancestral protein suggests that the ancestral function was to produce alcohol, probably as a mechanism of defence against bacteria.

The combination of dN/dS methods and ancestral sequence reconstruction can help trace evolutionary forces back through time. In the MHC in birds, duplication occurred in the β-chain, with residues under positive selection after duplication (Figure 2). Reconstruction of the structural dimer shows that three residues in the β-chain were directly involved in the interaction with the α-chain. Positive selection favoured a shift from basic to acidic residues in the β-chain. In the α-chain, the opposite occurred in a co-evolutionary process, in order to preserve the global stability of the interface [17].

Similarly, the ancestral state of a class of steroid receptors in vertebrates has been predicted and synthesized to test its binding affinity to different hormones (11-deoxycorticosterone, aldosterone and cortisol) [144]. It appears that the ancestral receptor, which originated at the dawn of vertebrates, was able to bind aldosterone, even though this hormone only emerged later, in tetrapods. By solving the structure of the ancestral receptor, the authors deduced that two substitutions, one that destabilizes the structure (S106P) and another that restores the stability (L111Q), were needed to enable the change of binding from the ancestral 11-deoxycorticosterone to cortisol [145].

Ancestral sequence reconstruction has also been successful in clarifying the evolution of digestive RNases in monkeys [146], vertebrate rhodopsin [147], GFP (green fluorescent protein)-like proteins in corals [148], the nuclear receptor EcR–USP dimer in insects (Mecoptera) [149] and the 3-isopropylmalate dehydrogenase in bacteria [150].

Methods to analyse indels

Indels are difficult to study and aligning regions prone to indels is generally awkward. Although good substitution models can predict the replacement of amino acids, indels are more difficult to predict and need more investigation to establish the parameters guiding the indel rates [151]. In previous studies, tools have been developed to take sequences with indels into account when examining phylogeny [152] or to simulate sequence alignments containing indels [153]. Tools such as SIFT Indel [154] or Polyscan can be used to detect indels, as well as SNPs [155] and databases have been developed to capture indels and their flanking regions, such as IndelFR [156]. Jiang and Blouin [157] observed that insertions tend to occur in previously inserted regions in a nested manner and used this feature to build structure-based distance metrics and thereby derive phylogenetic trees.

Methods predicting the impacts of single point mutations

SNP are mutations of single nucleotides that occur between different individuals of the same species. Owing to the redundancy in the genetic code in protein-coding genes, these mutations can change the amino acid [nsSNPs (non-synonymous SNPs)], leave the amino acid unchanged [sSNPs (synonymous SNPs)] or insert a stop codon. Some methods to predict the effect of nsSNPs and pathogenicity have been developed (see Table 1). These methods are mainly based on comparative genomics and/or physical models. These include SIFT [158], Polyphen [159] and Condel [160]. Other tools are reviewed in the paper by Ng and Henikoff [161]. There are now many databases listing human SNPs and their effects, such as dbSNP (Single Nucleotide Polymorphism database) [162], SAAPdb [163] or SNPdbe [164]. Some of these resources report structural impacts e.g. cavities in the protein, breakage of hydrogen bonds or changes in surface properties. Finally, the stability effect of a single amino acid mutation can be analysed at the structural level. Different computational methods have been developed to compare wild-type with mutant proteins, and to predict the ΔΔG value of the mutation, using first-principle models or empirical models. Such methods include FoldX [165], PoPMuSiC [166], CUPSAT [167], Rosetta [168] or I-Mutant [169].


Genetic disorders

Diseases caused by changes at the genetic level are generally referred to as ‘genetic disorders’ and include Mendelian diseases, that follow classical Mendelian inheritance patterns, and which have been catalogued in the OMIM database [170]. A Mendelian disease is controlled by one locus and transmitted from parents to the children according to Mendelian laws. One can cite sickle-cell disease or cystic fibrosis. Sequencing can help to detect genetic disorders. Progress in sequencing technologies and easier access to genomic data is enabling advances in a wide array of research fields, from comparative genomics to personal genomics and personalized medicine. Recent projects of exome sequencing have already provided new insights into rare Mendelian diseases, such as autism or obesity, by identifying new variants that are linked to these disorders [171]. Similar projects have also improved our understanding of cancer evolution [172].

Most of the tools to detect SNPs have been developed to analyse disease mutations. A survey of nsSNPs in human suggests that approximately 25% of them affect function, but that these deleterious mutations tend to be rare in the population as a whole [161]. In protein kinases, which are often involved in cancer, it has been observed that non-pathogenic nsSNPs occur randomly in the protein structure, but tend to avoid the functional sites [173,174]. However, pathogenic somatic and germline mutations (driver mutations) in kinases tend to occur in functional sites, and therefore affect function directly [173,174]. The effect of a mutation depends on its context. Some pathogenic mutations in human can be the wild-type in other species and this difference has also been proposed to be caused by a compensatory mechanism during evolution [175]. Finally, the association of nsSNPs associated with disease and impact on function is strong [176].

As for SNPs, the changes induced by indels can have medical implications [70,177]. A number of studies have analysed the impact of indels in proteins associated with disease. A survey of 79 human genomes revealed over two million indels [57], of which nearly one million are present in genes, with small indels frequently found in exons and larger indels more often found in introns. As discussed previously, exonic indels are less frequent than SNPs owing to their tendency to have an impact on the function of the protein. Nearly 1% of genetic diseases are caused by long insertions (of more than 20 bp), which are produced by different mechanisms such as duplication (tandem or partial tandem), nonpolyglutamine repeat expansion or LINE-1 retrotransposition [178].

Infectious diseases

Disease-causing micro-organisms, such as viruses, bacteria or unicellular eukaryotes, evolve under strong selective pressure from the immune system and drugs. Both parasites and hosts find themselves drawn into an evolutionary arms-race. Parasites find ways to invade the host and develop mechanisms for influencing it, or for requisitioning the host cellular machinery to support their own interests. In response, hosts develop resistance to defend themselves and fight the infection. A striking example of such a resistance mechanism is the adaptive immune system of vertebrates, which is able to produce antibodies under selective pressure (‘clonal selection’) and to retain only those able to recognize antigens, in a manner analogous to natural selection at the species level. In turn, parasites then explore new ways to escape these host defences. Some well-known examples are the influenza virus, the HIV virus and MRSA (methicillin-resistant Staphylococcus aureus) [179], which rapidly develop resistance to drugs. Tracing the deep evolution of viruses and other parasites can help in predicting the emergence of pathogenic mutations [180]. These pathogens can spontaneously develop resistance, but when they are exposed to drugs they undergo selective pressures and tend to evolve fast to develop resistance. Studies of resistance can be carried out both in the laboratory under special conditions or by following the progression of the disease in patients. Finally, the acquisition of resistance to antibiotics in bacteria can follow parallel paths by exploiting different combinations of mutations. This has been experimentally demonstrated in E. coli [181].

Residues under adaptive selection can depend on the host context. Comparing influenza viruses with respect to their different hosts (avian or human), using the TDG09 algorithm [110], 172 sites (from among all influenza genes) were found to evolve under the non-homogenous model, suggesting that these sites are important for the spread of the virus from avian to human. The identification of new mutations in deadly viruses, and the ability to determine when a virus will be able to cross the species barrier, are both very important aspects of the fight against such viruses. The RASER algorithm was developed to detect shifts between different subtypes of HIV [112]. Similarly, ancestral sequence reconstruction by the BADASP tool [113] identified 39 residues in HIV responsible for specificity differences [182].

An important biomedical application that exploits dN/dS values is the identification of residues that are highly conserved and critical for function. The high conservation of some positions can be caused by either different constraints (i.e. functional and structural, see the Effects of mutations on proteins and their functions section) or to a low evolutionary rate at the nucleotide level. The dN/dS value can be used to discriminate between these two situations, as positions under strong evolutionary constraints will present a very low dN/dS value, whereas positions with a low evolutionary rate will have a dN/dS value close to 1. The sites under constraints are unlikely to mutate in the future and are therefore good drug-target candidates. Drug resistance is a problem when targeting Plasmodium falciparum, the pathogen responsible for malaria, making it important to find such residues to target in that organism. Durand et al. [183] estimated dN/dS values in different genes so as to identify positions under very strong evolutionary constraints (dN/dS<0.1), and therefore are likely to be critical for protein function. Similarly in HIV-1, Woo et al. [184] used both entropy and dN/dS values measured by the HyPhy package [142]. As expected, most sites with lower dN/dS are found in protein cores and at interfaces, suggesting they are mostly affected by structural constraints. Highly conserved sites at the interface are likely to be good candidates to target with drugs.

Mutations in cancer cells

Cancer also evolves under natural selection [172]. Mutations occur during cell replication, but are normally kept under control by DNA repair. If the DNA repair process is unable to correct the errors, the abnormal cell is directed to undergo apoptosis. However, some functional mutations can result in the mutant tumoral cell remaining in the organism and propagating throughout. Distinguishing ‘driver mutations’, which are the cause of cancer, from ‘passenger mutations’, which occur in cancer genomes but have no effect on fitness, is an important research field [185,186]. As in bacteria, tumoral cells can develop resistance to anti-tumoral drugs. Mutational events in cancer (e.g. single point mutations and rearrangements) can be traced in a similar manner as for the phylogenetic studies of species [187]. A recent study compared genomic DNA from healthy cells with acute myeloid leukaemia tumour cells before and after treatment [188]. The authors showed that populations of cancer cells are different from one another and that those that resist the anti-tumour drugs harbour mutations that protect them. This trend has also been observed in other studies (reviewed by Caldas [189]).

Mutation assessors [190] specifically designed to detect constant-but-different mutations in huge alignments has been tested in proteins associated with cancer. The authors classify the type of mutation as loss-of-function, gain-of-function and switch-of-function. The aim is to detect non-synonymous mutations that are likely to cause a switch-of-function, thereby promoting cancer. The method was applied to proteins in the COSMIC cancer database [191] and revealed that ~5% of mutations are associated with a switch-of-function.

Sometimes functional changes can occur without any compensatory mutations. For example, a study in breast and colorectal cancers found an important destabilizing effect in most driver mutations, especially in tumour suppressors [192]. The authors used SNPs3D [193] to identify mutations that disrupt the structure. The observed impacts included steric clashes, disruption of an electrostatic interaction, disruption of a protein–protein interaction or disruption of ligand-binding. Some of these effects can lead to a gain-of-function as in tumour enhancers, which lose the ability to bind their repressors (tumour suppressor).


The possibilities of adaptation of proteins to environmental changes are multiple. When successful, these adaptations are written in the genome of organisms. The increase in the number of available genomes and the decreasing price of whole-genome sequencing will help in producing a more accurate view of both large- and small-scale gene and protein evolution. Models of molecular evolution are improving all the time, better describing the history of mutations, whereas structural genomics and molecular dynamics are developing biophysical models of higher resolution. For example, recent experimental methods using high-throughput sequencing were developed to assess the constraints on protein evolution and the tolerance to mutation at various positions [194]. The authors performed 600000 random mutations of the human WW domain. They next submitted these variants to six rounds of selection that assessed the ability to bind a given ligand. The number of viable variants, i.e. those that were able to bind this ligand, significantly decreases with each selection round. The final result is a map of sequence–function relationships.

Evolutionary concepts can also help to understand diseases. The identification of evolutionary patterns of amino acids can help to identify mutations that are likely to change function, thus playing the role of drivers in cancer [190]. The use of dN/dS to estimate selective forces can provide some clues as to which residues should be targeted by drugs, as illustrated in Plasmodium [183]. The use of phylogenetic trees can identify residues important for adaptation, for example, in influenza [110]. This information could be used to define when a new strain of virus is close to having a critical set of mutations that allow it to pass from one organism (i.e. bird) to another organism (i.e. human). Future developments will bring better ways of combining these approaches creating better integrative models of protein evolution [9,139].


R.A.S. received funding from the Fondation du 450ème anniversaire de l’Université de Lausanne and Swiss National Science Foundation [grant numbers 132476 and 136477]. B.H.D. is a long-term postdoctoral fellow of the Japan Society for the Promotion of Science (JSPS). This work was supported, in part, by the JSPS KAKENHI programme [grant number 23-01210].


We thank our reviewers for constructive remarks.

Abbreviations: ADH, alcohol dehydrogenase; BADASP, Burst After Duplication with Ancestral Sequence Predictions; CATH, class, architecture, topology, homology; dN, non-synonymous sites; dS, synonymous sites; ET, evolutionary trace; IMDH, isopropymalate dehydrogenase; indel, insertion–deletion; MEME, mixed effects model of evolution; nsSNP, non-synonymous SNP; PAML, Phylogenetic Analysis by Maximum Likelihood; RASER, RAte Shift EstimatoR; RMSD, root mean square deviation; SCOP, structural classification of proteins; SDR, specificity-determining residue; SNP, single nucleotide polymorphism; TDG09, Tamuri, dos Reis and Goldstein 2009


View Abstract