Review article

The origin, evolution and structure of the protein world

Gustavo Caetano-Anollés, Minglei Wang, Derek Caetano-Anollés, Jay E. Mittenthal


Contemporary protein architectures can be regarded as molecular fossils, historical imprints that mark important milestones in the history of life. Whereas sequences change at a considerable pace, higher-order structures are constrained by the energetic landscape of protein folding, the exploration of sequence and structure space, and complex interactions mediated by the proteostasis and proteolytic machineries of the cell. The survey of architectures in the living world that was fuelled by recent structural genomic initiatives has been summarized in protein classification schemes, and the overall structure of fold space explored with novel bioinformatic approaches. However, metrics of general structural comparison have not yet unified architectural complexity using the ‘shared and derived’ tenet of evolutionary analysis. In contrast, a shift of focus from molecules to proteomes and a census of protein structure in fully sequenced genomes were able to uncover global evolutionary patterns in the structure of proteins. Timelines of discovery of architectures and functions unfolded episodes of specialization, reductive evolutionary tendencies of architectural repertoires in proteomes and the rise of modularity in the protein world. They revealed a biologically complex ancestral proteome and the early origin of the archaeal lineage. Studies also identified an origin of the protein world in enzymes of nucleotide metabolism harbouring the P-loop-containing triphosphate hydrolase fold and the explosive discovery of metabolic functions that recapitulated well-defined prebiotic shells and involved the recruitment of structures and functions. These observations have important implications for origins of modern biochemistry and diversification of life.

  • evolution
  • fold superfamily
  • organismal diversification
  • protein domain
  • proteome
  • tripartite world


Protein molecules are vital components of life. Together with functional RNA, they are primarily responsible for the many biological activities of the cell. Proteins define the enzymatic chemistries and transport processes characteristic of metabolic pathways, regulate gene expression and many other molecular functions, are involved in signal transduction, and make up the actual molecular and cellular machinery that fuels life. They are highly diverse and embed hierarchically many layers of molecular organization. Their evolution is complex and constrained by aspects of molecular structure, thermodynamics and function. In the present review, we examine the structure of the modern protein world and discuss how evolutionary genomics and structural bioinformatics have helped to dissect the origin and history of modern proteins. We also discuss how the discovery of structure and function in the contemporary protein world has affected the distribution of molecules in proteomes.


Polypeptide chains fold into highly ordered architectures that embed protein function. These folds minimize the energy conformations of individual amino acid residues in the chain, maximize hydrogen-bonding of polar groups and form compact and well-packed 3D (three-dimensional) atomic structures that bury hydrophobic residues away from the aqueous environment. Physically, they represent spatial arrangements of more or less wound helices that locally distort the bond geometry of the polypeptide backbone (310-helix, α-helix, π-helix and polyproline II-helix) and extended chain segments called β-strands that can establish long-range interactions and form β-sheets in parallel and antiparallel arrangements (often curved into open and closed barrel structures). These conformational elements (‘helical’ and ‘sheet’), originally proposed by Pauling and Corey [1] as building blocks of proteins, are defined fundamentally by hydrogen-bonding interactions of closely or distantly related regions of the polypeptide chain and make up approx. two-thirds of protein structure. They are separated by loop regions (turns and coils), which are more or less rigid stretches of the backbone that delimit their direction in space. An example is the β-hairpin, a reverse turn that links two adjacent strands and forms an antiparallel β-sheet.

Linderstrøm-Lang and Schellman in the 1950s [2] realized that protein structure was hierarchical and proposed four levels of structural organization (complexity): (i) primary structure, the sequence of amino acids linked by peptide bonds; (ii) secondary structure, the hydrogen-bonding patterns that give rise to helix and sheet elements in the fold; (ii) tertiary structure, the actual fold of the molecule stabilized mainly by side-chain interactions of elements of secondary structure; and (iv) quaternary structure, the aggregation of separate polypeptide chains into a supramolecular biological unit. Function fully manifests when these four levels of complexity are achieved. The recognition that aspects in the structure of proteins are redundant and modular (see below) led to the addition of new levels: (i) supersecondary structures, recurrent motifs of secondary structure, such as α-α-hairpins, β-β-hairpins and β-α-β-structures, sometimes repeated in tandem (e.g. in leucine-rich repeat proteins) or producing structures with or without internal symmetry (e.g. β-α-β-structures in TIM barrels) [3]; (ii) protein domains, compact units within the fold that act as structural modules and appear singly or in combination with other domains in multidomain proteins [4]; and (iii) multiprotein complexes, heteromultimeric assemblies of functionally related proteins that act as high-order functional units (e.g. molecular machines such as the ribosome, the proteasome and the dynein complex) [5]. Figure 1 describes how hierarchical complexity correlates with degree of molecular detail and time needed to develop each level of organization in evolution. Note, however, that levels of organizational complexity can be blurred by how structural elements are defined and new levels may arise with increased knowledge of protein structure and drivers of protein evolution. Similarly, the rate of change associated with structure is notoriously difficult to estimate. However, and within orders of magnitude, more complex structures arise through accumulation of many changes at lower levels and therefore take considerably more time to arise in evolution.

Figure 1 Levels of molecular organization in the protein world

The hierarchy and complexity of proteins is illustrated with the ATP synthase complex, a highly abundant protein ensemble responsible for ATP synthesis in the cell. The 600 kDa complex can be separated into two subunits, F1 and Fo, which can be studied individually. Transmembrane proton gradients drive rotation of the C-subunit ring of Fo, which then propels rotation of the central stalk and the F1 head of the complex. This rotation causes conformational changes in F1 active sites that result in ATP synthesis/hydrolysis. The α-subunit of the complex has three domains, the central of which has a P-loop hydrolase fold that is highlighted in the Figure. Different levels of structure occur at different levels of resolution (scale in Å, where 1 Å=0.1 nm) and at different rates in evolution. For example, the discovery of ∼2×1024 total sequences (considering homogenization of sequence diversity at population level; see the text) suggest that new genotypes are produced on Earth at a level of fractions of a microsecond. A similar argument can be used with secondary structure. The average length of a helical segment is 10±2 residues and of a sheet element 5±1 residues, and their average number is 6±2 and 7±3 per protein chain respectively (see Supplementary Table S1 at Then the maximum number of possible permutations of elements of different length (∼7 residues) in groups of five to ten is 2.8×108 (i.e. 710). If all these permutations are accessible, the rate of maximum discovery approximates 0.1 structural arrangement per year. Rates of discovery at higher structural levels use frequentist arguments that relate the number of known structural arrangements to time. For example, the discovery of the ∼4×104 distinct domains indexed in SCOP [82,83] spread uniformly over a ∼4 billion year timeline renders a frequency of discovery of one domain every 105 years. PDB codes used: 1BNF and 1QO1.

Linderstrøm-Lang's laboratory studies of protein denaturation and proteolysis and the accessibility of protein bonds that are buried also helped shape the idea that protein structure was highly dynamic [6]. This ultimately materialized in Anfinsen's thermodynamic hypothesis of folding that postulates the native structure of a protein results from spontaneous refolding of denatured states [7] into the thermodynamically stable structure [8], linking primary to tertiary structure in proteins. It is now apparent that proteins indeed achieve native structure quickly and efficiently through a myriad of conformational changes that are influenced by the solvent [911]. This involves a complex interplay of simple pairwise and co-operative interactions that tend to stabilize protein structure towards its native state, a state in which frustration (conflicting interactions) is minimal and a ground energetic state and funnel topography dominates the local folding landscape (energy surface; Figure 2A). In reality, folding appears to materialize through a progressive organization of an ensemble of partially folded structures that resembles a rugged funnel, with trajectories defined by a series of steps of local optimization that minimize the free energy accessible to the polypeptide chain and conflicting energy contributions from the relative position of individual residues and solvent [911]. This complex interplay sometimes occurs in the presence of kinetic traps that complicate the landscape, characteristic of rugged funnels (Figure 2A). Folding should be regarded as a transition of disorder to order in a global optimization process. The ‘zipping and assembly’ hypothesis captures for example the essence of this process, describing microscopic routes of folding that start from a polypeptide sequence and materialize in a time series of smaller and smaller conformation ensembles [12]. These routes involve local metastable structures in the protein chain that progressively assemble into more global ones. The energy landscape has inspired physics-based modelling with semi-empirical atomic force fields that fold molecules in a computer and provide ab initio understanding of the forces and dynamics that govern the folding process. Great progress has been made, for example with helical and β-hairpin peptides and small proteins, using modern force fields, Boltzmann sampling and/or other considerations [1217]. Pathways of folding and unfolding derived from Molecular Dynamics simulations are now supported by experimentation with analysis of transition states, determination of intermediates with NMR spectroscopy and denatured states. One example is the folding and unfolding of the three-helix bundle protein from the Engrailed homeodomain of Drosophila melanogaster at atomic resolution [18,19].

Figure 2 Current paradigms on folding and evolution of proteins

(A) Folding funnel-shaped representation of the energy landscape that describes the transition of protein conformations from disorder to order. The free energy of a protein is displayed as a function of the number of conformations at each energy level (density of states) that are derived from the partition function and describe the topological arrangements of the polypeptide chain in space that are possible. Proteins fold co-operatively by channelling protein folding intermediates (non-native states) downhill into the funnel and achieving the native state at its base, after avoiding the kinetic traps of the rugged landscape. Note, however, that proteins do not remain folded. Native proteins are slightly more stable than their denatured states so that they fold and unfold every few minutes, setting the pace for change in the funnel. (B) Mapping of sequence (genotype) space into structure (phenotype) space and then into fitness. The first map is many-to-one and unfolds by single mutational steps in sequence space. The second map assigns real numbers to structures given some function that distils the phenotype. (C) Evolutionary dynamic representation of protein evolution. The mapping of mutating protein sequences (genotypes) into structures (phenotypes) defines neutral sets in sequence space, ensembles of sequences that fold into a given native structure and neutral networks, subset with sequences that are tightly linked by series of single point mutations. Neutral nets corresponding to four different protein folds are coloured differently in a planar representation of the multidimensional space of sequences. Mutation causes sequences to drift along the neutral nets. However, the search for thermodynamic and kinetic folding optimality (described by a third dimension) tailors evolutionary trajectories keeping them within space attractors for individual folds (illustrated as funnels). When mutational trajectories (paths in the graph) reach new neutral networks, new attractors and new folds are discovered. An animated version of this Figure can be seen at

This multidimensional energy landscape provides a statistical view of the energetics of protein conformations, but also manifests at evolutionary levels, as single-molecule structure variants (and associated conformational ensembles) generated by mutation are culled by natural selection and other evolutionary constraints (e.g. self-organization [20]). This landscape becomes evolutionary when fitness values are assigned to phenotypes (Figure 2B). Here, fitness embodies the advantageous contribution of individual molecules to the reproductive success of organismal lineages. When biopolymers mutate, they embark on an exploration of the space of possible sequences (defined by a Hamming metric of elemental mutational moves) and associated structures and functions. This exploration takes the form of adaptive walks in sequence space that optimize thermodynamic, kinetic and mutational features in molecules. The mapping of sequence (genotype) into structure (phenotype) has been shown to be tractable in RNA [21] and also in proteins [22] and has three fundamental properties: (i) there are many more sequences than structures (i.e. the sequence-to-structure map is highly degenerate); (ii) few common, but many rare, structures materialize in structure space; and (iii) extensive neutral networks that percolate sequence space define common structures and structural neighbourhoods [23,24]. Within these neutral networks (anticipated by Maynard Smith [25] and in response to Salisbury [26]), structure is impervious to mutational change at the sequence level and, because the distribution of sequences that fold into the same structure (shape) is approximately random, the mapping has ‘shape space covering’ properties. This means that all structures can materialize (are accessible) within relatively few mutational changes in sequence space. This property is especially true for RNA and has been confirmed experimentally using functional molecular switches that have been engineered by in vitro evolution [27]. The existence of neutral networks and shape space covering has also been predicted for polypeptides [28], paraphrasing conclusions from lattice models with simplified alphabets [2931]. In these studies, independent adaptive walks in sequence space can produce a given structure despite lacking significant sequence similarity, matching the recurrent observation that sometimes seemingly unrelated sequences can harbour a given fold [32]. At the same time, and because of shape space covering, sequences that fold into completely different structures may differ by a few critical amino acid residues. Consequently, extensive neutral networks enable the efficient exploration of sequence space, whereas shape space covering ensures a constant rate of structural discovery. These properties match, for example, some recent in vitro evolution experiments [33] that show extensive regions in natural proteins exhibit functions refractory to mutational change [34] and demonstrate that discovery of function in random peptide libraries is facile (e.g. [35,36]). However, the sequence-to-structure mapping of proteins is much more complex and its landscape is ‘holey' when compared with RNA, with proteins folding into native states missing in vast segments of sequence space. Although the neutrality of protein sequence space is much higher than that of RNA (>90% of single amino acid substitutions are neutral [37]), protein structures appear to concentrate in dense clusters [38,39], whereas RNA structures spread through sparsely connected networks [40,41]. Under a ‘superfunnel’ paradigm supported by experimentation [42,43], protein sequences drift along neutral networks and are sometimes trapped into funnels (Figure 2C), defined by sequences that are mutationally more stable (they tolerate the largest number of mutations) and, at the same time, are thermodynamically more stable [37,39]. These ‘attractors’ in neutral space are sometimes replaced by more fit ones through smooth transitions mediated by excited states that tend to occur between similar structural phenotypes and genotypes [44] (Figure 2C). Smooth adaptive walks of this kind [25] explain enzyme promiscuity [45] and reconcile recent experiments that show that proteins optimized for novel function arise before the original function is lost [46,47]. They can also explain gene duplication and divergence and the effect of epistatic thresholds of stability that buffer the effects of deleterious mutations on fitness [48]. The superfunnel paradigm therefore links the energetic landscape of folding with the evolutionary dynamics of molecules in percolating neutral networks.

Selection for compact and stable fold architectures is also maintained at higher levels of organization by more complex cellular infrastructure, which adds further evolutionary constraints on protein architecture. For example, the HSR (heat-shock response) is a fundamental cytoplasmic mechanism common to the three domains of life (Archaea, Bacteria and Eukarya) [49]. When subjected to temperature increases, five groups of Hsps (heat-shock proteins) are induced (Hsp100s, Hsp90s, Hsp70s, Hsp60s and small Hsps). These include chaperones, proteases, ATPases and DNA-repair proteins that mend damage and mediate non-covalent folding, unfolding, assembly and disaggregation of proteins. Hsp synthesis is of crucial importance for thermotolerance of organisms such as hyperthermophilic archaea, which appears to exhibit minimal, but highly tailored, protein-folding systems [50]. Within these groups of molecules, chaperonins are megadalton ring assemblies that mediate ATP-dependent protein folding to the native state (e.g. the bacterial chaperonin GroEL and its co-chaperonin GroES) through complex allosteric mechanisms [51]. Prefoldins deliver nascent unfolded proteins to these cytosolic chaperones as they exit the ribosome, establishing specific interactions with actin and tubulin in eukaryotes [52]. Since proteins exhibit a generic tendency to aggregate in the high macromolecular concentrations of intracellular compartments (molecular crowding) [53], proteins that unfold or remain unfolded are tagged and degraded by the ubiquitin–proteasome proteolytic pathway [54]. However, in eukaryotes, more complex systems guarantee the correct folding of a protein. These proteostasis control systems regulate protein concentration, the conformation of folds and complexes, and cell localization [5557]. They involve interactions between the folding polypeptide and cellular components such as chaperones, co-chaperones, folding enzymes and components of small-molecule metabolism that stabilize the folded state and stress sensors, including the HSR and the UPR (unfolded protein response) [57]. This adds an additional layer of complexity to the already difficult folding problem and additional evolutionary constraints to the discovery of protein architecture.


Proteins are covalently bonded linear heteropolymers made up of 20+ amino acid monomers with a specific primary sequence of side chains spaced at regular intervals. From this perspective, the roughly 103–105 protein sequences per genome that are encoded in the genomes of the ∼107–108 species that exist on Earth [58], most of which are microbial [59], cover necessarily only a minute fraction (∼1010–1013 variants) of the enormous permutational space defined by amino acid sequence (∼10321–10469 possible arrangements in sequence space), given recent estimates of average protein length in genomes [60,61]. In these calculations we assume there is no intraspecies variation, even though it is unlikely that members of a given reproducing population will be identical. In fact, if we consider that the (4–6)×1030 prokaryotic microbial cells in our planet (which account for ∼70% of life in certain habitats) have turnover rates of ∼8×1029 cells per year [59] and that mutations in proteins occur in clock-like fashion at rates of ∼4×10−7 per microbial cell and per generation [62], then we would expect an upper boundary of ∼2×1032 total mutational amino acid changes in microbial proteins in the ∼4-billion-year-long history of life, which is still a minute fraction of sequence space (even if these concentrate in sequences that fold successfully).

This limited molecular exploration of sequence space has nevertheless encountered considerable diversity at higher levels of structural organization, mostly because of the neutral net and shape space covering properties we discussed above. Whereas the rate of discovery of new sequence genotypes on Earth appears to occur at incredible pace and generate considerable sequence diversity, rates at higher levels of protein organization decrease progressively and in a substantial manner (Figure 1). Sequence variants develop within fractions of microseconds. However, secondary structure variants take considerably longer to be discovered, whereas complex 3D structures arise once in hundreds of millions of years. This is an expected outcome. Higher structural levels are linked directly to function and are therefore the subject of natural selection and strong evolutionary constraint [63,64]. Sequence genotypes have a limited alphabet and change constantly by mutation, making them poor repositories of molecular history. In fact, the repeated accumulation of substitutions in nucleotide sites (site saturation) can erase evolutionary history at intermediate and deep evolutionary timescales [6567]. In contrast, structural phenotypes have more complex alphabets that define function directly or through interactions of substructural, molecular and supramolecular components that are collectively responsible for function (e.g. in molecular ensembles), all of which are often carefully culled by natural selection. The effects of selection are consequently stronger at this level than at the genotype level and structural phenotypes are generally left unchanged over short, intermediate and long timescales. However, proteins evolve at vastly different rates, and recent studies suggest that this is due to differences in expression levels, functional roles and intra- and inter-molecular interactions [43,6873]. For example, increases in the density of contacts (fraction of buried sites) in domains or entire proteins tend to increase evolutionary rates [73]. In contrast, increases in the number of binding interfaces of multi-interacting proteins tend to decrease rates [71]. Interestingly, positively selected amino acid sites were found preferentially located on the exposed surface of proteins [72]. Within individual proteins, different regions of the molecules are differentially constrained and were found to be quantitatively stable over billions of years of divergence [74]. Most notably, active sites and residues important for structural maintenance tend to evolve slowly and were refractory to mutation. However, the relationship between protein conservation and function is complex, especially when molecular redundancy, strength of natural selection and genome structure are taken into consideration [75].

Advances in comparative and structural genomics offer unprecedented opportunities to understand proteomic complexity and provide insights into the diversity and structure of the protein world [76]. The number of protein sequences and structures has expanded significantly in the last few years and its organization is clearly hierarchical (Figure 3). There are currently (as of November 2008) 875 completely sequenced genomes contributing to ∼6 million sequences. However, only a fraction of sequences are well annotated, and the number of unique entries at lower levels of structural organization continues to increase exponentially; the protein world remains uncharted at these levels [77]. In contrast, the number of new folds that are encountered every year is decreasing considerably, supporting the idea that the repertoire of architectural designs is finite (perhaps ∼1500 folds). A recent attempt to recreate all possible protein folds by ab initio folding from short homopolymeric sequences revealed all constructs matched folds in solved structures, and vice versa; all natural single-domain structures had analogues in the model set [78]. This suggests that our knowledge of single-domain folds is probably complete. In order to make sense of increasing information, a number of bioinformatic strategies of sequence and structural comparison have led to the creation of a wide range of protein classification schemes, all of which aim to group evolutionarily related proteins [79]. These catalogues organize sequences and proteins of known structure (currently described by ∼54000 Protein Data Bank entries) into taxonomies in an attempt to provide a global evolutionary view of the protein world. The first taxonomies described were originally based on the concept of the protein domain [80] and most modern classifications are still organized around this structural level [79]. This is predicated on the premise that domains are compact and more-or-less independent globular folding elements, establish more abundant intradomain than interdomain residue contacts, and recur in different structural contexts (i.e. they act as modules, appearing singly or in combination with other domains). The recurrence concept is supported by a comparative framework on the basis of homology relationships and is fundamental. It defines the domain as an evolutionary unit of classification. Approx. 30 popular domain classifications based on sequence and/or structure are currently available; they use patterns and/or profiles in sequence to build libraries of domain families or establish distant relationships using structure comparisons. The Pfam database of multiple sequence alignments and HMMs (hidden Markov models), for example, is a comprehensive resource for the identification of domain families, repeats and motifs [81]. It provides two levels of curation, one based on automated domain sequence alignments (Pfam-B) and the other extended by HMM-based profile searches and literature analyses (Pfam-A), which serve as seeds for the iterative construction of HMMs. In contrast, SCOP (Structural Classification of Proteins) is a high-quality taxonomical resource that assigns domain boundaries manually at the structural level and applies the recurrence concept rigorously [82,83]. SCOP domains that are closely related at the sequence level (generally expressing >30% pairwise amino acid residue identities) are pooled into fold families (FFs), FFs sharing functional and structural features suggestive of a common evolutionary origin are unified further into fold superfamilies (FSFs), and FSFs that share similarly arranged and topologically connected secondary structures are grouped further into protein folds (Fs). Fs are then grouped into protein classes according to organization of secondary structure in the fold, defining the major α/β, α+β, all-α, all-β, small and multidomain groups. This architectural hierarchy (Figure 3A) somehow mimics the relative numbers of sequences and structures that have been discovered (Figure 3B). Unlike SCOP, the CATH (Class, Architecture, Topology, and Homologous superfamily) classification of proteins uses expert systems that automate most steps and classify domains that may or may have not been observed in other structural contexts [84,85]. CATH adds an additional hierarchical level (‘architecture’) over the fold classification (‘topology’) that describes the 3D arrangement of secondary structure but not its connectivity. A final example of structural classification is the DALI Dictionary, a fully automated non-hierarchical structural alignment system that uses domain recurrence to identify domains and provide lists of structural neighbours [86]. Interestingly, a comparative analysis of the SCOP, CATH and DALI taxonomies revealed remarkable agreement of protein assignments at fold (75%) and superfamily (80%) levels, with discrepancies attributed to different thresholds or manual curation [87]. In recent years, the different domain classifications have been consolidated by cross-listing and integration. For example, the InterPro consortium integrates protein classifications (including Pfam, SCOP and CATH) and maps protein families, domains, repeats and identifiable features of known proteins on to sequences in TrEMBL and Swiss-Prot [88].

Figure 3 Progress in the experimental discovery of sequences and structures

(A) Protein architectures can be defined at different levels of protein hierarchy, using, for example, taxonomial classifications such as SCOP [82,83], with categories described with alphanumeric labels and identifiers. Currently, sets of approx. 1000 Fs, 1800 FSFs and 3500 FFs describe the world of proteins. (B) The continuous increase of the available numbers of sequences from the highly curated UniProtKB, protein structures from the PDB, and F, FSF and FF architectures in SCOP. The numbers of completely sequenced genomes that have been published (indexed in the Genomes Online Database [198]) have increased continuously from 1997 to 2007. Only the latest data were used if some databases had more than one release available in one year. Note that UniProtKB entries represent only a fraction of the ∼6 million sequences in UniProtKB/TrEMBL. (C) Proteins have physical structures that were designed and constructed by Nature (architecture) defined by the folding of the sequence at F, FSF or other levels of the structural hierarchy (domain structure) and by how domains combine with others (domain organization).

Although taxonomies provide the framework needed to understand protein diversity, the definition of structure and associated functions remains challenging [89]. Protein architecture, the ‘fundamental build’ [the αρχι- (archi-) τ

Embedded Image

κτων (tekton)] of a protein, is modular (Figure 3C). Domains with different 3D structures (domain structures) combine with others in complex arrangements (defined here as domain organization). Domain structures associate with functions that are sometimes carried into the multidomain arrangement to increase enzyme specificity, provide links between other domains or regulate functional activity [90]. However, domain boundaries are difficult to establish and common topological elements that make up the folding core sometimes account for less than half of domain sequence [91]. Moreover, some CATH folds exhibit 3-fold variation in the number of secondary structures, and certain superfamilies show that secondary-structure embellishments often associate with change of function [91]. Peripheral regions of secondary structure can differ in size and conformation, ‘decorating’ the central folds of domains distinctively. Similarly, accretion of substructures around the core can result in functional diversity, as illustrated with the biochemistries that are linked to enzymes harbouring the thioredoxin-like fold [92]. To complicate matters, a measurement of how often fold substructures are shared by fold architectures (e.g. ‘gregariousness’) suggests some fold categories should be regarded as ‘neighbourhoods’ defined by how much structural overlap exists between them [93,94]. Some regions of the protein fold space therefore represent a continuum for certain architectural arrangements (sometimes linked by supersecondary motifs), whereas, in other regions, clearly distinct non-overlapping (discrete) topologies are observed. These regions can be best represented as a continuous and multidimensional environment [95]. Interestingly, detecting similarities between ligand-binding sites with a new structure–function comparison method tested the notion of a continuous fold space and revealed new evolutionary relationships across an existing discrete representation [96]. Finally, proteins can adopt multiple structures and functions, exhibiting conformational diversity and functional promiscuity. They can display ligand-independent conformational diversity, use structures to ‘moonlight’ different functions without involving their active sites or become promiscuous [45]. Chameleon sequences can adopt a distinct folded conformation under native conditions, and large-scale fold variations can alter topology in proteins [97]. This is complicated by the fluid nature of genome structure, which facilitates the rearrangement of domains [4,98]. These rearrangements are responsible for domains being both functional and structural subgenic modules that are highly plastic.

Remarkably, protein structures are unevenly distributed in the world of proteins [99]. Genome surveys have shown that families and folds in genomes follow power-law distributions and exhibit scale-free properties [100102]. This behaviour results in a few folds that are highly popular (‘superfolds’ with many families; e.g. TIM barrel folds are widely distributed in metabolism) and many that appear infrequently (‘mesofolds’ and ‘unifolds’) [103]. It also implies a preference for duplication of genes encoding folds that are already common, as summarized in models that account for duplication, acquisition and loss of genes [102] or describe birth–death–innovation processes [104106]. Interestingly, fold frequency plots for the microbial superkingdoms Archaea and Bacteria have steeper decay slopes than those for Eukarya, showing there is a larger level of architectural redundancy in the proteomes of complex organisms [99,107]. However, folds shared by all superkingdoms and folds shared by Eukarya and Bacteria (generally the most ancient; see below) fitted Gaussian-like distributions characteristic of random graphs, suggesting the spread of folds across superkingdoms is complex [107]. In order to explain the uneven proteomic distribution of structures, a number of phenomenological and physics-based models have been proposed that focus on functional constraints, convergence of sequences into structures (‘designability’), or evolutionary dynamic considerations, some of which invoke evolutionary processes of convergence or divergence and the paradigms described in Figure 2. They have been reviewed recently [108] and will not be discussed here. In particular, statistical mechanic approaches to evolution of simple lattice model proteins provide interesting insights into the workings of real proteins. Most notable is a recent microscopic ab initio model that considers not only the fate of genes, but also the survival of organisms [109]. The model is based on the central assumption that the death rate of an organism is determined by the stability of the least stable of its lattice model proteins. Simulations reveal exponential population growth once favourable sequence–structure combinations are discovered and collapse of these precursors into selected fold architectures, which remain stable and abundant at timescales greater than organismal lifetime. The rise of protein families and superfamilies and power-law distributions that match distributions for real proteins arise as emergent properties of the physical model, which suggests new folds result from dominant folds by satisfying energetically favoured native conformations. This is provocative and clearly in line with emerging views of protein folding and evolution.


Almost 150 years ago, in his seminal book, Charles Darwin established evolution by common descent as the dominant scientific explanation of biological diversity and change [110]. The divergence of species was illustrated with branching histories of inheritance (phylogenies) that allowed inference of ancestral links and tested evolutionary hypotheses. Phylogenetic thinking remains fundamental in evolutionary bioinformatics today and diversity and change are still illustrated with phylogenetic trees, graphical and mathematical representations (with branches and reticulations) that portray how contemporary is common ancestry. These trees have been particularly useful in the comparative analysis of nucleic acid and protein sequences and have had an impact on each and every discipline of biology, including genome science and informatics. They seed a holistic future [111].

The evolutionary classification of protein domains has been based on sequence and structural homologies that make use of phylogenetic tools and advanced bioinformatic methods [79]. Protein families group together sequences that share a common ancestry, but generally do so with a low hierarchical granularity; the reliability of comparative methods break down when reaching the so-called ‘twilight zone’ of <30% sequence identity. However, change in protein structure is linked directly to change in biological function. This has been recognized by structural genomic initiatives that seek to characterize exhaustively the major building blocks of proteins, and both structure and function have aided phylogenetic analyses when sequences fail to unite distant family relatives. Evolutionary relationships have been inferred directly from the structure of protein molecules, generally using formal methods of phylogenetic reconstruction [112117]. These methods have been limited to analysis of closely related architectures with backbones and secondary structures that can be more or less superimposed. However, global views of the protein world that establish evolutionary relationships at superfamily or fold level require more involved and systematic approaches of classification. One strategy is to compare all proteins with each other and plot relationships on existing protein fold space, with structural similarities visualized at low dimensional level [118]. For example, Gauss integrals that describe protein backbones as space curves were used to construct a 30-dimensional vector that was then projected on a plane, producing 2D (two-dimensional) maps with fold distributions matching CATH classification [119]. These maps divide structures belonging to α, β and αβ classes of CATH into distinct groups. Similarly, matrices of DALI alignment scores in pairwise backbone comparisons of SCOP folds produced 3D representations that clustered folds belonging to the α/β, α+β, all-α and all-β protein classes [120] and allowed construction of a structural map [121]. Note, however, that a simple plot of overall length of helical segments against strand segments was able to dissect these classes without resorting to complicated algorithms (Figure 4A and see Supplementary Figure S1 and Supplementary Table S1 at and an animated version of Figure 4(A) at Typically, α/β folds have interspersed helix and strand secondary structures, α+β folds segregate these elements within the molecule, and all-α and all-β proteins are mostly composed of helical or strand elements respectively. These simple plots reveal that helical segments were generally longer in α/β folds than in their α+β counterparts and shorter in all-β proteins, with strand segments being shorter in all-α proteins, the implication of which will be discussed below. Unfortunately, global views place structures in a continuum space and obscure fundamental architectural differences and heterogeneities that discrete views can capture. Other strategies that lack these shortcomings are therefore useful, including the generation of fold family trees based on rules of structural transformation [122,123], taxonomies based on similarity of secondary structural arrangements [124] and a PDUG (protein domain universe graph) representation of domains based on scores of structural similarity [125,126]. Some of these have captured salient natural features. For example, trees of secondary structures are in agreement with aspects of protein classification and suggest a simple mechanism of evolution that is in accord with a theory of folding based on the energetic of backbone hydrogen bonds [127]. The PDUG network representation of fold space is a graph that connects nodes (domains) with edges (structural similarities) in threshold-delimited clusters, and, similarly, captures the scale-free network topology that is typical of the protein world. However, problems associated with the systematic classification of structure at a topological level make it difficult, if not impossible, to find a general metric of pairwise comparison that could be used for global analysis and would portray all complexities of structural organization [128]. One solution to this drawback is a ‘periodic table’-like construct that merges the use of rules with the comparative framework [129]. In this approach, proteins are compared and assigned to idealized fold representations, which describe molecules as layered systems of helical and sheet structures (with curl and stagger). The approach shifts the problem to finding appropriate definitions for the idealized constructs and understanding their evolutionary meaning through models of structural transformation.

Figure 4 Phylogenomic analysis and evolution of major structural classes of globular proteins

(A) Grouping of proteins in the all-α, all-β, α/β and α+β classes according to features of secondary structure. The average total length of segments of secondary structure in a peptide chain was calculated using DSSP [199] secondary structure assignments in proteins (61175 peptide chains) from all PDB entries in SCOP version 1.69. These features were calculated from chains belonging to the same SCOP fold for all folds. Plots compared each feature of secondary structure with each other. The Figure shows only comparison of average total length of α-helical and β-strand segments. Averages are described in Supplementary Table S1 at An animated version of this Figure can be seen at (B) Universal phylogenomic tree of architectures reconstructed from a genomic census of protein domain structure and organization. A tree of architectures describing the evolution of domains and domain combinations at F level was reconstructed from a protein census in 266 genomes [200]. The census involved identifying domains using advanced HMMs of structural recognition and SCOP as reference. The three evolutionary epochs of the protein world are overlapped to the tree and are labelled with different shades (architectural diversification, light green; superkingdom specification, salmon; organismal diversification, yellow) and follow previous definitions [149]. Terminal leaves are not labelled since they would not be legible. Branches in red delimit the birth of architectures after the appearance of the first architecture unique to a superkingdom (broken line). The Venn diagrams show occurrence of architectures in the three superkingdoms of life. Pie charts show superkingdom distribution of architectures belonging to the four major categories of domain organization. The onset of the big bang of domain combinations is indicated in the tree. (C) Cumulative frequency distribution plots describing the appearance of all-α, all-β, α/β and α+β protein classes with only one domain as well as all the domain combinations with two domains or more than two domains along the branches of the tree described in (B). The cumulative number was given as a function of distance in nodes from the hypothetical ancestor (nd). The inset shows details of the accumulation of ancient domains and domain combinations. Information on trees of proteomes and architectures, data matrices and tracing exercises can be found in the MANET (Molecular Ancestry Networks) database [193] (


One fundamental limitation of most global approaches that try to unify fold space is that they do not embrace the ‘shared and derived’ tenet of evolutionary analysis. They are not truly phylogenetic. At present, there are no reliable procedures that can generate phylogenetic relationships at higher hierarchical levels of protein classification directly from the structure of proteins. Methods cannot yet reconstruct history because knowledge of how the ‘origami’ of protein folding evolves is still lacking. One solution to the conundrum of structure is to shift the focus of study from molecules to proteomes, the repertoire of all proteins of an organism. After all, proteins are encoded in the genomes of the many organisms that populate Earth. The rationale is simple. Proteins with structures that are fit will thrive in evolution. They will propagate in lineages through vertical descent, recruitment and convergent evolutionary processes, and their architectural designs will be used repeatedly in different contexts. Their history should be left imprinted in the actual fold constitution of the proteome, and a simple structural census of this historical repository should unlock the ‘tempo and mode’ of structural discovery. Here, we summarize the exciting findings that this novel approach has revealed.

Structures corresponding to validated crystallographic 3D models, catalogued, for example, in SCOP and CATH, have been assigned effectively to sequences present in proteomes using knowledge from domain classification and sequence and structure comparison tools such as profile-based sequence PSI-BLAST alignments, linear HMMs of structural recognition, and threading techniques [79]. Fold architectures were initially surveyed in a number of genomes [130134] and this genomic demographic census was then indexed in several popular databases (e.g. PEDANT [135], SUPERFAMILY [136,137], and Gene3D [138,139]). The census is restricted to proteins for which a known structure can be inferred (currently, ∼60% of the proteome), but it is powerful. It allows, for example, identification of SCOP FSF architectures corresponding to individual domains in enzymes of metabolism [140,141] and exploration of arrangements of domains in biological units [142,143]. Studies reveal patterns in both domain structure and domain organization and suggest, for example, pervasive recruitment of structures and functions in biological networks and an extended combinatorial interplay of domain modules in proteomic repertoires. The census also provides indications of how organisms in different superkingdoms make use of architectures, revealing that fold abundance and distribution of folds among genomes are unlinked [144]. Curiously, abundant protein domains occurred in proportion to proteome size in a survey of five eukaryotic genomes, suggesting functional constraints between interacting domains kept domains at specific ratios in evolution [145].

Since protein structure is highly conserved (Figure 1), every instance of discovery or adoption of an F or FSF architecture by a proteome represents a rare event in the history of the organismal lineage, and globally a rare event in the history of the protein world. These ‘molecular fossils’ are therefore excellent features (characters) for phylogenetic analysis. Gerstein [132] recognized this a decade ago and used fold occurrence in genomes and distance-based methods to build trees of proteomes (see Supplementary Figure S2 at Since then, a number of trees of life of this kind have been reconstructed from the occurrence and abundance of domain structures in proteomes [107,132,134,146149] and from surveys of domain organization [150,151], matching patterns obtained from other sources of genomic information [152]. In all cases, the three superkingdoms appeared as distinct groups, confirming the tripartite nature of cellular life heralded by the school of Carl Woese [153]. Phylogenomic trees showed patterns that were in agreement with traditional classification, and also tested contentious hypotheses. For example, they backed the controversial grouping of chordates with arthropods (the Coelomata hypothesis), an observation supported by whole-genome trees (e.g. [154]) and recently confirmed by an analysis of the complete collection of phylogenies of gene sequences in the human genome (phylome) [155]. Moreover, some of these phylogenetic methods identified a root for the universal tree [107,149,150] (see Supplementary Figure S2) and suggested that diversified life originated in a proto-eukaryotic organism, a proposal for which there is now an emerging consensus [156] and which is also supported by phylogenetic analysis of the structure of rRNA [157]. It is noteworthy that parsimony considerations based simply on the survey of protein repertoires suggest the ancestor to the three superkingdoms was endowed with a virtual proteome akin to Eukarya [61]. A simple Venn diagram shows that two or three superkingdoms share the majority of F or FSF architectures and supports this view (see Supplementary Figure S3 at Most importantly, the fact that phylogenomic trees were able to reconstruct the evolution of life satisfactorily supported the existence of strong phylogenetic signal in the occurrence, abundance and organization of domains in proteomes. Trees of proteomes, however, could not reveal patterns of diversification of protein architecture directly, unless characters were traced along branches of the trees. For example, when domain sequences and architectures from 62 genomes were traced along a universal consensus phylogeny derived for whole-genome trees, convergent evolutionary processes that could not be explained by architectural loss were found to be rare events (∼2%) [158]. A recent study of Pfam domains in 96 genomes confirmed this important observation, although the number of convergent events in protein structural evolution was found to be larger (∼12%) [159]. This suggests that protein structures at high levels of organization diversify mostly by vertical descent, empowering the phylogenetic reconstruction exercise. Tracing domain occurrence patterns in trees of proteomes derived from fold occurrence and abundance [160] or universal trees reconstructed from the small subunit of rRNA [161] also allowed to estimate the relative age of individual folds and the antiquity of protein classes. The latter study assumes, however, that the history of a single (albeit central) RNA molecule and of proteomes is concordant, that there is a single origin of organismal superkingdoms, and that the bacterial outgroup chosen to root the universal tree is appropriate. As we will see below, all of these assumptions can be contentious.

In search of a direct approach and using a strategy that polarizes characters and builds rooted phylogenetic trees [157], we introduced a new phylogenetic method that generates timelines of architectural discovery and a global phylogenetic view of the protein world [107]. Data matrices that were used to build trees of proteomes were transposed, normalized and used to reconstruct trees of architectures that were intrinsically rooted [107,149,150,162,163]. Evolution's arrow was established directly by the evolutionary model, the rationale and assumptions of which have been reviewed recently within a framework of evolution of repertoires of components in living systems [164] and will not be revisited here. Supplementary Figure S3 shows the first tree of F architectures that was reported and examples of trees of F and FSF architectures reconstructed more recently using updated releases of SCOP and information in numerous proteomes. The leaves of the trees (taxa) are, in this case, domains (see Supplementary Figure S3) or domains and domain combinations (Figure 4B) visualized at F or FSF levels of classification. The rooted trees establish by definition evolutionary timelines of architectural discovery, with time measured by a relative distance in nodes from a hypothetical ancestor at their bases (node distance, nd). A timeline showing the rise of protein classes in evolution is described in Figure 4(C). These timelines reveal remarkable historical patterns in the structure of proteins and proteomes, and, as we describe below, define an origin for the modern protein world and illustrate how biological functions were discovered in time. We caution, however, that statements relate only to modern biochemistry, as modern molecules were used to reconstruct the past. Any claims of origin and evolution relate necessarily to the design and complexities of extant molecules, and not to those of predecessors that were perhaps lost in the evolutionary process.


The most notable feature of every tree of architectures that has been generated so far is that F or FSF domains widely distributed in Nature appear at their base and are consequently ancient. They are only found to be lacking in parasitic organisms with highly reduced genomes (e.g. Nanoarchaeum, Mycoplasma and Encephalitozoon), organisms known to have discarded enzymatic and cellular machinery in exchange for resources from their hosts [149]. The first nine F architectures to emerge in evolution are nevertheless common to every genome analysed and include architectures widespread in metabolism [165]; the evolution of the five most basal and their structures are illustrated in Figure 5. One likely interpretation of early evolution of ancient architectures using the neutral net paradigm described above (Figure 2C) is given in Figure 5(B). As protein sequences harbouring the primordial fold drift by mutation in sequence space, new neutral nets are discovered that fold sequences into new fold structures, while variants within ancient and new folds continue to be discovered in the original neutral nets in an ongoing exploration of more stable and fit variants. The comparison of trees of F and FSF architectures supports this view, revealing a collection of proteins undergoing divergent, but concomitant, evolutionary processes that translate into patterns of recent (close relationship) or ancient origin (distant relationship) [163]. This is a consequence of the hierarchical nature of protein structure and the limited exploration of sequence and structure space. We expect, as corollary, a correlation between abundance and age of individual architectures and time-lapsed discovery of fold variants. Indeed, the distribution of branch lengths (longer towards the base) and the unbalanced shape of phylogenomic trees (Figure 4B and see Supplementary Figure S3) suggests strongly that architectural discovery involved semipunctuated evolutionary processes similar to those recently suggested for substitutional change in nucleic acids [166].

Figure 5 Evolution of the five most ancient folds

(A) Phylogenetic relationships at the base of a phylogenomic tree of domain structures at the F level of structural organization [165] together with examples of the different domain architectures. All ancient architectures share a common design of α-helices and β-strands that form barrel or highly symmetrical structures. The structural models illustrate the 3D arrangement of helices (cyan) and strands (mauve) separated by turns and coils (brown). Structures included are: c.37, P-loop NTP hydrolase fold of adenylate kinase from Methanococcus thermolithotrophicus (PDB code 1KI9), depicting a putative enzymatic origin of metabolism; a.4, DNA/RNA-binding three-helical bundle from the Trp repressor mutant V58I protein (1JHG); c.1, TIM β/α barrel fold of inosine 5′-monophosphate dehydrogenase from Borrelia burgdorferi (PDB code 1EEP); NADP(P)-binding Rossmann fold of glyceraldehyde-3-phosphate dehydrogenase from Escherichia coli (PDB code 1GAD); d.58, ferredoxin-like fold of 7-Fe ferredoxin from Azotobacter vinelandi (PDB code 7FD1). (B) One of many possible interpretations of early evolution of the five most ancient architectures using the neutral net paradigm (Figure 2C). Circles of different colours illustrate neutral nets corresponding to each fold and embedding mutational walks in sequence space responsible for extant structural diversity at F level of hierarchical organization. F neutral nets should map FSF neutral nets and two nets corresponding to two lower levels of hierarchy of protein structure.

When the representation of architectures in organisms in Archaea, Bacteria and Eukarya was traced along the evolutionary timeline, patterns of origin and evolution of our contemporary tripartite world were clearly revealed ([149] and M. Wang, unpublished work). Ancient architectures were multifunctional and were shared by many organisms (free-living or parasitic) in the three superkingdoms [107,149,162,163]. These common architectures defined the so-called ‘architectural diversification’ epoch in protein evolution in which members of an ancestral community of organisms diversified their protein repertoires through differential loss (light-green-shaded area overlapping the tree of domain structure and organization described in Figure 4B). Remarkably, architectural loss occurred preferentially in organisms belonging to the lineages of Archaea, establishing the first organismal divide [149]. This reductive evolutionary strategy was protracted and perhaps induced by adaptation to the extreme physical conditions of early Earth. The early rise of Archaea matches recent evolutionary studies of the structure of tRNA [167] and universal trees of proteomes reconstructed from architectures identified to be ancient in the tree of architectures [149]. These trees of proteomes showed a rooting that was internal (paraphyletic) to the Archaea and was located between the Crenarchaeota and the Euryarchaeota, close to methanogenic archaeal species. This paraphyletic rooting is consistent with a mutational comparative analysis of tRNA paralogues that identified molecular species in the Archaea as slow-evolving and ancient [168] and the existence of ancestral genome characters such as split genes and operon organization [169]. It also has an impact on the interpretation of protein evolutionary tracing exercises that consider superkingdoms as evolutionarily unified groups, as these will identify not a single origin for proteins, but many [161]. A proposed multiple convergent (polyphyletic) origin of genes occurring after lineage diversification involving the modular reorganization of sequence [170] would, in fact, have the same effect. Nevertheless, these and many other lines of evidence suggest that Archaea is the most ancient lineage of the modern living world, an emerging view that is gaining consensus [156].

It is important to note that reductive tendencies in Archaea started at a time when superkingdom-specific architectures and present-day organismal lineages had not developed and life was probably communal [149]. In fact, a substantial portion of the protein world developed during this time and resulted in complex proteomes that were rich in functions and architectures (Figure 6). These results are, for example, in line with profiles from phylogenetic tracing of enzymes linked to bioenergetic processes [171], an architectural census [172] and recent ancestral state reconstruction of the gene content of the universal ancestor [173] that revealed a bioenergetically and functionally complex genome with a gene complement similar in number to that of extant free-living microbes (reviewed in [156]).

Figure 6 The architectural and functional complement of the communal ancestor

The complement defines 78 FSFs that appeared before the first architecture that was completely lost in a superkingdom (Archaea) in the tree of FSF architectures (see Supplementary Figure S3B at FSFs were grouped according to coarse-grained functional SUPERFAMILY [201] categories and subcategories (peripheral pie) and according to major classes of globular proteins in SCOP (central pie).

Following the architectural diversification epoch, superkingdom-specific and lineage-specific architectures appeared in evolution as the world of organisms expanded [149]. In this new ‘superkingdom specification’ epoch, new reductive tendencies expressed in Bacteria and the superkingdoms were specified in what we believe was a protracted process (salmon-shaded areas in Figure 4B). Moreover, architectural representation decreased considerably with time until it approached zero, a point at which a large number of new architectures were clustered, each specific to a small number of organisms. Later on, an opposite trend took place, in which architectures that were more specialized and were specific to relatively small sets of organisms increased their representation in proteomes explosively. This architectural ‘big bang’ (paraphrasing that of the universe) involved the multiple combination and rearrangement of domains (Figures 4B and 4C) and the distribution of resulting multidomain proteins among emergent organismal lineages. We will not discuss the evolutionary patterns and processes that underlie these processes since they have been discussed recently [98]. They involve, however, preferential additions and deletions of terminal domains and fusion and fission processes that engage (with bias) different domain modules in a combinatorial interplay. The rise of modularity in the protein world defines the ongoing ‘organismal diversification’ epoch (light yellow areas). During this last period, architectural novelties linked to multicellularity appeared massively and quite late both immediately after microbe diversification events (mostly folds common to organismal domains) and during eukaryotic diversification (mostly Eukarya-specific) [149,162]. This included multidomain architectures known to be associated with programmed cell death, adhesion and recognition of cells [162].

Proteome distribution patterns along the timeline have had an impact on the constitution of present-day genomes, with the architectural repertoire being the largest and most diverse in Eukarya, and the smallest and most homogeneous in Archaea, with Bacteria taking an intermediate position (see the pie charts of Figure 4B). Remarkably, the diverse repertoire of the Bacteria superkingdom was by necessity compartmentalized into the small proteomes of individual organisms (L. S. Yafremava and G. Caetano-Anollés, unpublished work).


There were many remarkable patterns linked to structure in the trees and resulting evolutionary timelines. The most ancestral folds harboured barrels [e.g. the TIM β/α-barrel fold (c.1)] or interleaved β-sheets and α-helical architectures that packed helices to one face [e.g. the ferredoxin-like fold (d.58)] or two faces [e.g. the P-loop-containing NTP hydrolases (c.37) and the NAD(P)-binding Rossmann fold (c.2)] of the central β-sheet arrangement [107,163]. These and other ancient architectures were multifunctional and interacted with organic cofactors [174], especially nucleotide-containing ligands such as ATP, ADP, GDP, NAD and FAD, all of which appear to have originated early in evolution according to a power-law distribution of ligand–protein mapping [175]. Architectures appearing later in the timelines were functionally more specialized and simple, with structures that were increasingly smaller and more compact (e.g. increases in the tilt of strands or the frequency of open barrel structures in the popular β-barrels; [107]). At the same time, structures became more refined, as illustrated with barrel structures harbouring increasingly more complex strand topologies. Many important structural designs were derived in the tree (including polyhedral folds in the all-α class and β-sandwiches, β-propellers and β-prisms in the all-β class) and protein transformation pathways describing likely scenarios of structural evolution [176,177] and other patterns could be traced in the trees [107].

Interestingly, all classes of globular protein architecture appeared very early in evolution and in defined order, the α/β class being the first, followed by the α+β, the all-α and the all-β classes, and by small and multidomain proteins [107,163]. Patterns of origin and accumulation of protein classes were consistently revealed in all trees analysed, including those derived from a tree of domains and domain combinations (Figure 4). A similar conclusion was reached when tracing fold occurrence along branches of proteome [160] and rRNA trees [161], and when studying the evolution of aminoacyl-tRNA synthetases [178]. We proposed architectural designs with interspersed α-helical and β-sheet elements were segregated in the course of evolution, first within their structure (α+β class) and then confined to separate molecules (all-α and all-β classes) [107]. This is in accord with the random origin hypothesis of proteins [179]. Several interesting features distinguish the ancient α/β protein class from the rest. For example, topological accessibility measurements describing how easy it is to fold a structure from any point in the polypeptide chain showed a marked asymmetry toward the N-terminus in α/β folds, a property that was mostly confined to selected protein families [180]. Measurements of closeness to the molecular centroid and residue contact distribution also revealed the bias, which was more notable in ancient than in more recent α/β folds [181]. These observations were interpreted as evidence of ancient α/β folds predating chaperone-assisted folding and preserving the bias as a relic [180] or of unmasked co-translational folding in extant proteins [181]. Co-translation folding is the ability of proteins to fold as they exit the ribosome, but the process remains contentious. Interestingly, our evolutionary timelines revealed that FSF architectures linked to chaperone and proteostasis systems in the cell appeared early with the ATPase domain of the Hsp90 chaperone (d.122.1) and throughout the timeline (ndFSF=0.06–0.86). However, the dominant families that contributed to the N-terminal bias the most appeared earlier (e.g. c.37 and c.2), supporting the idea that the origin of this asymmetry lies in the past. We found α-helical segments were generally longer in α/β folds (Figure 4A), a trend that was especially notable with the early folds (see the animated version of Figure 4A). This could indicate these helical segments were overrepresented in the ancient interspersed α/β structures. It is well known that single extended β-sheets are quite effective at burying non-polar surfaces when compared with α-helices [182]. Moreover, a surprising in vitro model study of co-translational protein folding suggests an initial tendency to form misfolded sheets in an all-α protein (apomyoglobin), a tendency that decreases with protein length and underscores the importance of co-translationally active chaperones [183]. Perhaps the length-dependent misfolding tendencies in non-native proteins left behind relics in the ancient α/β proteins that had to fold unassisted, which tried to increase the length of helices to balance the secondary structure repertoire. Interestingly, a survey of hundreds of genomes reveals domains are longer in very ancient proteins (M. Wang, unpublished work) and not shorter as claimed in a recent phylogenetic tracing study [161]. We therefore propose that longer helical segments provided an advantage in early protein evolution and were then slowly replaced by strands and a reduction in protein length once chaperone systems were in place. This scenario would explain α-to-β tendencies uncovered in the tree of architectures [107].

Tracing biological function along the timeline revealed patterns of origin of fundamental cellular processes (Figure 7), confirming the very early and explosive onset of metabolism [149] and small-molecule-binding chemistries [175]. It appears that proteins were first associated with organic cofactors, but later involved transition metals as ligands, perhaps mediated by the increasing energy demands of the ancient world. Timelines revealed a relatively early rise of metallomes (with the zinc-metallome appearing first) (C. L. Dupont, G. Caetano-Anollés and P. E. Bourne, unpublished work), and the late appearance of oxygenic photosynthesis, which was preceded and followed by the discovery of functions typical of Eukarya (cell adhesion, receptors and chromatin structure, and functions linked to multicellularity). Some of these results are consistent with a proteomic analysis that suggest that shifts in trace metal geochemistry related to the redox state of ancient oceans are imprinted in protein architecture and suggests that prokaryotes evolved in anoxic marine environments, whereas eukaryotes did so in oxic counterparts [184]. The late evolutionary appearance of oxygenic photosynthesis confirms results from a phylogenomic analysis of metabolic networks [149] and is consistent with molecular and geological records that suggest that oxygen entered our atmosphere after major microbial divergences in the tree of life [185].

Figure 7 Evolution of biological function in the protein world

The evolutionary timeline shows the discovery of protein FSF architectures associated with different coarse-grained functional SUPERFAMILY categories in each superkingdom, with time measured by a relative distance in nodes from a hypothetical ancestral architecture at the base of the tree of architectures (see Supplementary Figure S3B at Pie charts below bins describe the distribution of architectures that are unique or shared between superkingdoms, and their areas are proportional to the total number of architectures in that bin. Arrowheads indicate the first appearance of architectures associated with functional subcategories that are listed. Details of their individual accumulation can be found in Supplementary Figure S4 at The three evolutionary epochs and corresponding phases of the protein world are labelled with different shades and follow previous definitions [149].

All functional categories and most subcategories appeared for the first time during the architectural diversification epoch, lending credence to the complex nature of the ‘communal ancestor’ to diversified life [149]. In fact, the functional and structural diversity of its architectural complement (Figure 6) suggests that biological functions were geared fundamentally to metabolic activities, proteostasis and protein degradation, and, as expected, were embodied mostly in α/β protein architectures. Major subcategories pooled transferases, nucleotide metabolism and small-molecule binding enzymes, matching recent metabolic network investigations [165]. Coenzyme, carbohydrate and energy metabolisms also featured prominently. These cells also had architectures involved in an incipient translation apparatus. Nucleic acid processing (DNA replication/repair) was embodied in Nudix (d.113.1) and DNA breaking–rejoining enzyme (d.163.1) FSFs linked to pyrophosphorylase/pyrophosphatase and RNA-decapping activities and integrases and topoisomerases respectively. Only five functional subcategories originated later on in the organismal specification epoch and were clearly related to the cellular make-up of organisms; they involved lipid/membrane binding and structural proteins, proteins associated with cell envelope biogenesis and the outer membrane, viral proteins and proteins related to oxygenic photosynthesis. Only one subcategory had its origin in the organismal diversification epoch (blood clotting). These proteins are therefore important markers in the architectural chronology. Similarly, α-solenoids, β-propellers, coiled coils and other architectures linked to the nuclear pore complex [186], a marker for the nuclear envelope in Eukarya and some bacterial lineages (e.g. Planctomycete and Verrucomicrobia [187]), appeared (together with karyopherins that interact transiently with the complex) very late in evolution (ndFSF=0.82–1.00). Nuclear pores therefore represent very modern protein complexes that were horizontally transferred or evolved convergently in Eukarya and Bacteria. Of all the main categories, extracellular processes appeared the latest, close to the boundary of the superkingdom specification epoch. These categories involve immune responses, toxin and defence enzymes, and cell adhesion, functions related to definition of self and intercellular interactions (competition and multicellularity). It is logical that these functions would appear at the end of a communal world of organisms.

The appearance of information-related processes and cellular motility has important consequences for origins of modern biochemistry and diversified life. Translation originated quite early and preceded the DNA repair/replication, transcription, RNA processing and chromatin structure subcategories, which developed in the timelines in that order (Figure 7). The early origin of translation was confirmed by tracing architectures of aminoacyl-tRNA synthetases, elongation factors and ribosomal proteins derived from crystallographic models and HMM searches in the trees (D. Caetano-Anollés, unpublished work). Models of amino acid evolution also supported the antiquity of aminoacyl-tRNA synthetases [178]. The observation that the origin of modern protein synthesis developed only after metabolic proteinaceous enzymes were in place suggests strongly that the translation apparatus suffered a fundamental revision during evolution of modern proteins. The inception of cell motility also has important consequences. The microbes of the communal world were probably auxotrophic or heterotrophic organisms seeking to improve their survival in the changing environments of early Earth. Cellular motility allowed better tools to seek and ingest food and in some lineages to prey on other members of the community. The development of phagotrophy (a hallmark of Eukarya) and mechanisms of cell motility could have ignited the rise of the tripartite world [188]. Indeed, fundamental FSF architectures associated with a number of important molecules linked to cell movement (e.g. tubulin, moesin, profilin and actin) originated at the end of the architectural diversification epoch (e.g. tubulin nucleotide-binding domain-like and C-terminal domain like, ndFSF=0.31) and continued to accumulate (Figure 7), but together with toxins and defence architectures, which could have brought other means of warfare (Figure 7). Whereas many important proteins related to motility developed later in the timeline [e.g. phase 1 flagellin domain (ndFSF=0.58), profilin (ndFSF=0.73), moesin tail domain (ndFSF=0.76), actin-cross-linking and depolymerizing domains (ndFSF=0.85)], others that were multifunctional and ancient were probably recruited for the task (e.g. actin-like ATPase domain architecture, ndFSF=0.04). It is therefore quite likely that the world of organisms underwent a transition from communal to competitive during superkingdom specification and that this triggered diversification of life.


It is generally assumed that life originated as an emergent dissipative system with a series of autocatalytic processes that produced primordial metabolites [189192]. Among these chemicals are the nucleotides and amino acids that are prerequisite for an ancient RNA world and an emergent protein world. As the latter developed, the first reactions available for RNA and protein molecules must have been metabolic reactions. Timelines already suggest that modern metabolism appeared very early on in evolution (Figure 7). However, a detailed phylogenomic tracing analysis of protein architecture in metabolic networks [193] revealed that the nine most ancient architectures were responsible for the explosive appearance of most modern enzymatic functions [165]. In fact, a careful dissection of recruitment patterns indicated that modern metabolism originated in enzymes of nucleotide metabolism harbouring the P-loop-containing NTP hydrolase fold, probably in pathways linked to the purine metabolic subnetwork. This study was complemented recently with a battery of other evolutionary bioinformatic approaches, which revealed a succession of recruitment gateways, each mediated by the discovery of a new primordial fold [194]. These gateways produced a layered system reminiscent of Morowitz's prebiotic shells [195] describing early evolutionary progressions and take-overs of ancient prebiotic chemistries. The first gateway originated in nucleotide metabolism, involved mostly transferases and was then extended to metabolism of cofactors. It was immediately followed by an ‘energy amphiphile’ lipid–carbohydrate core that provided enzymes for energy and hydrocarbon precursors established primordially in the self-replicating prebiotic entity. The TIM β/α barrel-mediated gateway later introduced amination reactions that converted keto acids into amino acids, mediating the incorporation of nitrogen into a multitude of metabolic processes. These opened new recruitment possibilities and generated explosively the chemical diversity we currently encounter in modern metabolism. Phylogenomics therefore provides for the first time a link between the prebiotic and modern worlds, showing metabolism as a palimpsest that recapitulates prebiotic and perhaps ribozymic chemistries. We note that many of the very ancient architectures were involved in functions associated with ancient genes that were recently identified by physical clustering in bacterial genomes [196]. In this study, the evolutionarily conserved gene core divided into three layers, the first highly connected centred around informational processes (fundamentally the ribosome and translation), a second featuring tRNA synthetases and other processes (e.g. proteolysis), and an outer loosely connected layer (assumed to be more ancestral) linked to metabolism and highlighting metabolism of nucleotides, coenzymes and fatty acid molecules. The overall picture of these studies points clearly to an origin of modern proteins in the synthesis of nucleotides for a world in which RNA was the only encoded catalyst, but also to the coexistence of RNA, proteins and prebiotic chemistries, a concept that is in line with recent prebiotic experiments [192]. The centrality of RNA in the primordial make-up of the early protein-encoding organisms is revealed.


Ever since the first crystallographic structure was reported 50 years ago for sperm whale myoglobin (PDB code 1MBN) [197], advances in comparative and structural genomics continue to provide an increasing number of sequences and crystal structures that are available for the study of the modern protein world. Recent advances in our understanding of protein structure and folding and the construction of powerful classification schemes provide a more thorough description of the hierarchical structure of this world. The linking of molecular evolution and structural biology now provides evolutionary views that are unprecedented. They prompt us to answer important questions. How discrete or continuous is protein space? What are the fundamental processes that drive the evolution of protein structure? What is the tempo and mode of architectural discovery? At what structural resolution do proteomes differ and how does it affect our definition of species? What are the principles that drive the evolutionary mechanics of domain combination in Nature? When and how did individual biological functions originated and evolved?

We have reviewed the remarkable patterns related to the origin, evolution and structure of the protein world and the diversification of life inferred from comparative and phylogenomic analysis of protein structure. History reconstruction exercises unfold timelines of the discovery of architectures and functions and an emergent picture of primordial biochemistries. They uncover episodes of specialization, exemplified by the explosive rise of functionally specialized multidomain proteins. They also reveal patterns of simplification, such as reductive tendencies of protein repertoires in the proteomes of microbial organisms. More importantly, results test long-standing and controversial hypotheses of how life originated and evolved. The gates to the mysteries of how the living world emerged have been opened, and we are expecting a flood of new exciting discoveries.


Supported by the National Science Foundation [grant numbers MCB-0343126 and MCB-0749836], the C-FAR Sentinel Program, the United States Department of Agriculture and the Critical Research Initiative of the University of Illinois.


We thank Professor Steven Huber and Professor Alex Toker for the invitation to write this review, and members of the GCA research group for constructive discussions.

Abbreviations: CATH, Class, Architecture, Topology, and Homologous superfamily; F, fold; FF, fold family; FSF, fold superfamily; HMM, hidden Markov model; Hsp, heat-shock protein; HSR, heat-shock response; nd, node distance; PDUG, protein domain universe graph; SCOP, Structural Classification of Proteins; 3D, three-dimensional


View Abstract