Biochemical Journal

Review article

Describing sequence–ensemble relationships for intrinsically disordered proteins

Albert H. Mao, Nicholas Lyle, Rohit V. Pappu


Intrinsically disordered proteins participate in important protein–protein and protein–nucleic acid interactions and control cellular phenotypes through their prominence as dynamic organizers of transcriptional, post-transcriptional and signalling networks. These proteins challenge the tenets of the structure–function paradigm and their functional mechanisms remain a mystery given that they fail to fold autonomously into specific structures. Solving this mystery requires a first principles understanding of the quantitative relationships between information encoded in the sequences of disordered proteins and the ensemble of conformations they sample. Advances in quantifying sequence–ensemble relationships have been facilitated through a four-way synergy between bioinformatics, biophysical experiments, computer simulations and polymer physics theories. In the present review we evaluate these advances and the resultant insights that allow us to develop a concise quantitative framework for describing the sequence–ensemble relationships of intrinsically disordered proteins.

  • intrinsically disordered protein
  • polymer physics
  • sequence–ensemble relationship


Emil Fischer [1,2] proposed that enzyme specificity could be explained by shape complementarity. He used the metaphor of a lock and key to illustrate how the three-dimensional arrangements of atoms comprising an enzyme and its substrate could enable them to fit together and prevent non-specific catalysis. Interpreted literally, this metaphor suggests that proteins possess rigid structures that determine their functions. Fischer, however, felt that popular interpretations of his lock-and-key metaphor exceeded its scope and experimental justification [3]. Indeed, protein rigidity has proven to be unsatisfactory at explaining non-competitive inhibition and cannot account for enzymes where binding of one reactive group increases the exposure of another [4]. The concepts of allosteric linkage [5] and induced fit require the invocation of protein conformational changes in response to the binding of an interaction partner [6]. The structure–function paradigm, nuanced to accommodate proteins that switch between discrete conformations with different shape complementarities for the execution of specific functions, provides visual clarity and mathematical simplicity.

Going beyond structure–function relationships

Advancements on the scientific and technological fronts have demonstrated unequivocally that proteins can exhibit significant conformational heterogeneity. IDPs (intrinsically disordered proteins) are at the extreme end of the heterogeneity spectrum [79]. They adopt ensembles of conformations in aqueous solutions for which no single structure or self-similar collection of structures provides an adequate description. By all accounts the conformational heterogeneity exhibited by IDPs is relevant for biological function [814]. The phrase IDPs is used to imply that the amino acid sequences for this class of proteins encode a preference for heterogeneous ensembles of conformations as the thermodynamic ground state under standard physiological conditions (aqueous solutions, 150 mM monovalent salt, low concentrations of divalent ions, pH 7.0 and a temperature in the 25–37°C range) [9,15].

For many IDPs, folding can be coupled to binding and they can adopt ordered structures in specific bound complexes [1620]. The intrinsic heterogeneity in their unbound forms is reflected in their ability to adopt different folds in the context of different complexes [10]. Transcription factors represent striking examples of molecules that undergo disorder-to-order transitions in complex with their cognate DNA partners [2124]. Highly stable complexes with DNA, that can force transcription factor dissociation, become ‘unreasonably slow’ when compared with the turnover time of downstream regulatory processes. Disorder in the unbound forms is proposed to be important for lowering the overall affinity, which in turn increases the off-rates of protein–DNA complexes [25].

There are a growing number of reports of ‘fuzzy complexes’ whereby conformational heterogeneity prevails in binary and multimolecular complexes [2628]. IDPs can also self-assemble to form ordered supramolecular assemblies, although the degree of order within these assemblies is variable and the intermediates which seem to be obligatory for self-assembly are characterized by significant conformational heterogeneity that can be modulated to alter the mechanisms of self-assembly and the stabilities of supramolecular structures [2938].

Connecting sequence to conformational heterogeneity

Sequence–structure relationships are well documented for proteins whose individual amino acid sequences fold autonomously into specific three-dimensional structures [39,40]. Specificity for a well-defined fold is the result of information encoded in the amino acid sequence [41,42]. In direct analogy, information encoded at the sequence level keeps IDPs from autonomously folding into singular well-defined three-dimensional structures [15,43]. The information content of IDP sequences is such that the acquisition of a folded conformation (if this happens) is deferred by coupling the folding process to either binding or self-assembly providing the heterotypic or homotypic interactions in trans can stabilize the IDP in a specific fold. From a thermodynamic standpoint, the stabilities of complexes and mechanisms of binding/assembly are linked to the conformational properties of IDPs in their unbound forms. Hence, sequence–ensemble relationships are central to understanding how disorder is used in IDP function.

Quantitative descriptions regarding sequence–ensemble relationships require biophysical characterization although IDPs present significant challenges for such studies. The signals are often highly averaged and by definition these systems resist crystallization unless they can be forced into specific folded structures. IDPs also present biochemical challenges because they can be difficult to isolate from tissue or cell systems as the process of homogenization exposes them to proteases that rapidly and preferentially degrade disordered proteins [4446]. Efforts to characterize and quantify conformational heterogeneity and understand its role in protein function have required a systematic integration of biophysical, biochemical and bioinformatics methods. In the present review, we focus on the advances made in describing sequence–ensemble relationships in terms of quantitative connections between IDP sequence characteristics and their coarse grain conformational descriptors such as average shapes, sizes and amplitudes of conformational fluctuations.

The major biophysical methodologies include NMR spectroscopy [4754], steady-state and time-resolved fluorescence (single molecule and ensemble) spectroscopies [5558], EPR (electron paramagnetic resonance) spectroscopy [59], SAXS (small-angle X-ray scattering)[6062] and molecular simulations that are used either de novo [6371] or in synergy with data collected from spectroscopic investigations of IDPs [54,62,7280]. Methodological advances are maturing and evolving to enable the comparative assessments of sequence–ensemble relationships for IDPs.

The challenge of describing conformational heterogeneity

A single set of position co-ordinates (and uncertainties in these co-ordinates) helps relate sequence-to-structure for a protein that folds autonomously into a distinct three-dimensional structure. Such co-ordinate sets are generated as models that fit either the electron density data from X-ray diffraction through ordered protein crystals or NMR data that report on the chemical environments of backbone and side-chain protons and nuclei in solution. The PDB [81] provides a comprehensive archive of co-ordinate sets for a range of crystallizable proteins and proteins that are amenable for studies by NMR. This rich data set has led to systematic classification of folds and fold families [40,41,82,83] thus yielding an improved understanding of sequence–structure relationships and insights regarding the evolution of protein folds.

IDPs are not amenable to descriptions by a single, or even small number of, distinct co-ordinate set(s). Instead statistical descriptors are required to provide a concise classification of the conformational ensembles and this is the language of polymer physics [84]. We have focussed on these concepts [84] to provide a unifying framework for quantitative analysis and descriptions of sequence–ensemble relationships of IDPs. This framework is useful for analysing and interpreting the data obtained either from experiments or from molecular simulations.

Limiting models as descriptors of conformational heterogeneity

The two most popular statistical descriptions based on polymer physics are the Flory random coil [85] and worm-like chain [86] models. In the rotational isomeric approximation [85] to the Flory random coil model, the conformational partition function for the polypeptide is written as a product of partition functions of independent interaction units. All interactions between non-nearest neighbour units or so-called Kuhn segments are explicitly ignored, although the intrinsic conformational preferences of individual units are captured in terms of the weights for each of the possible rotational isomers. The unit either spans the degrees of freedom of an individual residue or can take local effects into account to expand the unit to span multiple residues. In either situation, each conformation for unit i is annotated by an intrinsic energy value that is calculated using an empirical potential function of one's choosing. The conformations are binned into rotational isomeric states on the basis of the similarities of their backbone and sidechain dihedral angles (see Figure 1). Residue/unit i might have m rotational isomers, whereas residue j might have n rotational isomers. For a given residue, each rotational isomer is assigned a weight that is calculated using the Boltzmann weights of energies of individual conformations that make up the rotational isomer.

Figure 1 Illustration of how the rotational isomeric approximation of the Flory random coil model is constructed

(A) This process begins with a detailed calculation of the free energy landscape (with free energies increasing from red to blue) for an individual amino acid (or Kuhn segment), shown here for alanine. The tiles represent a coarse graining of conformational space into discrete rotational isomers, and each isomer has a label and a statistical weight that is calculated using the energies associated with conformations that make up a rotational isomer. The assumption of independence/additivity allows the statistical weights for each combination of rotational isomers to be written as a product of individual weights. Panel (B) shows this procedure, whereby there are M conformations for a polypeptide of N residues and the statistical weight for each conformation z is a product of the weights for individual residues. The result is a weighted ensemble of all conformational possibilities where each ‘conformation’ is denoted using a combination of the coarse grain rotational isomers. Panel (C) shows a schematic conformation for one of the conformations z.

Given an amino acid sequence of N residues, one can calculate, a priori, the probabilities associated with all combinations of rotational isomers. For the sequence of interest, the number of rotational isomers per residue, their statistical weights and the sequence composition dictates the total number of conformational possibilities and the likelihoods associated with each conformation. These likelihoods make up the predicted conformational distribution function and can be used to calculate a variety of conformational properties, including the average end-to-end distance, the average radius of gyration, the average hydrodynamic size, the average distance between residues i and j, and any observables that can be cast as a function of a moment of the conformational distribution function.

An alternative approach is to analyse experimental data, specifically data from fluorescence [8789] or force spectroscopy [9092] that are functions of end-to-end distances using variants of the worm-like chain model. The persistence length lp is the length scale over which the chain behaves like a continuously deformable entity. For a rod-like chain lp equals the contour length and for a freely joined chain lp equals the bond length, so this model ostensibly allows interpolation between two extremes. Fluctuations are highly correlated for spatial separations that are smaller than lp. For spatial separations longer than lp, the worm-like segments become uncorrelated and the model reverts to the Flory random coil limit. Therefore if lp is found to be small, the worm-like chain model does not yield any insights that go beyond the Flory random coil.

Estimates of lp values for different sequences, studied under similar solution conditions, yield comparative assessments of sequence–ensemble relationships through comparative measures of ‘chain stiffness’, although Yamakawa [93] has highlighted the limitations of lp as a measure of stiffness. The assumption of continuous deformation for spatial separations less than lp is questionable because this assumption breaks down if the chain can form heterogeneous ensembles of compact globules. Despite its inherent weaknesses, the worm-like chain model retains an appeal for its ease of use in interpreting experimental data for IDPs and denatured proteins [8789].

The value of limiting models

The preceding discussion focuses on limiting models, which are analogous to limiting models/laws in other branches of physics that include the Debye–Hückel equation for calculating activity coefficients of electrolytes [94], and the Hildebrand/Flory–Huggins expressions for the free energies of ideal mixtures [9597]. Limiting laws or models provide a route for interpreting experimental data as deviations from ideal behaviour. As a limiting model, the Flory random coil model is often used to calibrate measured observables such as NMR chemical shifts [98,99] and NMR paramagnetic relaxation enhancement effects [49], i.e. observables can be calibrated as deviations from the Flory random coil. This helps assess the contributions of spatial interactions between residues that are distal in the linear sequence. Such approaches are decidedly one-sided because deviations from a limiting model tell us what an ensemble is not and this is inadequate for developing a complete understanding of sequence–ensemble relationships.


The sequences of IDPs are deficient in hydrophobic residues and enriched in polar and charged amino acids [15]. Electrostatic and polar interactions are typically quite large, even when screened by the surrounding solvent and this raises serious questions regarding the applicability of limiting models for describing IDP conformations. Furthermore, with a few exceptions such as proline- or glycine-rich sequences, the intrinsic flexibilities of all polypeptides are approximately equivalent. This implies that the intrinsic value of lp is essentially fixed for a wide range of sequences, even if it is erroneously treated as a free parameter when fitting experimental data. The inadequacy of limiting models for describing experimental data have led to proposals that polymer physics concepts are inadequate for describing sequence–ensemble relationships for all IDPs [100]. This proposal ignores the rich possibilities afforded by explicit consideration of the effects of realistic interactions.

Conformational statistics are dictated by the interplay between chain–solvent and intrachain (intrabackbone, backbone–side chain and side chain–side chain) interactions [84]. As a result, polymers undergo continuous transitions between distinct conformational classes. These transitions are modulated by changes in solvent-mediated interactions either through the addition of ternary components or by changing the temperature and/or pressure. Importantly, the conformational classes are akin to distinct phases because they have distinct density profiles and the variation of spatial separation as a function of sequence separation follows distinct patterns. Transitions between conformational classes are hence akin to phase transitions [101].

Measures for assessing the phase behaviour of polymers

Quantities such as the average radius of gyration (<Rg>), the average hydrodynamic radius (<Rh>) and the average end-to-end distance (<Ree>) are different measures of chain size that can be used to quantify the average density, intrinsic viscosity and concentration of one end of the chain around the other. In addition to measures of chain size, one can also calculate the average shapes of polymers. The average asphericity (δ*) quantifies the extent of deviation from a perfect sphere (δ*=0). For ellipsoids δ*≈0.4, and this quantity attains its maximum value of 1 for a perfect rod [102]. The average asphericity is calculated from the ensemble-averaged eigenvalues of the gyration tensor.

One can also calculate the average distances between residues i and j. The quantity <Rij> represents the ensemble-average of spatial separations calculated as averages over all pairs i and j that yield a sequence separation |j–i| [103]. Multiple pairs of residues i and j will have similar sequence separations of |j–i|. The profile of <Rij> plotted against sequence separation |j–i| quantifies the local concentration of chain segments around each other and provides the most detailed information regarding the so-called link density [104], which is a formal order parameter in formalisms of polymer theories such as the Lifshitz approach [103,105108]. In addition to ensemble averages, one can also calculate the one- and two-parameter distribution functions such as P(Rg), P(Ree), P(Rh), P(δ), P(Rij| |j–i|) and P(Rg, δ). The latter quantifies the joint distribution of sizes and shapes.

Importantly, all of the quantities listed above are accessible to the appropriate combination of experiments and can be calculated using co-ordinates for simulated ensembles. This enables quantitative comparisons between simulation results and experiments thus facilitating direct approaches to either test predictions from simulations or routes to incorporate experimental data as restraints in simulations for generating ensembles that best describe the experimental data. Both approaches are equally important and have enabled the development of quantitative sequence–ensemble relationships for IDPs.

Assessing solvent quality

The balance between chain–solvent interactions and intrachain interactions is quantified using a parameter vex. This quantity has units of volume and is proportional to the integral of the Mayer-f function, vex=∫ f(r)d3r, where f(r)=exp [(-βW(r)]−1), W(r) is the potential of mean force for the thermally averaged inter-residue interaction and β=1/RT, where R is Boltzmann's constant and T is temperature. If the effective inter-residue interactions are repulsive, then the Mayer-f function is negative, which leads to positive values for vex and the converse is true for inter-residue interactions that are attractive on average. The parameter vex is hence a measure of the volume excluded, per residue, for favourable interactions with the surrounding solvent that results from the competition between chain–chain and chain–solvent interactions. It provides a measure of the strengths of pairwise inter-residue interactions, on average, and can be related to the second virial coefficient that is accessible using light-scattering measurements [84].

Classifying ensemble types based on vex

In a good solvent vex>0 and the chain expands to maximize the polymer/solvent interface. Expanded unfolded states are sampled in vitro in high concentrations of chemical denaturants such as urea and guanidinium chloride. Aqueous solutions with high concentrations (8 M) of urea are presumed to be reasonable mimics of good solvents for generic polypeptides because urea, a carbonyl diamide, is chemically equivalent to polypeptide backbone amides. As a result, quantities such as <Rg> [109], <Rh> [110] and the average end-to-end distance <Ree> scale as N0.59 with chain length of N. In a good solvent the distances <Rij> scale as |j–i|0.59 as a function of sequence separation |j–i|.

Since the inter-residue interactions are on average repulsive, vex>0 in good solvents. Indeed the sizes of self-avoiding random walks also scale as N0.59 and conformational ensembles for polymers in good solvents and self-avoiding random walks are said to belong to the same ‘universality class’ [111]. Accordingly, ensembles generated in atomistic detail for proteins in the EV (excluded volume) limit are useful reference states for the expanded unfolded states [112118]. In the EV limit, ensembles are generated using atomistic descriptions of proteins and all non-bonded interactions excepting steric repulsions are ignored.

The low overall hydrophobicity of IDP sequences implicitly suggests that these systems come under the same rubric as chemically denatured proteins. Hence, a popular approach is to generate EV limit ensembles and filter out those conformations that cause deviations from the observables that are measured experimentally [89,119125]. Although this seems like a reasonable approach, it imposes the fact that aqueous solutions are mimics of good solvents for IDP sequences and ignores the possibility that these sequences can sample compact phases.

Implications of decreasing values of vex

The parameter vex can change continuously going from posi-tive values in a good solvent, through zero in a θ solvent to negative values in a poor solvent [84]. If the effects of chain–solvent and intrachain interactions exactly counterbalance, then vex=0, and the chain is said to be in a θ solvent. Under such conditions, the chain statistics are consistent with those of a Flory random coil model. It is important to note that this behaviour comes about due to counterbalancing of the interactions rather than explicitly ignoring non-local interactions. In a θ solvent, <Rg>, <Rh> and <Ree> scale as N0.5 and <Rij> scales as |j–i|0.5.

In a poor solvent vex<0 and the chain prefers compact globular conformations that minimize the polymer/solvent interface and <Rg> and <Rh> scale as N1/3 with a chain length of N. The sizes of folded proteins follow N1/3 scaling [126]. The poorer the solvent, the more negative the value of vex. Statistics of inter-residue distances change fundamentally in a poor solvent. The distances <Rij> do not increase with sequence separation |j–i| according to a power law. Instead, for all values of |j–i| larger than a so-called blob length (approximately five to seven residues), the value of <Rij> is fixed by the average density of the globule.

Clearly, the consideration of the details of the balance between chain–chain and chain–solvent interactions affords a richer description of conformational statistics. The question is which of these models apply for describing IDPs? In order to answer this question, we need a systematic approach that assesses whether typical physiological milieus are good, θ or poor solvents for polypeptide backbones. This allows us to understand how side chains in IDPs modulate the intrinsic backbone preferences. Recent studies of archetypal IDPs using a combination of spectroscopic experiments and molecular simulations have yielded clear insights regarding sequence–ensemble relationships. The following sections summarize these findings and the implications for predicting sequence–ensemble relationships for IDPs.


The free energy of hydration for NMA (N-methylacetamide) at 298 K is −10 kcal/mol [127], indicating that the transfer of NMA from the gas phase into water is highly favourable. Naïve extrapolation from the transfer model suggests that polyglycine, a poly-secondary-amide, should prefer structures that maximize the interface with the aqueous solvent. However, results based on molecular dynamics simulations show that polyglycine forms a heterogeneous ensemble of compact globules [128]. Similar simulations showed that polyglycine samples expanded coil-like structures in 8 M urea and there is minimal overlap between ensembles sampled in the two milieus. These results and analysis of a series of order parameters drawn from polymer physics predict that water is a poor solvent for polyglycine, which is a mimic of polypeptide backbones. Recent fluorescence correlation spectroscopy experiments and solubility measurements of polyglycine peptides have confirmed these predictions [129].

Why do polypeptide backbones form compact globules in water?

In a polymer, each residue has a reduced translational entropy when compared with the free analogues diffusing in solvent. Mean-field theories show that the entropy of mixing between solute and solvent molecules is reduced by a factor of N if we compare the entropy of mixing for N freely diffusing solute molecules to the same N molecules concatenated to form a polymer [97,130]. Because of this diminution in entropy, polymers, unlike small molecules, can undergo intramolecular phase separation to form globules or prefer the coil phase that is well mixed with solvent [105,106]. Within globules, the effective concentration of residues around each other is independent of N, whereas in coils it decreases as N−0.77. As a result, each residue can make collective amide–amide contacts within globules whereas the coil state is characterized by a combination of negligible intrachain interactions and additive interactions of individual residues with the solvent. Even for polyamides, where individual amides can be favourably solvated, it is the competition between collective self-interactions within a globule and additive interactions of individual residues with the solvent in the coil state that determines the stable phase. Model compounds do not account for the diminution in translational entropy or the competition between collective intrapolymer and additive polymer–solvent interactions. It is this that partially explains the preference of polypeptide backbones for compact globules.

Questions remain regarding the balance between chain/solvation entropy and enthalpy, the interplay between backbone hydration and self-solvation of amides, and the comparative roles of hydrogen bonding against van der Waals interactions in giving rise to the observed preferences of polypeptide backbones in dilute and concentrated aqueous solutions. To answer these questions, we need a systematic investigation of the temperature and co-solute dependencies of the preference for collapsed states in dilute solutions and the solubility boundary in concentrated solutions. In addition, comparative studies of constructs with substituted amides, such as amide to ester substitutions (to probe the effect of weakened hydrogen bond donors and stronger acceptors) and secondary amide to primary/tertiary amide substitutions (to probe the effect of hydrogen bond donors), will be necessary for quantifying the role of hydrogen bonding in driving polypeptide backbone collapse.

Archetype 1: polar tracts form compact globules in aqueous solutions

Among the polar amino acids, IDPs are enriched in histidine, glutamine, serine and threonine and are relatively deficient in asparagine although glutamine/asparagine-rich regions are the hallmark of prion-forming domains in yeast [36]. Glutamine-rich linkers were among the first disordered segments identified from sequence analysis [131]. They are abundant in transactivation domains of transcription factors and in RNA-binding proteins that play important roles in post-transcriptional regulation [13,132].

Fluorescence correlation spectroscopy measurements revealed that <Rh> scales as N1/3 for monomeric polyglutamine molecules [133]. These results confirmed predictions from molecular dynamics simulations [69,134] and have been reproduced using steady state FRET (Förster resonance energy transfer) [135] measurements. The low sequence complexity of polyglutamine implies a lack of specificity for a single compact conformation and instead heterogeneous ensembles of compact conformations are energetically equivalent. Furthermore, the internal friction is uncharacteristically high for these molecules suggesting glassy dynamics for the conversions between distinct compact conformations [69]. Simulation results obtained using the ABSINTH implicit solvation model show evidence for continuous globule-to-coil transitions for polyglutamine [136138]. In accordance with polymer physics theories, the stabilities of globular conformations and the sharpness of the globule-to-coil transitions increase with increasing chain length (Figure 2).

Figure 2 A partial inventory of the results and analysis afforded by combining polymer physics concepts with simulation results to describe and classify sequence–ensemble relationships

The analyses are based on results from simulations of polyglutamine chains of different lengths. These Monte Carlo simulations utilize the ABSINTH implicit solvation and force-field based on the OPLS-AA/L molecular mechanics parameter set. (A) The globule-to-coil transitions plotted as the temperature-dependence of the normalized mean-square radius of gyration. The sharpness of the transition increases with chain length. The increased number of favourable intrachain interactions offset the unfavourable surface free energies with the surrounding poor solvent. The curves coincide at the θ temperature, designated as Tθ, and compact globules are preferred for all temperatures T<TC. (B) The globule-to-coil transition for different polyglutamine chains by plotting the ensemble-averaged asphericity values, δ*, as a function of temperature. For T<TC, spherical globules are favoured by longer chains, as evidenced by the low values for δ*. This preference is reversed as chains expand to sample ellipsoidal coil-like structures (δ*>0.4). (C) The <Rij> profiles for a 45-residue polyglutamine chain for different temperatures. For T<TC, the profiles show little variation and have the plateauing behaviour that is consistent with compact globules being preferred below TC. As T increases beyond TC towards Tθ=390 K, the profiles transition from the characteristic plateauing to the power law form characteristic of improving solvent quality, i.e. weakening intrachain interactions. (DF) The joint distributions P(Rg,δ) for a 45-residue polyglutamine chain under three different simulation conditions, TC, Tθ and the EV limit, which corresponds to the situation where all intrachain interactions excepting steric repulsions are ignored (see the main text). Rg is in Å and δ is unitless. These distributions underscore two points. First, the amplitudes of fluctuations, i.e. the range of conformations sampled increases as the chain expands and becomes more aspherical. Secondly, the overlap between ensembles at TC and expanded ensembles sampled at higher temperatures is minimal and decreases with increasing temperature, i.e. the EV limit ensembles bear no resemblance to the ensemble of globules sampled for T<TC and hence one should be cautious in choosing conformations from the EV limit as proxies for IDPs. (GI) The distributions P(Rij| |j–i|) for different sequence separations at T=TC, Tθ and the EV limit respectively. For T=TC, the distributions overlap considerably, irrespective of sequence separation. For higher temperatures and the EV limit, the distributions shift toward larger Rij values as |j–i| increases; a feature that is consistent with the power law behaviour expected for θ temperatures and beyond.

Single-molecule atomic-force spectroscopy studies showed that polyglutamine molecules form compact globules that are mechanically resistant to forces as large as 800 pN [139]. Interestingly, the introduction of proline residues within polyglutamine tracts increases their mechanical compliance. The preference for heterogeneous ensembles of compact globular conformations has also been observed using single- molecule FRET measurements for glutamine/asparagine-rich tracts [55] and for glycine–serine block co-polypeptides using a combination of time-resolved FRET measurements [119] and molecular dynamics simulations [128]. Taken together, it is now clear that sequences enriched in polar residues form heterogeneous ensembles of compact globules as measured by a range of properties that quantify sizes, shapes, the scaling of intersegmental distances and the responses of these sequences to applied force. The preference for collapsed states can be traced, at least partially, to the intrinsic preferences of polypeptide backbones whose conformational properties in aqueous solutions are consistent with water being a poor solvent for polyamides. These results might seem surprising since the preference for collapsed states is realized despite the absence or deficiency of canonical hydrophobic residues in these polar tracts. The results are, however, consistent with the poor solubility profiles of IDP sequences that are enriched in polar residues and highlight the weaknesses of extrapolations based on additivity assumptions, which suggest that conformational properties and solubility profiles of polymers can be inferred exclusively from the properties of their building blocks. Indeed, the assumption of additivity has been questioned in the protein literature [140142] and has proven to be invalid for synthetic polymers.

Archetype 2: globules to coils as a function of increased net charge per residue

Uversky et al. [15] showed that, in addition to low overall hydrophobicity, many IDP sequences also have high net charge per residue. IDP sequences populate a specific region of a two-dimensional space defined by mean hydrophobicity, <H>, and mean net charge, <q>. In this plane, a single line, <q>=2.785<H>−1.151, separates the IDP sequences, which lie below the line, from the sequences with well-defined folds. Given that archetypal polar IDPs form heterogeneous ensembles of collapsed structures in aqueous solutions and that the driving force for collapse originates in the intrinsic preference of polypeptide backbones for collapsed structures in water, the question is if this behaviour is a generic attribute of all IDP sequences? Polymer physics theories suggest that even in poor solvents, polyelectrolytes, i.e. sequences enriched in charged residues of one kind, can reverse the preference for collapsed structures [143145]. Instead the preference for charged residues to be solvated combined with intrachain electrostatic repulsions leads to chain expansion. Therefore, depending on the charge content, chain sizes can go beyond values expected for self-avoiding random walks in so-called good solvents [144,145]. This behaviour does not result from alterations to the solvent properties, but instead is the consequence of the interplay between chain–chain and chain–solvent interactions whereby the former essentially override the latter due to the large electrostatic energies involved.

Two studies on complementary systems, namely an IDP enriched in acidic residues [58] and a series of sequences enriched in basic residues (arginine) [146], showed that IDP sequences partition into globules compared with coils on the basis of their net charge per residue (Figure 3). This quantity, which is different from just the net charge, is calculated as |f+–f|, where f+ and f refer to the fraction of positive and negatively charged residues in the sequence respectively. If |f+–f|<0.2 then the sequences are most likely to be globule formers, and for values of |f+–f| larger than 0.2 one observes a continuous transition into non-globular expanded coils, where the precise power law followed by quantities such as <Rg> as a function of N or <Rij> as a function of sequence separation |j–i| depends on the value of |f+–f|. These findings were converged upon by different groups through a combination of single-molecule FRET measurements [58], synergy between molecular simulations and fluorescence spectroscopies [146], NMR measurements of hydrodynamic sizes [147] and SAXS [123].

Polymer physics theories provide a generalization of vex to include the contribution of increased charge contents for polyelectrolytes in poor solvents [143145,148]. For chains such as polypeptides, where the interplay between linear charge density and chain thickness is important, the value of an effective EV parameter, Xeff, determines the conformational properties of polyelectrolytes in a poor solvent. If the net charge per residue is small, and therefore Xeff<0, the sign and magnitude of vex dominate and the chain collapses to a uniform distribution of compact globules. Within a narrow range of values for the net charge per residue Xeff approaches zero. The value of the net charge per residue that yields Xeff=0 is the one for which conformational properties are akin to those of a chain in a θ solvent; although the solvent itself is not a θ solvent. Values of Xeff>0 are realized as the charge content increases thus increasing the net charge per residue. The chain expands and accesses coil states, the statistics of which are congruent with those of chains in good solvents; although the solvent quality remains unchanged and the behaviour results from an increased charge content in the sequence. As Xeff becomes increasingly positive with increasing net charge per residue, theory predicts a continuous increase in chain dimensions, a feature that has been recapitulated for archetypal IDPs of increasing net charge per residue [146]. Therefore the intrinsic solvation preferences of charged sidechains and their interactions with each other can override the intrinsic preferences of polypeptide backbones in aqueous solutions because the sidechains are akin to an effective solvent for IDP sequences that are enriched in charges.

A predictive phase diagram for sequence–ensemble relationships

Synthesis of the results obtained thus far for archetypal IDPs has led to a speculative generalization to enable the prediction of conformational classes for IDPs on the basis of their sequence characteristics. Mao et al. [146] introduced a schematic phase diagram where the three axes denote mean hydrophobicity, f+ and f respectively and each of these parameters varies between 0 and 1. The dividing line discovered by Uversky et al. [15] is a boundary that separates folded proteins from IDPs. Below this boundary IDPs prefer to be either globules or coils, and the net charge per residue is the determinant of this preference. The predicted phase diagram presumes that amino acid composition is a sufficient diagnostic of IDP sequence–ensemble relationships. On-going investigations are testing the accuracy of inferences drawn from the proposed phase diagram that is designed to quantify sequence–ensemble relationships entirely on the basis of the composition of amino acids for the sequence of an IDP or disordered region. Although preliminary results suggest that this assumption is reasonably robust, it is also clear that the local conformational preferences, such as secondary structure contents, can vary considerably between sequences of fixed or similar compositions [146,149,150]. Indeed recent results show that the net charge per residue is not particularly useful as a diagnostic of local conformational preferences [150] and instead sequence context acts as a rheostat to modulate these preferences [151].

Figure 3 IDPs form globules or coils depending on the net charge per residue (|f+−f| increases), T=298 K

(A) The joint distribution P(Rg,δ) and a representative conformation for polyglutamine with 45 residues. The distribution highlights the small amplitudes of conformational fluctuations and the overall compactness of the conformations that are sampled by this sequence. Panels (BD) show how arginine-rich sequences transition from globules (B) to θ-like conformations (C) and swollen ellipsoidal coils (D) as the net charge per residue increases. The Figures were generated using the simulation results of Mao et al. [146] (BD) and Vitalis et al. [137] (A). The representative conformations were drawn using the VMD package [181] and the colour-coding used depicts polar residues in green, hydrophobic residues in yellow, positively charged residues in red and negatively charged residues in blue.


Going beyond polyelectrolytes

The studies that uncovered the effects of net charge per residue focused mainly on polyelectrolytes. For these systems it appears that coarse grain sequence–ensemble relationships are robustly classifiable based on net charge per residue and hence amino acid composition alone. It is conceivable that overall charge content as well as the linear sequence patterning of oppositely charged residues alter the predictions of the sequence–ensemble relationships obtained on the basis of net charge per residue [152] and a systematic study of this issue is currently underway. This is especially important since a majority of IDPs are polyampholytic [153], i.e. IDP sequences of high charge content that have equivalent numbers of oppositely charged residues.

Ways to modulate the net charge per residue

The pKa values of ionizable groups can shift depending on the amino acid composition, sequence context and the degree of conformational heterogeneity [154]. Changes to the protonation states of ionizable residues will alter the net charge per residue and hence the predicted sequence–ensemble relationships. There is the possibility that some IDP sequences use pKa shifts to switch between conformational classes. This is relevant because IDP functions are often mediated through interactions involving linear peptide motifs that can be as short as three to five amino acids [155166]. These motifs interact in trans with ordered protein domains or with motifs from other disordered regions. Shifts in pKa values of ionizable residues within disordered regions can lead to an increase in net charge per residue past the threshold values causing a globule-to-coil transition to expose hidden linear motifs that mediate binding with other protein domains. Reversing these transitions from coils to globules is also feasible through shifts in pKa values that decrease the net charge per residue thus leading to sequestration of linear motifs to facilitate the dissociation of complexes.

Recent studies have also shown that disordered regions are under post-transcriptional control through tissue-specific alternative splicing [157]. These modifications might cause pKa shifts in ionizable groups by altering the sequence contexts of disordered regions in order to achieve tissue-specific control on the preferred conformational class. It is also well known that residues within disordered regions are the target of numerous post-translational modifications [167180]. These chemical transformations, such as serine/threonine phosphorylation, are reversible through the synergistic action of kinases and phosphatases. Such transformations occur often in disordered regions of hub proteins in interaction networks and afford the prospect of rewiring networks by controlling the collective exposure of linear motifs within disordered regions. Clearly a systematic study of the impact of pH-dependent conformational changes and the influence of post-translational modifications on sequence–ensemble relationships will be extremely important for understanding the expanded repertoires of functions afforded by IDPs and disordered regions. Finally, most biophysical studies of IDPs treat these systems as isolated proteins. In fact a majority of these sequences function in cis with structured protein domains. Despite the obvious biological relevance, the influence of these naturally occurring contexts and the possible coupling between disordered regions and structured domains remains essentially unexplored; a shortcoming that needs to be addressed if we are to understand the functional implications of disordered regions within proteins.


A synergistic combination of computational and experimental biophysical investigations of archetypal systems is yielding important information regarding the sequence–ensemble relationships of IDPs. The language of polymer physics provides a coherent framework for quantifying and hence classifying these sequence–ensemble relationships. The collection of initial results gathered from different groups has rather serendipitously been quite systematic and hence distinct archetypes within the IDP sequence space have been interrogated. The main highlights of these initial investigations are two-fold. The hydrophobic effect can, at least in spirit, be generalized to include sequences rich in polar residues and deficient in canonical hydrophobic groups. Charge content and net charge per residue appear to be the main determinants of non-globular expanded coils. Unlike their ordered counterparts that form folded globules, their unique sequence/compositional preferences seem to afford IDPs the advantage of realizing a continuum of conformational transitions that might serve as the biophysical basis for their diverse functions. Continued investigations of the sequence–ensemble relationships and their connections to function will be of utmost importance for understanding how the functional repertoires of proteins have been expanded through the use of disordered proteins and regions.


The authors’ work is supported by the National Science Foundation and National Institutes of Health.


We thank our colleagues, S.L. Crick, R.K. Das and A. Vitalis for insightful discussions.

Abbreviations: EV, excluded, volume; FRET, Förster resonance energy transfer; IDP, intrinsically disordered protein; NMA, N-methylacetamide; SAXS, small-angle X-ray scattering


View Abstract