## Abstract

We introduce various, recently developed, generalized ensemble methods, which are useful to sample various molecular configurations emerging in the process of protein–protein or protein–ligand binding. The methods introduced here are those that have been or will be applied to biomolecular binding, where the biomolecules are treated as flexible molecules expressed by an all-atom model in an explicit solvent. Sampling produces an ensemble of conformations (snapshots) that are thermodynamically probable at room temperature. Then, projection of those conformations to an abstract low-dimensional space generates a free-energy landscape. As an example, we show a landscape of homo-dimer formation of an endothelin-1-like molecule computed using a generalized ensemble method. The lowest free-energy cluster at room temperature coincided precisely with the experimentally determined complex structure. Two minor clusters were also found in the landscape, which were largely different from the native complex form. Although those clusters were isolated at room temperature, with rising temperature a pathway emerged linking the lowest and second-lowest free-energy clusters, and a further temperature increment connected all the clusters. This exemplifies that the generalized ensemble method is a powerful tool for computing the free-energy landscape, by which one can discuss the thermodynamic stability of clusters and the temperature dependence of the cluster networks.

- all-atom model
- free energy
- generalized ensemble
- molecular binding
- molecular dynamics
- molecular interaction

## INTRODUCTION

A protein–protein or protein–ligand complex is stabilized or destabilized by a variety of factors, and detailed atomic information on the intermolecular interactions provides a crucial key to understanding the complex formation from a microscopic point of view; this information could support drug discovery. Biomolecular complex formation is a process in which the biomolecules associate, starting from the dissociated state, and finally a biologically meaningful complex is formed. The experimentally determined complex structure may be the lowest free-energy state, which is equivalent to the thermodynamically most stable complex. However, recent studies have also shown that various complex forms, other than the most stable complex one, are generated in the binding process, such as encounter complexes [1], metastable complexes or fuzzy complexes [2], and these multiple complex forms may play a biologically/biophysically meaningful role. These complexes may be formed transiently with weak interactions, so we pose a question. Is there any method that can provide atomic information for the multiple complexes? In other words, is there any method that can assign free energies (i.e. stabilities) to the multiple complexes?

It is generally difficult to determine temporal complex structures in which the biomolecules are weakly interacting. Then an all-atom computer simulation such as molecular dynamics (MD) is a useful technique for quantifying those complex structures because the all-atom simulations can trace biomolecular motions at an atomic resolution at each moment of the complex formation. Figure 1 presents the processes of complex formation schematically, where semi-stable structural clusters are illustrated together with the most stable one. Imagine a simulation during which association and dissociation of molecules take place. The number of simulation snapshots assigned to a cluster relates to its stability (i.e. free energy): the more snapshots there are in a cluster, the lower the free energy assigned to it. The frequency of transitions from one cluster to another relates to the rate constant for the conformational change. Therefore, the simulation trajectory yields a diagram such as Figure 1.

A quantitatively evaluated diagram is called a free-energy landscape; it provides key information for discussing the complex formation in detail. As it is difficult to obtain an ensemble of snapshots sufficient to generate the free-energy landscape using conventional sampling methods, several powerful computational methods have been developed. One way to approach the free-energy landscape is to use a powerful computer such as ANTON [3,4] or MDGRAPE [5] to perform simulation over a long period. The second way is to integrate many simulation trajectories, where the trajectories generate a wide conformational distribution [6] or rate constant among conformational clusters [7,8], although each trajectory may cover only a small fraction of the whole conformational space. The third way is to use a generalized ensemble method [9,10]. In the present review, we focus on various generalized ensemble methods that have been applied or are applicable to biomolecular binding with an atomic resolution in an explicit solvent to obtain the free-energy landscape. Table 1 lists the generalized ensemble methods that are introduced in the present review. We note that these methods are also applicable to large molecular conformational changes, such as protein folding or intramolecular conformational transitions, by changing the computation object.

## SAMPLING ENHANCEMENT BY GENERALIZED ENSEMBLE METHODS

In computational biophysics, development of an effective conformational sampling method has been one of the central subjects. When force field parameters are assigned to constituent atoms of the system [protein(s), ligand(s) and solvents], the potential energy is computable to the protein system. In the present review, potential energy is simply called ‘energy’. Different conformations of the system have different energy. Therefore, an energy surface (Figure 2a) can in theory be constructed for the system. The problems are: the energy surface of the protein system has a large number of energy basins surrounded by energy barriers, and some of the conformational transitions among the basins are very slow processes. These difficulties arise as a result of the following factors: (i) the original conformational space is 3*n*-D for a system consisting of *n* atoms, and *n* is usually large; (ii) various types of interactions act among the constituent atoms; (iii) usually there is no structural symmetry in the system; and (iv) motions of an atom are influenced strongly by the surrounding atoms because they are densely packed.

When performing a molecular simulation at room temperature (denoted as *T*_{room}), the conformation of the protein is usually trapped in energy basins near the initial conformation of the simulation (Figure 2a). Note that most proteins exert their biological functions at room temperature, and many biological experiments have been performed at this temperature. To sample a wide conformational area, overcoming energy barriers and reaching the native complex structure (i.e. experimentally determined complex structure at *T*_{room}), a long simulation is then needed when the simulation starts from a dissociated conformation. A simple method of sampling various conformations without being trapped is a high-temperature simulation (Figure 2b). This method, however, generates conformations accessible only at the high temperature. Inversely, when the high-temperature simulation starts from the native complex structure, this complex dissociates eventually, and the transition probability that the dissociated molecules rebind again to the native complex is negligibly small. A requirement imposed on the sampling method is that the resultant ensemble should consist of conformations that are probable at *T*_{room} in equilibrium (conformations in the checked energy range in Figure 2b), even when the simulation starts from a dissociated conformation. We denote this ensemble as *Q*(*T*_{room})*.* The free-energy landscape is derived from *Q*(*T*_{room}).

The high-dimensional space needed to express the protein conformation is beyond our comprehension. Then, contraction of the high-dimensional space into a low-dimensional space is essential. Some parameters, such as relative molecular orientation of one molecule to another, separation distance between two molecules or root mean square deviation from a reference complex structure, are useful for constructing the low-dimensional space for viewing the conformational distribution. Suppose that two parameters, denoted as *s _{1}* and

*s*, are selected for the coordinate axes of the contracted space (here it is a 2D space). Then a set of parameters [

_{2}*s*

_{1},

*s*

_{2}] is calculated for all the conformations stored in

*Q*(

*T*

_{room}), and a probability distribution function

*P*

_{cano}(

*s*

_{1}

*,s*

_{2}

*,T*

_{room}) is computed. A ‘potential of mean force’ (PMF),

*F*(

*s*

_{1}

*,s*

_{2}

*,T*

_{room}), is formally defined as:

1

where *R* is the gas constant. Eqn 1 shows that a low PMF is assigned to a high-probability position. Note that high-probability regions in the conformational space are more stable thermodynamically than low-probability regions. In other words, the low-PMF regions are thermodynamically stable. Therefore, PMF is regarded as a free-energy landscape. The low-PMF regions are called ‘free-energy basins’, and the parameters *s*_{1} and *s*_{2} are called ‘reaction coordinates’. Figure 3 illustrates the relationship between *P*_{cano} and *F* in a 1D case. In a general case *P*_{cano} is converted as:

2

where both *P*_{cano} and *F* are expressed by *n* parameters.

As an enhanced sampling method, we first introduce umbrella sampling [11,12]. In this method, an appropriate reaction coordinate *p* is set in advance so that the variation of *p* reflects the change of complex form well. Figure 4a shows the *p* axis schematically, which connects the unbound (*P _{u}*) and the native complex structures (

*P*). The umbrella sampling method introduces bias functions known as ‘umbrella potential functions’ along the

_{n}*p*axis. A bias potential set at a bias centre (one of the filled or open circles in Figure 4a) forces the conformation to be confined within a narrow region around the bias centre during a simulation. Then, performing individual simulations at different bias centres at temperature

*T*

_{room}, one can obtain a number of biased distribution functions (Figure 4b). The purpose of the umbrella sampling is to obtain an entire distribution function

*P*

_{cano}(

*p, T*

_{room}) without the effects of the biased potentials in the full range [

*P*,

_{u}*P*] of the

_{n}*p*axis. Then a reweighting technique is applied to each of the biased distributions, and non-biased fragments of the entire distribution (Figure 4c) are obtained. Finally, smooth connection of the fragments produces

*P*

_{cano}(

*p, T*

_{room}) (Figure 4d). This connection technique is called a ‘weighted histogram analysis’ [13]. After the simulations, sampled snapshots are also reweighted and integrated into the conformational ensemble

*Q*(

*T*

_{room})

*.*

‘Adaptive umbrella sampling’ (AUS) [14,15] also introduces a bias function. However, this bias function is not for restricting the conformation in a narrow range of the reaction coordinate *p*. On the contrary, the bias assists the conformation to fluctuate smoothly in the range [*P _{u}, P_{n}*] (Figure 5a). In short, AUS is a method for enhancing the conformational fluctuations along the reaction coordinate. The bias potential

*E*

_{AUS}is given as:

3

Note that the second term of the right-hand side of this equation is PMF (see eqns 1 and 2). An MD simulation at *T*_{room}, where *E*_{AUS} is used for evaluation of interatomic forces (force *=−∇E*_{AUS})*,* produces a flat distribution function *P*_{cano}(*p, T*_{room}) along the *p* axis (Figure 5b) when the function *P*_{cano}(*p, T*_{room}) is accurate enough and the simulation long enough: *P*_{AUS}(*p, T*_{room}) ≈ constant.

Note that the distribution function *P*_{cano}(*p, T*_{room}) is unknown in advance, and this function is then refined iteratively through simulations [10]. After a simulation has been finished, *P*_{cano}(*p, T*_{room}) is updated from the simulation trajectory and the next simulation is started using the updated function, etc. Through the iterations, *P*_{AUS}(*p, T*_{room}) is flattened more and more, and when *P*_{AUS}(*p, T*_{room}) becomes sufficiently flat, we judge that *P*_{cano}(*p, T*_{room}) is sufficiently accurate. Finally, we perform a long simulation using the refined *P*_{cano}(*p, T*_{room}) and store snapshots. Then *Q (T _{room})* is generated, where a thermal weight at

*T*

_{room}is assigned to each of the stored snapshots (reweighting).

A method known as ‘metadynamics’ refines *P*_{cano}(*p, T*_{room}) within a simulation [16,17]. Thus, metadynamics is an iteration-free method and, therefore, suitable for automation of the simulation procedure [10]. Wang–Landau sampling [18], which enhances the conformational fluctuations along the energy axis as explained later, also refines a bias potential within a simulation.

Multi-canonical sampling was originally proposed for the study of the statistical properties of a physical model, the Potts model [19]. In this work, the Metropolis Monte Carlo algorithm was carried out to explore the conformational space. This method was then applied to biological systems [20–23], and extended to an MD scheme in which newtonian equations were solved in the cartesian coordinate space [24]. We denote this MD-based multi-canonical method as ‘McMD’. The adoption of the cartesian coordinates makes the sampling readily applicable to a multi-polypeptide system in an explicit solvent [25].

As with the AUS, multi-canonical sampling introduces an energy bias function (multi-canonical potential energy) as:

4

where *P*_{cano}(*E,T*_{room}) is the canonical energy distribution function at *T*_{room}. An MD simulation using atomic forces of force=-grad*E*_{MC} at *T*_{room} produces a flat distribution function *P*_{MC}(*E, T*_{room})≈*const* along the energy axis when the function *P*_{cano} (*E*, *T*_{room}) is accurate enough and the simulation long enough. Therefore, the multi-canonical sampling enhances the fluctuations along the energy axis: when the system is in a high-energy range (the shaded range in Figure 2c), the conformation can overcome energy barriers, and when it is in a low-energy range (the chequered range in Figure 2c), which corresponds to the room temperature range, the sampled conformations are accumulated into the ensemble *Q* (*T*_{room}).

Recently, trajectory parallelization has been combined with McMD [6,26], and applied to a system consisting of a fragment taken from an intrinsically disordered protein (IDP) and its partner protein in an explicit solvent [27]. Furthermore, a virtual system, which has arbitrary physical properties defined by a researcher, has been introduced to enhance sampling and couple with the biomolecular system [28]. This procedure was named ‘virtual-system coupled multi-canonical molecular dynamics’ (V-McMD). The V-McMD method was combined with the trajectory parallelization and applied to the p53 interdomain linker, which is an intrinsically disordered region of p53 for regulating p53–DNA interactions [29].

As well as the AUS, the distribution function *P*_{cano} (*E*, *T*_{room}) is unknown in advance. This function is then determined iteratively: when the *i*th simulation run has been finished, *P*_{cano}(*E*, *T*_{room}) is updated using a recurrent equation (see Higo et al. [10]), *E _{MC}* is refined using eqn 4, and the (

*i*+1)th run is performed. We call this procedure an ‘every-run’ update method. As mentioned above, the Wang–Landau sampling method [18] updates

*P*

_{cano}(

*E, T*

_{room}) at every step of the simulation. We call this updating method an ‘every-step’ update method. A force-biased McMD [30] is a method that is in between the every-run and every-step update methods: a long simulation can be regarded as a succession of simulation intervals (blocks). After a block has been finished,

*P*

_{cano}(

*E, T*

_{room}) is updated using the data in this block, and the simulation for the next block is started using the updated

*P*

_{cano}(

*E,T*

_{room}). We call this method an ‘every-interval’ update method, and it is also suitable for automating the simulation procedure.

Simulated tempering [31,32] is a method in which the temperature of the system changes as *T*_{1}→*T*_{2}→*T*_{3}→… during a simulation, satisfying the detailed balancing condition at temperature switching, and similar methods with simulated tempering have been proposed [33–35]. In this method, the temperature fluctuates, covering a range from *T*_{room} to high temperatures. When the temperature is elevated, the system overcomes energy barriers.

A method known as ‘temperature replica exchange method’ (tREM) [36,37] or ‘parallel tempering’ [38] introduces multiple systems (replicas), the chemical compositions of which are exactly the same, although the temperatures differ. These replicas evolve, according to the newtonian equations of motion (or Monte Carlo method) for a while, and occasionally exchange their temperatures, imposing the detailed balancing condition at exchange of temperature. Therefore, each replica experiences various temperatures during the time evolution. Importantly, replicas overcome energy barriers when their temperatures are high. One, at least, of the temperatures is set to *T*_{room} and, then, snapshots sampled at *T*_{room} are assembled in *Q*(*T*_{room}). Lyman et al. [39] expanded the replica exchange method in which replicas are expressed by different resolution models, such as all-atom and coarse-grained models, and the resolutions are exchanged among the replicas in a simulation. In fact, the quantity to be exchanged is arbitrary [40]; some exchange methods have been proposed: exchange of van der Waals’ radius [41], coulomb interactions [42] and hamiltonian (i.e. Hamiltonian exchange) [43].

A method called ‘λ dynamics’ [44,45] also enhances the conformational fluctuations along an arbitrary structural parameter (the reaction coordinate called λ), as with AUS. In λ dynamics, however, the reaction coordinate is treated as a dynamic quantity, i.e. λ, and its momentum, p_{λ}, is involved as a dynamic variable in equations of motion. Ikebe et al. [46] recently proposed an extended form of λ dynamics, adaptive λ square dynamics (ALSD), in which different weights are assigned to each energy term and the weights fluctuate as dynamic variables. ALSD is effective for sampling a biomolecular system in which very strong and weak interactions are mixed [47].

In most of conformational sampling methods based on the Monte Carlo scheme, the conformational changes are controlled by the detailed balancing condition. Recently, an algorithm, called the Suwa–Todo algorithm [48], has been proposed based on a balancing condition (non-detailed balancing condition). Probability fluxes may then occur in time evolution of the system's probability distribution function. Importantly, an equilibrated distribution is obtained finally in spite of the fact that the detailed balance condition is broken. Based on the Suwa–Todo algorithm, biomolecular sampling methods have been proposed [49,50], in which, although the method has a similar fashion to the replica exchange method, the exchange (or permutation) rule among the replicas obeys the Suwa–Todo condition. These methods may generate a new trend in conformational sampling.

Double density dynamics (DDD) is a sampling method in which an arbitrary parameter and its momentum are treated as dynamic variables in the equations of motion [51]. As a result, the parameter fluctuates in a given range. One may think that DDD is similar to the λ dynamics or ALSD mentioned above. In DDD, however, the statistics that microscopic states obey can be designed arbitrarily. Thus, one can optimize the sampling efficiency by modulating the statistics in theory.

## SEMI-STABLE CONFORMATIONAL CLUSTERS IN AN IDP SYSTEM

We mentioned in the Introduction that the generalized ensemble method elucidates not only the most thermodynamically stable state (the largest conformational cluster), but also semi-stable states, which may appear temporally in the molecular binding process. In the present review we introduce our recent computational study on an IDP interacting with its partner protein. IDP is a challenging system for examining the efficiency of the generalized ensemble methods because IDPs have larger conformational fluctuations than ordered proteins (regular proteins). Therefore, the free-energy landscape of an IDP consists of both the most stable and the semi-stable states.

An ordered protein has its own tertiary structure (native structure) determined by its amino acid sequence, which does not vary largely before and after complex formation. A unique tertiary structure is not assigned to the IDP when the IDP is in the unbound (isolated) state, and the unique structure is formed when it binds to its partner molecules [52–54]. This mechanism is known as ‘coupled folding and binding’ [54]. Therefore, the free-energy landscape of an IDP in the unbound state has no dominant cluster prevailing against the other clusters (Figure 6a). In the bound state, on the contrary, the landscape has the dominant cluster (the most stable complex) (Figure 6b).

We have computed the free-energy landscape of two systems: NRSF–mSin3 [27] and pKID–KIX [55] systems, in which NRSF and pKID are IDPs, and mSin2 and KIX their partner proteins. In the present review, we focus on the NRSF–mSin3 system. NRSF (the N-terminal repressor domain of neural restrictive silencer factor) is an IDP known to be an essential transcriptional repressor for neuron-specific genes in non-neuronal cells and neuronal progenitors, and mSin3 is its partner protein. The complex structure was solved using NMR [56]: a 15-residue segment of a NRSF fragment folded into a helix when bound to the cleft on the surface of the paired amphipathic helix (PAH) domain of mSin3. The regions of NRSF other than the 15-residue segment are disordered even in the complex structure.

In the McMD simulation, the 15-residue fragment and the PAH domain of mSin3 were treated. We denote the PAH domain of mSin3 simply as mSin3. In the initial conformation of the simulation, these two molecules were distant from each other in an explicit solvent. Furthermore, the conformation of the NRSF segment was disordered in advance (Figure 1c in Higo et al. [27]). After refinement of the multi-canonical energy *E*_{MC} (eqn 4) via iterative McMD simulations, the production runs were performed yielding the ensemble *Q*(300K). McMD of the single NRSF fragment (i.e. unbound state) was also performed with a similar simulation procedure. The initial conformation of the NRSF fragment was randomized in advance and put in an explicit solvent (Figure 1b in Higo et al. [27]).

Conformational clustering applied to *Q*(300K) has shown that the largest cluster (i.e. the most stable cluster) is the native-like complex cluster (Figure 7 in Higo et al. [27]). Thus, the McMD simulations provided reliable data, in the sense that the NMR complex was predicted correctly. In a high-energy range, the NRSF fragment distributed widely in space without a dominant structure (Figure 4A in Higo et al. [27]). On the other hand, at 300K, the NRSF segment was trapped into the NRSF-binding cleft of mSin3 (Figure 4B in Higo et al. [27]). Figure 7 demonstrates the conformational distribution for *Q*(300K) projected in an abstract conformational space. Note that a region with crowded dots (conformations) corresponds to a low free-energy (PMF) region (see eqn 2). Figure 7 shows that three domains of crowded dots exist and that the domains were spaced by broken lines labelled *b*_{1} and *b*_{2}, along which the dots distribute sparsely. These sparsely dotted regions correspond to free-energy barriers because the probability of conformation around the regions is low. The domains were discriminated well by the molecular orientation of the NRSF fragment in the NRSF-binding cleft of mSin3 (see Figure 7). Therefore the rearrangement of the molecular orientation of the fragment requires jumping over the free-energy barriers.

Figure 8 represents the conformational distribution of *Q*(300K) for the single NRSF fragment. A variety of conformations, such as helix, hairpin or bent, are seen in the conformational space, which means that the conformation fluctuates among these structures in solution at 300K: no dominant structure exists. Therefore, the NRSF fragment is disordered in the unbound state. It is interesting that most of the conformations in *Q*(300K) of the single NRSF system are found in the NRSF–mSin3 system, whereas the probabilities assigned to the conformations are different between the two systems (Figure 10 in Higo et al. [27]). The main feature for the complex state was a helix, although the single NRSF fragment adopts both the helix and the hairpin (Figure 3 in Higo et al. [27]).

From these results, we proposed a mechanism for the coupled folding and binding of NRSF (Figure 9): in the unbound regimen, NRSF fluctuates among various conformations. NRSF binds with the cleft of mSin3 when using these conformations, and non-native complexes are formed. The NRSF conformation moves in the bound regime. Otherwise the complex dissociates. Depending on the first formed non-native complex, the complex may overcome one or two free-energy barriers, and finally the native complex is formed. The McMD simulation has shown that the complex formation with the all-atom model is considerably complicated where the complex experiences various intermediates overcoming free-energy barriers.

## VIRTUAL-SYSTEM COUPLING

Above we introduced various enhanced conformational sampling methods. Now we introduce a ‘virtual system’, which couples with the biomolecular system (a ‘real system’) [57]. The entire system is the sum of the real and the virtual systems, which are specified by the coordinates ** x** for the real system (i.e. coordinates of the constituent atoms of the molecular system) and a state parameter for the virtual system. In a simulation, both the real and virtual systems move. Advantageously, one can set the virtual system arbitrarily. We trace the time development of the entire system rather than pursuing just the motions of the real system. Recently, the virtual system was integrated with McMD and AUS, abbreviated as ‘V-McMD’ [28] and ‘V-AUS’ [58], respectively. Although it may be difficult to understand the coupling between the real and virtual systems intuitively, the computational technique is simple as explained below (see Higo et al. [28] for details).

Imagine a virtual system, for which the state is specified by a discrete ordinal number *i* (‘virtual-state index’). We assume that when the virtual-state index is *i*, the energy of the real system *E*(** x**) is confined in an energy zone

*z*

_{i}as:

*z*=[

_{i}*E*

^{min }

_{i}] ⩽

*E*⩽

*E*

^{max }

_{i}]. Figure 10a represents a space constructed by the energy axis and virtual-state axis. Zones

*z*

_{i}_{−1}and

*z*(e.g.

_{i}*z*

_{1}and

*z*

_{2}) overlap each other as well as

*z*and

_{i}*z*

_{i+1}(e.g.

*z*

_{2}and

*z*

_{3}) do, but two zones,

*z*

_{i−1}and

*z*

_{i+1}, do not overlap because they are set as:

*E*

^{min }

_{i+1}−

*E*

_{i − 1}

^{max }> ε, where ε is a positive but infinitely small number. In time development of the entire system, the conformation of the real system varies according to the equations of motion, and the virtual-state index jumps

*i*to

*i*+1 or

*i*−1, which causes the variable energy range for the real system to be reset. If inter-virtual state transitions are exhibited in the simulation, the real system fluctuates within

*z*

_{i}, which means that only a narrow region of the conformational space is sampled (i.e. sampling efficiency is low).

We define the inter-virtual state transitions as follows: suppose that *E* is at the filled-circle position in Figure 10b. Then, the virtual state *i* may jump to the virtual state *i*+1 without changing ** x** of the real system. On the other hand, if

*E*is at the open-circle position, the virtual state may move to

*i*−1. At this point, we introduce a rule: during a time interval of [

*t, t+τ*], the virtual state number is fixed to

*i,*and

**moves according to the ordinary equations of motion confining**

*x**E*in

*z*. At time

_{i}*t+τ*the transition to

*z*

_{i−1}or

*z*

_{i+1}is achieved with transition probability

*ρ*(0≤ρ

_{t}_{t}≤1), at which

**does not move. Due to the arbitrary property of the virtual system, one can set the transition probability**

*x**ρ*and the interval τ, arbitrarily, which may increase the sampling efficiency [28,57]. Consequently, by travelling the virtual states, the real system fluctuates across the wide conformational space overcoming energy barriers. The detailed balance for this time development is theoretically well satisfied, as described in the Appendix. Introduction of the virtual system to AUS is explained in Higo et al. [58].

_{t}Recently, Moritsugu et al. [59,60] introduced ‘multiscale essential sampling’ (MSES), in which a protein system was expressed by an all-atom model, coupled with a coarse-grained model(s) to enhance conformational sampling. MSES was applied to protein–protein binding of a barnase–barstar system [61]. The free-energy landscape for association/dissociation demonstrated the existence of the non-native complex forms as well as the native complex. Although the methodological fashion of MSES is considerably different from the virtual-system coupling method, the two methods have a similarity in introducing non-realistic systems for coupling with the real system.

## FREE-ENERGY LANDSCAPE FOR DIMER FORMATION BY V-McMD

In the present review, as an example of a generalized ensemble method applied to a biomolecular complex formation, we show the free-energy landscape of homo-dimer formation of an endothelin-1 (ET1) derivative computed by V-McMD simulations. ET1 is a biomolecule of 21 amino acids, known as a strong vasoconstrictor of a vessel's smooth muscles [62–64]. This molecule is a potent drug target because it is related to many human diseases [65–69]. The tertiary structure was solved by NMR spectroscopy [70–72] and X-ray crystallography [73]. In either study, the N-terminal region adopts a strand and the middle region forms an α-helix. As two disulfide bonds link the strand and the helix, the tertiary structure is compact and stable regardless of its short polypeptide length.

ET1 aggregates at a concentration of 1–4 mM. To increase the solubility, the N-terminus was then extended by two charged amino acid residues, lysine and arginine [74], which exist in its precursor protein. This extended ET1 is denoted as KR-ET1. Unexpectedly and of interest, KR-ET1 has less activity than ET1 in spite of an increase in solubility. Then, X-ray crystallography [75] showed that KR-ET1 forms a homo-dimer (PDBID; 1t7h) (Figure 11a), in which the orientations of the two molecules are anti-parallel to each other, although the molecular tertiary structure is similar to that of the single ET1 structure. In this study, five amino acid residues at the C-terminus of KR-ET1 were removed because those residues are presumably disordered and exposed in solution. This truncated peptide of 18 amino acids (sequence: KRCSCSSLMDKECVYFCH) is denoted as KR-CSH-ET1. Figure 11a indicates that this complex is stabilized by three factors: intermolecular β-sheet, intermolecular hydrophobic stacking of phenylalanine side-chain rings and two intermolecular salt bridges. Therefore, the experimental study of Hoh et al. [75] reported that this homo-dimer structure has considerable stability.

The dimer formation of KR-CSH-ET1 is an appropriate target for assessing V-McMD because the complex form is discriminated well by two quantities: the mutual molecular orientation, , and the intermolecular separation distance, *r*_{12}, the exact definition of which is given later. Previously we performed V-McMD simulations for KR-CSH-ET1 dimer formation [28], where the two molecules were confined in a spherical droplet of an explicit solvent, and the 2D free-energy landscape was computed. The free-energy landscape at room temperature was predominantly composed of the crystallographic native complex structure. In the present review, we performed V-McMD of the KR-CSH-ET1 dimer formation with a periodic boundary condition, and computed the free-energy landscape.

The simulation system was generated as follows: one KR-CSH-ET1 was immersed at the centre of a periodic box (box size: 45^{3} Å^{3}; 1 Å=0.1 nm) filled by an explicit solvent; the other KR-CSH-ET1 was put at a position apart from the first molecule. The system consisted of 8706 atoms (580 atoms for the KR-CSH-ET1 molecules, 9 Na^{+}, 11 Cl^{−} ions and 2702 water molecules). The number of ions was determined to set the solution at a physiological salt concentration. Then a constant-pressure MD simulation was performed at room temperature and pressure of 1.0 atm, by which the initial conformation for V-McMD was prepared (resultant box size: 44.003683 Å^{3}) (Figure 11b). The V-McMD simulation procedure was similar to that for the previous study [28] as follows: first refinement of *E _{MC}* was done via iterative V-McMD simulations, and then a V-McMD simulation was performed to produce an entire conformational ensemble. The conformational ensemble

*Q*(

*T*

_{room}) was constructed by reweighting the snapshots in the entire ensemble.

In V-McMD the energy moves in a wide range as mentioned previously. KR-CSH-ET1 may then unfold when the system is elevated to a high-energy region. The purpose of the present review is to show the free-energy landscape for the molecular binding, and refolding of KR-CSH-ET1 is outside its scope. Thus, we restrained the tertiary structure of each KR-CSH-ET1 weakly by intramolecular restraint functions (see Higo et al. [28] for technical details). Thus, the translational and rotational motions of the KR-CSH-ET1 molecules were free to maintain their tertiary structures.

In the present review, we set *T*_{room} as 310K and obtained a canonical ensemble *Q*(310K).
Figure 12(a) demonstrates a free-energy landscape at 310K, presented in 2D by , and *r*_{12}, the exact definition of which is given in Figure 12(a). From Figure 12(a), the largest cluster (i.e. the lowest free-energy cluster) is native like, whereas the two KR-CSH-ET1 molecules are arranged anti-parallel and the three factors, mentioned above, stabilized the complex structure. We call this cluster the native-like cluster. The landscape provided two other clusters: in the second largest cluster (the second lowest free-energy cluster), the orientations of the two KR-CSH-ET1 molecules are approximately perpendicular to each other: ≈ 0 The third largest cluster (the third lowest free-energy cluster) was characterized by ≈ 1, which means that the relative orientations of the two molecules are parallel. In the solvent–droplet boundary condition [28], the free-energy landscape had only the native-like cluster. It is likely that the solvent–droplet boundary condition would result in a stronger pressure than 1 atm in the droplet centre due to the surface tension of the spherical droplet. This strong pressure could stabilize the well-packed conformation (i.e. native complex), and then probabilities assigned to the non-native complexes were diminished. The periodic boundary condition does not induce such an artificial excess pressure because there is no free surface in this treatment.

Figures 12(b) and 12(c) demonstrate free-energy landscapes at 350K and 370K, respectively. The largest and second largest clusters are connected at 350K, although the third largest cluster is still isolated. At 370K, the third largest cluster is connected to the second largest cluster. We presume that, once a non-native complex is formed in the cluster of ≈ 0, this complex can transition to the native-like cluster relatively readily. On the other hand, once a non-native complex has been formed in the cluster of ≈ 1, this complex should overcome a high-energy barrier to reach the native-like cluster via the cluster of ≈ 0. Otherwise, this complex dissociates, and the free KR-CSH-ET1 molecules may reassociate after that. Another interesting result from Figure 12 is that the second largest cluster at 300K and 350K is the third largest cluster at 370K.

We exemplify, in this section, that the generalized ensemble method is a powerful tool to compute the free-energy landscape, through which we can discuss the thermodynamic stability of the clusters, the cluster networks and the temperature dependence of the networks.

## CONCLUSION

In the present review, we have introduced various generalized ensemble methods, focusing in particular on methods applicable to biomolecular association/dissociation with an atomic resolution in explicit solvent to obtain the free-energy landscape. To make the methods useful for drug discovery, the methods should not only explain basic mechanisms for association/dissociation, but also predict the complex forms and their stabilities. From this point of view, the sampling methods are useful if they generate a free-energy landscape for complex formation in explicit solvent at atomic resolution. In the Introduction, we listed three approaches: fast computations, multiple simulation runs and generalized ensemble methods. Note that these approaches can be combined, e.g. trajectory parallelization has been used for multi-canonical sampling and AUS. The free-energy landscape obtained from the generalized ensemble method does not involve information on rate constants. However, from knowledge about the locations of many locally stable states, it is possible to look for the shortest path between a pair of conformational clusters, e.g. using the string method [76], and the activation energy overcoming the energy barrier can be computed. In addition, combination of the generalized ensemble methods with the rate-constant estimation method, using the Markov state model [7,8] among conformational clusters, may be useful.

In living matter, a number of biological molecules (proteins, metabolites and DNAs) crowd together in solution (water and ions). Recent studies reveal that crowding itself provides and/or enhances the biomolecules’ activities [77,78]. These large-scale studies, however, view the biomolecules, neglecting their atomic details. Therefore, detailed intermolecular information from the generalized ensemble methods is useful to complete the molecular picture for living matter.

Most MD simulations in current use are based on classical mechanics (newtonian dynamics). On the other hand, many important processes taking place in living matter, such as catalytic reactions, are quantum chemical. Thus, development of quantum-mechanical MD [79–83] is an important step for elucidating vividly biochemical reactions in living matter, although there is a way to go before being applied to large biomolecular systems. One of the goals of the generalized ensemble methods is to be coupled with the quantum-mechanical technique in the molecular crowding environment.

## FUNDING

J. Higo was supported by a Grant-in-Aid for Scientific Research on Innovative Areas [21113006] received from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan and by the development of core technologies for innovative drug development based on IT from the Japan Agency for Medical Research and Development (AMED); S. Iida and H. Nakamura were supported by the Japan Society for the Promotion of Science KAKENHI, Grant-in-Aid for Scientific Research on Innovative Areas [grant number 24118001].

## Acknowledgments

We thank Drs Ikuo Fukuda, Akira R. Kinjo and Bhaskar Dasgupta from Osaka University for helpful discussions.

## APPENDIX

As explained in the main text, the virtual-system coupling method allows two types of variations for the entire system (molecular system plus virtual system): the biomolecular conformational variations in real space and the virtual-state index variations. In this appendix, we show that these variations satisfy the detailed balance. We simplify the system as illustrated in Figure 13, which does not negate the generality of the discussion. In Figure 13, there are only two virtual states (index *v _{m}; m*=1,2) and the real space is divided into small bins to specify the site positions. Each site is denoted by

*i*(

*v*), where

_{m}*i*is the site ordinal number in a virtual state

*v*. The biomolecular conformational variations are intra-virtual state transitions shown by horizontal arrows in Figure 13, and the virtual-state index variations are inter-virtual state transitions shown by the vertical arrows. The probability assigned at

_{m}*i*(

*v*) is denoted as ρ

_{m}_{i(vm)}(

*t*), where

*t*is time.

Time development of ρ_{i(vm)}(*t*) by the intra-virtual state transi-tions is expressed formally as:

5

where *a*_{k(vm)→i(vm)} is the transition probability from *k*(*v _{m}*) to

*i*(

*v*) during Δτ. Then the term

_{m}*a*

_{k(vm) → i(vm)}ρ

_{k(vm)}(

*t*) is the probability flux from

*k*(

*v*) to

_{m}*i*(

*v*) during Δτ. Summation of the stay probability in

_{m}*i*(

*v*) and the outflux probabilities from

_{m}*i*(

*v*) is unity:

_{m}6

Then, eqn 5 is rewritten using eqn 6 as:

7

where:

8

Eqn 7 is applied to all the sites in each virtual state, and further time development is achieved by repeating this procedure *M* times: ρ_{i(vm)}(*t*)→ρ_{i(vm)}(*t*′), where *t*′=*M*Δτ.

Next, we consider the inter-virtual state transitions. Suppose that two sites, *i*(*v*_{1}) and *j*(*v*_{2}) belong to different virtual states and are connected by vertical arrows as in Figure 13. Then, variation of ρ_{i(v1)}(*t*′)ρ_{j(v2)}(*t*′) by the inter-virtual state transition is:

9

where *b*_{i(v1) → j(v2)} and *b*_{j(v2) → i(v1)} are the inter-virtual state transition probabilities from *i*(*v*_{1}) to *j*(*v*_{2}) and from *j*(*v*_{2}) to *i*(*v*_{1}) respectively. Two quantities, *b*_{i(v1) → i(v1)} and *b*_{j(v2) → j(v2)}, are the stay probabilities in *i*(*v*_{1}) to *j*(*v*_{2}), respectively. The probability invariants are given as:

10

Using the invariants, eqn 9 is rewritten as:

11

where

12

The asterisks in eqns 9, 11 and 12 are introduced to indicate clearly that time *t′* does not evolve by the inter-virtual state transitions because the real system does not move with these transitions. Eqn 11 is applied to all sites that are connected by vertical arrows in Figure 13.

In the virtual-system coupling method, the variation ρ_{i(vm)}(*t*) → ρ*_{i(vm)}(*t*+*M*Δτ) is achieved by repeating eqn 7 *M* times followed by eqn 11. Repetition of this procedure yields a stationary state as:

Below we determine the stationary state. First, remember that V-McMD or V-AUS is designed so that probabilities converge to a constant for each virtual state:

13

where the two constants *c _{1}* and

*c*are not necessarily the same because eqn 7 does not involve the inter-virtual state transitions. Eqn 13 is equivalent to the following equation:

_{2}14

where (*m*=1, 2). Next, remember that the inter-virtual state transition probabilities in V-McMD or V-AUS are set as:

15

where *m*=*m*′.

Then, we define the following probabilities:

16

From eqns 14 and 15, the set of probabilities defined by eqn 16 satisfies the following equations:

17

Therefore, eqn 16 defines the stationary state and eqn 17 shows that the detailed balance is satisfied (i.e. the stationary state is the equilibrium state).

In the above discussion, *M* was constant. However, *M* can vary arbitrarily in the sampling to reach the same equilibrium state (eqn 16). Besides, the discussion above does not alter for other systems with many more virtual states and an infinitely small bin size.

**Abbreviations:** ALSD, adaptive λ square dynamics; AUS, adaptive umbrella sampling; DDD, double density dynamics; ET1, endothelin-1; IDP, intrinsically disordered protein; McMD, MD-based multi-canonical method; MD, molecular dynamics; MSES, multiscale essential sampling; NRSF, N-terminal repressor domain of neural restrictive silencer factor; PAH, paired amphipathic helix; PMF, potential of mean force

- © 2016 The Author(s)

This is an open access article published by Portland Press Limited on behalf of the Biochemical Society and distributed under the Creative Commons Attribution Licence 4.0 (CC BY-NC-ND).