Carbohydrate-active enzymes face huge substrate diversity in a highly selective manner using only a limited number of available folds. They are therefore subjected to multiple divergent and convergent evolutionary events. This and their frequent modularity render their functional annotation in genomes difficult in a number of cases. In the present paper, a classification of polysaccharide lyases (the enzymes that cleave polysaccharides using an elimination instead of a hydrolytic mechanism) is shown thoroughly for the first time. Based on the analysis of a large panel of experimentally characterized polysaccharide lyases, we examined the correlation of various enzyme properties with the three levels of the classification: fold, family and subfamily. The resulting hierarchical classification, which should help annotate relevant genes in genomic efforts, is available and constantly updated at the Carbohydrate-Active Enzymes Database (http://www.cazy.org).
- catalytic mechanism
- enzyme family
- functional annotation
- modular structure
- polysaccharide lyase
PLs (polysaccharide lyases) are a group of enzymes (EC 4.2.2.-) that cleave uronic acid-containing polysaccharides via a β-elimination mechanism to generate an unsaturated hexenuronic acid residue and a new reducing end at the point of cleavage (Figure 1) . PLs are ubiquitous in nature, having been identified in organisms ranging from bacteriophages and Archaea, to Eubacteria and higher eukaryotes, such as fungi, algae, plants and mammals . For all of these organisms, PLs represent a complimentary mechanistic strategy to the GHs (glycoside hydrolases; EC 3.2.1.-)  for the breakdown of C-6 carboxylated polysaccharides (Figure 1), with the contrasting feature that PL-catalysed cleavage occurs without intervention of a water molecule. PLs are implicit in diverse biochemical processes, including biomass degradation, tissue matrix recycling and pathogenesis [2,4–9]. Moreover, the widespread use of polyuronic acids in the food and medical sectors makes PLs attractive as specific catalysts for the modification of substrates such as pectins, alginates and heparins in biotechnological applications [2,10–12].
The catalytic mechanism employed by PLs (Figure 2) can be broadly described as consisting of three events: (i) abstraction of the C-5 proton on the sugar ring of a uronic acid or ester by a basic amino acid side chain, (ii) stabilization of the resulting anion by charge delocalization into the C-6 carbonyl group, and (iii) lytic cleavage of the O-4:C-4 bond, facilitated by proton donation from a catalytic acid, to yield a hexenuronic acid (or ester) moiety at the newly formed non-reducing chain end [1,13]. Depending on the monosaccharide composition of the substrate and its conformation in the PL active site, the proton removed from C-5 and the departing oxygen on C-4 may lie either syn or anti to each other. This, in turn, imposes certain requirements on the position of active site groups and the possibilities for a stepwise or concerted elimination reaction (Figures 2A and 2B). Polysaccharide recognition in PLs is often dependent on the interaction of tightly held bivalent cations (often Ca2+), or positively charged amino acid side chains, with uronic acid groups in the substrate. Such cations may play an additional role in stabilizing the transient anion in the reaction pathway. The extent to which these molecular events are concerted, as well as the nature and individual contributions of the catalytic groups, in the mechanisms of specific enzymes have not been fully clarified, although significant advances have been made in a few cases (see [14,15] and references therein). Detailed structural information on the catalytic modules of PLs has been previously published .
In common with GHs, PLs frequently have multi-modular structures, in which the catalytic module can be appended to a variable number of ancillary modules such as CBMs (carbohydrate-binding modules) [17,18], other catalytic modules or modules with other functions (see below). Interestingly, many non-catalytic modules borne by PLs may also be appended to GHs. Following a full dissection of their modular organization, we have grouped the PLs into amino acid sequence-based families to provide a framework for structural and mechanistic studies. In the present paper we describe a hierarchical classification of PLs including subfamilies, families and clans/superfamilies, and we discuss the value of these levels for genome mining and functional prediction. This classification is implemented in the CAZy (Carbohydrate-Active Enzymes) Database (http://www.cazy.org) .
Included and excluded enzyme classes
For the purpose of this family classification, the scope of the term PL is restricted to those enzymes that operate according to the general mechanisms described in Figures 1 and 2, to produce a terminal hexenuronic acid moiety by β-elimination. This is a clear distinction from the broader NC-IUBMB classification of carbon-oxygen lyases acting on polysaccharides into EC 4.2.2.- (http://www.chem.qmul.ac.uk/iubmb/enzyme/). In particular, the following enzymes are not included in PL classification described in CAZy, as they are structurally and mechanistically more similar to the GHs:
(i) exo-α-(1,4)-D-glucan lyases (EC 220.127.116.11) cleave malto-oligosaccharides to produce 1,5-anhydro-D-fructose without the intervention of a water molecule. These enzymes are structurally similar to GH31 α-glucoside hydrolases, with which they are currently classified. Analogous to other GH31 enzymes, the first step in the catalytic mechanism involves the formation of a covalent glycosyl-enzyme intermediate. However, in α-glucan lyases this intermediate decomposes through a syn-elimination mechanism, rather than hydrolysis [13,20].
(ii) LTs (lytic transglycosylases) cleave the β-(1,4)-glycosidic bond between the N-acetylmuramic acid and the N-acetylglucosamine residues of peptideglycan via a substrate participation mechanism, with no intervention of water, to yield a 1,6-anhydro sugar derivative . LTs are structurally and mechanistically closely related to lysozymes and are currently classified in GH families GH23, GH102, GH103 and GH104 .
(iii) Levan and inulin fructotransferases (EC 18.104.22.168, EC 22.214.171.124 and EC 126.96.36.199) cleave fructo-oligosaccharides by intramolecular attack to yield various anhydro-fructodisaccharides. These enzymes are presently classified into GH91, along with a sequence-similar enzyme that hydrolyses the DFA III (α-D-fructofuranose β-D-fructofuranose 1,2′:2,3′-dianhydride) product of the EC 188.8.131.52 inulin fructotransferase [23,24]. As such, mechanistic commonality with GHs (and loosely with LTs) is predicted.
Family and subfamily groupings
The PL families were first built by searching sequence homologues of experimentally characterized enzymes. To avoid the creation of a large number of families, distant homologues were assigned to existing families. These families have been presented online in the CAZy database since its launch in 1998 with the occasional creation of novel families subsequent to the experimental characterization of PLs with no or insufficient similarity to known families. Within families, subfamilies have been defined by procedures similar to that described for the large GH family GH13, which is comprised of a diversity of starch-active enzymes of similar structure .
Briefly, in each family the sequences were edited to isolate the catalytic domains to avoid interference from the presence or absence of additional modules. The catalytic domains were then subject to a multiple sequence alignment using MUSCLE  and a distance matrix was created using the BLOSUM62  substitution model. The distance matrix was then used as the input for an automatic analysis based on the SECATOR algorithm , which proposes the breakdown of the family into a number of subfamilies, based on a reconstructed phylogenetic tree. The robustness of the subfamilies was tested by a resampling approach whereby a proportion of the sequences were randomly removed from the sample. The clustering procedure was iterated typically 10000 times with random variations of the parameters of the automatic partitioning algorithm. The percentage of sequences removed from the sample was also picked randomly from 5 to 30% at each iteration. Sequences found in the same cluster over 80% of the time were assigned to the same subfamily. Finally, only subfamilies containing more than five members were retained in order to define significant subfamilies. Unassigned sequences will be subjected to a new round of analysis as more sequences become available.
RESULTS AND DISCUSSION
Modular structure of PLs
Carbohydrate-active enzymes are frequently composed of a modular structure, in which a catalytic module carries one or more ancillary modules . PLs are no exception and there is a large variety of multi-modular PLs (Figure 3). Perhaps the most common situation is the occurrence of one or more CBMs in tandem with the catalytic PL module. However, other arrangements have been observed, such as the addition of domains that promote binding to other macromolecules, including SLH (S-layer homology) domains for cell attachment  or dockerin modules for cellulosome assembly . Some PLs may even be arranged with an additional PL module or a complementary CE (carbohydrate esterase) module, as well as domains whose function is presently unknown (termed ‘X’ modules; Figure 3). The number of possible combinations of domains is in principle unlimited, and their presence poses a specific challenge for sequence-based family grouping and annotation. Whole genome annotations are particularly prone to false identification (and subsequent misleading functional annotation) due to spurious hits on ancillary modules common to two distinct proteins. Consequently, a systematic excision of the ancillary modules was performed prior to all sequence alignments of PLs, and indeed this approach is the principal modus operandi of the CAZy classification [19,32].
Families and folds
In April 1999 there were approx. 100 PL sequences arranged in nine families . Since then, the number of PL sequences has increased approx. 20-fold, essentially due to whole genome-sequencing projects. Thanks to the biochemical characterization of many novel PLs, the number of PL families has progressively grown over the years to reach 21 in 2010. The corresponding 11 years of structural biology have vastly expanded knowledge of the three-dimensional structures of PLs (for a thorough review on three-dimensional structure–function relationships of PLs, see ), whereas only one of the initial nine families of PLs had a structural representative in 1999, the fold of only two (PL12 and PL17) out of the 21 current PL families remain to be determined (Figure 4).
PLs show a large variety of fold types, ranging from β-helices to α/α barrels (Figure 4). The abundance of PL folds indicates that PLs have been invented more than once during evolution, from totally different scaffolds. The most extreme example of the convergent evolution of PLs is perhaps with PL1 and PL10 pectate lyases, in which the different folds carry an identically poised catalytic machinery that performs the same reaction on the same substrate . The plasticity of the active site of PLs to accommodate a variety of substrates is reminiscent of that of GHs . Interestingly most of the PL folds are also found in GH families, an indication of possible common evolutionary origins between the two enzyme classes.
In addition to being well-characterized at the three-dimensional level, examination of the CAZy database (http://www.cazy.org) shows that more than 10% of the PLs in the database have been biochemically (kinetically) characterized, which is the highest proportion among all of the classes of carbohydrate-active enzymes described in CAZy [GHs, GTs (glycosyltransferases), PLs and CEs]. This wealth of biochemical data indicates that most PL families group enzymes with diverse substrate specificities (Table 1). This situation has been previously observed for other CAZyme classes, especially the GHs  and GTs . One probable explanation for this phenomenon is that the number of available protein folds is considerably smaller than the number of carbohydrate structures and hence nature has adventitiously tuned existing scaffolds for exquisite substrate specificity.
Less immediately apparent, however, are the structural similarities among the various substrates processed by individual family members. As an example, family PL8 can be considered polyspecific, as it groups together three different enzyme activities: hyaluronate lyase (EC 184.108.40.206), xanthan lyase (EC 220.127.116.11) and chondroitin AC lyase (EC 18.104.22.168). Here, the common names of these polysaccharides belie the fact that these three types of enzymes act at the same position on the same sugar, i.e. they cleave the C-O bond at position 4 of unsubstituted glucuronic acid in the backbone (Figure 5). What differentiates the three substrates is the substituent attached to the 4-oxygen of the glucuronic acid, a situation similar to, for instance, GHs that exhibit aglycone specificity . The three enzymes therefore can, and in the case of PL8 do, utilize an identical catalytic machinery to cleave their respective substrates.
The functional prediction (i.e. substrate specificity) of thousands of putative carbohydrate-active enzymes derived from genome data is highly desirable, but requires a direct unequivocal relationship between sequence groupings and substrate specificity. Because the sequence-based families of PLs generally do not correlate with the fine substrate specificity, as described above, we have examined the definition of subfamilies to assess whether functional grouping and prediction could be improved. A similar approach was previously applied to the large polyspecific GH13 family of α-amylase-related enzymes, in which most of the sequence-derived subfamilies were indeed found to correspond to a single enzyme activity .
With the sequence data available to date, we were able to break down the 21 PL families into a total of 41 subfamilies covering 72% of all sequences analysed (Table 1). The sequences that could not be assigned to subfamilies will most likely generate new subfamilies as more sequences become available in the future. The subfamilies are identified by an Arabic numeral following the family identifier; for instance, PL5_1 designates subfamily 1 within family PL5. As shown in Table 1, the vast majority of subfamilies have at least one representative that has been characterized with respect to substrate specificity; only seven subfamilies are lacking an experimentally characterized member. Depending on the subfamily, the cumulated biochemical characterization data varies from low (e.g. subfamilies PL3_1 and PL4_2) to high (e.g. PL1_5, PL1_6 and PL5_1). These variations can have a profound effect on any subsequent functional predictions based on subfamily membership, since reliability obviously depends (i) on the number of characterized enzymes per subfamily and (ii) on how detailed and reliably each characterization was performed.
We observe that of the 41 subfamilies identified here, 37 (90%) appear to be monospecific, thus indicating that the subfamilies correlate with substrate specificity significantly better than at the family level. Only three subfamilies remained apparently polyspecific (i.e. grouping enzymes with different EC numbers): PL1_5, PL9_1 and PL14_3. These three subfamilies were further inspected to identify the origin of their polyspecificity. In the case of subfamilies PL1_5 and PL9_1, the apparent polyspecificity is due to the presence of both endo-acting (EC 22.214.171.124) and exo-acting (EC 126.96.36.199) polygalacturonate lyases. These two types of enzymes have exactly the same substrate specificity and differ only in the degree of polymerization of the released products. As with other types of carbohydrate-cleaving enzymes, the basis of endo- compared with exo-activity within a family is typically due to subtle details in the three-dimensional structure of the enzymes, and rigidly distinguishing the two activities can be tricky . In the case of subfamily PL14_3, the apparent polyspecificity is associated with the presence of both poly- and oligo-alginate lyases (EC 188.8.131.52 and 4.2.2.- respectively). Here again, the difference is subtle: the bond cleaved is identical, and the difference in the definitions of the activities pertains to the degree of polymerization of the substrate. It may well be that such a difference is not biologically significant or, if it is, sequence data alone will never be able to sort one from the other.
Occurrence of PLs in genomes
We entered the genomic era approx. 15 years ago and the current pace of genome release is on the order of 1–2 per day. Next-generation sequencing will boost this flow of sequence data even further. Our analyses of more than 1300 genomes from diverse organisms, ranging from Archaea to higher plants and animals, show that the amount of PLs is usually low and consistently less than that of GHs (representing 3–5% of the number of GHs). The most likely explanation for this observation is that the substrates of PLs, polysaccharides containing uronic acids, represent just a small proportion of all carbohydrate polymers. The organisms that have the largest number of PLs share a common focus: the plant cell wall. The genomes of both plants and micro-organisms that feed on living or dead plant tissue (phytopathogens or saprophytes respectively) typically encode large numbers of PLs (Table 2). The abundance of PLs in plants is due to the emergence of large multigene CAZy families  and the biological importance of pectin in plant development . The pectic network contributes to the structural integrity of the plant cell wall and, as such, it is an obvious target for phytopathogens and symbionts (including bacteria, fungi, oomycetes and nematodes) to gain access via an arsenal of pectinolytic enzymes. And, because it is far more digestible than cellulose and lignin, pectin is also a delicacy for most saprophytic organisms, which draw nutrients from decaying plant material.
Recommendations for large-scale sequence annotation
Next-generation sequencing machines will deliver ever more sequences, whose utility largely depends on our ability to correlate them with molecular functions. The hierarchical classification that we advocate here, based on fold, family and subfamily, provides a convenient way to produce the best possible functional assignments that take into account distance with experimentally characterized enzymes.
At the most general end of the spectrum, very distant similarity [such as that resulting from PSI-BLAST analyses or the use of degenerate HMMs (hidden Markov models)] should be used only to assign a protein to a folding class and not to a function. For example, Stam et al.  have shown that a PSI-BLAST search starting with a β-helical polygalacturonase (EC 184.108.40.206) of family GH28 retrieved β-helical pectate lyases of family PL1 and dextranases of family GH49 after only two iterations , despite the fact that these two enzymes employ distinctly different catalytic mechanisms. Although this may reflect ancient evolutionary events, the detection of such distant similarities is of little use when it comes to anticipating a molecular function.
The next level is the assignment to a family. This is typically reflected by significant BLAST scores over the entire length of the catalytic module (not the entire protein, which may contact multiply ancillary modules, see above). Here, similarity is sufficient to predict a global PL function, especially if the catalytic residues are conserved in the sequence under consideration. Even though the PL families are often polyfunctional, commonalities between the various substrates known to be cleaved by family members can guide experimental design to determine the actual specificity of novel enzymes.
Finally, the most fine-grained annotation is reached at the other end of the spectrum when a sequence can be assigned to one of the defined PL subfamilies. Two cases will arise. (i) The subfamily to which the sequence can be assigned contains one or several experimentally characterized members (the more the better). Here the function of the query protein can reasonably be assigned, for instance ‘putative hyaluronate lyase’. (ii) The query protein belongs to a non-characterized subfamily, or does not belong to any defined subfamily. Here the precise substrate cannot be predicted with confidence, and the best possible annotation is simply ‘putative polysaccharide lyase’.
One consequence of the above hierarchy is that functional predictions should be dynamic, varying as biochemical data accumulates in the various subfamilies. Additionally, we suggest that an EC number should only be assigned to the query protein and included in public databases when, and only when, the precise substrate specificity has been established experimentally, to avoid unchecked propagation of erroneous assignments. In general, we advocate a conservative approach to functional assignment based on sequence analysis, guided by the mantra that no annotation is better than a misleading annotation.
Vincent Lombard performed the analysis of PL sequences; Thomas Bernard, Corinne Rancurel and Pedro Coutinho constructed computer tools to analyse the PL sequences; Harry Brumer, Pedro Coutinho and Bernard Henrissat reviewed data; Bernard Henrissat and Pedro Coutinho designed research; Vincent Lombard, Harry Brumer and Bernard Henrissat wrote the paper.
This work was supported by Novozymes; the Swedish Research Council Formas, the Swedish Research Council, Vetenskapsrådet and the Wallenberg Wood Science Center (to H.B.).
Abbreviations: CBM, carbohydrate-binding module; CE, carbohydrate esterase; GH, glycoside hydrolase; GT, glycosyltransferase; LT, lytic transglycosylase; PL, polysaccharide lyase
- © The Authors Journal compilation © 2010 Biochemical Society