Combining Inference from Evolution and Geometric Probability in Protein Structure Evaluation

https://doi.org/10.1016/S0022-2836(03)00663-6Get rights and content

Abstract

Starting from the hypothesis that evolutionarily important residues form a spatially limited cluster in a protein's native fold, we discuss the possibility of detecting a non-native structure based on the absence of such clustering. The relevant residues are determined using the Evolutionary Trace method. We propose a quantity to measure clustering of the selected residues on the structure and show that the exact values for its average and variance over several ensembles of interest can be found. This enables us to study the behavior of the associated z-scores. Since our approach rests on an analytic result, it proves to be general, customizable, and computationally fast. We find that clustering is indeed detectable in a large representative protein set. Furthermore, we show that non-native structures tend to achieve lower residue-clustering z-scores than those attained by the native folds. The most important conclusion that we draw from this work is that consistency between structural and evolutionary information, manifested in clustering of key residues, imposes powerful constraints on the conformational space of a protein.

Introduction

A growing body of evidence indicates that in most proteins it is possible to identify from sequence analysis a limited set of evolutionarily privileged residues in close mutual contact on the native structure.1., 2., 3., 4., 5. To further uphold that this sequence–structure connection is non-trivial, it is pertinent to show that a non-native protein conformation does not support an extensive network of critical residues. We would like to recast this goal as the problem of detecting the correct protein structure in a number of plausible candidates from the requirement that the evolutionarily important residues form a cluster. A convenient testing ground can then be found in the field of protein structure prediction, where one approach is to computationally construct a large set of physically reasonable structures and then restrict that set through various filtering methods.6., 7. The restricted set may still contain thousands of structures, some of which are close to the conformation occurring under physiological conditions. If the clustering of key residues in viable proteins is a general and significant phenomenon, its extent should help us expand the set of constraints on the native structure.

Various solutions with excellent discriminating power address the problem of the native structure selection. The common underlying idea is to find a function that assigns a numerical value to a structure, and finds its optimum in the native fold.8., 9. The functions used so far tend to fall into two broad categories: those that estimate physical energy,8., 10., 11. and those that assess its compatibility with known protein properties, such as hydrophobic core forming in the polar solvent, or interdependent evolution of residues in close contact on the folded chain.6., 12. Both categories can be further divided into methods that rely on statistical inference from the known accumulated protein data, and methods that strive to be self contained and dependent only on a small number of parameters. Thus the energy methods range from ab initio calculations of microscopic interactions between residues and/or their constituent atoms, to “statistical potential” approaches, wherein the properties of the interactions are inferred by studying the large number of known protein conformations.13., 14. Similarly, compatibility methods can rely on the properties of the protein under scrutiny only, or on the function with multiple knowledge-based (database-dependent) parameters.15., 16. For example, Simons et al.15 have tackled the decoy recognition problem by applying a battery of such expectation-compliance tests, ranging from hydrophobic effect to local structure preferences.

A successful application of a database-independent scoring function is Huang et al.17., 18. The group used an efficient and straightforward method to detect the hydrophobic core and score the folds generated by molecular dynamics simulations with the resolution approaching 1 Å root-mean-square distance (RMSD) from the native backbone confirmation. We shall return to the method later, and compare it with our own.

Also relevant to this work is the observation by Bonneau et al.19 in their 2002 work that a modest improvement in the structure selection can be achieved by focusing on the proteins with high contact order,20 a simple quantity measuring the average distance between all the residues in contact.

Several suggestions stem from a complementary perspective of evolutionary constraints on the protein structure.21., 22., 23., 24., 25., 26., 27., 28. One is to use the possibility that the residues in contact on the protein must either be conserved,21., 22., 23. or they must evolve in a correlated way, compensating for each other's mutation and thus preserving the protein shape and/or function.28 That way the native folds may be singled out by optimally placing conserved or co-evolving residues in contact.29 This approach adds multiple sequence alignment information (from which the evolutionary information is usually extracted) to the physical information about the protein.

Our approach falls in the last broad category. It is based on the recurrent finding that evolutionarily important residues singled out by Evolutionary Trace (ET),26 termed trace residues, form a small number of large clusters in proteins.5 Until now trace clusters have been studied for their functional properties, namely, for their tendency to overlap functional sites in structures and map out binding interfaces and catalytic sites. This has lead to accurate modeling of protein interactions and to efficient mutational studies of the molecular basis of function.30., 31., 32. The structural implications of trace clusters are striking in their own right, however, since the trace residues are defined from sequence analysis alone, without structural input. Their spatial clustering is therefore a demonstration of the linkage between sequence and structure through evolution, and it suggests that evolutionary determinant residues act in concert, the physical means of their cooperativity being direct physical contact. Moreover, ET studies show that the evolutionary connection among sequence, structure, and function is detectable through the study of trace residues, that it is quantifiable through measures of spatial clustering, and commonly observed.4

If it is indeed the case that evolutionarily important residues cluster non-randomly, it should be possible to match the sequence with the appropriate structure by measuring the non-randomness of the clusters. Specifically, given a multiple sequence alignment of a protein and a set of its homologs, we want to estimate whether the alignment is consistent with a proposed structure, and discard the structures that do not satisfy our consistency criteria. To fulfill this agenda, we need to attach a quantitative meaning to consistency, and learn how to compare two estimates derived from different sequence sets and mapped onto different structures. Therefore we define a quantity, selection clustering weight (SCW), which attaches a numerical value to the (residue selection, clustering) pair. In defining the SCW we conjecture that the clusters are non-trivial because the involved residues are well separated on the unfolded chain. We then derive analytic expressions for average and variance of the SCW in several supersets of such pairs. To illustrate the method, we evaluate SCW for the ribonuclease reductase case, and show that the resulting SCW betters the average SCW in the most intuitive of the supersets by more than ten standard deviation widths. Next we demonstrate that this is not an isolated case by studying the SCW behavior on the FSSP33 set of representative native protein structures. As a final test of our hypothesis, we use SCW as a filtering device for distinguishing putative (but incorrect) from actual protein structures. Misfolded structures indeed tend to score lower than the native ones. The method turns out to be comparable with, and complementary to that of Huang et al.17., 18. and the two could in principle be combined, thus improving their individual resolutions.

A practical outcome of our considerations is a straightforward, customizable, and fast filtering method, independent of a training set. But rather than providing a definitive algorithm, it is the aim of this article to point out that clustering is by itself sufficiently reliable to exclude from consideration a large fraction of decoy structures. Beyond the performance indicators of the method, this fact is a statement about a quantifiable geometric property of a critical subset of protein residues that reflects the evolutionary consistency between sequence and structure.

Section snippets

Theory

The assumption we are making here is that residues that are important for the protein (for whichever reason) form a non-random cluster on its structure. Among all possible pairs, consisting of residue selection and structure, it is possible to single out the pairs actually occurring in nature. Therefore, we need (i) a method to select important residues, and (ii) a scoring function to attach a numerical value to the pair (selection, structure). The methods we use are (i) the ET to detect the

Clustering example: ribonucleotide reductase beta chain

We begin our discussion by studying the behavior of the zS quantity on a single known protein structure. The purpose of this example is twofold: to illustrate the performance of the method under favorable circumstances (long, richly folded chain, many related sequences available from the database), as well as to further motivate the clustering as a phenomenon related to the active residues in the protein.

As our study case we choose the ribonucleotide reductase beta chain (Protein DataBank35

Conclusion

Here we investigated the possibility of sequence–structure matching through evolutionary consistency. To this end we devised a new residue clustering measure and calculated its average and variance in several relevant model ensembles. It enabled us to efficiently detect the clustering in most of the proteins in the FSSP database, thereby showing that this method can be used on large protein sets. By inverse reasoning, assuming that the clustering must be present in the native structure, we have

Sequence selection

To make the results reproducible and to avoid possible bias, we follow a well-defined protocol. Its rules are heuristic and emulate the manual sequence selection, which results in significant clustering and overlap of the residues selected by the ET with the epitope.4., 5. (i) To obtain a set of sequences homologous to the protein of interest, we use two rounds of psi-blast37 against the NCBI Entrez non-redundant protein sequence database, and accept all the sequences with the E-score<0.05.

Acknowledgements

The authors thank Srinivasan Madabushi for the insightful discussions at the outset of the work, and Alessandra Carbone and Daniel Morgan for carefully reading the manuscript. This work was supported in part by NSF (DBI-0114796), AHA (American Heart Association), MOD (March of Dimes), and NIGMS (HG02051).

References (43)

Cited by (44)

  • Recent advances in functional region prediction by using structural and evolutionary information - Remaining problems and future extensions

    2013, Computational and Structural Biotechnology Journal
    Citation Excerpt :

    Thus, the integration of the conservation score with other types of scores represents a trend toward improving evolutionary information methods. In addition to the methods examined in Teppa's work [2,24–26], Bray et al. [27] developed a functional site prediction tool, SitesIdentify, which is based on combining sequence conservation information with geometry-based cleft identification. This method functioned quite favorably in comparison to other methods, in the active site predictions for 237 non-redundant enzymes.

  • Multi-targeted therapy for leprosy: Insilico strategy to overcome multi drug resistance and to improve therapeutic efficacy

    2012, Infection, Genetics and Evolution
    Citation Excerpt :

    The earlier developed models of MurC, MurD, MurE and MurF enzymes and their corresponding multiple sequence alignment files were subjected to evolutionary trace analysis and followed by site directed mutagenesis studies. The evolutionary trace (ET) method (Lichtarge et al., 1996) exploits phylogenetic tree-based sequence comparisons along with crystal structure information to detect functional sites in proteins (Chakravarty et al., 2005; Lichtarge and Sowa, 2002; Mihalek et al., 2003; Sowa et al., 2001). We applied this ET method for predicting the functional residues in Mur enzymes to overcome single and multiple drug resistant strains of M. leprae.

  • Evolution: A guide to perturb protein function and networks

    2010, Current Opinion in Structural Biology
  • Detection of Functionally Important Regions in "Hypothetical Proteins" of Known Structure

    2008, Structure
    Citation Excerpt :

    The graph clearly shows that the Z-score tends to increase as the D-value decreases (Spearman correlation, r = −0.55; p < 0.0001). A similar trend was reported with the ET method (Mihalek et al., 2003). We examined the outliers of the graph, in particular the three patches tagged “1” to “3” in Figure 1 (see Supplemental Data).

  • Recurrent high-impact mutations at cognate structural positions in class A G protein-coupled receptors expressed in tumors

    2021, Proceedings of the National Academy of Sciences of the United States of America
View all citing articles on Scopus
View full text