Regular article
FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties1

https://doi.org/10.1006/jmbi.2001.4762Get rights and content

Abstract

FUGUE, a program for recognizing distant homologues by sequence-structure comparison (http://www-cryst.bioc.cam.ac.uk/fugue/), has three key features. (1) Improved environment-specific substitution tables. Substitutions of an amino acid in a protein structure are constrained by its local structural environment, which can be defined in terms of secondary structure, solvent accessibility, and hydrogen bonding status. The environment-specific substitution tables have been derived from structural alignments in the HOMSTRAD database (http://www-cryst.bioc.cam.ac.uk/homstrad/). (2) Automatic selection of alignment algorithm with detailed structure-dependent gap penalties. FUGUE uses the global-local algorithm to align a sequence-structure pair when they greatly differ in length and uses the global algorithm in other cases. The gap penalty at each position of the structure is determined according to its solvent accessibility, its position relative to the secondary structure elements (SSEs) and the conservation of the SSEs. (3) Combined information from both multiple sequences and multiple structures. FUGUE is designed to align multiple sequences against multiple structures to enrich the conservation/variation information. We demonstrate that the combination of these three key features implemented in FUGUE improves both homology recognition performance and alignment accuracy.

Introduction

The recognition of homology between protein sequences and known structures offers invaluable information in understanding the biological behaviour and biochemical function of uncharacterised sequences, and allows prediction of three-dimensional (3D) structures by means of comparative modelling1. The demand of automatic sequence-structure homology recognition methods is highlighted by the gap between rapidly accumulating sequence information, demonstrated by the recent completion of some bacterial and eukaryotic genomes and the draft of the human genome sequence, and experimentally determined structure information growing at relatively slow rate.

The objective of sequence-structure homology recognition is to identify homologous (i.e. divergently evolved) sequence-structure pairs in order to establish both structural and functional relationships. According to the SCOP database2, 3, where proteins are classified hierarchically to reflect both structural and evolutionary relationships, proteins in the same family have “clear evolutionary relationship”; proteins that share only the same superfamily have “probable common evolutionary origin”; those classified as sharing only the same fold are of “major structure similarity” and mostly “do not have evolutionary relationships”4. Thus, sequence-structure homology recognition is different from fold recognition: the former aims at recognising the similarity at both family and superfamily levels while the latter covers all three levels, with particular emphasis on the fold level.

Four major groups of automatic methods, have been used for sequence-structure homology recognition, although they may not be explicitly designed for this purpose. (1) Conventional pairwise sequence comparison methods 5, 6, 7, 8 are capable of recognising close homologues. (2) Conventional profile and hidden Markov model (HMM) methods that use only sequence information 9, 10, 11, 12, 13 are more sensitive to detecting distant homologues by using multiple sequence information to construct profiles/HMMs. (3) Conventional profile and HMM methods that use both sequence and structure information14, 15, 16, 17, 18, 19 generally help to improve the performance of homology recognition by taking into account the extra structural information. Our current work, implemented in the program FUGUE, falls into this category. (4) Threading methods20, 21 fit a probe sequence onto the backbone of a known structure and evaluate the compatibility by using knowledge-based potentials22. These methods concentrate on only the structural similarity and are capable of recognising sequence-structure pairs with fold-level similarity, but may not work as well as conventional profile/HMM methods in recognising homologous pairs. For instance, the performance of THREADER21 in homology recognition has been shown to be worse than that of many profile/HMM-based methods23. An attempt to combine (2) and (4) has been reported recently24, where the addition of multiple sequence information was shown to improve the performance of threading.

The key features of our current work include (1) improved environment-specific substitution tables, (2) automatic selection of alignment algorithm with detailed structure-dependent gap penalties and (3) combined information from both multiple sequences and multiple structures.

Each residue in a protein tertiary structure stays in a particular structural environment, which can be described by the structural features such as main-chain conformation, solvent accessibility and hydrogen bonding status. It has been demonstrated that the amino acid substitution is constrained by structural environments and each environment has a distinct substitution pattern25. Thus, environment-specific substitution tables offer a more precise and discriminating measure of substitution probabilities, compared to traditional substitution tables26, 27, 28, 29 that do not use any structural information. Environment-specific (or simply secondary structure-specific) substitution tables have been used to improve secondary structure prediction30, 31 and fold recognition15, 17, 32.

Following the studies of Johnson et al. and Overington et al.15, 25, we have introduced, in our current work, a new and improved method for deriving environment-specific substitution tables and systematically optimised various parameters. New features include a clustering scheme analogous to that used in the BLOSUM matrices29 and a smoothing procedure to correct data sparsity. Definitions of various environments were examined. To reduce the bias caused by non-structural constraints, certain positions of the alignments were “masked” and excluded from the calculation of substitution counts. For instance, the conservation of residues in catalytic sites or substrate/ligand binding sites is usually caused by functional, rather than structural, constraints25. Substitution masks were also applied to treat domain-domain interactions and protein-heteroatom interactions.

The dynamic programming method has been adopted by most sequence-structure comparison programs. There are three popular variations of the method: the global alignment algorithm33, the local alignment algorithm 8 and the global-local alignment algorithm that is based on the global algorithm but does not penalise the terminal hangout of either the sequence or the structure. A flexible alignment algorithm, similar to the idea used in CLUSTALW34, has been implemented in FUGUE, which selects the most suitable method according to the length difference between the sequence and the structure to be aligned.

In general, residues in secondary structures are more conserved than those in coil regions (i.e. irregular structure regions)35, 36, and solvent inaccessible residues are more conserved than accessible ones25, 37, 38, 39, 40. Thus, insertions/deletions are less likely to occur in secondary structure regions and solvent inaccessible regions. To account for this observation, we developed a structure-dependent gap penalty selection algorithm, which calculates a gap penalty for each residue according to its position relative to secondary structure elements (SSEs) and its solvent accessibility. If a residue is located within an SSE, the conservation of the whole SSE41 is also considered. Gap penalties for insertions and deletions are also distinguished. Similar methods have been implemented in some alignment algorithms such as CLUSTALW and a variation of ALIGN42. We define gap penalty variations more precisely, since CLUSTALW calculates penalties according to estimated residue accessibility, which is less accurate, and ALIGN considers only secondary structure information and ignores accessibilities.

It is well known that, multiple sequence information, which is widely used nowadays, enhances both homology recognition 9, 13 and sequence alignment34. However, multiple structure information has not yet been implemented in most sequence-structure comparison algorithms. We derive structural profiles from homologous structural alignments in the HOMSTRAD database43. The query sequence is pre-processed by PSI-BLAST 9 and the alignment between the query and its sequence homologues is used by FUGUE to search against the structural profiles. Such a procedure enables us to take advantage of both multiple sequences and multiple structures. A recent fold recognition algorithm, 3D-PSSM18, also makes use of both multiple sequences and structures. While 3D-PSSM uses completely automatic alignments of proteins sharing superfamily-level similarity, FUGUE uses individually validated structural alignments of proteins sharing family-level similarity.

A critical issue in sequence-structure homology recognition is the alignment accuracy. Alignments of distant homologues generated by automatic procedures, especially by threading methods, can be very inaccurate44, which is a key problem in comparative modelling45, 46. Here we demonstrate the usefulness of the three key features implemented in FUGUE in terms of improvements in both the recognition power and the alignment accuracy.

Section snippets

Overview

Our sequence-structure homology recognition algorithm consists of three stages, as outlined in Figure 1. The first step (broken arrows) is to construct environment-specific amino acid substitution tables by using homologous structure alignments that are based on high-resolution 3D coordinates. The second step (continuous arrows) is to generate a database of structural profiles, or a profile library, from individual structures or structural alignments using the environment-specific amino acid

Gap penalty determination

More than 100,000 combinations of the penalties were generated and the resulting alignment accuracy was assessed using 72 structural pairs. The optimal combination, which offered the highest overall accuracy, was taken as the default by FUGUE (see Methods for definitions of symbols): Initation penalties:Hi=28,Li=22,VLi=20,VVLi=8. Extension penaltiesHe=4,Le=4,VLe=2,VVLe=2. Note the corresponding substitution tables were calculated in one-third bits.

In order to achieve “good performance”, defined

Conclusions

The three key features in FUGUE, (1) improved environment-specific substitution tables, (2) automatic selection of alignment algorithm with detailed structure-dependent gap penalties and (3) combined information from both multiple sequences and multiple structures, resulted in the improvements of both homology recognition performance and alignment quality.

FUGUE recognised 8 % more homologues at the 95 % specificity level in the SCOP474 jack-knife benchmark than PSI-BLAST, and ranked 20 % more

Acknowledgements

J.S. is supported by Overseas Research Student awards and Cambridge Overseas Trust.

References (64)

  • E. Lindahl et al.

    Identification of related proteins on family, superfamily and fold level

    J. Mol. Biol.

    (2000)
  • A.R. Panchenko et al.

    Combination of threading potentials and sequence profiles improves fold recognition

    J. Mol. Biol.

    (2000)
  • H. Wako et al.

    Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. II. Secondary structures

    J. Mol. Biol.

    (1994)
  • S.B. Needleman et al.

    A general method applicable to the search for similarities in the amino acid sequence of two proteins

    J. Mol. Biol.

    (1970)
  • S. Pascarella et al.

    Analysis of insertions/deletions in protein structures

    J. Mol. Biol.

    (1992)
  • F.C. Bernstein et al.

    The Protein Data Banka computer-based archival file for macromolecular structures

    J. Mol. Biol.

    (1977)
  • A. Sali et al.

    Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming

    J. Mol. Biol.

    (1990)
  • A. Sali et al.

    Comparative protein modelling by satisfaction of spatial restraints

    J. Mol. Biol.

    (1993)
  • C.M. Topham et al.

    Fragment ranking in modelling of protein structure. Conformationally constrained environmental amino acid substitution tables

    J. Mol. Biol.

    (1993)
  • O. Gotoh

    An improved algorithm for matching biological sequences

    J. Mol. Biol.

    (1982)
  • T.L. Blundell et al.

    Knowledge-based prediction of protein structures and the design of novel molecules

    Nature

    (1987)
  • T.J. Hubbard et al.

    SCOPa structural classification of proteins database

    Nucl. Acids Res.

    (1999)
  • W.R. Pearson et al.

    Improved tools for biological sequence comparison

    Proc. Natl Acad. Sci. USA

    (1988)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLASTa new generation of protein database search programs

    Nucl. Acids Res.

    (1997)
  • M. Gribskov et al.

    Profile analysisdetection of distantly related proteins

    Proc. Natl Acad. Sci. USA

    (1987)
  • S.R. Eddy

    Profile hidden Markov models

    Bioinformatics

    (1998)
  • K. Karplus et al.

    Hidden Markov models for detecting remote protein homologies

    Bioinformatics

    (1998)
  • L. Rychlewski et al.

    Comparison of sequence profiles. Strategies for structural predictions using sequence information

    Protein Sci.

    (2000)
  • J. Hargbo et al.

    Hidden Markov models that use predicted secondary structures for fold recognition

    Proteins: Struct. Funct. Genet.

    (1999)
  • L.A. Kelley et al.

    Enhanced genome annotation using structural profiles in the program 3D-PSSM

    J. Mol. Biol.

    (2000)
  • D. Fischer et al.

    Protein fold recognition using sequence-derived predictions

    Protein Sci.

    (1996)
  • D.T. Jones et al.

    A new approach to protein fold recognition

    Nature

    (1992)
  • Cited by (1094)

    • Computational strategies and tools for protein tertiary structure prediction

      2023, Basic Biotechniques for Bioprocess and Bioentrepreneurship
    • Protein domain identification methods and online resources

      2021, Computational and Structural Biotechnology Journal
    View all citing articles on Scopus
    1

    Edited by B. Honig

    View full text