Journal of Molecular Biology
Improved Prediction of Signal Peptides: SignalP 3.0
Introduction
Numerous attempts to predict the correct subcellular location of proteins using machine learning techniques have been developed.1., 2., 3., 4., 5., 6., 7., 8., 9. Computational methods for prediction of N-terminal signal peptides were published around 20 years ago, initially using a weight matrix approach.1., 2. Development of prediction methods shifted to machine learning algorithms in the mid 1990s,10., 11. with a significant increase in performance.12 SignalP, one of the currently most used methods, predicts the presence of signal peptidase I cleavage sites. For signal peptidase II cleavage sites found in lipoproteins, the LipoP predictor has been constructed.13 SignalP produces both classification and cleavage site assignment, while most of the other methods classify proteins as secretory or non-secretory.
A consistent assessment of the predictive performance requires a reliable benchmark dataset. This is particularly important in this area, where the predictive performance is approaching the performance calculated from interpretation of experimental data, which is not always perfect. Incorrect annotation of signal peptide cleavage sites in the databases stems from trivial database errors, and from peptide sequencing, where it may be hard to control the level of post-processing of the protein by other peptidases after the signal peptidase I has made its initial cleavage. Such post-processing typically leads to cleavage site assignments shifted downstream relative to the true signal peptidase I cleavage site.
In the process of training the new version of SignalP we have generated a new, thoroughly curated dataset based on the extraction and redundancy reduction method published earlier.14 Other methods were used for cleaning the new dataset, and we found a surprisingly high error rate in Swiss-Prot, where, for example, of the order of 7% of the Gram-positive entries had either wrong cleavage site position and/or wrong annotation of the experimental evidence. Also, we found many errors in a previously used benchmark set (stemming from automatic extraction from Swiss-Prot),12 and it appears that some programs are in fact better than the performance reported (predictions are correct, while feature annotation is incorrect). For comparison, we made use of this independent benchmark dataset that was used initially for evaluation of five different signal peptide predictors.12
In the new version of SignalP we have introduced novel amino acid composition units as well as sequence position units in the neural network input layer in order to obtain better performance. Moreover, we have changed the window sizes slightly compared to the previous version. We have used fivefold cross-validation tests for direct comparison to the previous version of SignalP.10 In the previous version of SignalP a combination score, Y, was created from the cleavage site score, C, and the signal peptide score, S, and used to obtain a better prediction of the position of the cleavage site. In the new version, we also use the C-score to obtain a better discrimination between secreted and non-secreted sequences, and have constructed a new D-score for this classification task. The architecture of the hidden Markov model (HMM) SignalP has not changed, but the models have been retrained on the new data set, and have increased their performance significantly.
Section snippets
Generation of data sets
As the predictive performance of the earlier SignalP method was quite high, assessment of potential improvements is critically dependent on the quality of the data annotation. We generated a new positive signal peptide data set from Swiss-Prot15 release 40.0, retaining the negative dataset extracted from the previous work. The method for redundancy reduction was the same as in the previous work14, and was based on the reduction principle developed by Hobohm et al.16 Our final positive signal
Conclusion
We present new versions of SignalP, based on an expanded, highly curated dataset. The architecture of the HMM-based version was unchanged, while the neural network scheme was improved by including information about the amino acid composition of the precursor protein as well as the position of the sliding window. Furthermore, we optimized the window sizes by testing all possible combinations of asymmetric and symmetric input windows up to a total input of 51 amino acid residues. These were
Data set extraction
All sequence data were extracted from Swiss-Prot15 release 40.0. A total of 12,975 entries with the keyword SIGNAL were found. The dataset was split into three species-specific groups: eukaryotes, Gram-negative prokaryotes and Gram-positive prokaryotes. We excluded all archaeal sequences. Non-experimentally verified signal peptides that had POTENTIAL or HYPOTHETICAL stated in the keyword line were removed. Furthermore, any phage, viral or eukaryote organelle-encoded proteins were excluded.
Acknowledgements
We thank Anders Krogh for use of his HMM software. This work was supported by grants from the Danish National Research Foundation, the Danish Natural Science Research Council, the Danish Center for Scientific Computing, and by a grant from Novozymes A/S (to J.D.B.).
References (44)
On the predictive recognition of signal peptide sequences
Virus Res.
(1985)- et al.
Processing of Escherichia coli alkaline phosphatase: role of the primary structure of the signal peptide cleavage region
J. Mol. Biol.
(1998) - et al.
Cloning of a Locusta cDNA encoding neuroparsin A
Insect Biochem. Mol. Biol.
(1992) - et al.
The primary structure of glycoprotein III from bovine adrenal medullary chromaffin granules. Sequence similarity with human serum protein-40,40 and rat Sertoli cell glycoprotein
J. Biol. Chem.
(1990) - et al.
Relation between amino acid composition and cellular location of proteins
J. Mol. Biol.
(1997) Maximum shields: the assembly and function of the bacterial spore coat
Trends. Microbiol.
(2002)- et al.
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes
J. Mol. Biol.
(2001) - et al.
A combined transmembrane topology and signal peptide prediction method
J. Mol. Biol.
(2004) A new method for predicting signal sequence cleavage sites
Nucl. Acids Res.
(1986)- et al.
Expert system for predicting protein localization sites in Gram-negative bacteria
Proteins: Struct. Funct. Genet.
(1991)
Using neural networks for prediction of the subcellular location of proteins
Nucl. Acids Res.
Support vector machine approach for protein subcellular localization prediction
Bioinformatics
Proceedings of the Pacific Symposium on Biocomputing
SPEPlip: the detection of signal peptide and lipoprotein cleavage sites
Bioinformatics
PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria
Nucl. Acids Res.
A profile hidden Markov model for signal peptides generated by HMMER
Bioinformatics
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites
Protein Eng.
Prediction of signal peptides and signal anchors by a hidden Markov model
Proc. Int. Cong. Intell. Syst. Mol. Biol.
A comparison of signal sequence prediction methods using a test set of signal peptides
Bioinformatics
Prediction of lipoprotein signal peptides in Gram-negative bacteria
Protein Sci.
Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site
Proteins: Struct. Funct. Genet.
The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000
Nucl. Acids Res.
Cited by (5726)
The key tyrosine decarboxylase gene and its negative transcription factor for GX-50 biosynthesis in Sichuan Pepper (Zanthoxylum armatum)
2024, Industrial Crops and ProductsA longitudinal transcriptomic analysis of Rhipicephalus microplus midgut upon feeding
2024, Ticks and Tick-borne DiseasesFrom the fat body to the hemolymph: Profiling tick immune and storage proteins through transcriptomics and proteomics
2024, Insect Biochemistry and Molecular BiologyImproving Signal and Transit Peptide Predictions Using AlphaFold2-predicted Protein Structures
2024, Journal of Molecular Biology