Improved Prediction of Signal Peptides: SignalP 3.0

doi:10.1016/j.jmb.2004.05.028

Journal of Molecular Biology

Volume 340, Issue 4, 16 July 2004, Pages 783-795

https://doi.org/10.1016/j.jmb.2004.05.028 Get rights and content

Abstract

We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, combined with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions has increased notably for all three organism groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6–17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server http://www.cbs.dtu.dk/services/SignalP/

Introduction

Numerous attempts to predict the correct subcellular location of proteins using machine learning techniques have been developed.1., 2., 3., 4., 5., 6., 7., 8., 9. Computational methods for prediction of N-terminal signal peptides were published around 20 years ago, initially using a weight matrix approach.1., 2. Development of prediction methods shifted to machine learning algorithms in the mid 1990s,10., 11. with a significant increase in performance.¹² SignalP, one of the currently most used methods, predicts the presence of signal peptidase I cleavage sites. For signal peptidase II cleavage sites found in lipoproteins, the LipoP predictor has been constructed.¹³ SignalP produces both classification and cleavage site assignment, while most of the other methods classify proteins as secretory or non-secretory.

A consistent assessment of the predictive performance requires a reliable benchmark dataset. This is particularly important in this area, where the predictive performance is approaching the performance calculated from interpretation of experimental data, which is not always perfect. Incorrect annotation of signal peptide cleavage sites in the databases stems from trivial database errors, and from peptide sequencing, where it may be hard to control the level of post-processing of the protein by other peptidases after the signal peptidase I has made its initial cleavage. Such post-processing typically leads to cleavage site assignments shifted downstream relative to the true signal peptidase I cleavage site.

In the process of training the new version of SignalP we have generated a new, thoroughly curated dataset based on the extraction and redundancy reduction method published earlier.¹⁴ Other methods were used for cleaning the new dataset, and we found a surprisingly high error rate in Swiss-Prot, where, for example, of the order of 7% of the Gram-positive entries had either wrong cleavage site position and/or wrong annotation of the experimental evidence. Also, we found many errors in a previously used benchmark set (stemming from automatic extraction from Swiss-Prot),¹² and it appears that some programs are in fact better than the performance reported (predictions are correct, while feature annotation is incorrect). For comparison, we made use of this independent benchmark dataset that was used initially for evaluation of five different signal peptide predictors.¹²

In the new version of SignalP we have introduced novel amino acid composition units as well as sequence position units in the neural network input layer in order to obtain better performance. Moreover, we have changed the window sizes slightly compared to the previous version. We have used fivefold cross-validation tests for direct comparison to the previous version of SignalP.¹⁰ In the previous version of SignalP a combination score, Y, was created from the cleavage site score, C, and the signal peptide score, S, and used to obtain a better prediction of the position of the cleavage site. In the new version, we also use the C-score to obtain a better discrimination between secreted and non-secreted sequences, and have constructed a new D-score for this classification task. The architecture of the hidden Markov model (HMM) SignalP has not changed, but the models have been retrained on the new data set, and have increased their performance significantly.

Section snippets

Generation of data sets

As the predictive performance of the earlier SignalP method was quite high, assessment of potential improvements is critically dependent on the quality of the data annotation. We generated a new positive signal peptide data set from Swiss-Prot¹⁵ release 40.0, retaining the negative dataset extracted from the previous work. The method for redundancy reduction was the same as in the previous work¹⁴, and was based on the reduction principle developed by Hobohm et al.¹⁶ Our final positive signal

Conclusion

We present new versions of SignalP, based on an expanded, highly curated dataset. The architecture of the HMM-based version was unchanged, while the neural network scheme was improved by including information about the amino acid composition of the precursor protein as well as the position of the sliding window. Furthermore, we optimized the window sizes by testing all possible combinations of asymmetric and symmetric input windows up to a total input of 51 amino acid residues. These were

Data set extraction

All sequence data were extracted from Swiss-Prot¹⁵ release 40.0. A total of 12,975 entries with the keyword SIGNAL were found. The dataset was split into three species-specific groups: eukaryotes, Gram-negative prokaryotes and Gram-positive prokaryotes. We excluded all archaeal sequences. Non-experimentally verified signal peptides that had POTENTIAL or HYPOTHETICAL stated in the keyword line were removed. Furthermore, any phage, viral or eukaryote organelle-encoded proteins were excluded.

Acknowledgements

We thank Anders Krogh for use of his HMM software. This work was supported by grants from the Danish National Research Foundation, the Danish Natural Science Research Council, the Danish Center for Scientific Computing, and by a grant from Novozymes A/S (to J.D.B.).

References (44)

D.J. McGeoch
On the predictive recognition of signal peptide sequences
Virus Res.
(1985)
A.L. Karamyshev et al.
Processing of Escherichia coli alkaline phosphatase: role of the primary structure of the signal peptide cleavage region
J. Mol. Biol.
(1998)
M. Lagueux et al.
Cloning of a Locusta cDNA encoding neuroparsin A
Insect Biochem. Mol. Biol.
(1992)
D.J. Palmer et al.
The primary structure of glycoprotein III from bovine adrenal medullary chromaffin granules. Sequence similarity with human serum protein-40,40 and rat Sertoli cell glycoprotein
J. Biol. Chem.
(1990)
J. Cedano et al.
Relation between amino acid composition and cellular location of proteins
J. Mol. Biol.
(1997)
A. Driks
Maximum shields: the assembly and function of the bacterial spore coat
Trends. Microbiol.
(2002)
A. Krogh et al.
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes
J. Mol. Biol.
(2001)
L. Käll et al.
A combined transmembrane topology and signal peptide prediction method
J. Mol. Biol.
(2004)
G. von Heijne
A new method for predicting signal sequence cleavage sites
Nucl. Acids Res.
(1986)
K. Nakai et al.
Expert system for predicting protein localization sites in Gram-negative bacteria
Proteins: Struct. Funct. Genet.
(1991)

A. Reinhardt et al.

Using neural networks for prediction of the subcellular location of proteins

Nucl. Acids Res.

(1998)

S. Hua et al.

Support vector machine approach for protein subcellular localization prediction

Bioinformatics

(2001)

J.P. Vert

Proceedings of the Pacific Symposium on Biocomputing

(2002)

P. Fariselli et al.

SPEPlip: the detection of signal peptide and lipoprotein cleavage sites

Bioinformatics

(2003)

J.L. Gardy et al.

PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria

Nucl. Acids Res.

(2003)

Z. Zhang et al.

A profile hidden Markov model for signal peptides generated by HMMER

Bioinformatics

(2003)

H. Nielsen et al.

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

Protein Eng.

(1997)

H. Nielsen et al.

Prediction of signal peptides and signal anchors by a hidden Markov model

Proc. Int. Cong. Intell. Syst. Mol. Biol.

(1998)

K.M. Menne et al.

A comparison of signal sequence prediction methods using a test set of signal peptides

Bioinformatics

(2000)

A.S. Juncker et al.

Prediction of lipoprotein signal peptides in Gram-negative bacteria

Protein Sci.

(2003)

H. Nielsen et al.

Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site

Proteins: Struct. Funct. Genet.

(1996)

A. Bairoch et al.

The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000

Nucl. Acids Res.

(2000)

Cited by (5726)

Peptidomics analysis of in vitro digested wheat breads: Effect of genotype and environment on protein digestibility and release of celiac disease and wheat allergy related epitopes
2024, Food Chemistry
Wheat proteins can trigger immunogenic reactions due to their resistance to digestion and immunostimulatory epitopes. Here, we investigated the peptidomic map of partially digested bread samples and the fingerprint of epitope diversity from 16 wheat genotypes grown in two environmental conditions. Flour protein content and composition were characterized; gastric and jejunal peptides were quantified using LC-MS/MS, and genotypes were classified into high or low bread protein digestibility. Differences in flour protein content and peptide composition distinguish high from low digestibility genotypes in both growing environments. No common peptide signature was found between high- and low-digestible genotypes; however, the celiac or allergen epitopes were noted not to be higher in low-digestible genotypes. Overall, this study established a peptidomic and epitope diversity map of digested wheat bread and provided new insights and correlations between weather conditions, genotypes, digestibility and wheat sensitivities such as celiac disease and wheat allergy.
The key tyrosine decarboxylase gene and its negative transcription factor for GX-50 biosynthesis in Sichuan Pepper (Zanthoxylum armatum)
2024, Industrial Crops and Products
The natural chemical N-[2-(3,4-methoxyphenyl)ethyl]-3-phenylacrylamide (GX-50) found in Sichuan pepper (Zanthoxylum armatum) is well known for its important applications in anti-aging and the treatment of various diseases. However, the functional tyrosine decarboxylase (TyDC) gene, which plays a pivotal role as the initial key enzyme in the biosynthetic pathway of GX-50 by catalyzing the decarboxylation of tyrosine to tyramine, has yet to be identified. We identified 19 potential TyDC genes from the Z. armatum genome. Among them, ZaTyDC14 showed the highest expression level in the husks than others, suggesting its critical role involved in GX-50 biosynthesis in this tissue. Through transient transformation of tobacco leaves and prokaryotic expression experiments, we confirmed that ZaTyDC14 catalyzed the conversion of tyrosine to tyramine. Overexpression of ZaTyDC14 reduced malondialdehyde (MDA) content, indicate of its critical role in this chemical reaction. Furthermore, our study found that the Zinc-finger protein (ZFP) binds directly to the ZaTyDC14 promoter, inhibiting its expression. Collectively, the present studies reveal the major gene involved in the decarboxylation of tyrosine during GX-50 synthesis in Sichuan pepper, as well as the probable negative regulator influencing its expression.
A longitudinal transcriptomic analysis of Rhipicephalus microplus midgut upon feeding
2024, Ticks and Tick-borne Diseases
Rhipicephalus microplus, a highly host-specific tick that primarily feeds on cattle, posing a significant threat to livestock production. The investigation of tick physiology is crucial for identifying potential targets in tick control. Of particular interest adult female ticks undergo a significant expansion of the midgut during feeding, leading to an over 100-fold increase in body weight. Beyond the functions of storing and digesting blood meals, the tick midgut plays a crucial role in acquiring and transmitting pathogens. However, our understanding of tick midgut physiology remains limited. In this study we conducted a comprehensive longitudinal transcriptome analysis of the midgut from adult female R. microplus ticks collected at various feeding stages, providing an overview of the transcriptional modulation in this organ as feeding progress. By employing a de novo assembly approach followed by coding-sequences (CDS) extraction, 60,599 potential CDS were identified. In preparation for functional annotation and differential expression analysis, transcripts that showed an average transcript per million (TPM) ≥ 3 in at least one of the biological conditions were extracted. This selection process resulted in a total of 10,994 CDS, which were categorized into 24 functional classes. Notably, our differential expression analysis revealed three main transcriptional profiles. In the first one, representing the slow-feeding stage, the most abundant functional classes were the “protein synthesis” and “secreted” groups, reflecting the highly active state of the tick midgut. The second profile partially accounts for the rapid-feeding stage, in which a high number of differentially expressed transcripts was observed. Lastly, the third transcriptional profile represents post-detached ticks. Notably the highest number of modulated transcripts was observed up to 48 h post-detachment (hpd), however no major differences was observed up to 168 hpd. Overall, the data presented here offers a temporal insight into tick midgut physiology, contributing to the identification of potential targets for the development of anti-tick control strategies.
Convergent gene pair dSH3 and irr regulate Pi and Fe homeostasis in Bradyrhizobium diazoefficiens USDA110 and symbiotic nitrogen fixation efficiency
2024, Microbiological Research
The nitrogen-fixing bacteroids inhabit inside legume root nodules must manage finely the utilization of P and Fe, the two most critical elements, due to their antagonistic interactions. While the balance mechanism for them remains unclear. A double SH3 domain-containing protein (dSH3) in the Bradyrhizobium diazoefficiens USDA110 was found to inhibit the alkaline phosphatase activity, thereby reducing P supply from organophosphates. The dSH3 gene is adjacent to the irr gene, which encodes the iron response repressor and regulates Fe homeostasis under Fe-limited conditions. Their transcription directions converge to a common intergenic sequence (IGS) region, forming a convergent transcription. Extending the IGS region through Tn5 transposon or pVO155 plasmid insertion significantly down-regulated expression of this gene pair, leading to a remarkable accumulation of P and an inability to grow under Fe-limited conditions. Inoculation of soybean with either of the insertion mutants resulted in N₂-fixing failure. However, the IGS-deleted mutant showed no visible changes in N₂-fixing efficiency on soybean compared to that inoculated with wild type. These findings reveal a novel regulative strategy in the IGS region and its flanking convergent gene pair for antagonistic utilization of P and Fe in rhizobia and coordination of N₂-fixing efficiency.
From the fat body to the hemolymph: Profiling tick immune and storage proteins through transcriptomics and proteomics
2024, Insect Biochemistry and Molecular Biology
Ticks are blood-feeding arachnids that are known to transmit various pathogenic microorganisms to their hosts. During blood feeding, ticks activate their metabolism and immune system to efficiently utilise nutrients from the host's blood and complete the feeding process. In contrast to insects, in which the fat body is known to be a central organ that controls essential metabolic processes and immune defense mechanisms, the function of the fat body in tick physiology is still relatively unexplored. To fill this gap, we sought to uncover the repertoire of genes expressed in the fat body associated with trachea (FB/Tr) by analyzing the transcriptome of individual, partially fed (previtellogenic) Ixodes ricinus females. The resulting catalog of individual mRNA sequences reveals a broad repertoire of transcripts encoding proteins involved in nutrient storage and distribution, as well as components of the tick immune system. To gain a detailed insight into the secretory products of FB/Tr specifically involved in inter-tissue transport and humoral immunity, the transcriptomic data were complemented with the proteome of soluble proteins in the hemolymph of partially fed female ticks. Among these proteins, the hemolipoglyco-carrier proteins were predominant. When comparing immune peptides and proteins from the fat body with those produced by hemocytes, we found that the fat body serves as a unique producer of certain immune components. Finally, time-resolved transcriptional regulation of selected immune transcripts from the FB/Tr was examined in response to experimental challenges with model microbes and analyzed by RT-qPCR. Overall, our data show that the fat body of ticks, similar to insects, is an important metabolic tissue that also plays a remarkable role in immune defense against invading microbes. These findings improve our understanding of tick biology and its impact on the transmission of tick-borne pathogens.
Improving Signal and Transit Peptide Predictions Using AlphaFold2-predicted Protein Structures
2024, Journal of Molecular Biology
Many proteins contain cleavable signal or transit peptides that direct them to their final subcellular locations. Such peptides are usually predicted from sequence alone using methods such as TargetP 2.0 and SignalP 6.0. While these methods are usually very accurate, we show here that an analysis of a protein's AlphaFold2-predicted structure can often be used to identify false positive predictions. We start by showing that when given a protein’s full-length sequence, AlphaFold2 builds experimentally annotated signal and transit peptides in orientations that point away from the main body of the protein. This indicates that AlphaFold2 correctly identifies that a signal is not destined to be part of the mature protein’s structure and suggests, as a corollary, that predicted signals that AlphaFold2 folds with high confidence into the main body of the protein are likely to be false positives. To explore this idea, we analyzed predicted signal peptides in 48 proteomes made available in DeepMind’s AlphaFold2 database (https://alphafold.ebi.ac.uk). Applying TargetP 2.0 and SignalP 6.0 to the 561,562 proteins in the database results in 95,236 being predicted to contain a cleavable signal or transit peptide. In 95.1% of these cases, the AlphaFold2 structure of the full-length protein is fully consistent with the prediction of TargetP 2.0 or SignalP 6.0. In the remaining 4.9% of cases where the AlphaFold2 structure does not appear consistent with the prediction, the signal is often only predicted with low confidence. The potential false positives identified here may be useful for training even more accurate signal prediction methods.

View all citing articles on Scopus

View full text

Journal of Molecular Biology

Improved Prediction of Signal Peptides: SignalP 3.0

Abstract

Introduction

Section snippets

Generation of data sets

Conclusion

Data set extraction

Acknowledgements

Virus Res.

J. Mol. Biol.

Insect Biochem. Mol. Biol.

J. Biol. Chem.

J. Mol. Biol.

Trends. Microbiol.

J. Mol. Biol.

J. Mol. Biol.

A new method for predicting signal sequence cleavage sites

Nucl. Acids Res.

Expert system for predicting protein localization sites in Gram-negative bacteria

Proteins: Struct. Funct. Genet.

Using neural networks for prediction of the subcellular location of proteins

Nucl. Acids Res.

Support vector machine approach for protein subcellular localization prediction

Bioinformatics

Proceedings of the Pacific Symposium on Biocomputing

SPEPlip: the detection of signal peptide and lipoprotein cleavage sites

Bioinformatics

PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria

Nucl. Acids Res.

A profile hidden Markov model for signal peptides generated by HMMER

Bioinformatics

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

Protein Eng.

Prediction of signal peptides and signal anchors by a hidden Markov model

Proc. Int. Cong. Intell. Syst. Mol. Biol.

A comparison of signal sequence prediction methods using a test set of signal peptides

Bioinformatics

Prediction of lipoprotein signal peptides in Gram-negative bacteria

Protein Sci.

Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site

Proteins: Struct. Funct. Genet.

The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000

Nucl. Acids Res.