Concept: Mass spectrometry software
SELDI-TOF mass spectrometer’s compact size and automated, high throughput design have been attractive to clinical researchers, and the platform has seen steady-use in biomarker studies. Despite new algorithms and preprocessing pipelines that have been developed to address reproducibility issues, visual inspection of the results of SELDI spectra preprocessing by the best algorithms still shows miscalled peaks and systematic sources of error. This suggests that there continues to be problems with SELDI preprocessing. In this work, we study the preprocessing of SELDI in detail and introduce improvements. While many algorithms, including the vendor supplied software, can identify peak clusters of specific mass (or m/z) in groups of spectra with high specificity and low false discover rate (FDR), the algorithms tend to underperform estimating the exact prevalence and intensity of peaks in those clusters. Thus group differences that at first appear very strong are shown, after careful and laborious hand inspection of the spectra, to be less than significant. Here we introduce a wavelet/neural network based algorithm which mimics what a team of expert, human users would call for peaks in each of several hundred spectra in a typical SELDI clinical study. The wavelet denoising part of the algorithm optimally smoothes the signal in each spectrum according to an improved suite of signal processing algorithms previously reported (the LibSELDI toolbox under development). The neural network part of the algorithm combines those results with the raw signal and a training dataset of expertly called peaks, to call peaks in a test set of spectra with approximately 95% accuracy. The new method was applied to data collected from a study of cervical mucus for the early detection of cervical cancer in HPV infected women. The method shows promise in addressing the ongoing SELDI reproducibility issues.
The acquisition of high-resolution tandem mass spectra (MS/MS) is becoming more prevalent in proteomics, but most researchers employ peptide identification algorithms that were designed prior to this development. Here we demonstrate new software, Morpheus, designed specifically for high-mass accuracy data, based on a simple score that is little more than the number of matching products. For a diverse collection of datasets from a variety of organisms (E. coli, yeast, human) acquired on a variety of instruments (quadrupole-time of flight, ion trap-orbitrap, quadrupole-orbitrap) in different laboratories, Morpheus gives more spectrum, peptide, and protein identifications at a 1% false discovery rate (FDR) than Mascot, Open Mass Spectrometry Search Algorithm (OMSSA), and Sequest. Additionally, Morpheus is 1.5 to 4.6 times faster-depending on the dataset-than the next fastest algorithm, OMSSA. Morpheus was developed in C# .NET and is available free and open source under a permissive license.
Mass spectrometry has become one of the most important technologies in proteomic analysis. Tandem mass spectrometry (LC-MS/MS) is a major tool for the analysis of peptide mixtures from protein samples. The key step of MS data processing is the identification of peptides from experimental spectra by searching public sequence databases. Although a number of algorithms to identify peptides from MS/MS data have been already proposed, e.g. Sequest, OMSSA, X!Tandem, Mascot, MassWiz, etc., they are mainly based on statistical models considering only peak-matches between experimental and theoretical spectra, but not peak intensity information. Moreover, different algorithms gave different results from the same MS data, implying their probable incompleteness and questionable reproducibility. We developed a novel peptide identification algorithm ProVerB based on a binomial probability distribution model of protein tandem mass spectrometry combined with a new scoring function, making full use of peak intensity information, and thus enhancing the ability of identification. Compared with Mascot, Sequest and SQID, ProVerB identified significantly more peptides from LC-MS/MS datasets than the current algorithms at 1% False Discovery Rate (FDR) and provided more confident peptide identifications. ProVerB is also compatible with various platforms and experimental datasets, showing its robustness and versatility. The open-source program ProVerB is available at http://bioinformatics.jnu.edu.cn/software/proverb/.
Peptide-based proteomic data sets are ever increasing in size and complexity. These data sets provide computational challenges when attempting to quickly analyze spectra and obtain correct protein identifications. Database search and de novo algorithms must consider high-resolution MS/MS spectra and alternative fragmentation methods. Protein inference is a tricky problem when analyzing large data sets of degenerate peptide identifications. Combining multiple algorithms for improved peptide identification puts significant strain on computational systems when investigating large data sets. This review highlights some of the recent developments in peptide and protein identification algorithms for analyzing shotgun mass spectrometry data when encountering the aforementioned hurdles. Also explored are the roles that analytical pipelines, public spectral libraries, and cloud computing play in the evolution of peptide-based proteomics.
The in silico fragmenter MetFrag, launched in 2010, was one of the first approaches combining compound database searching and fragmentation prediction for small molecule identification from tandem mass spectrometry data. Since then many new approaches have evolved, as has MetFrag itself. This article details the latest developments to MetFrag and its use in small molecule identification since the original publication.
To address the growing need for a centralized, community resource of published results processed with Skyline, and to provide reviewers and readers immediate visual access to the data behind published conclusions, we present Panorama Public (https://panoramaweb.org/public.url), a repository of Skyline documents supporting published results. Panorama Public is built on Panorama, an open source data management system for mass spectrometry data processed with the Skyline targeted mass spectrometry environment. The Panorama web application facilitates viewing, sharing, and disseminating results contained in Skyline documents via a web-browser. Skyline users can easily upload their documents to a Panorama server and allow other researchers to explore uploaded results in the Panorama web-interface through a variety of familiar summary graphs as well as annotated views of the chromatographic peaks processed with Skyline. This makes Panorama ideal for sharing targeted, quantitative results contained in Skyline documents with collaborators, reviewers, and the larger proteomics community. The Panorama Public repository employs the full data visualization capabilities of Panorama which facilitates sharing results with reviewers during manuscript review.
- Journal of the American Society for Mass Spectrometry
- Published about 5 years ago
De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today’s peptide de novo sequencing analyses. To improve the accuracy, Novor’s scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today’s mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data. Graphical Abstract ᅟ.
The potential of the diverse chemistries present in natural products (NP) for biotechnology and medicine remains untapped because NP databases are not searchable with raw data and the NP community has no way to share data other than in published papers. Although mass spectrometry (MS) techniques are well-suited to high-throughput characterization of NP, there is a pressing need for an infrastructure to enable sharing and curation of data. We present Global Natural Products Social Molecular Networking (GNPS; http://gnps.ucsd.edu), an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. In GNPS, crowdsourced curation of freely available community-wide reference MS libraries will underpin improved annotations. Data-driven social-networking should facilitate identification of spectra and foster collaborations. We also introduce the concept of ‘living data’ through continuous reanalysis of deposited data.
Preeclampsia (PE) is a pregnancy complication characterized by high blood pressure and proteinuria. The disorder usually occurs after the 20th week of pregnancy and gets worse over time. PE increases the risk of poor outcomes for both the mother and the baby. In the study we applied LC-MS/MS method for the analysis of the urine peptidome of women with PE. Samples were prepared using size-exclusion chromatography method which gives more than twice peptides identities if compared with solid phase extraction. Thirty urine samples from women with mild and severe preeclampsia and the control group were analyzed. In total 1786 peptides were identified using complementary search engines (Mascot, MaxQuant and PEAKS). A high level of agreement in peptide identification was observed with previously published data. Label-free data comparison resulted in 35 peptides which reliably distinguished a particular PE group (severe or mild) from controls. Our results revealed unique identifications (correlate to alpha-1-antitrypsin, collagen alpha-1(I) chain, collagen alpha-1 (III) chain, and uromodulin, for instance) that can potentially serve as early indicators of PE.
Quantitative mass spectrometry (MS) is a key technique in many research areas (1), including proteomics, metabolomics, glycomics, and lipidomics. Because all of the corresponding molecules can be described by chemical formulas, universal quantification tools are highly desirable. Here we present pyQms, an open-source software for accurate quantification of all types of molecules measurable by MS. pyQms uses isotope pattern matching which offers an accurate quality assessment of all quantifications and the ability to directly incorporate mass spectrometer accuracy. pyQms is, due to its universal design, applicable to every research field, labeling strategy, and acquisition technique. This opens ultimate flexibility for researchers to design experiments employing innovative and hitherto unexplored labeling strategies. Importantly, pyQms performs very well to accurately quantify partially labeled proteomes in large-scale and high-throughput, the most challenging task for a quantification algorithm.