SciCombinator

Discover the most talked about and latest scientific content & concepts.

Journal: Journal of bioinformatics and computational biology

28

This paper is a self-contained introductory tutorial on the problem in proteomics known as peptide sequencing using tandem mass spectrometry. This tutorial deals specifically with de novo sequencing methods (as opposed to database search methods). We first give an introduction to peptide sequencing, its importance and history and some background on proteins. Next we show the relationship between a peptide and the final spectrum produced from a tandem mass spectrometer, together with a description of the various sources of complications that arise during the process of generating the mass spectrum. From there we model the computational problem of de novo peptide sequencing, which is basically the reverse problem of identifying the peptide which produced the spectrum. We then present several major approaches to solve it (including reviewing some of the current algorithms in each approach), and also discuss related problems and post-processing approaches.

Concepts: Mass spectrometry, Tandem mass spectrometry, Fourier transform ion cyclotron resonance, Top-down proteomics, Collision-induced dissociation, Blackbody infrared radiative dissociation

28

Phylogenetic networks are useful for visualizing evolutionary relationships between species with reticulate events such as hybridizations and horizontal gene transfers. In this paper, we consider the problem of constructing undirected phylogenetic networks that (1) are planar graphs and (2) admit embeddings in the plane where the vertices labeling all taxa are on the boundary of the network. We develop a new algorithm for constructing phylogenetic networks satisfying these constraints. First, we show that only approximate networks can be constructed for some distance matrices with at least five taxa. Then we prove that any five-point metric can be represented approximately by a planar boundary-labeled network with guaranteed fit value of 94.79. We extend the networks constructed in the proof to design an algorithm for computing planar boundary-labeled networks for any number of taxa.

Concepts: Evolution, Species, Horizontal gene transfer, Phylogenetic tree, Graph theory, Graph, Planar graph, Connectivity

0

Alternative polyadenylation (APA) is a pervasive mechanism that contributes to gene regulation. Increasing sequenced poly(A) sites are placing new demands for the development of computational methods to investigate APA regulation. Cluster analysis is important to identify groups of co-expressed genes. However, clustering of poly(A) sites has not been extensively studied in APA, where most APA studies failed to consider the distribution, abundance, and variation of APA sites in each gene. Here we constructed a two-layer model based on canonical correlation analysis (CCA) to explore the underlying biological mechanisms in APA regulation. The first layer quantifies the general correlation of APA sites across various conditions between each gene and the second layer identifies genes with statistically significant correlation on their APA patterns to infer APA-specific gene clusters. Using hierarchical clustering, we comprehensively compared our method with four other widely used distance measures based on three performance indexes. Results showed that our method significantly enhanced the clustering performance for both synthetic and real poly(A) site data and could generate clusters with more biological meaning. We have implemented the CCA-based method as a publically available R package called PAcluster, which provides an efficient solution to the clustering of large APA-specific biological dataset.

Concepts: DNA, Genetics, Gene expression, Statistics, Biology, RNA, Multivariate statistics, Canonical correlation

0

In this paper, we propose a high performance computing toolbox implementing efficient statistical methods for the study of phylogenies. This toolbox, which implements logit models and LASSO-type penalties, gives a way to better understand, measure, and compare the impact of each gene on a global phylogeny. As an application, we study the Echinococcus phylogeny, which is often considered as a particularly difficult example. Mitochondrial and nuclear genomes (19 coding sequences) of nine Echinococcus species are considered in order to investigate the molecular phylogeny of this genus. First, we check that the 19 gene trees lead to 19 totally different unsupported topologies (a topology is the sister relationship when both branch lengths and supports are ignored in a phylogenetic tree), while using the 19 genes as a whole are not sufficient for estimating the phylogeny. In order to circumvent this issue and understand the impact of the genes, we computed 43,796 trees using combinations ranging from 13 to 19 genes. By doing so, 15 topologies are obtained. Four particular topologies, appearing more robust and frequent, are then selected for more precise investigation. Refining further our statistical analysis, a particularly robust topology is extracted. We also carefully demonstrate the influence of nuclear genes on the likelihood of the phylogeny.

Concepts: Evolution, Mathematics, Biology, Organism, Species, Horizontal gene transfer, Phylogenetics, Cladistics

0

The microarray technology is widely used to identify the differentially expressed genes due to its high throughput capability. The number of replicated microarray chips in each group is usually not abundant. It is an efficient way to borrow information across different genes to improve the parameter estimation which suffers from the limited sample size. In this paper, we use a hierarchical model to describe the dispersion of gene expression profiles and model the variance through the gene expression level via a link function. A heuristic algorithm is proposed to estimate the hyper-parameters and link function. The differentially expressed genes are identified using a multiple testing procedure. Compared to SAM and LIMMA, our proposed method shows a significant superiority in term of detection power as the false discovery rate being controlled.

Concepts: Gene, Gene expression, Estimation theory, Estimator, DNA microarray, Bayes estimator

0

Structural controllability is the generalization of traditional controllability for dynamical systems. During the last decade, interesting biological discoveries have been inferred by applied structural controllability analysis to biological networks. However, false positive/negative information (i.e. nodes and edges) widely exists in biological networks that documented in public data sources, which can hinder accurate analysis of structural controllability. In this study, we propose WDNfinder, a comprehensive analysis package that provides structural controllability with consideration of node connection strength in biological networks. When applied to the human cancer signaling network and p53-mediate DNA damage response network, WDNfinder shows high accuracy on essential nodes prediction in these networks. Compared to existing methods, WDNfinder can significantly narrow down the set of minimum driver node set (MDS) under the restriction of domain knowledge. When using p53-mediate DNA damage response network as illustration, we find more meaningful MDSs by WDNfinder. The source code is implemented in python and publicly available together with relevant data on GitHub: https://github.com/dustincys/WDNfinder .

Concepts: DNA, Scientific method, Gene, Evolution, Mathematics, Java, Logic, Source code

0

Due to the importance of post-translational modifications (PTMs) in human health and diseases, PTMs are regularly reported in the biomedical literature. However, the continuing and rapid pace of expansion of this literature brings a huge challenge for researchers and database curators. Therefore, there is a pressing need to aid them in identifying relevant PTM information more efficiently by using a text mining system. So far, only a few web servers are available for mining information of a very limited number of PTMs, which are based on simple pattern matching or pre-defined rules. In our work, in order to help researchers and database curators easily find and retrieve PTM information from available text, we have developed a text mining tool called MPTM, which extracts and organizes valuable knowledge about 11 common PTMs from abstracts in PubMed by using relations extracted from dependency parse trees and a heuristic algorithm. It is the first web server that provides literature mining service for hydroxylation, myristoylation and GPI-anchor. The tool is also used to find new publications on PTMs from PubMed and uncovers potential PTM information by large-scale text analysis. MPTM analyzes text sentences to identify protein names including substrates and protein-interacting enzymes, and automatically associates them with the UniProtKB protein entry. To facilitate further investigation, it also retrieves PTM-related information, such as human diseases, Gene Ontology terms and organisms from the input text and related databases. In addition, an online database (MPTMDB) with extracted PTM information and a local MPTM Lite package are provided on the MPTM website. MPTM is freely available online at http://bioinformatics.ustc.edu.cn/mptm/ and the source codes are hosted on GitHub: https://github.com/USTC-HILAB/MPTM .

Concepts: Amino acid, Posttranslational modification

0

Emerging bioimaging technologies enable us to capture various dynamic cellular activities in vivo. As large amounts of data are obtained these days and it is becoming unrealistic to manually process massive number of images, automatic analysis methods are required. One of the issues for automatic image segmentation is that image-taking conditions are variable. Thus, commonly, many manual inputs are required according to each image. In this paper, we propose a bone marrow cavity (BMC) segmentation method for bone images as BMC is considered to be related to the mechanism of bone remodeling, osteoporosis, and so on. To reduce manual inputs to segment BMC, we classified the texture pattern using wavelet transformation and support vector machine. We also integrated the result of texture pattern classification into the graph-cuts-based image segmentation method because texture analysis does not consider spatial continuity. Our method is applicable to a particular frame in an image sequence in which the condition of fluorescent material is variable. In the experiment, we evaluated our method with nine types of mother wavelets and several sets of scale parameters. The proposed method with graph-cuts and texture pattern classification performs well without manual inputs by a user.

Concepts: Osteoporosis, Bone, Bone marrow, Metaphysics, Medullary cavity, Wavelet, Manual transmission, Bone marrow examination

0

Some interesting combinatorial problems have been motivated by genome rearrangements, which are mutations that affect large portions of a genome. When we represent genomes as permutations, the goal is to transform a given permutation into the identity permutation with the minimum number of rearrangements. When they affect segments from the beginning (respectively end) of the permutation, they are called prefix (respectively suffix) rearrangements. This paper presents results for rearrangement problems that involve prefix and suffix versions of reversals and transpositions considering unsigned and signed permutations. We give 2-approximation and ([Formula: see text])-approximation algorithms for these problems, where [Formula: see text] is a constant divided by the number of breakpoints (pairs of consecutive elements that should not be consecutive in the identity permutation) in the input permutation. We also give bounds for the diameters concerning these problems and provide ways of improving the practical results of our algorithms.

Concepts: Mathematics, Group, Group theory, Binomial coefficient, Inflection, Sorting algorithm, Suffix, Factorial number system

0

Alternative splicing (AS), by which individual genes can produce multiple mRNA, associates with genomic complexity, disease, and development. Histone modifications show important roles in both transcription initiation and mRNA splicing. Here, we intended to find the link between AS and histone modifications in flanking regions through analyzing publicly available data in two human cell lines, GM12878 and K562 cell lines. According to exon inclusion levels, exons were classified into three types, included skipped exons, excluded skipped exons and expressed constitutive exons. We revealed that the inclusion levels of skipped exons (SEs) were negatively correlated with the enrichment of active histone marks in SEs, indicating a role of histone modifications in AS. We also found that active histone modifications were enriched in the upstream exons of SEs, especially around 5[Formula: see text] splicing sites. We inferred that the histone modifications around the 5[Formula: see text] splicing sites in upstream exon of the SEs could help RNA Polymerase II complex to recruit the effector proteins and facilitate AS. It was indicated that nucleosome occupancy had little influence on the inclusion levels of SEs. At last, we proposed an integrated model that describe how histone modifications affected the pre-mRNA splicing.

Concepts: DNA, Gene expression, RNA, Messenger RNA, Intron, Spliceosome, RNA splicing, Exon