SciCombinator

Discover the most talked about and latest scientific content & concepts.

Journal: Journal of bioinformatics and computational biology

28

This paper is a self-contained introductory tutorial on the problem in proteomics known as peptide sequencing using tandem mass spectrometry. This tutorial deals specifically with de novo sequencing methods (as opposed to database search methods). We first give an introduction to peptide sequencing, its importance and history and some background on proteins. Next we show the relationship between a peptide and the final spectrum produced from a tandem mass spectrometer, together with a description of the various sources of complications that arise during the process of generating the mass spectrum. From there we model the computational problem of de novo peptide sequencing, which is basically the reverse problem of identifying the peptide which produced the spectrum. We then present several major approaches to solve it (including reviewing some of the current algorithms in each approach), and also discuss related problems and post-processing approaches.

Concepts: Mass spectrometry, Tandem mass spectrometry, Fourier transform ion cyclotron resonance, Top-down proteomics, Collision-induced dissociation, Blackbody infrared radiative dissociation

28

Phylogenetic networks are useful for visualizing evolutionary relationships between species with reticulate events such as hybridizations and horizontal gene transfers. In this paper, we consider the problem of constructing undirected phylogenetic networks that (1) are planar graphs and (2) admit embeddings in the plane where the vertices labeling all taxa are on the boundary of the network. We develop a new algorithm for constructing phylogenetic networks satisfying these constraints. First, we show that only approximate networks can be constructed for some distance matrices with at least five taxa. Then we prove that any five-point metric can be represented approximately by a planar boundary-labeled network with guaranteed fit value of 94.79. We extend the networks constructed in the proof to design an algorithm for computing planar boundary-labeled networks for any number of taxa.

Concepts: Evolution, Species, Horizontal gene transfer, Phylogenetic tree, Graph theory, Graph, Planar graph, Connectivity

0

MicroRNAs (miRNAs) play a key role in gene expression and regulation in various organisms. They control a wide range of biological processes and are involved in several types of cancers by causing mRNA degradation or translational inhibition. However, the functions of most miRNAs and their precise regulatory mechanisms remain elusive. With the accumulation of the expression data of miRNAs and mRNAs, many computational methods have been proposed to predict miRNA-mRNA regulatory relationship. However, most existing methods require the number of modules predefined that may be difficult to determine beforehand. Here, we propose a novel computational method to discover miRNA-mRNA regulatory modules by combining Phase-only correlation and improved rough-Fuzzy Clustering (MIMPFC). The proposed method is evaluated on three heterogeneous datasets, and the obtained results are further validated through relevant literatures, biological significance and functional enrichment analysis. The analysis results show that the identified modules are highly correlated with the biological conditions. A large part of the regulatory relationships found by MIMPFC has been confirmed in the experimentally verified databases. It demonstrates that the modules found by MIMPFC are biologically significant.

Concepts: DNA, Gene, Genetics, Gene expression, Evolution, Biology, Organism, Messenger RNA

0

Recently proposed relative addressing-based ([Formula: see text]) RNA secondary structure representation has important features by which an RNA structure database can be stored into a suffix array. A fast substructure search algorithm has been proposed based on binary search on this suffix array. Using this substructure search algorithm, we present a fast algorithm that finds the largest common substructure of given multiple RNA structures in [Formula: see text] format. The multiple RNA structure comparison problem is NP-hard in its general formulation. We introduced a new problem for comparing multiple RNA structures. This problem has more strict similarity definition and objective, and we propose an algorithm that solves this problem efficiently. We also develop another comparison algorithm that iteratively calls this algorithm to locate nonoverlapping large common substructures in compared RNAs. With the new resulting tools, we improved the RNASSAC website (linked from http://faculty.tamuc.edu/aarslan ). This website now also includes two drawing tools: one specialized for preparing RNA substructures that can be used as input by the search tool, and another one for automatically drawing the entire RNA structure from a given structure sequence.

Concepts: Algorithm, RNA, Secondary structure, Non-coding RNA, Transfer RNA, Array data structure, Substructure, Sorting algorithm

0

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.

Concepts: Algorithm, Bioinformatics, Microbiology, Sequence, Computer program, Programming language, Source code, Environmental microbiology

0

Chromatin conformation capture with high-throughput sequencing (Hi-C) is a powerful technique to detect genome-wide chromatin interactions. In this paper, we introduce two novel approaches to detect differentially interacting genomic regions between two Hi-C experiments using a network model. To make input data from multiple experiments comparable, we propose a normalization strategy guided by network topological properties. We then devise two measurements, using local and global connectivity information from the chromatin interaction networks, respectively, to assess the interaction differences between two experiments. When multiple replicates are present in experiments, our approaches provide the flexibility for users to either pool all replicates together to therefore increase the network coverage, or to use the replicates in parallel to increase the signal to noise ratio. We show that while the local method works better in detecting changes from simulated networks, the global method performs better on real Hi-C data. The local and global methods, regardless of pooling, are always superior to two existing methods. Furthermore, our methods work well on both unweighted and weighted networks and our normalization strategy significantly improves the performance compared with raw networks without normalization. Therefore, we believe our methods will be useful for identifying differentially interacting genomic regions.

Concepts: Interaction

0

In complex disorders, collaborative role of several genes accounts for the multitude of symptoms and the discovery of molecular mechanisms requires proper understanding of pertinent genes. Majority of the recent techniques utilize either single information or consolidate the independent outlook from multiple knowledge sources for assisting the discovery of candidate genes. In any case, given that various sorts of heterogeneous sources are possibly significant for quality gene prioritization, every source bearing data not conveyed by another, we assert that a perfect strategy ought to give approaches to observe among them in a genuine integrative style that catches the degree of each, instead of utilizing a straightforward mix of sources. We propose a flexible approach that empowers multi-source information reconciliation for quality gene prioritization that augments the complementary nature of various learning sources so as to utilize the maximum information of aggregated data. To illustrate the proposed approach, we took Autism Spectrum Disorder (ASD) as a case study and validated the framework on benchmark studies. We observed that the combined ranking based on integrated knowledge reduces the false positive observations and boosts the performance when compared with individual rankings. The clinical phenotype validation for ASD shows that there is a significant linkage between top positioned genes and endophenotypes of ASD. Categorization of genes based on endophenotype associations by this method will be useful for further hypothesis generation leading to clinical and translational analysis. This approach may also be useful in other complex neurological and psychiatric disorders with a strong genetic component.

Concepts: Scientific method, Gene, Genetics, Observation, Autism, Hypothesis, Mental disorder, Autism spectrum

0

Finding an effective measure to predict a more accurate RNA secondary structure is a challenging problem. In the last decade, an experimental method, known as selective [Formula: see text]-hydroxyl acylation analyzed by primer extension (SHAPE), was proposed to measure the tendency of forming a base pair for almost all nucleotides in an RNA sequence. These SHAPE reactivities are then utilized to improve the accuracy of RNA structure prediction. Due to a significant impact of SHAPE reactivity and in order to reduce the experimental costs, we propose a new model called HL-k-mer. This model simulates the SHAPE reactivity for each nucleotide in an RNA sequence. This is done by fetching the SHAPE reactivities for all sub-sequences of length k (k-mers) appearing in helix and loop regions. For evaluating the quality of simulated SHAPE data, ESD-Fold method is used based on the SHAPE data simulated by the HL-k-mer model ([Formula: see text]). Also, for further evaluation of simulated SHAPE data, three different methods are employed. We also extend this model to simulate the SHAPE data for the RNA pseudoknotted structure. The results indicate that the average accuracies of prediction using the SHAPE data simulated by our models (for [Formula: see text]) are higher compared to the experimental SHAPE data.

Concepts: DNA, Scientific method, RNA, Base pair, Evaluation, Science, Nucleotide, Accuracy and precision

0

From the definition, it appears that phenotypic robustness and evolvability of an organism are inversely related to each other. However, a number of studies exploring this question have found conflicting evidences in this regard. This question motivated the current work where we explore the relationship between robustness and evolvability. As a model system, we pick the Feed Forward Loops (FFLs), and develop a framework to characterize their performance in terms of their ability to resist changes to steady state expression (robustness), and their ability to evolve towards novel phenotypes (evolvability). We demonstrate that robustness and evolvability are positively correlated in some FFL topologies. We compare this against other small regulatory topologies, and show that the same trend does not hold among them. We postulate that the ability to positively link robustness and evolvability could be an additional reason for over-representation of FFLs in living organisms, as compared to other regulatory topologies.

Concepts: DNA, Gene, Cell, Bacteria, Evolution, Biology, Organism, Homeostasis

0

The architecture of eukaryotic coding genes allows the production of several different protein isoforms by genes. Current gene phylogeny reconstruction methods make use of a single protein product per gene, ignoring information on alternative protein isoforms. These methods often lead to inaccurate gene tree reconstructions that require to be corrected before phylogenetic analyses. Here, we propose a new approach for the reconstruction of gene trees and protein trees accounting for alternative protein isoforms. We extend the concept of reconciliation to protein trees, and we define a new reconciliation problem called MinDRGT that consists in finding a gene tree that minimizes a double reconciliation cost with a given protein tree and a given species tree. We define a second problem called MinDRPGT that consists in finding a protein supertree and a gene tree minimizing a double reconciliation cost, given a species tree and a set of protein subtrees. We propose a shift from the traditional view of protein ortholog groups as hard-clusters to soft-clusters and we study the MinDRPGT problem under this assumption. We provide algorithmic exact and heuristic solutions for versions of the problems, and we present the results of applications on protein and gene trees from the Ensembl database. The implementations of the methods are available at https://github.com/UdeS-CoBIUS/Protein2GeneTree and https://github.com/UdeS-CoBIUS/SuperProteinTree .

Concepts: DNA, Protein, Gene, Molecular biology, Biology, Organism, Species, Phylogenetic tree