SciCombinator

Discover the most talked about and latest scientific content & concepts.

Journal: Journal of bioinformatics and computational biology

28

This paper is a self-contained introductory tutorial on the problem in proteomics known as peptide sequencing using tandem mass spectrometry. This tutorial deals specifically with de novo sequencing methods (as opposed to database search methods). We first give an introduction to peptide sequencing, its importance and history and some background on proteins. Next we show the relationship between a peptide and the final spectrum produced from a tandem mass spectrometer, together with a description of the various sources of complications that arise during the process of generating the mass spectrum. From there we model the computational problem of de novo peptide sequencing, which is basically the reverse problem of identifying the peptide which produced the spectrum. We then present several major approaches to solve it (including reviewing some of the current algorithms in each approach), and also discuss related problems and post-processing approaches.

Concepts: Mass spectrometry, Tandem mass spectrometry, Fourier transform ion cyclotron resonance, Top-down proteomics, Collision-induced dissociation, Blackbody infrared radiative dissociation

28

Phylogenetic networks are useful for visualizing evolutionary relationships between species with reticulate events such as hybridizations and horizontal gene transfers. In this paper, we consider the problem of constructing undirected phylogenetic networks that (1) are planar graphs and (2) admit embeddings in the plane where the vertices labeling all taxa are on the boundary of the network. We develop a new algorithm for constructing phylogenetic networks satisfying these constraints. First, we show that only approximate networks can be constructed for some distance matrices with at least five taxa. Then we prove that any five-point metric can be represented approximately by a planar boundary-labeled network with guaranteed fit value of 94.79. We extend the networks constructed in the proof to design an algorithm for computing planar boundary-labeled networks for any number of taxa.

Concepts: Evolution, Species, Horizontal gene transfer, Phylogenetic tree, Graph theory, Graph, Planar graph, Connectivity

0

Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.

Concepts: DNA, Gene, Genetics, Promoter, Future, Transcription factor, Sequence, Set

0

We present MethyMer, a Python-based tool aimed at selecting primers for amplification of complete CpG islands. These regions are difficult in terms of selecting appropriate primers because of their low-complexity, high GC content. Moreover, bisulfite treatment, in fact, leads to the reduction of the 4-letter alphabet (ATGC) to 3-letter one (ATG, except for methylated cytosines), and this also reduces region complexity and increases mispriming potential. MethyMer has a flexible scoring system, which optimizes the balance between various characteristics such as nucleotide composition, thermodynamic features (melting temperature, dimers [Formula: see text]G, etc.), the presence of CpG sites and polyN tracts, and primer specificity, which is assessed with aligning primers to the bisulfite-treated genome using bowtie (up to three mismatches are allowed). Users are able to customize desired or limit ranges of various parameters as well as penalties for non-desired values. Moreover, MethyMer allows picking up the optimal combination of PCR primer pairs to perform the amplification of a large genomic locus, e.g. CpG island or other hard-to-study region, with minimal overlap of the individual amplicons. MethyMer incorporates ENCODE genome annotation records (promoter/enhancer/insulator), The Cancer Genome Atlas (TCGA) CpG methylation data derived with Illumina Infinium 450K microarrays, and records on correlations between TCGA RNA-Seq and CpG methylation data for 20 cancer types. These databases are included in the MethyMer release. Our tool is available at https://sourceforge.net/projects/methymer/ .

Concepts: DNA, Molecular biology, DNA sequencing, GC-content, Molecular genetics, Bisulfite sequencing, CpG site, CpG island

0

The discovery of thousands of long noncoding RNAs (lncRNAs) in mammals raises a question about their functionality. It has been shown that some of them are involved in post-transcriptional regulation of other RNAs and form inter-molecular duplexes with their targets. Sequence alignment tools have been used for transcriptome-wide prediction of RNA-RNA interactions. However, such approaches have poor prediction accuracy since they ignore RNA’s secondary structure. Application of the thermodynamics-based algorithms to long transcripts is not computationally feasible on a large scale. Here, we describe a new computational pipeline ASSA that combines sequence alignment and thermodynamics-based tools for efficient prediction of RNA-RNA interactions between long transcripts. To measure the hybridization strength, the sum energy of all the putative duplexes is computed. The main novelty implemented in ASSA is the ability to quickly estimate the statistical significance of the observed interaction energies. Most of the functional hybridizations between long RNAs were classified as statistically significant. ASSA outperformed 11 other tools in terms of the Area Under the Curve on two out of four test sets. Additionally, our results emphasized a unique property of the [Formula: see text] repeats with respect to the RNA-RNA interactions in the human transcriptome. ASSA is available at https://sourceforge.net/projects/assa/.

Concepts: DNA, Statistics, RNA, Statistical significance, Ronald Fisher, Statistical hypothesis testing, P-value, Long noncoding RNA

0

Epilepsy is the fourth most common neurological disease after migraine, stroke, and Alzheimer’s disease. Approximately one-third of all epilepsy cases are refractory to the existing anticonvulsants. Thus, there is an unmet need for newer antiepileptic drugs (AEDs) to manage refractory epilepsy (RE). Discovery of novel AEDs for the treatment of RE further retards for want of potential pharmacological targets, unavailable due to unclear etiology of this disease. In this regard, network pharmacology as an area of bioinformatics is gaining popularity. It combines the methods of network biology and polypharmacology, which makes it a promising approach for finding new molecular targets. This work is aimed at discovering new pharmacological targets for the treatment of RE using network pharmacology methods. In the framework of our study, the genes associated with the development of RE were selected based on analysis of available data. The methods of network pharmacology were used to select 83 potential pharmacological targets linked to the selected genes. Then, 10 most promising targets were chosen based on analysis of published data. All selected target proteins participate in biological processes, which are considered to play a key role in the development of RE. For 9 of 10 selected targets, the potential associations with different kinds of epilepsy have been recently mentioned in the literature published, which gives additional evidence that the approach applied is rather promising.

Concepts: Pharmacology, Medicine, The Canon of Medicine, Drug, Neurology, Epilepsy, Anticonvulsant, Lamotrigine

0

MicroRNAs (miRNAs) play a key role in gene expression and regulation in various organisms. They control a wide range of biological processes and are involved in several types of cancers by causing mRNA degradation or translational inhibition. However, the functions of most miRNAs and their precise regulatory mechanisms remain elusive. With the accumulation of the expression data of miRNAs and mRNAs, many computational methods have been proposed to predict miRNA-mRNA regulatory relationship. However, most existing methods require the number of modules predefined that may be difficult to determine beforehand. Here, we propose a novel computational method to discover miRNA-mRNA regulatory modules by combining Phase-only correlation and improved rough-Fuzzy Clustering (MIMPFC). The proposed method is evaluated on three heterogeneous datasets, and the obtained results are further validated through relevant literatures, biological significance and functional enrichment analysis. The analysis results show that the identified modules are highly correlated with the biological conditions. A large part of the regulatory relationships found by MIMPFC has been confirmed in the experimentally verified databases. It demonstrates that the modules found by MIMPFC are biologically significant.

Concepts: DNA, Gene, Genetics, Gene expression, Evolution, Biology, Organism, Messenger RNA

0

Recently proposed relative addressing-based ([Formula: see text]) RNA secondary structure representation has important features by which an RNA structure database can be stored into a suffix array. A fast substructure search algorithm has been proposed based on binary search on this suffix array. Using this substructure search algorithm, we present a fast algorithm that finds the largest common substructure of given multiple RNA structures in [Formula: see text] format. The multiple RNA structure comparison problem is NP-hard in its general formulation. We introduced a new problem for comparing multiple RNA structures. This problem has more strict similarity definition and objective, and we propose an algorithm that solves this problem efficiently. We also develop another comparison algorithm that iteratively calls this algorithm to locate nonoverlapping large common substructures in compared RNAs. With the new resulting tools, we improved the RNASSAC website (linked from http://faculty.tamuc.edu/aarslan ). This website now also includes two drawing tools: one specialized for preparing RNA substructures that can be used as input by the search tool, and another one for automatically drawing the entire RNA structure from a given structure sequence.

Concepts: Algorithm, RNA, Secondary structure, Non-coding RNA, Transfer RNA, Array data structure, Substructure, Sorting algorithm

0

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.

Concepts: Algorithm, Bioinformatics, Microbiology, Sequence, Computer program, Programming language, Source code, Environmental microbiology

0

Chromatin conformation capture with high-throughput sequencing (Hi-C) is a powerful technique to detect genome-wide chromatin interactions. In this paper, we introduce two novel approaches to detect differentially interacting genomic regions between two Hi-C experiments using a network model. To make input data from multiple experiments comparable, we propose a normalization strategy guided by network topological properties. We then devise two measurements, using local and global connectivity information from the chromatin interaction networks, respectively, to assess the interaction differences between two experiments. When multiple replicates are present in experiments, our approaches provide the flexibility for users to either pool all replicates together to therefore increase the network coverage, or to use the replicates in parallel to increase the signal to noise ratio. We show that while the local method works better in detecting changes from simulated networks, the global method performs better on real Hi-C data. The local and global methods, regardless of pooling, are always superior to two existing methods. Furthermore, our methods work well on both unweighted and weighted networks and our normalization strategy significantly improves the performance compared with raw networks without normalization. Therefore, we believe our methods will be useful for identifying differentially interacting genomic regions.

Concepts: Interaction