Discover the most talked about and latest scientific content & concepts.

Journal: Journal of computational biology : a journal of computational molecular cell biology


Abstract One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.

Concepts: Better, Improve, Graph theory, De Bruijn graph, De Bruijn sequence, Nicolaas Govert de Bruijn


Abstract In metabolomics and other fields dealing with small compounds, mass spectrometry is applied as a sensitive high-throughput technique. Recently, fragmentation trees have been proposed to automatically analyze the fragmentation mass spectra recorded by such instruments. Computationally, this leads to the problem of finding a maximum weight subtree in an edge-weighted and vertex-colored graph, such that every color appears, at most once in the solution. We introduce new heuristics and an exact algorithm for this Maximum Colorful Subtree problem and evaluate them against existing algorithms on real-world and artificial datasets. Our tree completion heuristic consistently scores better than other heuristics, while the integer programming-based algorithm produces optimal trees with modest running times. Our fast and accurate heuristic can help determine molecular formulas based on fragmentation trees. On the other hand, optimal trees from the integer linear program are useful if structure is relevant, for example for tree alignments.

Concepts: Algorithm, Mass spectrometry, Tree, Computer program, Graph theory, Programming language, Linear programming, Heuristic


This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables.

Concepts: Chromosome, Computer program, C, Hash table, Hash function, Linked list, Bloom filter, Cuckoo hashing


The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.


Abstract Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

Concepts: DNA, Gene, Genetics, Gene expression, Transcription, Molecular biology, Organism, RNA


Earlier analysis of the Protein Data Bank derived the distribution of rotations from the plane of a protein hydrogen bond donor peptide group to the plane of its acceptor peptide group. The quasi Boltzmann formalism of Pohl-Finkelstein is employed to estimate free energies of protein elements with these hydrogen bonds, pinpointing residues with a high propensity for conformational change. This is applied to viral glycoproteins as well as capsids, where the 90th+ percentiles of free energies determine residues that correlate well with viral fusion peptides and other functional domains in known cases and thus provide a novel method for predicting these sites of importance as antiviral drug or vaccine targets in general. The method is implemented at from an uploaded Protein Data Bank file.


RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.

Concepts: DNA, Scientific method, Gene, Genetics, Fragmentation, Molecular biology, RNA, Counting


Abstract A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Concepts: DNA, Genetics, Algorithm, Human genome, Genome, Geometry, Coordinate system, Heuristic


Abstract Context dependence is central to the description of complexity. Keying on the pairwise definition of “set complexity,” we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, “differential interaction information.” This quantity for two variables reduces to the pairwise “set complexity” previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the “differential interaction information” are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of “differential interaction information” also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.

Concepts: Mathematics, Measurement, Data, Emergence, Complexity, Systems theory, Complex system, Generalization


Both epithelium and stroma are involved in breast cancer invasion and metastasis. This study aimed at identifying the roles of the stroma in breast cancer tumorigenesis and metastasis. Gene expression profiling GSE10797 was downloaded from the Gene Expression Omnibus database, and it included 28-paired stroma and epithelium breast tissue samples from invasive breast cancer patients and 10 paired normal breast tissue samples. Differentially expressed genes (DEGs) between breast cancer and normal breast tissue samples were identified by using the limma package followed by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses to seek the potential functions of DEGs. Moreover, a protein-protein interaction network was constructed based on the String database, and modules were selected through the BioNet tool. Further, functional annotations of DEGs were carried out by using tumor suppressor gene and tumor associated gene databases. Ultimately, KEGG pathway enrichment analysis for DEGs in modules was performed. A total of 38 and 156 DEGs were identified from normal invasive stromal cells and epithelial cells, respectively. DEGs in stromal and epithelial cells were significantly enriched in extracellular matrix (ECM)- and cell proliferation-related functions. COL1A2, a hub node in the stromal module, was mainly enriched in ECM-receptor interaction and focal adhesion pathways. JUN, a hub node in the epithelium module, was significantly enriched in cancer and ErbB signaling pathways. COL1A2, COL1A1, COL3A1, JUN, and FN1 might be vital for tumorigenesis and metastasis of invasive breast cancer. These genes might be potential therapeutic targets for breast cancer treatment.