Discover the most talked about and latest scientific content & concepts.

Journal: Journal of computational biology : a journal of computational molecular cell biology


Abstract One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.

Concepts: Better, Improve, Graph theory, De Bruijn graph, De Bruijn sequence, Nicolaas Govert de Bruijn


Abstract In metabolomics and other fields dealing with small compounds, mass spectrometry is applied as a sensitive high-throughput technique. Recently, fragmentation trees have been proposed to automatically analyze the fragmentation mass spectra recorded by such instruments. Computationally, this leads to the problem of finding a maximum weight subtree in an edge-weighted and vertex-colored graph, such that every color appears, at most once in the solution. We introduce new heuristics and an exact algorithm for this Maximum Colorful Subtree problem and evaluate them against existing algorithms on real-world and artificial datasets. Our tree completion heuristic consistently scores better than other heuristics, while the integer programming-based algorithm produces optimal trees with modest running times. Our fast and accurate heuristic can help determine molecular formulas based on fragmentation trees. On the other hand, optimal trees from the integer linear program are useful if structure is relevant, for example for tree alignments.

Concepts: Algorithm, Mass spectrometry, Tree, Computer program, Graph theory, Programming language, Linear programming, Heuristic


This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables.

Concepts: Chromosome, Computer program, C, Hash table, Hash function, Linked list, Bloom filter, Cuckoo hashing


The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.


Abstract Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

Concepts: DNA, Gene, Genetics, Gene expression, Transcription, Molecular biology, Organism, RNA


RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.

Concepts: DNA, Scientific method, Gene, Genetics, Fragmentation, Molecular biology, RNA, Counting


Abstract A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Concepts: DNA, Genetics, Algorithm, Human genome, Genome, Geometry, Coordinate system, Heuristic


Experimental designs such as matched-pair or longitudinal studies yield mRNA sequencing (mRNA-Seq) counts that are correlated across samples. Most of the approaches for the analysis of correlated mRNA-Seq data are restricted to a specific design and/or balanced data only (with the same number of samples in each group). We propose a model that is applicable to the analysis of correlated mRNA-Seq data of different types: paired, clustered, longitudinal, or others. Any combination of explanatory variables, as well as unbalanced data, can be processed within the proposed modeling framework. The model assumes that exon counts of a particular gene of an individual sample jointly follow a multivariate negative-binomial distribution. Additional correlation between exon counts obtained for, for example, individual samples within the same pair or cluster, is taken into account by including into the model a cluster-level normally distributed random effect. An interesting feature of the model is that it provides explicit expression for marginal correlation between exon counts at different levels. The performance of the model is evaluated by using a simulation study and an analysis of two real-life data sets: a paired mRNA-Seq experiment for 24 patients with clear-cell renal-cell carcinoma and a longitudinal mRNA-Seq experiment for 29 patients with Lyme disease.


The purpose was to explore distinct molecular mechanisms of three lung cancer subtypes. GSE6044 microarray data downloaded from Gene Expression Omnibus (GEO) database were applied for identifying the differentially expressed genes (DEGs). Genetic global network was constructed to analyze the network annotation. The DEGs in the genetic global network related to small-cell lung carcinoma (SCLC), lung squamous cell carcinoma (SCC), and lung adenocarcinoma (AC) were screened. Protein-protein international networks of DEGs were constructed. Pathway enrichment analyses of DEGs in three subtypes were performed, followed by construction of interactional network among pathways. There were more DEGs screened in SCLC than in AC and SCC. The genetic global network with 341 genes and 1569 interaction edges was constructed. After annotating these DEGs into a protein interactional network, a total of 695 protein interactions related to these 36 DEGs were obtained. HSP90AA1 was the hub node with the highest degree of 81 in the annotation network. DEGs in SCLC and SCC were mainly enriched in some pathways, including cell cycle, DNA replication, and histidine metabolism; whereas DEGs in AC were enriched in complement and coagulation cascades, and extracellular matrix (ECM)-receptor interaction. Pathway interactional network was constructed with the hub node of a neuroactive ligand receptor interaction. The identified DEGs such as retinoid X receptor alpha (RXRA), cyclin-dependent kinase 2 (CDK2), histone deacetylase 2 (HDAC2), and KIT might be the target genes of lung cancer by participating in different pathways such as ECM-receptor interaction. Complement and coagulation cascades, and ECM-receptor interaction might be the specific pathways for AC; smoking might have a closer relationship with SCC.


Locus control regions (LCRs), cis-acting, noncoding regulatory elements with strong transcription-enhancing activity, are conserved in sequence and organization, and exhibit strict gene-specific expression. LCRs have been reported and studied in several mammalian gene systems, signifying that they play an important role in eukaryotic gene expression control. Their highly regulated, stable, and precise levels of expression have made them a strong candidate for use in gene therapy vectors. In this study, we attempted to determine the unique signatures of human LCRs by analyzing a data set of LCR sequences for the presence of motifs through systematic bioinformatics approach. Using web-based regulatory sequence analysis tools (RSAT), motif-based analysis was performed. Detected significant motifs were analyzed further for their identity using Tomtom tool. RSAT analysis revealed that significant motifs are existent within the LCRs. Identity analysis using Tomtom showed that detected significant motifs were comparable with known transcription factor (TF) binding sites and the top scoring motifs belong to zinc finger-containing proteins, an important group of proteins involved in a variety of cellular activities. Correspondence to segment of known motif indicates the biological relevance of the detected motifs. Motif-based analysis is valuable for analyzing the various characteristics of sequences, notably TF binding models in this study. Owning to their unique expression control abilities, LCRs form an important component of integrating vectors, therefore identification of unique signatures present within LCR sequences will be instrumental in the design of new generation of regulatory elements containing LCR sequences.