Discover the most talked about and latest scientific content & concepts.

Journal: Journal of computational biology : a journal of computational molecular cell biology


Abstract One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.

Concepts: Better, Improve, Graph theory, De Bruijn graph, De Bruijn sequence, Nicolaas Govert de Bruijn


Abstract In metabolomics and other fields dealing with small compounds, mass spectrometry is applied as a sensitive high-throughput technique. Recently, fragmentation trees have been proposed to automatically analyze the fragmentation mass spectra recorded by such instruments. Computationally, this leads to the problem of finding a maximum weight subtree in an edge-weighted and vertex-colored graph, such that every color appears, at most once in the solution. We introduce new heuristics and an exact algorithm for this Maximum Colorful Subtree problem and evaluate them against existing algorithms on real-world and artificial datasets. Our tree completion heuristic consistently scores better than other heuristics, while the integer programming-based algorithm produces optimal trees with modest running times. Our fast and accurate heuristic can help determine molecular formulas based on fragmentation trees. On the other hand, optimal trees from the integer linear program are useful if structure is relevant, for example for tree alignments.

Concepts: Algorithm, Mass spectrometry, Tree, Computer program, Graph theory, Programming language, Linear programming, Heuristic


This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables.

Concepts: Chromosome, Computer program, C, Hash table, Hash function, Linked list, Bloom filter, Cuckoo hashing


The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.


Abstract Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

Concepts: DNA, Gene, Genetics, Gene expression, Transcription, Molecular biology, Organism, RNA


RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.

Concepts: DNA, Scientific method, Gene, Genetics, Fragmentation, Molecular biology, RNA, Counting


Abstract A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Concepts: DNA, Genetics, Algorithm, Human genome, Genome, Geometry, Coordinate system, Heuristic


This study aimed to investigate the role of prostate cancer associated transcript 1 (PCAT1) underlying the molecular mechanisms of prostate cancer. Using GSE29886 data set downloaded from Gene Expression Omnibus database, we screened the differentially expressed genes (DEGs) in PCAT1-siRNA interfering (PCAT1-siRNA) LNCaP cells compared with control-siRNA cells. Transcription factor (TF) and tumor-associated genes database were used to obtain oncogenes and tumor suppressor genes. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis were used to investigate the function and pathways of DEGs. Subnetwork was further analyzed using BioNet. A total of 93 DEGs were identified. KEGG analysis showed downregulated TF genes (ID1 and ID3) were enriched in transforming growth factor-β pathway, whereas upregulated genes were involved in pathways associated with immune system, environmental sensing, and metabolism. GO analysis showed that downregulated genes were primarily enriched in cell cycle-related biological functions and upregulated DEGs were related to immune response, exogenous genetic material response, and viral response. Centromere protein F (CENPF) was identified as the central node of the regulatory subnetwork. In the PCAT1 knockdown LNCaP cells, the CENPF, ID1, and ID3 were obviously decreased based on the RT-PCR (quantitative real-time reverse transcription PCR) analysis. PCAT1 may be involved in cell cycle and proliferation of prostate cancers by mediating the expression of CENPF, ID1, and ID3.


The synthesis on the laboratory bench of the natural product known as alizarin was achieved in 1868. The subsequent elucidation of its structure was a milestone in the development of chemical theory based on Kekulé’s benzene ring and in the growth of the synthetic dyestuff industry. Dye and dyeing properties and theories were exploited for biological studies by the medical researcher Paul Ehrlich. Particular attention was paid to the side chains (functional or attached groups of atoms) of molecules. They became important in visualizing a mechanism for immunity, and then in the early 1900s for enabling a description of chemotherapeutic action. These side chains were transformed into the receptors that played a vital role in the development of theories well suited to the design of drugs during the second half of the twentieth century.


Novelty is a topic of broad interest, with two distinct approaches within evolutionary biology. The dominant approach since Darwin has been transformationist, with novelty arising through gradual changes in morphology. The Modern Synthesis emphasized the importance of ecological opportunity rather than the source of variation, and this view has many adherents today. Yet, since well before Darwin, an alternative view has held that novelties could arise by rapid changes and many not necessarily be connected to ecological opportunity. The rise of comparative evolutionary developmental biology since 1990 has led to a resurgence of these arguments. Many case studies have documented novelties and there have been rigorous efforts to define the attributes of novelty, but there have been few attempts at a more general model. In contrast, studies of technological innovation have been replete with qualitative models since the 1930s. In this article I consider several possibilities for constructing a general model of novelty and innovation: (1) A general formal theory. (2) Commonalities between different levels, such as genes and morphology, but with sufficient differences between domains that any formal theory would be level specific. (3) Commonalities across levels but for various reasons developing a formal theory even within domains is improbable. A final alternative is that novelty and innovation may be so deeply historical that any general framework is impossible. I conclude that a common conceptual framework can be developed and serve as the foundation for simulation studies, but the importance of feedbacks and potentiating factors renders a formal model implausible.