Journal: Briefings in bioinformatics
Next-generation sequencing (NGS) is increasingly being adopted as the backbone of biomedical research. With the commercialization of various affordable desktop sequencers, NGS will be reached by increasing numbers of cellular and molecular biologists, necessitating community consensus on bioinformatics protocols to tackle the exponential increase in quantity of sequence data. The current resources for NGS informatics are extremely fragmented. Finding a centralized synthesis is difficult. A multitude of tools exist for NGS data analysis; however, none of these satisfies all possible uses and needs. This gap in functionality could be filled by integrating different methods in customized pipelines, an approach helped by the open-source nature of many NGS programmes. Drawing from community spirit and with the use of the Wikipedia framework, we have initiated a collaborative NGS resource: The NGS WikiBook. We have collected a sufficient amount of text to incentivize a broader community to contribute to it. Users can search, browse, edit and create new content, so as to facilitate self-learning and feedback to the community. The overall structure and style for this dynamic material is designed for the bench biologists and non-bioinformaticians. The flexibility of online material allows the readers to ignore details in a first read, yet have immediate access to the information they need. Each chapter comes with practical exercises so readers may familiarize themselves with each step. The NGS WikiBook aims to create a collective laboratory book and protocol that explains the key concepts and describes best practices in this fast-evolving field.
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.
The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics.
In recent years, more 3D protein structures have become available, which has made the analysis of large molecular structures much easier. There is a strong demand for geometric models for the study of protein-related interactions. Alpha shape and Delaunay triangulation are powerful tools to represent protein structures and have advantages in characterizing the surface curvature and atom contacts. This review presents state-of-the-art applications of alpha shape and Delaunay triangulation in the studies on protein-DNA, protein-protein, protein-ligand interactions and protein structure analysis.
Flux balance analysis (FBA) is a widely used computational method for characterizing and engineering intrinsic cellular metabolism. The increasing number of its successful applications and growing popularity are possibly attributable to the availability of specific software tools for FBA. Each tool has its unique features and limitations with respect to operational environment, user-interface and supported analysis algorithms. Presented herein is an in-depth evaluation of currently available FBA applications, focusing mainly on usability, functionality, graphical representation and inter-operability. Overall, most of the applications are able to perform basic features of model creation and FBA simulation. COBRA toolbox, OptFlux and FASIMU are versatile to support advanced in silico algorithms to identify environmental and genetic targets for strain design. SurreyFBA, WEbcoli, Acorn, FAME, GEMSiRV and MetaFluxNet are the distinct tools which provide the user friendly interfaces in model handling. In terms of software architecture, FBA-SimVis and OptFlux have the flexible environments as they enable the plug-in/add-on feature to aid prospective functional extensions. Notably, an increasing trend towards the implementation of more tailored e-services such as central model repository and assistance to collaborative efforts was observed among the web-based applications with the help of advanced web-technologies. Furthermore, most recent applications such as the Model SEED, FAME, MetaFlux and MicrobesFlux have even included several routines to facilitate the reconstruction of genome-scale metabolic models. Finally, a brief discussion on the future directions of FBA applications was made for the benefit of potential tool developers.
Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.
The combination of DNA bisulfite treatment with high-throughput sequencing technologies has enabled investigation of genome-wide DNA methylation beyond CpG sites and CpG islands. These technologies have opened new avenues to understand the interplay between epigenetic events, chromatin plasticity and gene regulation. However, the processing, managing and mining of this huge volume of data require specialized computational tools and statistical methods that are yet to be standardized. Here, we describe a complete bisulfite sequencing analysis workflow, including recently developed programs, highlighting each of the crucial analysis steps required, i.e. sequencing quality control, reads alignment, methylation scoring, methylation heterogeneity assessment, genomic features annotation, data visualization and determination of differentially methylated cytosines. Moreover, we discuss the limitations of these technologies and considerations to perform suitable analyses.
Investing in documenting your bioinformatics software well can increase its impact and save your time. To maximize the effectiveness of your documentation, we suggest following a few guidelines we propose here. We recommend providing multiple avenues for users to use your research software, including a navigable HTML interface with a quick start, useful help messages with detailed explanation and thorough examples for each feature of your software. By following these guidelines, you can assure that your hard work maximally benefits yourself and others.
A wide variety of large-scale data have been produced in bioinformatics. In response, the need for efficient handling of biomedical big data has been partly met by parallel computing. However, the time demand of many bioinformatics programs still remains high for large-scale practical uses because of factors that hinder acceleration by parallelization. Recently, new generations of storage devices have emerged, such as NAND flash-based solid-state drives (SSDs), and with the renewed interest in near-data processing, they are increasingly becoming acceleration methods that can accompany parallel processing. In certain cases, a simple drop-in replacement of hard disk drives by SSDs results in dramatic speedup. Despite the various advantages and continuous cost reduction of SSDs, there has been little review of SSD-based profiling and performance exploration of important but time-consuming bioinformatics programs. For an informative review, we perform in-depth profiling and analysis of 23 key bioinformatics programs using multiple types of devices. Based on the insight we obtain from this research, we further discuss issues related to design and optimize bioinformatics algorithms and pipelines to fully exploit SSDs. The programs we profile cover traditional and emerging areas of importance, such as alignment, assembly, mapping, expression analysis, variant calling and metagenomics. We explain how acceleration by parallelization can be combined with SSDs for improved performance and also how using SSDs can expedite important bioinformatics pipelines, such as variant calling by the Genome Analysis Toolkit and transcriptome analysis using RNA sequencing. We hope that this review can provide useful directions and tips to accompany future bioinformatics algorithm design procedures that properly consider new generations of powerful storage devices.