SciCombinator

Discover the most talked about and latest scientific content & concepts.

Journal: Briefings in bioinformatics

183

Next-generation sequencing (NGS) is increasingly being adopted as the backbone of biomedical research. With the commercialization of various affordable desktop sequencers, NGS will be reached by increasing numbers of cellular and molecular biologists, necessitating community consensus on bioinformatics protocols to tackle the exponential increase in quantity of sequence data. The current resources for NGS informatics are extremely fragmented. Finding a centralized synthesis is difficult. A multitude of tools exist for NGS data analysis; however, none of these satisfies all possible uses and needs. This gap in functionality could be filled by integrating different methods in customized pipelines, an approach helped by the open-source nature of many NGS programmes. Drawing from community spirit and with the use of the Wikipedia framework, we have initiated a collaborative NGS resource: The NGS WikiBook. We have collected a sufficient amount of text to incentivize a broader community to contribute to it. Users can search, browse, edit and create new content, so as to facilitate self-learning and feedback to the community. The overall structure and style for this dynamic material is designed for the bench biologists and non-bioinformaticians. The flexibility of online material allows the readers to ignore details in a first read, yet have immediate access to the information they need. Each chapter comes with practical exercises so readers may familiarize themselves with each step. The NGS WikiBook aims to create a collective laboratory book and protocol that explains the key concepts and describes best practices in this fast-evolving field.

Concepts: Molecular biology, Biology, Sequence, Medical research, Sustainability, Bench, Exponential growth, Wikipedia

137

Knowledge on the relationship between different biological modalities (RNA, chromatin, etc.) can help further our understanding of the processes through which biological components interact. The ready availability of multi-omics datasets has led to the development of numerous methods for identifying sources of common variation across biological modalities. However, evaluation of the performance of these methods, in terms of consistency, has been difficult because most methods are unsupervised. We present a comparison of sparse multiple canonical correlation analysis (Sparse mCCA), angle-based joint and individual variation explained (AJIVE) and multi-omics factor analysis (MOFA) using a cross-validation approach to assess overfitting and consistency. Both large and small-sample datasets were used to evaluate performance, and a permuted null dataset was used to identify overfitting through the application of our framework and approach. In the large-sample setting, we found that all methods demonstrated consistency and lack of overfitting; however, in the small-sample size setting, AJIVE provided the most stable results. We provide an R package so that our framework and approach can be applied to evaluate other methods and datasets.

132

The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.

33

Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.

Concepts: Gene, Genetics, Cancer, Mutation, Human genome, Genomics, Data, Data set

28

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.

Concepts: DNA, Gene, Genome, Computer storage, Data compression, Media technology, Computer data storage, Image compression

28

Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics.

Concepts: Future, Hadoop, Cloud computing, Grid computing, Linux, Google, File system, MapReduce

28

In recent years, more 3D protein structures have become available, which has made the analysis of large molecular structures much easier. There is a strong demand for geometric models for the study of protein-related interactions. Alpha shape and Delaunay triangulation are powerful tools to represent protein structures and have advantages in characterizing the surface curvature and atom contacts. This review presents state-of-the-art applications of alpha shape and Delaunay triangulation in the studies on protein-DNA, protein-protein, protein-ligand interactions and protein structure analysis.

Concepts: Protein structure, Structure, Molecule, Sociology, Differential geometry, Computational geometry, Delaunay triangulation

28

Flux balance analysis (FBA) is a widely used computational method for characterizing and engineering intrinsic cellular metabolism. The increasing number of its successful applications and growing popularity are possibly attributable to the availability of specific software tools for FBA. Each tool has its unique features and limitations with respect to operational environment, user-interface and supported analysis algorithms. Presented herein is an in-depth evaluation of currently available FBA applications, focusing mainly on usability, functionality, graphical representation and inter-operability. Overall, most of the applications are able to perform basic features of model creation and FBA simulation. COBRA toolbox, OptFlux and FASIMU are versatile to support advanced in silico algorithms to identify environmental and genetic targets for strain design. SurreyFBA, WEbcoli, Acorn, FAME, GEMSiRV and MetaFluxNet are the distinct tools which provide the user friendly interfaces in model handling. In terms of software architecture, FBA-SimVis and OptFlux have the flexible environments as they enable the plug-in/add-on feature to aid prospective functional extensions. Notably, an increasing trend towards the implementation of more tailored e-services such as central model repository and assistance to collaborative efforts was observed among the web-based applications with the help of advanced web-technologies. Furthermore, most recent applications such as the Model SEED, FAME, MetaFlux and MicrobesFlux have even included several routines to facilitate the reconstruction of genome-scale metabolic models. Finally, a brief discussion on the future directions of FBA applications was made for the benefit of potential tool developers.

Concepts: Metabolism, Computer program, Software engineering, Computer software, Application software, Usability, Software architecture, Software system

28

Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.

Concepts: Bioinformatics, Molecular biology, Genomics, Biotechnology, Metagenomics, Sequence, Set theory

26

The combination of DNA bisulfite treatment with high-throughput sequencing technologies has enabled investigation of genome-wide DNA methylation beyond CpG sites and CpG islands. These technologies have opened new avenues to understand the interplay between epigenetic events, chromatin plasticity and gene regulation. However, the processing, managing and mining of this huge volume of data require specialized computational tools and statistical methods that are yet to be standardized. Here, we describe a complete bisulfite sequencing analysis workflow, including recently developed programs, highlighting each of the crucial analysis steps required, i.e. sequencing quality control, reads alignment, methylation scoring, methylation heterogeneity assessment, genomic features annotation, data visualization and determination of differentially methylated cytosines. Moreover, we discuss the limitations of these technologies and considerations to perform suitable analyses.

Concepts: DNA, Gene expression, Histone, Epigenetics, DNA methylation, Methylation, Molecular genetics, Bisulfite sequencing