We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2-4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai.
Molecular descriptors are widely employed to present molecular characteristics in cheminformatics. Various molecular-descriptor-calculation software programs have been developed. However, users of those programs must contend with several issues, including software bugs, insufficient update frequencies, and software licensing constraints. To address these issues, we propose Mordred, a developed descriptor-calculation software application that can calculate more than 1800 two- and three-dimensional descriptors. It is freely available via GitHub. Mordred can be easily installed and used in the command line interface, as a web application, or as a high-flexibility Python package on all major platforms (Windows, Linux, and macOS). Performance benchmark results show that Mordred is at least twice as fast as the well-known PaDEL-Descriptor and it can calculate descriptors for large molecules, which cannot be accomplished by other software. Owing to its good performance, convenience, number of descriptors, and a lax licensing constraint, Mordred is a promising choice of molecular descriptor calculation software that can be utilized for cheminformatics studies, such as those on quantitative structure-property relationships.
Motivation: Storing, transmitting, and archiving data produced by next generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3% - 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/~ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: email@example.com.
RNA-Sequencing has provided unprecedented resolution of alternative splicing and splicing-quantitative trait loci (sQTL). However, there are few tools available for visualizing the genotype-dependent effects of splicing at a population level. SplicePlot is a simple command line utility that produces intuitive visualization of sQTLs and their effects. SplicePlot takes mapped RNA-seq reads in BAM format and genotype data in VCF format as input and outputs publication quality sashimi plots, hive plots, and structure plots enabling better investigation and understanding of the role of genetics on alternative splicing and transcript structure.Availability and Implementation: Source code and detailed documentation are available at http://montgomerylab.stanford.edu/spliceplot/index.html under Resources and at Github. SplicePlot is implemented in Python and is supported on Linux and Mac OS. A VirtualBox virtual machine running Ubuntu with SplicePlot already installed is also available.
We present an update to our Galaxy-based web server for processing and visualizing deeply sequenced data. Its core tool set, deepTools, allows users to perform complete bioinformatic workflows ranging from quality controls and normalizations of aligned reads to integrative analyses, including clustering and visualization approaches. Since we first described our deepTools Galaxy server in 2014, we have implemented new solutions for many requests from the community and our users. Here, we introduce significant enhancements and new tools to further improve data visualization and interpretation. deepTools continue to be open to all users and freely available as a web service atdeeptools.ie-freiburg.mpg.de The new deepTools2 suite can be easily deployed within any Galaxy framework via the toolshed repository, and we also provide source code for command line usage under Linux and Mac OS X. A public and documented API for access to deepTools functionality is also available.
We present the latest version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which contains many sophisticated methods and tools for phylogenomics and phylomedicine. In this major upgrade, MEGA has been optimized for use on 64-bit computing systems for analyzing bigger datasets. Researchers can now explore and analyze tens of thousands of sequences in MEGA. The new version also provides an advanced wizard for building timetrees and includes a new functionality to automatically predict gene duplication events in gene family trees. The 64-bit MEGA is made available in two interfaces: graphical and command line. The graphical user interface (GUI) is a native Microsoft Windows application that can also be used on Mac OSX. The command line MEGA is available as native applications for Windows, Linux, and Mac OSX. They are intended for use in high-throughput and scripted analysis. Both versions are available fromwww.megasoftware.netfree of charge.
The world’s natural history collections constitute an enormous evidence base for scientific research on the natural world. To facilitate these studies and improve access to collections, many organisations are embarking on major programmes of digitization. This requires automated approaches to mass-digitization that support rapid imaging of specimens and associated data capture, in order to process the tens of millions of specimens common to most natural history collections. In this paper we present Inselect-a modular, easy-to-use, cross-platform suite of open-source software tools that supports the semi-automated processing of specimen images generated by natural history digitization programmes. The software is made up of a Windows, Mac OS X, and Linux desktop application, together with command-line tools that are designed for unattended operation on batches of images. Blending image visualisation algorithms that automatically recognise specimens together with workflows to support post-processing tasks such as barcode reading, label transcription and metadata capture, Inselect fills a critical gap to increase the rate of specimen digitization.
High-throughput single-cell technologies provide an unprecedented view into cellular heterogeneity, yet they pose new challenges in data analysis and interpretation. In this protocol, we describe the use of Spanning-tree Progression Analysis of Density-normalized Events (SPADE), a density-based algorithm for visualizing single-cell data and enabling cellular hierarchy inference among subpopulations of similar cells. It was initially developed for flow and mass cytometry single-cell data. We describe SPADE’s implementation and application using an open-source R package that runs on Mac OS X, Linux and Windows systems. A typical SPADE analysis on a 2.27-GHz processor laptop takes ∼5 min. We demonstrate the applicability of SPADE to single-cell RNA-seq data. We compare SPADE with recently developed single-cell visualization approaches based on the t-distribution stochastic neighborhood embedding (t-SNE) algorithm. We contrast the implementation and outputs of these methods for normal and malignant hematopoietic cells analyzed by mass cytometry and provide recommendations for appropriate use. Finally, we provide an integrative strategy that combines the strengths of t-SNE and SPADE to infer cellular hierarchy from high-dimensional single-cell data.
AliView is an alignment viewer and editor designed to meet the requirements of next generation sequencing era phylogenetic datasets. AliView handles alignments of unlimited size in the formats most commonly used, i.e. Fasta, Phylip, Nexus, Clustal and MSF. The intuitive graphical interface makes it easy to inspect, sort, delete, merge and realign sequences as part of the manual filtering process of large data sets. AliView also works as an easy to use alignment editor for small as well as large data sets. Availability: AliView is released as open-source software under the GNU General Public License, version 3.0 (GPLv3), and is available at GitHub (www.github.com/AliView). The program is cross-platform and extensively tested on Linux, Mac OS X and Windows systems. Downloads and help are available at http://ormbunkar.se/aliview CONTACT: firstname.lastname@example.org Supplementary information:
The ultimate goal of genomic research is to effectively predict phenotypes from genotypes so that medical management can improve human health and molecular breeding can increase agricultural production. Genomic prediction or selection (GS) plays a complementary role to genome-wide association studies (GWAS), which is the primary method to identify genes underlying phenotypes. Unfortunately, most computing tools cannot perform data analyses for both GWAS and GS. Furthermore, the majority of these tools are executed through a command-line interface (CLI), which requires programming skills. Non-programmers struggle to use them efficiently because of the steep learning curves and zero tolerance for data formats and mistakes when inputting keywords and parameters. To address these problems, this study developed a software package, named the Intelligent Prediction and Association Tool (iPat), with a user-friendly graphical user interface (GUI). With iPat, GWAS or GS can be performed using a pointing device to simply drag and/or click on graphical elements to specify input data files, choose input parameters, and select analytical models. Models available to users include those implemented in third party CLI packages such as GAPIT, PLINK, FarmCPU, BLINK, rrBLUP, and BGLR. Users can choose any data format and conduct analyses with any of these packages. File conversions are automatically conducted for specified input data and selected packages. A GWAS-assisted genomic prediction method was implemented to perform genomic prediction using any GWAS method such as FarmCPU. iPat was written in Java for adaptation to multiple operating systems including Windows, Mac, and Linux.