Journal: Bioinformatics (Oxford, England)
MOTIVATION: A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods. RESULTS: We evaluated the ability of multiple genome assembly programs to assemble bacterial genomes from a single, deep-coverage library. For our comparison, we chose bacterial species spanning a wide range of GC content, and measured the contiguity and accuracy of the resulting assemblies. We compared the assemblies produced by this very-high-coverage, one-library strategy to the best assemblies created by two-library sequencing, and found that remarkably good bacterial assemblies are possible with just one library. We also measured the effect of read length and depth of coverage on assembly quality and determined the values that provide the best results with current algorithms.
Metagenomes are often characterized by high levels of unknown sequences. Reads derived from known microorganisms can easily be identified and analyzed using fast homology search algorithms and a suitable reference database, but the unknown sequences are often ignored in further analyses, biasing conclusions. Nevertheless, it is possible to use more data in a comparative metagenomic analysis by creating a cross-assembly of all reads, i.e. a single assembly of reads from different samples. Comparative metagenomics studies the interrelationships between metagenomes from different samples. Using an assembly algorithm is a fast and intuitive way to link (partially) homologous reads without requiring a database of reference sequences.
SUMMARY: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast, customisable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search, and a library of “widgets” performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. AVAILABILITY: Freely available from http://www.intermine.org under the LGPL license. CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assessing functional associations between an experimentally derived gene or protein set of interest and a database of known gene/protein sets is a common task in the analysis of large-scale functional genomics data. For this purpose, a frequently used approach is to apply an over-representation-based enrichment analysis. However, this approach has four drawbacks: (i) it can only score functional associations of overlapping gene/proteins sets; (ii) it disregards genes with missing annotations; (iii) it does not take into account the network structure of physical interactions between the gene/protein sets of interest and (iv) tissue-specific gene/protein set associations cannot be recognized.
The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data.
With advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.
We have developed Cake, a bioinformatics software pipeline that integrates four publicly available somatic variant-calling algorithms to identify single nucleotide variants with higher sensitivity and accuracy than any one algorithm alone. Cake can be run on a high-performance computer cluster or used as a standalone application.
MOTIVATION: BLAST remains one of the most widely used tools in computational biology. The rate at which new sequence data is available continues to grow exponentially, driving the emergence of new fields of biological research. At the same time multicore systems and conventional clusters are more accessible. ScalaBLAST has been designed to run on conventional multiprocessor systems with an eye to extreme parallelism, enabling parallel BLAST calculations using over 16,000 processing cores with a portable, robust, fault-resilient design that introduces little to no overhead with respect to serial BLAST. ScalaBLAST 2.0 source code can be freely downloaded from http://omics.pnl.gov/software/ScalaBLAST.php.
SUMMARY: Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the “–add” and “–addfragments” options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, while the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment, as well as the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the “–addfragments” option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems. AVAILABILITY: http://mafft.cbrc.jp/alignment/software/ CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Available at Bioinformatics online.