Concept: Assembly language
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 5 years ago
The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches-digital normalization and partitioning-to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil.
MOTIVATION: A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods. RESULTS: We evaluated the ability of multiple genome assembly programs to assemble bacterial genomes from a single, deep-coverage library. For our comparison, we chose bacterial species spanning a wide range of GC content, and measured the contiguity and accuracy of the resulting assemblies. We compared the assemblies produced by this very-high-coverage, one-library strategy to the best assemblies created by two-library sequencing, and found that remarkably good bacterial assemblies are possible with just one library. We also measured the effect of read length and depth of coverage on assembly quality and determined the values that provide the best results with current algorithms.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron’s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
With advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.
BACKGROUND: In order to replicate within their cellular host, many viruses have developed self-assembly strategies for their capsids which are sufficiently robust as to be reconstituted in vitro. Mathematical models for virus self-assembly usually assume that the bonds leading to cluster formation have constant reactivity over the time course of assembly (direct assembly). In some cases, however, binding sites between the capsomers have been reported to be activated during the self-assembly process (hierarchical assembly). RESULTS: In order to study possible advantages of such hierarchical schemes for icosahedral virus capsid assembly, we use Brownian dynamics simulations of a patchy particle model that allows us to switch binding sites on and off during assembly. For T1 viruses, we implement a hierarchical assembly scheme where inter-capsomer bonds become active only if a complete pentamer has been assembled. We find direct assembly to be favorable for reversible bonds allowing for repeated structural reorganizations, while hierarchical assembly is favorable for strong bonds with small dissociation rate, as this situation is less prone to kinetic trapping. However, at the same time it is more vulnerable to monomer starvation during the final phase. Increasing the number of initial monomers does have only a weak effect on these general features. The differences between the two assembly schemes become more pronounced for more complex virus geometries, as shown here for T3 viruses, which assemble through homogeneous pentamers and heterogeneous hexamers in the hierarchical scheme. In order to complement the simulations for this more complicated case, we introduce a master equation approach that agrees well with the simulation results. CONCLUSIONS: Our analysis shows for which molecular parameters hierarchical assembly schemes can outperform direct ones. Hierarchical assembly is superior as it avoids kinetic trapping, but suffers more strongly from monomer starvation. These insights increase our physical understanding of an essential biological process, with many interesting potential applications in medicine and materials science.
Transcriptome sequencing and assembly represent a great resource for the study of non-model species,and many metrics have been used to evaluate and compare these assemblies. Unfortunately, it is stillunclear which of these metrics accurately reflect assembly quality.
The realization of reconfigurable modular microrobots could aid drug delivery and microsurgery by allowing a single system to navigate diverse environments and perform multiple tasks. So far, microrobotic systems are limited by insufficient versatility; for instance, helical shapes commonly used for magnetic swimmers cannot effectively assemble and disassemble into different size and shapes. Here by using microswimmers with simple geometries constructed of spherical particles, we show how magnetohydrodynamics can be used to assemble and disassemble modular microrobots with different physical characteristics. We develop a mechanistic physical model that we use to improve assembly strategies. Furthermore, we experimentally demonstrate the feasibility of dynamically changing the physical properties of microswimmers through assembly and disassembly in a controlled fluidic environment. Finally, we show that different configurations have different swimming properties by examining swimming speed dependence on configuration size.
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 3 years ago
Electron cryomicroscopy (cryo-EM) has been used to determine the atomic coordinates (models) from density maps of biological assemblies. These models can be assessed by their overall fit to the experimental data and stereochemical information. However, these models do not annotate the actual density values of the atoms nor their positional uncertainty. Here, we introduce a computational procedure to derive an atomic model from a cryo-EM map with annotated metadata. The accuracy of such a model is validated by a faithful replication of the experimental cryo-EM map computed using the coordinates and associated metadata. The functional interpretation of any structural features in the model and its utilization for future studies can be made in the context of its measure of uncertainty. We applied this protocol to the 3.3-Å map of the mature P22 bacteriophage capsid, a large and complex macromolecular assembly. With this protocol, we identify and annotate previously undescribed molecular interactions between capsid subunits that are crucial to maintain stability in the absence of cementing proteins or cross-linking, as occur in other bacteriophages.
Van der Waals heterostructures are comprised of stacked atomically thin two-dimensional crystals and serve as novel materials providing unprecedented properties. However, the random natures in positions and shapes of exfoliated two-dimensional crystals have required the repetitive manual tasks of optical microscopy-based searching and mechanical transferring, thereby severely limiting the complexity of heterostructures. To solve the problem, here we develop a robotic system that searches exfoliated two-dimensional crystals and assembles them into superlattices inside the glovebox. The system can autonomously detect 400 monolayer graphene flakes per hour with a small error rate (<7%) and stack four cycles of the designated two-dimensional crystals per hour with few minutes of human intervention for each stack cycle. The system enabled fabrication of the superlattice consisting of 29 alternating layers of the graphene and the hexagonal boron nitride. This capacity provides a scalable approach for prototyping a variety of van der Waals superlattices.