Concept: DNA sequencing
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 5 years ago
In the last decade there has been an exponential increase in knowledge about the genetic basis of complex human traits, including neuropsychiatric disorders. It is not clear, however, to what extent this knowledge can be used as a starting point for drug identification, one of the central hopes of the human genome project. The aim of the present study was to identify memory-modulating compounds through the use of human genetic information. We performed a multinational collaborative study, which included assessment of aversive memory-a trait central to posttraumatic stress disorder-and a gene-set analysis in healthy individuals. We identified 20 potential drug target genes in two genomewide-corrected gene sets: the neuroactive ligand-receptor interaction and the long-term depression gene set. In a subsequent double-blind, placebo-controlled study in healthy volunteers, we aimed at providing a proof of concept for the genome-guided identification of memory modulating compounds. Pharmacological intervention at the neuroactive ligand-receptor interaction gene set led to significant reduction of aversive memory. The findings demonstrate that genome information, along with appropriate data mining methodology, can be used as a starting point for the identification of memory-modulating compounds.
Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron’s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
As reports on possible associations between microbes and the host increase in number, more meaningful interpretations of this information require an ability to compare data sets across studies. This is dependent upon standardization of workflows to ensure comparability both within and between studies. Here we propose the standard use of an alternate collection and stabilization method that would facilitate such comparisons. The DNA Genotek OMNIgene∙Gut Stool Microbiome Kit was compared to the currently accepted community standard of freezing to store human stool samples prior to whole genome sequencing (WGS) for microbiome studies. This stabilization and collection device allows for ambient temperature storage, automation, and ease of shipping/transfer of samples. The device permitted the same data reproducibility as with frozen samples, and yielded higher recovery of nucleic acids. Collection and stabilization of stool microbiome samples with the DNA Genotek collection device, combined with our extraction and WGS, provides a robust, reproducible workflow that enables standardized global collection, storage, and analysis of stool for microbiome studies.
BACKGROUND: PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy. RESULTS: We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche’s 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise. CONCLUSIONS: DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.
Analysis of microbial communities by high-throughput pyrosequencing of SSU rRNA gene PCR amplicons has transformed microbial ecology research and led to the observation that many communities contain a diverse assortment of rare taxa-a phenomenon termed the Rare Biosphere. Multiple studies have investigated the effect of pyrosequencing read quality on operational taxonomic unit (OTU) richness for contrived communities, yet there is limited information on the fidelity of community structure estimates obtained through this approach. Given that PCR biases are widely recognized, and further unknown biases may arise from the sequencing process itself, a priori assumptions about the neutrality of the data generation process are at best unvalidated. Furthermore, post-sequencing quality control algorithms have not been explicitly evaluated for the accuracy of recovered representative sequences and its impact on downstream analyses, reducing useful discussion on pyrosequencing reads to their diversity and abundances. Here we report on community structures and sequences recovered for in vitro-simulated communities consisting of twenty 16S rRNA gene clones tiered at known proportions. PCR amplicon libraries of the V3-V4 and V6 hypervariable regions from the in vitro-simulated communities were sequenced using the Roche 454 GS FLX Titanium platform. Commonly used quality control protocols resulted in the formation of OTUs with >1% abundance composed entirely of erroneous sequences, while over-aggressive clustering approaches obfuscated real, expected OTUs. The pyrosequencing process itself did not appear to impose significant biases on overall community structure estimates, although the detection limit for rare taxa may be affected by PCR amplicon size and quality control approach employed. Meanwhile, PCR biases associated with the initial amplicon generation may impose greater distortions in the observed community structure.
BACKGROUND: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. RESULTS: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. CONCLUSIONS: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
BACKGROUND: Faba bean (Vicia faba L.) is an important food legume crop, grown for human consumption globally including in China, Turkey, Egypt and Ethiopia. Although genetic gain has been made through conventional selection and breeding efforts, this could be substantially improved through the application of molecular methods. For this, a set of reliable molecular markers representative of the entire genome is required. RESULTS: A library with 125,559 putative SSR sequences was constructed and characterized for repeat type and length from a mixed genome of 247 spring and winter sown faba bean genotypes using 454 sequencing. A suit of 28,503 primer pair sequences were designed and 150 were randomly selected for validation. Of these, 94 produced reproducible amplicons that were polymorphic among 32 faba bean genotypes selected from diverse geographical locations. The number of alleles per locus ranged from 2 to 8, the expected heterozygocities ranged from 0.0000 to 1.0000, and the observed heterozygosities ranged from 0.0908 to 0.8410. The validation by UPGMA cluster analysis of 32 genotypes based on Nei’s genetic distance, showed high quality and effectiveness of those novel SSR markers developed via next generation sequencing technology. CONCLUSIONS: Large scale SSR marker development was successfully achieved using next generation sequencing of the V. faba genome. These novel markers are valuable for constructing genetic linkage maps, future QTL mapping, and marker-assisted trait selection in faba bean breeding efforts.
Next generation sequencing and advances in genomic enrichment technologies have enabled the discovery of the full spectrum of variants from common to rare alleles in the human population. The application of such technologies can be limited by the amount of DNA available. Whole genome amplification (WGA) can overcome such limitations. Here we investigate applicability of using WGA by comparing SNP and INDEL variant calls from a single genomic/WGA sample pair from two capture separate experiments: a 50 Mbp whole exome capture and a custom capture array of 4 Mbp region on chr12.
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 3 years ago
Denisovans, a sister group of Neandertals, have been described on the basis of a nuclear genome sequence from a finger phalanx (Denisova 3) found in Denisova Cave in the Altai Mountains. The only other Denisovan specimen described to date is a molar (Denisova 4) found at the same site. This tooth carries a mtDNA sequence similar to that of Denisova 3. Here we present nuclear DNA sequences from Denisova 4 and a morphological description, as well as mitochondrial and nuclear DNA sequence data, from another molar (Denisova 8) found in Denisova Cave in 2010. This new molar is similar to Denisova 4 in being very large and lacking traits typical of Neandertals and modern humans. Nuclear DNA sequences from the two molars form a clade with Denisova 3. The mtDNA of Denisova 8 is more diverged and has accumulated fewer substitutions than the mtDNAs of the other two specimens, suggesting Denisovans were present in the region over an extended period. The nuclear DNA sequence diversity among the three Denisovans is comparable to that among six Neandertals, but lower than that among present-day humans.