- Proceedings of the National Academy of Sciences of the United States of America
- Published over 4 years ago
We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
Motor skill memory is first encoded online in a fragile form during practice and then converted into a stable form by offline consolidation, which is the behavioral stage critical for successful learning. Praise, a social reward, is thought to boost motor skill learning by increasing motivation, which leads to increased practice. However, the effect of praise on consolidation is unknown. Here, we tested the hypothesis that praise following motor training directly facilitates skill consolidation. Forty-eight healthy participants were trained on a sequential finger-tapping task. Immediately after training, participants were divided into three groups according to whether they received praise for their own training performance, praise for another participant’s performance, or no praise. Participants who received praise for their own performance showed a significantly higher rate of offline improvement relative to other participants when performing a surprise recall test of the learned sequence. On the other hand, the average performance of the novel sequence and randomly-ordered tapping did not differ between the three experimental groups. These results are the first to indicate that praise-related improvements in motor skill memory are not due to a feedback-incentive mechanism, but instead involve direct effects on the offline consolidation process.
To investigate cognitive operations underlying sequential problem solving, we confronted ten Goffin’s cockatoos with a baited box locked by five different inter-locking devices. Subjects were either naïve or had watched a conspecific demonstration, and either faced all devices at once or incrementally. One naïve subject solved the problem without demonstration and with all locks present within the first five sessions (each consisting of one trial of up to 20 minutes), while five others did so after social demonstrations or incremental experience. Performance was aided by species-specific traits including neophilia, a haptic modality and persistence. Most birds showed a ratchet-like progress, rarely failing to solve a stage once they had done it once. In most transfer tests subjects reacted flexibly and sensitively to alterations of the locks' sequencing and functionality, as expected from the presence of predictive inferences about mechanical interactions between the locks.
Background For more than a decade, risk stratification for hypertrophic cardiomyopathy has been enhanced by targeted genetic testing. Using sequencing results, clinicians routinely assess the risk of hypertrophic cardiomyopathy in a patient’s relatives and diagnose the condition in patients who have ambiguous clinical presentations. However, the benefits of genetic testing come with the risk that variants may be misclassified. Methods Using publicly accessible exome data, we identified variants that have previously been considered causal in hypertrophic cardiomyopathy and that are overrepresented in the general population. We studied these variants in diverse populations and reevaluated their initial ascertainments in the medical literature. We reviewed patient records at a leading genetic-testing laboratory for occurrences of these variants during the near-decade-long history of the laboratory. Results Multiple patients, all of whom were of African or unspecified ancestry, received positive reports, with variants misclassified as pathogenic on the basis of the understanding at the time of testing. Subsequently, all reported variants were recategorized as benign. The mutations that were most common in the general population were significantly more common among black Americans than among white Americans (P<0.001). Simulations showed that the inclusion of even small numbers of black Americans in control cohorts probably would have prevented these misclassifications. We identified methodologic shortcomings that contributed to these errors in the medical literature. Conclusions The misclassification of benign variants as pathogenic that we found in our study shows the need for sequencing the genomes of diverse populations, both in asymptomatic controls and the tested patient population. These results expand on current guidelines, which recommend the use of ancestry-matched controls to interpret variants. As additional populations of different ancestry backgrounds are sequenced, we expect variant reclassifications to increase, particularly for ancestry groups that have historically been less well studied. (Funded by the National Institutes of Health.).
Understanding natural selection is crucial to unveiling evolution of modern humans. Here, we report natural selection signatures in the Japanese population using 2234 high-depth whole-genome sequence (WGS) data (25.9×). Using rare singletons, we identify signals of very recent selection for the past 2000-3000 years in multiple loci (ADH cluster, MHC region, BRAP-ALDH2, SERHL2). In large-scale genome-wide association study (GWAS) dataset (n = 171,176), variants with selection signatures show enrichment in heterogeneity of derived allele frequency spectra among the geographic regions of Japan, highlighted by two major regional clusters (Hondo and Ryukyu). While the selection signatures do not show enrichment in archaic hominin-derived genome sequences, they overlap with the SNPs associated with the modern human traits. The strongest overlaps are observed for the alcohol or nutrition metabolism-related traits. Our study illustrates the value of high-depth WGS to understand evolution and their relationship with disease risk.
Next-generation sequencing (NGS) is increasingly being adopted as the backbone of biomedical research. With the commercialization of various affordable desktop sequencers, NGS will be reached by increasing numbers of cellular and molecular biologists, necessitating community consensus on bioinformatics protocols to tackle the exponential increase in quantity of sequence data. The current resources for NGS informatics are extremely fragmented. Finding a centralized synthesis is difficult. A multitude of tools exist for NGS data analysis; however, none of these satisfies all possible uses and needs. This gap in functionality could be filled by integrating different methods in customized pipelines, an approach helped by the open-source nature of many NGS programmes. Drawing from community spirit and with the use of the Wikipedia framework, we have initiated a collaborative NGS resource: The NGS WikiBook. We have collected a sufficient amount of text to incentivize a broader community to contribute to it. Users can search, browse, edit and create new content, so as to facilitate self-learning and feedback to the community. The overall structure and style for this dynamic material is designed for the bench biologists and non-bioinformaticians. The flexibility of online material allows the readers to ignore details in a first read, yet have immediate access to the information they need. Each chapter comes with practical exercises so readers may familiarize themselves with each step. The NGS WikiBook aims to create a collective laboratory book and protocol that explains the key concepts and describes best practices in this fast-evolving field.
Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ∼8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.
Despite recent advances spearheaded by molecular approaches and novel technologies, species description and DNA sequence information are significantly lagging for fungi compared to many other groups of organisms. Large scale sequencing of vouchered herbarium material can aid in closing this gap. Here, we describe an effort to obtain broad ITS sequence coverage of the approximately 6000 macrofungal-species-rich herbarium of the Museum of Natural History in Venice, Italy. Our goals were to investigate issues related to large sequencing projects, develop heuristic methods for assessing the overall performance of such a project, and evaluate the prospects of such efforts to reduce the current gap in fungal biodiversity knowledge. The effort generated 1107 sequences submitted to GenBank, including 416 previously unrepresented taxa and 398 sequences exhibiting a best BLAST match to an unidentified environmental sequence. Specimen age and taxon affected sequencing success, and subsequent work on failed specimens showed that an ITS1 mini-barcode greatly increased sequencing success without greatly reducing the discriminating power of the barcode. Similarity comparisons and nonmetric multidimensional scaling ordinations based on pairwise distance matrices proved to be useful heuristic tools for validating the overall accuracy of specimen identifications, flagging potential misidentifications, and identifying taxa in need of additional species-level revision. Comparison of within- and among-species nucleotide variation showed a strong increase in species discriminating power at 1-2% dissimilarity, and identified potential barcoding issues (same sequence for different species and vice-versa). All sequences are linked to a vouchered specimen, and results from this study have already prompted revisions of species-sequence assignments in several taxa.
Highly parallel SNP genotyping platforms have been developed for some important crop species, but these platforms typically carry a high cost per sample for first-time or small-scale users. In contrast, recently developed genotyping by sequencing (GBS) approaches offer a highly cost effective alternative for simultaneous SNP discovery and genotyping. In the present investigation, we have explored the use of GBS in soybean. In addition to developing a novel analysis pipeline to call SNPs and indels from the resulting sequence reads, we have devised a modified library preparation protocol to alter the degree of complexity reduction. We used a set of eight diverse soybean genotypes to conduct a pilot scale test of the protocol and pipeline. Using ApeKI for GBS library preparation and sequencing on an Illumina GAIIx machine, we obtained 5.5 M reads and these were processed using our pipeline. A total of 10,120 high quality SNPs were obtained and the distribution of these SNPs mirrored closely the distribution of gene-rich regions in the soybean genome. A total of 39.5% of the SNPs were present in genic regions and 52.5% of these were located in the coding sequence. Validation of over 400 genotypes at a set of randomly selected SNPs using Sanger sequencing showed a 98% success rate. We then explored the use of selective primers to achieve a greater complexity reduction during GBS library preparation. The number of SNP calls could be increased by almost 40% and their depth of coverage was more than doubled, thus opening the door to an increase in the throughput and a significant decrease in the per sample cost. The approach to obtain high quality SNPs developed here will be helpful for marker assisted genomics as well as assessment of available genetic resources for effective utilisation in a wide number of species.