Concept: Human genome
A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 - 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 1 year ago
We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
Genomic instability is a hallmark of cancer often associated with poor patient outcome and resistance to targeted therapy. Assessment of genomic instability in bulk tumor or biopsy can be complicated due to sample availability, surrounding tissue contamination, or tumor heterogeneity. The Epic Sciences circulating tumor cell (CTC) platform utilizes a non-enrichment based approach for the detection and characterization of rare tumor cells in clinical blood samples. Genomic profiling of individual CTCs could provide a portrait of cancer heterogeneity, identify clonal and sub-clonal drivers, and monitor disease progression. To that end, we developed a single cell Copy Number Variation (CNV) Assay to evaluate genomic instability and CNVs in patient CTCs. For proof of concept, prostate cancer cell lines, LNCaP, PC3 and VCaP, were spiked into healthy donor blood to create mock patient-like samples for downstream single cell genomic analysis. In addition, samples from seven metastatic castration resistant prostate cancer (mCRPC) patients were included to evaluate clinical feasibility. CTCs were enumerated and characterized using the Epic Sciences CTC Platform. Identified single CTCs were recovered, whole genome amplified, and sequenced using an Illumina NextSeq 500. CTCs were then analyzed for genome-wide copy number variations, followed by genomic instability analyses. Large-scale state transitions (LSTs) were measured as surrogates of genomic instability. Genomic instability scores were determined reproducibly for LNCaP, PC3, and VCaP, and were higher than white blood cell (WBC) controls from healthy donors. A wide range of LST scores were observed within and among the seven mCRPC patient samples. On the gene level, loss of the PTEN tumor suppressor was observed in PC3 and 5/7 (71%) patients. Amplification of the androgen receptor (AR) gene was observed in VCaP cells and 5/7 (71%) mCRPC patients. Using an in silico down-sampling approach, we determined that DNA copy number and genomic instability can be detected with as few as 350K sequencing reads. The data shown here demonstrate the feasibility of detecting genomic instabilities at the single cell level using the Epic Sciences CTC Platform. Understanding CTC heterogeneity has great potential for patient stratification prior to treatment with targeted therapies and for monitoring disease evolution during treatment.
In order to explore the diversity and selective signatures of duplication and deletion human copy number variants (CNVs), we sequenced 236 individuals from 125 distinct human populations. We observed that duplications exhibit fundamentally different population genetic and selective signatures than deletions and are more likely to be stratified between human populations. Through reconstruction of the ancestral human genome, we identify megabases of DNA lost in different human lineages and pinpoint large duplications that introgressed from the extinct Denisova lineage now found at high frequency exclusively in Oceanic populations. We find that the proportion of CNV base pairs to single nucleotide variant base pairs is greater among non-Africans than it is among African populations, but we conclude that this difference is likely due to unique aspects of non-African population history as opposed to differences in CNV load.
Sole-source business models for genetic testing can create private databases containing information vital to interpreting the clinical significance of human genetic variations. But incomplete access to those databases threatens to impede the clinical interpretation of genomic medicine. National health systems and insurers, regulators, researchers, providers and patients all have a strong interest in ensuring broad access to information about the clinical significance of variants discovered through genetic testing. They can create incentives for sharing data and interpretive algorithms in several ways, including: promoting voluntary sharing; requiring laboratories to share as a condition of payment for or regulatory approval of laboratory services; establishing - and compelling participation in - resources that capture the information needed to interpret the data independent of company policies; and paying for sharing and interpretation in addition to paying for the test itself. US policies have failed to address the data-sharing issue. The entry of new and established firms into the European genetic testing market presents an opportunity to correct this failure.European Journal of Human Genetics advance online publication, 14 November 2012; doi:10.1038/ejhg.2012.217.
This year marks 60 years since James Watson and Francis Crick described the structure of DNA and 10 years since the complete sequencing of the human genome. Fittingly, today the Food and Drug Administration (FDA) has granted marketing authorization for the first high-throughput (next-generation) genomic sequencer, Illumina’s MiSeqDx, which will allow the development and use of innumerable new genome-based tests. When a global team of researchers sequenced that first human genome, it took more than a decade and cost hundreds of millions of dollars. Today, because of federal and private investment, sequencing technologies have advanced dramatically, and a human genome . . .
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 4 years ago
In the last decade there has been an exponential increase in knowledge about the genetic basis of complex human traits, including neuropsychiatric disorders. It is not clear, however, to what extent this knowledge can be used as a starting point for drug identification, one of the central hopes of the human genome project. The aim of the present study was to identify memory-modulating compounds through the use of human genetic information. We performed a multinational collaborative study, which included assessment of aversive memory-a trait central to posttraumatic stress disorder-and a gene-set analysis in healthy individuals. We identified 20 potential drug target genes in two genomewide-corrected gene sets: the neuroactive ligand-receptor interaction and the long-term depression gene set. In a subsequent double-blind, placebo-controlled study in healthy volunteers, we aimed at providing a proof of concept for the genome-guided identification of memory modulating compounds. Pharmacological intervention at the neuroactive ligand-receptor interaction gene set led to significant reduction of aversive memory. The findings demonstrate that genome information, along with appropriate data mining methodology, can be used as a starting point for the identification of memory-modulating compounds.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron’s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
Left-sided congenital heart disease (CHD) encompasses a spectrum of malformations that range from bicuspid aortic valve to hypoplastic left heart syndrome. It contributes significantly to infant mortality and has serious implications in adult cardiology. Although left-sided CHD is known to be highly heritable, the underlying genetic determinants are largely unidentified. In this study, we sought to determine the impact of structural genomic variation on left-sided CHD and compared multiplex families (464 individuals with 174 affecteds (37.5%) in 59 multiplex families and 8 trios) to 1,582 well-phenotyped controls. 73 unique inherited or de novo CNVs in 54 individuals were identified in the left-sided CHD cohort. After stringent filtering, our gene inventory reveals 25 new candidates for LS-CHD pathogenesis, such as SMC1A, MFAP4, and CTHRC1, and overlaps with several known syndromic loci. Conservative estimation examining the overlap of the prioritized gene content with CNVs present only in affected individuals in our cohort implies a strong effect for unique CNVs in at least 10% of left-sided CHD cases. Enrichment testing of gene content in all identified CNVs showed a significant association with angiogenesis. In this first family-based CNV study of left-sided CHD, we found that both co-segregating and de novo events associate with disease in a complex fashion at structural genomic level. Often viewed as an anatomically circumscript disease, a subset of left-sided CHD may in fact reflect more general genetic perturbations of angiogenesis and/or vascular biology.
Emerging sequencing technologies allow common and rare variants to be systematically assayed across the human genome in many individuals. In order to improve variant detection and genotype calling, raw sequence data are typically examined across many individuals. Here, we describe a method for genotype calling in settings where sequence data are available for unrelated individuals and parent-offspring trios and show that modeling trio information can greatly increase the accuracy of inferred genotypes and haplotypes, especially on low to modest depth sequencing data. Our method considers both linkage disequilibrium (LD) patterns and the constraints imposed by family structure when assigning individual genotypes and haplotypes. Using simulations, we show that trios provide higher genotype calling accuracy across the frequency spectrum, both overall and at hard-to-call heterozygous sites. In addition, trios provide greatly improved phasing accuracy-improving the accuracy of downstream analyses (such as genotype imputation) that rely on phased haplotypes. To further evaluate our approach, we analyzed data on the first 508 individuals sequenced by the SardiNIA sequencing project. Our results show that our method reduces the genotyping error rate by 50% compared with analysis using existing methods that ignore family structure. We anticipate our method will facilitate genotype calling and haplotype inference for many ongoing sequencing projects.