Concept: Human Genome Project
A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 - 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 2 years ago
We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
Blowflies and houseflies are mechanical vectors inhabiting synanthropic environments around the world. They feed and breed in fecal and decaying organic matter, but the microbiome they harbour and transport is largely uncharacterized. We sampled 116 individual houseflies and blowflies from varying habitats on three continents and subjected them to high-coverage, whole-genome shotgun sequencing. This allowed for genomic and metagenomic analyses of the host-associated microbiome at the species level. Both fly host species segregate based on principal coordinate analysis of their microbial communities, but they also show an overlapping core microbiome. Legs and wings displayed the largest microbial diversity and were shown to be an important route for microbial dispersion. The environmental sequencing approach presented here detected a stochastic distribution of human pathogens, such as Helicobacter pylori, thereby demonstrating the potential of flies as proxies for environmental and public health surveillance.
This year marks 60 years since James Watson and Francis Crick described the structure of DNA and 10 years since the complete sequencing of the human genome. Fittingly, today the Food and Drug Administration (FDA) has granted marketing authorization for the first high-throughput (next-generation) genomic sequencer, Illumina’s MiSeqDx, which will allow the development and use of innumerable new genome-based tests. When a global team of researchers sequenced that first human genome, it took more than a decade and cost hundreds of millions of dollars. Today, because of federal and private investment, sequencing technologies have advanced dramatically, and a human genome . . .
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 5 years ago
In the last decade there has been an exponential increase in knowledge about the genetic basis of complex human traits, including neuropsychiatric disorders. It is not clear, however, to what extent this knowledge can be used as a starting point for drug identification, one of the central hopes of the human genome project. The aim of the present study was to identify memory-modulating compounds through the use of human genetic information. We performed a multinational collaborative study, which included assessment of aversive memory-a trait central to posttraumatic stress disorder-and a gene-set analysis in healthy individuals. We identified 20 potential drug target genes in two genomewide-corrected gene sets: the neuroactive ligand-receptor interaction and the long-term depression gene set. In a subsequent double-blind, placebo-controlled study in healthy volunteers, we aimed at providing a proof of concept for the genome-guided identification of memory modulating compounds. Pharmacological intervention at the neuroactive ligand-receptor interaction gene set led to significant reduction of aversive memory. The findings demonstrate that genome information, along with appropriate data mining methodology, can be used as a starting point for the identification of memory-modulating compounds.
Data from the 1000 genomes project (1KGP) and Complete Genomics (CG) have dramatically increased the numbers of known genetic variants and challenge several assumptions about the reference genome and its uses in both clinical and research settings. Specifically, 34% of published array-based GWAS studies for a variety of diseases utilize probes that overlap unanticipated single nucleotide polymorphisms (SNPs), indels, or structural variants. Linkage disequilibrium (LD) block length depends on the numbers of markers used, and the mean LD block size decreases from 16 kb to 7 kb,when HapMap-based calculations are compared to blocks computed from1KGP data. Additionally, when 1KGP and CG variants are compared, 19% of the single nucleotide variants (SNVs) reported from common genomes are unique to one dataset; likely a result of differences in data collection methodology, alignment of reads to the reference genome, and variant-calling algorithms. Together these observations indicate that current research resources and informatics methods do not adequately account for the high level of variation that already exists in the human population and significant efforts are needed to create resources that can accurately assess personal genomics for health, disease, and predict treatment outcomes.
Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions. We present a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data obtained from two refined structure prediction algorithms, RNAz and SISSIz, were then analyzed to fine-tune the parameters of an optimized workflow for genomic sliding window screens. When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5-22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional. As an example, our findings identify both known and novel conserved RNA structure motifs in the long noncoding RNA MALAT1. This study provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying the developmental ontologies of higher eukaryotes.
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron’s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
Epigenome mapping consortia are generating resources of tremendous value for studying epigenetic regulation. To maximize their utility and impact, new tools are needed that facilitate interactive analysis of epigenome datasets. Here we describe EpiExplorer, a web tool for exploring genome and epigenome data on a genomic scale. We demonstrate EpiExplorer’s utility by describing a hypothesis-generating analysis of DNA hydroxymethylation in relation to public reference maps of the human epigenome. All EpiExplorer analyses are performed dynamically within seconds, using an efficient and versatile text indexing scheme that we introduce to bioinformatics. EpiExplorer is available at http://epiexplorer.mpi-inf.mpg.de.