Concept: Junk DNA
A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 - 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 5 years ago
Do data from the Encyclopedia Of DNA Elements (ENCODE) project render the notion of junk DNA obsolete? Here, I review older arguments for junk grounded in the C-value paradox and propose a thought experiment to challenge ENCODE’s ontology. Specifically, what would we expect for the number of functional elements (as ENCODE defines them) in genomes much larger than our own genome? If the number were to stay more or less constant, it would seem sensible to consider the rest of the DNA of larger genomes to be junk or, at least, assign it a different sort of role (structural rather than informational). If, however, the number of functional elements were to rise significantly with C-value then, (i) organisms with genomes larger than our genome are more complex phenotypically than we are, (ii) ENCODE’s definition of functional element identifies many sites that would not be considered functional or phenotype-determining by standard uses in biology, or (iii) the same phenotypic functions are often determined in a more diffuse fashion in larger-genomed organisms. Good cases can be made for propositions ii and iii. A larger theoretical framework, embracing informational and structural roles for DNA, neutral as well as adaptive causes of complexity, and selection as a multilevel phenomenon, is needed.
Programmed DNA rearrangements in the single-celled eukaryote Oxytricha trifallax completely rewire its germline into a somatic nucleus during development. This elaborate, RNA-mediated pathway eliminates noncoding DNA sequences that interrupt gene loci and reorganizes the remaining fragments by inversions and permutations to produce functional genes. Here, we report the Oxytricha germline genome and compare it to the somatic genome to present a global view of its massive scale of genome rearrangements. The remarkably encrypted genome architecture contains >3,500 scrambled genes, as well as >800 predicted germline-limited genes expressed, and some posttranslationally modified, during genome rearrangements. Gene segments for different somatic loci often interweave with each other. Single gene segments can contribute to multiple, distinct somatic loci. Terminal precursor segments from neighboring somatic loci map extremely close to each other, often overlapping. This genome assembly provides a draft of a scrambled genome and a powerful model for studies of genome rearrangement.
A single CpG site within F2RL3 was recently found to be hypomethylated in peripheral blood genomic DNA from smokers compared to former and non-smokers. We performed two epigenome-wide association studies (EWAS) nested in a prospective healthy cohort using the Illumina 450K Methylation Beadchip. The two populations consisted of matched pairs of healthy individuals (n=374), of which half went on to develop breast or colon cancer. The association was analysed between methylation and smoking status, as well as cancer risk. In addition to the same locus in F2RL3, we report several loci that are hypomethylated in smokers compared to former and non-smokers, including an intragenic region of the aryl hydrocarbon receptor repressor gene (AHRR; cg05575921, p=2.31x10(-15); effect size = 14%-17%), an intergenic CpG island on 2q37.1 (cg21566642, p=3.73x10(-13); effect size = 12%), and a further intergenic region at 6p21.33 (cg06126421, p=4.96x10(-11), effect size = 7%-8%). Bisulphite pyrosequencing validated six loci in a further independent population of healthy individuals (n=180). Methylation levels in AHRR were also significantly decreased (p<0.001) and expression increased (p=0.0047) in the lung tissue of current smokers compared to non-smokers. This was further validated in a mouse model of smoke exposure. We observed an association with breast cancer risk for the 2q37.1 locus (p=0.003, adjusted for smoking status), but not for the other loci associated with smoking. These data show that smoking has a direct effect on the epigenome in lung tissue, which is also detectable in peripheral blood DNA and may contribute to cancer risk.
Within the ENCODE Consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes, and noncoding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were assessed, and less well-supported loci were systematically, experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we called RT-PCR-seq. Seventy-nine percent of all assessed junctions are confirmed by this evaluation procedure, demonstrating the high quality of the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNA-seq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly noncoding, which will further complement the GENCODE annotation. Our novel experimental validation pipeline is extremely sensitive, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. For example, exon-exon junctions unique to GENCODE annotated transcripts are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Data sets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in ~11% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
The function of the non-coding portion of the human genome remains one of the most important questions of our time. Its vast complexity is exemplified by the recent identification of an unusual and notable component of the transcriptome - very long intergenic non-coding RNAs, termed vlincRNAs.
The extent to which non-coding mutations contribute to Mendelian disease is a major unknown in human genetics. Relatedly, the vast majority of candidate regulatory elements have yet to be functionally validated. Here, we describe a CRISPR-based system that uses pairs of guide RNAs (gRNAs) to program thousands of kilobase-scale deletions that deeply scan across a targeted region in a tiling fashion (“ScanDel”). We applied ScanDel to HPRT1, the housekeeping gene underlying Lesch-Nyhan syndrome, an X-linked recessive disorder. Altogether, we programmed 4,342 overlapping 1 and 2 kb deletions that tiled 206 kb centered on HPRT1 (including 87 kb upstream and 79 kb downstream) with median 27-fold redundancy per base. We functionally assayed programmed deletions in parallel by selecting for loss of HPRT function with 6-thioguanine. As expected, sequencing gRNA pairs before and after selection confirmed that all HPRT1 exons are needed. However, HPRT1 function was robust to deletion of any intergenic or deeply intronic non-coding region, indicating that proximal regulatory sequences are sufficient for HPRT1 expression. Although our screen did identify the disruption of exon-proximal non-coding sequences (e.g., the promoter) as functionally consequential, long-read sequencing revealed that this signal was driven by rare, imprecise deletions that extended into exons. Our results suggest that no singular distal regulatory element is required for HPRT1 expression and that distal mutations are unlikely to contribute substantially to Lesch-Nyhan syndrome burden. Further application of ScanDel could shed light on the role of regulatory mutations in disease at other loci while also facilitating a deeper understanding of endogenous gene regulation.
Conversion to psychosis is a longitudinal process during which several epigenetic changes have been described. We tested the hypothesis that epigenetic variability in the methylomes of ultra-high risk (UHR) individuals may contribute to the risk of conversion. We studied a longitudinal cohort of UHR individuals (n = 39) and compared two groups (converters, n = 14 vs. non-converters, n = 25). A longitudinal methylomic study was conducted using Infinium HumanMethylation450 BeadChip covering half a million cytosine-phosphate-guanine (CpG) sites across the human genome from whole-blood samples. We used two statistical methods to investigate the variability of methylation probes. (i) The search for longitudinal variable methylation probes (VMPs) based on median comparisons identified two VMPs in converters only. The first CpG was located in the MACROD2 gene and the second CpG was in an intergenic region at 8q24.21. (ii) The detection of outliers using variance analysis related to private epimutations identified a dozen CpGs in converters only and highlighted two genes (RAC1 and SPHK1) from the sphingolipid signaling pathway. Our study is the first to support increased methylome variability during conversion to psychosis. We speculate that stochastic factors could increase DNA methylation variability and have a role in the complex pathophysiology of conversion to psychosis as well as in other psychiatric diseases.
- Proceedings of the National Academy of Sciences of the United States of America
- Published 7 months ago
Expansions of simple sequence repeats, or microsatellites, have been linked to ∼30 neurological-neuromuscular diseases. While these expansions occur in coding and noncoding regions, microsatellite sequence and repeat length diversity is more prominent in introns with eight different trinucleotide to hexanucleotide repeats, causing hereditary diseases such as myotonic dystrophy type 2 (DM2), Fuchs endothelial corneal dystrophy (FECD), andC9orf72amyotrophic lateral sclerosis and frontotemporal dementia (C9-ALS/FTD). Here, we test the hypothesis that these GC-rich intronic microsatellite expansions selectively trigger host intron retention (IR). Using DM2, FECD, and C9-ALS/FTD as examples, we demonstrate that retention is readily detectable in affected tissues and peripheral blood lymphocytes and conclude that IR screening constitutes a rapid and inexpensive biomarker for intronic repeat expansion disease.
Chloroplast genomes have undergone tremendous alterations through the evolutionary history of the green algae (Chloroplastida). This study focuses on the evolution of chloroplast genomes in the siphonous green algae (order Bryopsidales). We present five new chloroplast genomes, which along with existing sequences, yield a data set representing all but one families of the order. Using comparative phylogenetic methods, we investigated the evolutionary dynamics of genomic features in the order. Our results show extensive variation in chloroplast genome architecture and intron content. Variation in genome size is accounted for by the amount of intergenic space and freestanding open reading frames that do not show significant homology to standard plastid genes. We show the diversity of these nonstandard genes based on their conserved protein domains, which are often associated with mobile functions (reverse transcriptase/intron maturase, integrases, phage- or plasmid-DNA primases, transposases, integrases, ligases). Investigation of the introns showed proliferation of group II introns in the early evolution of the order and their subsequent loss in the core Halimedineae, possibly through RT-mediated intron loss.