Concept: Junk DNA
A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 - 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
- Proceedings of the National Academy of Sciences of the United States of America
- Published about 5 years ago
Do data from the Encyclopedia Of DNA Elements (ENCODE) project render the notion of junk DNA obsolete? Here, I review older arguments for junk grounded in the C-value paradox and propose a thought experiment to challenge ENCODE’s ontology. Specifically, what would we expect for the number of functional elements (as ENCODE defines them) in genomes much larger than our own genome? If the number were to stay more or less constant, it would seem sensible to consider the rest of the DNA of larger genomes to be junk or, at least, assign it a different sort of role (structural rather than informational). If, however, the number of functional elements were to rise significantly with C-value then, (i) organisms with genomes larger than our genome are more complex phenotypically than we are, (ii) ENCODE’s definition of functional element identifies many sites that would not be considered functional or phenotype-determining by standard uses in biology, or (iii) the same phenotypic functions are often determined in a more diffuse fashion in larger-genomed organisms. Good cases can be made for propositions ii and iii. A larger theoretical framework, embracing informational and structural roles for DNA, neutral as well as adaptive causes of complexity, and selection as a multilevel phenomenon, is needed.
A single CpG site within F2RL3 was recently found to be hypomethylated in peripheral blood genomic DNA from smokers compared to former and non-smokers. We performed two epigenome-wide association studies (EWAS) nested in a prospective healthy cohort using the Illumina 450K Methylation Beadchip. The two populations consisted of matched pairs of healthy individuals (n=374), of which half went on to develop breast or colon cancer. The association was analysed between methylation and smoking status, as well as cancer risk. In addition to the same locus in F2RL3, we report several loci that are hypomethylated in smokers compared to former and non-smokers, including an intragenic region of the aryl hydrocarbon receptor repressor gene (AHRR; cg05575921, p=2.31x10(-15); effect size = 14%-17%), an intergenic CpG island on 2q37.1 (cg21566642, p=3.73x10(-13); effect size = 12%), and a further intergenic region at 6p21.33 (cg06126421, p=4.96x10(-11), effect size = 7%-8%). Bisulphite pyrosequencing validated six loci in a further independent population of healthy individuals (n=180). Methylation levels in AHRR were also significantly decreased (p<0.001) and expression increased (p=0.0047) in the lung tissue of current smokers compared to non-smokers. This was further validated in a mouse model of smoke exposure. We observed an association with breast cancer risk for the 2q37.1 locus (p=0.003, adjusted for smoking status), but not for the other loci associated with smoking. These data show that smoking has a direct effect on the epigenome in lung tissue, which is also detectable in peripheral blood DNA and may contribute to cancer risk.
Programmed DNA rearrangements in the single-celled eukaryote Oxytricha trifallax completely rewire its germline into a somatic nucleus during development. This elaborate, RNA-mediated pathway eliminates noncoding DNA sequences that interrupt gene loci and reorganizes the remaining fragments by inversions and permutations to produce functional genes. Here, we report the Oxytricha germline genome and compare it to the somatic genome to present a global view of its massive scale of genome rearrangements. The remarkably encrypted genome architecture contains >3,500 scrambled genes, as well as >800 predicted germline-limited genes expressed, and some posttranslationally modified, during genome rearrangements. Gene segments for different somatic loci often interweave with each other. Single gene segments can contribute to multiple, distinct somatic loci. Terminal precursor segments from neighboring somatic loci map extremely close to each other, often overlapping. This genome assembly provides a draft of a scrambled genome and a powerful model for studies of genome rearrangement.
Within the ENCODE Consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes, and noncoding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were assessed, and less well-supported loci were systematically, experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we called RT-PCR-seq. Seventy-nine percent of all assessed junctions are confirmed by this evaluation procedure, demonstrating the high quality of the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNA-seq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly noncoding, which will further complement the GENCODE annotation. Our novel experimental validation pipeline is extremely sensitive, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. For example, exon-exon junctions unique to GENCODE annotated transcripts are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Data sets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in ~11% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
The function of the non-coding portion of the human genome remains one of the most important questions of our time. Its vast complexity is exemplified by the recent identification of an unusual and notable component of the transcriptome - very long intergenic non-coding RNAs, termed vlincRNAs.
The extent to which non-coding mutations contribute to Mendelian disease is a major unknown in human genetics. Relatedly, the vast majority of candidate regulatory elements have yet to be functionally validated. Here, we describe a CRISPR-based system that uses pairs of guide RNAs (gRNAs) to program thousands of kilobase-scale deletions that deeply scan across a targeted region in a tiling fashion (“ScanDel”). We applied ScanDel to HPRT1, the housekeeping gene underlying Lesch-Nyhan syndrome, an X-linked recessive disorder. Altogether, we programmed 4,342 overlapping 1 and 2 kb deletions that tiled 206 kb centered on HPRT1 (including 87 kb upstream and 79 kb downstream) with median 27-fold redundancy per base. We functionally assayed programmed deletions in parallel by selecting for loss of HPRT function with 6-thioguanine. As expected, sequencing gRNA pairs before and after selection confirmed that all HPRT1 exons are needed. However, HPRT1 function was robust to deletion of any intergenic or deeply intronic non-coding region, indicating that proximal regulatory sequences are sufficient for HPRT1 expression. Although our screen did identify the disruption of exon-proximal non-coding sequences (e.g., the promoter) as functionally consequential, long-read sequencing revealed that this signal was driven by rare, imprecise deletions that extended into exons. Our results suggest that no singular distal regulatory element is required for HPRT1 expression and that distal mutations are unlikely to contribute substantially to Lesch-Nyhan syndrome burden. Further application of ScanDel could shed light on the role of regulatory mutations in disease at other loci while also facilitating a deeper understanding of endogenous gene regulation.
Chloroplast genomes have undergone tremendous alterations through the evolutionary history of the green algae (Chloroplastida). This study focuses on the evolution of chloroplast genomes in the siphonous green algae (order Bryopsidales). We present five new chloroplast genomes, which along with existing sequences, yield a data set representing all but one families of the order. Using comparative phylogenetic methods, we investigated the evolutionary dynamics of genomic features in the order. Our results show extensive variation in chloroplast genome architecture and intron content. Variation in genome size is accounted for by the amount of intergenic space and freestanding open reading frames that do not show significant homology to standard plastid genes. We show the diversity of these nonstandard genes based on their conserved protein domains, which are often associated with mobile functions (reverse transcriptase/intron maturase, integrases, phage- or plasmid-DNA primases, transposases, integrases, ligases). Investigation of the introns showed proliferation of group II introns in the early evolution of the order and their subsequent loss in the core Halimedineae, possibly through RT-mediated intron loss.
Epilepsy is a common neurological disorder, and mutations in genes encoding ion channels or neurotransmitter receptors are frequent causes of monogenic forms of epilepsy. Here we show that abnormal expansions of TTTCA and TTTTA repeats in intron 4 of SAMD12 cause benign adult familial myoclonic epilepsy (BAFME). Single-molecule, real-time sequencing of BAC clones and nanopore sequencing of genomic DNA identified two repeat configurations in SAMD12. Intriguingly, in two families with a clinical diagnosis of BAFME in which no repeat expansions in SAMD12 were observed, we identified similar expansions of TTTCA and TTTTA repeats in introns of TNRC6A and RAPGEF2, indicating that expansions of the same repeat motifs are involved in the pathogenesis of BAFME regardless of the genes in which the expanded repeats are located. This discovery that expansions of noncoding repeats lead to neuronal dysfunction responsible for myoclonic tremor and epilepsy extends the understanding of diseases with such repeat expansion.
Acoels are small, ubiquitous - but understudied - marine worms with a very simple body plan. Their internal phylogeny is still not fully resolved, and the position of their proposed phylum Xenacoelomorpha remains debated. Here we describe mitochondrial genome sequences from the acoels Paratomella rubra and Isodiametra pulchra, and the complete mitochondrial genome of the acoel Archaphanostoma ylvae. The P. rubra and A. ylvae sequences are typical for metazoans in size and gene content. The larger I. pulchra mitochondrial genome contains both ribosomal genes, 21 tRNAs, but only 11 protein-coding genes. We find evidence suggesting a duplicated sequence in the I. pulchra mitochondrial genome. The P. rubra, I. pulchra and A. ylvae mitochondria have a unique genome organisation in comparison to other metazoan mitochondrial genomes. We found a large degree of protein-coding gene and tRNA overlap with little non-coding sequence in the compact P. rubra genome. Conversely, the A. ylvae and I. pulchra genomes have many long non-coding sequences between genes, likely driving genome size expansion in the latter. Phylogenetic trees inferred from mitochondrial genes retrieve Xenacoelomorpha as an early branching taxon in the deuterostomes. Sequence divergence analysis between P. rubra sampled in England and Spain indicates cryptic diversity.