OPEN Communications biology | 10 Jul 2019
A Arneson and J Ernst
Comparative genomics sequence data is an important source of information for interpreting genomes. Genome-wide annotations based on this data have largely focused on univariate scores or binary elements of evolutionary constraint. Here we present a complementary whole genome annotation approach, ConsHMM, which applies a multivariate hidden Markov model to learn de novo ‘conservation states’ based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment. We applied ConsHMM to a 100-way vertebrate sequence alignment to annotate the human genome at single nucleotide resolution into 100 conservation states. These states have distinct enrichments for other genomic information including gene annotations, chromatin states, repeat families, and bases prioritized by various variant prioritization scores. Constrained elements have distinct heritability partitioning enrichments depending on their conservation state assignment. ConsHMM conservation states are a resource for analyzing genomes and genetic variants.
* Data courtesy of Altmetric.com