Concept: Limit of a sequence
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.
Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair .
High-throughput metagenomic sequencing has revolutionized our view on the structure and metabolic potential of microbial communities. However, analysis of metagenomic composition is often complicated by the high complexity of the community and the lack of related reference genomic sequences. As a start point for comparative metagenomic analysis, the researchers require efficient means for assessing pairwise similarity of the metagenomes (beta-diversity). A number of approaches is used to address this task, however, most of them have inherent disadvantages that limit their scope of applicability. For instance, the reference-based methods poorly perform on metagenomes from previously unstudied niches, while composition-based methods appear to be too abstract for straightforward interpretation and do not allow to identify the differentially abundant features.
EBI metagenomics ( http://www.ebi.ac.uk/metagenomics ) provides a free to use platform for the analysis and archiving of sequence data derived from the microbial populations found in a particular environment. Over the past two years, EBI metagenomics has increased the number of datasets analysed 10-fold. In addition to increased throughput, the underlying analysis pipeline has been overhauled to include both new or updated tools and reference databases. Of particular note is a new workflow for taxonomic assignments that has been extended to include assignments based on both the large and small subunit RNA marker genes and to encompass all cellular micro-organisms. We also describe the addition of metagenomic assembly as a new analysis service. Our pilot studies have produced over 2400 assemblies from datasets in the public domain. From these assemblies, we have produced a searchable, non-redundant protein database of over 50 million sequences. To provide improved access to the data stored within the resource, we have developed a programmatic interface that provides access to the analysis results and associated sample metadata. Finally, we have integrated the results of a series of statistical analyses that provide estimations of diversity and sample comparisons.
Paradoxically, centromeres are known both for their characteristic repeat sequences (satellite DNA) and for being epigenetically defined. Maize (Zea mays mays) is an attractive model for studying centromere positioning because many of its large (~2 Mb) centromeres are not dominated by satellite DNA. These centromeres, which we call complex centromeres, allow for both assembly into reference genomes and for mapping short reads from ChIP-seq with antibodies to centromeric histone H3 (cenH3).
There has been progress towards malaria elimination in the last decade. In response, WHO launched the Global Technical Strategy (GTS), in which vector surveillance and control play important roles. Country experiences in the Eliminating Malaria Case Study Series were reviewed to identify success factors on the road to elimination using a cross-case study analytic approach.
Insecticide resistance threatens effective vector control, especially for mosquitoes and malaria. To manage resistance, recommended insecticide use strategies include mixtures, sequences and rotations. New insecticides are being developed and there is an opportunity to develop use strategies that limit the evolution of further resistance in the short term. A 2013 review of modelling and empirical studies of resistance points to the advantages of mixtures. However, there is limited recent, accessible modelling work addressing the evolution of resistance under different operational strategies. There is an opportunity to improve the level of mechanistic understanding within the operational community of how insecticide resistance can be expected to evolve in response to different strategies. This paper provides a concise, accessible description of a flexible model of the evolution of insecticide resistance. The model is used to develop a mechanistic picture of the evolution of insecticide resistance and how it is likely to respond to potential insecticide use strategies. The aim is to reach an audience unlikely to read a more detailed modelling paper. The model itself, as described here, represents two independent genes coding for resistance to two insecticides. This allows the representation of the use of insecticides in isolation, sequence and mixtures.
The increasing application of next generation sequencing technologies has led to the availability of thousands of reference genomes, often providing multiple genomes for the same or closely related species. The current approach to represent a species or a population with a single reference sequence and a set of variations cannot represent their full diversity and introduces bias towards the chosen reference. There is a need for the representation of multiple sequences in a composite way that is compatible with existing data sources for annotation and suitable for established sequence analysis methods. At the same time, this representation needs to be easily accessible and extendable to account for the constant change of available genomes.
Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments.
Although everyday experiences unfold continuously over time, shifts in context, or event boundaries, can influence how those events come to be represented in memory [1-4]. Specifically, mnemonic binding across sequential representations is more challenging at context shifts, such that successful temporal associations are more likely to be formed within than across contexts [1, 2, 5-9]. However, in order to preserve a subjective sense of continuity, it is important that the memory system bridge temporally adjacent events, even if they occur in seemingly distinct contexts. Here, we used pattern similarity analysis to scalp electroencephalographic (EEG) recordings during a sequential learning task [2, 3] in humans and showed that the detection of event boundaries triggered a rapid memory reinstatement of the just-encoded sequence episode. Memory reactivation was detected rapidly (∼200-800 ms from the onset of the event boundary) and was specific to context shifts that were preceded by an event sequence with episodic content. Memory reinstatement was not observed during the sequential encoding of events within an episode, indicating that memory reactivation was induced specifically upon context shifts. Finally, the degree of neural similarity between neural responses elicited during sequence encoding and at event boundaries correlated positively with participants' ability to later link across sequences of events, suggesting a critical role in binding temporally adjacent events in long-term memory. Current results shed light onto the neural mechanisms that promote episodic encoding not only for information within the event, but also, importantly, in the ability to link across events to create a memory representation of continuous experience.