Concept: Computational phylogenetics
Figures of phylogenetic trees are widely used to illustrate the result of evolutionary analyses. However, one cannot easily extract a machine-readable representation from such images. Therefore, new software emerges that helps to preserve phylogenies digitally for future research.
The Ginglymodi is one of the most common, though poorly understood groups of neopterygians, which includes gars, macrosemiiforms, and “semionotiforms.” In particular, the phylogenetic relationships between the widely distributed “semionotiforms,” and between them and other ginglymodians have been enigmatic. Here, the phylogenetic relationships between eight of the 11 “semionotiform” genera, five genera of living and fossil gars and three macrosemiid genera, are analysed through cladistic analysis, based on 90 morphological characters and 37 taxa, including 7 out-group taxa. The results of the analysis show that the Ginglymodi includes two main lineages: Lepisosteiformes and †Semionotiformes. The genera †Pliodetes, †Araripelepidotes, †Lepidotes, †Scheenstia, and †Isanichthys are lepisosteiforms, and not semionotiforms, as previously thought, and these taxa extend the stratigraphic range of the lineage leading to gars back up to the Early Jurassic. A monophyletic †Lepidotes is restricted to the Early Jurassic species, whereas the strongly tritoral species previously referred to †Lepidotes are referred to †Scheenstia. Other species previously referred to †Lepidotes represent other genera or new taxa. The macrosemiids are well nested within semionotiforms, together with †Semionotidae, here restricted to †Semionotus, and a new family including †Callipurbeckia n. gen. minor (previously referred to †Lepidotes), †Macrosemimimus, †Tlayuamichin, †Paralepidotus, and †Semiolepis. Due to the numerous taxonomic changes needed according to the phylogenetic analysis, this article also includes formal taxonomic definitions and diagnoses for all generic and higher taxa, which are new or modified. The study of Mesozoic ginglymodians led to confirm Patterson’s observation that these fishes show morphological affinities with both halecomorphs and teleosts. Therefore, the compilation of large data sets including the Mesozoic ginglymodians and the re-evaluation of several hypotheses of homology are essential to test the hypotheses of the Halecostomi vs. the Holostei, which is one of the major topics in the evolution of Mesozoic vertebrates and the origin of modern fish faunas.
BACKGROUND: Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great “Tree of Life” (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch lengths and annotations, resulting in a custom phylogeny suited to the user’s needs. Such a system could become a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact through clearly defined interfaces. RESULTS: With the aim of building such a “phylotastic” system, the NESCent Hackathons, Interoperability, Phylogenies (HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012. During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations, documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proof-of-concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing, finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a website (www.phylotastic.org), and a server image. CONCLUSIONS: Approximately 9 person-months of effort (centered on a software development hackathon) resulted in the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3 end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality assessment.
SUMMARY: Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the “–add” and “–addfragments” options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, while the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment, as well as the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the “–addfragments” option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems. AVAILABILITY: http://mafft.cbrc.jp/alignment/software/ CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Available at Bioinformatics online.
With the advent of high-throughput sequencing technologies, the rapid generation and accumulation of large amounts of sequencing data pose an insurmountable demand for efficient algorithms for constructing whole-genome phylogenies. The existing phylogenomic methods all use assembled sequences, which are often not available owing to the difficulty of assembling short-reads; this obstructs phylogenetic investigations on species without a reference genome. In this report, we present co-phylog, an assembly-free phylogenomic approach that creates a ‘micro-alignment’ at each ‘object’ in the sequence using the ‘context’ of the object and calculates pairwise distances before reconstructing the phylogenetic tree based on those distances. We explored the parameters' usages and the optimal working range of co-phylog, assessed co-phylog using the simulated next-generation sequencing (NGS) data and the real NGS raw data. We also compared co-phylog method with traditional alignment and alignment-free methods and illustrated the advantages and limitations of co-phylog method. In conclusion, we demonstrated that co-phylog is efficient algorithm and that it delivers high resolution and accurate phylogenies using whole-genome unassembled sequencing data, especially in the case of closely related organisms, thereby significantly alleviating the computational burden in the genomic era.
We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.
Sequence alignment is a long standing problem in bioinformatics. The Basic Local Alignment Search Tool (BLAST) is one of the most popular and fundamental alignment tools. The explosive growth of biological sequences calls for speedup of sequence alignment tools such as BLAST. To this end, we develop high speed BLASTN (HS-BLASTN), a parallel and fast nucleotide database search tool that accelerates MegaBLAST-the default module of NCBI-BLASTN. HS-BLASTN builds a new lookup table using the FMD-index of the database and employs an accurate and effective seeding method to find short stretches of identities (called seeds) between the query and the database. HS-BLASTN produces the same alignment results as MegaBLAST and its computational speed is much faster than MegaBLAST. Specifically, our experiments conducted on a 12-core server show that HS-BLASTN can be 22 times faster than MegaBLAST and exhibits better parallel performance than MegaBLAST. HS-BLASTN is written in C++ and the related source code is available at https://github.com/chenying2016/queries under the GPLv3 license.
Vulvovaginal candidiasis (VVC) is an important problem due to Candida spp. The aim of this study was molecular identification, phylogenetic analysis, and evaluation of antifungal susceptibility of non-albicans Candida isolates from VVC.
A new small-bodied ornithopod dinosaur, Diluvicursor pickeringi, gen. et sp. nov., is named from the lower Albian of the Eumeralla Formation in southeastern Australia and helps shed new light on the anatomy and diversity of Gondwanan ornithopods. Comprising an almost complete tail and partial lower right hindlimb, the holotype (NMV P221080) was deposited as a carcass or body-part in a log-filled scour near the base of a deep, high-energy river that incised a faunally rich, substantially forested riverine floodplain within the Australian-Antarctic rift graben. The deposit is termed the ‘Eric the Red West Sandstone.’ The holotype, interpreted as an older juvenile ∼1.2 m in total length, appears to have endured antemortem trauma to the pes. A referred, isolated posterior caudal vertebra (NMV P229456) from the holotype locality, suggests D. pickeringi grew to at least 2.3 m in length. D. pickeringi is characterised by 10 potential autapomorphies, among which dorsoventrally low neural arches and transversely broad caudal ribs on the anterior-most caudal vertebrae are a visually defining combination of features. These features suggest D. pickeringi had robust anterior caudal musculature and strong locomotor abilities. Another isolated anterior caudal vertebra (NMV P228342) from the same deposit, suggests that the fossil assemblage hosts at least two ornithopod taxa. D. pickeringi and two stratigraphically younger, indeterminate Eumeralla Formation ornithopods from Dinosaur Cove, NMV P185992/P185993 and NMV P186047, are closely related. However, the tail of D. pickeringi is far shorter than that of NMV P185992/P185993 and its pes more robust than that of NMV P186047. Preliminary cladistic analysis, utilising three existing datasets, failed to resolve D. pickeringi beyond a large polytomy of Ornithopoda. However, qualitative assessment of shared anatomical features suggest that the Eumeralla Formation ornithopods, South American Anabisetia saldiviai and Gasparinisaura cincosaltensis, Afro-Laurasian dryosaurids and possibly Antarctic Morrosaurus antarcticus share a close phylogenetic progenitor. Future phylogenetic analysis with improved data on Australian ornithopods will help to test these suggested affinities.
Genomic data is increasingly being used to understand infectious disease epidemiology. Isolates from a given outbreak are sequenced, and the patterns of shared variation are used to infer which isolates within the outbreak are most closely related to each other. Unfortunately, the phylogenetic trees typically used to represent this variation are not directly informative about who infected whom - a phylogenetic tree is not a transmission tree. However, a transmission tree can be inferred from a phylogeny while accounting for within-host genetic diversity by colouring the branches of a phylogeny according to which host those branches were in. Here we extend this approach and show that it can be applied to partially sampled and ongoing outbreaks. This requires computing the correct probability of an observed transmission tree and we herein demonstrate how to do this for a large class of epidemiological models. We also demonstrate how the branch colouring approach can incorporate a variable number of unique colours to represent unsampled intermediates in transmission chains. The resulting algorithm is a reversible jump Monte-Carlo Markov Chain, which we apply to both simulated data and real data from an outbreak of tuberculosis. By accounting for unsampled cases and an outbreak which may not have reached its end, our method is uniquely suited to use in a public health environment during real-time outbreak investigations. We implemented this transmission tree inference methodology in an R package called TransPhylo, which is freely available from https://github.com/xavierdidelot/TransPhylo.