Discover the most talked about and latest scientific content & concepts.

Concept: Annotation


Detailed anatomical understanding of the human brain is essential for unraveling its functional architecture, yet current reference atlases have major limitations in terms of lack of whole-brain coverage, relatively low image resolution, and sparse structural annotation. We present the first digital human brain atlas to incorporate neuroimaging, high-resolution histology, and chemoarchitecture across a complete adult female brain, consisting of MRI, DWI, and 1356 large-format cellular resolution (1 µm/pixel) Nissl and immunohistochemistry anatomical plates. The atlas is comprehensively annotated for 862 structures, including 117 white matter tracts and several novel cyto- and chemoarchitecturally defined structures, and these annotations were transferred onto the matching MRI dataset. Neocortical delineations were done for sulci, gyri, and modified Brodmann areas to link macroscopic anatomical and microscopic cytoarchitectural parcellations. Correlated neuroimaging and histological structural delineation allowed fine feature identification in MRI data and subsequent structural identification in MRI data from other brains. This interactive online digital atlas is integrated with existing Allen Institute for Brain Science gene expression atlases and is publicly accessible as a resource for the neuroscience community. This article is protected by copyright. All rights reserved.

Concepts: Gene, Neuroanatomy, Brain, Histology, Human brain, Cerebral cortex, Cerebellum, Annotation


BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. RESULTS: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. CONCLUSIONS: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.

Concepts: Effectiveness, Ontology, Machine learning, Learning, Annotation, Annotated bibliography, Natural language processing, Marginalia


The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is identifying from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies.

Concepts: DNA, Bacteria, Biology, Organism, Sequence, Annotation, Annotated bibliography, Marginalia


Electron cryomicroscopy (cryo-EM) has been used to determine the atomic coordinates (models) from density maps of biological assemblies. These models can be assessed by their overall fit to the experimental data and stereochemical information. However, these models do not annotate the actual density values of the atoms nor their positional uncertainty. Here, we introduce a computational procedure to derive an atomic model from a cryo-EM map with annotated metadata. The accuracy of such a model is validated by a faithful replication of the experimental cryo-EM map computed using the coordinates and associated metadata. The functional interpretation of any structural features in the model and its utilization for future studies can be made in the context of its measure of uncertainty. We applied this protocol to the 3.3-Å map of the mature P22 bacteriophage capsid, a large and complex macromolecular assembly. With this protocol, we identify and annotate previously undescribed molecular interactions between capsid subunits that are crucial to maintain stability in the absence of cementing proteins or cross-linking, as occur in other bacteriophages.

Concepts: Microbiology, Molecule, Chemistry, Atom, Annotation, Assembly language, Annotated bibliography, Marginalia


The number of image analysis tools supporting the extraction of architectural features of root systems has increased over the last years. These tools offer a handy set of complementary facilities, yet it is widely accepted that none of these software tool is able to extract in an efficient way growing array of static and dynamic features for different types of images and species. . We describe the Root System Markup Language (RSML) that has been designed to overcome two major challenges: (i) to enable portability of root architecture data between different software tools in an easy and interoperable manner allowing seamless collaborative work, and (ii) to provide a standard format upon which to base central repositories which will soon arise following the expanding worldwide root phenotyping effort. RSML follows the XML standard to store 2D or 3D image metadata, plant and root properties and geometries, continuous functions along individual root paths and a suite of annotations at the image, plant or root scales, at one or several time points. Plant ontologies are used to describe botanical entities that are relevant at the scale of root system architecture. An xml-schema describes the features and constraints of RSML and open-source packages have been developed in several languages (R, Excel, Java, Python, C#) to enable researchers to integrate RSML files into popular research workflow.

Concepts: Mathematics, Computer program, Annotation, Root, Software architecture, XML, Markup language, Weyl group


Several research groups have shown how to map fMRI responses to the meanings of presented stimuli. This paper presents new methods for doing so when only a natural language annotation is available as the description of the stimulus. We study fMRI data gathered from subjects watching an episode of BBCs Sherlock (Chen et al., 2017), and learn bidirectional mappings between fMRI responses and natural language representations. By leveraging data from multiple subjects watching the same movie, we were able to perform scene classification with 72% accuracy (random guessing would give 4%) and scene ranking with average rank in the top 4% (random guessing would give 50%). The key ingredients underlying this high level of performance are (a) the use of the Shared Response Model (SRM) and its variant SRM-ICA (Chen et al., ; Zhang et al.,) to aggregate fMRI data from multiple subjects, both of which are shown to be superior to standard PCA in producing low-dimensional representations for the tasks in this paper; (b) a sentence embedding technique adapted from the natural language processing (NLP) literature (Arora et al., 2017) that produces semantic vector representation of the annotations; © using previous timestep information in the featurization of the predictor data. These optimizations in how we featurize the fMRI data and text annotations provide a substantial improvement in classification performance, relative to standard approaches.

Concepts: Linguistics, Language, Semantics, Annotation, Map, Ranking, Natural language processing, Natural language


Currently available sequencing technologies enable quick and economical sequencing of many new eukaryotic parasite (apicomplexan or kinetoplastid) species or strains. Compared to SNP calling approaches, de novo assembly of these genomes enables researchers to additionally determine insertion, deletion and recombination events as well as to detect complex sequence diversity, such as that seen in variable multigene families. However, there currently are no automated eukaryotic annotation pipelines offering the required range of results to facilitate such analyses. A suitable pipeline needs to perform evidence-supported gene finding as well as functional annotation and pseudogene detection up to the generation of output ready to be submitted to a public database. Moreover, no current tool includes quick yet informative comparative analyses and a first pass visualization of both annotation and analysis results. To overcome those needs we have developed the Companion web server ( providing parasite genome annotation as a service using a reference-based approach. We demonstrate the use and performance of Companion by annotating two Leishmania and Plasmodium genomes as typical parasite cases and evaluate the results compared to manually annotated references.

Concepts: DNA, Gene, Genetics, Bacteria, Chromosome, Annotation, Annotated bibliography, Footnote


Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype-genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas. Database URL: BioDIG website: BioDIG source code repository: The MyDIG database:

Concepts: Gene, Genome, Computer program, Annotation, IMAGE, Source code, SQL, Marginalia


MOTIVATION: Advancing the search, publication, and integration of bioinformatics tools and resources demands consistent machineunderstandable descriptions. A comprehensive ontology allowing such descriptions is therefore required. RESULTS: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains, and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, data sets and publications within bioinformatics. EDAM applies to organising and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations. AVAILABILITY: The latest stable version of EDAM is available in OWL format from and in OBO format from It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to This article describes version 1.2 available at CONTACT:

Concepts: Bioinformatics, Semantics, Computer program, Annotation, Availability, Reference, Workflow, Format


The annotation of small molecules is one of the most challenging and important steps in untargeted mass spectrometry analysis, as most of our biological interpretations rely on structural annotations. Molecular networking has emerged as a structured way to organize and mine data from untargeted tandem mass spectrometry (MS/MS) experiments and has been widely applied to propagate annotations. However, propagation is done through manual inspection of MS/MS spectra connected in the spectral networks and is only possible when a reference library spectrum is available. One of the alternative approaches used to annotate an unknown fragmentation mass spectrum is through the use of in silico predictions. One of the challenges of in silico annotation is the uncertainty around the correct structure among the predicted candidate lists. Here we show how molecular networking can be used to improve the accuracy of in silico predictions through propagation of structural annotations, even when there is no match to a MS/MS spectrum in spectral libraries. This is accomplished through creating a network consensus of re-ranked structural candidates using the molecular network topology and structural similarity to improve in silico annotations. The Network Annotation Propagation (NAP) tool is accessible through the GNPS web-platform

Concepts: Mass spectrometry, Structure, Annotation, Computer network, Tandem mass spectrometry, Fourier transform ion cyclotron resonance, Reference, Footnote