SciCombinator

Discover the most talked about and latest scientific content & concepts.

Journal: Database : the journal of biological databases and curation

171

The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical-gene, chemical-disease and gene-disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009-2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL: http://ctdbase.org/

Concepts: Journal, Curator, Chemical industry, Comparative Toxicogenomics Database, Chemical substance, Health, Toxicogenomics, Toxicology

171

Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. Database URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/

Concepts: Existence, Biology, Bioinformatics, Data mining, Writing, Controlled vocabulary, Ontology

165

High-quality manual annotation methods and practices need to be scaled to the increased rate of genomic data production. Curation based on gene families and gene networks is one approach that can significantly increase both curation efficiency and quality. The Sol Genomics Network (SGN; http://solgenomics.net) is a comparative genomics platform, with genetic, genomic and phenotypic information of the Solanaceae family and its closely related species that incorporates a community-based gene and phenotype curation system. In this article, we describe a manual curation system for gene families aimed at facilitating curation, querying and visualization of gene interaction patterns underlying complex biological processes, including an interface for efficiently capturing information from experiments with large data sets reported in the literature. Well-annotated multigene families are useful for further exploration of genome organization and gene evolution across species. As an example, we illustrate the system with the multigene transcription factor families, WRKY and Small Auxin Up-regulated RNA (SAUR), which both play important roles in responding to abiotic stresses in plants. Database URL: http://solgenomics.net/

Concepts: Species, Biology, Genomics, DNA, Evolution, Genome, Genetics, Gene

165

The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/

Concepts: Genome, Model organism, Gene expression, Evolution, Biology, Genetics, Gene, DNA

149

Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available athttps://github.com/SuLab/crowd_cid_relexDatabase URL:https://github.com/SuLab/crowd_cid_relex.

Concepts: Abstract, Regression analysis, Cost, Message Understanding Conference, Crowdsourcing, Scientific method, Error, Named entity recognition

37

Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59 721 human genes and 73 355 mouse genes have been imported from NCBI and 27 306 human proteins and 16 728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists.In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web.Database URL: https://www.wikidata.org/.

Concepts: Human genome, Data, English Wikipedia, Wikipedia, DNA, Gene, Molecular biology, Biology

22

Improving the prediction of chemical toxicity is a goal common to both environmental health research and pharmaceutical drug development. To improve safety detection assays, it is critical to have a reference set of molecules with well-defined toxicity annotations for training and validation purposes. Here, we describe a collaboration between safety researchers at Pfizer and the research team at the Comparative Toxicogenomics Database (CTD) to text mine and manually review a collection of 88 629 articles relating over 1 200 pharmaceutical drugs to their potential involvement in cardiovascular, neurological, renal and hepatic toxicity. In 1 year, CTD biocurators curated 2 54 173 toxicogenomic interactions (1 52 173 chemical-disease, 58 572 chemical-gene, 5 345 gene-disease and 38 083 phenotype interactions). All chemical-gene-disease interactions are fully integrated with public CTD, and phenotype interactions can be downloaded. We describe Pfizer’s text-mining process to collate the articles, and CTD’s curation strategy, performance metrics, enhanced data content and new module to curate phenotype information. As well, we show how data integration can connect phenotypes to diseases. This curation can be leveraged for information about toxic endpoints important to drug safety and help develop testable hypotheses for drug-disease events. The availability of these detailed, contextualized, high-quality annotations curated from seven decades' worth of the scientific literature should help facilitate new mechanistic screening assays for pharmaceutical compound survival. This unique partnership demonstrates the importance of resource sharing and collaboration between public and private entities and underscores the complementary needs of the environmental health science and pharmaceutical communities. Database URL: http://ctdbase.org/

Concepts: Toxicity, Science, Pharmaceutical drug, Scientific method, Medicine, Drug, Pharmacy, Pharmacology

16

The last 20 years of advancement in sequencing technologies have led to sequencing thousands of microbial genomes, creating mountains of genetic data. While efficiency in generating the data improves almost daily, applying meaningful relationships between taxonomic and genetic entities on this scale requires a structured and integrative approach. Currently, knowledge is distributed across a fragmented landscape of resources from government-funded institutions such as National Center for Biotechnology Information (NCBI) and UniProt to topic-focused databases like the ODB3 database of prokaryotic operons, to the supplemental table of a primary publication. A major drawback to large scale, expert-curated databases is the expense of maintaining and extending them over time. No entity apart from a major institution with stable long-term funding can consider this, and their scope is limited considering the magnitude of microbial data being generated daily. Wikidata is an openly editable, semantic web compatible framework for knowledge representation. It is a project of the Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics. We are developing a microbial specific data model, based on Wikidata’s semantic web compatibility, which represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 43 694 gene and 37 966 protein items for 21 species of bacteria, including the human pathogenic bacteriaChlamydia trachomatis.Using this pathogen as an example, we explore complex interactions between the pathogen, its host, associated genes, other microbes, disease and drugs using the Wikidata SPARQL endpoint. In our next phase of development, we will add another 99 bacterial genomes and their gene and gene products, totaling ∼900,000 additional entities. This aggregation of knowledge will be a platform for community-driven collaboration, allowing the networking of microbial genetic data through the sharing of knowledge by both the data and domain expert.

Concepts: Horizontal gene transfer, Genetics, DNA, Archaea, Genome, Organism, Gene, Bacteria

11

A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains ‘all of the information’, and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.

Concepts: Genetic disorder, Evolution, DNA, Natural selection, Cancer, Genetics, Mutation

10

Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability.

Concepts: Enabling act, Content format, Debut albums, The Reader, Enabling, Life, Research, Workflow