Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database
- Database : the journal of biological databases and curation
- Published almost 7 years ago
The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical-gene, chemical-disease and gene-disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009-2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL: http://ctdbase.org/
WikiPathways (http://www.wikipathways.org) is an open, collaborative platform for capturing and disseminating models of biological pathways for data visualization and analysis. Since our last NAR update, 4 years ago, WikiPathways has experienced massive growth in content, which continues to be contributed by hundreds of individuals each year. New aspects of the diversity and depth of the collected pathways are described from the perspective of researchers interested in using pathway information in their studies. We provide updates on extensions and services to support pathway analysis and visualization via popular standalone tools, i.e. PathVisio and Cytoscape, web applications and common programming environments. We introduce the Quick Edit feature for pathway authors and curators, in addition to new means of publishing pathways and maintaining custom pathway collections to serve specific research topics and communities. In addition to the latest milestones in our pathway collection and curation effort, we also highlight the latest means to access the content as publishable figures, as standard data files, and as linked data, including bulk and programmatic access.
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon-an important outcome given that >98% of all annotations are inferred without direct curation.
Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. Because the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work, and making errors introduced during curation easier to correct.
Whole-genome knockout collections are invaluable for connecting gene sequence to function, yet traditionally, their construction has required an extraordinary technical effort. Here we report a method for the construction and purification of a curated whole-genome collection of single-gene transposon disruption mutants termed Knockout Sudoku. Using simple combinatorial pooling, a highly oversampled collection of mutants is condensed into a next-generation sequencing library in a single day, a 30- to 100-fold improvement over prior methods. The identities of the mutants in the collection are then solved by a probabilistic algorithm that uses internal self-consistency within the sequencing data set, followed by rapid algorithmically guided condensation to a minimal representative set of mutants, validation, and curation. Starting from a progenitor collection of 39,918 mutants, we compile a quality-controlled knockout collection of the electroactive microbe Shewanella oneidensis MR-1 containing representatives for 3,667 genes that is functionally validated by high-throughput kinetic measurements of quinone reduction.
The advancement of high-throughput sequencing (HTS) technologies and the rapid development of numerous analysis algorithms and pipelines in this field has resulted in an unprecedentedly high demand for training scientists in HTS data analysis. Embarking on developing new training materials is challenging for many reasons. Trainers often do not have prior experience in preparing or delivering such materials and struggle to keep them up to date. A repository of curated HTS training materials would support trainers in materials preparation, reduce the duplication of effort by increasing the usage of existing materials, and allow for the sharing of teaching experience among the HTS trainers' community. To achieve this, we have developed a strategy for materials' curation and dissemination. Standards for describing training materials have been proposed and applied to the curation of existing materials. A Git repository has been set up for sharing annotated materials that can now be reused, modified, or incorporated into new courses. This repository uses Git; hence, it is decentralized and self-managed by the community and can be forked/built-upon by all users. The repository is accessible at http://bioinformatics.upsc.se/htmr.
The IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb, http://www.guidetopharmacology.org) provides expert-curated molecular interactions between successful and potential drugs and their targets in the human genome. Developed by the International Union of Basic and Clinical Pharmacology (IUPHAR) and the British Pharmacological Society (BPS), this resource, and its earlier incarnation as IUPHAR-DB, is described in our 2014 publication. This update incorporates changes over the intervening seven database releases. The unique model of content capture is based on established and new target class subcommittees collaborating with in-house curators. Most information comes from journal articles, but we now also index kinase cross-screening panels. Targets are specified by UniProtKB IDs. Small molecules are defined by PubChem Compound Identifiers (CIDs); ligand capture also includes peptides and clinical antibodies. We have extended the capture of ligands and targets linked via published quantitative binding data (e.g. Ki, IC50 or Kd). The resulting pharmacological relationship network now defines a data-supported druggable genome encompassing 7% of human proteins. The database also provides an expanded substrate for the biennially published compendium, the Concise Guide to PHARMACOLOGY. This article covers content increase, entity analysis, revised curation strategies, new website features and expanded download options.
We describe a field and laboratory workflow developed for plant phylotranscriptomic projects that involves cryogenic tissue collection in the field, RNA extraction and quality control, and library preparation. We also make recommendations for sample curation.
- Database : the journal of biological databases and curation
- Published almost 3 years ago
Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of current information extraction programs are too high to replace professional curation today. Furthermore, current IEP programs extract single narrow slivers of information, such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60 years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease in curation costs.Database URL.
- Database : the journal of biological databases and curation
- Published over 3 years ago
The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline.Available on: http://babar.unige.ch:8082/neXtA5Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp.