SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Resource Description Framework

173

MOTIVATION: Since 2011, The Cancer Genome Atlas' (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months. RESULTS: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals. AVAILABILITY: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial. CONTACT: robbinsd@uab.edu.

Concepts: The Cancer Genome Atlas, Data, Web 2.0, Reproducibility, World Wide Web Consortium, Semantic Web, Resource Description Framework, World Wide Web

28

Objectives: The International Classification of Diseases and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM) ontology is a knowledge base created from the Thai modification of the World Health Organization International Classification of Diseases and Related Health Problems, 10th Revision. The objectives of this research were to develop the ICD-10-TM ontology as a knowledge base for use in a semi-automated ICD coding system and to test the usability of this system. Methods: ICD concepts and relations were identified from a tabular list and alphabetical indexes. An ICD-10-TM ontology was defined in the resource description framework (RDF), notation-3 (N3) format. All ICD-10-TM contents available as Microsoft Word documents were transformed into N3 format using Python scripts. Final RDF files were validated by ICD experts. The ontology was implemented as a knowledge base by using a novel semi-automated ICD coding system. Evaluation of usability was performed by a survey of forty volunteer users. Results: The ICD-10-TM ontology consists of two main knowledge bases (a tabular list knowledge base and an index knowledge base) containing a total of 309,985 concepts and 162,092 relations. The tabular list knowledge base can be divided into an upper level ontology, which defines hierarchical relationships between 22 ICD chapters, and a lower level ontology which defines relations between chapters, blocks, categories, rubrics and basic elements (include, exclude, synonym etc.) of the ICD tabular list. The index knowledge base describes relations between keywords, modifiers in general format and a table format of the ICD index. In this research, the creation of an ICD index ontology revealed interesting findings on problems with the current ICD index structure. One problem with the current structure is that it defines conditions that complicate pregnancy and perinatal conditions on the same hierarchical level as organ system diseases. This could mislead a coding algorithm into a wrong selection of ICD code. To prevent these coding errors by an algorithm, the ICD-10-TM index structure was modified by raising conditions complicating pregnancy and perinatal conditions into a higher hierarchical level of the index knowledge base. The modified ICD-10-TM ontology was implemented as a knowledge base in semi-automated ICD-10-TM coding software. A survey of users of the software revealed a high percentage of correct results obtained from ontology searches (>95%) and user satisfaction on the usability of the ontology. Conclusion: The ICD-10-TM ontology is the first ICD-10 ontology with a comprehensive description of all concepts and relations in an ICD-10-TM tabular list and alphabetical index. A researcher developing an automated ICD coding system should be aware of The ICD index structure and the complexity of coding processes. These coding systems are not a word matching process. ICD-10 ontology should be used as a knowledge base in The ICD coding software. It can be used to facilitate successful implementation of ICD in developing countries, especially in those countries which do not have an adequate number of competent ICD coders.

Concepts: Word processor, Array data structure, Resource Description Framework, Microsoft Word, Index, International Statistical Classification of Diseases and Related Health Problems, ICD-10, World Health Organization

27

OBJECTIVE: There is a growing realisation that clinical pathways (CPs) are vital for improving the treatment quality of healthcare organisations. However, treatment personalisation is one of the main challenges when implementing CPs, and the inadequate dynamic adaptability restricts the practicality of CPs. The purpose of this study is to improve the practicality of CPs using semantic interoperability between knowledge-based CPs and semantic electronic health records (EHRs). METHODS: Simple protocol and resource description framework query language is used to gather patient information from semantic EHRs. The gathered patient information is entered into the CP ontology represented by web ontology language. Then, after reasoning over rules described by semantic web rule language in the Jena semantic framework, we adjust the standardised CPs to meet different patients' practical needs. RESULTS: A CP for acute appendicitis is used as an example to illustrate how to achieve CP customisation based on the semantic interoperability between knowledge-based CPs and semantic EHRs. A personalised care plan is generated by comprehensively analysing the patient’s personal allergy history and past medical history, which are stored in semantic EHRs. Additionally, by monitoring the patient’s clinical information, an exception is recorded and handled during CP execution. According to execution results of the actual example, the solutions we present are shown to be technically feasible. CONCLUSION: This study contributes towards improving the clinical personalised practicality of standardised CPs. In addition, this study establishes the foundation for future work on the research and development of an independent CP system.

Concepts: Medical history, Electronic health record, Swoogle, Interoperability, Ontology, Resource Description Framework, Web Ontology Language, Semantic Web

27

Personal Health Record systems (PHRs) provide opportunities for patients to access their own PHR. However, PHRs are teeming with medical terminologies, such as disease and symptom names, etc. Patients need readily understandable and useful health knowledge in addition to their records in order to enhance their self-care ability. This study describes a Personal Health Record and Health Knowledge Sharing System (PHR&HKS) whereby users not only can maintain and import their PHR, but also can collate useful health Web resources that are related to their personal diseases. Furthermore, they can share the collated Web resources with any user with the same diseases and vice versa. To fulfill these objectives, IHE Cross-Enterprise Document Sharing (XDS) architecture was adopted to share and integrate the PHR. A registry ontology, consisting of part of the XDS document metadata attributes, the ICD-9-CM code, and part of the Dublin Core Metadata Element Set (DCMES), was created to enhance the health knowledge collating and sharing functions. The system was then tested and evaluated by 30 users. Among these individuals, 24 (81 %) held positive views on the ease of use and usefulness of the system while the remainder, who held either neutral (14 %) or negative (5 %) attitudes, were identified as individuals who were somewhat unwilling to maintain any PHR or share any information with others.

Concepts: Medical informatics, Electronic health record, Collation, Controlled vocabulary, Resource Description Framework, Personal health record, Metadata, Dublin Core

24

Recently, synthetic biologists have developed the Synthetic Biology Open Language (SBOL), a data exchange standard for descriptions of genetic parts, devices, modules, and systems. The goals of this standard are to allow scientists to exchange designs of biological parts and systems, to facilitate the storage of genetic designs in repositories, and to facilitate the description of genetic designs in publications. In order to achieve these goals, the development of an infrastructure to store, retrieve, and exchange SBOL data is necessary. To address this problem, we have developed the SBOL Stack, a Resource Description Framework (RDF) database specifically designed for the storage, integration, and publication of SBOL data. This database allows users to define a library of synthetic parts and designs as a service, to share SBOL data with collaborators, and to store designs of biological systems locally. The database also allows external data sources to be integrated by mapping them to the SBOL data model. The SBOL Stack includes two Web interfaces: the SBOL Stack API and SynBioHub. While the former is designed for developers, the latter allows users to upload new SBOL biological designs, download SBOL documents, search by keyword, and visualize SBOL data. Since the SBOL Stack is based on semantic Web technology, the inherent distributed querying functionality of RDF databases can be used to allow different SBOL stack databases to be queried simultaneously, and therefore, data can be shared between different institutes, centers, or other users.

Concepts: DNA, Genetics, Gene, Synthetic biology, Registry of Standard Biological Parts, Biology, Resource Description Framework, Semantic Web

15

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs).RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easierto scale up inference and data analysis.Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples.Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO;exposes more information from the database; and is now available as dereferencable, linked data.To demonstrate these new features, we present novel use cases showing further integration withother web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standardontologies for querying.Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL databaseto other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDFresource creates a foundation for integrated semantic web cheminformatics applications,such as the presented decision support.

Concepts: Uniform Resource Locator, Linked Data, World Wide Web, Uniform Resource Identifier, Resource Description Framework, Semantic Web

8

DisGeNET is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation. DisGeNET data can also be analysed via the DisGeNET Cytoscape plugin, and enriched with the annotations of other plugins of this popular network analysis software suite. Finally, the information contained in DisGeNET can be expanded and complemented using Semantic Web technologies and linked to a variety of resources already present in the Linked Data cloud. Hence, DisGeNET offers one of the most comprehensive collections of human gene-disease associations and a valuable set of tools for investigating the molecular mechanisms underlying diseases of genetic origin, designed to fulfill the needs of different user profiles, including bioinformaticians, biologists and health-care practitioners. Database URL: http://www.disgenet.org/.

Concepts: Web browser, Disease, Uniform Resource Locator, Web 2.0, Resource Description Framework, Uniform Resource Identifier, World Wide Web, Semantic Web

5

ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 Nucleic Acids Research Database Issue. Since then, a variety of new data sources and improvements in functionality have contributed to the growth and utility of the resource. In particular, more comprehensive tracking of compounds from research stages through clinical development to market is provided through the inclusion of data from United States Adopted Name applications; a new richer data model for representing drug targets has been developed; and a number of methods have been put in place to allow users to more easily identify reliable data. Finally, access to ChEMBL is now available via a new Resource Description Framework format, in addition to the web-based interface, data downloads and web services.

Concepts: Semantic Web, Web application, Nucleic acid, United States, Database, Pharmacology, PHP, Resource Description Framework

4

The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.

Concepts: Organism, Order, DNA, Life, Species, Biology, Resource Description Framework, Semantic Web

3

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public repository for information on chemical substances and their biological activities, launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH). For the past 11 years, PubChem has grown to a sizable system, serving as a chemical information resource for the scientific research community. PubChem consists of three inter-linked databases, Substance, Compound and BioAssay. The Substance database contains chemical information deposited by individual data contributors to PubChem, and the Compound database stores unique chemical structures extracted from the Substance database. Biological activity data of chemical substances tested in assay experiments are contained in the BioAssay database. This paper provides an overview of the PubChem Substance and Compound databases, including data sources and contents, data organization, data submission using PubChem Upload, chemical structure standardization, web-based interfaces for textual and non-textual searches, and programmatic access. It also gives a brief description of PubChem3D, a resource derived from theoretical three-dimensional structures of compounds in PubChem, as well as PubChemRDF, Resource Description Framework (RDF)-formatted PubChem data for data sharing, analysis and integration with information contained in other databases.

Concepts: Chemical database, PHP, Resource Description Framework, Molecule, Chemical substance, Chemistry, Chemical compound, Scientific method