SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Data mining

171

Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. Database URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/

Concepts: Bioinformatics, Biology, Ontology, Controlled vocabulary, Writing, Data mining, Existence

170

The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.

Concepts: DNA, Gene, Bioinformatics, Molecular biology, Data mining, Natural language processing, Graph rewriting, Subgraph isomorphism problem

170

BACKGROUND: Experimental datasets are becoming larger and increasingly complex, spanning different data domains, thereby expanding the requirements for respective tool support for their analysis. Networks provide a basis for the integration, analysis and visualization of multi-omics experimental datasets. RESULTS: Here we present VANTED (version 2), a framework for systems biology applications, which comprises a comprehensive set of seven main tasks. These range from network reconstruction, data visualization, integration of various data types, network simulation to data exploration combined with a manifold support of systems biology standards for visualization and data exchange. The offered set of functionalities is instantiated by combining several tasks in order to enable users to view and explore a comprehensive dataset from different perspectives. We describe the system as well as an exemplary workflow. CONCLUSIONS: VANTED is a stand-alone framework which supports scientists during the data analysis and interpretation phase. It is available as a Java open source tool from http://www.vanted.org.

Concepts: Statistics, Mathematics, Data, Data set, Data analysis, Data mining, Open source, Real analysis

169

BACKGROUND: Hyperbilirubinemia is emerging as an increasingly common problem in newborns due to a decreasing hospital length of stay after birth. Jaundice is the most common disease of the newborn and although being benign in most cases it can lead to severe neurological consequences if poorly evaluated. In different areas of medicine, data mining has contributed to improve the results obtained with other methodologies.Hence, the aim of this study was to improve the diagnosis of neonatal jaundice with the application of data mining techniques. METHODS: This study followed the different phases of the Cross Industry Standard Process for Data Mining model as its methodology.This observational study was performed at the Obstetrics Department of a central hospital (Centro Hospitalar Tamega e Sousa – EPE), from February to March of 2011. A total of 227 healthy newborn infants with 35 or more weeks of gestation were enrolled in the study. Over 70 variables were collected and analyzed. Also, transcutaneous bilirubin levels were measured from birth to hospital discharge with maximum time intervals of 8 hours between measurements, using a noninvasive bilirubinometer.Different attribute subsets were used to train and test classification models using algorithms included in Weka data mining software, such as decision trees (J48) and neural networks (multilayer perceptron). The accuracy results were compared with the traditional methods for prediction of hyperbilirubinemia. RESULTS: The application of different classification algorithms to the collected data allowed predicting subsequent hyperbilirubinemia with high accuracy. In particular, at 24 hours of life of newborns, the accuracy for the prediction of hyperbilirubinemia was 89%. The best results were obtained using the following algorithms: naive Bayes, multilayer perceptron and simple logistic. CONCLUSIONS: The findings of our study sustain that, new approaches, such as data mining, may support medical decision, contributing to improve diagnosis in neonatal jaundice.

Concepts: Pregnancy, Childbirth, Infant, Fetus, Bilirubin, Data mining, Pediatrics, Neonatal jaundice

168

BACKGROUND: U-Compare is a text mining platform that allows the construction, evaluation and comparison of text miningworkflows. U-Compare contains a large library of components that are tuned to the biomedical domain. Userscan rapidly develop biomedical text mining workflows by mixing and matching U-Compare’s components.Workflows developed using U-Compare can be exported and sent to other users who, in turn, can import andre-use them. However, the resulting workflows are standalone applications, i.e., software tools that run and areaccessible only via a local machine, and that can only be run with the U-Compare platform. RESULTS: We address the above issues by extending U-Compare to convert standalone workflows into web servicesautomatically, via a two-click process. The resulting web services can be registered on a central server andmade publicly available. Alternatively, users can make web services available on their own servers, afterinstalling the web application framework, which is part of the extension to U-Compare. We have performed auser-oriented evaluation of the proposed extension, by asking users who have tested the enhanced functionalityof U-Compare to complete questionnaires that assess its functionality, reliability, usability, efficiency andmaintainability. The results obtained reveal that the new functionality is well received by users. CONCLUSIONS: The web services produced by U-Compare are built on top of open standards, i.e., REST and SOAP protocols,and therefore, they are decoupled from the underlying platform. Exported workflows can be integrated withany application that supports these open standards. We demonstrate how the newly extended U-Compareenhances the cross-platform interoperability of workflows, by seamlessly importing a number of text miningworkflow web services exported from U-Compare into Taverna, i.e., a generic scientific workflow constructionplatform.

Concepts: Data mining, Web 2.0, Internet, Web application, Import, XML, Web application framework, Software framework

150

Our publication of the BitTorious portal [1] demonstrated the ability to create a privatized distributed data warehouse of sufficient magnitude for real-world bioinformatics studies using minimal changes to the standard BitTorrent tracker protocol. In this second phase, we release a new server-side specification to accept anonymous philantropic storage donations by the general public, wherein a small portion of each user’s local disk may be used for archival of scientific data. We have implementated the server-side announcement and control portions of this BitTorrent extension into v3.0.0 of the BitTorious portal, upon which compatible clients may be built.

Concepts: Data mining, BitTorrent, BitTorrent tracker, UDP tracker

32

Seismology is experiencing rapid growth in the quantity of data, which has outpaced the development of processing algorithms. Earthquake detection-identification of seismic events in continuous data-is a fundamental operation for observational seismology. We developed an efficient method to detect earthquakes using waveform similarity that overcomes the disadvantages of existing detection methods. Our method, called Fingerprint And Similarity Thresholding (FAST), can analyze a week of continuous seismic waveform data in less than 2 hours, or 140 times faster than autocorrelation. FAST adapts a data mining algorithm, originally designed to identify similar audio clips within large databases; it first creates compact “fingerprints” of waveforms by extracting key discriminative features, then groups similar fingerprints together within a database to facilitate fast, scalable search for similar fingerprint pairs, and finally generates a list of earthquake detections. FAST detected most (21 of 24) cataloged earthquakes and 68 uncataloged earthquakes in 1 week of continuous data from a station located near the Calaveras Fault in central California, achieving detection performance comparable to that of autocorrelation, with some additional false detections. FAST is expected to realize its full potential when applied to extremely long duration data sets over a distributed network of seismic stations. The widespread application of FAST has the potential to aid in the discovery of unexpected seismic signals, improve seismic monitoring, and promote a greater understanding of a variety of earthquake processes.

Concepts: Data mining, Earthquake, Earthquake engineering, Seismology, Seismometer

28

INTRODUCTION: Decision-tree analysis; a core component of data mining analysis can build predictive models for the therapeutic outcome to antiviral therapy in chronic hepatitis C virus (HCV) patients. AIM: To develop a prediction model for the end virological response (ETR) to pegylated interferon PEG-IFN plus ribavirin (RBV) therapy in chronic HCV patients using routine clinical, laboratory, and histopathological data. PATIENTS AND METHODS: Retrospective initial data (19 attributes) from 3719 Egyptian patients with chronic HCV presumably genotype-4 was assigned to model building using the J48 decision tree-inducing algorithm (Weka implementation of C4.5). All patients received PEG-IFN plus RBV at Cairo-Fatemia Hospital, Cairo, Egypt in the context of the national treatment program. Factors predictive of ETR were explored and patients were classified into seven subgroups according to the different rates of ETR. The universality of the decision-tree model was subjected to a 10-fold cross-internal validation in addition to external validation using an independent dataset collected of 200 chronic HCV patients. RESULTS: At week 48, overall ETR was 54% according to intention to treat protocol. The decision-tree model included AFP level (<8.08ng/ml) which was associated with high probability of ETR (73%) followed by stages of fibrosis and Hb levels according to the patients' gender followed by the age of patients. CONCLUSION: In a decision-tree model for the prediction for antiviral therapy in chronic HCV patients, AFP level was the initial split variable at a cutoff of 8.08ng/ml. This model could represent a potential tool to identify patients' likelihood of response among difficult-to-treat presumably genotype-4 chronic HCV patients and could support clinical decisions regarding the proper selection of patients for therapy without imposing any additional costs.

Concepts: Cirrhosis, Hepatitis, Interferon, Hepatitis C, Hepatitis B, Data mining, Hepatitis C virus, Decision tree learning

28

Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over hundred years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research.

Concepts: Bioinformatics, Cancer, Oncology, Data mining, Malignancy, Research and development, Natural language processing, Text mining

27

Oils of various species of Copaifera are commonly found in pharmacies and on popular markets and are widely sold for their medicinal properties. However, the chemical variability between and within species and the lack of standardization of these oils have presented barriers to their wider commercialization. With the aim to recognize patterns for the chemical composition of copaiba oils, 22 oil samples of C. multijuga Hayne species were collected, esterified with CH2 N2 , and characterized by GC-FID and GC/MS analyses. The chromatographic data were processed using hierarchical cluster analysis (HCA) and principal component analysis (PCA). In total, 35 components were identified in the oils, and the multivariate analyses (MVA) allowed the samples to be divided into three groups, with the sesquiterpenes β-caryophyllene and caryophyllene oxide as the main components. These sesquiterpenes, which were detected in all the samples analyzed in different concentrations, were the most important constituents in the differentiation of the groups. There was a prevalence of sesquiterpenes in all the oils studied. In conclusion, GC-FID and GC/MS analyses combined with MVA can be used to determine the chemical composition and to recognize chemical patterns of copaiba oils.

Concepts: Cluster analysis, Multivariate statistics, Mathematical analysis, Principal component analysis, Data mining, Linear discriminant analysis, Multivariate analysis, Kernel principal component analysis