SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Natural language processing

169

The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.

Concepts: DNA, Gene, Bioinformatics, Molecular biology, Data mining, Natural language processing, Graph rewriting, Subgraph isomorphism problem

168

A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora.

Concepts: Costs, Linguistics, Gold, Cost, Task, Gold standard, Web 2.0, Natural language processing

168

BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. RESULTS: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. CONCLUSIONS: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.

Concepts: Effectiveness, Ontology, Machine learning, Learning, Annotation, Annotated bibliography, Natural language processing, Marginalia

139

Best performing named entity recognition (NER) methods for biomedical literature are based on hand-crafted features or task-specific rules, which are costly to produce and difficult to generalize to other corpora. End-to-end neural networks achieve state-of-the-art performance without hand-crafted features and task-specific knowledge in non-biomedical NER tasks. However, in the biomedical domain, using the same architecture does not yield competitive performance compared with conventional machine learning models.

Concepts: Machine learning, Learning, Knowledge, Play, Music, Natural language processing, Named entity recognition, Message Understanding Conference

138

In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine’s Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems.

Concepts: Medicine, Disease, Infectious disease, Object, Programming language, Natural language processing, Named entity recognition, Computational linguistics

137

Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed.

Concepts: Linguistics, Artificial intelligence, Machine learning, Natural language processing, Supervised learning, Named entity recognition, Computational linguistics, Message Understanding Conference

58

Creativity is a complex, multi-faceted concept encompassing a variety of related aspects, abilities, properties and behaviours. If we wish to study creativity scientifically, then a tractable and well-articulated model of creativity is required. Such a model would be of great value to researchers investigating the nature of creativity and in particular, those concerned with the evaluation of creative practice. This paper describes a unique approach to developing a suitable model of how creative behaviour emerges that is based on the words people use to describe the concept. Using techniques from the field of statistical natural language processing, we identify a collection of fourteen key components of creativity through an analysis of a corpus of academic papers on the topic. Words are identified which appear significantly often in connection with discussions of the concept. Using a measure of lexical similarity to help cluster these words, a number of distinct themes emerge, which collectively contribute to a comprehensive and multi-perspective model of creativity. The components provide an ontology of creativity: a set of building blocks which can be used to model creative practice in a variety of domains. The components have been employed in two case studies to evaluate the creativity of computational systems and have proven useful in articulating achievements of this work and directions for further research.

Concepts: Mathematics, Research, Emergence, Linguistics, Language, Semantics, Logic, Natural language processing

44

Artificial intelligence (AI) is rapidly moving from an experimental phase to an implementation phase in many fields, including medicine. The combination of improved availability of large datasets, increasing computing power, and advances in learning algorithms has created major performance breakthroughs in the development of AI applications. In the last 5¬†years, AI techniques known as deep learning have delivered rapidly improving performance in image recognition, caption generation, and speech recognition. Radiology, in particular, is a prime candidate for early adoption of these techniques. It is anticipated that the implementation of AI in radiology over the next decade will significantly improve the quality, value, and depth of radiology’s contribution to patient care and population health, and will revolutionize radiologists' workflows. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI working group with the mandate to discuss and deliberate on practice, policy, and patient care issues related to the introduction and implementation of AI in imaging. This white paper provides recommendations for the CAR derived from deliberations between members of the AI working group. This white paper on AI in radiology will inform CAR members and policymakers on key terminology, educational needs of members, research and development, partnerships, potential clinical applications, implementation, structure and governance, role of radiologists, and potential impact of AI on radiology in Canada.

Concepts: Psychology, Artificial intelligence, Machine learning, Learning, Policy, Neural network, Artificial neural network, Natural language processing

28

Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over hundred years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research.

Concepts: Bioinformatics, Cancer, Oncology, Data mining, Malignancy, Research and development, Natural language processing, Text mining

28

BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. RESULTS: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. CONCLUSIONS: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides avaluable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

Concepts: Linguistics, Semantics, Natural language processing, Natural language, Named entity recognition, Computational linguistics, Lexical analysis