SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Programming language

420

Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (i) the words of natural human language possess a universal positivity bias, (ii) the estimated emotional content of words is consistent between languages under translation, and (iii) this positivity bias is strongly independent of frequency of word use. Alongside these general regularities, we describe interlanguage variations in the emotional spectrum of languages that allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.

Concepts: Programming language, Cognition, Reason, Mathematics, Translation, Root, Word, Language

189

Crowdsourcing linguistic phenomena with smartphone applications is relatively new. In linguistics, apps have predominantly been developed to create pronunciation dictionaries, to train acoustic models, and to archive endangered languages. This paper presents the first account of how apps can be used to collect data suitable for documenting language change: we created an app, Dialäkt Äpp (DÄ), which predicts users' dialects. For 16 linguistic variables, users select a dialectal variant from a drop-down menu. DÄ then geographically locates the user’s dialect by suggesting a list of communes where dialect variants most similar to their choices are used. Underlying this prediction are 16 maps from the historical Linguistic Atlas of German-speaking Switzerland, which documents the linguistic situation around 1950. Where users disagree with the prediction, they can indicate what they consider to be their dialect’s location. With this information, the 16 variables can be assessed for language change. Thanks to the playfulness of its functionality, DÄ has reached many users; our linguistic analyses are based on data from nearly 60,000 speakers. Results reveal a relative stability for phonetic variables, while lexical and morphological variables seem more prone to change. Crowdsourcing large amounts of dialect data with smartphone apps has the potential to complement existing data collection techniques and to provide evidence that traditional methods cannot, with normal resources, hope to gather. Nonetheless, it is important to emphasize a range of methodological caveats, including sparse knowledge of users' linguistic backgrounds (users only indicate age, sex) and users' self-declaration of their dialect. These are discussed and evaluated in detail here. Findings remain intriguing nevertheless: as a means of quality control, we report that traditional dialectological methods have revealed trends similar to those found by the app. This underlines the validity of the crowdsourcing method. We are presently extending DÄ architecture to other languages.

Concepts: Programming language, Semiotics, English language, German language, Historical linguistics, Dialect, Linguistics, Language

176

SUMMARY: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast, customisable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search, and a library of “widgets” performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. AVAILABILITY: Freely available from http://www.intermine.org under the LGPL license. CONTACT: g.micklem@gen.cam.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Concepts: Data, Model organism, Type system, Bioinformatics, Programming language, Biological data, Statistics, Data management

175

Displaying chemical structures in LATEX documents currently requires either hand-coding of the structures using one of several LATEX packages, or the inclusion of finished graphics files produced with an external drawing program. There is currently no software tool available to render the large number of structures available in molfile or SMILES format to LATEX source code. We here present mol2chemfig, a Python program that provides this capability. Its output is written in the syntax defined by the chemfig TEX package, which allows for the flexible and concise description of chemical structures and reaction mechanisms. The program is freely available both through a web interface and for local installation on the user¿s computer. The code and accompanying documentation can be found at http://chimpsky.uwaterloo.ca/mol2chemfig.

Concepts: Computer software, Programmer, Free software, Programming language, Java, Latex, Source code, Computer program

172

BACKGROUND: Although programming in a type-safe and referentiallytransparent style offers several advantages over working withmutable data structures and side effects, this style of programminghas not seen much use in chemistry-related software. Since functionalprogramming languages were designed with referential transparency in mind,these languages offer a lot of support when writing immutable data structuresand side-effects free code. We therefore started implementingour own toolkit based on the above programming paradigms in a modern,versatile programming language. RESULTS: We present our initial results with functionalprogramming in chemistry by first describing an immutable data structurefor molecular graphs together with a couple of simplealgorithms to calculate basic molecular propertiesbefore writing a complete SMILES parser in accordance with theOpenSMILES specification. Along the way we show how to dealwith input validation, error handling, bulk operations, and parallelizationin a purely functional way. At the end we also analyze and improve our algorithmsand data structures in terms of performance and compare itto existing toolkits both object-oriented and purely functional.All code was written inScala, a modern multi-paradigm programming language with a strongsupport for functional programming and a highly sophisticated type system. CONCLUSIONS: We have successfully made the first importantsteps towards a purely functional chemistry toolkit. The data structuresand algorithms presented in this article perform well while at the sametime they can be safely used in parallelized applications, such as computeraided drug design experiments, withoutfurther adjustments. This stands in contrast to existing object-orientedtoolkits where thread safety of data structures and algorithms isa deliberate design decision that can be hard to implement.Finally, the level of type-safety achieved by \emph{Scala}highly increased the reliability of our codeas well as the productivity of the programmers involved in this project.

Concepts: Haskell, C Sharp, Referential transparency, Type system, Purely functional, Programming paradigm, Functional programming, Programming language

171

BACKGROUND: A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language - JavaScript. SUMMARY: The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages. CONCLUSIONS: A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/

Concepts: HTML, Web server, Programming language, Web page, Google, World Wide Web, Web browser, Java

169

The concept of reachable workspace is closely tied to upper limb joint range of motion and functional capability. Currently, no practical and cost-effective methods are available in clinical and research settings to provide arm-function evaluation using an individual’s three-dimensional (3D) reachable workspace. A method to intuitively display and effectively analyze reachable workspace would not only complement traditional upper limb functional assessments, but also provide an innovative approach to quantify and monitor upper limb function.

Concepts: Programming language, Limbs, Lambda calculus, Limb, Upper limb

164

We present a web service to access Ensembl data using Representational State Transfer (REST). The Ensembl REST Server enables the easy retrieval of a wide range of Ensembl data by most programming languages, using standard formats such as JSON and FASTA whilst minimising client work. We also introduce bindings to the popular Ensembl Variant Effect Predictor (VEP) tool permitting large-scale programmatic variant analysis independent of any specific programming language. Availability: The Ensembl REST API can be accessed at http://rest.ensembl.org and source code is freely available under an Apache 2.0 license from http://github.com/Ensembl/ensembl-rest.

Concepts: Compiler, C, Language, Programmer, Source code, Java, Computer program, Programming language

149

Despite the rapid global movement towards electronic health records, clinical letters written in unstructured natural languages are still the preferred form of inter-practitioner communication about patients. These letters, when archived over a long period of time, provide invaluable longitudinal clinical details on individual and populations of patients. In this paper we present three unsupervised approaches, sequential pattern mining (PrefixSpan); frequency linguistic based C-Value; and keyphrase extraction from co-occurrence graphs (TextRank), to automatically extract single and multi-word medical terms without domain-specific knowledge. Because each of the three approaches focuses on different aspects of the language feature space, we propose a genetic algorithm to learn the best parameters of linearly integrating the three extractors for optimal performance against domain expert annotations. Around 30,000 clinical letters sent over the past decade from ophthalmology specialists to general practitioners at an eye clinic are anonymised as the corpus to evaluate the effectiveness of the ensemble against individual extractors. With minimal annotation, the ensemble achieves an average F-measure of 65.65 % when considering only complex medical terms, and a F-measure of 72.47 % if we take single word terms (i.e. unigrams) into consideration, markedly better than the three term extraction techniques when used alone.

Concepts: Physician, Latin, Medical terms, Genetic algorithm, Language, Programming language, Time, Linguistics

138

In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine’s Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems.

Concepts: Programming language, Infectious disease, Disease, Computational linguistics, Object, Medicine, Named entity recognition, Natural language processing