How does network structure affect diffusion? Recent studies suggest that the answer depends on the type of contagion. Complex contagions, unlike infectious diseases (simple contagions), are affected by social reinforcement and homophily. Hence, the spread within highly clustered communities is enhanced, while diffusion across communities is hampered. A common hypothesis is that memes and behaviors are complex contagions. We show that, while most memes indeed spread like complex contagions, a few viral memes spread across many communities, like diseases. We demonstrate that the future popularity of a meme can be predicted by quantifying its early spreading pattern in terms of community concentration. The more communities a meme permeates, the more viral it is. We present a practical method to translate data about community structure into predictive knowledge about what information will spread widely. This connection contributes to our understanding in computational social science, social media analytics, and marketing applications.
Secondary use of electronic health records (EHRs) promises to advance clinical research and better inform clinical decision making. Challenges in summarizing and representing patient data prevent widespread practice of predictive modeling using EHRs. Here we present a novel unsupervised deep feature learning method to derive a general-purpose patient representation from EHR data that facilitates clinical predictive modeling. In particular, a three-layer stack of denoising autoencoders was used to capture hierarchical regularities and dependencies in the aggregated EHRs of about 700,000 patients from the Mount Sinai data warehouse. The result is a representation we name “deep patient”. We evaluated this representation as broadly predictive of health states by assessing the probability of patients to develop various diseases. We performed evaluation using 76,214 test patients comprising 78 diseases from diverse clinical domains and temporal windows. Our results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies. Prediction performance for severe diabetes, schizophrenia, and various cancers were among the top performing. These findings indicate that deep learning applied to EHRs can derive patient representations that offer improved clinical predictions, and could provide a machine learning framework for augmenting clinical decision systems.
Use of socially generated “big data” to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society’s reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between “real time monitoring” and “early predicting” remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.
Correctly assessing a scientist’s past research impact and potential for future impact is key in recruitment decisions and other evaluation processes. While a candidate’s future impact is the main concern for these decisions, most measures only quantify the impact of previous work. Recently, it has been argued that linear regression models are capable of predicting a scientist’s future impact. By applying that future impact model to 762 careers drawn from three disciplines: physics, biology, and mathematics, we identify a number of subtle, but critical, flaws in current models. Specifically, cumulative non-decreasing measures like the h-index contain intrinsic autocorrelation, resulting in significant overestimation of their “predictive power”. Moreover, the predictive power of these models depend heavily upon scientists' career age, producing least accurate estimates for young researchers. Our results place in doubt the suitability of such models, and indicate further investigation is required before they can be used in recruiting decisions.
Will they participate? predicting patients' response to clinical trial invitations in a pediatric emergency department
- Journal of the American Medical Informatics Association : JAMIA
- Published over 1 year ago
(1) To develop an automated algorithm to predict a patient’s response (ie, if the patient agrees or declines) before he/she is approached for a clinical trial invitation; (2) to assess the algorithm performance and the predictors on real-world patient recruitment data for a diverse set of clinical trials in a pediatric emergency department; and (3) to identify directions for future studies in predicting patients' participation response.
In a random number generation task, participants are asked to generate a random sequence of numbers, most typically the digits 1 to 9. Such number sequences are not mathematically random, and both extent and type of bias allow one to characterize the brain’s “internal random number generator”. We assume that certain patterns and their variations will frequently occur in humanly generated random number sequences. Thus, we introduce a pattern-based analysis of random number sequences. Twenty healthy subjects randomly generated two sequences of 300 numbers each. Sequences were analysed to identify the patterns of numbers predominantly used by the subjects and to calculate the frequency of a specific pattern and its variations within the number sequence. This pattern analysis is based on the Damerau-Levenshtein distance, which counts the number of edit operations that are needed to convert one string into another. We built a model that predicts not only the next item in a humanly generated random number sequence based on the item’s immediate history, but also the deployment of patterns in another sequence generated by the same subject. When a history of seven items was computed, the mean correct prediction rate rose up to 27% (with an individual maximum of 46%, chance performance of 11%). Furthermore, we assumed that when predicting one subject’s sequence, predictions based on statistical information from the same subject should yield a higher success rate than predictions based on statistical information from a different subject. When provided with two sequences from the same subject and one from a different subject, an algorithm identifies the foreign sequence in up to 88% of the cases. In conclusion, the pattern-based analysis using the Levenshtein-Damarau distance is both able to predict humanly generated random number sequences and to identify person-specific information within a humanly generated random number sequence.
Comparison of the binding sites of proteins is an effective means for predicting protein functions based on their structure information. Despite the importance of this problem and much research in the past, it is still very challenging to predict the binding ligands from the atomic structures of protein binding sites. Here, we designed a new algorithm, TIPSA (Triangulation-based Iterative-closest-point for Protein Surface Alignment), based on the iterative closest point (ICP) algorithm. TIPSA aims to find the maximum number of atoms that can be superposed between two protein binding sites, where any pair of superposed atoms has a distance smaller than a given threshold. The search starts from similar tetrahedra between two binding sites obtained from 3D Delaunay triangulation and uses the Hungarian algorithm to find additional matched atoms. We found that, due to the plasticity of protein binding sites, matching the rigid body of point clouds of protein binding sites is not adequate for satisfactory binding ligand prediction. We further incorporated global geometric information, the radius of gyration of binding site atoms, and used nearest neighbor classification for binding site prediction. Tested on benchmark data, our method achieved a performance comparable to the best methods in the literature, while simultaneously providing the common atom set and atom correspondences.
Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar.
Electrical stimulation of the central nervous system has been widely used for decades for either fundamental research purposes or clinical treatment applications. Yet, very little is known regarding the spatial extent of an electrical stimulation. If pioneering experimental studies reported that activation threshold currents (TCs) increase with the square of the neuron-to-electrode distance over a few hundreds of microns, there is no evidence that this quadratic law remains valid for larger distances. Moreover, nowadays, numerical simulation approaches have supplanted experimental studies for estimating TCs. However, model predictions have not yet been validated directly with experiments within a common paradigm. Here, we present a direct comparison between experimental determination and modeling prediction of TCs up to distances of several millimeters. First, we combined patch-clamp recording and microelectrode array stimulation in whole embryonic mouse spinal cords to determine TCs. Experimental thresholds did not follow a quadratic law beyond 1 millimeter, but rather tended to remain constant for distances larger than 1 millimeter. We next built a combined finite element–compartment model of the same experimental paradigm to predict TCs. While theoretical TCs closely matched experimental TCs for distances <250 microns, they were highly overestimated for larger distances. This discrepancy remained even after modifications of the finite element model of the potential field, taking into account anisotropic, heterogeneous or dielectric properties of the tissue. In conclusion, these results show that quadratic evolution of TCs does not always hold for large distances between the electrode and the neuron and that classical models may underestimate volumes of tissue activated by electrical stimulation.
Although the concept of genomic selection relies on linkage disequilibrium (LD) between quantitative trait loci and markers, reliability of genomic predictions is strongly influenced by family relationships. In this study, we investigated the effects of LD and family relationships on reliability of genomic predictions and the potential of deterministic formulas to predict reliability using population parameters in populations with complex family structures. Five groups of selection candidates were simulated taking different information sources from the reference population into account: 1) allele frequencies; 2) LD pattern; 3) haplotypes; 4) haploid chromosomes; 5) individuals from the reference population, thereby having real family relationships with reference individuals. Reliabilities were predicted using genomic relationships among 529 reference individuals and their relationships with selection candidates and with a deterministic formula where the number of effective chromosome segments (M(e)) was estimated based on genomic and additive relationship matrices for each scenario. At a heritability of 0.6, reliabilities based on genomic relationships were 0.002±0.0001 (allele frequencies), 0.015±0.001 (LD pattern), 0.018±0.001 (haplotypes), 0.100±0.008 (haploid chromosomes) and 0.318±0.077 (family relationships). At a heritability of 0.1, relative differences among groups were similar. For all scenarios, reliabilities were similar to predictions with a deterministic formula using estimated M(e). So, reliabilities can be predicted accurately using empirically estimated M(e) and level of relationship with reference individuals has a much higher effect on the reliability than linkage disequilibrium per se. Furthermore, accumulated length of shared haplotypes is more important in determining the reliability of genomic prediction than the individual shared haplotype length.