Concept: Decision tree learning
The aim of this study was to develop a new data-mining model to predict axillary lymph node (AxLN) metastasis in primary breast cancer. To achieve this, we used a decision tree-based prediction method-the alternating decision tree (ADTree).
BACKGROUND: Recursive partitioning is a non-parametric modeling technique, widely used in regression and classification problems. Model-based recursive partitioning is used to identify groups of observations with similar values of parameters of the model of interest. The mob() function in the party package in R implements model-based recursive partitioning method. This method produces predictions based on single tree models. Predictions obtained through single tree models are very sensitive to small changes to the learning sample. We extend the model-based recursive partition method to produce predictions based on multiple tree models constructed on random samples achieved either through bootstrapping (random sampling with replacement) or subsampling (random sampling without replacement) on learning data. RESULTS: Here we present an R package called “mobForest” that implements bagging and random forests methodology for model-based recursive partitioning. The mobForest package constructs large number of model-based trees and the predictions are aggregated across these trees resulting in more stable predictions. The package also includes functions for computing predictive accuracy estimates and plots, residuals plot, and variable importance plot. CONCLUSION: The mobForest package implements a random forest type approach for model-based recursive partitioning. The R package along with it source code is available at http://CRAN.R-project.org/package=mobForest.
Background Current inertial motion capture systems are rarely used in biomedical applications. The attachment and connection of the sensors with cables is often a complex and time consuming task. Moreover, it is prone to errors, because each sensor has to be attached to a predefined body segment. By using wireless inertial sensors and automatic identification of their positions on the human body, the complexity of the set-up can be reduced and incorrect attachments are avoided.We present a novel method for the automatic identification of inertial sensors on human body segments during walking. This method allows the user to place (wireless) inertial sensors on arbitrary body segments. Next, the user walks for just a few seconds and the segment to which each sensor is attached is identified automatically.MethodsWalking data was recorded from ten healthy subjects using an Xsens MVN Biomech system with full-body configuration (17 inertial sensors). Subjects were asked to walk for about 6 seconds at normal walking speed (about 5 km/h). After rotating the sensor data to a global coordinate frame with x-axis in walking direction, y-axis pointing left and z-axis vertical, RMS, mean, and correlation coefficient features were extracted from x-, y- and z-components and magnitudes of the accelerations, angular velocities and angular accelerations. As a classifier, a decision tree based on the C4.5 algorithm was developed using Weka (Waikato Environment for Knowledge Analysis).Results and conclusions After testing the algorithm with 10-fold cross-validation using 31 walkingtrials (involving 527 sensors), 514 sensors were correctly classified (97.5%). When a decision tree for alower body plus trunk configuration (8 inertial sensors) was trained andtested using 10-fold cross-validation, 100% of the sensors were correctly identified. This decision tree wasalso tested on walking trials of 7 patients (17 walking trials) after anterior cruciate ligamentreconstruction, which also resulted in 100% correct identification, thus illustrating the robustness of themethod.
Inflammatory bowel disease (IBD) and alimentary lymphoma (ALA) are common gastrointestinal diseases in cats. The very similar clinical signs and histopathologic features of these diseases make the distinction between them diagnostically challenging. We tested the use of supervised machine-learning algorithms to differentiate between the 2 diseases using data generated from noninvasive diagnostic tests. Three prediction models were developed using 3 machine-learning algorithms: naive Bayes, decision trees, and artificial neural networks. The models were trained and tested on data from complete blood count (CBC) and serum chemistry (SC) results for the following 3 groups of client-owned cats: normal, inflammatory bowel disease (IBD), or alimentary lymphoma (ALA). Naive Bayes and artificial neural networks achieved higher classification accuracy (sensitivities of 70.8% and 69.2%, respectively) than the decision tree algorithm (63%, p < 0.0001). The areas under the receiver-operating characteristic curve for classifying cases into the 3 categories was 83% by naive Bayes, 79% by decision tree, and 82% by artificial neural networks. Prediction models using machine learning provided a method for distinguishing between ALA-IBD, ALA-normal, and IBD-normal. The naive Bayes and artificial neural networks classifiers used 10 and 4 of the CBC and SC variables, respectively, to outperform the C4.5 decision tree, which used 5 CBC and SC variables in classifying cats into the 3 classes. These models can provide another noninvasive diagnostic tool to assist clinicians with differentiating between IBD and ALA, and between diseased and nondiseased cats.
Epilepsy is a global disease with considerable incidence due to recurrent unprovoked seizures. These seizures can be noninvasively diagnosed using electroencephalogram (EEG), a measure of neuronal electrical activity in brain recorded along scalp. EEG is highly nonlinear, nonstationary and non-Gaussian in nature. Nonlinear adaptive models such as empirical mode decomposition (EMD) provide intuitive understanding of information present in these signals. In this study a novel methodology is proposed to automatically classify EEG of normal, inter-ictal and ictal subjects using EMD decomposition. EEG decomposition using EMD yields few intrinsic mode functions (IMF), which are amplitude and frequency modulated (AM and FM) waves. Hilbert transform of these IMF provides AM and FM frequencies. Features such as spectral peaks, spectral entropy and spectral energy in each IMF are extracted and fed to decision tree classifier for automated diagnosis. In this work, we have compared the performance of classification using two types of decision trees (i) classification and regression tree (CART) and (ii) C4.5. We have obtained the highest average accuracy of 95.33%, average sensitivity of 98%, and average specificity of 97% using C4.5 decision tree classifier. The developed methodology is ready for clinical validation on large databases and can be deployed for mass screening.
The assessment of data mining for the prediction of therapeutic outcome in 3719 Egyptian patients with chronic hepatitis C.
- Clinics and research in hepatology and gastroenterology
- Published about 5 years ago
INTRODUCTION: Decision-tree analysis; a core component of data mining analysis can build predictive models for the therapeutic outcome to antiviral therapy in chronic hepatitis C virus (HCV) patients. AIM: To develop a prediction model for the end virological response (ETR) to pegylated interferon PEG-IFN plus ribavirin (RBV) therapy in chronic HCV patients using routine clinical, laboratory, and histopathological data. PATIENTS AND METHODS: Retrospective initial data (19 attributes) from 3719 Egyptian patients with chronic HCV presumably genotype-4 was assigned to model building using the J48 decision tree-inducing algorithm (Weka implementation of C4.5). All patients received PEG-IFN plus RBV at Cairo-Fatemia Hospital, Cairo, Egypt in the context of the national treatment program. Factors predictive of ETR were explored and patients were classified into seven subgroups according to the different rates of ETR. The universality of the decision-tree model was subjected to a 10-fold cross-internal validation in addition to external validation using an independent dataset collected of 200 chronic HCV patients. RESULTS: At week 48, overall ETR was 54% according to intention to treat protocol. The decision-tree model included AFP level (<8.08ng/ml) which was associated with high probability of ETR (73%) followed by stages of fibrosis and Hb levels according to the patients' gender followed by the age of patients. CONCLUSION: In a decision-tree model for the prediction for antiviral therapy in chronic HCV patients, AFP level was the initial split variable at a cutoff of 8.08ng/ml. This model could represent a potential tool to identify patients' likelihood of response among difficult-to-treat presumably genotype-4 chronic HCV patients and could support clinical decisions regarding the proper selection of patients for therapy without imposing any additional costs.
Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.
Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.
Over the last 200 years the wetlands of the Upper Tietê and Upper Paraíba do Sul basins, in the southeastern Atlantic Forest, Brazil, have been almost-completely transformed by urbanization, agriculture and mining. Endemic to these river basins, the São Paulo Marsh Antwren (Formicivora paludicola) survived these impacts, but remained unknown to science until its discovery in 2005. Its population status was cause for immediate concern. In order to understand the factors imperiling the species, and provide guidelines for its conservation, we investigated both the species' distribution and the distribution of areas of suitable habitat using a multiscale approach encompassing species distribution modeling, fieldwork surveys and occupancy models. Of six species distribution models methods used (Generalized Linear Models, Generalized Additive Models, Multivariate Adaptive Regression Splines, Classification Tree Analysis, Artificial Neural Networks and Random Forest), Random Forest showed the best fit and was utilized to guide field validation. After surveying 59 sites, our results indicated that Formicivora paludicola occurred in only 13 sites, having narrow habitat specificity, and restricted habitat availability. Additionally, historic maps, distribution models and satellite imagery showed that human occupation has resulted in a loss of more than 346 km2 of suitable habitat for this species since the early twentieth century, so that it now only occupies a severely fragmented area (area of occupancy) of 1.42 km2, and it should be considered Critically Endangered according to IUCN criteria. Furthermore, averaged occupancy models showed that marshes with lower cattail (Typha dominguensis) densities have higher probabilities of being occupied. Thus, these areas should be prioritized in future conservation efforts to protect the species, and to restore a portion of Atlantic Forest wetlands, in times of unprecedented regional water supply problems.