Concept: Expectation-maximization algorithm
BACKGROUND: Severe eczema in young children is associated with an increased risk of developing asthma and rhino-conjunctivitis. In the general population, however, most cases of eczema are mild to moderate. In an unselected cohort, we studied the risk of current asthma and the co-existence of allergy-related diseases at 6 years of age among children with and without eczema at 2 years of age. METHODS: Questionnaires assessing various environmental exposures and health variables were administered at 2 years of age. An identical health questionnaire was completed at 6 years of age. The clinical investigation of a random subsample ascertained eczema diagnoses, and missing data were handled by multiple imputation analyses. RESULTS: The estimate for the association between eczema at 2 years and current asthma at 6 years was OR=1.80 (95 % CI 1.10-2.96). Four of ten children with eczema at 6 years had the onset of eczema after the age of 2 years, but the co-existence of different allergy-related diseases at 6 years was higher among those with the onset of eczema before 2 years of age. CONCLUSIONS: Although most cases of eczema in the general population were mild to moderate, early eczema was associated with an increased risk of developing childhood asthma. These findings support the hypothesis of an atopic march in the general population.Trial registrationThe Prevention of Allergy among Children in Trondheim study has been identified as ISRCTN28090297 in the international Current Controlled Trials database.
In many fields, including the field of nephrology, missing data are unfortunately an unavoidable problem in clinical/epidemiological research. The most common methods for dealing with missing data are complete case analysis-excluding patients with missing data-mean substitution-replacing missing values of a variable with the average of known values for that variable-and last observation carried forward. However, these methods have severe drawbacks potentially resulting in biased estimates and/or standard errors. In recent years, a new method has arisen for dealing with missing data called multiple imputation. This method predicts missing values based on other data present in the same patient. This procedure is repeated several times, resulting in multiple imputed data sets. Thereafter, estimates and standard errors are calculated in each imputation set and pooled into one overall estimate and standard error. The main advantage of this method is that missing data uncertainty is taken into account. Another advantage is that the method of multiple imputation gives unbiased results when data are missing at random, which is the most common type of missing data in clinical practice, whereas conventional methods do not. However, the method of multiple imputation has scarcely been used in medical literature. We, therefore, encourage authors to do so in the future when possible.
Global, regional, and country statistics on population and health indicators are important for assessing development and health progress and for guiding resource allocation; however, data are often lacking, especially in low- and middle-income countries. To fill the gaps, statistical modelling is frequently used to produce comparable health statistics across countries that can be combined to produce regional and global statistics. The World Health Organization (WHO), in collaboration with other United Nations agencies and academic experts, regularly updates estimates for key indicators and involves its Member States in the process. Academic institutions also publish estimates independent from the WHO using different methods. The use of sophisticated statistical estimation methods to fill missing values for countries can reduce the pressures on governments and development agencies to improve information systems. Efforts to improve estimates must be accompanied by concerted attempts to address data gaps, common standards for documentation, sharing of data and methods, and regular interaction and collaboration among all groups involved.
The analysis of clinical trials aiming to show symptomatic benefits is often complicated by the ethical requirement for rescue medication when the disease state of patients worsens. In type 2 diabetes trials, patients receive glucose-lowering rescue medications continuously for the remaining trial duration, if one of several markers of glycemic control exceeds pre-specified thresholds. This may mask differences in glycemic values between treatment groups, because it will occur more frequently in less effective treatment groups. Traditionally, the last pre-rescue medication value was carried forward and analyzed as the end-of-trial value. The deficits of such simplistic single imputation approaches are increasingly recognized by regulatory authorities and trialists. We discuss alternative approaches and evaluate them through a simulation study. When the estimand of interest is the effect attributable to the treatments initially assigned at randomization, then our recommendation for estimation and hypothesis testing is to treat data after meeting rescue criteria as deterministically ‘missing’ at random, because initiation of rescue medication is determined by observed in-trial values. An appropriate imputation of values after meeting rescue criteria is then possible either directly through multiple imputation or implicitly with a repeated measures model. Crucially, one needs to jointly impute or model all markers of glycemic control that can lead to the initiation of rescue medication. An alternative for hypothesis testing only are rank tests with outcomes from patients ‘requiring rescue medication’ ranked worst, and non-rescued patients ranked according to final visit values. However, an appropriate ranking of not observed values may be controversial. Copyright © 2015 John Wiley & Sons, Ltd.
Missing data may seriously compromise inferences from randomised clinical trials, especially if missing data are not handled appropriately. The potential bias due to missing data depends on the mechanism causing the data to be missing, and the analytical methods applied to amend the missingness. Therefore, the analysis of trial data with missing values requires careful planning and attention.
Intraspecific variation in ploidy occurs in a wide range of species including pathogenic and nonpathogenic eukaryotes such as yeasts and oomycetes. Ploidy can be inferred indirectly - without measuring DNA content - from experiments using next-generation sequencing (NGS). We present nQuire, a statistical framework that distinguishes between diploids, triploids and tetraploids using NGS. The command-line tool models the distribution of base frequencies at variable sites using a Gaussian Mixture Model, and uses maximum likelihood to select the most plausible ploidy model. nQuire handles large genomes at high coverage efficiently and uses standard input file formats.
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples.
The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation-maximization algorithm, applied to a real-world data set. Results were contrasted with those obtained from the complete data set and from the listwise deletion method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on the importance of statistical assumptions, and recommendations for researchers. Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and © the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.
Accuracy of transcript quantification with RNA-Seq is negatively affected by positional fragment bias. This article introduces Mix2 (rd. “mixquare”), a transcript quantification method which uses a mixture of probability distributions to model and thereby neutralize the effects of positional fragment bias. The parameters of Mix2 are trained by Expectation Maximization resulting in simultaneous transcript abundance and bias estimates. We compare Mix2 to Cufflinks, RSEM, eXpress and PennSeq; state-of-the-art quantification methods implementing some form of bias correction. On four synthetic biases we show that the accuracy of Mix2 overall exceeds the accuracy of the other methods and that its bias estimates converge to the correct solution. We further evaluate Mix2 on real RNA-Seq data from the Microarray and Sequencing Quality Control (MAQC, SEQC) Consortia. On MAQC data, Mix2 achieves improved correlation to qPCR measurements with a relative increase in R2 between 4% and 50%. Mix2 also yields repeatable concentration estimates across technical replicates with a relative increase in R2 between 8% and 47% and reduced standard deviation across the full concentration range. We further observe more accurate detection of differential expression with a relative increase in true positives between 74% and 378% for 5% false positives. In addition, Mix2 reveals 5 dominant biases in MAQC data deviating from the common assumption of a uniform fragment distribution. On SEQC data, Mix2 yields higher consistency between measured and predicted concentration ratios. A relative error of 20% or less is obtained for 51% of transcripts by Mix2, 40% of transcripts by Cufflinks and RSEM and 30% by eXpress. Titration order consistency is correct for 47% of transcripts for Mix2, 41% for Cufflinks and RSEM and 34% for eXpress. We, further, observe improved repeatability across laboratory sites with a relative increase in R2 between 8% and 44% and reduced standard deviation.
Principled methods to appropriately analyze missing data have long existed; however, broad implementation of these methods remains challenging. In this and companion papers, we discuss issues of missing data in the epidemiologic literature. We provide details regarding missing data mechanisms and nomenclature and motivate principled analyses through a detailed comparison of multiple imputation and inverse probability weighting. We do so in the setting of a masked data-analytic challenge with missing data induced by known mechanisms to data from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974. We illustrate the deleterious effects of missing data with naïve methods and show how principled methods can sometimes mitigate such effects. For example when data were missing at random, naïve methods showed a spurious protective effect of smoking on spontaneous abortion, odds ratio (OR) of 0.43 (95% confidence interval, CI: 0.19, 0.93) while implementing principled methods multiple imputation (OR = 1.30, CI: 0.95, 1.77) or augmented inverse probability weighting (OR = 1.40, CI: 1.00, 1.97) provided estimates closer to the “true” full data effect (OR = 1.31, CI: 1.05, 1.64). We call for greater acknowledgement of and attention to missing data and for the broad use of principled missing data methods in epidemiologic research.