SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Estimation theory

173

We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a Bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set.

Concepts: Statistics, Mathematics, Estimation theory, Maximum likelihood, Computer program, Allele frequency, Bayesian inference, Likelihood function

171

BACKGROUND: Lidar height data collected by the Geosciences Laser Altimeter System (GLAS) from 2002 to 2008 has the potential to form the basis of a globally consistent sample-based inventory of forest biomass. GLAS lidar return data were collected globally in spatially discrete full waveform “shots,” which have been shown to be strongly correlated with aboveground forest biomass. Relationships observed at spatially coincident field plots may be used to model biomass at all GLAS shots, and well-established methods of model-based inference may then be used to estimate biomass and variance for specific spatial domains. However, the spatial pattern of GLAS acquisition is neither random across the surface of the earth nor is it identifiable with any particular systematic design. Undefined sample properties therefore hinder the use of GLAS in global forest sampling. RESULTS: We propose a method of identifying a subset of the GLAS data which can justifiably be treated as a simple random sample in model-based biomass estimation. The relatively uniform spatial distribution and locally arbitrary positioning of the resulting sample is similar to the design used by the US national forest inventory (NFI). We demonstrated model-based estimation using a sample of GLAS data in the US state of California, where our estimate of biomass (211 Mg/hectare) was within the 1.4% standard error of the design-based estimate supplied by the US NFI. The standard error of the GLAS-based estimate was significantly higher than the NFI estimate, although the cost of the GLAS estimate (excluding costs for the satellite itself) was almost nothing, compared to at least US$ 10.5 million for the NFI estimate. CONCLUSIONS: Global application of model-based estimation using GLAS, while demanding significant consolidation of training data, would improve inter-comparability of international biomass estimates by imposing consistent methods and a globally coherent sample frame. The methods presented here constitute a globally extensible approach for generating a simple random sample from the global GLAS dataset, enabling its use in forest inventory activities.

Concepts: Statistics, Variance, Mathematics, Simple random sample, Sample size, Estimation theory, Estimator, Sampling

164

Advances in the development of micro-electromechanical systems (MEMS) have made possible the fabrication of cheap and small dimension accelerometers and gyroscopes, which are being used in many applications where the global positioning system (GPS) and the inertial navigation system (INS) integration is carried out, i.e., identifying track defects, terrestrial and pedestrian navigation, unmanned aerial vehicles (UAVs), stabilization of many platforms, etc. Although these MEMS sensors are low-cost, they present different errors, which degrade the accuracy of the navigation systems in a short period of time. Therefore, a suitable modeling of these errors is necessary in order to minimize them and, consequently, improve the system performance. In this work, the most used techniques currently to analyze the stochastic errors that affect these sensors are shown and compared: we examine in detail the autocorrelation, the Allan variance (AV) and the power spectral density (PSD) techniques. Subsequently, an analysis and modeling of the inertial sensors, which combines autoregressive (AR) filters and wavelet de-noising, is also achieved. Since a low-cost INS (MEMS grade) presents error sources with short-term (high-frequency) and long-term (low-frequency) components, we introduce a method that compensates for these error terms by doing a complete analysis of Allan variance, wavelet de-nosing and the selection of the level of decomposition for a suitable combination between these techniques. Eventually, in order to assess the stochastic models obtained with these techniques, the Extended Kalman Filter (EKF) of a loosely-coupled GPS/INS integration strategy is augmented with different states. Results show a comparison between the proposed method and the traditional sensor error models under GPS signal blockages using real data collected in urban roadways.

Concepts: Estimation theory, Signal processing, Inertial navigation system, Accelerometer, Global Positioning System, Dead reckoning, Autocorrelation, Unmanned aerial vehicle

163

The problem of determining the optimal geometric configuration of a sensor network that will maximize the range-related information available for multiple target positioning is of key importance in a multitude of application scenarios. In this paper, a set of sensors that measures the distances between the targets and each of the receivers is considered, assuming that the range measurements are corrupted by white Gaussian noise, in order to search for the formation that maximizes the accuracy of the target estimates. Using tools from estimation theory and convex optimization, the problem is converted into that of maximizing, by proper choice of the sensor positions, a convex combination of the logarithms of the determinants of the Fisher Information Matrices corresponding to each of the targets in order to determine the sensor configuration that yields the minimum possible covariance of any unbiased target estimator. Analytical and numerical solutions are well defined and it is shown that the optimal configuration of the sensors depends explicitly on the constraints imposed on the sensor configuration, the target positions, and the probabilistic distributions that define the prior uncertainty in each of the target positions. Simulation examples illustrate the key results derived.

Concepts: Mathematics, Estimation theory, Estimator, Maximum likelihood, Signal processing, Optimization, Sensor, Wireless sensor network

32

Therapeutic substitution offers potential to decrease pharmaceutical expenditures and potentially improve the efficiency of the health care system.

Concepts: Health care, Health economics, Medicine, Healthcare, Health, Estimation theory, Economics, Potential

28

Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao’s estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (‘Hill diversities’), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao’s estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.The ISME Journal advance online publication, 14 February 2013; doi:10.1038/ismej.2013.10.

Concepts: Statistics, Mathematics, Estimation theory, Estimator, Approximation, Estimation, Microorganism, Robust statistics

28

Gompertz-related distributions have dominated mortality studies for 187 years. However, nonrelated distributions also fit well to mortality data. These compete with the Gompertz and Gompertz-Makeham data when applied to data with varying extents of truncation, with no consensus as to preference. In contrast, Gaussian-related distributions are rarely applied, despite the fact that Lexis in 1879 suggested that the normal distribution itself fits well to the right of the mode. Study aims were therefore to compare skew-t fits to Human Mortality Database data, with Gompertz-nested distributions, by implementing maximum likelihood estimation functions (mle2, R package bbmle; coding given). Results showed skew-t fits obtained lower Bayesian information criterion values than Gompertz-nested distributions, applied to low-mortality country data, including 1711 and 1810 cohorts. As Gaussian-related distributions have now been found to have almost universal application to error theory, one conclusion could be that a Gaussian-related distribution might replace Gompertz-related distributions as the basis for mortality studies.

Concepts: Estimation theory, Maximum likelihood, Ronald Fisher, Normal distribution, Probability density function, Likelihood function, Uniform distribution

28

In this paper, we propose a class of multivariate random effects models allowing for the inclusion of study-level covariates to carry out meta-analyses. As existing algorithms for computing maximum likelihood estimates often converge poorly or may not converge at all when the random effects are multi-dimensional, we develop an efficient expectation-maximization algorithm for fitting multi-dimensional random effects regression models. In addition, we also develop a new methodology for carrying out variable selection with study-level covariates. We examine the performance of the proposed methodology via a simulation study. We apply the proposed methodology to analyze metadata from 26 studies involving statins as a monotherapy and in combination with ezetimibe. In particular, we compare the low-density lipoprotein cholesterol-lowering efficacy of monotherapy and combination therapy on two patient populations (naïve and non-naïve patients to statin monotherapy at baseline), controlling for aggregate covariates. The proposed methodology is quite general and can be applied in any meta-analysis setting for a wide range of scientific applications and therefore offers new analytic methods of clinical importance. Copyright © 2012 John Wiley & Sons, Ltd.

Concepts: Estimation theory, Atherosclerosis, Statin, Niacin, Mevalonate pathway, Maximum likelihood, Machine learning, Ezetimibe

28

Recently, phylogenetics has expanded to routinely include estimation of clade ages in addition to their relationships. Various dating methods have been used, but their relative performance remains understudied. Here, we generate and assemble an extensive phylogenomic data set for squamate reptiles (lizards and snakes) and evaluate two widely used dating methods, penalized likelihood in r8s (r8s-PL) and Bayesian estimation with uncorrelated relaxed rates among lineages (BEAST). We obtained sequence data from 25 nuclear loci (∼500-1000bp per gene; 19,020bp total) for 64 squamate species and nine outgroup taxa, estimated the phylogeny, and estimated divergence dates using 14 fossil calibrations. We then evaluated how well each method approximated these dates using random subsets of the nuclear loci (2, 5, 10, 15, and 20; replicated 10 times each), and using ∼1kb of the mitochondrial ND2 gene. We find that estimates from r8s-PL based on 2, 5, or 10 loci can differ considerably from those based on 25 loci (mean absolute value of differences between 2-locus and 25-locus estimates were 9.0Myr). Estimates from BEAST are somewhat more consistent given limited sampling of loci (mean absolute value of differences between 2 and 25-locus estimates were 5.0Myr). Most strikingly, age estimates using r8s-PL for ND2 were ∼68-82Myr older (mean=73.1) than those using 25 nuclear loci with r8s-PL. These results show that dates from r8s-PL with a limited number of loci (and especially mitochondrial data) can differ considerably from estimates derived from a large number of nuclear loci, whereas estimates from BEAST derived from fewer nuclear loci or mitochondrial data alone can be surprisingly similar to those from many nuclear loci. However, estimates from BEAST using relatively few loci and mitochondrial data could still show substantial deviations from the full data set (>50Myr), suggesting the benefits of sampling many nuclear loci. Finally, we found that confidence intervals on ages from BEAST were not significantly different when sampling 2 vs. 25 loci, suggesting that adding loci decreased errors but did not increase confidence in those estimates.

Concepts: Statistics, Mathematics, Estimation theory, Estimator, Phylogenetic nomenclature, Phylogenetics, Squamata, Estimation

28

The paper investigates approaches for loosely coupled GPS/INS integration. Error performance is calculated using a reference trajectory. A performance improvement can be obtained by exploiting additional map information (for example, a road boundary). A constrained solution has been developed and its performance compared with an unconstrained one. The case of GPS outages is also investigated showing how a Kalman filter that operates on the last received GPS position and velocity measurements provides a performance benefit. Results are obtained by means of simulation studies and real data.

Concepts: Estimation theory, Signal processing, Derivative, Solution, Kalman filter, Filtering problem, Limit, Wiener filter