Discover the most talked about and latest scientific content & concepts.

Concept: Principal component analysis


Malaria transmission is dependent on the propensity of Anopheles mosquitoes to bite humans (anthropophily) instead of other dead end hosts. Recent increases in the usage of Long Lasting Insecticide Treated Nets (LLINs) in Africa have been associated with reductions in highly anthropophilic and endophilic vectors such as Anopheles gambiae s.s., leaving species with a broader host range, such as Anopheles arabiensis, as the most prominent remaining source of transmission in many settings. An. arabiensis appears to be more of a generalist in terms of its host choice and resting behavior, which may be due to phenotypic plasticity and/or segregating allelic variation. To investigate the genetic basis of host choice and resting behavior in An. arabiensis we sequenced the genomes of 23 human-fed and 25 cattle-fed mosquitoes collected both in-doors and out-doors in the Kilombero Valley, Tanzania. We identified a total of 4,820,851 SNPs, which were used to conduct the first genome-wide estimates of “SNP heritability” for host choice and resting behavior in this species. A genetic component was detected for host choice (human vs cow fed; permuted P = 0.002), but there was no evidence of a genetic component for resting behavior (indoors versus outside; permuted P = 0.465). A principal component analysis (PCA) segregated individuals based on genomic variation into three groups which were characterized by differences at the 2Rb and/or 3Ra paracentromeric chromosome inversions. There was a non-random distribution of cattle-fed mosquitoes between the PCA clusters, suggesting that alleles linked to the 2Rb and/or 3Ra inversions may influence host choice. Using a novel inversion genotyping assay, we detected a significant enrichment of the standard arrangement (non-inverted) of 3Ra among cattle-fed mosquitoes (N = 129) versus all non-cattle-fed individuals (N = 234; χ2, p = 0.007). Thus, tracking the frequency of the 3Ra in An. arabiensis populations may be of use to infer selection on host choice behavior within these vector populations; possibly in response to vector control. Controlled host-choice assays are needed to discern whether the observed genetic component has a direct relationship with innate host preference. A better understanding of the genetic basis for host feeding behavior in An. arabiensis may also open avenues for novel vector control strategies based on driving genes for zoophily into wild mosquito populations.

Concepts: Gene, Genetics, Malaria, Anopheles, Mosquito, Principal component analysis, Mosquito control, Anopheles gambiae


BACKGROUND: Dengue, a mosquito-borne febrile viral disease, is found in tropical and sub-tropical regions and is now extending its range to temperate regions. The spread of the dengue viruses mainly depends on vector population (Aedes aegypti and Aedes albopictus), which is influenced by changing climatic conditions and various land-use/land-cover types. Spatial display of the relationship between dengue vector density and land-cover types is required to describe a near-future viral outbreak scenario. This study is aimed at exploring how land-cover types are linked to the behavior of dengue-transmitting mosquitoes. METHODS: Surveys were conducted in 92 villages of Phitsanulok Province Thailand. The sampling was conducted on three separate occasions in the months of March, May and July. Dengue indices, i.e. container index (C.I.), house index (H.I.) and Breteau index (B.I.) were used to map habitats conducible to dengue vector growth. Spatial epidemiological analysis using Bivariate Pearson’s correlation was conducted to evaluate the level of interdependence between larval density and land-use types. Factor analysis using principal component analysis (PCA) with varimax rotation was performed to ascertain the variance among land-use types. Furthermore, spatial ring method was used as to visualize spatially referenced, multivariate and temporal data in single information graphic. RESULTS: Results of dengue indices showed that the settlements around gasoline stations/workshops, in the vicinity of marsh/swamp and rice paddy appeared to be favorable habitat for dengue vector propagation at highly significant and positive correlation (p = 0.001) in the month of May. Settlements around the institutional areas were highly significant and positively correlated (p = 0.01) with H.I. in the month of March. Moreover, dengue indices in the month of March showed a significant and positive correlation (p <= 0.05) with deciduous forest. The H.I. of people living around horticulture land were significantly and positively correlated (p = 0.05) during the month ofMay, and perennial vegetation showed a highly significant and positive correlation (p = 0.001) in the month of March with C.I. and significant and positive correlation (p <= 0.05) with B.I., respectively. CONCLUSIONS: The study concluded that gasoline stations/workshops, rice paddy, marsh/swamp and deciduous forests played highly significant role in dengue vector growth. Thus, the spatio-temporal relationships of dengue vector larval density and land-use types may help to predict favorable dengue habitat, and thereby enables public healthcare managers to take precautionary measures to prevent impending dengue outbreak.

Concepts: Mosquito, Factor analysis, Principal component analysis, Correlation and dependence, Pearson product-moment correlation coefficient, Aedes aegypti, Aedes, Dengue fever


Complex diseases are typically caused by combinations of molecular disturbances that vary widely among different patients. Endophenotypes, a combination of genetic factors associated with a disease, offer a simplified approach to dissect complex trait by reducing genetic heterogeneity. Because molecular dissimilarities often exist between patients with indistinguishable disease symptoms, these unique molecular features may reflect pathogenic heterogeneity. To detect molecular dissimilarities among patients and reduce the complexity of high-dimension data, we have explored an endophenotype-identification analytical procedure that combines non-negative matrix factorization (NMF) and adjusted rand index (ARI), a measure of the similarity of two clusterings of a data set. To evaluate this procedure, we compared it with a commonly used method, principal component analysis with k-means clustering (PCA-K). A simulation study with gene expression dataset and genotype information was conducted to examine the performance of our procedure and PCA-K. The results showed that NMF mostly outperformed PCA-K. Additionally, we applied our endophenotype-identification analytical procedure to a publicly available dataset containing data derived from patients with late-onset Alzheimer’s disease (LOAD). NMF distilled information associated with 1,116 transcripts into three metagenes and three molecular subtypes (MS) for patients in the LOAD dataset: MS1 (n1=80), MS2 (n2=73), and MS3 (n3=23). ARI was then used to determine the most representative transcripts for each metagene; 123, 89, and 71 metagene-specific transcripts were identified for MS1, MS2, and MS3, respectively. These metagene-specific transcripts were identified as the endophenotypes. Our results showed that 14, 38, 0, and 28 candidate susceptibility genes listed in AlzGene database were found by all patients, MS1, MS2, and MS3, respectively. Moreover, we found that MS2 might be a normal-like subtype. Our proposed procedure provides an alternative approach to investigate the pathogenic mechanism of disease and better understand the relationship between phenotype and genotype.

Concepts: DNA, Gene, Genetics, Principal component analysis, Machine learning, Object-oriented programming, K-means clustering, Rand index


The 2008-2012 global financial crisis began with the global recession in December 2007 and exacerbated in September 2008, during which the U.S. stock markets lost 20% of value from its October 11 2007 peak. Various studies reported that financial crisis are associated with increase in both cross-correlations among stocks and stock indices and the level of systemic risk. In this paper, we study 10 different Dow Jones economic sector indexes, and applying principle component analysis (PCA) we demonstrate that the rate of increase in principle components with short 12-month time windows can be effectively used as an indicator of systemic risk-the larger the change of PC1, the higher the increase of systemic risk. Clearly, the higher the level of systemic risk, the more likely a financial crisis would occur in the near future.

Concepts: Risk, Principal component analysis, Investment, Financial crisis, Bank run, Stock market, Subprime mortgage crisis, Stock exchange


BACKGROUND: The treatment planning of spine pathologies requires information on the rigidity and permeability of the intervertebral discs (IVDs). Magnetic resonance imaging (MRI) offers great potential as a sensitive and non-invasive technique for describing the mechanical properties of IVDs. However, the literature reported small correlation coefficients between mechanical properties and MRI parameters. Our hypothesis is that the compressive modulus and the permeability of the IVD can be predicted by a linear combination of MRI parameters. METHODS: Sixty IVDs were harvested from bovine tails, and randomly separated in four groups (in-situ, digested-6h, digested-18h, digested-24h). Multi-parametric MRI acquisitions were used to quantify the relaxation times T1 and T2, the magnetization transfer ratio MTR, the apparent diffusion coefficient ADC and the fractional anisotropy FA. Unconfined compression, confined compression and direct permeability measurements were performed to quantify the compressive moduli and the hydraulic permeabilities. Differences between groups were evaluated from a one way ANOVA. Multi linear regressions were performed between dependent mechanical properties and independent MRI parameters to verify our hypothesis. A principal component analysis was used to convert the set of possibly correlated variables into a set of linearly uncorrelated variables. Agglomerative Hierarchical Clustering was performed on the 3 principal components. RESULTS: Multilinear regressions showed that 45 to 80% of the Young’s modulus E, the aggregate modulus in absence of deformation HA0, the radial permeability kr and the axial permeability in absence of deformation k0 can be explained by the MRI parameters within both the nucleus pulposus and the annulus pulposus. The principal component analysis reduced our variables to two principal components with a cumulative variability of 52-65%, which increased to 70-82% when considering the third principal component. The dendograms showed a natural division into four clusters for the nucleus pulposus and into three or four clusters for the annulus fibrosus. CONCLUSIONS: The compressive moduli and the permeabilities of isolated IVDs can be assessed mostly by MT and diffusion sequences. However, the relationships have to be improved with the inclusion of MRI parameters more sensitive to IVD degeneration. Before the use of this technique to quantify the mechanical properties of IVDs in vivo on patients suffering from various diseases, the relationships have to be defined for each degeneration state of the tissue that mimics the pathology. Our MRI protocol associated to principal component analysis and agglomerative hierarchical clustering are promising tools to classify the degenerated intervertebral discs and further find biomarkers and predictive factors of the evolution of the pathologies.

Concepts: Nuclear magnetic resonance, Magnetic resonance imaging, Principal component analysis, Diffusion MRI, Pearson product-moment correlation coefficient, Spin echo, Young's modulus, Helium


Principal component (PC) maps, which plot the values of a given PC estimated on the basis of allele frequency variation at the geographic sampling locations of a set of populations, are often used to investigate the properties of past range expansions. Some studies have argued that in a range expansion, the axis of greatest variation (i.e., the first PC) is parallel to the axis of expansion. In contrast, others have identified a pattern in which the axis of greatest variation is perpendicular to the axis of expansion. Here, we seek to understand this difference in outcomes by investigating the effect of the geographic sampling scheme on the direction of the axis of greatest variation under a two-dimensional range expansion model. From datasets simulated using each of two different schemes for the geographic sampling of populations under the model, we create PC maps for the first PC. We find that depending on the geographic sampling scheme, the axis of greatest variation can be either parallel or perpendicular to the axis of expansion. We provide an explanation for this result in terms of intra- and inter-population coalescence times.

Concepts: Genetics, Principal component analysis, Population genetics, Singular value decomposition, Personal computer


This paper demonstrates how multi-scale measures of rugosity, slope and aspect can be derived from fine-scale bathymetric reconstructions created from geo-referenced stereo imagery. We generate three-dimensional reconstructions over large spatial scales using data collected by Autonomous Underwater Vehicles (AUVs), Remotely Operated Vehicles (ROVs), manned submersibles and diver-held imaging systems. We propose a new method for calculating rugosity in a Delaunay triangulated surface mesh by projecting areas onto the plane of best fit using Principal Component Analysis (PCA). Slope and aspect can be calculated with very little extra effort, and fitting a plane serves to decouple rugosity from slope. We compare the results of the virtual terrain complexity calculations with experimental results using conventional in-situ measurement methods. We show that performing calculations over a digital terrain reconstruction is more flexible, robust and easily repeatable. In addition, the method is non-contact and provides much less environmental impact compared to traditional survey techniques. For diver-based surveys, the time underwater needed to collect rugosity data is significantly reduced and, being a technique based on images, it is possible to use robotic platforms that can operate beyond diver depths. Measurements can be calculated exhaustively at multiple scales for surveys with tens of thousands of images covering thousands of square metres. The technique is demonstrated on data gathered by a diver-rig and an AUV, on small single-transect surveys and on a larger, dense survey that covers over [Formula: see text]. Stereo images provide 3D structure as well as visual appearance, which could potentially feed into automated classification techniques. Our multi-scale rugosity, slope and aspect measures have already been adopted in a number of marine science studies. This paper presents a detailed description of the method and thoroughly validates it against traditional in-situ measurements.

Concepts: Mathematics, Principal component analysis, Demonstration, Depth perception, Plane, Singular value decomposition, Calculation, Autonomous underwater vehicle


Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

Concepts: Cluster analysis, Algorithm, Principal component analysis, Machine learning, Computational complexity theory, K-means clustering, Rand index


BACKGROUND: Static posture, repetitive movements and lack of physical variation are known risk factors for work-related musculoskeletal disorders, and thus needs to be properly assessed in occupational studies. The aims of this study were (i) to investigate the effectiveness of a conventional exposure variation analysis (EVA) in discriminating exposure time lines and (ii) to compare it with a new cluster-based method for analysis of exposure variation. METHODS: For this purpose, we simulated a repeated cyclic exposure varying within each cycle between “low” and “high” exposure levels in a “near” or “far” range, and with “low” or “high” velocities (exposure change rates). The duration of each cycle was also manipulated by selecting a “small” or “large” standard deviation of the cycle time. Theses parameters reflected three dimensions of exposure variation, i.e. range, frequency and temporal similarity.Each simulation trace included two realizations of 100 concatenated cycles with either low (rho = 0.1), medium (rho = 0.5) or high (rho = 0.9) correlation between the realizations. These traces were analyzed by conventional EVA, and a novel cluster-based EVA (C-EVA). Principal component analysis (PCA) was applied on the marginal distributions of 1) the EVA of each of the realizations (univariate approach), 2) a combination of the EVA of both realizations (multivariate approach) and 3) C-EVA. The least number of principal components describing more than 90% of variability in each case was selected and the projection of marginal distributions along the selected principal component was calculated. A linear classifier was then applied to these projections to discriminate between the simulated exposure patterns, and the accuracy of classified realizations was determined. RESULTS: C-EVA classified exposures more correctly than univariate and multivariate EVA approaches; classification accuracy was 49%, 47% and 52% for EVA (univariate and multivariate), and C-EVA, respectively (p < 0.001). All three methods performed poorly in discriminating exposure patterns differing with respect to the variability in cycle time duration. CONCLUSION: While C-EVA had a higher accuracy than conventional EVA, both failed to detect differences in temporal similarity. The data-driven optimality of data reduction and the capability of handling multiple exposure time lines in a single analysis are the advantages of the C-EVA.

Concepts: Multivariate statistics, Factor analysis, Principal component analysis, Exposure, Singular value decomposition, Photography, Linear discriminant analysis, The Unscrambler


OBJECTIVE: /st>To study the psychometric properties of a translated version of the Agency for Healthcare Research and Quality Hospital Survey on Patient Safety Culture (HSOPSC) in the Slovenian setting. DESIGN: /st>A cross-sectional psychometric study including principal component and confirmatory factor analysis. The percentage of positive responses for the 12 dimensions (42 items) of patient safety culture and differences at unit and hospital-level were calculated. SETTING: /st>Three acute general hospitals. PARTICIPANTS: /st>Census of clinical and non-clinical staff (n = 976). MAIN OUTCOME MEASURES: /st>Model fit, internal consistency and scale score correlations. RESULTS: /st>Principal component analysis showed a 9-factor model with 39 items would be appropriate for a Slovene sample, but a Satorra-Bentler scaled χ(2) difference test demonstrated that the 12-factor model fitted Slovene data significantly better. Internal consistency was found to be at an acceptable level. Most of the relationships between patient safety culture dimensions were strong to moderate. The relationship between all 12 dimensions and the patient safety grade was negative. The unit-level dimensions of patient safety were perceived better than the dimensions at the hospital-level. CONCLUSION: /st>The original 12-factor model for the HSOPSC was a good fit for a translated version of the instrument for use in the Slovene setting.

Concepts: Hospital, Psychometrics, Factor analysis, Principal component analysis, Confirmatory factor analysis, Safety, Patient safety, Slovenia