SciCombinator

Discover the most talked about and latest scientific content & concepts.

K Sikorska, E Lesaffre, PF Groenen and PH Eilers
Abstract
Background Genome-wide association studies have become very popular in identifyinggenetic contributions to phenotypes. Millions of SNPs are being tested fortheir association with diseases and traits using linear or logistic regression models.This conceptually simple strategy encounters the following computational issues: a largenumber of tests and very large genotype files (many Gigabytes) which cannot bedirectly loaded into the software memory. One of the solutions applied on agrand scale is cluster computing involving large-scale resources.We show how to speed up the computations using matrix operations in pure R code.Results We improve speed: computation time from 6 hours is reduced to 10-15 minutes.Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them intocomputer memory becomes an important issue. However, much improvement can bemade if the data is structured beforehand in a way allowing for easy access to blocks ofSNPs. We propose several solutions based on the R packages ff and ncdf.We adapted the semi-parallel computations for logistic regression.We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures.Conclusions We provide very fast algorithms for GWAS written in pure R code. We also showhow to rearrange SNP data for fast access.
Tweets*
13
Facebook likes*
0
Reddit*
0
News coverage*
0
Blogs*
1
SC clicks
1
Concepts
Data, Logistic regression, Computational complexity theory, Computer science, Genome-wide association study, Computation, Computing, Computer
MeSH headings
-
comments powered by Disqus

* Data courtesy of Altmetric.com