Concept: Genome-wide association study
Background Despite evidence that genetic factors contribute to the duration of gestation and the risk of preterm birth, robust associations with genetic variants have not been identified. We used large data sets that included the gestational duration to determine possible genetic associations. Methods We performed a genomewide association study in a discovery set of samples obtained from 43,568 women of European ancestry using gestational duration as a continuous trait and term or preterm (<37 weeks) birth as a dichotomous outcome. We used samples from three Nordic data sets (involving a total of 8643 women) to test for replication of genomic loci that had significant genomewide association (P<5.0×10(-8)) or an association with suggestive significance (P<1.0×10(-6)) in the discovery set. Results In the discovery and replication data sets, four loci (EBF1, EEFSEC, AGTR2, and WNT4) were significantly associated with gestational duration. Functional analysis showed that an implicated variant in WNT4 alters the binding of the estrogen receptor. The association between variants in ADCY5 and RAP2C and gestational duration had suggestive significance in the discovery set and significant evidence of association in the replication sets; these variants also showed genomewide significance in a joint analysis. Common variants in EBF1, EEFSEC, and AGTR2 showed association with preterm birth with genomewide significance. An analysis of mother-infant dyads suggested that these variants act at the level of the maternal genome. Conclusions In this genomewide association study, we found that variants at the EBF1, EEFSEC, AGTR2, WNT4, ADCY5, and RAP2C loci were associated with gestational duration and variants at the EBF1, EEFSEC, and AGTR2 loci with preterm birth. Previously established roles of these genes in uterine development, maternal nutrition, and vascular control support their mechanistic involvement. (Funded by the March of Dimes and others.).
As our understanding of genetics has improved, genome-wide association studies (GWAS) have identified numerous variants associated with lifestyle behaviours and health outcomes. However, what is sometimes overlooked is the possibility that genetic variants identified in GWAS of disease might reflect the effect of modifiable risk factors as well as direct genetic effects. We discuss this possibility with illustrative examples from tobacco and alcohol research, in which genetic variants that predict behavioural phenotypes have been seen in GWAS of diseases known to be causally related to these behaviours. This consideration has implications for the interpretation of GWAS findings.
A genome-wide polygenic score (GPS), derived from a 2013 genome-wide association study (N=127,000), explained 2% of the variance in total years of education (EduYears). In a follow-up study (N=329,000), a new EduYears GPS explains up to 4%. Here, we tested the association between this latest EduYears GPS and educational achievement scores at ages 7, 12 and 16 in an independent sample of 5825 UK individuals. We found that EduYears GPS explained greater amounts of variance in educational achievement over time, up to 9% at age 16, accounting for 15% of the heritable variance. This is the strongest GPS prediction to date for quantitative behavioral traits. Individuals in the highest and lowest GPS septiles differed by a whole school grade at age 16. Furthermore, EduYears GPS was associated with general cognitive ability (~3.5%) and family socioeconomic status (~7%). There was no evidence of an interaction between EduYears GPS and family socioeconomic status on educational achievement or on general cognitive ability. These results are a harbinger of future widespread use of GPS to predict genetic risk and resilience in the social and behavioral sciences.Molecular Psychiatry advance online publication, 19 July 2016; doi:10.1038/mp.2016.107.
Crohn’s disease (CD) is a complex disorder resulting from the interaction of intestinal microbiota with the host immune system in genetically susceptible individuals. The largest meta-analysis of genome-wide association to date identified 71 CD-susceptibility loci in individuals of European ancestry. An important epidemiological feature of CD is that it is 2-4 times more prevalent among individuals of Ashkenazi Jewish (AJ) descent compared to non-Jewish Europeans (NJ). To explore genetic variation associated with CD in AJs, we conducted a genome-wide association study (GWAS) by combining raw genotype data across 10 AJ cohorts consisting of 907 cases and 2,345 controls in the discovery stage, followed up by a replication study in 971 cases and 2,124 controls. We confirmed genome-wide significant associations of 9 known CD loci in AJs and replicated 3 additional loci with strong signal (p<5×10⁻⁶). Novel signals detected among AJs were mapped to chromosomes 5q21.1 (rs7705924, combined p = 2×10⁻⁸; combined odds ratio OR = 1.48), 2p15 (rs6545946, p = 7×10⁻⁹; OR = 1.16), 8q21.11 (rs12677663, p = 2×10⁻⁸; OR = 1.15), 10q26.3 (rs10734105, p = 3×10⁻⁸; OR = 1.27), and 11q12.1 (rs11229030, p = 8×10⁻⁹; OR = 1.15), implicating biologically plausible candidate genes, including RPL7, CPAMD8, PRG2, and PRG3. In all, the 16 replicated and newly discovered loci, in addition to the three coding NOD2 variants, accounted for 11.2% of the total genetic variance for CD risk in the AJ population. This study demonstrates the complementary value of genetic studies in the Ashkenazim.
Advances in sequencing technology and genome-wide association studies are now revealing the complex interactions between hosts and pathogen through genomic variation signatures, which arise from evolutionary co-existence.
This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.
Background Genome-wide association studies have become very popular in identifyinggenetic contributions to phenotypes. Millions of SNPs are being tested fortheir association with diseases and traits using linear or logistic regression models.This conceptually simple strategy encounters the following computational issues: a largenumber of tests and very large genotype files (many Gigabytes) which cannot bedirectly loaded into the software memory. One of the solutions applied on agrand scale is cluster computing involving large-scale resources.We show how to speed up the computations using matrix operations in pure R code.Results We improve speed: computation time from 6 hours is reduced to 10-15 minutes.Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them intocomputer memory becomes an important issue. However, much improvement can bemade if the data is structured beforehand in a way allowing for easy access to blocks ofSNPs. We propose several solutions based on the R packages ff and ncdf.We adapted the semi-parallel computations for logistic regression.We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures.Conclusions We provide very fast algorithms for GWAS written in pure R code. We also showhow to rearrange SNP data for fast access.
We performed a genome-wide association study (GWAS) and a multistage meta-analysis of type 2 diabetes (T2D) in Punjabi Sikhs from India. Our discovery GWAS in 1,616 individuals (842 case subjects) was followed by in silico replication of the top 513 independent SNPs (P < 10(-3)) in Punjabi Sikhs (n = 2,819; 801 case subjects). We further replicated 66 single nucleotide polymorphisms (SNPs) (P < 10(-4)) through genotyping in a Punjabi Sikh sample (n = 2,894; 1,711 case subjects). On combined meta-analysis in Sikh populations (n = 7,329; 3,354 case subjects), we identified a novel locus in association with T2D at 13q12 represented by a directly genotyped intronic SNP (rs9552911, P = 1.82 × 10(-8)) in the SGCG gene. Next, we undertook in silico replication (stage 2b) of the top 513 signals (P < 10(-3)) in 29,157 non-Sikh South Asians (10,971 case subjects) and de novo genotyping of up to 31 top signals (P < 10(-4)) in 10,817 South Asians (5,157 case subjects) (stage 3b). In combined South Asian meta-analysis, we observed six suggestive associations (P < 10(-5) to < 10(-7)), including SNPs at HMG1L1/CTCFL, PLXNA4, SCAP, and chr5p11. Further evaluation of 31 top SNPs in 33,707 East Asians (16,746 case subjects) (stage 3c) and 47,117 Europeans (8,130 case subjects) (stage 3d), and joint meta-analysis of 128,127 individuals (44,358 case subjects) from 27 multiethnic studies, did not reveal any additional loci nor was there any evidence of replication for the new variant. Our findings provide new evidence on the presence of a population-specific signal in relation to T2D, which may provide additional insights into T2D pathogenesis.
Next-Generation Sequencing (NGS) technologies and Genome-Wide Association Studies (GWAS) generate millions of reads and hundreds of datasets, and there is an urgent need for a better way to accurately interpret and distill such large amounts of data. Extensive pathway and network analysis allow for the discovery of highly significant pathways from a set of disease vs. healthy samples in the NGS and GWAS. Knowledge of activation of these processes will lead to elucidation of the complex biological pathways affected by drug treatment, to patient stratification studies of new and existing drug treatments, and to understanding the underlying anti-cancer drug effects. There are approximately 141 biological human pathway resources as of Jan 2012 according to the Pathguide database. However, most currently available resources do not contain disease, drug or organ specificity information such as disease-pathway, drug-pathway, and organ-pathway associations. Systematically integrating pathway, disease, drug and organ specificity together becomes increasingly crucial for understanding the interrelationships between signaling, metabolic and regulatory pathway, drug action, disease susceptibility, and organ specificity from high-throughput omics data (genomics, transcriptomics, proteomics and metabolomics).
Genome-wide association studies (GWAS) have detected many disease associations. However, the reported variants tend to explain small fractions of risk, and there are doubts about issues such as the portability of findings over different ethnic groups or the relative roles of rare versus common variants in the genetic architecture of complex disease. Studying the degree of sharing of disease-associated variants across populations can help in solving these issues. We present a comprehensive survey of GWAS replicability across 28 diseases. Most loci and SNPs discovered in Europeans for these conditions have been extensively replicated using peoples of European and East Asian ancestry, while the replication with individuals of African ancestry is much less common. We found a strong and significant correlation of Odds Ratios across Europeans and East Asians, indicating that underlying causal variants are common and shared between the two ancestries. Moreover, SNPs that failed to replicate in East Asians map into genomic regions where Linkage Disequilibrium patterns differ significantly between populations. Finally, we observed that GWAS with larger sample sizes have detected variants with weaker effects rather than with lower frequencies. Our results indicate that most GWAS results are due to common variants. In addition, the sharing of disease alleles and the high correlation in their effect sizes suggest that most of the underlying causal variants are shared between Europeans and East Asians and that they tend to map close to the associated marker SNPs.