GenoKey has extensive experience in scientific computing and has developed a suite of innovative technologies for solving very large-scale combinatorial problems. Our array- based approach provides very efficient and scalable methods for combinatorial data mining on case-control data with large number of biomarkers, including genetic and clinical data.
Our technology was developed in close cooperation with leading research centers and has been tested on large biological data samples, as described below.
Genome Wide Association Studies
Genome Wide Association Studies (GWAS) attempt to compare the genotypes of a population of patients with a specific disease against a control population to identify signals that are associated with the disease. These signals are usually single base mutations (Single Nucleotide Polymorphisms or SNPs) at specific points in the DNA. Tests based on these signals can be used as research tools or companion diagnostics to help prescribe drugs only to those patients who can benefit from them.
Existing GWAS methods generally identify single (or at most pairs of) SNPs. Very often however, these associations are found not to be significant and cannot be validated clinically. Unfortunately the simplistic ‘SNP at a time’ view fails to reflect the biological complexity and often misses crucial factors – rarely explaining more than 50% of the disease risk that is seen. This means that key associations between SNPs may not be spotted and hence cannot be used as tools in research on the disease. The closing of this ‘heritability gap’ is a major current undertaking in the analysis of GWAS data*, and will lead to innovative new drugs and diagnostics products with significant commercial value.This is not surprising – complex diseases such as diabetes, bipolar disorder and cancer are highly multi-factorial, with combinations of several factors combining to exert a clinical effect. In these important diseases, one or two SNP variations by themselves have little or no predictive value.
To understand the complex network of metabolic influences on the disease we must therefore find sets of several SNPs that, in combination, are associated with different disease variants and treatment outcomes. This represents a massive computational challenge for traditional approaches due to the number of potential combinations. Assuming that we want to select p SNPs at a time from a total of n, the number of SNP combinations is n!/p!(n-p)!. A graph illustrating the factorial expansion for sets of between 3 to 10 SNPs in a 500,000 SNP genotype is shown below (note the log scale):
To overcome this challenge, we use a highly efficient geometrical approach based on nested data arrays combined with high performance massively parallel GPU processing. The massive data analysis capability that this provides, allows very fast and complete examination of all potentially valid SNP combinations to find statistically significant genotype patterns for your studies. Large GWAS studies on 3 SNP sets can now be run in minutes even on relatively low-cost GPU hardware.
For more information on the large scale bipolar disease study undertaken with University of Copenhagen please see here or online at http://dx.plos.org/10.1371/journal.pone.0023812.
* Zeggini E., (2011) Next-generation association studies for complex traits, Nature Genetics 43, 4, 287-288