SNP-HLA Reference Consortium
Project leaders: Nicolas Vince & Pierre-Antoine Gourraud.
Detailed project description:
For the past 10 years, genome-wide association studies (GWAS) identified more than 10,000 associations; the HLA genomic region is the top associated locus in GWAS, particularly in immune-related diseases. SNPs are the hallmark of GWAS, however, the information of this type of genetic marker is very limited, especially in the HLA region where linkage disequilibrium (LD; defined as the non-random association of allele frequencies) is strong and extends over several megabases. Indeed, a SNP associated with a pathology is only the marker of a genomic region, and it is necessary to go beyond this simple association, especially with HLA alleles, to improve our understanding of the functional mechanisms and possibly to develop therapeutic targets. HLA typing techniques are expensive, require specialized laboratory infrastructure, and are in constant evolution.
However, thanks to statistical inference, it is now possible to impute HLA alleles from genotyped GWAS SNPs. This technique requires the availability of adequate reference panels for imputation; these reference panels need improvements in terms of sample size, population diversity and SNPs exhaustiveness. The goal here is to create custom reference panels to better impute HLA from GWAS datasets. To do so, we need: (1) to gather more HLA and SNP data from numerous sources, (2) to better understand how to improve HLA imputation practices, and (3) to build a digital platform where scientists can access these custom reference panels to impute their own data (SHLARC, the SNP-HLA reference consortium).
Practically, we plan to gather HLA+SNP data from several sources: our own data, public data (the 1000 Genomes project), semi-public data (via access to dbGAP and EGA data repositories) and direct collaborations. All HLA+SNP data will never be shared directly. Only reference panels built from these datasets will be provided within our tools, reference panels contain only summary statistics (i.e. probability of having a given HLA allele for given SNPs genotypes), which are not sufficient to reconstruct individuals.
Milestones in years:
2020: Data collection and database set up. Optimisation of HLA alleles imputation.
2021: Continuation of data collection and website construction.
2022: Website beta version release, feedback integration and release of a final version.
Data required (number, type of data, inclusion/exclusion criteria):
Several types of data are suitable but all need to contain at least second-field molecular HLA typing for HLA-A, -B, -C, -DRB1, -DQB1 and SNP genotypes.
SNP genotypes: all types of GWAS chip data, sequencing data covering 500 kb around HLA genes, whole-genome sequencing WGS data.
Minimal HLA typing resolution: second-field. HLA can also be called from WGS data.
Data infrastructure required:
Data infrastructure will be hosted in Nantes University data center.
We will make use of our local high throughput calculation center (CCIPL, Nantes) to perform custom reference panels building with the help of high-performance GPUs (NVIDIA P100).