Project: Haplotypes and pedigree analyses tools

Development op haplotype and pedigree analyses software tools

Project leaders: Martin Maiers

Detailed project description:

18th IHIWS project on haplotypes and pedigree analysis tools

The international HLA & Immunogenetics workshop has a rich tradition about the use of families to understand complex immunogenetics.  Despite great progress in the field in terms of typing technology and computational resources we are still in a situation where we lack tools to address some of the basic questions where pedigree information is useful:

  • What is the recombination rate within and between alleles?
  • What are the selection processes driving HLA diversity?
  • How can families with different size be integrated into one analysis (duos, trios, quads, families with >2 children)?
  • How can families be used to analyze genomic structure (e.g. KIR haplotypes)?
  • How can genomic sequencing data (e.g. HapMap, GONL be integrated?
  • How can we use real world data with ambiguity (e.g. not all samples typed at the same resolution or at the same loci)?

The motivation for wanting to have more capable tools to analyze families is, in turn, to help understand the largest questions in the field:

  • Why are there so many HLA alleles?  Ostensibly due to pathogen driven balancing selection.  However, the evidence for this is still lacking.  Statistical methods (e.g. Ewens Watterson) have been applied to HLA to support claims of balancing selection are based on a long list of assumptions which don’t apply.  Direct counting of homozygosity in large cohorts reveals evidence for the opposite: excess homozygosity: in particular at the Class I loci.
  • What is the contribution of each of the following mechanisms to allele diversity
    o   Mutation and recombination
    o   Undetected recombination
    o   Selection
  • Lower and Upper bounds on the MHC recombination rate in humans differ by an order of magnitude.  Can we improve this?
  • Does HLA driven mate selection exist in humans?
  • Can we detect Transmission disequilibrium (deviation from Mendel’s laws) in MHC in humans?

The approach we have proposed for this component is to develop:

  1. a set of tools for analyzing pedigrees that address shortcomings of existing tools
  2. a set of tools for simulation of meiotic reproduction parameterized in such a way as to allow the creation of datasets that mimic the family structure and typing (loci/resolution) and with specific population-level parameters (recombination rate, transmission disequilibrium, inbreeding coefficient, etc).

Pedigree Analysis Software feature list:

  • Ability to analyze extended families
  • Imputation to statistically resolve ambiguity using population haplotype frequencies
  • Operate over a spectrum of typing resolutions
  • HLA and KIR fluency (lift over, reduction to structures: exons, ARD exons)
  • flexible number of loci
  • Recombination detection under ambiguity
  • Query language (e.g. Cypher graph query language)
  • Estimates of false paternity
  • A nice user interface (UI)
  • Standards based data import (HML, PED, HL7-FHIR, IHIW-XML)
  • Integration of WGS/WES/GWAS or other data

Which systems?

  • MHC-class I in Rhesus monkeys and other primates where the number of genes varies
  • KIR and LILR – in all species that have it
  • TCR – germ-line sequencing
  • MHC-class II in human and non-human primates

Real world datasets to analyze:

  1. Registry cord/mom pairs
  2. Clinical patient/families
  3. 17th IHIW families
  4. GONL families (GoNL http://www.nlgenome.nl/)
  5. HapMap families (1000 genomes)

Milestones in years:

2020:

  • Develop simulated datasets.
  • Curate/acquire clinical (real-world datasets)

2021:

  • Assess existing tools.
  • Develop new tools.

2022:

  • Data analysis and interpretation

Patient/sample description (if applicable, details, inclusion/exclusion criteria):

Real-World family data:

  • Any pedigree size (2+)
  • Preference for 2 parents, > 2 children

Data required (number, type of data, inclusion/exclusion criteria):

  • Any data accepted
  • Higher resolution (e.g. full gene sequence) and more loci preferred.

Samples required (if applicable, number, type of samples, inclusion/exclusion criteria):

N/A

Reagents/additional assays required:

N/A

Data infrastructure required:

  • We will share software using github.
  • We will distribute simulated data through AWS.
  • Real-world datasets will require data use agreements.