Handling Immunogenetic Data Managing and Validating HLA Data

Handling Immunogenetic Data Managing and Validating HLA Data Steven J. Mack PhD Children s Hospital Oakland Research Institute 16 th IHIW & Joint Conference Sunday 3 June, 2012

Overview 1. Master Analytical Dataset 2. The Perils of Using MS Excel 3. HLA Ambiguity 4. Internal Validation and Standardization of Datasets 5. Analytical Validation of HLA Population Data

Create a Master Dataset For Analysis/Sharing Create a single master data-file for research datasets Integrate demographic, phenotypic, and genotypic information Each analytical element (sample) is a row Each variable (genotype, phenotype, measurement) is a column Where possible, encode variable data numerically avoids errors, ambiguity of meaning, and saves space A data dictionary accompanies and explains the master data-file Describe the meaning and significance of each variable Define the numerical codes for each variable 1 = affected, 0 = control Define the code(s) for missing values -9 = unknown

Should I Store My Data in Excel? Microsoft Excel doesn t speak HLA IMGT/HLA db version Allele Name Excel Error 1.*.* 01011 1011 2.*.* 010101 10101 3.*.* 01:01:01 1:01:01 AM 3.*.* 01:61 0.0840278 Opening a text file by right-clicking and selecting Excel will result in Excel errors like these. Use Excel at your own risk; store data using a text format; consider using database software.

Consider Short-Term and Long-Term Needs for HLA Data Short-Term Need Interim Reporting, Submission Deadline, Data Analysis May often require an abstracted best guess to summarize the data. But that best guess may only be useful at the time it is made. Long-Term Need Archiving, Storage, Deposition into Public Domain Registry, Meta-Analysis, Multi-cycle projects Need to be able to make full use of these data in the future; The goal is to maximize available information, which often requires that primary or raw data be maintained. Should be storing both cleaned and ambiguous data.

HLA Data Ambiguity Allele ambiguity results when the polymorphisms that distinguish alleles fall outside of the regions assessed by the genotyping system. A*02:03:01/A*02:253/A*02:264 identical exon 2&3 sequence Genotype ambiguity results from an inability to establish chromosomal phase between identified polymorphisms. DRB1*04:01:01+DRB1*13:01:01 or DRB1*04:01:01+ DRB1*13:117 or DRB1*04:13+DRB1*13:02:01 or DRB1*04:14+DRB1*14:21 or DRB1*04:35+DRB1*13:40 or DRB1*04:38+DRB1*13:20 identical heterozygous exon 2 sequence An HLA genotype can include both allele and genotype ambiguity A*02:03:01/A*02:253/A*02:264+A*03:01:02 or A*02:171:01+A*03:50 or A*02:171:02+A*03:66 http://www.ebi.ac.uk/imgt/hla/ambig.html

NMDP Allele Codes Coding Ambiguous HLA Data DRB1*04:02:01+DRB1*11:20 or DRB1*04:14+DRB1*11:16 Can be represented as DRB1*04:BK+DRB1*11:YR DRB1*04:BK represents DRB1*04:02/DRB1*04:14 DRB1*11:YR represents DRB1*11:20/DRB1*11:16 But these codes also represent excluded genotypes DRB1*04:02:01+DRB1*11:16 and DRB1*04:14+DRB1*11:20 Allele Codes increase ambiguity and omit information in the 3 rd & 4 th fields http://bioinformatics.nmdp.org/hla/allele_codes/allele_codes.aspx

Coding Ambiguous HLA Data Allele Groups G groups: Alleles with identical nucleotide sequence in the exons encoding the peptide binding domain (exon 2 for class II alleles, and exons 2 & 3 for class I alleles) A*02:03:01G = A*02:03:01/A*02:253/A*02:264 A*02:07:01G = A*02:07:01/A*02:07:02/A*02:15N/A*02:265 http://hla.alleles.org/alleles/g_groups.html P groups: Alleles with identical peptide sequence in the peptide binding domain, with the exclusion of null alleles. A*02:03P = A*02:03:01/A*02:03:02/A*02:03:03/A*02:03:04/A*02:253/A*02:264 A*02:07P = A*02:07:01/A*02:07:02/A*02:265 http://hla.alleles.org/alleles/p_groups.html

Recording Ambiguous HLA Data Genotype Strings Genotype List String (GL String) Uses specific operators to describe the relationships between alleles, allowing genotype data to be recorded in a single line. Order of Precedence Data Delimiter Operator Description 1 ^ Gene/Locus 2 Genotype list 3 + Genotype 4 ~ Haplotype 5 / Alleles http://www.ebi.ac.uk/ipd/kir/standards.html

GL String Representation of Ambiguous HLA Data Allelic ambiguity delimiter (forward slash) Defines an allele list / A*23:26/A*23:39 allele allele Possible alleles at locus A

GL String Representation of Ambiguous HLA Data ~ Haplotype delimiter (tilde) Applied in cis to identify a haplotype DRB5*01:02~DRB1*15:04 allele at locus DRB5 allele at locus DRB1 same chromosome

GL String Representation of Ambiguous HLA Data Genotype delimiter (plus sign) Identifies alleles on different chromosomes, genotype (trans) Delimits haplotype May also indicate gene duplication (ambiguous cis) + A*02:302+A*23:26/A*23:39 allele allele allele HLA-A allelic ambiguity genotype

GL String Representation of Ambiguous HLA Data Genotype list delimiter (pipe) Distinguishes ambiguous genotypes A*02:69+A*23:30 A*02:302+A*23:26/A*23:39 allele allele allele ambiguous allele list genotype genotype

GL String Representation of Ambiguous HLA Data Locus delimiter (carat) Distinguishes loci A*02:69+A*23:30 A*02:302+A*23:26/A*23:39^B*44:02:13+B*49:08 ^ allele allele allele ambiguous allele list allele allele possible genotype for A possible genotype for A Ambiguous genotype list for HLA-A HLA-B genotype

Alternative Format for Recording Ambiguous HLA Data UNIFORMAT Also allows ambiguous genotype data to be recorded in a single line, using different operators (colons, commas, spaces, tabs) than in GL String. identifier {tab mark} allele,allele [{space} allele,allele...][{tab mark} allele, allele...][{tab mark}#comments] sample identifier ambiguous alleles at one locus ambiguous alleles at additional loci comments http://geneva.unige.ch/generate/

Know Your Nomenclature Version Identify the IMGT/HLA database release number applicable to your data http://www.ebi.ac.uk/imgt/hla/ The release number identifies which alleles should and should not be in your data IMGT/HLA db rel Allele 1.13 A*2416 1.14 A*3108 1.15 B*1522 1.16 B*3543 2.28 DPB1*0502 3.0.0 DPB1*104:01 This information allows you to check your data for naming/recording errors

Dataset Validation The Allele Name Translation Tool (ANTT) can be used to validate/update the allele names in column-formatted datasets against/to any IMGT/HLA db release. Parses forward-slash (/) allele ambiguity delimiters; the next version will parse any delimiters (e.g. all GL String delimiters). Documents right-truncated allele names and unrecognized allele names. Identifies the id and row-column position of errors. DRB1*08:02:00 could not be found in the HLA-DRB1.upd translation file.[id = 003][Row = 4 Column = 2] DRB1*04:08 appears to be a truncated version of the DRB1*04:08:01 allele, and was translated to DRB1*0408. [id = 005][Row = 6 Column = 2] http://immunogenomics.org/software.html

Internal Standardization of Data Data Consistency Record data consistently across individuals (and datasets). For individuals typed Record homozygotes as diploid A*02:03:01/A*02:253/A*02:264+A*02:03:01/A*02:253/A*02:264 Use a code to identify missing data ****, -9, missing, etc. Use a code to identify absent loci when recording structural variants DRB1*01:01:01~DRB3*BLANK~DRB4*BLANK~DRB5*BLANK+DRB1*15:01:01~DRB3*BLANK~DRB4*BLANK~DRB5*01:01:01

Internal Standardization of Data Data Modification Document any post-typing modifications made to data. How did A*02:69+A*23:30 A*02:302+A*23:26/A*23:39 become A*02:302+A*23:39? For analysis across individuals and datasets Analyze allele names at the same level of polymorphism Avoid A*01:01:01:01, A*01:01:01, A*01:01, and A*01 in the same analysis You will have to throw out some data/information. Analyze alleles at the same sequence level For exons 2/2& 3 testing, don t analyze alleles in the same G group separately. Analyze DRB1*14:01:01 and DRB1*14:54 as DRB1*14:01:01G. Analyze allele names in the same nomenclature context Don t analyze A*2416 and A*3108; update to a single nomenclature.

Analytical Validation of HLA Population Data The Hardy-Weinberg (HW) model can give you insights into data-quality Hardy-Weinberg Equilibrium The frequency of the alleles should predict the frequency of the genotypes. If it does not (HW deviation), you may have problems with your data. Multi-locus HW deviation Sampling error Related individuals Populations mixed together (admixture) Solution: Review inclusion criteria and remove individuals Critical Typing problem (uncommon) Single-locus HW deviation Typing error Excess of homozygotes due to missed alleles Excess of heterozygotes due to poor assignment Solution: Review and redo typings Selection (unexpected in control-populations)

Example of HW Data Validation 13 DQB1 alleles in a population of n=109 Genotype Observed Count Expected Count p-value DQB1*03:03:02+DQB1*02:01:01 0 3.137 0.0493 DQB1*03:03:02+DQB1*03:03:02 3 0.743 0.0223 Chen s test of individual genotypes in PyPop (http://www.pypop.org) HW deviation due to poor detection of DQB1*02:01:01 in the presence of DQB1*03:03:02, resulting from a SNP in DQB1*02:01:01 under a PCR primer. For population/control datasets, the first analysis done should be a HW test.

Much Thanks To Pierre-Antoine Gourraud Standard Methods for the Management Jill A. Hollenbach of Immunogenetic Data. Frank T. Christiansen Thomas Barnetche and Brian D. Tait (eds.), Immunogenetics: Richard Single Methods and Applications in Clinical Practice, Steven J. Mack Methods in Molecular Biology, vol. 882. pages 197-213. 2012. doi: 10.1007/978-1-61779-842-9_12 Children s Hospital Oakland Henry A. Erlich Janelle Noble Elizabeth Trachtenberg Immunogenetics Colleagues Glenys Thomson Alicia Sanchez-Mazas Owen D. Solberg Martin Maiers Carolyn Hurley Marcel Tilanus Christien Voorter Immunogenetics Community

Handling Immunogenetic Data Managing Highly Polymorphic Data for Disease Association Studies Jill A. Hollenbach, PhD, MPH Children s Hospital Oakland Research Institute 16 th IHIW & Joint Conference Sunday 3 June, 2012

Immunogenetic data require special handling in disease association studies

Immunogenetic data require special handling in disease association studies 1 Highly polymorphic loci

Immunogenetic data require special handling in disease association studies 1 Highly polymorphic loci Many rare alleles >>> sparse cells

Immunogenetic data require special handling in disease association studies 1 Highly polymorphic loci Many rare alleles >>> sparse cells Need to identify all disease associated alleles

Immunogenetic data require special handling in disease association studies 1 Highly polymorphic loci Many rare alleles >>> sparse cells Need to identify all disease associated alleles 2 Strong linkage disequilibrium

Case-Control Study

Case-Control Study Statistical tests

Case-Control Study Statistical tests First step: Population analyses

Case-Control Study Statistical tests First step: Population analyses Tests for fit to HWEP

Case-Control Study Statistical tests First step: Population analyses Tests for fit to HWEP Calculation of allele and haplotype frequencies

Case-Control Study Statistical tests First step: Population analyses Tests for fit to HWEP Calculation of allele and haplotype frequencies Association tests

Case-Control Study Statistical tests First step: Population analyses Tests for fit to HWEP Calculation of allele and haplotype frequencies Association tests Contingency tables /chi-squared test

Case-Control Study Statistical tests First step: Population analyses Tests for fit to HWEP Calculation of allele and haplotype frequencies Association tests Contingency tables /chi-squared test Logistic regression

Case-Control Study Contingency tables

Case-Control Study Contingency tables Test difference (independence) of frequency distributions for categorical variables between groups - 2 test

Case-Control Study Contingency tables Test difference (independence) of frequency distributions for categorical variables between groups - 2 test Always constructed with raw counts, not frequency data Analyses can be performed at the allele, genotype, haplotype, amino acid or other levels

Sparse cells in contingency tables

Chi-squared Test Statistic: (O - E) 2 c 2 = E å all cells O is the observed cell counts E is the expected cell counts, where E = Sparse cells in contingency tables (row total column total) 2N

Chi-squared Test Statistic: (O - E) 2 c 2 = E å all cells O is the observed cell counts E is the expected cell counts, where E = Sparse cells in contingency tables (row total column total) 2N c 2 test is inappropriate if any expected count is less than 1 or if the expected count is less than five in more than 20% of all cells in a contingency table *** aka sparse cells

Sparse cells in contingency tables DRB1 case control 0101 4 9 0102 14 13 0103 1 0 0301 18 24 0302 16 23 0401 8 7 0403 2 0 0404 1 2 0405 1 5 0407 0 3 0701 44 21 0801 0 1 0802 3 6 0803 1 1 0804 12 12 0806 1 1 0901 4 11 1001 7 3 1101 30 28 1102 14 11 1104 1 1 1201 22 11 1301 21 12 1302 19 21 1303 9 4 1304 1 2 1401 7 3 1402 0 2 1501 5 9 1502 2 2 1503 36 35 1602 8 3 2n 312 286

Sparse cells in contingency tables DRB1 case control 0701 44 21 1503 36 35 1101 30 28 1201 22 11 1301 21 12 1302 19 21 0301 18 24 0302 16 23 0102 14 13 1102 14 11 0804 12 12 1303 9 4 0401 8 7 1602 8 3 1001 7 3 1401 7 3 1501 5 9 0101 4 9 0901 4 11 0802 3 6 0405 1 5 binned 10 15 2n 312 286

Sparse cells in contingency tables DRB1 case control p-value 0701 44 21 0.01 1503 36 35 0.81 1101 30 28 0.95 1201 22 11 0.10 1301 21 12 0.20 1302 19 21 0.56 0301 18 24 0.24 0302 16 23 0.17 0102 14 13 0.97 1102 14 11 0.71 0804 12 12 0.83 1303 9 4 0.23 0401 8 7 0.93 1602 8 3 0.18 1001 7 3 0.27 1401 7 3 0.27 1501 5 9 0.23 0101 4 9 0.13 0901 4 11 0.05 0802 3 6 0.27 0405 1 5 0.09 binned 10 15 0.23

Identifying all disease associated alleles

Identifying all disease associated alleles Relative predispositional effects method (RPE; Payami et al 1989)

Identifying all disease associated alleles Relative predispositional effects method (RPE; Payami et al 1989) Method to identify all heterogeneity in disease risk at locus of interest

Identifying all disease associated alleles Relative predispositional effects method (RPE; Payami et al 1989) Method to identify all heterogeneity in disease risk at locus of interest Contingency table testing reveals overall difference in allele frequency distributions at a locus But we want to identify all alleles that contribute significantly Alleles with the strongest predisposing or protective effects sequentially removed from analysis until no further heterogeneity in risk effects is seen

Identifying all disease associated alleles DRB1 case control p-value 0701 44 21 0.01 0901 4 11 0.05 0405 1 5 0.09 1201 22 11 0.10 0101 4 9 0.13 0302 16 23 0.17 1602 8 3 0.18 1301 21 12 0.20 1501 5 9 0.23 1303 9 4 0.23 binned 10 15 0.23 0301 18 24 0.24 0802 3 6 0.27 1001 7 3 0.27 1401 7 3 0.27 1302 19 21 0.56 1102 14 11 0.71 1503 36 35 0.81 0804 12 12 0.83 0401 8 7 0.93 1101 30 28 0.95 0102 14 13 0.97

Identifying all disease associated alleles

Identifying the primary locus

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.10 0.05 07:01~03:03 0.04 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved.

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved.

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03)

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03) (0.12)/(0.02)=

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03) (0.12)/(0.02)=6

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03) f cont (07:01~02:01)/f cont (07:01~03:03) (0.12)/(0.02)=6

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03) f cont (07:01~02:01)/f cont (07:01~03:03) (0.12)/(0.02)=6 (0.05)/(0.02)=

Identifying the primary locus The frequencies of two haplotypes of an allele at the predisposing locus may differ between patients and controls DRB1~DQB1 Haplotype Case Control 07:01~02:01 0.12 0.05 07:01~03:03 0.02 0.02 HOWEVER: the relative frequency of their ratios will be the same if the second locus is not involved. f case (07:01~02:01)/f case (07:01~03:03) f cont (07:01~02:01)/f cont (07:01~03:03) (0.12)/(0.02)=6 (0.05)/(0.02)=2.5

Population substructure in disease association studies

Population substructure in disease association studies (f) PopX (f) LocusA PopY

Population substructure in disease association studies (f) PopX cases controls (f) LocusA PopY cases controls

Population substructure in disease association studies (f) PopX cases controls LocusA PopY No association (f) cases controls

Population substructure in disease association studies (f) PopX cases controls (f) LocusA PopY No association PopXY cases controls

Population substructure in disease association studies (f) PopX cases controls (f) LocusA PopY No association PopXY cases controls cases controls

Population substructure in disease association studies (f) PopX cases controls (f) LocusA PopY No association PopXY cases controls p<.05 cases controls

Population substructure in disease association studies cases controls p<.05

Population substructure in disease association studies (f) PopX cases controls (f) LocusA PopY No association PopXY cases controls p<.05 cases controls

Population substructure in disease association studies cases controls p<.05

Immunogenetic data require special handling in disease association studies

Immunogenetic data require special handling in disease association studies Highly polymorphic loci

Immunogenetic data require special handling in disease association studies Highly polymorphic loci Combine low frequency alleles

Immunogenetic data require special handling in disease association studies Highly polymorphic loci Combine low frequency alleles Binning

Immunogenetic data require special handling in disease association studies Highly polymorphic loci Combine low frequency alleles Binning Need to identify all associated alleles

Immunogenetic data require special handling in disease association studies Highly polymorphic loci Combine low frequency alleles Binning Need to identify all associated alleles Relative predispositional effects Strong linkage disequilibrium

For further discussion see: Hollenbach JA, Mack SJ, Thomson G, Gourraud PA. Analytical methods for disease association studies with immunogenetic data. Methods Mol Biol. 2012;882:245-66. DOI: 10.1007/978-1-61779-842-9_14

Thank you!