Detection and Validation of Clinically Relevant High Order Epistatic Interactions in a BRCA2 Positive Breast Cancer Population

Size: px

Start display at page:

Download "Detection and Validation of Clinically Relevant High Order Epistatic Interactions in a BRCA2 Positive Breast Cancer Population"

Juniper Hensley
5 years ago
Views:

MARKERS AI enabled precision medicine precisonlife1 brca2@precisionlife.

Matelska, Steve Gardner Introduction Genome Wide Association Studies (GWAS) aim to find (single) genetic variant loci associated with specific phenotypes.

1 MARKERS AI enabled precision medicine precisonlife1 Detection and Validation of Clinically Relevant High Order Epistatic Interactions in a BRCA2 Positive Breast Cancer Population Gert Lykke Møller, Erling Mellerup, Claus Erik Jensen, Dorota Matelska, Steve Gardner Introduction Genome Wide Association Studies (GWAS) aim to find (single) genetic variant loci associated with specific phenotypes. While GWAS has been useful at identifying disease associated factors, it is known to provide only a limited model for explaining complex diseases as very few loci have significant effect sizes and most diseases are highly polygenic (Boyle, 2017). In addition, GWAS cannot directly include the impact of non-genomic factors such as phenotype, lifestyle and comorbidities that may modulate disease processes and exert significant influence over disease risks. Current detection methods for disease associated combinations of SNPs (epistatic interactions) are able only to find combinations of two or at most three SNPs from a preselected list. This significantly limits the insights that can be gained from the analysis. precisionlife MARKERS radically extends this capability. It is a highly innovative and massively scalable combinatorial multiomics association platform that can detect and annotate high order epistatic interactions at genome wide and disease population scale. precisionlife MARKERS can find and statistically validate specific combinations of up to 20 SNP genotypes (and/or other non-genomic features) that are found in many cases and zero/few controls and associate those combinations with specific disease phenotypes. precisionlife MARKERS overcomes three major limitations of existing large-scale analysis methods such as GWAS: finding combinations of multiple features that in a specific combination are found in patients (cases) but not controls, and associating them with an observed outcome (e.g. disease risk, protective effect or therapy response), identifying & validating higher order interactions (e.g. with 20+ features) in tractable time on affordable hardware, including different types of features genomic, phenotypic, clinical, lifestyle and other factors in the associations This opens up new opportunities to generate value from existing genomic and related patient population datasets for: NOVEL TARGET DISCOVERY AND ADAPTIVE TRIALS DESIGN DRUG REPURPOSING DISEASE PROTECTIVE EFFECTS IMPROVEMENT OF EXISTING POLYGENIC RISK SCORES DESIGN OF PERSONALIZED COMBINATORIAL THERAPY REGIMENS These are illustrated below by an analysis of the CIMBA consortium s BRCA2 positive population in the context of breast cancer disease risk (Chenevix-Trench, et al., 2007) run on a single IBM Minsky Power8 NVLink system with 4 x Nvidia GP100 GPUs: 7978 BRCA2 positive cases + controls 1000 fully random permutations 200K SNPs / person 10⁶⁰ possible n-snp combinations 3 4 day run on GPUs distinct patient cohorts SNPs in combination 3 5 drug repurposing opportunities in single network disease protective effect SNPs USA: One Broadway, Cambridge, MA T: +1 (617) precisionlife is a registered UK: C9 Glyme Court, Langford Lane, Kidlington OX5 1LQ T: trademark of RowAnalytics Ltd

2 Background Over the last 15 years applying gene sequence analysis and GWAS we have learned that diseases, especially common chronic diseases, are much more complex than we originally thought (Low SK, 2018). Even the most important disease loci have small effect sizes, and tens or hundreds of variants and other external (e.g. phenotypic/environmental) factors may contribute to disease risks/protective effects (Visscher, et al., 2017). Diseases which were previously single diagnoses can now be stratified into multiple patient subgroups even using just a few combined factors (Ahlqvist E, 2018). We now understand the more complex and nuanced interconnectedness of gene regulatory networks (Boyle, 2017) and genetic control regions (ENCODE Project Consortium, 2012). Complex diseases are often caused by non-coding variants, which do not affect protein structure, but may affect gene expression (Li Y, 2016). It appears that the impact of such variants is stronger when they occur in active chromatin and in expression quantitative trait loci (eqtls), particularly in chromatin that is active in cell types relevant to the disease (Trynka, 2013). It has therefore been hypothesised that complex disease is driven by an accumulation of weak effects on the key genes and regulatory pathways that drive disease related processes (Chakravarti, 2016). We believe that this interpretation is correct but incomplete, primarily because GWAS analyses have been ineffective at finding high-order SNP-combinations that synergistically affect disease status (Wan, 2010). We therefore built precisionlife MARKERS to enable analysis of the combined effects of multiple variants affecting several genes and regulatory pathways. Finding High-Order Epistatic Interactions Finding combinations of SNP genotypes (and other features) that are highly associated with a specific phenotype in a large study is computationally challenging. In a study with three types of SNP genotype (normal homozygote, heterozygote and variant homozygote) the number of possible combinations is n! 3 r /r!(n r)! where n = the total number of SNPs and r = the maximum combinatorial order. The number of possible combinations increases exponentially as the order goes higher. In a study analyzing 500,000 SNPs, the theoretical number of combinations of ten SNP genotypes is 1 10⁵⁷. For a medium sized study such as the full CIMBA dataset (15,000 patients, 200,000 SNPs per person), each additional step in combinatorial order increases the number of combinations to be tested by a factor of over 10⁵. This becomes even more problematic if a large number of permutations are used to correct for multiple sampling and establish statistical significance. The challenges of combinatorial expansion and feature heterogeneity have led to higher order SNP and multi-modal feature combinations being computationally intractable for GWAS datasets, and hence such networks have not been observed and nor have their associations to phenotype been described. It has not been possible to identify and validate more causal variants that are only involved in disease processes in the context of multiple other specific features. With precisionlife MARKERS it is now possible routinely (on affordable GPU servers) to identify multiple genetically non- or minimally-overlapping cohorts within a single disease patient population that share high order disease associated combinatorial features (Mellerup, 2017). Methodology precisionlife MARKERS is a massively scalable combinatorial multi-omics association platform that enables the detection of high order epistatic interactions at genome wide study scale. It analyses GWAS datasets that have been prepared using standard techniques on distributed GPU instances, applying a pre-filtering step and a 5 stage automated analysis workflow: Pre-filtering optional removal of SNPs in linkage disequilibrium (LD) using LD clumping, or SNPs that are not likely to be of sufficient statistical significance for the analysis, or selection of specific SNPs and features (usually hypothesis driven) 1. Mining finding all (or most) of the distinct n-combinations of SNP genotypes and/or other types of features found in the cases but not in the controls (or vice versa if the study is focusing on protective factors) 2. Permutations repeat mining using 1,000 random permutations of all cases:controls using the same mining parameters 3. Network analysis & validation find networks, determine p-value with FDR correction to eliminate random observations 4. Network annotation annotate networks using a semantic graph containing SNP IDs, genes, pathways, druggable targets, pharmacogenetic interactions, epigenetic modifications and other features 5. Reclustering merge/correlate validated networks sharing specific features involving lifestyle factors, pathways, and others available in the specific disease datasets to test biological hypotheses interactively 2

Example Analysis BRCA2 and Breast Cancer After more than a decade of clinical testing of BRCA1 and BRCA2, there remains considerable uncertainty regarding cancer risks associated with inherited

Reported estimates for lifetime breast cancer risk range between 18 88% (Mavaddat N, 2013).

In large retrospective studies, several common variants were associated with breast cancer risk for BRCA2 carriers.

risk versus a generalized group average risk.

3 Example Analysis BRCA2 and Breast Cancer After more than a decade of clinical testing of BRCA1 and BRCA2, there remains considerable uncertainty regarding cancer risks associated with inherited mutations of these genes. The variable penetrance is most striking for BRCA2, and it affects treatment decisions. Reported estimates for lifetime breast cancer risk range between 18 88% (Mavaddat N, 2013). Women with the same BRCA2 mutation may develop breast, ovarian or other cancers at different ages or not at all. In large retrospective studies, several common variants were associated with breast cancer risk for BRCA2 carriers. The effect sizes of these SNPs are small, but in specific combinations these alleles may be useful in stratifying individuals into distinct risk categories that more accurately reflect their true risk versus a generalized group average risk. We used precisionlife MARKERS to investigate high-order combinations of genetic variations that may have a potential to distinguish which female carriers of BRCA2 mutations will develop breast cancer. Input data contained the genotypes of 200,908 variants in each of 7,978 BRCA2 carriers, including 1,576 patients who had developed breast cancer before the age of 40 (cases) and 6,402 healthy subjects who had not developed breast cancer. These data were collected by the CIMBA consortium (Chenevix-Trench, et al., 2007) using an icogs genotyping array. The participants, the ethics statement, and the selection of 200,908 SNPs have previously been described in detail (Hamdi, Soucy, Pastinen, & al., 2017). Figure 1. Manhattan plot below shows p-values for over-representation of single variants in cases, based on the exact Fisher s test for single-locus associations, computed with PLINK 1.9. The variants with p-values < 10 - ⁵ are marked with the corresponding names of adjacent genes. Only FGFR2 variants satisfy the criterion for genome-wide significance (p-value < ⁸). Variants identified in networks of interacting SNPs are shown in green. With high-order correlations there is always the potential for random observations, and we take care to test for and remove these. We first run 1,000 fully randomized permutations of the mining and apply a p-value cutoff for the simple networks (p > 0.05). Further correction for multiple testing applies the Benjamini-Hochberg procedure (Benjamini Y, 1995). Analysis using a False Discovery Rate (FDR) of 5% identified 3,045 states (unique n-snp combinations) at layers (order) 5 and 7 13 that were found to differentiate breast cancer susceptibility, shown in Figure 2. The penetrance in the cohort depends on the FDR used as shown in Table 1: Table 1. Penetrance for simple networks (sets of non-redundant states) validated using different FDRs for BRCA2 dataset Figure 2. Distribution of states showing combinatorial order from 5 to 13 at FDR=5%. Size of node is proportional to the co-ocurrence of the SNPs in the state in patient cases. 3

Interestingly, the 3,045 states cluster (based on SNPs contained) to 533 genetically non-redundant sets.

There are 16 commonly occurring genes (belonging to between 10-501 states) associated with non-zero genotypes (i.e., containing minor alleles).

variants corresponding to these genes co-occur. Based on this, we can extract six sets of states each containing one of the genes in the five variant groups, and a remaining set without any of them.

The states that include one or more of these six features cluster well when mapped on the first two components in Principal Component Analysis (PCA), and in the network, in which edges correspond to

4 Interestingly, the 3,045 states cluster (based on SNPs contained) to 533 genetically non-redundant sets. Between these there are 141,400 pairs that share at least one case, and 76,609 states sharing at least 10 cases. There are 16 commonly occurring genes (belonging to between states) associated with non-zero genotypes (i.e., containing minor alleles). Hierarchical clustering (based on co-occurrences of cases) clusters these 16 genes into five distinct groups, shown in Figure 3: The Venn diagram in Figure 4 shows the number of states in which variants corresponding to these genes co-occur. Based on this, we can extract six sets of states each containing one of the genes in the five variant groups, and a remaining set without any of them. Figure 3. Hierarchical cluster analysis of 16 most commonly occuring genes associated with non-zero genotypes. Figure 4. Venn diagram of overlap of five canonical gene groups. The states that include one or more of these six features cluster well when mapped on the first two components in Principal Component Analysis (PCA), and in the network, in which edges correspond to at least 10 common cases between two given states (Figure 6). Interestingly, all states containing none of the above genes associated to non-zero genotypes (green dots in the PCA plot) comprise only variants with zero (major allele homozygous) genotypes. Figure 5. Clustering of canonical gene groups using first two PCA dimensions. Figure 6. Cluster analysis of 16 most commonly occurring genes associated with non-zero genotypes. The genes in the six clusters are known to be associated with major breast cancer disease processes including the progression from ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC), Golgi-associated MT organization and stabilization, cell polarity and motility, invasion and metastasis, promotion of ERα-mediated transcription, formation of vascular networks, cancer cell survival, and determining cell fate in specific cell types. There are also multiple variants that have not previously been reported as being associated with breast cancer disease risk or processes. 4

Merged Network Analysis The simple networks were clustered into several merged biomarker networks using a naïve (non-hypothesis) biological criterion that the constituent simple networks contain

Nodes correspond to SNPs, edges to co-occurrence in underlying simple networks (grey lines) or in the same haplotypes (i.e., in linkage disequilibrium, LD, with r²> 0.8, yellow lines).

5 Merged Network Analysis The simple networks were clustered into several merged biomarker networks using a naïve (non-hypothesis) biological criterion that the constituent simple networks contain shared SNP genotypes. The largest of these merged networks exhibited no genetic overlap with the others. The graph of this largest merged network is shown in Figure 7. Nodes correspond to SNPs, edges to co-occurrence in underlying simple networks (grey lines) or in the same haplotypes (i.e., in linkage disequilibrium, LD, with r²> 0.8, yellow lines). Distance between nodes reflects the number of simple networks in which two corresponding SNPs co-occur. Nodes with red borders are SNPs with non-zero genotypes (i.e., with at least one minor allele), and node size corresponds to the odds ratio of the SNPs. The largest nodes occur in over 1,500 states (out of 3,045 in total), but half of the SNPs are present in fewer than 10 states. Interestingly, the topology of the 5% FDR graph (grey nodes) correlates well with the clustering of simple networks at the stronger FDR threshold of 1% (blue nodes). That is, these three 1% networks are founders of different communities found at FDR 5% (see Figure 7, where SNPs found at FDR 1% are coloured blue, and the three FDR 1% networks are marked with their respective numbers, i.e., #1, #2, #3 ). Figure 7. Clustering of merged biomarker networks This graph comprises 841 SNPs, which correspond to 744 independent haplotypes (SNPs that are in approximate linkage equilibrium). As expected, almost all variants are located outside protein-coding regions (intronic, intergenic or within ncrnas, as shown in Figure 8). Figure 8. Genomic location of SNPs 5

Identifying Druggable Targets The vast majority of SNPs in the networks are non-coding, and annotating a GWAS SNP with an expression quantitative trait locus (eqtl) can help to highlight candidate

org/) are colored blue in Figure 9 below (i.e., egenes, p-value < 10 - ²).

Another way to find candidate causal variants is to identify those located in regulatory regions of the genome, such as promoters, transcription start sites (TSSs), enhancers or transcriptionally

6 Identifying Druggable Targets The vast majority of SNPs in the networks are non-coding, and annotating a GWAS SNP with an expression quantitative trait locus (eqtl) can help to highlight candidate causal genes within a locus (i.e., the eqtl target gene). The SNPs in one of the sub-networks identified within eqtl using the Genotype-Tissue Expression (GTEx) project ( are colored blue in Figure 9 below (i.e., egenes, p-value < 10 - ²). Nodes corresponding to twelve variants with p-value < 10 - ⁵ and non-zero genotypes are labelled with their respective egenes. Another way to find candidate causal variants is to identify those located in regulatory regions of the genome, such as promoters, transcription start sites (TSSs), enhancers or transcriptionally active regions of open chromatin. As these regions can be characterized by specific patterns of histone modifications and ChIP-seq enrichments, various genome-wide datasets can be used to predict those regulatory regions. The Roadmap Epigenomics Consortium (Bernstein BE, 2010) has used a variety of genome-wide methods to study the chromatin state of non-coding regions in the human genome (Roadmap Epigenomics Consortium, 2015). ChromHMM (Ernst, 2012) can integrate these chromatin datasets to discover major recurring combinatorial and spatial patterns of marks and to systematically annotate the genome. SNPs from the network (together with highly correlated variants) were annotated with precomputed ChromHMM states (15- state core model or 25-state model incorporating imputed data), based on multiple datasets available for breast mammary tissue. In Figures 9 and 10, nodes corresponding to variants within predicted promoters, enhancers, TSSs or DNasehypersensitive regions are coloured blue. Figure 9. SNPs with an eqtl Figure 10. SNPs with epigenetic marks Druggability of their protein products can be predicted, i.e., based on homology to known drug targets (kinases, receptors, proteases, etc.), or already established experimentally with various level of confidence, in vitro, or for existing drugs, through their mechanism of action. We checked both predicted and validated druggability of egenes using dgene (Kumar 2013), DrugBank (Wishart DS, 2006) and ChEMBL (Gaulton A, 2012) resources. The druggability of the targets represented in one of the smaller networks was studied in more detail. Taking into account the condition of the associated eqtl target being correlated in the breast tissue with a given SNP with p-value lower than 0.01, there were three strong potential drug targets among the SNP targets. The strongest of these belongs to a community of eight SNPs shown in Figure 11. These form a complete graph and occur in four simple networks, which in turn occur in 53 cases and 0 controls. The variant is located within the ORF of a gene coding for a druggable target that is known to be related to breast cancer metastasis potential. Figure 11. SNPs relating to druggable target 6

Finding Disease Protective Effects A key opportunity offered by precisionlife MARKERS is to reverse the case:controls to perform a study to identify features that are associated with disease

Using the reversed controls:cases analysis approach on the BRCA2 population, we have successfully found a number of such protective effects, including several that may work to reduce an individual s

7 Finding Disease Protective Effects A key opportunity offered by precisionlife MARKERS is to reverse the case:controls to perform a study to identify features that are associated with disease protective effects rather than disease risks effects. Using the reversed controls:cases analysis approach on the BRCA2 population, we have successfully found a number of such protective effects, including several that may work to reduce an individual s overall lifetime risk of developing breast cancer. To perform this analysis the BRCA2 population was first segregated more stringently to enable better differentiation, particularly of potentially protective effects where BRCA2 positive people have not experienced early onset of breast cancer. The population was split such that: Cases included all non-affected people who had not developed breast cancer by the age of 55 Controls included all affected people who had developed breast cancer before the age of 40 The findings of this disease protective effect study are described below: Figure 12. Disease population segregation Study BRCA2-55 Population Number of cases (non-affected >55 years): 1,458 Number of controls (affected <40 years): 1,576 False discovery rate: 5% Table 2. SNP networks associated with significant disease protective effect using an FDR of 5% In total the protective effect factors are found in 451 out of 1,458 cases (non-affected women), giving a penetrance of 30.9%. It should be acknowledged that the size of the population was somewhat limiting for this analysis and that the arbitrary cut-off of age 55 means that some non-affected people might go on to develop the disease later in life. A separate study of BRCA2 mutation carriers suggests that approximately 72-78% of the lifetime disease risk will have been encountered by this age in the BRCA2 positive population (Kuchenbaecker, 2017). If the number of controls were larger (approaching a case:control ratio of 1:3 instead of 1:1) and segregation even more stringent, it is possible that some of these protective effects (particularly those with fewer states and cases, i.e., network 5) may become less significant and may even have failed the p-value and FDR tests. Nonetheless, for the first time, this study identified protective combinations of multiple genetic factors from a GWAS population dataset. Knowledge of the genes implicated in the protective effects can be used to identify new druggable R&D opportunities for further research. The known gene associations in Table 2 include a number of recognized druggable candidate targets. It could also be used to improve the accuracy and specificity of current disease risk scoring models and genetic tests, and be used to inform a personal disease risk scoring tool that incorporates knowledge of the impact of all the combinations of networks that a patient s genome contains. Even a patient presenting with a risky BRCA2 mutation, who would at the moment be given a high group-based disease risk score, may potentially have a personal risk significantly different from this singlelocus or even multi-snp panel test would suggest. This has not been previously demonstrated and offers a more tailored (ie. more precise) treatment option towards the goal of realizing personalized medicine. Next Steps Our next steps will include replication of these findings in larger datasets with higher resolutions SNP arrays and incorporating non-genomic data. We are also actively pursuing analysis of a variety of populations in other disease areas. 7

8 Conclusions These results of our re-analysis of the CIMBA dataset are intriguing. This breast cancer population dataset has been analyzed multiple times using conventional analytical tools and yet we have uncovered several novel and significant findings that could only be identified in the context of high-order (5-13) SNP combinations. Work continues with phenotypic / pathological data. precision MARKERS offers a powerful new lens through which to view diseases and identify new opportunities for: better patient risk scoring including disease protective effects novel target discovery and R&D directions drug repurposing and adaptive trials design For more information please contact brca2@precisionlife.com Methods Single-locus association tests were computed with PLINK using Fisher s exact test. The Manhattan plot was generated with R qqman. The results reported above refer to identified variants or their highly correlated neighbours from haplotypes. LD clumping was done using PLINK at the r² cutoff of 0.8. Genes were associated using HaploReg, taking into account genomic proximity, LD and cis-eqtl. Breast cancer associations were taken from GWAS Catalog. Gene set enrichment for associated genes was performed with DEPICT and DAVID. Regulatory elements (promoters, enhancers, TSSs) were assigned based on data from HaploReg. eqtls were assigned using ENSEMBL API ( to the GTEx data at the p-value thresholds of 0.05 and Minor allele frequency was taken from LDproxy, based on European subpopulation. Obsolete SNPs were merged according to data from dbsnp 150. Odds ratios for single variants were calculated in reference to the given SNP genotype. Druggability of the genes was assessed with PharmGKB, dgene, ChEMBL, and DrugBank resources. Visualization of networks was done using Cytoscape and in-house scripts. References Ahlqvist, E., et. al. (2018). Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes & Endocrinology S (18): Benjamini Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc., 57(1): Bernstein, B.E, et al. (2010). The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol., 28(10): Boyle, E. A., Li, Y., & Pritchard, J.K. (2017). An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell, 169(7): Chakravarti, A., & Turner, T.N. (2016). Revealing rate-limiting steps in complex disease biology: The crucial importance of studying rare, extremephenotype families. BioEssays, 38(6): Chenevix-Trench, G., Milne, R.L., Antoniou, A.C., Couch, F.J., Easton, D.F., Goldgar, D.E., & CIMBA. (2007). An international initiative to identify genetic modifiers of cancer risk in BRCA1 and BRCA2 mutation carriers: the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA). Breast Cancer Res, 9(2):104. ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489: Ernst, J., & Kellis, M. (2012). ChromHMM: automating chromatin-state discovery and characterization. Nat Methods., 9(3):215-6 Gaulton, A., et al. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res., D GTEx Consortium (2013) The Genotype-Tissue Expression (GTEx) project. Nat Genet. 45(6): Hamdi, Y., et al. (2017). Association of breast cancer risk in BRCA1 and BRCA2 mutation carriers with genetic variants showing differential allelic expression: identification of a modifier of breast cancer risk at locus 11q22.3. Breast Cancer Res Treat., 161(1): Kuchenbaecker, K.B., et al. (2017). Risks of Breast, Ovarian and Contralateral Breast Cancer for BRCA1 and BRCA2 Mutation Carriers. JAMA 317(23): Kumar, R.D., Chang, L.W., Ellis, M.J., & Bose, R. (2013). Prioritizing Potentially Druggable Mutations with dgene: An Annotation Tool for Cancer Genome Sequencing Data. PLoS ONE., 8(6):e67980 Li, Y.I, van de Gejin, B., Raj, A., Knowles, D.A., Petti, A.A., Golan, D., Gilad, Y., Pritchard, J.K. (2016). RNA splicing is a primary link between genetic variation and disease. Science, (352): Low, S.K., Zembutsu, H., & Nakamura, Y. (2018). Breast cancer: The translation of big genomic data to cancer precision medicine. Cancer Sci., 109(3): Mavaddat, N., et al., (2015). Prediction of breast cancer risk based on profiling with common genetic variants. J Natl Cancer Inst., 107(5) Mellerup, E., Andreassen, O.A., Bennike, B., Dam, H., Djurovic, S., Jorgensen, M.B., Kessing, L.V., Koefoed, P., Melle, I., Mors, O., & Moeller, G.L. (2017) Combinations of genetic variants associated with bipolar disorder. PLoS ONE. 12(12):e Roadmap Epigenomics Consortium, et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature, 518(7539): Trynka, G.S., Sandor, C., Han, B., Xu, H., Stranger, B.E., Liu, X.S., Raychaudhuri, S. (2013). Chromatin marks identify critical cell types for fine mapping complex trait variants. Nat Genet., 45(2): Visscher, P.M., Wray, N.R., Zhang, Q., Sklar, P., McCarthy, M.I., Brown, M.A., & Yang, J. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet, 101(1):5-22. Wan, X., Yang. C., Yang, Q., Xue, H., Fan, X., Tang, X.L., & Yu, W. (2010). BOOST: a fast approach to detecting gene-gene interactions in genomewide case-control studies. Am J Hum Genet., 87(3): Wishart, D.S., et al. (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acid Res. 36:D USA: One Broadway, Cambridge, MA T: +1 (617) UK: C9 Glyme Court, Langford Lane, Kidlington OX5 1LQ T:

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Chromatin marks identify critical cell-types for fine-mapping complex trait variants Gosia Trynka 1-4 *, Cynthia Sandor 1-4 *, Buhm Han 1-4, Han Xu 5, Barbara E Stranger 1,4#, X Shirley Liu 5, and Soumya