SUPPLEMENTARY INFORMATION

Size: px

Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Sharleen Hancock
5 years ago
Views:

1 doi: /nature16931 Contents 1 Nomenclature 3 2 Fertility of homozygous B6 H/H and heterozygous B6 B6/H mice 3 3 DSB hotspot maps DSB hotspot caller Calling procedure Overlap between hotspots and map comparison Pairwise overlap between DSB hotspot maps DSB hotspot map correlations DSB hotspots in the (B6 PWD)F1 B6/PWD and (B6 PWD)F1 H/PWD mice Motif analyses Overview Motif refinement algorithm Search of motifs in genomes Identification of mutated motifs PWD sequencing and ancestral genome to B6 and PWD Mutated ancestral motifs in DSB hotspots DSB hotspot assignment in hybrids Chromosome-aware DSB calling Attribution of PRDM9 control in hybrids Evidence for erosion of PRDM9 binding motifs Fraction of asymmetric DSB hotspots explained by binding motif disruption Systematic shift towards the chromosome less bound in hybrid mice Histone marks and genomic features in the humanized mouse H3K4me3 marks H3K36me3 marks Strand effect for H3K4me3 and H3K36me Other histone marks

2 6.5 Exons and Transcription Start Sites (TSS) H3K4me3 ChIP-seq Identifying PRDM9 binding peaks and defining PRDM9-independent H3K4me3 regions Estimating haplotype-specific H3K4me3 signals Chromosome effects following Prdm9 humanization Prdm9 humanization of the (PWD B6)F1 PWD/B6 mouse Chromosome effects in DMC1 signal comparisons between infertile and rescue Chromosome effects in H3K4me3 signal comparisons between infertile and rescue Chromosome effect predictors Prdm9 humanization of the wild-type B6 mouse Chromosome effects in DMC1 signal comparisons between B6 B6/B6 and B6 B6/H Chromosome effect predictors DSB hotspot in Hstx PWD deletions at DSB hotspot locations Symmetry metric Further discussions on PRDM9 binding symmetry Fertility levels and DSB symmetry across hybrid mice Additivity of H3K4me3 binding Examining how PRDM9 binding on the homologue impacts DMC1 measures Addressing our results in the context of an individual cell Within population dynamics Supplementary Tables

3 1 Nomenclature Wild-type C57BL/6J and PWD/PhJ mice are referred to as B6 and PWD throughout the text. In general, we refer to specific mice by writing their genomic background in uppercase and their Prdm9 alleles using superscripts. In homozygous mice, when the Prdm9 allele is the same as the genomic background, we sometimes do not use superscripts to simplify notation (e.g. B6 and PWD wild-type mice carry the B6 and PWD Prdm9 alleles homozygously, respectively). F1 hybrids, the progeny of an outcross between two different inbred strains, are named according to their inbred parents, with the strain of the female written first. For example, the infertile cross is named (PWD B6)F1 PWD/B6. Unless stated otherwise, we always refer to male mice throughout this work. The humanized model described in this manuscript has been attributed the allele name Prdm9 tm1(prdm9)wthg (MGI Accession ID MGI: ). 2 Fertility of homozygous B6 H/H and heterozygous B6 B6/H mice Both male and female B6 H/H and B6 B6/H mice show normal fertility (Extended Data Fig. 2a) and Mendelian transmission of the humanized allele. Detailed cytogenetic analysis revealed no major abnormalities in DSB counts (DMC1 immunoreactivity, Extended Data Fig. 2b), crossover counts (MLH1 foci, Extended Data Fig. 2c), and normal sex body formation (γh2ax immunostaining, Fig. 1b) in heterozygous and homozygous humanized male mice. No differences in quantitative measures of fertility and successful synapsis were found between genotypes (Extended Data Fig. 2d). 3 DSB hotspot maps 3.1 DSB hotspot caller The formation of a DSB leaves a strand specific signal, due to the 3 to 5 activity of the exonuclease. To take the resulting asymmetry of the coverage of the ChIP-seq reads between the two strands into account, we developed a specific DSB hotspot caller. The assay, and the available coverage, do not permit the detection of DSBs in individual cells, but by piling up signal coming from multiple testis cells, it enables the detection of genomic regions at which DSBs occur frequently. The novel algorithm takes advantage of the shift in the mapping of single-stranded DNA (ssdna) reads between the 5 and the 3 DNA strands to call hotspots. These ssdna segments are a consequence of the resection of DNA ends that accompanies a DSB and are isolated by DMC1 ChIP amplification followed by high through-put sequencing (Methods). For each hotspot, the caller estimates the centre of the hotspot and its heat, loosely defined as the number of reads mapping to this DSB hotspot and predicted to represent real signal (as opposed to background reads). The DSB heat is proportional to the fraction of cells possessing 3

4 unrepaired breaks decorated by DMC1 at that position. The caller handles sample replicates and is able to call hotspots using several samples jointly. To test whether the genomic position p is the centre of a DSB hotspot, we compared the two following models: No hotspot model: the DNA segment surrounding the position p is divided into six bins, namely left, centre and right bins on each of the two strands, of lengths l e, l c and l e respectively. The distribution of the number of reads N i±, i {l, c, r} in each bin is assumed to be Poisson, with a mean parameter proportional to the length of the bin: N l± Po(l e λ ), N c± Po(l c λ ), and N r± Po(l e λ ), where λ is the factor of proportionality. The N i±, i {l, c, r}, are assumed to be independent. Letting n i,s, i {l, c, r}, s {+, }, be the observed number of reads falling within bin i on strand s, respectively, we estimate λ by its maximum likelihood estimator λ = 1 2(2l e + l c ) i {l,r,c},s {+, } Hotspot model: six bins are defined, as previously, around the position p. The {right, +} and {left, } bins only carry background signal, while the four other bins exhibit both background and hotspot signals, so that the distributions of the N i±, i {l, c, r}, become: n i,s. N l+ Po(l e (λ + λ b )), N c+ Po(l c (λ + λ b )), N r+ Po(l e λ b ), N l Po(l e λ b ), N c Po(l c (λ + λ b )), and N r Po(l e (λ + λ b )), where λ b is the rate of background binding, and λ is the rate at which reads representing true hotspot signal occur. These parameters are estimated by maximum likelihood as: { λ = max 1 2(l e + l c ) λ b = 1 (n r+ + n l ), 2l e ( n l+ + n c+ + n c + n r l e λ b (n r+ + n l ) ) }, 0. Because the no hotspot model is nested within the hotspot model, a simple loglikelihood ratio test can be applied to test for the presence of a DSB hotspot at the position p. Under the null hypothesis of the absence of hotspot at p, the likelihood ratio is assumed to follow a chi-squared distribution with 1 degree of freedom. If a hotspot is called (see Supplementary Information, Subsection 3.2), we refer to the quantity 2(l e + l c ) λ as the heat of the hotspot. 4

5 3.2 Calling procedure Only fragments mapping to some genomic position with mapping quality of 20 or above are considered. Duplicate fragments, defined as fragments mapping to the same genomic coordinates, are removed. A conservative, three-step procedure is then applied: 1. We test for the presence of a DSB hotspot every 250bp along chromosomes using the above model, with l e = 750bp and l c = 500bp. Positions p i associated with a p- value below 10 4 are retained. Potential DSB hotspot segments are then computed by considering the segments [p i 1000, p i ], and overlapping segments are merged, producing a final list [a j, b j ] of potential segments. 2. Segments overlapping with hotspots called in the ChIP-seq control using this procedure are discarded. 3. For each remaining segment [a j, b j ], we independently compute the likelihood ratio for each genomic position within the range [a j + 600, b j 600], with the modified parameters l e = 550bp and l c = 300bp to get a better resolution. The position p for which the likelihood ratio is maximal is reported to be the centre of the DSB hotspot, and the heat of the hotspot subsequently used is the one we obtained under this model. Supplementary Table 1 gives details about the number of DSB hotspots called in each sample. As each meiotic cell presents between 200 and 400 DSBs [1], we estimate that at most 1% of DSB hotspots called ( 15 17, 000) are active in any given meiosis. Modified procedure for replicates To combine the two B6 B6/H replicates, we modified the above procedure in the following way. Letting l 1 and l 2 (resp. l 1, l 2 ) be the log-likelihoods of the no hotspot (resp hotspot ) model for replicates 1 and 2, we defined the new likelihood ratio to be Λ = 2 ( l 1 + l 2 l 1 l ) 2 and assumed Λ follows a chi-squared distribution with d = 2 degrees of freedom under the null hypothesis of the absence of a hotspot. The inference procedure is otherwise identical to the one described above, that is, is composed of the three steps described above. The same calling p-value threshold (10 4 ) was used. The joint call set (using both replicates) for the B6 B6/H mouse was used in the main text. 3.3 Overlap between hotspots and map comparison Pairwise overlap between DSB hotspot maps Given a DSB hotspot map ( donor map), we asked what proportion of the DSB hotspots in the donor map was present in a recipient sample. To do this, we set an overlapping p-value threshold p overlap, and asked which hotspots from the donor map are found 5

6 in the recipient sample. Specifically, for each hotspot of the donor map, occurring at position p j on chromosome c, we re-ran the detailed hotspot calling procedure (step 3 in Supplementary Information Subsection 3.2) in the recipient sample in the segment [p j 600, p j +600] (with a j = b j = p j, where a j and b j are defined in step 3 of Subsection 3.2). If any position in the segment [p j 600, p j + 600] is found to be a DSB hotspot at a p-value p < p overlap, the hotspot from the donor is reported to overlap a hotspot in the recipient sample. Thus, two hotspots reported to overlap will have a distance of at most 600bp between their centres. Unless stated otherwise, we used p overlap = 10 3 throughout analyses in this study. Supplementary Table 2 gives the pairwise overlapping fractions of DSB hotspots. Hotspots in the humanized heterozygous mouse are on average hotter if they are shared with the homozygous humanized mouse rather than with the wild-type mouse (Extended Data Fig. 3d). This analysis also revealed a non-linear effect in heat transmission between one of the homozygotes (B6 B6/B6 ) and the heterozygote, suggesting a saturation effect in B6 B6/B DSB hotspot map correlations To correlate DSB maps, we divided the genome into non-overlapping bins of fixed size (Extended Data Fig. 3b). The heat assigned to a bin is the sum of the heats of DSB hotspots whose centres fall within the bin. We used the Pearson s correlation coefficient in the correlation analyses. We studied the correlation between the DSB hotspot maps obtained for homozygous humanized and wild-type mice at different scales (ranging from 1kb to 10Mb). At the 10kb-scale, a very weak correlation between the human and mouse DSB hotspots maps was observed (r=0.016), reflecting the usage of a different set of hotspots in the two mice. At larger-scales however, the correlation increased considerably (Extended Data Fig. 3b). This same pattern held when comparing the rates between mice with different mouse Prdm9 alleles, including comparison with the Prdm9 knockout mouse [2]. In contrast, different mice strains with the same Prdm9 allele showed strongly correlated DSB maps at all genomic scales, similar to the strong correlation obtained between mouse replicates. These results confirm a major role for PRDM9 zinc finger array binding specificities alone in defining the fine-scale recombination landscape, but also indicate that the determinants of recombination rates at large scales are likely to include factors other than PRDM DSB hotspots in the (B6 PWD)F1 B6/PWD and (B6 PWD)F1 H/PWD mice Similarly to the comparison between the infertile and the humanized rescue mice, we found that a genome-wide effect of humanizing the mouse is to strongly reduce hotspot asymmetry in the reciprocal rescue (B6 PWD)F1 H/PWD, compared to the (B6 PWD)- F1 B6/PWD mouse. DSB hotspots only active in the reciprocal humanized rescue (B6 PWD)- F1 H/PWD (and not in the reciprocal (B6 PWD)F1 B6/PWD ), so attributable to the human allele, occur mainly (57%) in hotspots showing approximate symmetry between 6

7 chromosomes. Conversely, only 20% of DSBs at hotspots present exclusively in the reciprocal (B6 PWD)F1 B6/PWD mouse attributable to the B6 allele occur in symmetric hotspots. See also Extended Data Fig. 6a-c. Section 5.1 describes our procedure to define hotspots for which a fraction of informative reads from the B6 chromosome can be calculated, and how this fraction is then calculated. In the above analysis, symmetric hotspots were defined as those for which the fraction of informative reads from the B6 chromosome could be computed, and for which this fraction fell in the range (inclusive). Asymmetric hotspots were conversely defined as those whose fraction fell outside this range. 4 Motif analyses 4.1 Overview Search for motif enrichment at hotspot centres was performed using MEME-ChIP (MEME Suite version 4.9.1) and a motif refinement algorithm we developed (see Supplementary Information Subsection 4.2). DNA sequences of 500bp, centred at each DSB hotspot position, were considered, but only 500bp sequences displaying no more than 10 unknown ( N ) or repeat-masked bases were retained for the motif discovery analysis (referred to as the discovery sequences). For each sample, the procedure is as follows. The motif with the most significant enrichment p-value reported by MEME-ChIP (with parameters ccut 0, nmeme 1200) was selected as the first raw motif, and passed on to the motif refinement algorithm. Next, based on the output of the motif refinement algorithm, sequences containing the refined motif were excluded from the discovery sequences pool. The remaining discovery sequences were then analysed by MEME-ChIP again, using the same parameters, and the motif that was the most similar to the previously selected motif was selected as the new raw motif, and passed on to the motif refinement algorithm. The process is iterated until no new credible motif arises from the MEME-ChIP step. We then searched for these refined motifs in all 500bp sequences using our motif refinement algorithm (for this last step, the motifs were fixed). The algorithm is constrained to report no more than one motif per 500bp sequence. A final run of the MEME-ChIP discovery algorithm on these 500bp sequences for which no motif has been found only yields repetitive sequences. We used FIMO (MEME Suite version 4.9.1) to locate the identified motifs in the mm10 mouse genome (see Subsection 4.3), as well as the reconstructed ancestral genome for B6 and PWD mice (see Supplementary Information Subsection 4.4). All motifs with an associated FIMO-reported p-value below 10 4 were kept. 4.2 Motif refinement algorithm We developed a new, Bayesian approach to characterise DNA motifs enriched in a set of sequences. The method is able to refine position weight matrices (PWMs) of initially supplied motifs, e.g. those suggested by other approaches (e.g., MEME-ChIP). In short, the algorithm samples fragments containing the motifs and iteratively updates 7

8 the PWMs via a Gibbs sampler, until convergence; in practice after a set number of iterations. The method utilises a triplet based background sequence model. It is able to infer several motifs at the same time, and assumes that each sequence contains at most one motif. This allows separation of even closely related motifs, e.g. those found using different arrangements of PRDM9 zinc fingers. Moreover, the approach outputs probabilities of motif presence for each region, and infers a distribution of where motifs lie within each sequence, improving motif identification against background. The algorithm is presented in the case where all input sequences have the same length l. 1. Input: - n sequences, each of length l, some of which may contain instances of one of k motifs; - k PWMs, denoted PWM (0) i, i = 1, 2,..., k; - the distribution on the proportion of sequences containing each motif, (α 1,..., α k ), and no motif (α 0 ), where as a prior (α 0, α 1,..., α n ) Dirichlet( 1 2, 1 2k,..., 1 2k ) (so in particular k i=0 α i = 1). By default, initially, α 0 = 1 2, and α i = 1 2k, for i 1; - prior distribution (β t ) t=1,2,...,10 Dirichlet(1, 1,..., 1) on the region of any sequence in which the beginning of the motif will fall. Each sequence is divided in 10 bins of equal size, and β t is the probability that the t-th bin has a motif in it (given the sequence has a motif). By default, the vector is uniformly initialised: β t = 1 10, for t = 1, 2,..., 10; - prior on PWMs: uniform Dirichlet(1,1,1,1) prior on each base within a motif, and a geometric prior on the number of bases within the motif, such that the mean motif length is 20 bp. 2. Initialisation: background model (a) From the input sequences, estimate the probability vector q r3 r 1 r 2 of having the base r 3 {A, C, G, T }, given we observe the bases r 1 and r 2 in the triplet (r 1, r 2, r 3 ), at any particular position in any sequence. The vector q is simply estimated by its maximum likelihood estimator. (b) Using the vector q, compute the probabilities f ijps of observing a particular fragment starting at position p in the j-th sequence, on strand s, and having the same length as the i-th motif. So f ijps represents the probability of a particular fragment occurring under the background model, in which fragments are generated according to the conditional probabilities q r3 r 1 r Iteration: m-th step (a) Compute the likelihood b ijps of observing the sequence starting at position p on the strand s in sequence j, if motif i begins at this position. This is 8

9 simply done by assuming independence between positions under the motif model (PWM). At each (motif) position, the relative probability of having a particular base is given by e PWM(m) i (by definition, PWMs are on a log scale, and scaled so these probabilities sum to 1). (b) Compute the likelihood ratio b ijps = b ijps f ijps the background model. of the sequence in (a), relative to (c) Compute the posterior probability of a particular motif occurring at a certain position as π ijps = c j α i β ti bijps, where t i = 10i l and the normalisation constant is c j = {α 0 + i,p,s α iβ ti bijps } 1. Note this relies on Bayes formula, and on our assumption that at most one motif can arise on any one sequence. (d) For the j-th sequence, assign no motif with probability 1 ips π ijps. Otherwise, assign the i-th motif at position p on strand s to this sequence, with π probability ijps ips π. ijps (e) Update the i-th motif by computing the new matrix PWM (m+1) i. This is done by estimating the PWM from all the fragments carrying the i-th motif that have been sampled from step (3d). Specifically, fragments sampled as containing the i-th motif, and extended by 25 base pairs on both ends, are used to create a count array of bases at each motif position among motif occurrences. Conditional on these occurrences of motif i, we sample a new motif length (using the geometric prior) and then sample a new PWM (given the length of the motif, we take this new PWM to be that maximising the posterior log-likelihood, rather than sampling from the full posterior distribution). (f) Update remaining model parameters, including the fractions of sequences containing each motif, and the prior on where motifs fall within each region. These updates use conjugacy properties of the chosen priors. Motifs obtained using this method for B6, PWD and humanized homozygous mice are reported in Extended Data Fig. 4a-c using logo plots to represent motifs. The information content (y-axis, bits) at a given position for the i-th base is computed as p i {2 j (p j log 2 p j )}, where p i is the relative frequency of base i, as given by the PWM. We find that binding motifs are enriched at the center of DSB hotspots (Extended Data Fig. 4e-g). 4.3 Search of motifs in genomes The FIMO module from MEME-ChIP (MEME Suite version 4.9.1) was used to scan genomes for matches to motifs obtained from the motif search within DSB sequences. For the B6 and humanized homozygous mice, the motif with the smallest enrichment p- value reported by MEME-ChIP was considered (Extended Data Fig. 4), hereafter called B6 and human motifs (obtained from B6 and humanized homozygous DSB sequences, respectively). 9

10 Sequence motifs enriched within the DSB hotspots in humanized B6 H/H and wildtype B6 B6/B6 mice were found to closely match the previously reported PRDM9 binding motifs for human and mouse, respectively [3, 4] (Extended Data Fig. 4). Overall, 85% and 78% of the DSB hotspots were inferred to contain a binding motif in the homozygous humanized and wild-type mice respectively. Motifs were most enriched at hotspot centres (Extended Data Fig. 4e-g), and hotter hotspots contained a strong PRDM9 binding motif more often than weaker hotspots (data not shown). 4.4 Identification of mutated motifs PWD sequencing and ancestral genome to B6 and PWD We sequenced the genome of a wild-type PWD male mouse at 50 coverage on Illumina HiSeq 2500 (Wafergen DNA library prep kit). We obtained paired-end reads of 150bp each, which we aligned to the reference genome mm10 using Stampy [5]. We subsequently used the variant caller Platypus [6] to call variants in the PWD mouse. Only variants reported with a PASS filter were kept in the following analysis. Multiallelic variants were ignored. Using Mus famulus and Mus caroli as outgroups, we reconstructed an ancestral reference genome for B6 and PWD. Specifically, whole genome resequencing data for these two mouse subspecies were obtained from [7, 8] and re-mapped to mm10 using Stampy. We then used Platypus to genotype PWD variants in both Mus famulus and Mus caroli. For a given PWD variant, several situations can arise : 1. A genotype has been called in both Mus famulus and Mus caroli. If both subspecies are homozygous for the PWD allele, or if one is homozygous and the other is heterozygous, the PWD variant is labelled as ancestral. Conversely, if both subspecies are homozygous and carry the B6 allele, or if one is homozygous for the B6 allele and the other is heterozygous, the PWD variant is labelled as derived. 2. A genotype has only been called in one of Mus famulus and Mus caroli. In this case, the PWD variant is marked as ancestral if the subspecies for which a call has been reported is homozygous for the PWD allele. Conversely, if the subspecies is homozygous for the B6 allele, the PWD variant is called as derived. 3. Otherwise, the ancestral status of the variant is set to missing. Using this ancestral classification of the PWD variants, we were able to recreate an ancestral reference genome for B6 and PWD using the following procedure. For each mm10 reference chromosome, start at the left end part of the chromosome and start walking to its right end. At each genomic position in the reference, if a PWD variant is encountered, either (i) replace the B6 (mm10) allele by the PWD allele if the PWD variant is ancestral, (ii) leave the B6 (mm10) allele if the PWD variant is derived, or (iii) replace the B6 (mm10) allele by the PWD allele with probability 1/2 if the ancestral status of the PWD variant is missing. Hence, while walking along the mm10 reference 10

11 sequence, we created an ancestral reference sequence. In the process, we kept a record of coordinate correspondences between the two reference genomes Mutated ancestral motifs in DSB hotspots To obtain a set of ancestral motifs, the FIMO module from MEME-ChIP (MEME Suite version 4.9.1) was used to scan the ancestral genome for matches to motifs obtained from the motif search within DSB sequences in B6 and PWD homozygous mice. To derive the fraction of DSB hotspots for which a variant is found in an ancestral motif, we applied the following procedure. Looking at, say, the hotspots that are under control of the B6 Prdm9 allele, we first restricted the hotspot list to hotspots that have exactly one occurrence of the B6 motif, within 300bp of the hotspot centre, in the ancestral reference. For all these hotspots, we only considered the 1kb segments centred at the middle of the B6 motif. For these segments, we computed the fraction of segments that harboured at least one variant (SNPs and indels) on the B6 and PWD lineages, in a 30bp window sliding from the beginning to the end of the 1kb segment. As we work from the DSB perspective throughout, while computing the fraction, each segment is weighted by the heat of the hotspot to which it corresponds. The fraction of segments that harboured at least one variant on the B6 and PWD lineage in PWD motifs are computed in the same way. The point estimates obtained along the segments are shown in Extended Data Fig. 6g. Each binding motif called in the ancestral genome is assigned a score, which is defined as the logarithm of the probability that this motif was drawn from the motif s PWM. We also computed the corresponding scores, for each motif, in both the B6 and PWD lineages, using the following procedure. For each motif in the ancestral lineage, we created the corresponding motifs for B6 and PWD lineages by substituting any ancestral base affected by a point mutation in either B6 or PWD. We then computed the motif scores for these modified motifs in the same way as previously. No score was assigned to motifs that were affected by variants that were not point mutations. Given a specific PRDM9 control (either B6 or PWD) and a specific lineage (either B6 or PWD), we computed the differences in motif score, which is simply the difference between the lineage specific and the ancestral scores (Extended Data Fig. 6g). A negative difference indicates the motif was worsened by the change along its lineage, when compared to the initial (ancestral) motif. 5 DSB hotspot assignment in hybrids 5.1 Chromosome-aware DSB calling Using the list of single nucleotide polymorphisms (SNPs) from the PWD variant calling (see Supplementary Information Subsection 4.4), each read pair from a hybrid DSB library is assigned to one of the categories B6, PWD, unclassified or uninformative using the following criteria: 11

12 1. if one or both reads of a pair overlap SNP positions from the list and the alleles carried by the read(s) are from the B6 genome, the pair is classified as B6; 2. if one or both reads of a pair overlap SNP positions from the list and the alleles carried by the read(s) are from the PWD genome, the pair is classified as PWD; 3. if one or both reads of a pair overlap SNP positions from the list and the alleles carried by the read(s) are from both B6 and PWD genomes, the pair is unclassified; 4. otherwise, the pair is uninformative. Then, for each DSB hotspot in a sample, we extract all reads mapping within 1kb of the centre of the hotspot. Using these reads, we compute the fraction of reads coming from B6 chromosome as the number of paired reads classified as B6 over the total number of paired reads classified either as B6 or PWD. The ratio is not computed if fewer than 10 pairs are classified as B6 or PWD in this genomic segment, or if more than 10% of the pairs marked B6, PWD or unclassified are actually unclassified. The fraction of reads coming from B6 chromosome is also used to classify DSB hotspots as active on either (i) the B6 chromosome, if the ratio is greater than 80%, (ii) the PWD chromosome, if ratio is less than 20%, (iii) both chromosomes, if the ratio is between 20% and 80%. 5.2 Attribution of PRDM9 control in hybrids Using both the DSB maps in homozygous samples (B6, PWD, B6 H/H and B6 / ) and PRDM9 binding motif calls on the B6/PWD ancestral reference sequence, we were able to classify DSB hotspots as being under either B6, humanized or PWD Prdm9 allele control in the hybrids. In the case of the infertile (PWD B6)F1 PWD/B6 hybrid, for each DSB hotspot: 1. if the DSB hotspot overlaps (at the p-value < 10 4 threshold) DSB hotspots (see Supplementary Information Subsection 3.3.1) from both B6 and PWD maps, or if the DSB hotspot overlaps with hotspots from B6 /, the PRDM9 control for this hotspot is undetermined ; 2. if the DSB hotspot overlaps with hotspots from exactly one of the B6 or PWD maps, the DSB hotspot is set to be under PRDM9-B6 or PRDM9-PWD control, respectively; 3. if the DSB hotspot does not overlap with the homozygous mice DSB maps, and if in the 600bp segment centred at the centre of the hotspot (in the ancestral reference sequence), there are only binding motifs from one of the Prdm9 alleles (B6 or PWD), then this hotspot is set to be under the control of the corresponding PRDM9; 4. otherwise, the PRDM9 control of the hotspot is undetermined. The humanized rescue (PWD B6)F1 PWD/H hybrid is treated in a very similar manner, replacing all occurrences of the B6 allele of PRDM9 by its humanized version. 12

13 5.3 Evidence for erosion of PRDM9 binding motifs Attributing DSB hotspots to chromosomes and to specific Prdm9 alleles, along with mutation detection using an ancestral reference (see Subsection 4.4), allows the detailed study of mutational patterns within and around PRDM9 binding sites on the B6 and PWD lineage (Extended Data Fig. 6g). For the B6 motifs that are totally lost in the B6 lineage (no hotspot called in the wild-type B6, but called in the infertile mouse on the PWD chromosome), the maximum point enrichment of both SNPs and indels within the motif, compared to genomic background, is 7 fold (data not shown). These analyses provide strong evidence that motif erosion is occurring at PRDM9 binding targets, and indicate that indels, along with point mutations, also play an important role in motif erosion. Results suggest that the B6 motif is more eroded than the PWD motif (Extended Data Fig. 6g). This could be explained either by the B6 allele being older or more common in the past than the PWD allele, or by a larger ancestral effective population size for the B6 mouse. We also observe that mutations within and around the motifs (0-200bp) are subjected to strong GC bias (data not shown) and that the hotspot heats in the ancestral lineage are associated with an increased GC bias around the binding motifs. This constitutes strong evidence for the GC biased gene conversion phenomenon at DSB hotspots in mice [9]. In the infertile mice, although a high fraction of DSB hotspots under the control of the B6 allele of Prdm9 occur solely on the PWD chromosome, due to motif erosion on the B6 lineage (and vice-versa for the PWD allele), we note, as expected, that a small fraction of hotspots harbour the opposite pattern: these hotspots occur preferentially on the B6 chromosome (Fig. 2b). This is explained by chance loss of the B6 motif on the PWD lineage (vice-versa for the PWD motif), and not meiotic drive. Besides, somewhat conversely, as shown by the blue dotted line in Extended Data Fig. 6g, a small number of B6 motifs are gained on the PWD lineage. This again likely results from random mutations (not under any drive or selection) creating new binding motifs on the PWD lineage. 5.4 Fraction of asymmetric DSB hotspots explained by binding motif disruption To assess the extent to which mutations within PRDM9 binding motifs explain the observed chromosomal asymmetry in DSB initiation, we compared the fraction of mutated motifs in symmetric versus asymmetric hotspots. Specifically, for DSB hotspots under B6 PRDM9 control in the (PWD B6)F1 PWD/B6 mouse, we defined for this analysis the symmetric hotspots as those with a fraction of reads coming from B6 chromosome in the range (inclusive), and with a hotspot heat of at least 50 (to reduce uncertainty in the ratio estimate). Asymmetric hotspots were defined as those with a fraction less than or equal to 0.1 and with a hotspot heat of at least 50. For both symmetric and asymmetric hotspots, we further restricted ourselves to DSB hotspots containing a clear motif match: exactly one B6 binding motif in the ancestral sequence (see Subsection 4.4.2), within 250bp of the DSB hotspot centre. In 13

14 both classes of hotspots, we then computed the fraction of hotspots whose binding motif was mutated (SNP or indel) between the B6 and PWD lineages. Letting f s and f a denote these fractions in the symmetric and asymmetric cases respectively, we finally computed the fraction of hotspots whose asymmetry could be explained by a mutation in the binding motif, using the symmetric hotspots to correct for chance SNPs occurring within the motif (probability f s ), as f e := 1 (1 f a )/(1 f s ). For DSB hotspots under PWD PRDM9 control in the (PWD B6)F1 PWD/B6 mouse, we followed the same approach, but we defined the asymmetric hotspots as those with a fraction greater than or equal to 0.9. Finally, for DSB hotspots under the control of humanized PRDM9 in the (PWD B6)- F1 PWD/H mouse, we followed the same approach, with two modifications. First, we used exact matches to the 13bp motif CCNCCNTNNCCNC to conservatively identify human binding motifs. We considered SNPs falling within the human refined motif (Extended Data Fig. 4b), which was naturally defined by extending the above 13bp motif seed. Second, we defined asymmetric hotspots as those with either a fraction less than or equal to 0.1 or greater than or equal to 0.9. The fractions f a, f s and f e in those three cases are reported in Supplementary Table 3. In all cases, we could explain a very high fraction (> 83%) of asymmetric hotspots by mutations in the binding motif (affecting differently the binding motifs on the B6 and the PWD chromosomes). 5.5 Systematic shift towards the chromosome less bound in hybrid mice We detected a subtle systematic shift (p < 10 15, binomial test) in DSB heat towards the chromosome where overall, PRDM9 bound less often, in the infertile mouse (Fig. 2a), moving the central peak of DSB activity away from 50% on each chromosome. This shift might reflect interference or compensation acting towards an equal number of DSBs occurring on each homologue, or alternatively differential persistence e.g. through longer repair time, of DSBs marked by DMC1 on the respective chromosomes. 6 Histone marks and genomic features in the humanized mouse PRDM9 binding motifs have varying probabilities of becoming the centre of a DSB hotspot, and some may never be used in this respect. To gain some information on the determinants of such probabilities, we compared the distributions in ChIP-seq coverage of several epigenetic marks around binding motifs in standard B6 testis cells, differentiating between motifs within DSB hotspots and those which lie outside. ChIP-seq peaks in 8 week mouse testis for histone modifications H3K4me1, H3K4me3, H3K27ac, H3K27me3 and H3K36me3 were obtained from the Mouse Encode Project [10]. When available, corresponding marks in heart, kidney and liver tissues were also used for comparison. We defined a DSB hotspot as overlapping a 14

15 particular mark if the centre of a hotspot was within 10 bp away from a peak for this mark. 6.1 H3K4me3 marks For the non-specific tissues we considered (heart, kidney and liver), the H3K4me3 enrichment was 2.4 times higher at the mouse motifs outside DSB hotspots than at those within DSB hotspots identified in the wild-type mouse (Extended Data Fig. 5a). In testis, however, the high enrichment of H3K4me3 marks at the motifs within DSB hotspots reflects the trimethyl-transferase activity of PRDM9 PR/SET domain [11]. Interestingly, mean H3K4me3 enrichment at motifs outside DSB hotspots was not significantly different in testis compared to other tissues, which suggests a low trimethylation activity of H3K4 by PRDM9 at motifs that are not the centre of DSB hotspots. Besides, human motifs outside DSB hotspots found in the humanized mouse are enriched in H3K4me3 marks, with the enrichment being stronger than observed in mouse, possibly because of the GC rich nature of the human motif (Extended Data Fig. 4b). As the ChIP-seq experiment is performed in standard B6 testis cells, this enrichment of H3K4me3 marks cannot be attributable to the action of PRDM9 at the human motifs within DSB hotspots. Therefore, the observed H3K4me3 status at human motifs within DSB hotspots must reflect the H3K4me3 status before PRDM9 binding, which is similar across tissues. To quantify the enrichment of H3K4me3 marks at motifs outside DSB hotspots, we further looked at H3K4me3 marks in kidney. For each motif within a DSB hotspot, we selected a matched motif outside DSB hotspots (at least 2kb away from nearest hotspot) in highly mappable regions. Using Odds Ratios (OR) comparing the number of motifs enriched and not enriched in H3K4me3 within and outside DSBs, we found that, for both human and mouse motifs, H3K4me3 enriched motifs are significantly enriched outside DSB hotspots compared to within (human OR = 1.18, 95CI (1.118, 1.236), p-value = ; mouse OR = 1.05, 95CI (1.002, 1.111), p-value = 0.04). Similar results are obtained with H3K4me3 marks in heart tissue. This suggests that the H3K4me3 marks in other tissues at these motifs are a good proxy for the H3K4me3 patterns prior to the action of PRDM9. Hence, in both mouse and human cases, a motif is more likely to become a DSB hotspot if its local environment is depleted for H3K4me3. Since H3K4me3 marks are enriched at promoters, this suggests that the presence of H3K4me3 marks (prior to PRDM9 activity) is involved in directing PRDM9 away from promoter regions. We also performed the same comparisons using our in-house generated H3K4me3 data which gave almost indistinguishable results. For consistency across all comparisons (i.e. H3K36me3, H3K27me3, H3K1Me3, H3K27ac) we present the published consortium data for this analysis. Brick et al. [4] had earlier shown that in the mouse, the use of PRDM9 to target sites for DSBs has the effect of moving DSBs away from promoters to the specific DNA sequence motifs targeted by PRDM9, in contrast for example to yeast, and to the Prdm9 knockout mouse (B6 / ). Here we show further that amongst instances of the binding motif in the humanized mouse (B6 H/H ), DSBs occur preferentially at motifs which do not carry H3K4me3 marks, with a similar, but weaker, effect in the wild-type B6 mouse. 15

16 6.2 H3K36me3 marks Recent evidence suggests PRDM9 is also able to trimethylate H3K36 [12]. We indeed find an enrichment of H3K36me3 signal at motifs that fall within a mouse DSB hotspot in testis, likely reflecting that this newly discovered activity of PRDM9 is associated with PRDM9 binding. However, neither enrichment nor depletion is seen in other tissues, or in motifs that lie outside DSB hotspots (Extended Data Fig. 5b). Furthermore, no significant differences are seen around human motifs, regardless of whether they are within or outside DSB hotspots (Extended Data Fig. 5b). It thus appears that H3K36me3 status does not influence PRDM9 binding, which suggests an importance of this mark (if any) downstream to PRDM9 binding in DSB formation. 6.3 Strand effect for H3K4me3 and H3K36me3 In order to better understand biochemical mechanism underlying PRDM9 function, we asked whether its trimethylation activity harbours any strand specific effect. After centering each mouse DSB hotspot to the nearest binding motif, the mean distributions for both H3K4me3 and H3K36me3 marks shows higher densities 5 of the motif, and this effect is seen on both strands (Extended Data Fig. 5c). It is unclear whether this asymmetry reflects a preference of PRDM9 to trimethylate 5 histones, or whether this preferential methylation pattern at DSB hotspots is selected downstream of PRDM9 binding. 6.4 Other histone marks We also investigated other histone marks (Extended Data Fig. 5d-h). We found in particular that H3K4me1 displays a very similar, though weaker, enrichment pattern to H3K4me3 (with DSB sites), possibly capturing e.g. a transient (de)methylation state. Moreover, H3K27me3 was found to be strongly depleted around B6 allele controlled hotspots. The histone modification H3K27me3 is involved in regulation of transcription and may be impoverished around PRDM9-induced H3K4me3 marks. Two acetyl marks (H3K9ac, H3K27ac) also show differential enrichment depending on whether or not motifs are within DSB hotspots (with motifs within DSB hotspots showing a depletion of the histone mark, while motifs outside DSB hotspots show an enrichment of the histone mark). 6.5 Exons and Transcription Start Sites (TSS) We investigated the relationship between mouse exons and our DSB hotspot maps in both wild-type B6 and humanized mice. Genomic coordinates from genes, exons and transcription start sites were retrieved from known gene tables from the UCSC gene track (mouse assembly mm9/ncbi37). Both B6 and humanized hotspots were significantly enriched in mouse exons, that span 2.9% of the mouse genome, with 10.7% and 11.0% overlapping an exon, respectively. We then asked whether these results are due to differences in numbers of motifs within and outside of exons, or is being in an exon 16

17 a significant predictor of whether or not a motif will become a hotspot. In the case of the B6 hotspots, the effect is fully explained by an enrichment of motifs in exons (OR = 0.99, 95CI (0.93, 1.05)), but not for human hotspots, which remain significantly enriched within exons after controlling for the distribution of motifs (OR = 1.15, 95CI (1.11, 1.21), p-value = ). Furthermore, hotter hotspots overlap exons slightly more often than weaker ones. Mouse exons thus appear to be in a slightly favourable conformation to allow DSBs to take place, irrespective of the Prdm9 allele present. To investigate the overlap between DSB maps and TSS, we asked what was the proportion of DSB hotspots from a particular map that overlap (at p overlap < 10 4 ) with the 200 bp segments centred around each transcription start site, as defined by the UCSC gene table. In both wild-type B6 and humanized homozygotes, we found that 1.2% of the DSB hotspots were overlapping a TSS segment thus defined, indicating that as previously observed [4], most hotspots do not overlap TSS. 7 H3K4me3 ChIP-seq 7.1 Identifying PRDM9 binding peaks and defining PRDM9-independent H3K4me3 regions Peak calling was performed using a maximum-likelihood-based peak calling algorithm that uses fragment coverage information from both sequencing replicates and the total chromatin control (specified in [13]). For each bin in the genome the approach estimates a ChIP enrichment value relative to local background, and it also provides genome-wide estimates of the proportion of reads originating from signal vs. background, giving an estimate of the purity of each replicate. For de novo identification of enriched regions we merged adjacent 100bp non-overlapping bins with p < 10 5, genome-wide. To enable filtering of H3K4me3 peaks corresponding to promoters and other PRDM9-independent sources, we identified H3K4me3-enriched regions shared among any pair of mice with different Prdm9 alleles. Specifically, we took [ (PWD B6)F1 PWD/H B6 B6/B6 ] [ (PWD B6)F1 PWD/B6 B6 H/H ] [ (B6 PWD)F1 B6/PWD B6 H/H ] to define a set of regions likely to be trimethylated independently of PRDM9. For comparisons of H3K4me3 and DMC1 signals at DSB hotspots, we used the same approach to estimate H3K4me3 enrichment in a 1kb bin centred on the midpoint of a given DSB hotspot. For downstream analyses, we removed DSB hotspots overlapping any of the PRDM9- independent H3K4me3 sites defined above. When directly comparing H3K4me3 enrichment between different mice (as in Fig. 5b), we normalised H3K4me3 enrichment to the sum across all DSB hotspots being compared. For de novo peak calling in H3K4me3-enriched regions (defined above), the base with the maximum read coverage within each region was chosen as the peak centre. Then, around each peak centre we computed coverage and performed likelihood ratio testing in a 1kb window (in keeping with the 1kb windows used for force calling around DMC1 midpoints). For peak calling with published single-end H3K4me3 ChIP-seq data from a (B6 CAST)F1 B6/CAST mouse [14], we performed all steps identically, but we computed 17

18 read coverage instead of fragment coverage across the genome. 7.2 Estimating haplotype-specific H3K4me3 signals H3K4me3 ChIP-seq reads overlapping the 1kb region surrounding each DSB hotspot centre were compared with a list of biallelic SNPs distinguishing the PWD genome from the B6 genome (described in Supplementary Information Subsection 4.4.1). Reads matching one or more PWD alleles and no B6 alleles at these sites (with base quality 20) were assigned to the PWD haplotype, and vice versa. Reads not overlapping any SNPs and reads matching alleles from both haplotypes were excluded. We then subtracted the expected background coverage at each site using information from the input lane and from our peak calling algorithm. For example, to estimate B6 haplotype coverage after subtracting expected background, we computed: d B6 rep1 + db6 rep2 0.5(α 1 + α 2 )(d B6 input + dp input W D ), where for example db6 rep1 represents the number of B6-assigned read pairs from ChIP replicate 1, and α 1 and α 2 are constants relating expected background coverage in each replicate to the input signal (these are estimated genome-wide by the peak calling algorithm). Any resulting background-subtracted coverage values below zero were set equal to zero, and then background-subtracted B6 coverage was divided by total background-subtracted B6 plus PWD coverage to estimate the proportion of H3K4me3 signal from the B6 chromosome, or assigned undetermined if there were fewer than 10 haplotype-informative reads per hotspot. This proportion was then multiplied by the total H3K4me3 enrichment estimate at each hotspot to yield a haplotype-specific enrichment estimate. 8 Chromosome effects following Prdm9 humanization 8.1 Prdm9 humanization of the (PWD B6)F1 PWD/B6 mouse Chromosome effects in DMC1 signal comparisons between infertile and rescue To test for potential chromosome effects differentiating DMC1 or H3K4me3 signals between the infertile and rescue mice, we compared signal intensities at hotspots shared between those two mice. Specifically, we defined shared hotspots as those hotspots whose estimated centres (by DMC1 signal) in the infertile and in the rescue are no more than 500bp apart. We further required the hotspots to be under PWD PRDM9 control in both mice (as expected), and we restricted our analyses to the hotspots whose H3K4me3 signal is greater than the median signal, in each of the infertile and the rescue, amongst the shared hotspots (this is because of the relative higher level of background noise in H3K4me3 ChIP-seq, and to enable us to compare identical hotspots as in Subsection 8.1.2). Finally, we only considered hotspots on the autosomes. After applying the different filters, the list of shared hotspots comprised 1,536 DSB hotspots. To avoid biasing any potential chromosome effect estimates by differences in overall DMC1 heat distributions between mice (for example, a non-linear relationship between 18

19 heat in the infertile and rescue mice, of which there is some visual evidence at higher heats), we used two approaches. In the first, we matched chromosomes for their DMC1 hotspot heat distribution in the (fertile) rescue mouse before comparing them to the infertile mouse. In the second, we used a generalized linear model which can explicitly model such a non-linear effect (see below). Both methods yielded almost identical estimates (r 2 = 0.97). In the main text, we report results using the generalized linear model, which is (in our view) a more powerful approach that directly compares hotspots, and accordingly give somewhat smaller standard errors on estimates. All the results described in this work hold, essentially identically, with the first, non-model based approach. Both methods estimate the ratio of DMC1 heat between the infertile and rescue mice. First approach to estimate chromosome effects To apply the first approach, we divided all DMC1 hotspots (across all autosomes) in the rescue mouse into 5 bins based on their heats, each bin containing the same number of hotspots. For each shared hotspot j, wj rescue hotspots on that chromosome fall into the same bin as the hotspot j, using DMC1 heats in the rescue mouse, and this gives a weight for this bin, which we applied to equalise heat across chromosomes in the rescue mouse. We then simply computed the ratio of average DMC1 heat in the infertile mouse, to heat in the rescue mouse (β i, 1 i 19), on each chromosome, in the following manner: letting d infertile j and d rescue j be the DMC1 heats at the jth shared hotspot, we formed ˆβ i = all hotspots j on ith chr all hotspots j on ith chr 1 w rescue j 1 w rescue j d infertile j d rescue j To ensure the observed chromosome effects were not due to biases introduced by specific filters, we varied the number of bins used to match DMC1 heat distributions (5 to 20), applied filters to cap extreme outlying values of DMC1, H3K4me3, or both, and also removed the lower bound (median) filters on H3K4me3. In those different cases, we found very similar chromosome effects: the median r 2 when comparing the ˆβ i s derived above (under the previously specified filters) with those obtained with the varying conditions was 0.98 (95CI: (0.95, 0.99)). We finally defined the chromosome effects on the log scale: ˆβi = log( ˆβ i ). Second approach to estimate chromosome effects We also applied a second method to estimate chromosome effects. We used a generalised linear modelling approach to model our DMC1 heats (which are based on counts of reads), namely a quasi-poisson model ( quasi allows overdispersion relative to the Poisson case, e.g. due to rescaling). We employed the default canonical (log link) function for this analysis: ( [ ]) log E d infertile d rescue, c 19 = α + γ log (d rescue ) + βi P 1 {c=i}. i=1 19

ChIP-seq data analysis

ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional