Nature Immunology: doi: /ni Supplementary Figure 1

Supplementary Figure 1 A β-strand positions consistently places the residues at CDR3β P6 and P7 within human and mouse TCR-peptide-MHC interfaces. (a) E8 TCR containing V β 13*06 carrying with an 11mer CDR3β loop bound to HLA-DR1 + ptpi, PDB code 2IAM (b) HA1.7 TCR containing V β 3.1 carrying with a 14mer CDR3β loop bound to HLA-DR1 + piav, PDB code 1FYT (c) KB5-C20 TCR containing V β 1*01 carrying with an 18mer CDR3β loop bound to H2-K b + self-peptide, PDB code 1KJ2. CDR3β loop residues are space filled and shown in orange. A 90 o rotation shows the location of the internal disulfide bond that forms between Vβ residues Cys21 to Cys89 that connects the Vβ B and F strands, which fixes the positions the β strand.

Supplementary Figure 2 The ability of residues at CDR3β P6 or P7 to regulate reactivity to self peptide MHC correlates with hydrophobicity, measured by octanol-water partitioning, but not with the weight or surface area of the amino acids, and correlates weakly with hydropathy. Correlation plots of P6 or P7 amino acid amino acid fold change in self-reactive TCRs and (a) hydrophobicity based on octanol/water partitioning, amino acid molecular weight (b), solvent assessable surface area 42 (c), or hydrophobicity as indexed by Kyte and Doolittle 43 (d). These data are derived from the natural log of the average fold increase of the twenty amino acids at the CDR3β residues 6 or 7 in V β 2 +, V β 6 + and V β 8.2 + thymocyte TCRs recognizing the MHC alleles: I-A b, I-A g7, I-A d, H2-K b, H2-D b and H2-K d. Analysis is nonparametric 2 tailed Spearman correlation (r) computed in Graphpad Prism v6.04.

Supplementary Figure 3 A self-reactivity enrichment- and/or depletion-factor threshold of 0.4 balances the accuracy and completeness of the self-reactivity index. (a) Scatter plot of CDR3β P6/7-doublets expressed on pre-selection CD4 + CD8 + Vβ8.2 + thymocytes derived from C57BL/6 MHCdeficient mice, and pre-selection Vβ8.2 + thymocytes that express CD69 and Nur77-gfp following incubation with β2m -/- I-A b -expressing cells. Shown are data derived from TCRs carrying CDR3β length of 13 amino acids. CDR3β P6/7-doublets that are significantly enriched (red), unchanged (gray) or depleted (blue) in the I-A b + self-peptide reactive repertoire are shown. (b) Clustering of empirically-derived self-reactivity enrichment factors for CDR3β 6/7-doublets expressed on Vβ 8.2 + T cells recognizing three alleles of murine MHC class I: H2-K b, H2-D b, H2-K d, and three alleles of MHC class II: I-A b, I-A g7, IA d. Enrichment factors (rows) were clustered using the k-means algorithm for different values of k and an optimal value of k= 4 was determined based on the gap-statistic method 44. Self-reactivity indexes were generated based on multiplying the enrichment factors of single amino acids at CDR3 β P6 and P7 using threshold cutoffs of ln of 0, 0.2/-0.175, 0.4/-0.35, 0.6/-0.525, 0.8/-0.7, 1/-0.875. The indexes were tested for the ability to identify CDR3β P6/7-doublets that are enriched or deplete on self-reactive thymocytes. (c) Average accuracy in correctly identifying significantly enriched or depleted doublets as red or blue. in V β 2 +, V β 6 +, Vβ8 + TCRs. Accuracy is defined as: the number of doublets correctly identified / number of doublets correctly identified + number of false positive doublets. (d) Average percent of identifying significantly enriched/depleted doublets in V β 2 +, V β 6 +, Vβ8 + TCRs. Completeness is defined as: the number of doublets correctly identified / total number of significantly enriched or depleted doublets.

Supplementary Figure 4 A CDR3β P6 P7 self-reactivity index with a threshold value of 0.4 accurately predicts whether a CDR3β P6-P7 doublet promotes or limits reactivity to self peptide MHC. Fold change of differentially expressed of CDR3β P6-P7 doublets that promote (red), are neutral (white) or limit self-pmhc reactivity (blue) expressed by self-reactive Vβ2 +, Vβ6 + or Vβ8.2 + thymocytes as compared to the preselection. Analyses is separated by selfpmhc allele being recognized, (a) I-A b, (b) I-A g7, (c) I-A d, (d) H2-K b, (e) H2-D b and (f) H2-K d. The analyses are for Vβ2 +, Vβ6 + or Vβ8.2 + TCRs carrying CDR3β loops of 13, 14 and 15 amino acids in length. Self-reactivity enrichment thresholds used are, red > ln(0.4), and blue, < ln (-0.35). P values are from a hypergeometric test of the distribution in red or blue doublets against all significantly different doublets.

Supplementary Figure 5 CDR3β P6-P7 doublets that promote reactivity to self peptide MHC are greater in frequency among CD4 + T reg cells than among CD4 + T conv cells and lead to increased levels of CD5 in CD4 + T conv cells. (a) Relative fold change of differentially expressed of CDR3β P6-P7 doublets that promote (red), are neutral (white) or limit self-pmhc reactivity (blue) expressed by Vβ2 +, Vβ6 + or Vβ8.2 + mouse T regs versus CD4 + T conv in NOD.H2 b, C57BL/6.H2 g7, and NOD mice. P- values are derived from a hypergeometric test. (b-d) CD5 expression level on CD4 + T conv cells (b), on CD8 + T cells (c), and CD4 + T regs (d) from splenocytes in C57BL/6 plus TCRβ chain transgenic YAe62 FW, B3K506 SS, or retrogenic V β 8.2 SW, V β 8.2 AW, V β 8.2 EG, V β 8.2 EQ mixed bone-marrow chimeric mice. CD5 ratios are calculated from the TCRβ transgenic containing cells compared to the same non-transgenic C57BL/6 derived cell populations within the same mouse. Splenocyte populations are pre-gated for TCRβ+ B220- and CD4 or CD8 and Foxp3 expression. Data are derived from YAe62β FW (n=5), V β 8.2 SW (n=8), V β 8.2 AW (n=6), B3K506 SS (n=4), V β 8.2 EG (n=5), V β 8.2 EQ (n=3) mice. Error bars indicate standard deviation. Statistical significance was assessed by oneway ANOVA: ****p < 0.0001.

Supplementary Figure 6 The H-2 g7 haplotype is the main contributing factor to the repertoire-wide shifts toward self-reactivity in the CD4 + T conv cell repertoire of NOD mice. Relative fold change of differentially expressed of CDR3β P6-P7 doublets that promote (red), are neutral (white) or limit self-pmhc reactivity (blue) expressed by CD4 + T conv (a) or CD8 + T conv (b) in NOD v. B6, B6.H-2 g7 v. B6, NOD v. NOD.H-2 b, NOD.H-2 b v. B6 and NOD v. B6.H-2 g7. Fold change in doublet distribution classified by the self-reactivity index is reported as natural log (ln). p values are derived from a hypergeometric test of the distribution in red or blue doublets against all enriched doublets.

Supplementary Figure 7 Estimating N eff using CV versus f relationship. (a) N eff calculation illustrated for pre-selection Vβ8 TCRs (top), B6-CD4 Vβ6 TCRs (middle) and Bim-CD4 Vβ2 TCRs (bottom). Within each row, the three panels correspond to the three different CDR3 lengths. N eff is calculated independently for each case. Each panel plots the CV vs. frelationship on a log-log scale. The amino acids are distributed across 12-20 frequency bins such that each bin had at least 8 doublets. The black dots correspond to the average CV within each bin and the dashed black line depicts the best fit according to Eq. 3. The colored lines indicate theoretical CV vs. f curves for various population sizes (N=10 3,10 4,10 5,10 6 ) calculated using Eq. 3. The inferred N eff is indicated on each panel, together with the number of raw read counts (N reads ). (b) N eff estimation on simulated data We used simulations to explore how N eff compares with the true population size (when known). We simulated doublet counts by sampling a multinomial distribution based the frequency distribution seen in the B6-CD4 data. This for done for various sample sizes (N (true)) ranging from 500 to 100,000, and for each case we simulated scenarios involving varying number of replicates (n(rep) = 3, 5, 10, 100). For each case, N eff was calculated (y-axis) based on the CV vs. f relationship described by Eq. (3) and illustrated in fig. S9. These simulations show that at n=3 replicates (red line), the computed N eff underestimates the true population size.

Supplementary Figure 8 Bayesian statistical and cooperativity analysis for the identification of differential expression in CDR3β P6-P7 doublet frequencies. (a) The posterior probability distributions (Eqs. (7) and (9)) for doublet frequencies are illustrated for three doublets (FW, AW, GK) across three TCR populations. In each panel, we plot the replicate-specific distributions (red, blue, magenta) and the average distribution (black). (b) The probability distribution of the enrichment factor (Eq. (12)) for three doublets (FW, QK and GM) for three representative comparisons (Left: IA b -activated vs. pre-selection, V β 2, 15-mer, Middle: Bim -/- CD8 + T cells vs. B6 CD8 + T cells, V β 6, 13- mer, Right: B6 CD4 + T cells vs. NOD CD4 + T cells, Vβ6, 13-mer) illustrating significant enrichment, depletion and no change, respectively. The black dashed line corresponds to an enrichment factor of 1. (c) Enrichment factor distributions for doublet FW: Comparison of FW CDR3β doublet enrichment in H-2 D b activated versus preselection TCRs for the 9 different cases, (rows: Vβ8, Vβ6, Vβ2, and columns: CDR3 lengths of 13, 14, 15 amino acids). 6 out of 9 cases show a significant enrichment, while 3 cases, owing to poor statistical power, cannot be called based on our threshold (P(Enrichment) > 0.95). It is noteworthy that in the three non-significant cases, the distribution shows a clear bias to the right. (d) Cooperativity between amino acid distributions at CDR3β P6 and P7: Bar graphs show the Mutual information between the frequency distributions at P6 and P7, MI(P6, P7) expressed as the percentage of the Shannon entropy at P6 (Panel d) or P7 (Panel e). Within each panel, the calculation was performed on the pre-selection TCRs and the MHC-activated TCRs separately (left and right set of bars).