SUPPLEMENTARY INFORMATION doi:10.1038/nature22976 Supplementary Discussion The adaptive immune system uses a highly diverse population of T lymphocytes to selectively recognize and respond to antigenic proteins. Individual T lymphocytes generate and express single recombinants from a massive repertoire of combinatorially generated T-cell receptors (TCRs) with diversification mechanisms that could theoretically produce in excess of (~10 16 ) unique heterodimers 4. Because of this extreme diversity, individuals typically share only ~1% of their TCR seqences, and even monozygotic twins-with identical HLA haplotypes, only share ~2% 1,3,24,25. The major αβ form of the TCR heterodimer typically recognizes peptide fragments of larger proteins, bound to major histocompatibility complex (MHC) or related molecules and displayed on the cell surface 26. The unique TCR recombinantly produced by each T lymphocyte imbues it with a unique TCR specificity the ability to selectively recognize and respond to a given peptide-mhc (pmhc) complex or set of complexes. While αβ TCRs are highly specific for a given peptide-mhc ligand, such that a small change in an amino acid side chain of the peptide or TCR is sufficient to eliminate binding, it is also very common for there to be cross-reactivity to similar and even non-homologous peptides bound to the same or a different MHC. This appears to be due to a flexibility in the TCR binding site, allowing multiple stable conformations 27. Although advances in high-throughput sequencing technologies now enable the routine analysis of millions of T-cell receptors in a single experiment, there has been no systematic way to organize groups of TCR sequences according to their likely antigen specificities. Relying on tetramer-sorted TCR "training" dataset, we successfully developed a novel algorithm to search for and automatically cluster TCR sequences into distinct groups according to their likely specificity. This algorithm, Grouping Lymphocyte Interactions by Paratope Hotspots or GLIPH, combines global TCR sequence similarity, local CDR sequence similarity (motifs), spatial peptideantigen contact propensity, V-segment bias, CDR3 length bias, shared HLA alleles among TCR contributors, and clonal expansion bias, together with other observations concerning TCR specificity from the literature, to identify and cluster TCR sequences in specificity groups sets of TCRs that are likely to be recognizing the same or very similar peptide-mhc ligands (Extended Data Fig. 3, https://github.com/immunoengineer/gliph). The algorithm first searches in parallel for global and local similarity signatures. For local similarity signatures, it tests for enrichment of any 2, 3, and 4 WWW.NATURE.COM/NATURE 1
RESEARCH SUPPLEMENTARY INFORMATION amino acid continuous motifs within CDR3 β sequences while excluding conserved N-terminal V region or C-terminal J region CDR3 amino acid positions (IMGT 104, 105, 106, 117, and 118) that have never been observed to bind the antigenic peptide in crystal structures (Extended Data Fig. 1). Optionally it can also search for discontinuous amino acid motifs in CDR3 that allow any amino acid at a position, with specific 2, 3, or 4 amino acids around that position. A motif is considered enriched if it is elevated at least 10-fold over expected frequency in the naïve TCR reference pool, with a probability<0.001 being at that level of enrichment by chance. In parallel, it also identifies global similarity sequences with CDR3 β sequences that are identical or only differ by one amino acid and are of the same length. Next, it clusters these TCRs with identified similarities, gathering together variants of similar amino acid motifs or global CDR3 similarity into a single sequence group in the process. Finally, GLIPH confirms the significance of the formed clusters through evaluation of enrichment of independent features, including V-gene usage, CDR3 length distribution, clonal expansion bias and donor HLA usage in each sequence group. This three-step process of motif nucleation creates a powerful statistical framework for identifying TCRs that recognize the same antigen. Optionally, GLIPH can restrict cluster members to only those TCRs that share the same V- gene. When HLA genotypes for the donors are available, GLIPH also provides a predicted HLArestriction for each TCR sequence group generated. This HLA allele prediction can then in turn be used to score databases of candidate peptides, ranking them according to their odds of being the target antigen. While previous studies have noted that certain TCR positions are more likely to be in contact with antigen, or that distributions of short TCR CDR3β motifs systematically alter after immunization 5,28, no other TCR repertoire analysis tools that we are aware of are able to formalize these observations into a unified framework for automated TCR sequence group identification, with statistical validation and HLA prediction 29-31. GLIPH can cluster TCR sequences based on their likely antigen specificity, predict their likely HLA restriction and prioritize specific peptide antigen candidates. Our data also indicates that linear, usually continuous motifs of 3-4 amino acids in CDR3β and CDR3α, are often highly conserved in TCRs that recognize the same peptide-mhc ligand, and that in known structures, these are the residues responsible for contacting the antigenic peptide. This builds and expands on previous work by Chain and colleagues indicating that 3 amino acid motifs in the TCR CDR3 regions are often correlated with antigen specific responses 28. While GLIPH requires the HLA genotypes of the subjects in order to make an HLA prediction, in our Mtb dataset of 22 individuals there were 69 unique class II HLAs, and in each evaluated sequence group we were able to predict the presenting HLA. Similarly 2 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH while GLIPH cannot predict a specific target peptide a-priori, knowing the HLA and a candidate population of peptides, it is possible to prioritize the peptides by predicted binding to the candidate HLA, greatly reducing the combinations of APCs and peptides to be tested. Where no pool of candidate antigens is available, the HLA-prediction facilitates the use of the pmhc yeast display libraries developed by Garcia and colleagues 27. An important question is - how much of the TCR repertoire will GLIPH capture? Here the limited data we have at this time is that it will be substantial, in that of the >2000 TCR sequences that we analyzed in Fig 3A, up to 14% could be gathered into high-confidence discrete specificity groups with GLIPH, with rarefaction suggesting that more groups could be identified if more data was available. Similarly (14%; 796/5711) of the TCRs fit into specificity groups in the Mtb cohort. A partial solution to this problem will likely come from larger sample sizes, although in very diverse populations, such as the South African cohort analyzed here, there will HLA alleles that will be very rare and those only in very large population studies will there be enough matches that could generate shared TCR specificities. Nonetheless, building large TCR data bases around specific diseases, or TCR-omes, could be very valuable in assessing the responses of individuals or cohorts, even if only a fraction of the TCR sequences can be clustered using GLIPH. 24 Naylor, K. et al. The influence of age on T cell generation and TCR diversity. Journal of immunology 174, 7446-7452 (2005). 25 Robins, H. S. et al. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114, 4099-4107, doi:10.1182/blood-2009-04-217604 (2009). 26 Garcia, K. C. et al. An alphabeta T cell receptor structure at 2.5 A and its orientation in the TCR- MHC complex. Science 274, 209-219 (1996). 27 Birnbaum, M. E. et al. Deconstructing the peptide-mhc specificity of T cell recognition. Cell 157, 1073-1087, doi:10.1016/j.cell.2014.03.047 (2014). 28 Thomas, N. et al. Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinformatics 30, 3181-3188, doi:10.1093/bioinformatics/btu523 (2014). 29 Bolotin, D. A. et al. MiTCR: software for T-cell receptor sequencing data analysis. Nature methods 10, 813-814, doi:10.1038/nmeth.2555 (2013). 30 Nazarov, V. I. et al. tcr: an R package for T cell receptor repertoire advanced data analysis. BMC bioinformatics 16, 175, doi:10.1186/s12859-015-0613-1 (2015). 31 Yu, Y., Ceredig, R. & Seoighe, C. LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins. Nucleic acids research 44, e31, doi:10.1093/nar/gkv1016 (2016). WWW.NATURE.COM/NATURE 3
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Methods GLIPH Documentation ================================================================ GROUPING OF LYMPHOCYTE INTERACTIONS BY PARATOPE HOTSPOTS - GLIPH ================================================================ Questions? Comments? Contact jakeg@stanford.edu huang7@stanford.edu Name ---- gliph - Grouping of Lymphocyte Interactions by Paratope Hotspots Synopsis -------- gliph [options ] --tcr TCR_TABLE --hla HLA_TABLE Description ----------- GLIPH clusters TCRs that are predicted to bind the same MHC-restricted peptide antigen. When multiple donors have contributed to the clusters, and HLA genotypes for those donors are available, GLIPH additionally can provide predictions of which HLA-allele is presenting the antigen. Typically the user will pass in a sequence set of hundreds to thousands of TCR sequences. This dataset will be analyzed for very similar TCRs, or TCRs that share CDR3 motifs that appear enriched in this set relative to their expected frequencies in an unselected naive reference TCR set. GLIPH returns significant motif lists, significant TCR convergence groups, and for each group, a collection of scores for that group indicating enrichment for motif, V-gene, CDR3 length, shared HLA among contributors, and proliferation count. When HLA data is available, it predicts the likely HLA that the set of TCRs recognizes. Options ------- Required Data Inputs ~~~~~~~~~~~~~~~~~~~~ The user must provide a table of TCR sequences. --tcr TCR_TABLE The format of the table is tab delimited, expecting the following columns in this order. Only TCRb is required for the primary component of the algorithm to function, but patient identity is required for HLA prediction. Example: 4 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH CDR3b TRBV TRBJ CDR3a TRAV TRAJ Patient Counts CAADTSSGANVLTF TRBV30 TRBJ2-6 CALSDEDTGRRALTF TRAV19 TRAJ5 09/0217 1 CAATGGDRAYEQYF TRBV2 TRBJ2-7 CAASSGANSKLTF TRAV13-1 TRAJ56 03/0492 2 CAATQQGETQYF TRBV2 TRBJ2-5 CAASYGGSARQLTF TRAV13-1 TRAJ22 02/0259 1 CACVSNTEAFF TRBV28 TRBJ1-1 CAGDLNGAGSYQLTF TRAV25 TRAJ28 PBMC863 1 CAGGKGNSPLHF TRBV2 TRBJ1-6 CVVLRGGSQGNLIF TRAV12-1 TRAJ42 02/0207 1 CAGQILAGSDTQYF TRBV6-4 TRBJ2-3 CATASGNTPLVF TRAV17 TRAJ29 09/0018 1 CAGRTGVSTDTQYF TRBV5-1 TRBJ2-3 CAVTPGGGADGLTF TRAV41 TRAJ45 02/0259 1 CAGYTGRANYGYTF TRBV2 TRBJ1-2 CVVNGGFGNVLHC TRAV12-1 TRAJ35 01/0873 3 Optional Data Inputs ~~~~~~~~~~~~~~~~~~~~ The user may additional supply a table of HLA genotyping for each subject. --hla HLA_TABLE The format of the table is tab delimited, with each row beginning with the identity of a subject, and then two or more following column providing HLA identification. The number of total columns (HLA defined genotypes) is flexible. Example: 09/0217 DPA1*01:03 DPA1*02:02 DPB1*04:01 DPB1*14:01 DQA1*01:02 09/0125 DPA1*02:02 DPA1*02:02 DPB1*05:01 DPB1*05:01 DQA1*06:01 03/0345 DPA1*02:01 DPA1*02:01 DPB1*17:01 DPB1*01:01 DQA1*01:03 03/0492 DPA1*01:03 DPA1*02:01 DPB1*03:01 DPB1*11:01 DQA1*01:02 02/0259 DPA1*01:03 DPA1*01:03 DPB1*104:01 DPB1*02:01 DQA1*02:01 Optional Arguments ~~~~~~~~~~~~~~~~~~~ --refdb DB --gccutoff=1 optional alternative reference database global covergence distance cutoff. This is the maximum CDR3 Hamming mutation distance between two clones sharing the same V, same J, and same CDR3 length in order for them to be considered to be likely binding the same antigen. This number will change depending on sample depth, as with more reads, the odds of finding a similar sequence increases even in a naive repertoire. This number will also change depending on the species evaluated and even the choice of reference database (memory TCRs will be more likely to have similar TCRs than naive TCR repertoires). Thus, by default this is calculated at runtime if not specified. --simdepth=1000 simulated resampling depth for non-parametric convergence significance tests. This defines the number of random WWW.NATURE.COM/NATURE 5
RESEARCH SUPPLEMENTARY INFORMATION repeat samplings into the reference distribution that GLIPH performs when analyzing 1) global similarity cutoff 2) local similarity motif enrichment 3) V-gene enrichment 4) CDR3 length enrichment 5) clonal proliferation enrichment A higher number will take longer to run but will produce more reproducible results. --lcminp=0.01 local convergence minimum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set would but into the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set. It is set to 0.01 by default. --lcminove=10 local convergence minimum observed vs expected fold change. This is a cutoff for the minimum fold enrichment over a reference distribution that a given motif should have in the sample set in order to be considered for further evaluation. It is set to 10 by default. --kmer_mindepth=3 minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increase False Discovery Rate. --global=1 Search for global TCR similarity (Default 1) --local=1 Search for local TCR similarity (Default 1) --make_depth_fig=0 Perform repeat random samplings at the sample set depth in order to visualize convergence --discontinuous=0 Allow discontinuous motifs (Default 0) --positional_motifs=0 Restrict motif clustering to a shared position that is fixed from the N-terminal end of CDR3 --cdr3len_stratify=0 Stratify by shared cdr3length distribution (Default 0) --vgene_stratify=0 Stratify by shared V-gene frequency distribution (Default 0) --public_tcrs=0 Reward motifs in public TCRs (Default 0) Controlling Output ~~~~~~~~~~~~~~~~~~ GLIPH produces multiple output files. 6 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH --output FILE --verbose Place command output into the named file. Have *tmo* print messages describing Definitions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Convergence group - a set of multiple TCRs from one or more individuals that bind the same antigen in a similar manner through similar TCR contacts. GLIPH predicts convergence groups by asking "what is the probability that this cluster of similar TCRs could have appeared without the selection of common antigen?" Global convergence - A pair of TCRs that share the same length CDR3 and differ by less than a certain number of amino acids in those CDR3s. Example: in a set of 300 random TCRs, finding two TCRs that only differ by one amino acid in their CDR3 would be highly unlikely. Local convergence - A pair of TCRs that share in their CDR3 regions an amino acid motif that appears enriched in their sample set. Optionally this common motif could be positionally constrained. Example: in the malaria set, an enriched QRW motif was found in 23 unique TCRs from 12 individuals in a conserved position in their CDR3. Reference set - A large database of TCR sequences that are not expected to be enriched for the specificities found in the sample set. Example: by default, GLIPH uses as a reference database over 200,000 nonredundant naive CD4 and CD8 TCRb sequences from 12 healthy controls. Sample set - The input collection of TCRs under evaluation that are potentially enriched for a specificity not present in the reference set. Function of Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GLIPH first makes the input sample set non-redundant at the CDR3 amino acid level. The identity of the V-genes, the J-genes, the clonal frequencies, and the associated HLA information for each TCR is hidden during the CDR3 analysis in order to act as independent confirmation variables of the resulting clusters. GLIPH next counts the number of unique CDR3 sequences in the sample dataset. GLIPH next calculates the minimum hamming distance from each CDR3 in the sample dataset to the next closest same-length sequence in the sample set. This result is the minimum distance distribution. GLIPH next loads the reference dataset. WWW.NATURE.COM/NATURE 7
RESEARCH SUPPLEMENTARY INFORMATION GLIPH next repeatedly resamples from the reference database at the nonredundant sample set depth, calculating the minimum hamming distance distribution in each case. This is repeated simdepth times (1000 by default). Probability of encountering next closest members at distant 0, 1, 2... is tabulated across all random samples. GLIPH next compares the sample set minimum hamming distance distribution to that of the reference set, to assign a probability to each hamming distance that sequences of that distance are likely to appear randomly. Example: after analysis, it is found that out of 1000 simulations hamming distance in-test-set in-ref-set probability 1 14 0 0.001 2 37 8+/-3 0.02 3 194 147+/-35 0.24 GLIPH then selects a maximum hamming distance cutoff for grouping similar TCRs. NOTE: For the standard reference set this analysis has been precomputed for a range of sampling depths, enabling a lookup table rather than simulation as a runtime accelerator. GLIPH then analyzes all possible 3mer, 4mer, 5mer and discontinuous 3mer and 4mer motifs for their frequency in the sample set. This is done excluding the first three and last three residues in the CDR, as they are not observed to be in contact with antigen in known crystal structures. This is done for every non-redundant CDR3 in the reference set. Example: for a scan of the following CDR3, the following 3mers would be collected CAS SFGSGGHYE TYF xxx SFGSGGHYE xxx SFG FGS GSG SGG GGH GHY HYE SFGS FGSG GSGG SGGH GGHY GHYE S.GS F.SG (etc) The frequences off all motifs across the sample set are calculated. In order to establish whether these motifs are just naturally abundant or specifically enriched by antigen, these frequencies are compared to a repeat random sampling of the reference set at the same depth as the non-redundant database. The frequency 8 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH of every motif is collected with each random sampling in order to create a distribution. If each motif could only be observed in a given sequence once, then the distribution of sampling motif frequency means u become normally distributed and this result is equivalent to calculating the frequencies of all motifs in the reference database, and then calculating one-sided confidence intervals for expected frequencies of any given motif in the reference database at any given sampling depth: _ CI(99.9% os) = y + t (s / sqrt(n)) - 0.0005 _ where n is the sample set non-redundant CDR3 sample size, y is the motif mean frequency, and s is the SD estimate at that sampling depth for the motif, as _ 2 s = \ (y1 - y) / -------------- \ (n - 1) This approximation provides a runtime acceleration. GLIPH then compares the probability that each motif in the sample set would have appeared at the observed frequency by random change (calculated as the number of times the motif was found at at least that depth in a random sampling, divided by the total number of random samplings, or alternatively though the CI). GLIPH next generates a network of all CDR3s as nodes, and all edges as either global or local interactions. The clusters can be optionally filtered with the following arguements --positional_motifs=0 Restrict motif edges to CDR3s where the motif is in a shared position that is fixed from the N-terminal end of CDR3 --public_tcrs=0 Allow CDR3s from multiple subjects to increase the size of their clusters. Rewards public TCRs but risks contamination clouding results. --global_vgene=0 Restricts global relationships to TCRs of common V-gene Recommended but invalidates V-gene probability score GLIPH then scores each individual cluster (candidate convergence group) by evaluating a set of features that are independet of the CDR3 observations, assigning a probability p to each feature, and then combining those probabilities into a single score by conflation. For tests i through N, testing Pi(X=C) probability that cluster X is convergent, the combined conflation score is given as N P(X=C) = P (X=C) i i ---------------------------- N N P (X=C) + P (X!=C) WWW.NATURE.COM/NATURE 9
RESEARCH SUPPLEMENTARY INFORMATION i i i i The individual Pi(X=C) tests include 1) global similarity probability 2) local motif probability 3) network size 4) enrichment of V-gene within cluster 5) enrichment of CDR3 length (spectratype) within cluster 6) enrichment of clonal expansion within cluster 7) enrichment of common HLA among donor TCR contributors in cluster Individual score components are calculated as follows: == 3) Calculating network size p == For each discrete cluster, the probability p of a given cluster topology can be obtained from the number of members of the cluster by comparison to a lookup table calculated from repeat random sampling and GLIPH clustering of naive TCR sequences at a range of different sampling depths n from 25 to 5000, each performed 1000 times each. Example: at sampling depth n=500, clusters of size 5 have a probabilty p=0.002 of occuring in naive TCR sample sets. == 4) enrichment of V-gene within cluster == GLIPH hides V-gene usage for all clones prior to CDR3 analysis, and does not explore the V-gene template endcoded amino acids in the local (motif) search. To evaluate whether there is an enrichment of common V-segment within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the V-genes for each clone and calculating a V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability that any two members within the cluster would share a V-gene and is calculated as V \ v(v-1) / 1 D = ----------- V \ n(n-1) / 1 where V is the number of all V-genes, v is the total counts of a given v-gene, and n is the sampling size (cluster size). The probability of the observed D for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the D score distribution from random sampled clusters of same size n. == 4) enrichment of CDR3 within cluster == 10 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH For local (motif) scoring, no stratification based on CDR3 length is performed. To evaluate whether there is an enrichment of common CDR3 length within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the CDR3 lengths for each clone and calculating a V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability that any two members within the cluster would share a CDR3length and is calculated as C \ c(c-1) / 1 D = ----------- C \ n(n-1) / 1 where C is the number of all CDR3 lengths, c is the total counts of a given CDR3 length, and n is the sampling size (cluster size). The probability of the observed D for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the D score distribution from random sampled clusters of same size n. == 5) enrichment of clonal expansion within cluster == The first step in GLIPH is to make all CDR3s non-redundant, hiding their clone frequences. To evaluate whether there is an enrichment of expanded clones within a given convergence group, the number of counts for each clone are recovered, and a convergence group expansion coefficient e is calculated as e = total clones / total unique clones To evaluate whether there is an enrichment of expanded clones within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the clone counts for each clone and calculating e for each random sample to establish a distribution. The probability of the observed e for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the e score distribution from random sampled clusters of same size n. == 6) enrichment of common HLA among donor TCR contributors in cluster == TCRs that recognize a common antigen should be constrained by the same HLA. Thus, their contributing donors should contain that HLA allele in their genotype. When HLA genotype information is available, GLIPH uses combinatorial resampling without replacement to estimate the probability that the collection of TCRs in the convergent group recognizes any given HLA. For each HLA allele that appears at least twice in the candidate convergence group, GLIPH obtains WWW.NATURE.COM/NATURE 11
RESEARCH SUPPLEMENTARY INFORMATION a the number of subjects in the convergence group that harbor that allele A the number of subjects in the study that harbor that allele n the number of subjects in the convergence group N the number of subjects in the study GLIPH uses this information to calculate the probability that a given HLA allele is present by chance. Examples -------- gliph --tcr TCR_TABLE --hla HLA_TABLE 12 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Methods Training set curation The training set was derived from single-tcr sequencing of tet+ cells, bulk sequencing of tet+ cell pools, sequences from literature, and sequences from solved crystal structures. To simplify analysis, TCRs in the training set were curated to ensure that they were A) Real: (i.e. not a read error variant) B) Non-redundant: each sequence was only counted once per subject source C) Non-promiscuous: each TCR only associated with a single specificity A) Real (i.e. not a read error variant) Single-cell TCR sequences were sequenced to a depth of at least 1000x to eliminate read error. Bulk-cell TCR sequencing was performed in replicate. To avoid PCR error and read error, only reads identified across both replicates were retained. Frequencies of all TCRb clones were evaluated in each replicate, and any reads differing by only 1bp from a higher frequency clone were absorbed into that clone, assuming it to most likely represent a read error variant of the more dominant clone, given the read depths utilized in each replicate. Final sequences were reviewed to contain in-frame VDJ recombination and function CDR3 boundary residues. B) Non-redundant: each sequence was only counted once per subject source To avoid over-counting motifs in high-frequency clones, the data was rendered clonally non-redundant. Irrespective of how many times a clone was observed in single-cells or across reads in bulk TCR sequencing, the clones were rendered non-redundant within a given specificity and given subject. For bulk sequencing, subjects with very large numbers of singleton clones were down-sampled to use only the highest-frequency clones in comparable numbers to those presented by other subjects. This was done out of the assumption that these large populations of low-frequency singleton reads likely represented tetramer staining background from the FACS sorting, or barcode background during sequencing. It was also done ensure that each subject contributed a comparable amount of information to analysis of statistical trends in the antigen-specific repertoire. C) Non-promiscuous: each TCR only associated with a single specificity In almost all cases, a single TCR was only observed against a single specificity from a single individual. However, in the bulk TCR sequencing, there were some cases of a TCR being found across multiple specificities. In 37 of these cases, the TCR was found at high frequency in a single sample, but appeared at very low copy in some other samples. As these samples were all sequenced together, it was assumed that a low rate of DNA barcode contamination was causing a low rate of DNA sequencing read misassignment, a known issue for Illumina sequencing with multiplex identifier barcodes. To address this, sequences with clearly dominant specificities were curated to remove the presence of that clone in the presumed contaminant sources. In total 37 clones were curated in this way; three example cases are shown below. WWW.NATURE.COM/NATURE 13
RESEARCH SUPPLEMENTARY INFORMATION Subject HLA Antigen Counts CDRb3 Selection subject-0958 HLA-A1 CMV 3 CSVVAGGADTQYF subject-0898 HLA-A2 CMV 28396 CSVVAGGADTQYF X subject-1521 HLA-B7 CMV 4 CSVVAGGADTQYF subject-2387 HLA-A2 flu 13438 CSAISGSGGVGDTQYF X subject-2385 HLA-B7 CMV 2 CSAISGSGGVGDTQYF subject-2351 HLA-B7 flu 3 CSAISGSGGVGDTQYF subject-3183 HLA-A1 CMV 2 CASSPGTDTQYF subject-1mc1 HLA-A2 EBV 3 CASSPGTDTQYF subject-0959 HLA-B7 flu 54 CASSPGTDTQYF X There remained 49 additional clones that were found across two samples at frequencies that made unambiguous assignment challenging. Almost every cases was found between two samples: subject- 0958_HLA-A1-CMV and subject-0988_hla-b7-flu. All ambiguous clones were removed from the final curated dataset. The final dataset contained 2067 T-cell receptors of known, unique specificity, of which 47 were found across more than one source. 14 WWW.NATURE.COM/NATURE