SUPPLEMENTARY INFORMATION

Similar documents
New technologies for studying human immunity. Lisa Wagar Postdoctoral fellow, Mark Davis lab Stanford University School of Medicine

Supplementary Figure 1. Using DNA barcode-labeled MHC multimers to generate TCR fingerprints

Antigen Recognition by T cells

Evidence for antigen-driven TCRβ chain convergence in the tumor infiltrating T cell repertoire

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Binding capacity of DNA-barcoded MHC multimers and recovery of antigen specificity

Current practice, needs and future directions in immuno-oncology research testing

and Jamie Rossjohn 1,2,4 *

Antigen Receptor Structures October 14, Ram Savan

Use of BONSAI decision trees for the identification of potential MHC Class I peptide epitope motifs.

Nature Immunology: doi: /ni Supplementary Figure 1

Lecture 6. Burr BIO 4353/6345 HIV/AIDS. Tetramer staining of T cells (CTL s) Andrew McMichael seminar: Background

Potential cross reactions between HIV 1 specific T cells and the microbiome. Andrew McMichael Suzanne Campion

A HLA-DRB supertype chart with potential overlapping peptide binding function

Immunology - Lecture 2 Adaptive Immune System 1

Significance of the MHC

HLA and antigen presentation. Department of Immunology Charles University, 2nd Medical School University Hospital Motol

An innovative multi-dimensional NGS approach to understanding the tumor microenvironment and evolution

Antigen Presentation to T lymphocytes

HLA and antigen presentation. Department of Immunology Charles University, 2nd Medical School University Hospital Motol

Transcript-indexed ATAC-seq for immune profiling

Supplementary Table 1. Data collection and refinement statistics (molecular replacement).

Completing the CIBMTR Confirmation of HLA Typing Form (Form 2005)

The Immune Epitope Database Analysis Resource: MHC class I peptide binding predictions. Edita Karosiene, Ph.D.

Principles of Adaptive Immunity

Major Histocompatibility Complex (MHC) and T Cell Receptors

Antigen capture and presentation to T lymphocytes

Introduction to Immunology Part 2 September 30, Dan Stetson

Evaluation of MIA FORA NGS HLA test and software. Lisa Creary, PhD Department of Pathology Stanford Blood Center Research & Development Group

SUPPLEMENTARY INFORMATION

2/10/2016. Evaluation of MIA FORA NGS HLA test and software. Disclosure. NGS-HLA typing requirements for the Stanford Blood Center

Structure and Function of Antigen Recognition Molecules

Supplementary Table 1. Functional avidities of survivin-specific T-cell clones against LML-peptide

The Major Histocompatibility Complex (MHC)

Page 4: Antigens: Self-Antigens The body has a vast number of its own antigens called self-antigens. These normally do not trigger immune responses.

Supplementary Data 1. Alanine substitutions and position variants of APNCYGNIPL. Applied in

The Adaptive Immune Response. T-cells

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

CD44

Supplementary Fig. 1: Ex vivo tetramer enrichment with anti-c-myc beads

Tissue distribution and clonal diversity of the T and B cell repertoire in type 1 diabetes

The Major Histocompatibility Complex of Genes

Introduction. and Department of Medical Oncology, Dana-Farber Cancer Institute, 77 Avenue Louis Pasteur, Boston, MA 02115, USA

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

Profiling HLA motifs by large scale peptide sequencing Agilent Innovators Tour David K. Crockett ARUP Laboratories February 10, 2009

TITLE: Development of Antigen Presenting Cells for adoptive immunotherapy in prostate cancer

COURSE: Medical Microbiology, MBIM 650/720 - Fall TOPIC: Antigen Processing, MHC Restriction, & Role of Thymus Lecture 12

1,000 in silico simulated alpha, beta, gamma and delta TCR repertoires were created.

MHC class I MHC class II Structure of MHC antigens:

Genetics and Genomics in Medicine Chapter 8 Questions

Cover Page. The handle holds various files of this Leiden University dissertation.

Introduction to LOH and Allele Specific Copy Number User Forum

Low Avidity CMV + T Cells accumulate in Old Humans

Nature Methods: doi: /nmeth.3115

Supplementary Appendix

How T cells recognize antigen: The T Cell Receptor (TCR) Identifying the TCR: Why was it so hard to do? Monoclonal antibody approach

A second type of TCR TCR: An αβ heterodimer

T cell Receptor. Chapter 9. Comparison of TCR αβ T cells

TITLE: Development of Antigen Presenting Cells for adoptive immunotherapy in prostate cancer

Test Bank for Basic Immunology Functions and Disorders of the Immune System 4th Edition by Abbas

Introduction. Introduction. Lymphocyte development (maturation)

CELL BIOLOGY - CLUTCH CH THE IMMUNE SYSTEM.

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

The T cell receptor for MHC-associated peptide antigens

Adaptive Immune Response Day 2. The Adaptive Immune Response

Significance of the MHC

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

MHC Tetramers and Monomers for Immuno-Oncology and Autoimmunity Drug Discovery

Supplemental Information. Antigen Identification. for Orphan T Cell Receptors Expressed. on Tumor-Infiltrating Lymphocytes

Chapter 1. Introduction

Adaptive Immune System

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

B F. Location of MHC class I pockets termed B and F that bind P2 and P9 amino acid side chains of the peptide

Autoimmunity to hypocretin and molecular mimicry to flu in type 1 narcolepsy

TITLE: Development of Antigen Presenting Cells for adoptive immunotherapy in prostate cancer

Degenerate T-cell Recognition of Peptides on MHC Molecules Creates Large Holes in the T-cell Repertoire

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Basic Immunology. Lecture 5 th and 6 th Recognition by MHC. Antigen presentation and MHC restriction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Significance of the MHC

AG MHC HLA APC Ii EPR TAP ABC CLIP TCR

Bjoern Peters La Jolla Institute for Allergy and Immunology Buenos Aires, Oct 31, 2012

DIRECT IDENTIFICATION OF NEO-EPITOPES IN TUMOR TISSUE

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

all of the above the ability to impart long term memory adaptive immunity all of the above bone marrow none of the above

Supplementary Figure 1. Example of gating strategy

Supplementary Table 1. T-cell receptor sequences of HERV-K(HML-2)-specific CD8 + T cell clone.

Surface plasmon resonance (SPR) analysis

HD1 (FLU) HD2 (EBV) HD2 (FLU)

EBV Infection and Immunity. Andrew Hislop Institute for Cancer Studies University of Birmingham

Antigen Presentation and T Lymphocyte Activation. Shiv Pillai MD, PhD Massachusetts General Hospital Harvard Medical School. FOCiS

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Generation of antibody diversity October 18, Ram Savan

Validation of the MIA FORA NGS FLEX Assay Using Buccal Swabs as the Sample Source

Key Concept B F. How do peptides get loaded onto the proper kind of MHC molecule?

Lentiviral Delivery of Combinatorial mirna Expression Constructs Provides Efficient Target Gene Repression.

SMPD 287 Spring 2015 Bioinformatics in Medical Product Development. Final Examination

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

OPTN/UNOS Policy Notice Review of HLA Tables (2016)

SUPPLEMENTARY APPENDIX

Below, we included the point-to-point response to the comments of both reviewers.

Transcription:

SUPPLEMENTARY INFORMATION doi:10.1038/nature22976 Supplementary Discussion The adaptive immune system uses a highly diverse population of T lymphocytes to selectively recognize and respond to antigenic proteins. Individual T lymphocytes generate and express single recombinants from a massive repertoire of combinatorially generated T-cell receptors (TCRs) with diversification mechanisms that could theoretically produce in excess of (~10 16 ) unique heterodimers 4. Because of this extreme diversity, individuals typically share only ~1% of their TCR seqences, and even monozygotic twins-with identical HLA haplotypes, only share ~2% 1,3,24,25. The major αβ form of the TCR heterodimer typically recognizes peptide fragments of larger proteins, bound to major histocompatibility complex (MHC) or related molecules and displayed on the cell surface 26. The unique TCR recombinantly produced by each T lymphocyte imbues it with a unique TCR specificity the ability to selectively recognize and respond to a given peptide-mhc (pmhc) complex or set of complexes. While αβ TCRs are highly specific for a given peptide-mhc ligand, such that a small change in an amino acid side chain of the peptide or TCR is sufficient to eliminate binding, it is also very common for there to be cross-reactivity to similar and even non-homologous peptides bound to the same or a different MHC. This appears to be due to a flexibility in the TCR binding site, allowing multiple stable conformations 27. Although advances in high-throughput sequencing technologies now enable the routine analysis of millions of T-cell receptors in a single experiment, there has been no systematic way to organize groups of TCR sequences according to their likely antigen specificities. Relying on tetramer-sorted TCR "training" dataset, we successfully developed a novel algorithm to search for and automatically cluster TCR sequences into distinct groups according to their likely specificity. This algorithm, Grouping Lymphocyte Interactions by Paratope Hotspots or GLIPH, combines global TCR sequence similarity, local CDR sequence similarity (motifs), spatial peptideantigen contact propensity, V-segment bias, CDR3 length bias, shared HLA alleles among TCR contributors, and clonal expansion bias, together with other observations concerning TCR specificity from the literature, to identify and cluster TCR sequences in specificity groups sets of TCRs that are likely to be recognizing the same or very similar peptide-mhc ligands (Extended Data Fig. 3, https://github.com/immunoengineer/gliph). The algorithm first searches in parallel for global and local similarity signatures. For local similarity signatures, it tests for enrichment of any 2, 3, and 4 WWW.NATURE.COM/NATURE 1

RESEARCH SUPPLEMENTARY INFORMATION amino acid continuous motifs within CDR3 β sequences while excluding conserved N-terminal V region or C-terminal J region CDR3 amino acid positions (IMGT 104, 105, 106, 117, and 118) that have never been observed to bind the antigenic peptide in crystal structures (Extended Data Fig. 1). Optionally it can also search for discontinuous amino acid motifs in CDR3 that allow any amino acid at a position, with specific 2, 3, or 4 amino acids around that position. A motif is considered enriched if it is elevated at least 10-fold over expected frequency in the naïve TCR reference pool, with a probability<0.001 being at that level of enrichment by chance. In parallel, it also identifies global similarity sequences with CDR3 β sequences that are identical or only differ by one amino acid and are of the same length. Next, it clusters these TCRs with identified similarities, gathering together variants of similar amino acid motifs or global CDR3 similarity into a single sequence group in the process. Finally, GLIPH confirms the significance of the formed clusters through evaluation of enrichment of independent features, including V-gene usage, CDR3 length distribution, clonal expansion bias and donor HLA usage in each sequence group. This three-step process of motif nucleation creates a powerful statistical framework for identifying TCRs that recognize the same antigen. Optionally, GLIPH can restrict cluster members to only those TCRs that share the same V- gene. When HLA genotypes for the donors are available, GLIPH also provides a predicted HLArestriction for each TCR sequence group generated. This HLA allele prediction can then in turn be used to score databases of candidate peptides, ranking them according to their odds of being the target antigen. While previous studies have noted that certain TCR positions are more likely to be in contact with antigen, or that distributions of short TCR CDR3β motifs systematically alter after immunization 5,28, no other TCR repertoire analysis tools that we are aware of are able to formalize these observations into a unified framework for automated TCR sequence group identification, with statistical validation and HLA prediction 29-31. GLIPH can cluster TCR sequences based on their likely antigen specificity, predict their likely HLA restriction and prioritize specific peptide antigen candidates. Our data also indicates that linear, usually continuous motifs of 3-4 amino acids in CDR3β and CDR3α, are often highly conserved in TCRs that recognize the same peptide-mhc ligand, and that in known structures, these are the residues responsible for contacting the antigenic peptide. This builds and expands on previous work by Chain and colleagues indicating that 3 amino acid motifs in the TCR CDR3 regions are often correlated with antigen specific responses 28. While GLIPH requires the HLA genotypes of the subjects in order to make an HLA prediction, in our Mtb dataset of 22 individuals there were 69 unique class II HLAs, and in each evaluated sequence group we were able to predict the presenting HLA. Similarly 2 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH while GLIPH cannot predict a specific target peptide a-priori, knowing the HLA and a candidate population of peptides, it is possible to prioritize the peptides by predicted binding to the candidate HLA, greatly reducing the combinations of APCs and peptides to be tested. Where no pool of candidate antigens is available, the HLA-prediction facilitates the use of the pmhc yeast display libraries developed by Garcia and colleagues 27. An important question is - how much of the TCR repertoire will GLIPH capture? Here the limited data we have at this time is that it will be substantial, in that of the >2000 TCR sequences that we analyzed in Fig 3A, up to 14% could be gathered into high-confidence discrete specificity groups with GLIPH, with rarefaction suggesting that more groups could be identified if more data was available. Similarly (14%; 796/5711) of the TCRs fit into specificity groups in the Mtb cohort. A partial solution to this problem will likely come from larger sample sizes, although in very diverse populations, such as the South African cohort analyzed here, there will HLA alleles that will be very rare and those only in very large population studies will there be enough matches that could generate shared TCR specificities. Nonetheless, building large TCR data bases around specific diseases, or TCR-omes, could be very valuable in assessing the responses of individuals or cohorts, even if only a fraction of the TCR sequences can be clustered using GLIPH. 24 Naylor, K. et al. The influence of age on T cell generation and TCR diversity. Journal of immunology 174, 7446-7452 (2005). 25 Robins, H. S. et al. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114, 4099-4107, doi:10.1182/blood-2009-04-217604 (2009). 26 Garcia, K. C. et al. An alphabeta T cell receptor structure at 2.5 A and its orientation in the TCR- MHC complex. Science 274, 209-219 (1996). 27 Birnbaum, M. E. et al. Deconstructing the peptide-mhc specificity of T cell recognition. Cell 157, 1073-1087, doi:10.1016/j.cell.2014.03.047 (2014). 28 Thomas, N. et al. Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinformatics 30, 3181-3188, doi:10.1093/bioinformatics/btu523 (2014). 29 Bolotin, D. A. et al. MiTCR: software for T-cell receptor sequencing data analysis. Nature methods 10, 813-814, doi:10.1038/nmeth.2555 (2013). 30 Nazarov, V. I. et al. tcr: an R package for T cell receptor repertoire advanced data analysis. BMC bioinformatics 16, 175, doi:10.1186/s12859-015-0613-1 (2015). 31 Yu, Y., Ceredig, R. & Seoighe, C. LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins. Nucleic acids research 44, e31, doi:10.1093/nar/gkv1016 (2016). WWW.NATURE.COM/NATURE 3

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Methods GLIPH Documentation ================================================================ GROUPING OF LYMPHOCYTE INTERACTIONS BY PARATOPE HOTSPOTS - GLIPH ================================================================ Questions? Comments? Contact jakeg@stanford.edu huang7@stanford.edu Name ---- gliph - Grouping of Lymphocyte Interactions by Paratope Hotspots Synopsis -------- gliph [options ] --tcr TCR_TABLE --hla HLA_TABLE Description ----------- GLIPH clusters TCRs that are predicted to bind the same MHC-restricted peptide antigen. When multiple donors have contributed to the clusters, and HLA genotypes for those donors are available, GLIPH additionally can provide predictions of which HLA-allele is presenting the antigen. Typically the user will pass in a sequence set of hundreds to thousands of TCR sequences. This dataset will be analyzed for very similar TCRs, or TCRs that share CDR3 motifs that appear enriched in this set relative to their expected frequencies in an unselected naive reference TCR set. GLIPH returns significant motif lists, significant TCR convergence groups, and for each group, a collection of scores for that group indicating enrichment for motif, V-gene, CDR3 length, shared HLA among contributors, and proliferation count. When HLA data is available, it predicts the likely HLA that the set of TCRs recognizes. Options ------- Required Data Inputs ~~~~~~~~~~~~~~~~~~~~ The user must provide a table of TCR sequences. --tcr TCR_TABLE The format of the table is tab delimited, expecting the following columns in this order. Only TCRb is required for the primary component of the algorithm to function, but patient identity is required for HLA prediction. Example: 4 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH CDR3b TRBV TRBJ CDR3a TRAV TRAJ Patient Counts CAADTSSGANVLTF TRBV30 TRBJ2-6 CALSDEDTGRRALTF TRAV19 TRAJ5 09/0217 1 CAATGGDRAYEQYF TRBV2 TRBJ2-7 CAASSGANSKLTF TRAV13-1 TRAJ56 03/0492 2 CAATQQGETQYF TRBV2 TRBJ2-5 CAASYGGSARQLTF TRAV13-1 TRAJ22 02/0259 1 CACVSNTEAFF TRBV28 TRBJ1-1 CAGDLNGAGSYQLTF TRAV25 TRAJ28 PBMC863 1 CAGGKGNSPLHF TRBV2 TRBJ1-6 CVVLRGGSQGNLIF TRAV12-1 TRAJ42 02/0207 1 CAGQILAGSDTQYF TRBV6-4 TRBJ2-3 CATASGNTPLVF TRAV17 TRAJ29 09/0018 1 CAGRTGVSTDTQYF TRBV5-1 TRBJ2-3 CAVTPGGGADGLTF TRAV41 TRAJ45 02/0259 1 CAGYTGRANYGYTF TRBV2 TRBJ1-2 CVVNGGFGNVLHC TRAV12-1 TRAJ35 01/0873 3 Optional Data Inputs ~~~~~~~~~~~~~~~~~~~~ The user may additional supply a table of HLA genotyping for each subject. --hla HLA_TABLE The format of the table is tab delimited, with each row beginning with the identity of a subject, and then two or more following column providing HLA identification. The number of total columns (HLA defined genotypes) is flexible. Example: 09/0217 DPA1*01:03 DPA1*02:02 DPB1*04:01 DPB1*14:01 DQA1*01:02 09/0125 DPA1*02:02 DPA1*02:02 DPB1*05:01 DPB1*05:01 DQA1*06:01 03/0345 DPA1*02:01 DPA1*02:01 DPB1*17:01 DPB1*01:01 DQA1*01:03 03/0492 DPA1*01:03 DPA1*02:01 DPB1*03:01 DPB1*11:01 DQA1*01:02 02/0259 DPA1*01:03 DPA1*01:03 DPB1*104:01 DPB1*02:01 DQA1*02:01 Optional Arguments ~~~~~~~~~~~~~~~~~~~ --refdb DB --gccutoff=1 optional alternative reference database global covergence distance cutoff. This is the maximum CDR3 Hamming mutation distance between two clones sharing the same V, same J, and same CDR3 length in order for them to be considered to be likely binding the same antigen. This number will change depending on sample depth, as with more reads, the odds of finding a similar sequence increases even in a naive repertoire. This number will also change depending on the species evaluated and even the choice of reference database (memory TCRs will be more likely to have similar TCRs than naive TCR repertoires). Thus, by default this is calculated at runtime if not specified. --simdepth=1000 simulated resampling depth for non-parametric convergence significance tests. This defines the number of random WWW.NATURE.COM/NATURE 5

RESEARCH SUPPLEMENTARY INFORMATION repeat samplings into the reference distribution that GLIPH performs when analyzing 1) global similarity cutoff 2) local similarity motif enrichment 3) V-gene enrichment 4) CDR3 length enrichment 5) clonal proliferation enrichment A higher number will take longer to run but will produce more reproducible results. --lcminp=0.01 local convergence minimum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set would but into the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set. It is set to 0.01 by default. --lcminove=10 local convergence minimum observed vs expected fold change. This is a cutoff for the minimum fold enrichment over a reference distribution that a given motif should have in the sample set in order to be considered for further evaluation. It is set to 10 by default. --kmer_mindepth=3 minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increase False Discovery Rate. --global=1 Search for global TCR similarity (Default 1) --local=1 Search for local TCR similarity (Default 1) --make_depth_fig=0 Perform repeat random samplings at the sample set depth in order to visualize convergence --discontinuous=0 Allow discontinuous motifs (Default 0) --positional_motifs=0 Restrict motif clustering to a shared position that is fixed from the N-terminal end of CDR3 --cdr3len_stratify=0 Stratify by shared cdr3length distribution (Default 0) --vgene_stratify=0 Stratify by shared V-gene frequency distribution (Default 0) --public_tcrs=0 Reward motifs in public TCRs (Default 0) Controlling Output ~~~~~~~~~~~~~~~~~~ GLIPH produces multiple output files. 6 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH --output FILE --verbose Place command output into the named file. Have *tmo* print messages describing Definitions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Convergence group - a set of multiple TCRs from one or more individuals that bind the same antigen in a similar manner through similar TCR contacts. GLIPH predicts convergence groups by asking "what is the probability that this cluster of similar TCRs could have appeared without the selection of common antigen?" Global convergence - A pair of TCRs that share the same length CDR3 and differ by less than a certain number of amino acids in those CDR3s. Example: in a set of 300 random TCRs, finding two TCRs that only differ by one amino acid in their CDR3 would be highly unlikely. Local convergence - A pair of TCRs that share in their CDR3 regions an amino acid motif that appears enriched in their sample set. Optionally this common motif could be positionally constrained. Example: in the malaria set, an enriched QRW motif was found in 23 unique TCRs from 12 individuals in a conserved position in their CDR3. Reference set - A large database of TCR sequences that are not expected to be enriched for the specificities found in the sample set. Example: by default, GLIPH uses as a reference database over 200,000 nonredundant naive CD4 and CD8 TCRb sequences from 12 healthy controls. Sample set - The input collection of TCRs under evaluation that are potentially enriched for a specificity not present in the reference set. Function of Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GLIPH first makes the input sample set non-redundant at the CDR3 amino acid level. The identity of the V-genes, the J-genes, the clonal frequencies, and the associated HLA information for each TCR is hidden during the CDR3 analysis in order to act as independent confirmation variables of the resulting clusters. GLIPH next counts the number of unique CDR3 sequences in the sample dataset. GLIPH next calculates the minimum hamming distance from each CDR3 in the sample dataset to the next closest same-length sequence in the sample set. This result is the minimum distance distribution. GLIPH next loads the reference dataset. WWW.NATURE.COM/NATURE 7

RESEARCH SUPPLEMENTARY INFORMATION GLIPH next repeatedly resamples from the reference database at the nonredundant sample set depth, calculating the minimum hamming distance distribution in each case. This is repeated simdepth times (1000 by default). Probability of encountering next closest members at distant 0, 1, 2... is tabulated across all random samples. GLIPH next compares the sample set minimum hamming distance distribution to that of the reference set, to assign a probability to each hamming distance that sequences of that distance are likely to appear randomly. Example: after analysis, it is found that out of 1000 simulations hamming distance in-test-set in-ref-set probability 1 14 0 0.001 2 37 8+/-3 0.02 3 194 147+/-35 0.24 GLIPH then selects a maximum hamming distance cutoff for grouping similar TCRs. NOTE: For the standard reference set this analysis has been precomputed for a range of sampling depths, enabling a lookup table rather than simulation as a runtime accelerator. GLIPH then analyzes all possible 3mer, 4mer, 5mer and discontinuous 3mer and 4mer motifs for their frequency in the sample set. This is done excluding the first three and last three residues in the CDR, as they are not observed to be in contact with antigen in known crystal structures. This is done for every non-redundant CDR3 in the reference set. Example: for a scan of the following CDR3, the following 3mers would be collected CAS SFGSGGHYE TYF xxx SFGSGGHYE xxx SFG FGS GSG SGG GGH GHY HYE SFGS FGSG GSGG SGGH GGHY GHYE S.GS F.SG (etc) The frequences off all motifs across the sample set are calculated. In order to establish whether these motifs are just naturally abundant or specifically enriched by antigen, these frequencies are compared to a repeat random sampling of the reference set at the same depth as the non-redundant database. The frequency 8 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH of every motif is collected with each random sampling in order to create a distribution. If each motif could only be observed in a given sequence once, then the distribution of sampling motif frequency means u become normally distributed and this result is equivalent to calculating the frequencies of all motifs in the reference database, and then calculating one-sided confidence intervals for expected frequencies of any given motif in the reference database at any given sampling depth: _ CI(99.9% os) = y + t (s / sqrt(n)) - 0.0005 _ where n is the sample set non-redundant CDR3 sample size, y is the motif mean frequency, and s is the SD estimate at that sampling depth for the motif, as _ 2 s = \ (y1 - y) / -------------- \ (n - 1) This approximation provides a runtime acceleration. GLIPH then compares the probability that each motif in the sample set would have appeared at the observed frequency by random change (calculated as the number of times the motif was found at at least that depth in a random sampling, divided by the total number of random samplings, or alternatively though the CI). GLIPH next generates a network of all CDR3s as nodes, and all edges as either global or local interactions. The clusters can be optionally filtered with the following arguements --positional_motifs=0 Restrict motif edges to CDR3s where the motif is in a shared position that is fixed from the N-terminal end of CDR3 --public_tcrs=0 Allow CDR3s from multiple subjects to increase the size of their clusters. Rewards public TCRs but risks contamination clouding results. --global_vgene=0 Restricts global relationships to TCRs of common V-gene Recommended but invalidates V-gene probability score GLIPH then scores each individual cluster (candidate convergence group) by evaluating a set of features that are independet of the CDR3 observations, assigning a probability p to each feature, and then combining those probabilities into a single score by conflation. For tests i through N, testing Pi(X=C) probability that cluster X is convergent, the combined conflation score is given as N P(X=C) = P (X=C) i i ---------------------------- N N P (X=C) + P (X!=C) WWW.NATURE.COM/NATURE 9

RESEARCH SUPPLEMENTARY INFORMATION i i i i The individual Pi(X=C) tests include 1) global similarity probability 2) local motif probability 3) network size 4) enrichment of V-gene within cluster 5) enrichment of CDR3 length (spectratype) within cluster 6) enrichment of clonal expansion within cluster 7) enrichment of common HLA among donor TCR contributors in cluster Individual score components are calculated as follows: == 3) Calculating network size p == For each discrete cluster, the probability p of a given cluster topology can be obtained from the number of members of the cluster by comparison to a lookup table calculated from repeat random sampling and GLIPH clustering of naive TCR sequences at a range of different sampling depths n from 25 to 5000, each performed 1000 times each. Example: at sampling depth n=500, clusters of size 5 have a probabilty p=0.002 of occuring in naive TCR sample sets. == 4) enrichment of V-gene within cluster == GLIPH hides V-gene usage for all clones prior to CDR3 analysis, and does not explore the V-gene template endcoded amino acids in the local (motif) search. To evaluate whether there is an enrichment of common V-segment within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the V-genes for each clone and calculating a V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability that any two members within the cluster would share a V-gene and is calculated as V \ v(v-1) / 1 D = ----------- V \ n(n-1) / 1 where V is the number of all V-genes, v is the total counts of a given v-gene, and n is the sampling size (cluster size). The probability of the observed D for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the D score distribution from random sampled clusters of same size n. == 4) enrichment of CDR3 within cluster == 10 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH For local (motif) scoring, no stratification based on CDR3 length is performed. To evaluate whether there is an enrichment of common CDR3 length within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the CDR3 lengths for each clone and calculating a V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability that any two members within the cluster would share a CDR3length and is calculated as C \ c(c-1) / 1 D = ----------- C \ n(n-1) / 1 where C is the number of all CDR3 lengths, c is the total counts of a given CDR3 length, and n is the sampling size (cluster size). The probability of the observed D for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the D score distribution from random sampled clusters of same size n. == 5) enrichment of clonal expansion within cluster == The first step in GLIPH is to make all CDR3s non-redundant, hiding their clone frequences. To evaluate whether there is an enrichment of expanded clones within a given convergence group, the number of counts for each clone are recovered, and a convergence group expansion coefficient e is calculated as e = total clones / total unique clones To evaluate whether there is an enrichment of expanded clones within the cluster over the degree that would be expected from an unbiased sampling of TCRs of cluster size n from the total dataset, we perform repeat random sampling at sample size n from the total dataset, each time obtaining the clone counts for each clone and calculating e for each random sample to establish a distribution. The probability of the observed e for candidate convergence group is obtained as the one-tailed probability of observing a score at least that high in the e score distribution from random sampled clusters of same size n. == 6) enrichment of common HLA among donor TCR contributors in cluster == TCRs that recognize a common antigen should be constrained by the same HLA. Thus, their contributing donors should contain that HLA allele in their genotype. When HLA genotype information is available, GLIPH uses combinatorial resampling without replacement to estimate the probability that the collection of TCRs in the convergent group recognizes any given HLA. For each HLA allele that appears at least twice in the candidate convergence group, GLIPH obtains WWW.NATURE.COM/NATURE 11

RESEARCH SUPPLEMENTARY INFORMATION a the number of subjects in the convergence group that harbor that allele A the number of subjects in the study that harbor that allele n the number of subjects in the convergence group N the number of subjects in the study GLIPH uses this information to calculate the probability that a given HLA allele is present by chance. Examples -------- gliph --tcr TCR_TABLE --hla HLA_TABLE 12 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH Supplementary Methods Training set curation The training set was derived from single-tcr sequencing of tet+ cells, bulk sequencing of tet+ cell pools, sequences from literature, and sequences from solved crystal structures. To simplify analysis, TCRs in the training set were curated to ensure that they were A) Real: (i.e. not a read error variant) B) Non-redundant: each sequence was only counted once per subject source C) Non-promiscuous: each TCR only associated with a single specificity A) Real (i.e. not a read error variant) Single-cell TCR sequences were sequenced to a depth of at least 1000x to eliminate read error. Bulk-cell TCR sequencing was performed in replicate. To avoid PCR error and read error, only reads identified across both replicates were retained. Frequencies of all TCRb clones were evaluated in each replicate, and any reads differing by only 1bp from a higher frequency clone were absorbed into that clone, assuming it to most likely represent a read error variant of the more dominant clone, given the read depths utilized in each replicate. Final sequences were reviewed to contain in-frame VDJ recombination and function CDR3 boundary residues. B) Non-redundant: each sequence was only counted once per subject source To avoid over-counting motifs in high-frequency clones, the data was rendered clonally non-redundant. Irrespective of how many times a clone was observed in single-cells or across reads in bulk TCR sequencing, the clones were rendered non-redundant within a given specificity and given subject. For bulk sequencing, subjects with very large numbers of singleton clones were down-sampled to use only the highest-frequency clones in comparable numbers to those presented by other subjects. This was done out of the assumption that these large populations of low-frequency singleton reads likely represented tetramer staining background from the FACS sorting, or barcode background during sequencing. It was also done ensure that each subject contributed a comparable amount of information to analysis of statistical trends in the antigen-specific repertoire. C) Non-promiscuous: each TCR only associated with a single specificity In almost all cases, a single TCR was only observed against a single specificity from a single individual. However, in the bulk TCR sequencing, there were some cases of a TCR being found across multiple specificities. In 37 of these cases, the TCR was found at high frequency in a single sample, but appeared at very low copy in some other samples. As these samples were all sequenced together, it was assumed that a low rate of DNA barcode contamination was causing a low rate of DNA sequencing read misassignment, a known issue for Illumina sequencing with multiplex identifier barcodes. To address this, sequences with clearly dominant specificities were curated to remove the presence of that clone in the presumed contaminant sources. In total 37 clones were curated in this way; three example cases are shown below. WWW.NATURE.COM/NATURE 13

RESEARCH SUPPLEMENTARY INFORMATION Subject HLA Antigen Counts CDRb3 Selection subject-0958 HLA-A1 CMV 3 CSVVAGGADTQYF subject-0898 HLA-A2 CMV 28396 CSVVAGGADTQYF X subject-1521 HLA-B7 CMV 4 CSVVAGGADTQYF subject-2387 HLA-A2 flu 13438 CSAISGSGGVGDTQYF X subject-2385 HLA-B7 CMV 2 CSAISGSGGVGDTQYF subject-2351 HLA-B7 flu 3 CSAISGSGGVGDTQYF subject-3183 HLA-A1 CMV 2 CASSPGTDTQYF subject-1mc1 HLA-A2 EBV 3 CASSPGTDTQYF subject-0959 HLA-B7 flu 54 CASSPGTDTQYF X There remained 49 additional clones that were found across two samples at frequencies that made unambiguous assignment challenging. Almost every cases was found between two samples: subject- 0958_HLA-A1-CMV and subject-0988_hla-b7-flu. All ambiguous clones were removed from the final curated dataset. The final dataset contained 2067 T-cell receptors of known, unique specificity, of which 47 were found across more than one source. 14 WWW.NATURE.COM/NATURE