Gibbs sampling - Sequence alignment and sequence clustering Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark massimo@cbs.dtu.dk Technical University of Denmark 1
MHC class II binding - MHC class II molecules selectively bind peptides derived from extracellular proteins - 9 amino acids commonly interact with the pocket BUT - The binding cleft is open at both sides - Peptides of up to 20 AAs can bind Technical University of Denmark 2
MHC class II binding 9 AAs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY Technical University of Denmark 3
MHC class II binding SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY 9 AAs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY Technical University of Denmark 4
Gibbs sampling - sequence alignment Why sampling? 50 sequences 12 amino acids long try all possible combinations with a 9-mer overlap 4 50 ~ 10 30 possible combinations...computationally unfeasible SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY Technical University of Denmark 5
Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 E = C p,a p,a log p p,a q a de = E i E i 1 Technical University of Denmark 6
Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 7 Note that the probability of going to the new state only depends on the previous state
Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 1 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.52 P = min 1,exp 0.08 = min 1, 1.49 0.2 [ ] =1 Accept move with Prob = 100% Technical University of Denmark 8
Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 2 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.35 P = min 1,exp 0.09 = min 1, 0.638 0.2 [ ] = 0.638 Accept move with Prob = 63.8% Technical University of Denmark 9
Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation iteration Technical University of Denmark 10
Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation t 1 =0.4 P(t 1 ) = min 1,exp de = min 1,exp 0.3 = 0.47 t 1 0.4 E.g. same de=-0.3 but at different temperatures t 2 =0.1 P(t 2 ) = min 1,exp 0.3 = 0.05 0.1 P(t 3 ) = min 1,exp 0.3 0 0.02 t 3 =0.02 iteration Technical University of Denmark 11
Technical University of Denmark 12
f(z) Move freely around states when the system is warm, then cool it off to force it into a state of high fitness Technical University of Denmark 13 Z
Single sequence move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 14
Phase shift move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 shift all sequences SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 15
Does it work? Technical University of Denmark 16
Does it work? HLA-DRB3*01:01 Technical University of Denmark 17
Gibbs clustering Multiple motifs! SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK Cluster 1 ----SLFIGLKGDIRESTV-- --DGEEEVQLIAAVPGK---- ------VFRLKGGAPIKGVTF ---SFSCIAIGIITLYLG--- ----IDQVTIAGAKLRSLN-- WIQKETLVTFKNPHAKKQDV- ------KMLLDNINTPEGIIP Cluster 2 --ELLEFHYYLSSKLNK---- ------LNKFISPKSVAGRFA ESLHNPYPDYHWLRT------ -NKVKSLRILNTRRKL----- --MMGMFNMLSTVLGVS---- AKSSPAYPSVLGQTI------ --RHLIFCHSKKKCDELAAK- Technical University of Denmark 18
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixture of 100 binders for the two alleles ATDKAAAAY A*0101 EVDQTKIQY A*0101 AETGSQGVY B*4402 ITDITKYLY A*0101 AEMKTDAAT B*4402 FEIKSAKKF B*4402 LSEMLNKEY A*0101 GELDRWEKI B*4402 LTDSSTLLV A*0101 FTIDFKLKY A*0101 TTTIKPVSY A*0101 EEKAFSPEV B*4402 AENLWVPVY B*4402 Technical University of Denmark 19
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 G 2 Technical University of Denmark 20
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 97 3 G 2 3 97 Resolved Technical University of Denmark 21
Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 Technical University of Denmark 22
Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 0 1 76 1 0 2 4 0 0 95 5 87 5 1 0 93 2 19 0 2 0 6 0 98 3 HLA-A0301 97% HLA-A0101 80% HLA-A0201 89% HLA-B4402 94% HLA-B0702 92% Technical University of Denmark 23
Dealing with noisy data Experimental data often contain false positives Outliers do not match any recurrent motif Introduce a garbage bin to collect outliers Technical University of Denmark 24
Dealing with noisy data Introduce a garbage bin to collect outliers SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK g1 g2 gn trash Move to the trash cluster peptides that do not match any motif Technical University of Denmark 25
Dealing with noisy data 200 binders to 3 MHC class I alleles 3 alleles HLA A0101 HLA B0702 HLA B4001 Random g0 0 197 2 6 205 g1 199 1 0 2 202 g2 0 0 196 5 201 trash 1 2 2 37 42 200 200 200 50 50 random sequences are added to the data set Technical University of Denmark 26
Dealing with noisy data 200 binders to 3 MHC class I alleles 3 alleles HLA A0101 HLA B0702 HLA B4001 Random g0 0 197 2 6 205 g1 199 1 0 2 202 g2 0 0 196 5 201 trash 1 2 2 37 42 200 200 200 50 50 random sequences are added to the data set DHHFTPQII NAFGWENAY SQTSYQYLI ELPIVTPAL FCSNHFTEL Technical University of Denmark 27
Dealing with noisy data NetMHC predictions NetMHC version 3.2. 9mer predictions using Artificial Neural Networks. Strong binder threshold 50 nm. Weak binder threshold score 500 nm ------------------------------------------------------------------- pos peptide logscore affinity(nm) Bind Level Allele ------------------------------------------------------------------- 0 DHHFTPQII 0.069 23725 HLA-A0101 1 NAFGWENAY 0.068 24050 HLA-B0702 2 SQTSYQYLI 0.059 26341 HLA-B0702 3 ELPIVTPAL 0.110 15217 HLA-B4001 4 FCSNHFTEL 0.173 7709 HLA-B4001 False positives? Technical University of Denmark 28
SH3 domain binding - SH3 domains are small interaction modules abundantly found in eukaryotes - They preferentially bind proline-rich regions (minimum consensus sequence is PxxP) - Two main classes of ligands exist: class I +xφpxφp class II φpxφpx+ +: R or K φ: hydrophobic x: any amino acid Surface representation of the Hck SH3 domain bound to an artificial high-affinity peptide ligand, in green. Schmidt et al. (2007) Technical University of Denmark 29
SH3 domain binding - Gibbs clustering - Large data set of 2,457 unique peptide sequences binding to Src SH3 domain - 12 amino acids long peptides, unaligned with respect to the binding core WVTAPRSLPVLP GSWVVDISNVED NYSGNRPLPGIW RSPIVRQLPSLP RALPVMPTNGPM AKSRPLPMVGLV VRPLPSPEGFGQ YRGRMLPVIWGT RALPLPRAYEGI VPLLPIRNGAVN PVRVLPNIPLSV...... Technical University of Denmark 30
SH3 domain binding Gibbs clustering - 1 cluster class I +xφpxφp 2,360 sequences (97 to trash) class II φpxφpx+ Technical University of Denmark 31
SH3 domain binding Gibbs clustering - 2 clusters 498 sequences 1,892 sequences (67 to trash) class I +xφpxφp class II φpxφpx+ Technical University of Denmark 32
SH3 domain binding Gibbs clustering - 3 clusters 1,606 sequences 490 sequences 305 sequences class I +xφpxφp (56 to trash) class II φpxφpx+ Technical University of Denmark 33
HLA-A*02:01 sub-motifs - Binding to MHC class molecules is a fundamental step for peptide immunogenicity - One important parameter is binding affinity - But not all peptides with high binding affinity become immunogenic - Immunogenicity depends also on the formation of stable MHCpeptide complexes 650 peptides binding to A*02:01 Technical University of Denmark 34
HLA-A*02:01 sub-motifs 441 sequences 209 sequences <Aff> = 6 nm <Th> = 5.7 hours = <Aff> = 9 nm <Th> = 2.1 hours Technical University of Denmark 35
In conclusion Sampling methods can solve problems where the search space is too large to be exhaustively explored Gibbs sampling can detect even weak motifs in a sequence alignment (e.g. MHC class II) Gibbs clustering can be used to separated multiple motifs contained in a data set (e.g. SH3 domain classes, MHC molecules sub-specificities) More than 1,000 papers in PubMed using Gibbs sampling methods Transcription factor binding sites Transmembrane domains MHC class I and II binding... Technical University of Denmark 36