BIOINFORMATICS. June Immunological Bioinformatics

IMMUNOLOGICAL BIOINFORMATICS MEHTEIDHWLEFSATKLSSCDSFTSTINELNHCLSLRT YLVGNSLSLADLCVWATLKGNAAWQEQLKQKKAPVH VKRWFGFLEAQQAFQSVGTKWDVSTTKARVAPEKKQ DVGKFVELPGAEMGKVTVRFPPEASGYLHIGHAKAA LLNQHYQVNFKGKLIMRFDDTNPEKEKEDFEKVILE DVAMLHIKPDQFTYTSDHFETIMKYAEKLIQEGKAYV DDTPAEQMKAEREQRIESKHRKNPIEKNLQMWEEM KKGSQFGHSCCLRAKIDMSSNNGCMRDPTLYRCKIQP HPRTGNKYNVYPTYDFACPIVDSIEGVTHALRTTEYH DRDEQFYWIIEALGIRKPYIWEYSRLNLNNTVLSKRK LTWFVNEGLVDGWDDPRFPTVRGVLRRGMTVEGLK QFIAAQGSSRSVVNMEWDKIWAFNKKVIDPVAPRYVA LLKKEVIPVNVPEAQEEMKEVAKHPKNPEVGLKPVW YSPKVFIEGADAETFSEGEMVTFINWGNLNITKIHKN June - 2011 27685 Immunological Bioinformatics

TABLE OF CONTENTS Abstract! 3 Outline! 3 Introduction to the immune system! 4 Performance measures in prediction systems! 6 B ce" epitope identification and prediction! 8 Prediction of MHC binding.! 11 MHC class I binding predictions! 11 MHC class II binding predictions! 13 MHC polymorphism, supertype clustering, and panpredictions.! 14 Prediction of the CTL presenting pathway and integration of prediction systems! 16 Integrated T-ce" epitope predictions! 17 Epitope discovery and rational vaccine design.! 19 SARS! 19 Influenza! 20 HIV! 21 The new age of vaccine development! 22 Bibliography! 24

IMMUNOLOGICAL BIOINFORMATICS Morten Nielsen Claus Lundegaard 2011 Abstract Immunological bioinformatics is a relatively novel research field, highly relevant in investigations of the complex behavior of the immune system. The immune system reacts to foreign protein segments called epitopes, and a major task is to develop methods identifying epitopes in pathogen proteomes. Predicted epitopes can eventually be experimentally verified and used in vaccine design and diagnosis. The course will demonstrate how bioinformatics can effectively be used in the search of vaccine candidates against pathogens like for instance SARS, Flu, and HIV. The methods used include artificial neural networks, Gibbs sampling, weight matrices and hidden Markov models as well as integrative models and protein structural analysis. Outline 1. Introduction to the immune system 2. Performance measures in prediction systems 3. B cell epitope identification and prediction 4. Prediction of MHC binding. 5. MHC polymorphism, supertype clustering, and panpredictions. 6. Prediction of the CTL presenting pathway and integration of prediction systems. 7. Epitope discovery and rational vaccine design. June 2011! 3

Introduction to the immune system Adapted from (Lundegaard et al., 2007) Adaptive immunity is induced by lymphocytes and can be classified into two types: humoral immunity, mediated by antibodies, which are secreted by B lymphocytes and can neutralize pathogens outside the cells; and cellular immunity, mediated by T lymphocytes that eliminate infected or malfunctioning cells, and provide help to other immune responses. Diversity is the hallmark of the adaptive immune systems. Both the B and T lymphocyte-specific receptors for antigen recognition are assembled from variable (V), diversity (D), and joining (J) gene segments early in the lymphocyte development. There are multiple copies of V, D and J segments, and a huge repertoire of T and B cells is generated by the recombination of these segments, reviewed by (Li et al., 2004). Another task faced by the immune system is the tolerance to self, which is handled by continuously removing receptors that react to self-epitopes. Special immunoglobulin molecules (antibodies) mediate the humoral response. As mentioned above, the antibodies are produced by B lymphocytes that bind to antigens by their immunoglobulin receptors, which is a membrane bound form of the antibodies. When the B lymphocytes become activated, they start to secrete the soluble form of this receptor in large amounts. The antibody is Y-shaped, and each of the two branches functions independently and can be recombinantly produced and is then known as Fabs. The highly variable tip of the Fab, which can bind to epitopes is called the paratope and is made up of the so-called complementary determining regions (CDRs). Antibodies can coat the surface of an antigen such as a virus, so that it cannot function or infect cells, reviewed by (Burton, 2002). Antibody-covered viruses or bacteria are easily phagocytosed and destroyed by scavenger cells of the immune system, e.g. the macrophages. Antigenic proteins can be recognized by the antibodies in their native form without any cleavage or interactions with other molecules. Thus the humoral immune response reacts to extracellular pathogens, and the response is crucial in the defense against most pathogens. The cellular arm of the immune system consists of two parts; cytotoxic T lymphocytes (CTL), and helper T lymphocytes (HTLs). CTLs destroy cells that present non-self peptides (epitopes). HTLs are needed for B cells activation and proliferation to produce antibodies against a given antigen. CTLs on the other hand perform surveillance of the host cells, and recognize and kill infected cells, generally explained in (Janeway, 2005). Both CTL and HTL are raised against peptides that are presented to the immune cells by major histocompatibility complex (MHC) molecules, which are the most polymorphic of mammalian proteins. The human versions of MHCs are referred to as the human leucocyte antigen (HLA). The cells of an individual are constantly screened for such peptides by the cellular arm of the immune system. In the MHC class I pathway, class I MHCs presents endogenous antigens to T cells carrying the CD8 receptor (CD8+ T cells). To be presented, a precursor peptide is normally first generated by the large cytosomal protease complex called the proteasome (Loureiroa and Ploegha, 2006). Gen- June 2011! 4

erally, it then binds to the transporter associated with antigen processing (TAP) for translocation into the endoplasmic reticulum (ER), reviewed by (Abele and Tampé, 2004), but some peptides can enter the ER independently of TAP (Anderson et al., 1991; Bacik et al., 1994; Eisenlohr et al., 1992; Larsen et al., 2006). During or after the transport into the ER the peptide must bind to the MHC class I molecule (Stoltze et al., 2000; Zhang et al., 2006) before it can be transported to the cell surface through the golgi system. The most selective step in this pathway is binding of a peptide to the MHC class I molecule as on average only approximately one out of 200 peptides will bind to a given MHC molecule with enough strength to be detected by a CD8+ T cell and induce a CTL response. June 2011! 5

Performance measures in prediction systems As an evaluation of the general quality of a prediction method a measure describing the quality is needed. However, no single measure can capture all qualities of a prediction, and not all types of data and predictions can be reasonably described by the same measure. So to be able to compare different systems, it is often needed to present several measures of quality. Most measures need the data to be classified into two groups, i.e. positives and negatives. The fraction correct predicted or accuracy is the fraction of the total predictions that falls into the correct group. This measure is intuitively easily captured, but has the weakness that if a large fraction of the total evaluation data falls into a single group one will get high performance by just blindly predicting most or even everything to belong to this category. The number of classified (experimentally measured) positives is often designated as actual positives (AP), and the number of negatives, actual negatives (AN), the number of predicted positives (PP), predicted negatives (PN), truly predicted positives (TP), falsely predicted positives (FP), truly predicted negatives (TN), and falsely predicted negatives (FN). Some of the most often used measures are briefly described here. The equations for the mentioned measures are given at the end of the section. The positive predicted value (PPV) is the fraction of the positive predictions that actually falls into the positive class. The sensitivity is the fraction of the AP that is predicted as positives using a given threshold. Sensitivity = TP AP The specificity is the fraction of the AN that is predicted as negatives. Specificity = TN AN The three latter measures are also easily grasped, however they are all dependent on the chosen prediction cutoff classifying the data into positive and negative predictions. A high sensitivity can be obtained by setting your prediction cutoff so that most of your evaluation data will fall into the positive group, but this will then be at the expense of the specificity and the PPV. Which cutoff to use is determined by the purpose of the prediction, i.e. how many verified epitopes is needed versus the resources available for experimental validation. A plot of the sensitivity against the false positive rate (1-specificity) is called a receiver operating characteristic (ROC) curve (Swets, 1988). Such a plot can be a help to set the best prediction cutoff. One of the best ways of measuring the predictive power of a method is to calculate the June 2011! 6

area under the ROC curve (AUC) since this is a threshold-independent measure. Another robust measure is the Pearson s correlation coefficient (PCC), which is a measure of how well the prediction scores correlate with the actual value on a linear scale. In situations where the correlation is not necessarily linear, the Spearman's rank correlation coefficient (SRC) is more appropriate. In this measure each prediction is ranked on the basis of the prediction score and the PCC is calculated on the basis of this rank rather than the prediction score. The SRC, like the AUC, is a threshold-independent measure of how well the predictor ranks the data when compared with the actual ranking. When comparing different methods, the threshold-independent measures are to be preferred. Otherwise a threshold has to be set under the same assumptions for all predictors. As an example one can estimate the specificity for each predictor by setting the threshold for the given predictor to a value where the sensitivity will be 0.5 (i.e. half of the total available positives is over the threshold), or estimate the sensitivity at a threshold where the specificity will be 0.8 (i.e. 80% of the AN are predicted as negatives). June 2011! 7

B cell epitope identification and prediction Adapted from (Lundegaard et al., 2007) The humoral response is mediated by antibodies (immunoglobulin molecules), which are produced by B lymphocytes. B lymphocytes can bind to antigens by their immunoglobulin receptors. When they become activated, they start to secrete a soluble form of this receptor (antibodies) in large amounts. Overall the antibody is Y-shaped (or T-shaped) with two antigen binding (Fab) fragments at the top left and right of the antibody and the constant fragment (Fc) below. The linker regions connecting the constant with the variable regions allow great flexibility in the relative orientation of the domains. The highly variable tip of a Fab fragment that can bind to epitopes is called a paratope and is made up of the so-called complementary determining regions (CDRs) in the antibody sequence. The binding of antibodies to, e.g., a virus, can coat the surface so that it cannot infect cells (Burton, 2002). Viruses or bacteria covered by antibodies are also more easily taken up (phagocytosed) and destroyed by scavenger cells of the immune system such as macrophages. B-cell epitope prediction is a highly challenging field due to the fact that the vast majority of antibodies raised against a specific protein interact with discontinuous fragments (van Regenmortel, 1996). The prediction of continuous, or linear, epitopes, however, is a somewhat simpler problem, and may be still useful for synthetic vaccines or as diagnostic tools (Regenmortel and Muller, 1999). Moreover, the determination of continuous epitopes can be integrated into determination of discontinuous epitopes, as these often contain linear stretches (Hopp, 1994). In the early 1980s, Hopp and Woods (Hopp and Woods, 1981; Hopp and Woods, 1983) developed the first linear epitope prediction method. This method takes the assumption that the regions of proteins that have a high degree of exposure to solvent contain the antigenic determinants. According to the hydrophilicity scale generated by Levitt (Levitt, 1976), Hopp and Woods (Hopp and Woods, 1981) assigned the hydrophilicity propensity to each amino acid in a sequence and looked at groups of six residues. This gave promising results and a number of methods have since been developed with the aim of predicting linear epitopes using a combination of different amino acid propensities (Alix, 1999; Debelle et al., 1992; Jameson and Wolf, 1988; Maksyutov and Zagrebelnaya, 1993; Odorico and Pellequer, 2003; Parker et al., 1986). In 1993, Pellequer et al. (Pellequer et al., 1993) proposed an evaluation set containing 85 continuous epitopes in 14 proteins and found that the method based on turn propensity (i.e. the propensity of an amino acid to occur within a turn structure) had the highest sensitivity using this set. Seventy percent of the residues predicted to be in epitopes by this method were actually part of epitopes. The sensitivity for methods based on other propensities was in the range of June 2011! 8

36 61% (Pellequer et al., 1991). Analyzing the epitope regions in the Pellequer dataset reveals that almost all the hydrophobic amino acids are underrepresented, supporting the assumption that linear B-cell epitopes will occur in hydrophilic regions of the proteins. An extensive study of linear B-cell epitope prediction methods was published by Blythe and Flower (Blythe and Flower, 2005). To test how well peaks in single amino acid scale propensity profiles are (significantly) associated with known linear epitope locations, 484 amino acid propensities from the AAindex database (http://www.genome.ad.jp) (Kawashima and Kanehisa, 2000) were used. As test set they used 50 epitope-mapped proteins defined by polyclonal antibodies, which were the best non-redundant test set available. Blythe and Flower (Blythe and Flower, 2005) found, however, that even the predictions based on the most accurate amino acid scales were only marginally better than random, suggesting that more sophisticated approaches is needed to predict the linear epitopes. BepiPred (Larsen et al., 2006), an algorithm that combines scores from the Parker hydrophilicity scale (Parker et al., 1986) and a PSSM trained on linear epitopes, shows a small, but significant, increase in AUC over earlier scalebased methods. The sequence parametrizer algorithm (Sollner and Mayer, 2006; Sollner, 2006), along with its associated machine-learning methods uses the common single amino acid propensity scales, but also incorporates neighborhood parameters reflecting the probability that a given stretch of amino acids exists within a predefined proximity of a specific amino acid residue. Training and testing on epitope sequences pulled from a high-quality proprietary database, as well as several publicly accessible databases, yields a degree of accuracy that is greatly increased over single-parameter methods. Different experimental techniques can be used to define conformational epitopes. Probably the most accurate, and easily defined is using the solved structures of antibody antigen complexes (Fleury et al., 2000; Mirza et al., 2000). The amount of this kind of data is unfortunately still scarce, compared to linear epitopes. Furthermore, very few antigens have been studied in a way where all possible epitopes on a given antigen has been identified. Unidentified epitopes within the dataset will lower the apparent performance of an accurate prediction method by increasing the apparent false positive rate. The simplest way to predict the possible epitopes in a protein of known 3D structure is to use the knowledge of surface accessibility (Novotny et al., 1986; Thornton et al., 1986). Two newer methods using protein structure and surface exposure for prediction of B-cell epitopes have been developed. The CEP method (Kulkarni-Kale et al., 2005) calculates the relative accessible surface area for each residue in the structure. Then it is determined which parts of the protein that are exposed enough to be antigenic determinants. Regions that are distant in the primary sequence, but close in three-dimensional space are considered as one epitope. The tool was tested on a dataset of 63 antigen antibody complexes and the algorithm correctly identified 76% of the epitope residues. DiscoTope (Haste Andersen et al., 2006) uses a combination of amino acid statistics, spatial information and surface exposure. It is trained on a compiled June 2011! 9

dataset of discontinuous epitopes from 76 X-ray structures of antibody antigen protein complexes. This method outperforms methods that predict linear epitopes but needs an experimentally solved structure. A more recent approach, PEPITO (now BEpro). (Sweredoski and Baldi, 2008) uses both contact numbers (i.e. the number of C atoms within a certain distance threshold) and an amino-acid propensity scale. BEpro, attempts to overcome some of the limitations of previous predictors by incorporating an aminoacid propensity scale along with side chain orientation and solvent accessibility information using half sphere exposure values (Hamelryck, 2005). BEpro uses propensity scales and half sphere exposure values at multiple distance thresholds from the target residue, which seems to increase the robustness and reaches the to date best prediction performances (Sweredoski and Baldi, 2008). Recently a workshop was held on the subject of B-cell epitope predictions attended by a broad range of the current method developers. The workshop resulted in a published review containing conclusions on the present common ground, and suggestions for the future especially concerning coordination and evaluation (Greenbaum et al., 2007). Different ways of measuring the accuracy of B-cell epitope predictions have been suggested (Hopp, 1994; van Regenmortel and Pellequer, 1994). Pellequer suggested using the specificity as a measure of accuracy, while Hopp suggested using the PPV, but, as described earlier, neither measure will alone give a good description of the performance. In accordance to this the recent workshop concluded that the AUC measure is to be preferred (Greenbaum et al., 2007). Another issue is whether to make the statistics on a per-residue or on a per-epitope basis. However, as the latter have the additional complications of defining how much of an epitope that must be included in a prediction to be considered correct, and how much extra included residues is allowed, the per residue measure is to be preferred. June 2011! 10

Prediction of MHC binding. MHC CLASS I BINDING PREDICTIONS The identification of T cell epitopes in a set of peptides can be facilitated by predicting their capability to bind MHC molecules, as MHC binding of a peptide is a necessary requirement for its recognition by T cells. The majority of peptides binding to MHC class I molecules have a length of 8 10 amino acids. Position 2 and the C-terminal position of the peptide are typically most important for the binding, and these positions are referred to as anchor positions (Rammensee et al., 1999). Only few different amino acids are tolerated for MHC binding at the anchor positions. For some alleles, the binding motifs further have auxiliary anchor positions. For example, peptides binding to the human HLA-A*0101 allele also have positions 3 as anchor (Kondo et al., 1997; Kubo et al., 1994; Rammensee et al., 1999). The importance of anchor positions for peptide binding and the allele-specific amino acid preference at the anchor positions was first described by (Falk et al., 1990). The discovery of such allele-specific motifs led to the development of the first algorithms (Pamer et al., 1991; Rotzschke et al., 1991; Sette et al., 1989), which essentially determined if a peptide did or did not match the binding motif of the MHC molecule. Matrix prediction algorithms go beyond the match/mismatch classification of a motif prediction by assigning a score to each possible amino acid at each position in a peptide. It is assumed that the amino acids at each position along the peptide sequence contribute a given binding energy, which can independently be added up to yield the overall binding energy of the peptide Meister 1995; Parker 1994; Stryhn 1996}. This type of approach is used by the EpiMatrix (Schafer et al., 1998), BIMAS (Parker et al., 1994), SYFPEITHI (Rammensee et al., 1997), RANKPEP (Reche et al., 2002), Gibbs sampler (Nielsen et al., 2004), SMM (Peters and Sette, 2005) and ARB methods (Bui et al., 2005). These methods differ in how they determine the matrix coefficients. Some are trained by statistical methods that analyze how often a given amino acid is seen at a given position in binding versus non-binding peptides. Matrix coefficients can also be determined by a machine learning procedure that aims at finding coefficients that best explains the observed binding data. This can be done by interpreting the matrix scores for a peptide as predicted binding affinities, and minimizing the distance between predicted and measured values. This is the approach utilized by SMM, which is presently the best performing matrix method (Lin et al., 2008; Peters et al., 2006). Matrix-based methods cannot take correlated effects into account, i.e. where the contribution to the binding affinity of an amino acid at one position depends on amino acids at other positions in the peptide. Higher order methods like Artificial Neural networks (ANNs) and Support Vector Machines (SVMs) are ideally suited to take such correlations into account (Adams and Koziol, 1995; Brusic et al., 1994; Buus et al., 2003; Gulukota et al., 1997; Nielsen et al., 2003). These methods can be trained with data either in the format of binder/non-binder clas- June 2011! 11

sification, or as real affinity data. In an ANN, information about the amino acid sequence is converted into a numerical representation. Twenty input neurons representing the 20 different amino acids typically found in proteins are normally used for each position in the peptide. The neuron defined to represent the amino acid found on that position in the given peptide is assigned the value 1 and the other 19 neurons are assigned the value zero. An alternative approach to represent amino acids as vectors is to take the 20 scores from a substitution matrices such as PAM or BLOSUM (Henikoff and Henikoff, 1992; Henikoff and Henikoff, 1993; Jones et al., 1992). The parameters of the neural network are optimized to minimize the difference between the predicted output and the expected output which can be 0 or 1 in the case of a binder/non-binder classification, or the actual affinity if training is done on real value affinity data. In the ANNs typically used in bioinformatics, a so called hidden layer is inserted between the input and the output layer. The input to the hidden layer is a weighted sum of the output from the input layer, and the input to the final output layer is a weighted sum of the outputs from the hidden layer. The parameters of an ANN are the weights used to calculate these sums (Baldi and Brunak, 2001). The NetMHC ANN implementation based on (Nielsen et al., 2003) and (Buus et al., 2003) was the most successful higher order method in two recent comparisons (Lin et al., 2008; Peters et al., 2006). Beside the sequence based machine-learning techniques, empirical molecular force field modeling techniques (Logean et al., 2001) and 3D Quantitative Structure Activity Relationship (3D- QSAR) (Doytchinova and Flower, 2002; Zhihua et al., 2004) analysis have been applied to predict peptide:mhc binding. These approaches are based on small molecule docking procedures, taking advantage of 3d structural data for MHC molecules. They directly calculate energy terms for peptide-mhc interactions. As these calculations are very resource intensive, none of them is directly available online. This makes it difficult to compare their performance with the peptide sequence binding data based models. However in-house benchmark experiments performed by our group and the group hosting the IEDB database strongly suggest that the structure-based prediction approach is not suitable for MHC.peptide binding predictions and that the predictive performance of structure-based method is significantly below that of the data-driven methods (unpublished data). To summarize: of the prediction methods publicly available online, the neural network based NetMHC performs best, followed by the matrix based SMM (Lin et al., 2008; Peters et al., 2006). Factors apart from performance that justify using SMM are a) that matrix predictions are inherently easier to interpret compared to an ANN which is essentially a black box, and b) that the SMM training and prediction code is freely available (Peters and Sette, 2005). The implementation of online consensus prediction tools that combine different prediction methods (Moutaftsi et al., 2006) is currently in progress at the IEDB site. The prediction performance of a given method is largely dependent on the size of the available training data set. The performance differences due to training set size are much larger than June 2011! 12

those due to the choice of prediction method. (Peters et al., 2006; Yu et al., 2002). Unfortunately, for MHC molecules with little binding data available, it is not possible to assess how good the current binding predictions are. The sizes of training sets and their corresponding prediction performance are listed in (Peters et al., 2006). The output of any prediction method will be a list of peptides associated with a binding affinity prediction. There are two approaches to select a reasonable set of T cell epitope candidates from this peptide set. First, one can select based on the actual affinities, and set a cutoff of 50, 500 or 5000 nm which are generally considered sensible for high, intermediate and low affinity. Alternatively, one can rank the predicted score, and set a cutoff of the top 0.1%, 1% or 10% of the predicted peptides. The amount of resources available to test the predicted candidates will often in practice determine the cutoffs chosen. MHC CLASS II BINDING PREDICTIONS Unlike MHC class I molecules, the binding cleft of MHC class II molecules is open-ended, which allows for the bound peptide to have significant overhangs at both ends. As a result MHC class II binding peptides have a broader length distribution, and the typical binding peptides have a length of eleven to twenty residues. However, the majority of the binding interaction is mediated by a 9 amino acid residue core sequence of the bound peptide. This complicates binding predictions, as the identification of the correct alignment of the binding core is a crucial part of identifying the MHC class II binding motif (Nielsen et al., 2004; Wang et al., 2008). Identifying this core is difficult, as the MHC class II binding motifs have relatively weak and often degenerate sequence signals. The majority of MHC class II binding prediction methods are based on the assumption that the peptide MHC binding affinity is determined solely from a nine amino acids in binding core motif. This is clearly an oversimplification, since it is known that peptide flanking residues on both sides of the binding core may contribute to the binding affinity and stability (Godkin et al., 2001). Some methods for MHC class II binding have attempted to include peptide flanking residues indirectly, in terms of the peptide length, in the prediction of binding affinities (Chang et al., 2006). Later it was demonstrated that including peptide flanking residues in MHC class II predictions does improve the prediction accuracy (Nielsen et al., 2007). Most of the methods for MHC class II binding predictions have been trained and evaluated on very limited datasets covering only a single or a few different MHC class II alleles, making it very difficult to compare the different performance values and establish generality of the methods. A recent large scale comparison of prediction methods for MHC class II binding (Wang et al., 2008) covered 14 HLA-DR (human MHC) and two mouse class II alleles. The best performing methods are described in (Nielsen et al., 2007) and (Sturniolo et al., 1999). June 2011! 13

MHC polymorphism, supertype clustering, and panpredictions. One potential drawback of epitope-based vaccines is that the HLA genes are extremely polymorphic. Each of the corresponding molecules has a different specificity. If a vaccine needs to contain a unique peptide for each of these molecules it will need to comprise hundreds of peptides. One way to counter this is to select sets of a few HLA molecules that together have a broad distribution in the human population. A factor that may reduce the number of epitopes necessary to include in a vaccine is that many of the different HLA molecules have similar specificities. The different HLA molecules have been grouped together in so-called supertypes (del Guercio, 1995; Sette and Sidney, 1999; Sidney et al., 1995). This means ideally that if a peptide can bind to one allele within a supertype, it can bind to all alleles within that supertype. In practice, however, only some peptides that bind to one allele in a supertype will bind to all alleles within that supertype. A number of different criteria have been used to define these supertypes, including structural similarities, shared peptide binding motifs, identification of cross reacting peptides, and ability to generate methods that can predict cross binding peptides (Sidney et al., 1996). For HLA class I molecules, (Sette and Sidney, 1999) defined nine supertypes (A1, A2, A3, A24, B7, B27, B44, B58, B62), which were reported to cover most of the HLA-A and -B polymorphism. They argued that the different alleles within each of these supertypes have almost identical peptide binding specificity. They found that while the frequencies at which the different alleles were found in different ethnic groups were very different, the frequencies of the supertypes were quite constant. Assuming Hardy-Weinberg equilibrium (i.e., infinitely large, random mating populations free from outside evolutionary forces), they found that >99.6% of persons in all ethnic groups surveyed possessed at least one allele within at least one of these supertypes. They also showed that the smaller collections of supertypes (A2, A3, B7) and (A1, A2, A3, B7, A24, B44) covered in the range of 83.0 88.5% and 98.1 100.0% of persons in different ethnic groups respectively. Three alleles, A29, B8 and B46, were found to be outliers with a different binding specificity than any of the supertypes. These may define supertypes themselves when the specificity of more HLA molecules is known. (Lund et al., 2004) presented a measure for the difference in the specificities of different HLA molecules and used it to cluster HLA molecules. In order to define a clustering of HLA molecules, we first calculated difference in specificities (the distance) between each pair of HLA molecules. The distance d ij between two HLA molecules (i, j) was calculated as the sum over each position in the two motifs of one minus the normalized vector products of the amino acids frequency vectors (Lyngsø et al., 1999): d ji = p 1 pi p p j p p i p p j p June 2011! 14

The result of this analysis was used to revise the class I supertypes and add additional 3 supertypes (A26, B8, and B39). Recently (Sidney et al., 2008) analyzed the complete list of alleles available through IMGT (release 2.9) using a simple approach largely based on compilation of published motifs, binding data, and analyses of shared repertoires of binding peptides, in conjunction with clustering based on the primary sequence of the B and F peptide binding pockets. On this basis they updated the supertype assignments, with new assignments for about 700 different HLA-A and -B alleles. (Gulukota and DeLisi, 1996) compiled lists with three, four, and five alleles that give the maximal coverage of different ethnic groups. One complication they had to deal with is that HLA alleles are in linkage disequilibrium; i.e., that the joint probability of an allelic pair may not be equal to the product of their individual frequencies [P (a)p (b) = P (ab)]. This means that it is not necessarily optimal to choose the alleles with the highest individual frequencies. Moreover, (Gulukota and DeLisi, 1996) find that populations like the Japanese, Chinese and Thais can be covered by fewer alleles than the North American Black population, which turns out to be very diverse. Thus, different alleles should be targeted in order to make vaccines for different ethnic groups or geographic regions. This observation combined with the fact that a binder to a given supertype representative does not necessary bind to a majority of the supertype representatives strongly reduce the assumed population coverage of a given set of epitopes. Lately, the number of available binding data has increased significantly both in terms of amount of different peptides and the number of different MHC alleles. This fact has in itself influenced the MHC prediction systems that now cover a large number of different HLA alleles (Peters et al., 2006; Zhang et al., 2008). More interestingly, however, it has enabled the development of new so called pan-specific algorithms that can predict peptide binding to HLA alleles for which limited or even no experimental data are available (Jacob and Vert, 2007; Jojic et al., 2006; Nielsen et al., 2007; Nielsen et al., 2008; Zhang et al., 2005). A common feature of these pan-specific methods is that they go beyond the conventional single allele approach, and take both the peptide sequence and the MHC contact environment into account. Large-scale benchmark evaluation of the pan-specific methods has demonstrated their power not only for predicting peptide-binding affinities to previously un-characterized MHC molecules but also to boost the predictive performance for characterized alleles by leveraging information from neighboring MHC molecules (Zhang et al., 2009). June 2011! 15

Prediction of the CTL presenting pathway and integration of prediction systems The below text is adapted from (Lundegaard et al., 2007). MHC class I binding is a necessary but not sufficient requirement for a peptide to be recognized by CD8 T cells. Another major requirement is the efficient generation of the peptide from its source antigen, which includes proteasomal cleavage in the cytosol and transport into the Endoplasmic reticulum (ER) by the Transporter associated with Antigen Processing (TAP). Successful prediction of these processes should therefore improve the prediction of naturally processed MHC ligands and T cell epitopes. The proteasome has a complex enzymatic specificity, which makes the prediction of its cleavages challenging. It comprises multiple catalytically active sites, each with a distinct specificity (Kloetzel, 2004). It is thus expected that the accuracy for prediction of proteasomal activity will be relatively low compared to that of methods for MHC peptide binding. A further complication is that two versions of the proteasome exist. The normal proteasome is called the constitutively expressed and participates in the constant turnover of proteins within cells. When a cell is alerted that an infection is present some of the newly synthesized so called immunoproteasome have different subunits, which are thought to cleave fragments that are more suited for producing epitopes. Several prediction algorithms for proteasomal cleavage are available. FragPredict, which is publicly available as a part of MAPPP service (http://www.mpiib-berlin.mpg.de/mappp/), consists of two algorithms. The first algorithm uses a statistical analysis of cleavage-enhancing and -inhibiting amino acid motifs to predict potential proteasomal cleavage sites (Holzhutter et al., 1999). The second algorithm, which uses the results of the first algorithm as an input, predicts which fragments are most likely to be generated. This model takes the time-dependent degradation into account based on a kinetic model of the 20S proteasome (Holzhutter and Kloetzel, 2000). PAProC (http://www.paproc.de) is a method for predicting cleavages by human as well as wild type and mutant yeast proteasomes. The influence of different amino acids at different positions is determined by using a stochastic hill-climbing algorithm (Kuttler et al., 2000) based on the experimentally in vitro verified cleavage and non-cleavage sites (Nussbaum et al., 2001). (Tenzer et al., 2004) have also published a weight matrix based method for prediction of both constitutive- and immunoproteasomal cleavage specificity. Both matrices are trained on in the very limited in vitro proteasomal digest data available. The NetChop (Kesmir et al., 2002) method uses the C terminus of naturally processed MHC class I ligands, which are thought to be created by proteasomal cleavage. Since some of these June 2011! 16

ligands are generated by the immunoproteasome, and some by the constitutive proteasome, such a method should predict the combined specificity of both forms of proteasomes. In 2003, NetChop-2.0 was evaluated to be the best-performing predictor on an independent evaluation set (Saxová et al., 2003). The SVM based Pcleavage proteasomal cleavage predictor, which is available online, has a published performance comparable to that of NetChop-2.0 (Bhasin and Raghava, 2005). An update of the NetChop method to version 3.0 (Nielsen et al., 2005) consists of a combination of several ANNs, each trained using a different sequence-encoding scheme of the data. NetChop 3.0 has an increase in the prediction sensitivity as compared to NetChop-2.0, without lowering the specificity, and is probably still the best predictor of proteasomal cleavage. Relatively few methods have been developed to predict the specificity of TAP. (Daniel et al., 1998) have developed ANNs using 9-mer peptides for which the TAP affinity was determined experimentally. Surprisingly, they found that some MHC alleles such as alleles blonging to the HLA-A*02 family have some natural ligands with very low TAP affinities. It has been shown that TAP ligands can be trimmed in the ER before binding to MHC molecules (Fruci et al., 2001), and the TAP ligand thus often will be a precursor to the MHC binding peptide. (Peters et al., 2003) used an SMM based matrix to predict TAP affinity of peptides. This method can be generalized to peptides that are longer than 9 amino acids by assuming that only the first three positions in the N-terminal and the last position at the C-terminal influence TAP binding. The method had been tested on a large data set and been shown to have a high accuracy. The selective influence of TAP binding in the epitope presentation pathway is much lower than MHC binding and the AUC value when this method is used alone as an epitope predictor of 0.79 is thus significantly lower than most MHC-binding prediction methods. With increasing numbers of TAP ligands being publically available (e.g. Jen-Pep database, http://www.jenner.ac.uk/antijen/aj_tap.htm) (Blythe et al., 2002), it is likely that more accurate TAP predictions will be available in the future. I NTEGRATED T-CELL EPITOPE PREDICTIONS Reliable predictions of immunogenic peptides can reduce the experimental effort needed to identify new epitopes. Although predictions of MHC binding alone can be used to rank the possible epitopes very accurately, even better predictions should be possible if molecular events other than peptide:mhc binding in the antigen presentation pathway are modeled. Accordingly, many attempts have been made to predict the outcome of the steps involved in antigen presentation: MAPP (Hakenberg et al., 2003), NetCTL (Larsen et al., 2005), MHCpathway (Tenzer et al., 2005), EpiJen (Doytchinova et al., 2006), and WAPP (Donnes and Kohlbacher, 2005). All of these methods attempt to predict antigen presentation by integrating peptide:mhc binding predictions with one or more of the other events involved in the antigen presentation pathway. June 2011! 17

To benchmark these prediction methods, a set of verified epitopes can be used as the positive data set. On the other hand, negative data set (peptides that cannot induce an immunologic response) is difficult to define because it is impossible to guarantee that a peptide will never be an epitope in any persons with a given HLA haplotype. Instead, epitopes from well-studied pathogens (e. g. HIV) are often used as the positive set, and all other peptides from the genome of the same pathogen that have never been shown to be an epitope are assumed negative as they have a very low probability of being an epitope. Running a large-scale benchmark calculation comparing the predictive performance of several publicly available MHC-I presentation prediction methods evaluated on a large set of known HIV epitopes (http://www.cbs.dtu.dk/suppl/immunology/ctl-1.2/hiv_dataset) reveals that the NetCTL and MHCpathway methods have the highest predictive performance with >75% of the epitopes being within the top 5% peptides with the highest prediction scores (Larsen et al., 2007). June 2011! 18

Epitope discovery and rational vaccine design. To exemplify the use of bioinformatics for epitope discovery in specific cases we here present a number of published works concerning selected pathogens of special concern. SARS The following text have been adapted from (Sylvester-Hvid et al., 2004) Severe Acute Respiratory Syndrome (SARS) during 2003 infected more than 8400 patients in over 30 countries and caused more than 800 deaths. Coordinated by the WHO, classical measures including case detection, isolation, infection control, contact tracing, and follow-up surveillance successfully contained the disease. However, ideally, the SARS-coronavirus (SARS- CoV) should be possible detect fast and contained by vaccination programs. To develop both vaccines and diagnostic tools in the timeframe of a new aggressive and previously unknown pathogen, the use of bioinformatical tools to shortcut and speed up experimental will be absolutely neccesary. The SARS-CoV infects epithelial cells in the respiratory tract causing interstitial pneumonia (Holmes, 2003). One would therefore expect that an effective vaccine should induce mucosal immunity such as that effected by secretory immunoglobulin A (IgA), which specifically prevents an infectious agent from penetrating the mucosal epithelium, and by cytotoxic T lymphocytes (CTLs), which specifically eradicate infected cells (Eriksson and Holmgren, 2002). IgA responses are generally considered the major protective mechanism; however, there are examples of CTLs, not antibodies, being responsible for early control of mucosal infection (Eriksson and Holmgren, 2002). Particularly noteworthy, this is the case for the infectious bronchitis virus of chicks, a prototype of the Coronaviridae family, where primary effector CD8+ CTLs play a critical role in the elimination of virus during acute infection and subsequent control of the infection (Collisson et al., 2000; Pei et al., 2003; Seo and Collisson, 1997; Seo and Collisson, 1998; Seo et al., 2000). The complete SARS genome/proteome was obtained from Gen-Bank (NC004718) and virtually digested into all 9862 unique nonamer peptides (Marra et al., 2003). Thus, close to 10,000 binding predictions were made for each of the nine HLA supertypes. Artificial neural networks (ANNs) were used to predict the binding affinity quantitatively when the corresponding data were available [e.g. for A*0201 (Buus et al., 2003; Nielsen et al., 2003)]. The performance of the ANNs is high, as the correlation coefficient between predicted and measured binding is 0.85. The remaining HLA bindings were predicted using weight matrices derived from Gibbs sampling sequence-weighting methods with pseudocount correction for low counts as well as differential position-specific anchor weighting (Nielsen et al., 2004). These weight matrices were calculated from available nonamer data from the SYFPEITHI and MHCPEP databases June 2011! 19

with the peptides clustered into the nine supertypes (A1, A2, A3, A24, B7, B27, B44, B58, and B62). The positive predictive value of the matrix-driven prediction has been found to be around 66%, whereas the negative predictive value has been found to be around 97% (Lamberth et al. unpublished observation). Proteasomal processing was predicted using NETCHOP 2.0 (Kesmir et al., 2002). NETCHOP 2.0 has been found to be superior to other proteasomal prediction algorithms (Saxová et al., 2003). Peptides with a NETCHOP 2.0 score below 0.5 (i.e. poorly predicted proteasomal processing) were excluded from further analysis. Finally, we excluded all peptides that did not represent epitopes conserved in all SARS isolates. Figure 1 shows a representative example for a member of the HLA-A3 supertype, the HLA-A*1101 (this haplotype is particularly common in Southeast Asia). The selected peptides were biochemically tested for binding and more than 70% of the selected peptides turned out to actually bind to the specific HLA molecule. These peptides were never tested for immunogenicity because of lack of patient material. I NFLUENZA As an example the use of combined knowledge and bioinformatics we have adapted the published results of (Wang et al., 2007): Influenza is a highly contagious, airborne respiratory tract infection associated with a significant disease burden. The annual mild influenza epidemics caused by antigenic drift of the virus affects 10 20% of the world s population with up to 5 million cases of serious illness and 500,000 deaths (http://www.who.int/vaccineresearch/diseases/ari/en). At various intervals, new influenza subtypes emerge against which no immunity exists in the human population and these may cause global pandemics with an even higher disease toll. The current outbreak of a new influenza subtype A (H5N1), which can be directly, although at this time rarely, transmitted from birds to humans (Chan, 2002), is an example of a potential pandemic flu threat, which has currently reached phase 3 of 6 in thewhodelineation of a flu pandemic (http://www.who.int/csr/disease/avian influenza/phase/en/index.html). It is highly pathogenic in birds, and when transmitted to humans it carries a mortality of over 50% (Beigel et al., 2005). It is the largest and most severe influenza epidemic ever registered among birds; since 2003, it has spread rapidly to poultry in many countries It is known that CD8+ T cell responses also play a major role in the control of primary influenza virus infection (Bender et al., 1992; McMichael et al., 1983). In mice, CTLs against conserved epitopes contribute to protective immunity against influenza viruses of various subtypes (Kuwano et al., 1988; Ulmer et al., 1993). Identification of CTL epitopes, especially conserved epitopes shared by multiple viral strains, might therefore be a robust vaccine strategy against emerging influenza epidemics. The NetCTL algorithm were used to select among highly conserved epitopes encompassing 12 different HLA class I supertypes with >99% population coverage, and searched for conserved epitopes from available influenza A H1N1 viral protein sequences. Peptides corresponding to June 2011! 20

167 predicted peptide HLA-I interactions were synthesized, tested for peptide HLA-I interactions in a biochemical assay and for influenza-specific, HLA-I-restricted CTL responses in an IFN- ELISPOT assay. Eighty-nine peptides could be confirmed as HLA-I binders, and 13 could be confirmed as CTL targets. The 13 epitopes, were highly conserved among human influenza A pathogens, and 12 of these epitopes were present in the emerging H5N1 bird flu isolates (Wang et al., 2007). HIV The below text is an abstract of (Lund et al., 2008). The immense genetic variation of the host cellular immune system, HLA class I, forms another major obstacle for CTL-based vaccine approaches. Few human beings will share the same set of HLA alleles and most humans will react to a pathogen infection in a unique manner. Recently, combining HLA class I molecules with significant overlap among the peptide-binding specificities has allowed a functional classification of HLA supertypes. The HLA supertypes together cover most of the HLA-A and HLA-B alleles in the population, thus allowing for large cohort studies investigating the efficacy of specific immune responses in relation to disease progression. The design of a vaccine based on such knowledge would increase the probability of eliciting broad CTL responses. A method to optimally cover a large majority of a highly diverse population were developed and used for selection of potential HIV epitopes (Perez et al., 2008). The developed algorithm, EpiSelect, were used to select a number of epitopes, which together constitute a broad coverage of the HIV-1 strains in subjects of diverse ethnical background. The EpiSelect algorithm aims at selecting a given number of predicted epitopes so that most strains are covered with the largest number of epitopes, while maximizing the coverage of the viral strain with the smallest number of epitopes. The NetCTL method (Larsen et al., 2005) was used to select a set of potential HIV epitopes restricted to the seven major HLA supertypes A1, A2, A3, A24, B7, B44, and B58. The CTL epitopes were selected from a database of HIV-1 genomic strains covering the subtypes A (A1 and A2), B, C, D, and CRF01_AE. EpiSelect was then applied to select, for each HLA supertype, a set of potential epitopes with highest possible genomic coverage. A total of 174 predicted HLA class I supertype restricted epitopes were selected from the Gag, Pol, Env, Nef, and Tat regions with maximal genomic coverage of the major HIV-1 subtypes (Figure 1) and combined with 10 experimentally verified HLA class I restricted epitopes. June 2011! 21