Prediction of In- Vivo Modification Sites of Proteins from Their Primary Structures

Size: px

Start display at page:

Download "Prediction of In- Vivo Modification Sites of Proteins from Their Primary Structures"

Cameron Small
5 years ago
Views:

1 J, Biochem. 104, (1988) Prediction of In- Vivo Modification Sites of Proteins from Their Primary Structures Kenta Nakai and Minoru Kanehisa' Institute for Chemical Research, Kyoto University, Uji, Kyoto 611 Received for publication, June 13, 1988 In order to make better use of the information contained in rapidly expanding amino acid sequence data, a new method to predict various modification sites of proteins from their primary structures is presented. It is also applicable to the prediction of other functional sites in proteins. Here we show the examples of N-glycosylation and serine/threonine phosphorylation sites. The method is essentially an elaboration of consensus sequence pattern matching based on stepwise discriminant analysis. The occurring amino acids near a potential modification site are represented by six numerical values which reflect various properties of amino acids. Longer-range effects around these sites are also considered. The stepwise procedure enabled us to automatically select effective features for discrimination. A computer program with our method first identifies potential modification sites by a sequence pattern, NX(S/T) for N-glycosylation or (S/T) for phosphorylation, and then decides by discriminant analysis whether a potential site is likely to be a true modification site. The prediction accuracy in the second step of discrimination was about 60% for glycosylation sites and about 80% for phosphorylation sites. The rapid growth of the quantity of DNA sequence data is one of the most remarkable events in biosciences today. The size of the GenBank database is doubling almost every year, but this is likely to be accelerated further as the sequencing of large genomes proceeds. Although various physiological phenomena in living systems are ultimately encoded in genomic DNA, sequencing of genes does not necessarily mean their clarification. In this respect, what we can learn from sequence data is still very poor and insufficient. One of the established methods for obtaining insights into higher structural and functional properties of proteins from sequence data is to search for homologous sequences in the databases. In many cases, however, strong homologies are absent and the significance of weak homologies is difficult to assess. It is also customary to perform secondary structure predictions from the amino acid sequence. But their accuracies are too low to be useful in practice. Despite these limited success rates, computational methods are frequently used and there seems to be a great demand on new sequence analysis methods which make up for existing methods. In this paper we propose a new approach toward the prediction of protein functional sites from its primary structure. Suppose we have to locate, in a newly determined sequence, a DNA binding site, a calcium binding site, and an in- vivo modification site, by sequence analysis. The ordinary way to do this is to use the pattern matching against the so-called consensus sequence. For instance, when there is an N-X-(S or T) pattern in the sequence (here X denotes any amino acid), the site is called a potential glycosylation site (1). The consensus sequence of Tufty and Kretsinger is also well-known with regards to calcium To whom correspondence should be addressed. Abbreviation: PKA, camp-dependent protein kinase, binding sites (2). However, the consensus is basically a short, weak sequence pattern, and when the whole amino acid sequence database is searched, there are usually many additional sites matching only by chance. Our approach is similar to the consensus matching method in the sense that it is also based on the identification of common features among related sites. However, it contains three new elements which will overcome the problems inherent in the simple pattern matching approach. They are: (i) preparation of data set, (ii) numerical representation of amino acid sequences, and (iii) objective selection of features. First, the data set is prepared with practical applications in mind. Suppose one wishes to locate an N-glycosylation site, he or she will no doubt use the NX(S/T) consensus first. The question is therefore: is it possible to discriminate true glycosylation sites from false analogues merely having the NX(S/T) pattern? The consensus sequence is derived only from a set of true sites; there is no control group of false sites. In contrast, our approach uses data sets of both true and false sites. This is the most critically different aspect of our approach. By comparing the patterns of true and false sites, it is possible to establish a practical criterion for discrimination. Second, common features in the sequence may be too weak to be represented by a limited number of characters. This is another drawback of consensus pattern matching. Because the amino acids are represented by characters, they can be either matched or not, although it is possible to include multiple matches. In reality, however, each amino acid has many characteristics which cannot be represented by binary digits. There have been a number of reports on so-called amino acid indices which reflect various aspects of amino acid residues, as summarized in our previous analysis (3). We introduce the numerical representation of Vol No. 5, y'1

2 694 K. Nakai and M. Kanehisa amino acids selected from our database of over 200 published amino acid indices. The numerical representation is also convenient for incorporating environmental factors, namely, long-range effects of residues surrounding the consensus. Thus, we should be able to combine weak signals within and outside of the consensus to improve the accuracy of prediction. Third, the derivation of a consensus sequence is usually based on visual inspection and tends to be subjective. In contrast, we use an objective approach, the stepwise variable selection method by discriminant analysis, which enables us to select features automatically. Recently, another automatic approach for feature detection was developed and applied to discrimination of protein secondary structural segments (4). These automatic procedures are to be compared with our previous method by discriminant analysis (5) where the variables were selected manually. In this paper, we apply the stepwise variable selection method to the prediction of in-vivo modification sites of proteins, especially the N-glycosylation sites (1) and serine/threonine phosphorylation sites (6). MATERIALS AND METHODS Collection of Data-In order to distinguish true modification sites from false analogues, it is necessary to collect both groups of sequence data. False analogues, in our definition, contain sequence patterns similar to true sites, the NX(S/T) consensus pattern in N-glycosylation sites and the (S/T) residue in Ser/Thr phosphorylation sites. False analogues form a control group in the sense that-they are assumed to contain different, or possibly negative, patterns other than the consensus specified above. In contrast, true modification sites are expected to contain positive patterns around the consensus. The sequences around truly modified sites were mainly collected from the NBRF-PIR database (7) Release 13.0 (June 1987). While the positions of the modified sites are reported in the database, there is no description of the sites which are not modified. Therefore, in the sequences reported to have real modification sites, other potential sites are assumed to be non-modified. Strictly speaking, this assumption may not be correct. But when the data set is sufficiently large, errors are expected to be negligible. In the case of N-glycosylation, the data set for false analogues was relatively small as the non-modified sites were selected from glycoproteins, namely, proteins that have at least one modified NX(S/T) pattern. Thus, we also collected data from another source. In 1970 Hunt and Dayhoff (8) studied glycosylated and non-glycosylated sites in their sequence database. We used their compilation of non-glycosylated sites. Because all of these sites existed in non-glycosylated proteins, there was no overlap with the sites collected from the NBRF database. As for the Ser/Thr phosphorylation sites, another level of complication exists. The information concerning the type of enzyme, i.e., protein kinase, that causes phosphorylation is not usually reported in the NBRF database. On the other hand, there are many protein kinases identified in vivo and each kinase seems to have its own substrate specificity (9). Engstrom et al. (10) compiled a list of the in-vitro phosphorylation sites of known kinases. We use their substrate data for the camp-dependent protein kinase (PKA), which is the most extensive. Stepwise Discriminant Analysis-We give a minimum description of the stepwise discriminant analysis here. For further details, see appropriate textbooks (Ref. 11, for example). Discriminant analysis makes it possible to allocate an individual to one of the pre-defined groups on the basis of measurements or values of given variables. It uses the so-called discriminant function for allocation, which is a linear combination of variables with coefficients optimized to best discriminate the given groups of data, i.e., the training set. In our analysis, discrimination is usually performed between two groups, the sequence data of modified and non-modified sites. However, when we analyze the Ser/Thr phosphorylation sites, we also perform a discrimination between the three groups by further dividing modified sites into two groups. Starting from the repertoire of numerous variables, we wish to select a minimum set of variables that gives a sufficiently good discrimination. Generally speaking, it is desirable to combine many variables in order to obtain better discrimination. However, because the number of possible combinations in the repertoire is enormous, it is impractical to test all of them. Moreover, when a new variable is added for discrimination, it may not actually contribute significantly although the accuracy will not decrease. In fact, even if a variable works quite well by itself, it may not work as well in combination with other variables. Since there is no way to tell beforehand whether the added variable is going to contribute effectively, a procedure called the stepwise discriminant analysis is applied. This procedure has another implication; since we prepare the repertoire as a comprehensive set of variables representing various aspects of sequence data, it plays, in effect, the role of feature detection. The stepwise discriminant analysis examines all candidates and enters them into or removes them from the discriminant function one-by-one, according to a pre-established criterion, until a stopping condition is satisfied. In practice, we used the program package FACOM/ ANALYST to perform these analyses (12). Estimation of Prediction Efficiency-Once the discriminant function is determined from the training set, it can be applied to an individual outside the training set. This we call prediction. The efficiency of prediction is, in general, worse than the efficiency of self-discrimination which is the discrimination of individuals within the training set. One way to estimate the prediction efficiency is to divide the available data into the training set and the test set. However, because the data set is rather small in our case, it may not be enough to make a reliable estimation. Thus we use an alternative method, the U-method of jack-knifing, defined as follows (13) : Extract an individual from the sample data and calculate the discriminant function based on the remainder, which is then used to predict the extracted one. By repeating this procedure for each individual in the data set, the average degree of prediction is calculated. This is the estimated prediction efficiency. Variables for Discrimination-The sample data collected as described above are aligned at the possible modification site, the Asn residue for N-glycosylation and the Ser or Thr residue for phosphorylation, without any insertion or deletion. As shown in Fig. 1, we consider a segment of 31 residues containing 15 residues each on both sides of the J. Biochem.

3 Prediction of Modification Sites from Amino Acid Sequences 695 modification site. The relative position within this segment is specified by the numbering shown: 0 for the modification site, positive numbers toward the carboxyl terminus, and negative numbers toward the amino terminus. From the aligned sequence data set we will quantify the importance of respective positions in the 31 residue segment. For this purpose we introduce numerical representations of the amino acid sequence. The 20 naturally occurring amino acids have various physicochemical and biochemical properties which are supposed to play important roles in the manifestation of biological functions in proteins. These properties are often quantified by numerical values called amino acid indices. In our previous analysis (3), we classified 222 published amino acid indices into several categories. We use the following indices, which are representative ones from different categories, in the present analysis: a: a-propensity by Robson and Suzuki (14), f3: /3-propensity by Kanehisa and Tsong (15), Turn: f8-turn propensity by Chou and Fasman (16), Hail: hydrophobicity by Eisenberg (17), Hqf 2: hydrophobicity by Fauchere and Pliska (18), Size: residue size by Chothia (19). a, fl, and turn propensities are the preference of amino acid residues to be in these secondary structure conformations. The two hydrophobicity indices represent somewhat different aspects: HO 1 belongs to a category of hydrophobicity scales for amino acids in proteins derived from the analysis of X-ray crystal structures, while Hdi2, the partition energy of amino acids, is derived from experimental measurements of free amino acids. The last one is the physical size of an amino acid residue. We define the variables for discriminant analysis as follows (see Fig. 1): X(1)-X(11): the values of numerical indices at each position at and 5 neighboring residues around the modification site X(12), X(13): the averages of numerical indices over 1o residues, the 6th through 15th neighboring residues X(14), X(15): the averages of numerical indices over 5 immediately neighboring residues N: the total number of residues R: the relative position from the amino terminus The position dependence of a numerical index is represented by X(i) where i corresponds to a residue or a segment of residues. For example, the value of a at position -2 is called a(4). Thus, because we use 6 different numerical indices, there are, in total, 92 variables for consideration, including the total number of residues N and the relative location of the modification site R. The stepwise discriminant analysis then selects a smaller number of optimized variables with position-specific weightings. Note that the variables at position 0 will have any relevance only in the case of Ser/Thr phosphorylation. When the possible modification site is located in the vicinity of the N or C terminus and some portion of the neighboring 15 residues in either direction does not exist, we simply insert blanks as special amino acids which have 0 values for all properties. RESULTS N-Glycosylation-The number of N-glycosylation sites collected from the NBRF database was 394. Since many glycoproteins had multiple glycosylation sites, the number of proteins was 177. The control group from the NBRF database had 80 elements in 54 proteins, which are potential, but false, glycosylation sites in glycoproteins. In addition, the data of Hunt and Dayhoff contained 59 sites in 32 proteins, as false sites in non-glycosylated proteins. It seems to be widely accepted that the occurrence of the NX(S/T) sequence is necessary for N-glycosylation. However, in the NBRF database, there were four exceptions. Three NXC patterns were found in the N-glycosylation sites of protein C precursor (human and bovine) and of von Willebrand factor precursor (human). Apparently, cysteine can take the place of serine or threonine in the N-glycosylation reaction (20). An NGG pattern was found in immunoglobulin heavy chain V regions (mouse). These data were excluded from our analysis. We prepared the following two training sets. Training set 1: both of the N-glycosylated and nonglycosylated sites are from the NBRF database. TABLE I. Results of discriminant analysis for N-glycosylation sites. Fig. 1. The definition of variables around a potential modification site. The numbering of residues is: 0 for the site, negative toward the amino end, and positive toward the carboxyl end. One of the six numerical indices, a, 8, turn, H(b1, H02, and size, is represented by X. The value of X at the site is denoted by X(6), and of each of the five immediate neighbors on both sides is denoted by X(1) through X(5) and X(7) through X(11). The mean values of X are defined as illustrated. Vol No. 5, 1988

4 696 K. Nakai and M. Kanehisa TABLE II. Selected variables for the discrimination of N-glycosylation sites, a Variable number defined in Fig. 1. TABLE III. Results of discriminant analysis for Ser/Thr phosphorylation sites. TABLE IV. Selected variables for the discrimination of Ser/Thr phosphorylation sites. Training set 2: the non-glycosylated sites from Hunt- Dayhoff's compilation were added to training set 1. According to the stepwise discriminant analysis, a set of optimized variables was selected first for training set 1. Here the stepwise procedure was stopped at the 18th step because remaining variables no longer contributed effectively to discrimination. Table I(a) shows the result of self-discriminating the training set itself by the 18 variables. The rows and columns correspond to the two groups to be discriminated and to be allocated, respectively. The number of individuals allocated from either of the two groups to either of the two groups is shown both in actual numbers and in percentages. The unweighted average of diagonal percentages is regarded as a measure of discriminant accuracy. In this case, it is about 74%. As the ANALYST program excluded some individuals automatically, the total number of the allocated data is somewhat smaller than the number of initially collected data. Next, Hunt-Dayhoff's data for non-glycosylated sites were combined with the data from the NBRF database (training set 2). As the control group became larger, we hoped to obtain a more reliable result. However, as shown in Table I(b), which is the result of self-discrimination with another optimized set of 18 variables, the accuracy was somewhat lower. We adopted the U-method of jack-knifing in order to J. Biochem.

Prediction of Modification Sites from Amino Acid Sequences 697 TABLE V. Prediction of camp-dependent phosphorylation sites in bovine a-crystallin A chain.

5 Prediction of Modification Sites from Amino Acid Sequences 697 TABLE V. Prediction of camp-dependent phosphorylation sites in bovine a-crystallin A chain. estimate the prediction accuracy when the discriminant function is applied to unknown data, as shown in Table I(c). Training set 1 was used here. The ability to discriminate between true and false sites which are outside the training set is around 60% in contrast to the self-discrimination of over 70% (Table I(a)). The 18 variables selected in the stepwise discriminant analysis using training set 1 are summarized in Table II. The selected variables are represented by circles. The ones selected in the first 3 steps are denoted by double circles. Roughly speaking, the variables selected earlier are considered to give more important contributions. We have also examined the variables selected in the analysis with training set 2 and in the jack-knifing procedure (data not shown). While the correspondence between these cases was not perfect, the relative position R, the fl-turn propensity at position + 2, and the a propensity and size averaged over the region are conserved relatively well, suggesting that environmental factors are better determinants than site-specific factors. Ser/Thr Phosphorylation-The number of Ser/Thr phosphorylation sites collected from the NBRF database was 66 in 30 proteins. In contrast, the number of all other serines and threonines in these proteins amounted to 847, which formed the control group of non-phosphorylated sites. The amino acid distributions at positions around the Ser/Thr site did not show any significant bias in the control group (data not shown). Because half of the collected phosphorylation sites (33 sites) were contained in caseins, and because the phosphorylation of caseins appeared to be somewhat different (see "DISCUSSION"), we also performed discrimination among three groups: phosphorylation sites in caseins, phosphorylation sites in other proteins, and non-phosphorylated sites. The number of substrate sites for camp-dependent protein kinase (PKA) from the compilation of Engstrom et al. (10) was 21 in 17 proteins, including peptide fragments. Three of them were also in the NBRF data. The control group sites were collected from the NBRF database in these 17 proteins and amounted to 526. With these data we prepared training sets as follows: Training set 1: phosphorylated and non-phosphorylated sites from the NBRF database Training set 2: the phosphorylated sites in training set 1 are divided into casein sites and non-casein sites Training set 3: phosphorylated and non-phosphorylated sites in PKA substrate proteins from the compilation of Engstrom et al. The discrimination result for training set 1 is shown in Table III(a) in the same way as in Table I(a). The stepwise procedure was stopped at the 15th step because an overflow occurred at step 20. The discrimination accuracy of the training set was about 88%. The result of three-group discrimination with training set 2 is shown in Table 111(b). Twenty steps were executed. It can be seen that to distinguish the sites in caseins is easier than to distinguish other ones. However, the overall accuracy, the average of the three diagonal values, did not improve. The discrimination result of PKA substrate sites using training set 3 is shown in Table III(c). Because there were some proteins whose sequences were not determined completely, the global variables N and R were omitted in this case, from the repertoire of variables to be selected. Almost perfect discrimination was possible, although parameter fitting against a small number of data often becomes too specific to the original data. The stepwise procedure was stopped at the 20th step. Table III(d) shows the accuracy of predicting unknown Ser/Thr phosphorylation sites estimated by the U-method of jack-knifing. Training set 1 was used. Compared with the result of self-discrimination (Table III(a)), the ability to identify true modification sites dropped about 10% because more true sites were classified as false, while the ability to identify false sites was about the same. As the overall accuracy is still high, the derived discriminant function seems to be of practical use. The prediction accuracy with training set 3 for PKA substrates was also calculated. It decreased drastically when compared with the selfdiscrimination accuracy; only 16 out of 21 true sites (76%) were correctly predicted. This is probably due to the too-specific nature of the variables optimized for the small data set. The variables selected in the discrimination of training sets 2 and 3 are summarized in Tables IV(a) and IV(b), respectively. The result with training set 1 was similar to Table IV(a). The symbols used here are the same as those used in Table II. In Table IV(a) no strong tendency for selecting any amino acid property at any position can be recognized. But it seems that the information about the position + 2 is relatively important. It may reflect the fact that, in many cases, the phosphorylated Ser/Thr is separated from a basic amino acid (Lys or Arg) by one residue (21). V,,' 104, No. 5, 1988

698 K. Nakai and M. Kanehisa Note that the variable at position 0 is also selected. It shows that the preference for phosphorylation differs between serines and threonines.

6 698 K. Nakai and M. Kanehisa Note that the variable at position 0 is also selected. It shows that the preference for phosphorylation differs between serines and threonines. In Table IV(b) we can see that the N-terminus side of the modified site, especially from -2 to -4, is essential for discrimination. In addition, hydrophobicity and 8-propensity seem to be relatively more important (see "DISCUSSION" for more details). DISCUSSION We have considered determinants for modification sites mostly from the point of view of protein structures as they are encoded in the amino acid sequence. For example, when a certain modification occurs post-translationally, it is unlikely to involve buried residues. Thus, the consensus sequence is a result of molecular interaction between the modifying enzyme and the modified protein. As pointed out by Wold (22), there exist at least two other types of determinants. Namely, the compartment location of the enzymes involved in processing and the temporal changes in substrate structures during biosynthesis and transport. It will be interesting to see whether these factors can also be identifiable from the amino acid sequence data. For example, the first factor may be related to localization signals in the amino acid sequence like signal sequences, although it will be difficult to identify all the enzymes in the compartment from sequence information alone. The second factor must also be encoded ultimately in the amino acid sequence, although its clarification would be as difficult as the problem of protein folding. Whatever the actual mechanism is, the present approach is an empirical one identifying characteristic features in the amino acid sequence around modified sites. These features have practical implications for use in prediction, as well as providing insights into the molecular mechanisms of modification reactions. Hunt and Dayhoff (8) examined the occurrences of NX(S/T) patterns and bound carbohydrates in their collection of amino acid sequences. Not more than 20 of the 101 NX(S/T) sites in their collection were N-glycosylated. While they observed that the occurrence of the NX(S/T) pattern was much less frequent than that of similar patterns, such as (S/T)XN and N(S/T), they occur as frequently, in the present NBRF database. It is possible that our data set will turn out to contain such a bias in the future. We examined if any family of closely related proteins occupied most of the data, but no serious overlaps were found. More recently Mononen and Karjalainen (23) collected and analyzed possible N-glycosylation sites. In their collection, 139 out of 196 sites were glycosylated; namely, most potential sites were actually modified, which agrees with our result. Of course, it is premature to conclude that this reflects the natural ratio of glycosylated to non-glycosylated sites in general. However, it seems possible to say that in glycoproteins many of the possible sites are actually glycosylated. Mononen and Karjalainen could not find marked differences in sequence patterns between glycosylated and non-glycosylated sites, and they suggested that the proposed 8-turn structure (24) was not a determinant as far as it was estimated by the secondary structure prediction. Not only these earlier results, but also ours suggest the difficulty of predicting N-glycosylation sites precisely. The estimated accuracy of predicting unknown data is about 60%. However, inspection of commonly selected variables still seems to reveal some interesting features. First, the variable selected at position + 2 was 8-turn propensity, which suggests a difference in inclination to be glycosylated between the NXT and NXS patterns. Although there is a natural difference in the occurrences of serines and threonines, the difference of the modification tendency is much stronger. Second, the fact that the global variables are selected relatively well suggests a wider interaction range involved. Perhaps the structure surrounding a potential glycosylation site during the elongation process is, at least, as important as the recognition sequence of the enzyme catalyzing the reaction, since the modification occurs cotranslationally. Furthermore, the strong selection of variable R implies the importance of relative locations in the primary structure. Generally speaking, N-glycosylation sites appear more frequently near amino terminal regions than carboxyl terminal regions, which may be interpreted as glycosylation reactions occurring at earlier stages in peptide synthesis. The Ser/Thr phosphorylation is often used in the turning on or off of various kinds of molecular mechanisms with physiological significance. The reaction is somewhat unstable and modified sites may be different under different conditions in vitro, which makes it difficult to evaluate the collected data. In addition, there are different classes of protein kinases in vivo and each has its own substrate specificity. In this sense, phosphorylation sites constitute a set of different components and a unified prediction of all types of Ser/Thr phosphorylation sites may be difficult. On the other hand, because of its biological importance, the amount of data available is relatively large for phosphorylation sites,which makes it suitable to treat with empirical approaches. Despite different classes, many protein kinase actually have a common character in their substrate specificities; they usually require basic residues near the acceptor sites. Williams (21) pointed out that phosphorylated sites were frequently separated from lysine or arginine by one amino acid. This still seems to hold to some extent. However, according to our analysis of amino acid compositions, the existence of basic residues was not so outstanding. Furthermore, there were also quite a few examples of basic residues existing near a Ser/Thr residue which was actually not phosphorylated. Thus, it is difficult to discriminate a phosphorylation site from the neighboring basic residues only. The results for phosphorylation sites turned out to be relatively good. The accuracy was close to 90% for selfdiscriminating the training set and over 80% for predicting unknown data. Because the phosphorylation sites in caseins occupy one half of our data set and because the phosphorylation of caseins appears to have somewhat different characters, they were then treated separately. However, the separation did not raise the accuracy. Casein kinases prefer acidic residues around acceptor sites while many other protein kinases prefer basic residues as substrate specificities. It is therefore surprising that a single discriminant function can deal with both basic and acidic residues effectively. Possibly, it reflects a common mechanism of phosphorylation reactions. Indeed there is another J. Biorhem.

Prediction of Modification Sites from Amino Acid Sequences 699 class of protein kinases, namely, tyrosine kinases, which also seem to require acidic residues and share conserved sequences (25).

7 Prediction of Modification Sites from Amino Acid Sequences 699 class of protein kinases, namely, tyrosine kinases, which also seem to require acidic residues and share conserved sequences (25). The variables selected for Ser/Thr phosphorylation sites were more or less dispersed over various positions and properties (Table IV(a)). However, careful inspection may suggest the importance of positions + / - 2 and - 4, which correspond well to the positions suggested by Williams (21). In contrast, the pattern of selected variables in the discrimination of PKA substrates was quite different (Table IV(b)). The pattern suggests the importance of the region toward the amino end, especially positions -2 to -4. This coincides with suggestions from experiments using peptide analogues where the substrate specificity of PKA was analyzed (26). In order to demonstrate how to utilize our method, we show an example of the prediction of camp-dependent phosphorylation sites in bovine a -crystalline A chain. The result of running our program is shown in Table V. It first locates potential sites by simple sequence pattern matching; in the present case all serines and threonines in the sequence were treated as potential phosphorylation sites. Then the program evaluated their potency by the discriminant function derived from training set 3 of the Ser/ Thr phosphorylation sites. As shown in Table V, three sites were predicted to be phosphorylated and one of them (Ser 122) was reported to be an actual phosphorylation site by Voorter et al. (27). The sequence pattern preceding the phosphorylated serine is RRYRLPS, while the proposed recognition sequence of PKA is either KRXXS or RRXS (26). In this paper we tried to predict two types of modification sites from the amino acid sequence. However, our method is also applicable to other functional sites of proteins, once proper data sets are prepared for both true and false functional sites. In reality, however, it requires a major effort to prepare a well-verified control group, simply because there seldom exist experimental data about sites which are, for example, not modified in vivo. The selection of variables that characterize different features in true and false sites can also be developed further. In the present paper we defined variables representing both position-specific properties and global properties covering longer ranges of sequence data, and used stepwise discriminant analysis for the objective selection of variables. There are certainly other ways to set up the repertoire of variables and to select variables with optimally discriminating features. We believe that the prediction of protein function will become more useful and effective means in sequence analyses, as the development of both databases and algorithms continues. We thank Dr. Sho Takahashi for helping us collect the data of modification sites. This work was partly supported by the Protein Engineering Research Institue. REFERENCES 1. Wagh, P.V. & Bahl, O.P. (1981) CRC Crit. Rev. Biochem. 10, Tufty, R.M. & Kretsinger, R.H. (1975) Science 187, Nakai, K., Kidera, A., &Kanehisa, M. (1988) Protein Eng. 2, Kanehisa, M. (1988) Protein Eng. 2, Klein, P., Kanehisa, M., & DeLisi, C. (1984) Biochim. Biophys. Acta 787, Edelman, A.M., Blumenthal, D.K., & Krebs, E.G. (1987) Annu. Rev. Biochem. 56, George, D.G., Barker, W.C., & Hunt, L.T. (1986) Nucleic Acids Res. 14, Hunt, L.T. & Dayhoff, M.O. (1970) Biochem. Biophys. Res. Commun. 39, Hunter, T. (1987) Cell 50, Engstrom, L., Ekman, P., Humble, E., Ragnarsson, U., & Zetterqvist, M. (1984) Methods Enzymol. 107, Afifi, A.A. & Azen, S.P. (1979) Statistical Analysis: A Computer Oriented Approach, 2nd Ed., Academic Press, New York 12. Fujitsu Limited (1984) FACOM ANALYST Kaisetsusho (in Japanese) 13. Mardia, K.V., Kent, J.T., & Bibby, J.H. (1979) Multivariate Analysis, Academic Press, New York 14. Robson, B. & Suzuki, E. (1976) J. Mol. Biol. 107, Kanehisa, M.I. & Tsong, T.Y. (1980) Biopolymers 19, Chou, P.Y. & Fasman, G.D. (1978) Adv. Enzymol. 47, Eisenberg, D. (1984) Annu. Rev. Biochem. 53, Fauchere, J.-L. & Pliska, V. (1983) Eur. J. Med. Chem. 18, Chothia, C. (1975) Nature 254, Bause, E. & Legler, G. (1981) Biochem. J. 195, Williams, R.E. (1976) Science 192, Wold, F. (1981) Annu. Rev. Biochem. 50, Mononen, I. & Karjalainen, E. (1984) Biochim. Biophys. Acta 788, Aubert, J.-P., Biserte, G., & Loucheux-Lefebvre, M.-H. (1976) Arch. Biochem. Biophys. 175, Hunter, T. & Cooper, J.A. (1985) Annu. Rev. Biochem. 54, Krebs, E.G. & Beavo, J.A. (1979) Annu. Rev. Biochem. 48, Voorter, C.E.M., Molders, J.W.M., Bloemendal, H., & de Jong, W.W. (1986) Eur. J. Biochem. 160, Vol. 104, No. 5, 1988

Biochemistry - I. Prof. S. Dasgupta Department of Chemistry Indian Institute of Technology, Kharagpur Lecture 1 Amino Acids I

Biochemistry - I. Prof. S. Dasgupta Department of Chemistry Indian Institute of Technology, Kharagpur Lecture 1 Amino Acids I Biochemistry - I Prof. S. Dasgupta Department of Chemistry Indian Institute of Technology, Kharagpur Lecture 1 Amino Acids I Hello, welcome to the course Biochemistry 1 conducted by me Dr. S Dasgupta,