Disulphide Connectivity Prediction in Proteins Based on Secondary Structures and Cysteine Separation

Size: px

Start display at page:

Download "Disulphide Connectivity Prediction in Proteins Based on Secondary Structures and Cysteine Separation"

Joella Jefferson
6 years ago
Views:

1 Disulphide Connectivity Prediction in Proteins Based on Secondary Structures and Cysteine Separation Raju Balakrishnan India Software Labs. IBM Global Services India Pvt Ltd. Embassy Golf Links, Koramangala Ring Road, Off Indiranagar, Bangalore, Karnataka, India Abstract The disulphide bonds are important in deciding the final 3D conformation of protein. Knowing disulphide connectivity will help to find out the final protein conformation, as it will limit the conformational search space. Fariselli and Casadio[] approached problem of predicting disulphide connectivity by equating the problem to a imum graph matching problem and assigning edge weights based on the residues in the nearest neighborhoods of the cysteines. This paper modifies the weights by adding constraints based on secondary structure and separation between the cysteines in the protein chain. Prediction results show considerable improvement. The prediction results can provide insight into the protein folding and disulphide bond formation as they are supporting the hypothesis on which the objective function is based upon. Introduction Proteins, the building blocks of life, consist of chains of ao acids. There are 20 different types of ao acids constituting protein chains. A protein can contain a single chain of ao acids or can contain multiple ao acid chains linked together. Protein structure can be described in four levels, namely, primary, secondary, tertiary, and quaternary. The sequence information of proteins, or the order in which ao acids constitute the chains, is the primary structure of a protein. The strands formed by these ao acids again bend locally forg sheets, coils etc. These form the secondary structure of proteins. These secondary structures again tangle themselves in different shapes. This is the tertiary structure. Quaternary structure refers to the manner in which different chains in a protein are bonded together. Detering the tertiary structure of proteins is of prime importance in medical science. Prediction of tertiary structure, given its primary structure, removes the cost for experimentation to detere the tertiary structure, and, will enable the detection of tertiary structures of many proteins for which empirical detection is difficult, or not possible. Hence protein s tertiary structure prediction is an active research area. Disulphide bonds (SS-Bonds) are formed between the Sulfur atoms in proteins. Cysteine is the only ao acid residue forg disulphide bonds. The major types of forces contributing to the protein folding are disulphide bonds and hydrogen bonds. Compared to hydrogen bonds, the disulphide bonds are very less in number but many times stronger. Hence it will be simpler to predict the disulphide bonds. Formation of the disulphide bonds adds to the stability of the protein conformations. The correct prediction of disulphide bonds will help tertiary structure prediction of proteins considerably by reducing conformational search space for the tertiary structure. Prediction of disulphide bonds involves two sub-problems. First problem is to predict the bonding state of cysteines. Second problem is the prediction of connectivity of cysteines. There are good solutions available for first sub-problem. Also in a protein chain bonded and un-bonded cysteines rarely co-exist [3]. This paper attempts the second problem, predicting the disulphide connectivity. 2. Problem Definition Given ao acid sequence and secondary structure information of a protein chain with 2N bonded cysteines and no un-bonded cysteines, predict the bonding pattern of the N disulphide bonds in the protein. For a protein the difficulty for predicting increases with increase in number of bonds in the protein. So the goodness, or accuracy, of the prediction need to be evaluated against the number of cysteines in the protein chain. For example, in a protein with only two cysteines and one disulphide bridge, accuracy of prediction will always be 00% as there can only be one way of bonding possible. But for a chain with four cysteines the accuracy

2 of a random predictor will be 50%. Likewise, as the number of disulphide bonds in a protein increases the probability of correct prediction for the random predictor decreases. The data set used consists of proteins with 2 to 6 disulphide bonds. Nearly 82% of proteins with disulphide bonds have only to 4 disulphide bonds []. So the percentage of proteins having greater than 6 SSbonds is small. 3. Prior work Several methods are tried to predict the disulphide connectivity of the proteins [], [2], [5].The method suggested in this paper is an improvement on the objective function used in []. Paper [2] uses neural network based methods. The method in this paper gives better results than [] for all classes of proteins 2. Even though neural network based methods described in [2] gives better accuracy in some classes of protein than the method proposed by us, method in [2] uses a different data set and it does not use an explicit objective function. Objective functions proposed here can be combined with tools like neural networks and can be used for more accurate predictions. Numerous resources are available for protein folding in general [7], [8], [9], [], [3], [4]. Databases like PDB [2] has a collection of 3D structure of proteins which can be visualized using tools like Rasmol [5] 4. System and Method 4. Protein Data Set The data set contains protein chains satisfying following constraints.. Single chain. Justification: To keep the program complexity low. The method is scalable to multiple chains also. 2. All cysteines bonded. Justification: Predicting bonding state of the cysteine-whether the cysteine is bonded or not-is an independent problem and methods for accurate prediction methods are available. These methods can be combined with the method proposed in this paper. 3. Disulphide bonds, secondary structure information and sequence information are annotated without ambiguity: The data set do not include proteins with one disulphide bonds as they can have only one connectivity pattern and do not need prediction. 2 Protein classification based on number of disulphide bonds, e.g. proteins having, 2, 3 etc number of disulphide bonds. Justification: The primary and secondary structure data is used as inputs for prediction program and the disulphide bond annotation is used for calculating accuracy of prediction. Inaccuracy in any of this information may result a wrong prediction or accuracy evaluation. Data set was downloaded from SwissProt database [6]. Proteins data with disulphide bonds (Proteins annotated with the string DISULFID in SwissProt) were downloaded, and those with multiple chains and un-bonded cysteines were removed. Also protein with ambiguous disulphide bonds (bonds annotated as POTENTIAL, PROBABLE, BY SIMILARITY, and REDOX ACTIVE) or having ambiguous residue names-like X-were removed from the data set. 4.2 Measures of Accuracy Two measures of accuracies are used for the evaluating the predictions. The first measure is fraction of bonds predicted correctly (Qc), and second is fraction of proteins for which all bonds are predicted correctly. Number of bonds predicted correctly Qc Total Number of bonds Qp No : of prot : all bonds predicted correct Total Number of proteins Probability of a random prediction to give correct results is taken as accuracy of the random predictor. Any prediction accuracies need to be evaluated against that of a random predictor. Methods giving accuracies less than or equal to that of a random predictor is not valid. For a random predictor the prediction accuracy are given below [] Q c of Random Predictor: Qc ( Rp) ( ) 2 B Q p of Random Predictor: Qp ( Rp) (2B )!! i B(2i ) ( 2 ) Where B is the number of disulphide bonds in protein.

3 4.3 Approach The disulphide connectivity in proteins is modeled as an undirected graph with nodes as cysteines and edges as bonds between the cysteines. The weights for the graph edges are assigned using the objective functions described in next section. H.Gabow s N-cubed imum weighted matching for undirected graphs is used to find out the set of pairs for which the sum of the weights of edges is the imum. For example, suppose we have a protein with 6 bonded cysteines. Say C to C6. We assign weights to these cysteine pairs as shown in Table. This is a symmetric matrix. After assigning weights, 3 pairs of cysteines which will give the imum sum of weights need to be found out. This is a optimization problem. H Gabow s Maximum weighted matching algorithm [4] is used for this optimization. The output of the algorithm is the predicted connectivity of the disulphide graph. This output is compared against the actual connectivity described in the SwissProt database protein annotations to calculate accuracy. Here the Optimization method is same as that proposed by Fariselli and Casadio[]. Objective function to assign the weights for graph edges is modified. C C2 C3 C4 C5 C6 C C C C C C Table : The weights are assigned using the objective functions described in next section. This weight matrix is given as the input to the imum weight matching optimization. 4.4 Objective Function The objective function used to assign the graph weights is the combination of following four functions. A. Monte Carlo derived contact potential[] B. Weights decreasing exponentially with increasing distance between proteins in the chain C. Penalty for bonds which cannot be formed without bending of alpha helixes or beta sheets. below D. Penalty for bonds between cysteines which are less than two residues apart in the chain 3. Each of these functions is described in detail A. Monte Carlo derived Contact potential Odd ration contact potential modified by Monte Carlo simulated annealing is found to be giving the best predictions by Fariselli and Casadio[]. We assumed that the cysteines along with the four residues, two in each side of the chain, are in contact. For example if the cysteine at 5 and cysteine at 2 are forg a SS-Bond we assume that all the residues at 3,4,5,6 and 7 are in touch with all the residues at 9,20,2,22 and 23. Then we add the contact potentials for all these cysteines as shown in (3) W mc W Ri, Rj) i j W ( Ri, Rj) ( (3) R i and R j Values are listed in table 5 in [] These are the only criteria used by Fariselli and Casadio[] B. Weights decreasing exponentially with increasing distance between cysteines (Wd ). A number factor, decreasing exponentially with number of residues between the cysteines in the chain, is used as weight. The following function used to calculate these weights. 2 2 ( d 2σ ) (4) W d e σ 2π σ 0.75 d I I 2 00 Where d is the number of residues between two cysteines divided 4 by 00 This term in objective function is based on the intuition that disulphide bonding state in proteins may not be going to the global ima of energy, but may be getting trapped in local energy ima s. 3 Here each ao acid residue in protein is indexed ascending order starting from 0, i.e. 0,,2,3 etc. Each index in a protein chain maps to a single ao acid residue. 4 Numerical values are detered empirically

4 C. Penalty for bonds between cysteines less than two residues apart. 0, I I2 < 2 W { W, I I 2 (5) 2 Here I and I 2 refer to indexes of a cysteine in the chain. If the magnitude of difference between indexes of two cysteines is less than two, then weight for that particular bond is set to 0, i.e. a penalty being applied. The steric hindrance and excessive strain in the bonds may be preventing the formation of the bond. D. Penalty for bond between the cysteines which necessitate bending of alpha helixes or beta sheets (Penalty based on secondary structure constraints) It is observed that secondary structure units like alpha helixes and beta sheets do not bend. Mostly the bending happens in turns. So protein chains are modeled as rigid alpha and beta structures connected by flexible turns. In this model the criteria for two points in the chain to come together is that the longest single rigid segment in between the points should be shorter than sum of the lengths of the all other segments in between. Where, d2 is the imum distance between I and I 2 possible without bending alpha helixes or beta sheets. L i Length of a rigid or flexible segment between I and I 2, where I and I 2 are the positions at at which the secondary structure unit is starting and ending respectively. If I or I 2 is positioned on the particular segment the length of the segment between I and I 2 is taken instead of the whole length of the segment. is, L Maximum of L i between I and I 2 Overall function to assign weights to graph edges W k Wmc + k 2 W d (7) Empirically detered value 7, for k 2 /k,, found to be giving best accuracies. is Figure : Protein s rigid segments are represented by thick lines and flexible segments are represented by thin lines. Here if the length of longest rigid segment ( L ) is greater than the sum of lengths of all other segments ( L i i ) the points I and I 2 can not come in touch (bonding) without bending L. Figure illustrates the protein model for this method. The criterion can be formulated 5 as below. d 2 L L i i W, d 0 W 2 (6) 0, d > 0 2 The weighing factors are obtained based on empirical results. The weighing factors are kept high to assign integer weights as the input to the program implementing EG algorithm [4], which can accept only integer values. Small weights will result higher rounding errors while converting to integer values. The adjacency list representation of graph with edge weights assigned as described above is given as input to Edmond-Gabow graph matching [4]. 5 The alpha helix and beta sheet lengths adjusted based on inter residue distance may give better results.

5 Sequence Info: Secondary Structure Info: () Assign Simulated Annealing Contact Potential (2) Modify weights Based on Distance between the Cysteins (3) Penalty for cysteins <2 residues apart (4) Penalty based on secondary structure constraints (5) Maximum Weighted Graph Matching Predicted Connectivity Figure 2: Prediction flow. The sequence information and secondary structure information are parsed from SwissProt database protein files. This information is passed to modules and 4 respectively. Modules to 4 assign weights to graph edges. This graph representation is passed as input to the imum weighted matching module. This module predicts the optimum match representing the disulphide connectivity of protein. The entire implementation is a stand alone java program except Edmond-Gabow alogirithm, which is a C module. 5. Results Proteins are classified based on number of disulphide bonds. Disulphide bonds prediction accuracies are calculated for each class separately. Results and comparison with the results obtained by applying prediction method in Fariselli and Casadio [] are listed in Table 2. The abbreviations used in Table 2 are as follows, EG Simulated annealing contact potential applied in [] S Constraints base on secondary structure. D Constraints and weight base on distance between the cysteines. 6. Conclusion and future work The suggested methods in this paper show considerable increase in the accuracy of prediction. Analyzing the objective function may be helpful in giving better insight into the protein folding mechanism. Secondary structure based constraints in equation (6) increases accuracy for proteins having three 3 and 4 disulphide bonds. Incorporating more domain knowledge such as adjusting inter residual distance in alpha and beta structures and tools like neural network may give better accuracies. Cysteines closer together tend to form bonds which may be an indication of disulphide bonding state of the proteins getting trapped in local energy imas rather that attaining the global energy ima.

6 No: of No: of Qp(EG,S,D) Qp(EG,D) Qp( EG ) Qc(EG,S, D) Qc(EG,D) Qc(EG) SS- Bonds Chains Table 2: Qp and Qc are the measures of accuracies mentioned in section 4.2. Qp( EG) and Qc(EG) are from methods used in prior work[] and Qc( EG,S,D) and Qp(EG,S,D) are using method proposed in this paper, i.e. using EG potential ( EG ), secondary structure based constraints ( S ) and distance based weights and penalties for cysteines less than two residues apart ( D ). Qp( EG, D ) and Qc( EG, D ) are accuracies of prediction not using secondary structure based constraints ( S ). Methods not using secondary structure information has the advantage of overall accuracy not depending on the accuracy of detering secondary structures. References [] Piero Fariselli and Rita Casadio, Prediction of disulphide Connectivity to proteins, BioInformatics, vol. 7 No. 0, pp , 200 [2] Alessandor Vullo, Paolo Frasconi Bioinformatics, Disulphide Connectivity prediction using recursive neural networks and evolutionary information., Bioinformatics, vol. 20, No. 5, pp , [3] Paolo Frasconi, Andrea Passerini, A two stage SVM architecture for predicting Disulphide bonding state of cysteine, Proc. of the IEEE Workshop on Neural Networks for Signal Processing, 2002 [4] Gabow H N, An efficient implementation of edmonds algorithm for imum weighted matching in graphs., Technical Report CU-CS Department of computer science, colorado university, 975 [5] Fariselli P, Martelli P.L. Casadio R. A neural network based methode for predicting disulphide connectivity in protein. In Damiani et al.(kes 2002), vol, pp , 2002 [6] Swiss Prot Database of protein sequences. [7] Lecture notes Biochemistry Carnegie Mellon University, 4/Lec08/lec08.pdf [8] Lecture Notes on Protein folding, Indiana University lecture_notes_27.pdf [9] Studies on the Principles that Govern the Folding of Protein Chains, Nobel Lecture, Chemistry 97-80, Christian Anfisen [0] Arnold Neumier, Molecular Modeling of Proteins and Mathematical Prediction of Proteins, structure, SIAM Rev. 39, , 997 [] A Guide to Structure Prediction, Rob Russel, EMBL. [2] Protein Data Bank. [3] Andriy Kryshtafovych, Torgeir R. Hvidsten et al, Fold Recognition Using Sequence Fingerprints of Protein Local Substructures, IEEE Computer Society Bioinformatics Conference (CSB'03), [4] Mohammed J. Zakit, Shan Jint et al, Mining Residue Contacts in Proteins Using Local Structure Predictions Proceedings of Bio-Informatics and Biomedical Engineering, [5] Rasmol Molecular visualization tool.

Proteins consist of joined amino acids They are joined by a Also called an Amide Bond

Lecture Two: Peptide Bond & Protein Structure [Chapter 2 Berg, Tymoczko & Stryer] (Figures in Red are for the 7th Edition) (Figures in Blue are for the 8th Edition) Proteins consist of joined amino acids