Solvation properties of proteins in membranes

Size: px

Start display at page:

Download "Solvation properties of proteins in membranes"

Donald Fitzgerald
6 years ago
Views:

1 Solvation properties of proteins in membranes

3 Solvation properties of proteins in membranes Anna Johansson

5 Abstract Knowledge about the insertion and stabilization of membrane proteins is a key step towards understanding their function and enabling membrane protein design. Transmembrane helices are normally quite hydrophobic to insert efficiently, but there are many exceptions with unfavorable polar or titratable residues. Since evolutionary conserved, these amino acids are likely of paramount functional importance, e.g. the four arginines in the S4 voltage sensor helix of voltage-gated ion channels. This has lead to vivid discussions about their conformation, protonation state and cost of insertion. To address such questions, the main focus of this thesis has been membrane protein solvation in lipid bilayers, evaluated using molecular dynamics simulations methods. A main result is that polar and charged amino acids tend to deform the bilayer by pulling water/headgroups into the hydrophobic core to keep their hydrogen bonds paired, thus demonstrating the adaptiveness of the membrane to allow specific and quite complex solvation. In addition, this retained hydration suggests that the solvation cost is mainly due to entropy, not enthalpy loss. To further quantify solvation properties, free energy profiles were calculated for all amino acids in pure bilayers, with shapes correlating well with experimental in vivo values but with higher magnitudes. Additional profiles were calculated for different protonation states of the titratable amino acids, varying lipid composition and with transmembrane helices present in the bilayer. While the two first both influence solvation properties, the latter seems to be a critical aspect. When the protein fraction in the models resemble biological membranes, the solvation cost drops significantly - even to values compatible with experimental data. In conclusion, by using simulation based methods I have been able to provide atomic scale explanations to experimental results, and in particular present a hypothesis for how the solvation of charged groups occurs. 5

7 List of Publications This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I II III IV V Anna CV. Johansson and Erik Lindahl Amino-Acid Solvation Structure in Transmembrane Helices from Molecular Dynamics Simulations. Biophys J. 91: Anna CV. Johansson and Erik Lindahl Position-resolved free energy of solvation for amino acids in lipid membrane from molecular dynamics simulations. Proteins. 70(4): Anna CV. Johansson and Erik Lindahl Titratable amino acid solvation in lipid membranes as a function of protonation state. J Phys Chem B. 113(1): Anna CV. Johansson and Erik Lindahl The role of lipid composition for insertion & stabilization of amino acids in membranes. J Chem Phys. In press. Anna CV. Johansson and Erik Lindahl. Protein contents in biological membranes can explain abnormal solvation of charged and polar residues. Submitted. Reprints were made with permission from the publishers. 7

9 Contents 1 Introduction Life, cells and membranes Water as a solvent From sequence to structure to function What this thesis is about Biological membranes Composition of a biological membrane Bilayer lipids Influence of lipids on membrane protein function Biological membranes vs membrane models Membrane proteins What do membrane proteins look like? How are proteins inserted into the membrane? Amino acid composition Polar groups in transmembrane segments Structural features Computational methods Molecular dynamics in silico chemistry Relation between statistical mechanics and thermodynamics The MD algorithm Force fields Bonded interactions Non-bonded interactions Parameterization Technical aspects Periodic boundary conditions Long range interactions Constraints Temperature and pressure coupling Parallelization approaches Applications to membrane proteins Free energy calculations Definition of free energy Thermodynamic integration Free energy perturbation

10 5.4 Free energy along a reaction coordinate Umbrella sampling Constraint free energy Adaptive biasing force Thermodynamic cycles Applications to membrane proteins Summary of papers Solvation of transmembrane helices (paper I) Solvation free energy of amino acid analogs (paper II) Solvation free energy as function of protonation state (paper III) Solvation free energy as function of lipid type (paper IV) Solvation free energy as function of membrane protein content (paper V) Final thoughts Acknowledgment Bibliography

11 1. Introduction 1.1 Life, cells and membranes The driving force behind all physical processes is an endeavor towards an equilibrium state where competing influences are balanced, although when such a state is obtained all life processes cease. Some main factors are commonly stated as necessary for life to persist including compartmentalization, conversion of energy and reproduction, all of which can be contributed to balance between a strive for equilibrium and the maintenance of non-equilibrium. In living organisms there is a non-uniform distribution of matter maintained by compartmentalization into different kinds of tissues, different types of cells and varying concentrations of ions and molecules in different parts of the cell. According to the second law of thermodynamics the disorder in an isolated system will increase until equilibrium is reached. This means that energy and matter tend to spread out over the universe to form an equilibrated distribution. To concentrate energy/matter in one specific place, it is necessary to compensate for this by spreading out a greater amount of energy across the remainder of the universe. In living cells complex molecules are constructed from simpler building blocks and energy is stored in the form of energy rich molecules. This maintenance of non-equilibrium is so important for the cell that a large fraction of the energy absorbed as light or as chemical energy is released as heat to counterbalance the increased concentration of matter. On a population scale natural selection attempts to accumulate traits important for survival, while the genetic drift introduces random changes. Together these processes ensure the continuous development of life. 1.2 Water as a solvent Many of the unique physical properties of water, including its solvation properties and high boiling and melting points, are due to its ability to form hydrogen bonds as illustrated in figure 1.1. According to thermodynamics, matter seeks to be in a low-energy state and bonding reduces chemical energy. When solvating a polar group in water it becomes part of this hydrogen bonding network, but when solvating a non-polar group, water prefers to rearrange its hydrogen bond network and repels this hydrophobic solute. When attempting the opposite, solvating a polar group in a hydrophobic solvent, the solute will 11

Figure 1.1: Hydrogen bonding network in liquid water. be dehydrated without any possibility to pair its hydrogen bonds which is an energetically very unfavorable state.

12 Figure 1.1: Hydrogen bonding network in liquid water. be dehydrated without any possibility to pair its hydrogen bonds which is an energetically very unfavorable state. A main theme of this thesis will be the influence of hydrophobicity and hydrophilicity for solvation properties. 1.3 From sequence to structure to function Most processes in a cell are conducted by proteins acting either as enzymes, structural elements or by performing signal transduction. All information needed to synthesize the entire set of proteins in a cell (the proteome) is captured by a long sequence of four different sugar-bases denoted A, T, C and G in the DNA-molecule within the cell. When a protein is synthesized, the part of the DNA-molecule that corresponds to this protein is transcribed to a RNA-molecule carrying the information to the ribosome machinery, which translates the sequence and puts individual amino acids together according to the coded sequence to form the actual polypeptide. Still how the nascent protein that emerges from the ribosome folds into its final structure is a somewhat open question, and a very interesting one since the structure can tell us how the protein performs its functions. As first proposed by Anfinsen [1], all the information required for folding is contained within the sequence of amino acids. If the folding would occur by trial and error the number of possible conformations would according to the Levinthal paradox [2] be in the order of , which implies that folding instead follows specific pathways. The number of sequenced genomes is growing exponentially and ever since the first successful attempts to sequence DNA there has been a constant progress in method development to analyze and manage this vast amount of information. The amount of structure and proteomics data is growing at 12

13 a slower rate, making cheap computer based methods for predicting protein properties from sequence an appealing option. In addition it seems like nature reuses protein folds that have proven efficient. All living organisms are related but have diverged over evolution due to random mutations and natural selection. If the sequences of two proteins are similar, they usually share the same structural features as well. If the structure of a related protein is known, rather reliable conclusions can thus be drawn about the unknown structure. If no such related structures are known, structure prediction is however a very hard problem without simple solutions. Our understanding of molecular biology has increased tremendously over the last couple of decades. It however seems as if for every rule that you try to state in biology there is at least one exception and for every attempt to make some kind of generalization, the picture grows more complex as more data becomes available. Nature is continuously evolving and finding new ways to solve problems, which is what makes it such a challenging and interesting area of research. 1.4 What this thesis is about The main focus of this thesis is computational models that aim at making sense of the fundamental biological mechanisms that allow for proteins to occur in biological membranes. This class of proteins is responsible for a vast array of different cellular functions but inherently difficult to characterize experimentally due to their non-water solubility, which makes them very interesting candidates for computational studies. Biological processes are usually studied with a top-down method starting from observations of a system and based on these trying to understand how it works. Here an opposite approach is used; by starting with individual atoms for which the basic properties have been determined, models can be created and studied in action to gain understanding about the system as a whole. Fortunately the two approaches are by no means exclusive and rather complement each other since theoretical models and predictions are just models and predictions that need to be experimentally validated. In return they frequently provide a much more detailed view of the system. The first part of the thesis describes the biological problem at hand, how biological membranes and the proteins within them are constructed, followed by a presentation of the computational methods used to study these systems and finally a summary of the research papers whereupon this thesis is based. 13

15 2. Biological membranes The most fundamental role of a lipid bilayer is to function as a barrier between living cells and their environment and to compartmentalize intracellular organelles within eukaryotes. At the same time as being both mechanically strong and flexible, the membrane must also be impermeable to compounds that are unwanted in the cell and have mechanisms for passage of desired compounds into the cell. A composition based on lipids provides these desired structural properties and is impermeable to polar and charged groups while allowing the passage of small, hydrophobic molecules. The transport of larger and/or hydrophilic groups is managed by membrane bound proteins, which secure the optimal chemical composition inside the cell. In 1972 Singer and Nicolson proposed a model where biological membranes were described as 2-dimensional liquids in which lipids and proteins diffuse more or less freely in the plane [3], offering a more dynamic view than earlier hard cell models. This picture was prevalent for a number of years and can still be found in most text books, although our view has evolved extensively over the last couple of decades. It is now generally acknowledged that the membrane is a complex entity compartmentalized into domains exhibiting lipid and protein composition differing from the bulk plasma membrane, as reviewed e.g. by Jacobson and Dietrich [4]. 2.1 Composition of a biological membrane A biological membrane consists of variable proportions of lipids, proteins and carbohydrates attached to either lipids or proteins. The mass fraction of proteins is often around half the mass of the membrane, which in most cases is roughly equivalent to 50 lipids per protein [5]. There are extreme cases with protein content as low as 18% in neurons where the myelin membranes function as insulation and as high as 75% in mitochondria with a large amount of enzymatic activity in the membrane, for data see Guidotti [6] Bilayer lipids A lipid typically consists of a headgroup, either charged or zwitterionic with two charged groups with opposite sign and no overall charge, and one part with one or more fatty acid hydrocarbon tail with a varying number of carbons and double bonds. When placed in water these amphiphilic molecules have a 15

Figure 2.1: A: A zwitterionic POPC lipid with a positive and a negative headgroup charge and one double bond. B: Lipid shape determines the preferred lamellar structure of the bilayer.

16 Figure 2.1: A: A zwitterionic POPC lipid with a positive and a negative headgroup charge and one double bond. B: Lipid shape determines the preferred lamellar structure of the bilayer. C: A simulated membrane and the corresponding densities of different groups. Red water, blue polar headgroups, green carbonyl groups and grey hydrocarbon tails. natural tendency to form bilayer structures to simultaneously expose the polar groups and shield the hydrophobic parts from the water. The preferred lamellar structure depends on the geometry of headgroups and tails where equal size promotes a flat bilayer and differences in size promote curved bilayers. Figure 2.1 shows a typical lipid structure, different possible lipid shapes and a phosphatidylcholin (PC) bilayer with varying composition as function of bilayer depth. A membrane typically contains hundreds to thousand of chemically distinct species of lipid molecules, differing in the chemical composition and structure. The most common lipids are zwitterionic phospholipids, while rarer but important examples include cholesterol, saturated lipids and trans fats. Cholesterol is essential in mammalian cell membranes to regulate fluidity over the range of physiological temperatures. This lipid is only slightly soluble in water and transported by different lipoproteins in the blood stream, one type when going out to cells and one type when going back to the liver for 16

17 excreation. High levels of the first type, known as "bad cholesterol", and low levels of the second, known as "good cholesterol", is associated with cardiovascular disease. Saturated lipids are made of the same building blocks as other lipids but arranged differently since they lack double bonds. This configuration gives straight molecules that packs well resulting in high melting temperatures and hard membranes. The same effect can also be seen for special unsaturated fats denoted trans fats, which unlike most naturally occurring unsaturated fatty acids have the chains on opposite sides of the double bond. These lipids are usually a side effect from attempts by the food industry to hydrogenate lipids and found in products containing processed fats. They are not essential and they do not promote good health. At the same time as being very diverse the exact lipid composition cannot be critical since the fatty acyl chain composition of animal cell membranes varies with changes in diet that, with a few exceptions as indicated above, have no obvious deleterious effect on cell function or on the health of the animal [7]. The fatty acid composition is thus to some part determined by the diet, although overall features of the lipid composition are nevertheless believed to be important and influence the bilayer properties. Factors such as chain length, degree of chain saturation, headgroup geometry and charge determine the bilayer fluidity, curvature, charge distribution and thickness, a matter extensively reviewed e.g. by Anthony Lee [8, 9]. 2.2 Influence of lipids on membrane protein function The direct effects of lipid composition on structure and function of membrane proteins have been quantified experimentally for a number of cases [10]. A striking example is the voltage gated potassium channels that pump potassium when placed in an mixed bilayer containing zwitterionic (POPE) and negatively charged phospholipids (POPG) but whose function is impaired when placed in a bilayer with positively charged lipids (DOTAP) [11]. Transport efficiency of the same channel has also been shown to be affected by altering the lipid composition enzymatically [12, 13]. Other examples include rhodopsin that does not fold properly without a specific lipid composition [14], LacY that might change topology when the lipid composition is altered [15], the mechanosensitive channel MscL that depends on a specific curvature of the membranes for activation (reviewed by Hammil and and Martinac [16]) and Ca 2+ -ATPase which has lower activity in membrane with a high fraction of phosphoethanolamine (PE) lipid headgroups as compared to phosphatidylcholine (PC) likely due to reduced ability to form hydrogen bonds in the interface region [17, 18, 19]. Varying lateral pressure profiles and charge distributions associated with different bilayer lipids have been shown to effect membrane protein conformation and hence also function 17

18 [20, 21, 22]. The importance of the lipid composition of the bilayer for cell function is also emphasized by the varying lipid composition between different organelles and by the uneven distribution of lipids between the two leaflets of the bilayer in the Golgi, plasma, and endosomal membranes that results in different environments on the two sides of the membrane [23]. 2.3 Biological membranes vs membrane models A biological membrane is a very intricate medium composed of a large number of molecules with differing properties. Due to this complexity any attempts to model the bilayer result in highly simplified models, where the aim of the modeling dictates the level of detail. If only the hydrophobicity of the membrane is needed it can be modeled very simplistically as an implicit solvent hydrophobic slab. If individual lipids are believed to contribute to the solvation properties, an all-atom model where all lipids and the surrounding water is modeled explicitly is needed. Even the all-atom models are of course highly simplified since often only including a single lipid type, or possible a mixture of a few different types of lipids. In addition, the bilayer slab used is usually rather small excluding long range effects like undulations from the model. Fortunately the rather simple descriptions presently in use seem to be detailed enough for a large number of applications. 18

19 3. Membrane proteins A majority of articles about membrane proteins begin their introduction by stating the importance of membrane proteins based on their high abundance (about 30% of the genes in a typical genome codes for membrane related proteins according to Wallin et al. [24]), their importance as pharmaceutical targets [25] and the scarcity of experimentally determined structures (as of March 2009, 184 structures had been determined [26], which amounts to less than 1% of all structures in the protein data bank (PDB) [27]). This is all true and they are very important reasons to study membrane proteins. Beyond that the problem of understanding how this class of proteins function and behave inside the membrane is a very fascinating and challenging problem in itself, and an area of research where computational and experimental methods have successfully worked together. Although the number of structures [28, 29] and hence our knowledge is steadily increasing, there are still many aspects of the membrane protein life story that are poorly characterized. 3.1 What do membrane proteins look like? The definition of membrane proteins usually includes both peripheral membrane proteins associated with the membrane surface and integral or transmembrane proteins traversing the membrane. This chapter will focus on the latter. The membrane puts a number of constraints on the possible structures that can be adopted by integral proteins but as more structures are solved, membrane protein structures seem to be just as complex as their water soluble counterparts [30]. Since no free hydrogen bonding acceptor or donors are possible inside the hydrophobic interior of the membrane, the only possible secondary structure elements that fulfill full hydrogen bonding potential of the protein backbone are α helices and β barrels, as illustrated in fig 3.1 for an α helix bundle and a β barrel protein. For the same reason that no unpaired hydrogen bond partners are available inside the membrane, the thickness of the membrane puts length restrictions on the secondary structure elements. In addition, the amino acid composition is dictated by varying chemical properties as function of membrane depth, ranging from pure water to pure hydrocarbons via the chemically complex 19

20 Figure 3.1: A β barrel, outer membrane protein (1QJA) (left) and an α helical bundle protein, rhodopsin (1UAZ) (right), together with small insets with the peptide bond hydrogen bonding patterns for these secondary structure elements. interface region with ample possibilities for electrostatic, hydrogen bond and van der Waals interactions. β barrel proteins have this far only been found in the outer membrane of gram negative bacteria and in organelles such as mitochondria and chloroplast, which both are of bacterial origin. The β strands are more difficult to predict based on sequence than the transmembrane α helices due to more long ranged interactions and fewer residues. Genome analyses predict that only 2-3% of a bacterial genome codes for this class of proteins [31] and there are also fewer structures of β barrel proteins than of α helical proteins. A typical proteome consists of about 20-30% α helical proteins [24], coming in all shapes and with a large variation in size ranging from single helices to tight bundles formed from by a large number of helices. From here on the term membrane protein will implicitly refer to α helical membrane proteins since they have been the focus of this thesis. 3.2 How are proteins inserted into the membrane? All proteins are transcribed by the ribosome where the genetic code in the RNA molecule is used to put together a polypeptide chain, which is subsequently folded to form the three-dimensional protein. Membrane proteins are 20

21 Figure 3.2: The translocon channel can either insert the helix into the membrane or translocate it to the other side of the membrane. however not designed to survive in the aqueous cytosol, and if placed directly there by the ribosome a multi-helical membrane protein would immediately aggregate into a non-native structure. Some small single-helix proteins can insert spontaneously into the membrane, but normally the insertion is aided by special channels called translocons [32, 33]. The first step in eucaryote membrane protein folding is the emergence of a signal peptide in the nascent polypeptide chain. This signal triggers the tight binding of a signal recognition particle (SRP) to the ribosome, which temporarily halts the translocation and allows the ribosome-srp complex to bind to a receptor on the ER-membrane [34]. Here the SRP is cleaved off and the ribosome is placed on top of the translocon channel into which the polypeptide chain is now synthesized. The translocon can either use a lateral gate to deliver transmembrane helices into the membrane or translocate the helix to the other side of the membrane, as illustrated in figure 3.2. This is in accordance with the classical two-state model proposed by Popot and Engelman [35], where transmembrane helices are first inserted into the membrane one by one and there aggregate to form the final three-dimensional structure. The mechanism whereby the translocon recognizes sequences intended for the membrane is not completely known, although the molecular features seem to be the same as seen to mediate protein-lipid interactions in known membrane protein structures implying that the translocon is designed to allow the nascent polypeptide chain to sample the surrounding bilayer [15]. Attempts have been made to experimentally characterize the molecular code behind translocon recognition, resulting in position specific solvation profiles for all amino acids in a biological membrane [36, 37, 38, 39]. In paper I in this thesis molecular dynamics simulations were performed using the same type of he- 21

22 lices as in these experimental studies to obtain a better understanding of amino acid solvation properties within a membrane. In papers II to V molecular dynamic simulation based methods were used to calculate solvation free energy profiles using a range of different conditions. 3.3 Amino acid composition Transmembrane helix sequences are dominated by highly hydrophobic amino acids such as leucine, isoleucine, valine, alanine and phenylalanine, although sprinkled among these are polar or even charged residues having structural and functional importance. The otherwise rather rare aromatic residues tyrosine and tryptophan are commonly found in the interface region [40] where they lock the helix in place thanks to their shape and amphiphilic nature with a polar group and a large apolar aromatic ring [41], as illustrated in fig 3.3 for tyrosine. Another intriguing example of the interplay between amino acid composition and function is the helix breaking residue proline that when found in transmembrane stretches is usually associated with helix kinks. It has been shown that when substituting a kink-associated proline with an alanine, the kink remains [42], suggesting that the proline is not needed for the kink formation. A proposed explanation is that the introduction of proline generates the kink, but over time additional substitutions allow for the preservation of the bend in the helix. When shifting the focus to the amino acid composition of parts outside the membrane, the loops on the inside of the cell are enriched in the positively charged amino acids lysine and arginine with a corresponding reduction on the outside of the cell, a phenomenon known as the positive-inside rule [43]. It seems to be valid over a large number of organisms [44] and it has been shown in protein engineering studies that it is a powerful determinant of membrane protein topology and promotes helix insertion [45]. Evolution also seems to have made use of this ability by creating homologous proteins with the same number of transmembrane helices but with opposite orientation of the topology, often referred to as dual topology proteins [46]. 3.4 Polar groups in transmembrane segments Since it is energetically highly unfavorable to expose polar groups in transmembrane stretches, their presence is a strong indicator of functional importance in each individual case, which is also highlighted by a generally strong evolutionary conservation of these residues. Even though this is a well established fact, it is still unknown how they are stabilized inside the membrane [47]. In the final protein structure they are usually directed towards polar parts 22

Figure 3.3: Left, tyrosine that intercalates with lipid tails at the same time as forming hydrogen bonds with interfacial groups, thus positioning the side chain in the interface region.

23 Figure 3.3: Left, tyrosine that intercalates with lipid tails at the same time as forming hydrogen bonds with interfacial groups, thus positioning the side chain in the interface region. Right, lysine snorkeling towards the interface region where it pairs its hydrogen bonds. in the membrane protein where they can pair their hydrogen bonds, which is also an important driving force for helix dimerization [48, 49, 50]. During protein folding when helices are inserted individually into the membrane they however still need to be exposed to the lipid environment. Experimental studies have shown that it is energetically possible to incorporate even significantly hydrophilic residues into transmembrane helices as long as they are counterbalanced by enough hydrophobic residues [36]. Molecular dynamics studies, including paper I in this thesis and others [51, 52, 53], have proposed that this can be explained by the creation of water cavities inside the membrane allowing for polar groups to keep their hydrogen bonds paired. Another option is used by the positively charged lysine and arginine, which are both long and flexible making it possible for them to direct the polar groups towards the polar interface regions [41]. This behavior is often known as snorkeling and is illustrated for lysine in figure Structural features The first solved structure of a membrane protein was that of the photosynthetic reaction center published in the mid 80 s [54] that together with subsequent structures represent highly regular examples both in terms of helix length and orientation, which was of course related to the fact that these could be crystalized and have their structure determined in the first place. As the number of available structures has increased, so has the number of encountered structural peculiarities. A typical helix-bundle structure was once believed to consist of a number of equally long interacting helices with slightly different angles 23

Figure 3.4: The highly irregular structure of a glutamate receptor (1XFH). Aromatic residues are pink to highlight the position of the aromatic belt.

24 Figure 3.4: The highly irregular structure of a glutamate receptor (1XFH). Aromatic residues are pink to highlight the position of the aromatic belt. Lysine and arginine are shown in cyan, and as predicted by the positive inside rule more abundant on the cytosolic side (i.e. downwards) of the membrane. through the membrane, but now this is extended to also include elements such as very long or very short helices, kinked helices, interfacial helices, re-entrant loops or even irregular structure elements, all which contribute to give the protein desired functional properties [55]. A number of these features can be seen in the irregular structure of the glutamate receptor shown in figure 3.4, where the aromatic belt and the positive inside rule is also highlighted by differently colored residues. A number of studies have focused on dimerization patterns between interacting helices, and some early results showed that a specific motif based on glycine, GxxxG where x can be any other residue, was often found in the interface [56, 57]. This is natural as the small size of glycine allows the helices to come in close contact. Attempts have been made to find additional dimerization targeting motifs but it appears as if the GxxxG-motif could be an exception since no other similarly strong sequence signals have been found. Even polar groups such as serine and threonine are however commonly found in dimerization interfaces, possibly due to their ability to form shared hydrogen bonds to the helix backbone [58, 59]. Even tough the task of predicting interacting helices seems to be a difficult problem, the information can be used the other way around. The DeGrado group has shown that it is possible to design peptides specifically targeting transmembrane helices in a sequence specific 24

25 manner based on information about geometric constraints from known cases of helix interactions [60, 61, 62]. 3.6 Computational methods Even before the first 3-dimensional membrane protein structure was determined, attempts were made to model the structure of membrane proteins based solely on sequence information. The main characteristic feature of transmembrane helices is their hydrophobicity [63]. For easy cases the topology, position and orientation of helices can be determined very accurately based only on this hydrophobicity. Topology predictions can be made even more accurate by also including evolutionary information [64] or amino acid composition biases such as the positive-inside rule or the aromatic belt, allowing for correct topology predictions for close to 70% of all membrane proteins. This number can be improved even further by addition of experimental information about the proteins at hand, e.g. the location of one of the peptide termini that gives a constraint to the predication [65, 66]. Attempts have also been made to extend the 2-dimensional characterization by predicating features such as degree of lipid exposure, tilt angles relative the membrane [67], oligomerization states [68], the presence of proline induced kinks [42] and distance from the center of the membrane, a helpful parameter when characterizing structural elements like re-entrant loops and interfacial helices [69]. In terms of full three-dimensional structure predictions, the same limitations hold for membrane proteins as for globular proteins; if a homolog with known structure exist it is a rather easy problem, otherwise it is a very hard problem. Once the structure is known, other methods (frequently simulation based), can be used to further explore the dynamics of the protein, something that will be the main focus in the following chapters. 25

27 4. Molecular dynamics 4.1 in silico chemistry Historically there have been two main ways to gain understanding about chemical processes, either by developing theoretical models or by conducting experimental work. During the last couple of decades another possibility somewhere in between has emerged with different computationally techniques, making it possible to predict the behavior of complex chemical systems and perform theoretical experiments complementing laboratory work. To emphasize the analogy to the latin expressions in vivo, which means in a living organism and in vitro, in a controlled environment such as test tube, this approach is sometimes called in silico referring to the computer silicon microchips where calculations are conducted. The first attempts to perform simulations of molecular systems were made by Alder and Wainwright in the early 50 s [70] who simulated rigid spheres, and the research area soon expanded to more realistic systems, like argon [71] and even liquid water [72]. In the late 70 s the first simulation of a protein was conducted for a total of 8.8 ps or 9000 steps by Martin Karplus and colleagues [73]. Three decades later this is common practice and the size and timescale of systems possible to simulate are still growing rapidly and now extend into the microsecond range. A wide array of simulation techniques have been developed, all representing different levels of trade-off between accuracy and efficient simulations. On the one extreme is quantum mechanical calculations, for which John A. Pope was awarded the chemistry Nobel prize in 1998, where the Schrodinger equation is solved for all atoms in the system allowing for calculations of e.g. partial charges and binding strengths. The only input needed is the participating elements, although with the limitations that the calculations are very costly and can usually only be applied to small systems under a set of somewhat unrealistic conditions such as vacuum and a temperature of zero degrees Kelvin. To model more complex systems for longer time spans approximations are necessary. The other extreme is coarse grained simulations where individual particles are groups of atoms allowing for calculations of concerted dynamic properties for a large number of particles, but with lower accuracy. To conduct such a simulation you need an extensive knowledge about the interactions between particles in the system. Somewhere in between these extremes you find the conventional molecular dynamics technique, which uses empirically de- 27

28 rived potentials and Newtonian dynamics to calculate interactions between atoms to study either dynamic or equilibrium properties. Ideally one would of course prefer as accurate calculations as possible. Most interaction properties are however fortunately well described by classical mechanics and except for the lightest atoms, such as hydrogen and helium, quantum effects can be neglected. This means that for a large number of applications it is possible to settle for the classical view and solve Newtons equations of motion for the interacting particles instead of the much more expensive Schrodinger formulation. There are of course cases where molecular mechanics methods are too limited, e.g. when a very accurate energy estimate is needed or when studying bond-breaking and formation. In return for treating interactions classically, these techniques also allow for more realistic conditions such as explicit solvent water and room temperature. The main limitation for all simulation techniques, including the molecular dynamics technique, is the length of the simulations as determined by the system size and the available computer resources. The maximum time step in a atomistic simulation is in the order of a few femtoseconds, the exact value depending on the system and the integration and constraint algorithms, resulting in a maximum total length of a simulation in the order of microseconds thus limiting the type of processes that can be studied successfully. 4.2 Relation between statistical mechanics and thermodynamics Experiments usually measure macroscopic properties that are always averages over a representative statistical ensemble of a molecular system. In a molecular dynamics simulation the position, velocity and force is known for each particle in every time step. The knowledge of a single structure is however not sufficient and a representative ensemble has to be generated in order to compute the macroscopic properties of interest. Furthermore, detailed atomic descriptions of structure and motions are often not relevant for macroscopic properties and can be averaged by applying a statistical mechanics framework. This framework also offers the main justification of the MD method, that statistical ensemble averages are equal to time averages of the system, known as the ergodic hypothesis. The phase space of a system with N atoms is defined as a virtual space where all possible states of a system are represented. For a mechanical system the phase space usually consist of all possible values of position and momentum variables (6N dimensions). A point in phase space is often denoted a micro-state while a macro-state is defined by properties like temperature and pressure. The basic assumption in statistical mechanics is that all possible configurations of the system that have the same total energy are equally likely, therefor the macro-state associated with the largest number of micro-states 28

29 is the one that is most likely to be observed. A partition function describes the system at thermodynamical equilibrium and relates the probability distribution of micro-states to the available macro-states, where the proportion of microscopic states for each macroscopic state is given by the Boltzmann distribution. This partition function is defined as Q = e E i/k B T i where E i is the energy associated with micro-state i and k B is the Boltzmann constant that relates the units of temperature to the units of energy. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives. 4.3 The MD algorithm To generate a representative equilibrium ensemble of a system the phase space as defined by the partition function must be efficiently sampled. Two main methods exist to do this, either by a biased random walk through phase space using the Monte Carlo method or by solving Newton s equations of motion for the system and follow the time dependent path through phase space in the molecular dynamics method. For the study of dynamic events only the second one is usable and since it usually produces comparable amounts of statistics in a given amount of computational time, molecular dynamics simulations are more commonly used. A third approach known as stochastic dynamics combines the two by solving the equations of motions but adding a random potential in each step to improve sampling. The focus of this chapter is the molecular dynamics method, where the forces on every particle are integrated over each time step according to Newtons second law F = ma to obtain new particle positions. The forces are calculated from interaction potentials between all particles as a function of their relative positions, which after a completed time step are updated allowing for the next force calculation and integrattion to obtain the resulting positions after yet another timestep. This procedure can be repeated as long as desired. The trajectory, the coordinates as function of time, is saved to disk and can be used to perform analysis of dynamic and equilibrium system properties after the completed simulation. Since the force calculation is usually the most time consuming part, the main criterion on the integration algorithm is not calculation efficiency but the ability to allow for long time steps. Another often desirable criterion is time-reversibility since the Newton equations are time reversible; the effect of adding a small time step t should be equal to taking the same step backwards. Good integrators are also area-preserving and leave the magnitude of 29

30 any volume element in phase space unchanged. If the latter is not fulfilled, the volume of the system will expand in phase space, which is not compatible with energy conservation. The most commonly used class of integration algorithms is based on the Verlet formulation [74], which is derived by writing two Taylor expansions of the position vector x(t) in opposite time directions and combining these to an expression for the new position at t = t + t. r(t + t) = r(t) + ẋ(t) t + ẍ(t)... x (t) 2! t2 + 3! t3 + O( t 4 ) r(t t) = r(t) ẋ(t) t + ẍ(t)... x (t) 2! t2 3! t3 + O( t 4 ) r(t + t) 2r(t) r(t t) + ẍ(t) t 2 The error in the updated positions will be in the order of O( t 4 ). Variations of this basic Verlet integrator exist, e.g. the Velocity Verlet algorithm, which explicitly incorporates velocities, and the Leap-frog algorithm that evaluates the velocities at half-integer time steps and uses these velocities to compute the new positions. The latter was originally an efficient choice using a minimum of memory since only the positions at a single step needed to be stored, and is still popular due to its combination of accuracy and efficiency in particular for parallel simulations. More advanced schemes like the Beeman algorithm have also been developed, which produces identical positions as the Verlet integrator and gives more accurate velocities, but is more expensive and needs significant storage. 4.4 Force fields The forces acing on each individual particle are calculated using empirically derived potentials. This collection of potential terms is usually referred to as a potential function in physics and a force field in chemistry. The most commonly used force fields in our type of molecular dynamics simulations are semi-empirical with parameters derived from both experimental work and high level quantum-calculations, optimized to reproduce experimental data as close as possible. They have different levels of refinement, either only implicitly including interaction effects such as polarization, or taking these effects into explicit consideration. The basic form contains terms describing both bonded and non-bonded terms, where the specific decomposition depends on the force field. A gen- 30

31 k2(ξ-ξ0)2 k1(l-l0)2 Figure 4.1: Three bonded particles where harmonic constraint are used to model the bond length and the bond angle. The ideal length and angles are l 0 and ξ 0 and force constants are k 1 and k 2. eral approach could be E tot = E bonded + E nonbonded E bonded = E bond + E angle + E dihedral E nonbonded = E electrostatic + E vanderwaals Force fields are usually either atomistic, describing the interactions between individual atoms, or coarse grained describing interactions between groups of atoms. Small groups as united atoms where methyl-groups are regarded as individual particles or larger groups where each particle may represent a number of atoms Bonded interactions The potential due to covalent bond stretching and bending is usually modeled as harmonic oscillators, as sketched for three bonded particles in figure 4.1. The resulting potential is larger the further away from the ideal bond length or angle the molecule has moved. Dihedrals are angles involving more than three atoms, either proper dihedrals describing torsion angles in chain molecules or improper dihedrals to keep planar groups like aromatic rings planar Non-bonded interactions The non-bonded interactions are more expensive to calculate since they include a much larger number of interacting neighbors for each particle. These 31

32 V(r) Repulsive (σ/r12) r 0 σ ε Attractive(-σ/r6) r Figure 4.2: The Lennard Jones potential between two particles as function of the separation r. potentials can either be calculated as pair or many body potentials, where the first is a simpler yet highly efficient and accurate choice, which will be further described here. A common way to jointly describe the van der Waals dispersion interactions due to induced dipoles and the short-ranged repulsion due to overlapping orbitals is to use the Lennard-Jones potential as function of the particle separation r, [ (σ ) 12 ( σ ) ] 6 V (r) = 4ε r r where ε is the minimum of the potential and σ the particle separation for which the energy is zero, also illustrated in figure 4.2. The 1/r 6 term describes the dispersion interactions and the 1/r 12 term is an approximative term describing the repulsion. This repulsion force should depend exponentially on the distance since the energy of the system increases abruptly once the electronic clouds surrounding the atoms starts to overlap. This approximation however allows for more efficient calculation due to the ease of calculating 1/r 12 as the square of 1/r 6 and in practice atoms will not get so close that it matters. The contribution from electrostatic interactions between two point charges q i and q j is calculated using Coloumb s law 32

33 V = q iq j 4πε 0 ε r r i j where ε 0 is the permittivity in vacuum, ε r the relative dielectric constant and r i j the distance between the charges Parameterization In addition to the functional form of the potentials, a force field also defines the values of parameters needed to calculate these potentials including partial charges, equilibrium values for bond lengths and harmonic potentials associated with these bonded interactions. Often several parameters are needed for the same element when placed in different chemical environments. An example of this would be a carbon bound to two oxygens in a carbon dioxide versus a carbon part of a hydrocarbon chain. Force fields are parametrized to reproduce experimental properties like density, enthalpy of vaporization, various spectroscopic parameters and sometimes even solvation free energy, and hence have differing capabilities of describing a certain systems. 4.5 Technical aspects To efficiently implement the molecular dynamical method, there are a number of additional considerations; Periodic boundary conditions Consider a system with a finite number of particles placed in empty space. The smaller the system the larger the fraction of the system that will be located at the edges. These particles will experience properties very different from the rest of the particles and to minimize such edge effects a very large system must be used. Another option is to use periodic boundary conditions where a particle on the left edge of the box is actually placed next to a particle on the right edge of the box and if this particle leaves the simulation box on the left, it will enter it again on the right. This effectively means that the simulated box is surrounded by translated copies of itself, as sketched in two dimensions in figure 4.3. When calculating short-ranged interactions between particles, only the closest replica of each molecule is considered, independent of its box location. Care must be taken to make the simulation box sufficiently large for molecules to be embedded in enough solvent not to interact with copies in neighboring boxes. 33

34 Figure 4.3: Using period boundary conditions the simulation box is effectively surrounded by translated replicas, illustrated here in the x and y direction Long range interactions For short range interactions that decay fast as a function of inter-group distance, the periodic boundary conditions cause no additional complexity and only the interactions between atoms within a cut-off have to be considered without any substantial loss of information. For long ranged electrostatic interactions such a cut-off approach would result in errors since the potential function depends on 1/r and does not converge when integrated over 3D space. Furthermore, the evaluation of a large number of pair-wise interaction is very costly. There fortunately exist options that combine reasonable computational cost with accuracy. The most widespread approaches are either based on reaction-field or different implementations of the Particle Mesh Ewald method (PME). In the reaction-field method pair-wise interactions are only calculated within a cut-off. Beyond this a constant dielectric environment is assumed, which works best for homogeneous systems. In PME the interactions are divided into two parts, the computationally cheap contribution from short ranged interactions between neighbors within a cut-off and the more expensive long range interactions. The short range contribution is calculated in real space and the long range part in reciprocal space allowing for accurate and efficient calculations Constraints To allow for longer time steps and more efficient simulations the fastest motions in the system, usually associated with bond vibrations for light atoms like hydrogens, can be removed by applying constraints to bond lengths and possibly angles. Incidentally, modeling bond lengths as being constant is quite a good description of the quantum chemical ground state of a bond, at least better than the harmonic oscillator bond description. Constraints offer an ef- 34

35 ficient way to increase the MD time step, but with the drawback of lost time reversibility of the integration step Temperature and pressure coupling A molecular dynamics simulation is per default performed in a NVEensemblem with constant number of particles (N), volume (V) and energy (E). There are however very few chemical processes that occur under these conditions, a ordinary chemical experiment in a test tube is e.g. performed under constant pressure and temperature, and there is hence frequently a need to couple other properties of the system to fixed values. The most commonly used ensembles are either NVT or NPT, corresponding to constant number of particles (N), volume (V) or pressure (P) and temperature(t), which means that either the temperature have to be kept constant by a thermostat and/or the pressure by a barostat Parallelization approaches The development of more efficient computers has shifted focus from more efficient single processors to the use of several processors in parallel. When running a simulation in parallel the calculation has to be divided on the available nodes as efficiently as possible to minimize the need for communication between the nodes. The simplest way is to divide the particles over different nodes, perform the force calculations for these particles individually on each node and then communicate the result in an all-to-all communication step. This communication will limit how many nodes a system efficiently scales to, and more refined ways have been developed that allow for systems to scale to a larger number of nodes by partitioning space instead [75]. Care must be taken since the part of the simulation algorithm that scales least well will be the bottle neck that limits the parallelization performance. While a highly parallelizable simulation software is desirable, good performance on a single processor is still of paramount importance. 4.6 Applications to membrane proteins Molecular dynamics is a commonly used tool when studying macromolecules in general and, as emphasized by a number of recent reviews [76, 77, 78], membrane proteins are no exceptions. Since this class of protein is solvated in a more complex environment than their water soluble counterpart, their dynamics is a more intricate problem. A bilayer system requires longer equilibration times due to slow lipid diffusion and relaxation and the processes of interest are often slow. This requires long calculations with simulation times approaching µs to sample simple protein-membrane interactions. Attempts 35

36 have been made to use implicit solvents to mimic the membrane dielectric properties [79, 80] and make these inherently expensive simulations more easily obtainable. As shown e.g. in paper I in this thesis, a lipid bilayer is however a very adaptive solvent and by removing the atomic representation of the lipids these solvation properties are excluded. Another option is to use coarse grained methods, where the number of particles in the system is reduced resulting in removal of the fastest motions allowing for much longer time steps. This approach has successfully been used e.g. by the Sansom group to study protein interactions with bilayers [81, 82, 83] and Treptow et al. [84, 85] to study ion channel gating mechanisms. Even if the atomic details of the interactions are kept, the range of possible process that can be studied can be extended by the use of non-equilibrium dynamics where instead of waiting for the system to evolve in a certain direction, it is forced there. One example is Gumbart s studies of the translocon channel being forced to open its lateral gate by pulling a helix out of the channel, allowing for increased understanding of the structural rearrangements [86, 87]. Another example is a recent microsecond simulation of the voltage gated potassium channel with voltage applied over the membrane to induce movement [88]. Thus, if simulations are used wisely by applying small tricks, molecular dynamics can be used to study even rather slow processes on an atomic scale. 36

37 5. Free energy calculations Every system seeks to achieve a minimum of free energy 5.1 Definition of free energy By definition the thermodynamic free energy in a system is the amount of energy available for doing thermodynamic work, which is minimized when the system is at equilibrium. A reaction that is associated with a negative free energy difference is hence favored and will run spontaneously, which makes free energy calculations useful when evaluating e.g. the most likely conformation of a protein or which ligand that provides the best fit to a given receptor. For a closed system with constant temperature and volume the free energy is described by the Helmholtz free energy F = U T S, where U is the internal energy of the system, T the absolute temperature and S the entropy. In a statistical mechanics context the partition function Q is, as discussed in section 4.2, the sum over all possible states i of a system in the ensemble and hence related to both the total energy and entropy of the system. This offers an alternative way to define the Helmholtz free energy as F = k B T lnq. In chemistry where reactions often take place in test tubes it is more common with constant pressure than constant volume. Gibbs introduced an additional term to compensate for the work performed against the pressure to change the volume and this Gibbs free energy is defined as G =U T S+ pv = H T S. It is unfortunately impossible to calculate the free energy directly from a simulation since it is an ensemble property directly related to the volume in phase space associated with the system, not a simple average over phase space coordinates. There is however no way to measure absolute free energy experimentally either since it is defined as the work that one system does (or can do) on another. Only the changes in free energy associated with the transition of a system from one state into another can thus be defined and measured. Examples of this include the derivative of the free energy with respect to volume or temperature, properties that can be measured directly in simulations as well: ( ) F = P, V NT ( ) F/T = E 1/T V N 37

38 5.2 Thermodynamic integration The free energy difference between two states can be calculated by first finding a reversible path that links them together and then integrating the derivative of the free energy over this path, an approach denoted thermodynamic integration. The free energy is a state function that is only dependent on the states and not the actual path travelled between them. In a simulation we are hence not restricted to a physical path that can be followed experimentally, which implies that all the parameters in the potential energy function can be used as variables for thermodynamical integration. To define the system along this path a coupling parameter λ can be defined where the potential energy U 0 of the system corresponds to λ = 0 and potential energy U 1 corresponds to λ = 1. The free energy difference between the states can be calculated by taking the derivative with respect to λ of the free energy along the path between the states, which can be written as an ensemble average, and integrating this ensemble average between λ = 0 and λ = 1. This is an important result, since ensemble averages can be obtained from simulations. The coupling parameter can either be defined for systems with the same composition or for systems with similar but distinct molecules where non-physical intermediate states are used, so called artificial thermodynamic integration. The most common method to perform thermodynamic integration is to construct a set of configurations that are simulated for fixed values of the coupling constant λ to obtain the derivative of the free energy expressed as an ensemble average from equilibrium simulations df dλ = dh dλ The ensemble averages from these simulations are numerically integrated from λ = 0 to λ = 1 to obtain the free energy difference F = λ=1 λ=0 dh/dλ dλ An alternative is to use slow growth where the value of the coupling parameter λ is changed slowly during a single simulation. For the latter there is never a true average of any state since λ is constantly changing and the result is thus often associated with a large degree of error due to hysteresis; i.e. there is a risk of a systematic difference in the calculated free energy depending on whether the calculation is performed forwards and backwards. 38

39 5.3 Free energy perturbation Suppose we are interested in the free energy difference between two systems, 0 and 1, with partition functions Q 0 and Q 1 and free energy difference F = F 1 F 0 = k B T ln(q 1 /Q 0 ). If a simulation is performed for system 0, the potential energy can also be calculated for system 1 for each configuration visited during this sampling. The potential energy difference, U, can thus be recorded and used to construct a histogram as a measure of the probability density for the potential energy difference, which can be related to the free energy by F = k B T lnp( U). This method to calculate free energy difference from a single simulation is called free energy perturbation and was presented already in 1954 by Robert Zwanzig [89]. If the phase space volumes between the states of interest do not overlap sufficiently and the free energy difference is not small enough to ensure convergence, additional simulations have to be performed for intermediate systems. 5.4 Free energy along a reaction coordinate Sometimes free energies correspond closely to physical processes acting along a reaction coordinate, representing cases when not only the free energy difference between two states is of interest but also the free energy along the path the system travels between these states. The reaction coordinate can e.g. be a simple geometric coordinate representing bond length (as illustrated in figure 5.1), bond order, or perhaps the distance between the active site of an enzyme and a ligand. A naive approach to calculate the free energy profile along the reaction coordinate would be to let the system sample the phase space and note the number of times the system is found in different locations along the coordinate to estimate the probability of the state being in that location. This probability as function of position, P(ξ ), can be related to the free energy by F = k B T lnp(ξ ), where k B is the Boltzmann constant and T the temperature. The problem with this approach is that only states associated with sufficiently low energy will be sampled efficiently since the probability of finding the system at higher energy states will be very small. If there are barriers along the free energy path, the system might even become stuck and some parts of the reaction coordinate pathway might not be sampled at all. To overcome this problem and ensure even sampling along the entire reaction coordinate some tricks can be used, which will be discussed in the following sections Umbrella sampling A frequently used approach to ensure sufficient sampling along the entire reaction coordinate is to add a potential to the system to cancel out the free energy profile and to confine the system to a small region, where the free energy is 39

40 Free energy Separation Figure 5.1: The free energy along a reaction coordinate represented by the separation between two bonded atoms. 40

41 F = k B T lnp(ξ ) +W(ξ ) and W(ξ ) a biasing potential. During simulation, the preferred position of the system is sampled and a histogram approach used to evaluate the result and calculate the free energy. In the final step the biasing potential is subtracted again to recover the actual free energy. The added potentials often have the functional form of harmonic potentials, the shape of which has given the method its name. This is a well known method with the advantage of easy usage and system setup, but the disadvantage is that the biasing potential has to be guessed beforehand. Since this cannot be done exactly, the normal solution is to partition the reaction coordinate into a number of windows with sufficient overlap to ensure convergence resulting in a large number of simulations Constraint free energy According to the concept of potential of mean force (PMF), if a force f (ξ ) = F ξ depending on some reaction coordinate can be extracted, then the PMF can be calculated according to PMF = ξ =1 ξ =0 F/ ξ dξ Constraint-based free energy calculation, which is the method used in papers II to V presented in this thesis, is based on the same principle as umbrella sampling where a potential is added to the system to ensure uniform distribution of P(ξ ). In this case the added potential is infinite, which means that the system will stay in the exact same position. During simulation the force needed to keep the system in this position is recorded and used to calculate the PMF, which can interpreted as the free energy along the reaction coordinate. In the most simple case the added constraint is one-dimensional and the recorded forces in this direction will only depend on this degree of freedom and hopefully all other degrees of freedom are sufficiently sampled during the simulation. The advantages with this approach are that no assumptions have to made beforehand about the system and each subsystem represents a point on the reaction coordinate that can be simulated independently, making it embarrassingly parallelizable. The disadvantage is that an advanced constraint algorithm has to be used to keep the groups of interest at the defined distance, which is particularly complex in parallel runs. 41

Figure 5.2: The free energy between two states can be more easily calculated by constructing a thermodynamic cycle with intermediate states. 5.4.

42 Figure 5.2: The free energy between two states can be more easily calculated by constructing a thermodynamic cycle with intermediate states Adaptive biasing force Another option is to calculate the biasing potential on the fly, as is the case in the adaptive biasing force method. The average mean force acting on the system is calculated and then removed from the system, which then performs a one-dimensional random walk with zero mean force along the reaction coordinate ξ. Only the fluctuating part of the force remains, ensuring uniform and efficient sampling. The calculated mean force is updated as more statistics is collected, which results in the adaptive properties of the method. The method gives better sampling than the aforementioned ones but at the cost of being less suitable for simulation on multiple processors since it is not possible to perform independent simulations, as is the case e.g. for the constraint method. 5.5 Thermodynamic cycles For some applications it can be difficult to directly define a path between the states of interest. However, since the free energy is a state function and the path travelled is unimportant, intermediate states can be used to define the sought difference. An example is shown in Fig 5.2 where the sought free energy difference is the hydration of a particle, G 4, a difference that is easier to calculate when using intermediate steps where the particle is transformed into a dummy particle according to G 4 = G 1 + G 2 G Applications to membrane proteins The methods above can be used to perform the same types of free energy calculations on membrane proteins that are commonly performed on all types of proteins, e.g. ligand binding to a protein surface or assessing different folding 42

43 options, but also to more specifically evaluate membrane protein structure and function. An interesting problem well suited for such calculations is the permeability of molecules through channels of different types. Striking examples include work by Allen et al. [90] who used umbrella sampling to study model gramicidin channels in reasonable agreement with experimental results and modeling of the permeation through the voltage gated potassium channel using the adaptive biasing force method [84]. Other applications include recognition and association of transmembrane helical domains [91] and solvation profiles of individual amino acids as function of position in the membrane, as presented in papers II - V of this thesis but also by the Tieleman [92, 53] and Allen groups [51, 93, 94, 95] who both used umbrella sampling with amino acids analogs and entire helices respectively, to accomplish this. Although the development of faster and more accurate methods for free energy calculations have enabled applications that were not feasible only a couple of years ago, the methods are still not possible to use as black box applications. A number of different careful choices have to be made, both when designing the free energy path between the states of interest and when performing the simulations to ensure efficient and sufficient sampling. A free energy calculation in a complex system thus offers great methodological challenges, but can also substantially contribute to understanding the problem at hand. 43

45 6. Summary of papers Knowledge about the insertion and stabilization of membrane proteins is a key step towards understanding their function and enabling membrane protein design. Transmembrane helices are normally quite hydrophobic to insert efficiently into membranes but there are many exceptions with polar or titratable residues. An obvious example is the S4 helices of voltage-gated ion channels with up to 4 arginines, leading to vivid discussions about whether such helices can insert spontaneously and if so what their conformation, protonation state and cost of insertion really are [36, 96, 97, 98, 99]. To address this question, the main focus of this thesis has been membrane protein solvation in lipid bilayers. In paper I, we used single transmembrane helices with a systematically varied sequence inserted into a lipid bilayer to study how the solvation properties changed as function of amino acid type and position in the helix. In the next paper, paper II, we tried to reproduce the experimental in vivo amino acid solvation profiles in a bilayer presented by Hessa et al. [36, 37] by using a molecular dynamics based method. Our results show good correlation with the experimental values, but the magnitude of the in vivo values are considerably smaller and the scale interestingly appears as a compressed version of the calculated values. At the same time other groups also presented calculated values that lead to the same conclusion [51, 53], which made us believe that there was something about the process of helix insertion that we still do not understand and that is hence not included in theoretical models. The following three papers, paper III to V, are all focused on possible explanations for the discrepancies between the two scales. Paper III evaluates the influence of the protonation state for titratable amino acid on solvation properties, paper IV the corresponding influence from different types of bilayer lipids and the last paper, paper V, the influence of transmembrane helices already present in the bilayer to mimic a biological membrane. Both the protonation state and lipid types do have an influence on solvation, although not large enough to single handedly offer an explanation, while the presence of transmembrane helices seems to be able to account for virtually the entire discrepancy. The answer is however most likely a combination of different aspects since the bilayer used in models, no matter how realistic you try to make it, will differ extensively from a complex biological membrane. 45

46 6.1 Solvation of transmembrane helices (paper I) Inspired by experimental work by Hessa et al. [36] a set of transmembrane test helices was constructed to evaluate membrane solvation properties for amino acids in a transmembrane helix, both as function of amino acid type and position in the membrane. Starting from a 19-residue long poly-alanine helix with interfacial glycine and proline anchors on either side, nine test helices were constructed for each of the remaining amino acids by systematically substituting two alanines at a time, starting next to each other in the middle of the membrane and stepwise moving outwards until placed at the ends of the helix. These test helices were studied inside a DMPC bilayer using molecular dynamics simulations. The most important result was that charged and, to some extent polar, amino acids retain hydration by pulling in water and polar lipid headgroups, even when placed in the middle of the membrane. The amount of water surrounding each tested amino acid was quantified by calculating the cumulative radial distribution and, interestingly enough, these values correspond well to experimental in vivo amino acid solvation profiles [36] indicating that this retained solvation could play an important role for the solvation cost. This also implies that the cost is mostly due to entropic contributions from molecules in the hydration shell, not loss of enthalpy from bond breaking. Other interesting observations include a number of atomic scale explanations to already observed properties of membrane protein interactions with a bilayer. Polar amino acids, and particular long flexible residues like lysine, tend to snorkel towards the interface region to pair their hydrogen bonds. In our simulations we were able to quantify this behavior by calculating average snorkeling angles and, due to geometrical constraints of the peptide bonds, the angles are larger in the N-terminal than the C-terminal end of the helix. The size of the hydration shell for polar groups is also frequently larger in the N-terminal than the C-terminal end, which is consistent with earlier studies on amino acid composition of transmembrane helices indicating that polar groups are more common in the N-terminal than in the C-terminal end. When further comparing the relative occurrence of amino acids, basic amino acids seem to occur further into the membrane than acidic ones. This is caused by their ability to form hydrogen bonds to lipid carbonyl groups, something acidic amino acids cannot. Tryptophan and tyrosine are known to be enriched in the interface region and to function as interfacial anchors keeping helices in place in the membrane. In our simulations we could see how the aromatic ring orients itself parallel to the lipid chains and intercalates at the same time as forming hydrogen bonds with the interface regions, thus efficiently locking the helix in place. Serine and threonine are furthermore known to be important for helix dimerization, which agrees well with our observations that they form shared hydrogen bonds to the helix backbone as illustrated in figure 6.1, making their 46

47 Figure 6.1: Hydrogen bonding network for serine (left) and threonine (right). membrane insertion relatively cheap at the same as it is more advantageous to later form real hydrogen bonds to other groups, e.g. to another helix. In conclusion, the result from this paper showed how adaptive the membrane environment is as a solvent and that the solvation is both specific and quite complex. 6.2 Solvation free energy of amino acid analogs (paper II) In the second paper we wanted to further quantify the solvation properties of amino acids in a membrane environment by calculating the solvation free energy as a function of position. Sampling in an explicit membrane is very slow, and as we in paper I showed that polar groups tend to retain hydration, accurate free energy calculations would necessitate extremely long simulation times. Here, we instead chose to use amino acid analogs where the backbone part was removed to allow for accurate calculations using shorter simulations since individual side chains cause much less disturbance to the membrane than an entire helix. To systematically study the effect individual amino acids have on the bilayer environment 100 conformations were constructed for each amino acid analog, except for glycine (no side chain), placed at different positions along the membrane z-axis. Simulations were performed for each of these conformations with the position of the analog relative the lipid bilayer constrained. The direction and size of the resulting force, denoted F constr, indicates the direction the analog wants to move and to what extent. By integrating this set of positiondependent forces over the box-length in the z-direction, a likewise position dependent potential of mean force (PMF) was obtained, V (z) = z F constr(z)dz. 47

Figure 6.2: Arginine side chain analogs in different positions along the membrane normal. The relative z-coordinate is constrained and the resulting forces acting on the analogs are shown with arrows.

48 Figure 6.2: Arginine side chain analogs in different positions along the membrane normal. The relative z-coordinate is constrained and the resulting forces acting on the analogs are shown with arrows. This PMF can be interpreted as the free energy of solvating amino acid side chains at different positions in the bilayer. Figure 6.2 illustrates this with an arginine analog placed at three different locations in the membrane together with the resulting forces from simulations. Similar trends as in paper I were seen here as well, with polar groups retaining solvation and directing them selves to facilitate the pairing of hydrogen bonds, indicating that the solvation cost is to a large extent entropy driven. When comparing the results with the in vivo apparent free energy scale [36], there was initially a strikingly good correlation and we seemed to be able to calculate the solvation free energy of entire helices with result corresponding very well to experimental results. However, as shown in the erratum to paper II, this turned out to be an artifact due to a software error and after recalculation the net values where clearly higher than the experimental. This agrees well with other simulation studies [53, 51] and is somewhat perplexing: The relative correlation is quite good despite the higher magnitude and the in vivo scale almost seems like a compressed version of the calculated scales. Altogether, this implies that the membrane protein insertion is still not well understood and that some aspect was not included in the models used for calculating the solvation free energies. Suggested explanations for this discrepancy include different protonation states of the titratable amino acids, different types of bilayer lipids, the absence of protein in the pure lipid bilayers used for calculations, as well as different aspects of the still not fully understand helix recognition by the translocon machinery. The three first suggestions are the theme of the remaining three papers of this thesis, while the latter is still an interesting question for further studies. 48

49 6.3 Solvation free energy as function of protonation state (paper III) The main objective of this paper was to evaluate the solvation properties of titratable amino acids inside a lipid bilayer as function of protonation state. These amino acids are rare inside bilayers, and their presence in transmembrane stretches implies importance for structure and/or function, most likely associated with its charge. By calculating solvation free energy profiles for both the charged and the de-charged states we were able compare solvation properties and thus evaluating if it is possible to have the charged forms in the bilayer or if insertion would lead to de-charging. In paper II the solvation free energy was calculated for the most likely protonation states in water, hence the profiles for the charged state were calculated for the analogs to arginine, lysine, histidine, aspartate and glutamate. In paper III profiles were also calculated for the corresponding de-charged states using the same method as presented in paper II. As expected, the cost of solvating the previously de-charged analogs is considerably lower than for the charged states, corresponding well to profiles for other polar amino acids. The non-charged protonation states induce little distortion of the membrane even when placed in the hydrophobic core whereas the charged form cause local distortion to maintain hydration, as illustrated in figure 6.3 Our results show that the differences between protonation states are not large enough to rule out the possibility that charged amino acids can be inserted into a transmembrane helix if necessary for structure or function, especially not if the free energy cost of de-charging the residue is taken into account, being in the same order as the differences. Both the charged and decharged solvation free energy show good correlation with experimental values, which can thus not be used to discriminate between the two sets. It is however important to keep in mind that there is a cost associated with the solvation of polar groups in the membrane, smaller or larger depending on the protonation state but still a cost. The geometric aspects of solvation were also extensively evaluated and charged states have a larger fraction of paired hydrogen bonds inside the membrane and a higher degree of order since consistently directing their polar groups towards the closest interface region. This almost perfect conservation of hydration for charged amino acids also results in a local thinning of the membrane around the analog, an effect that is not as strong for the uncharged protonation states. Even though the difference between protonation states is significant it is not large enough to explain the discrepancy between calculated and experimental values presented in paper II. Furthermore for some proteins it is most likely necessary to have the charged state of the titratable amino acid to maintain function. 49

50 Figure 6.3: Systems containing arginine and aspartic acid at different positions and protonation states. 6.4 Solvation free energy as function of lipid type (paper IV) A number of experimental observations for individual membrane proteins have emphasized the importance of the membrane composition for proteins to retain proper function. The typical membrane model used in molecular dynamics simulations includes only a single type of lipids, while a biological membrane consists of a mixture of different types of lipids as well as proteins and carbohydrates. Here the aim was to systematically evaluate the influence different types of bilayer lipids have on amino acids solvation properties. To evaluate bilayers with different characteristics a set of six different lipid types was used, including positively and negatively charged lipids as well as zwitterionic lipids with different headgroup geometry and tail lengths. The solvation properties was determined for eight different amino acids comprising a representative test set of all twenty amino acids. We used the same basic method as presented in paper II to calculate solvation profiles for all 48 configurations. A number of perhaps expected but nevertheless intriguing findings were made that explain some observed lipid preferences for different membrane proteins. The maximum solvation cost for polar amino acids is lower in a thinner membrane since the effort to retain hydration and snorkeling is facil- 50

Lecture 15. Membrane Proteins I

Lecture 15. Membrane Proteins I Lecture 15 Membrane Proteins I Introduction What are membrane proteins and where do they exist? Proteins consist of three main classes which are classified as globular, fibrous and membrane proteins. A