1.7: Protein Structure
- Page ID
- 185744
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)So far, we have really only thought about proteins like this: CTYQVYKHPM. A sequence of amino acids. This is what we call the primary structure of a protein. Each amino acid has specific chemical properties that influence the overall structure of the protein. But a protein is much much more than its primary structure. First, we have to note that a protein has different structures in different environments. A protein's structure defines its function (proteins work by physically interacting with other molecules, and those interactions are dictated by structure). Almost all of these functions take place inside a cell, with a very specific cellular environment. Luckily, most of that environment is just water, so we have a pretty good understanding of the basic forces that make proteins "fold" into their structures. These chemical interactions give us secondary and tertiary structures of a protein. When we try to predict a protein's structure, we are usually looking for the tertiary structure, or if there are multiple subunits, the quaternary structure.
Proteins are not just long strings of amino acids. They fold over onto themselves to create complex structures. How do they do that? Well, there are chemical interactions between the amino acids that pull some together, push some apart, and connect some directly. There are chemical interactions with the external environment: for example; amino acids can be hydrophobic or hydrophilic, and because the protein is usually in water, the hydrophilic amino acids go "outside" and the hydrophobic amino acids go "inside". Other parts of the external environment, like pH, ion concentrations, and temperature, also affect the structure of the protein. Finally, many proteins require the help of other proteins (called "chaperone proteins") to fold properly.
In addition to all the different factors that go into the folding of a protein, there is also the problem of just how many possible structures a protein can take. Imagine taking a short string, and finding all the ways you can twist it up. Now do that with a really long string. Searching through possible protein structures is not doable. The sequence isn't enough information, and there's too much information overall. So we generally figure out protein structure by...looking at the protein.
Experimental Techniques
There are three basic experimental techniques we use to determine protein structure. This is still the standard technique today; if we have the opportunity and the resources to actually look at the protein and deduce the structure, that's the best way. We'll talk about computational techniques to predict protein structure in later sections, but this is still how we validate those predictions.
- X-ray Crystallography: This is the classic technique. You make a crystal out of your protein (the same way you would make a crystal out of anything else; precipitate it from a supersaturated solution) and then fire X-rays at it. The specific arrangement of the electrons in the protein crystal diffract the X-rays, and you can deduce electron densities and subsquently the structure of the protein from the diffraction patterns. X-ray crystallography is one of those fields that is legitimately extremely difficult to get really good at, but it's the gold standard for protein structure. The main drawback is that some proteins do not crystallize well, and so this technique wouldn't work very well.
- (Cryo-)Electron microscopy: This one is the easiest one to conceptualize: just look at it under a microscope! Now, you can imagine that it is very difficult to look at very tiny things, and you would be correct. You have to freeze the protein and put it on a special grid. This means that you can only get limited results for proteins with flexible structures (aka clusters of amino acids that are "floppy" and don't form rigid structures; this is actually extremely common). Also, the proteins have to be large, and the resolution tends to be low.
- NMR Spectroscopy: Another common structural chemistry technique is NMR spectroscopy. What you do here is put the protein in a solution in a magnetic field and blast it with radio waves. Vibration patterns indicate which nuclei are close to each other (aka where there are bonds). From this information, you can solve for the whole structure. This method is good for proteins that do not crystallize well and for flexible proteins, but they have to be relatively small.
So what information do we actually get out of these methods? The main stuff is literally: the position of each atom in the protein, the bond angles, and any other additional information we get from these methods (like, for instance, how stiff or floppy a particular part of the protein is). This information is given in the order of the atoms in the amino acids in the protein sequence and stored in a database call the Protein Data Bank.
Protein Data Bank
The Protein Data Bank (PDB) was established in 1971 as an open-access database of protein structural information. Each entry not only contains "solved" protein structures, but also a lot of other information about the protein itself.
The main file format used in the PDB is a PDB file. Here is an example:
The main parts of a PDB file are as follows:
- First, a large header that includes a lot of metadata (aka what kind of protein this is, details of the experiment, who found the structure, etc.) as well as some structural information about the protein that isn't easy to put in the next part
- The amino acid sequence of all of the peptide chains
- A list of all the atoms in the protein, which amino acid they are in, their 3D positions, and some chemical information about them.
- A list of other atoms that are not part of the protein itself (like a ligand or an ion that affects the structure) with their positions and chemical information.
If you open up a PDB file, you may notice that it is actually pretty easy to read. That's good for us! But it's bad for computers. We can write programs that can read PDB files and display the protein structures, but PDB files have mostly been supplanted by a new format, mmCIF, which is more flexible and easier for a computer to read. New protein structures may not have PDB files associated, just mmCIF files. Two activities for this chapter explore the PDB and the PDB and mmCIF file formats.
Computational Prediction of Protein Structure
Just like with sequence assembly, there are two ways you can predict a protein structure: one way is to do it completely from scratch (ab-initio) and the other way is to use existing information from similar proteins. We'll start with the second way.
Homology Modeling
Recall that "homology" means "shared through common ancestry". There are two principles of homology modeling:
- Protein structure is entirely determined by protein sequence (we know that this is technically not true, but the problem that all of these prediction methods are trying to solve is "what is the structure of the protein given the sequence?")
- Protein structure evolves more slowly than protein sequence (there are silent mutations, and missense mutations might lead to similar amino acids that don't disrupt the overall structure much). Thus, sequence similarity implies strong structural similarity.
Basically, as long as the two aligned sequences are long enough and they share enough sequence similarity, you can use homology modeling to predict one structure from another. The first step is to take your protein sequence and BLAST it to find similar sequences. You then have to align your sequence with these best matches, and often a MSA algorithm is useful here if the sequence similarity is low (so you can use information from multiple different hits). Each protein contains a "backbone" of four atoms: N, C_alpha (the alpha-carbon), C, and O. For amino acids that are aligned to the similar sequence (called the "template"), you can simply copy over the coordinates of the matched backbones. If the amino acids themselves match, you can also copy the side chain coordinates. If they don't match, you have to try to find the best conformation of the side chain in 3D space, and you use pre-existing information about general trends in how these side chains are positioned in the context of backbones (i.e. some backbone conformations force the side chains into specific places).
To deal with potential gaps in the alignment, you have to either insert or delete some amino acids in the model compared to the template. This requires changing the backbone. One useful rule is that these backbone changes never occur inside specific secondary structure elements, like alpha helices and beta sheets, so you can move all of those possible changes to the outside of those structures (to areas called "loops"). Now it becomes difficult. You can either search through known protein structures to guess how these insertions/deletions will affect the structure, or you can use a model that just uses physics to minimize the energy of the structure (this is also what is used in ab-initio prediction).
To optimize the model, you can iteratively predict the backbone, then the loops/side chains, then the backbone, then the loops/side chains, etc. You can also put the physics into the model and then run a molecular dynamics situation to see where the protein "ends up" from your initial guess and the laws of physics. All of this stuff used to be done manually, with specialized 3D visualization software. Now, however, we can automate it. One major automated homology-modeling method is SWISS-MODEL, which is available online.
Fold Recognition/Threading
This method is based on the "fold recognition problem", which states "which of the known ways a protein can fold would we predict given a particular protein sequence"? This method more heavily involves physics, and is used when there is no good homology match in the PDB. It is also called "threading" because the basic idea is we take our backbone that we get from a homology search, then we try to "thread" a different protein sequence through it. If it has a similar structure to the homologous sequence, it'll fit through well. If it doesn't it, won't. Sometimes, however, an unrelated sequence fits very well! This is the benefit of fold recognition as an addition to/over homology; you can predict structures for some unrelated sequences.
Fold recognition is based on two principles:
- There aren't that many types of protein folds that exist in nature (~1000).
- The vast majority of newly-solved structures do not have newly-solved folds.
Thus, even if your protein is not closely related to other proteins, there are only a few ways in which protein folds actually happen, so you might be able to predict the structure of the protein anyways.
Ab-Initio Prediction
If you want to predict a protein from scratch, then the general strategy is: use our knowledge of physical and chemical forces to find a structure that has the amino acids in your sequence which minimizes the free energy of your protein (aka is in the most "relaxed" configuration). The biggest difficulty here is: there are so many possible configurations a protein can take! How do you search that space for the one with the lowest free energy?
One thing to do is simplify the search space. You can restrict how the conformation of side chains to those already found in the PDB (similar to the fold recognition strategy). You can break the protein sequence down into smaller pieces and then put them together again later. You can pretend that every atom can only occupy points on a 3D lattice. All of these strategies can make it easier to find a conformation that has low free energy, but because they are all simplifications, there's a limit to their ultimate accuracy. Sometimes, to make a problem doable, you have to simplify it enough
Once you have a search space you have to calculate the free energy of particular conformations. The function used to calculate this is called a "potential" function, and incorporates interactions between the peptides and water (or other solvent), electrostatic interactions, Van der Waals interactions, and covalent interactions (aka chemical and physical forces).
Finally, once you have your search space and the function you want to minimize, you can use any standard optimization algorithm to find the lowest free energy structure (or something close).
AlphaFold
Suppose you wanted to create a method to distinguish between dogs and cats. What information would you give it? You might give it some physical characteristics, like leg-to-head length ratio, or number of teeth, or something like that. Number of legs would be useless, because both dogs and cats have four legs in general. If you have DNA sequences, that would be even better! Once you make these observations, you can train a model to distinguish between dogs and cats using these observations. How do you do this? A model will have some "parameters", some numbers that change depending on how you train it. So let's say that "number of teeth < X --> this is a cat" is some part of the model. The "X" would be the parameter. If you trained this model on a bunch of dogs and cats (by figuring out which X value leads to the most accurate classification) you might come up with X = 35 (dogs in general have 42 teeth and cats have 30, with some variation like everything else). Then, you can check to see if your model is good by testing it on a new dataset of "unclassified" dogs and cats. One reason you want to do this is let's say you got your dog and cat teeth data from a veterinarian, which would potentially oversample from dogs and cats with fewer than normal teeth, which might make your initial best value of X to be closer to 30. If you test this on a less biased dataset, it might not do as well. So you test your model, and if it's good, great! If it's not, you either 1) find better data to train your model on or 2) change the model structure a bit (maybe add in another observation like leg-to-head ratio).
Anyways, I just described the process of machine learning. That's it. All that "deep learning", "AI" stuff? It's just that. It's much much fancier! But it's fundamentally that. A neural net/deep learning algorithm uses a very large number of very simple operations like "Is this data input + 2*this other data input > 4?" (the "parameters" in this model include which data inputs to use, that "2", and that "4"), and all the results of those operations are connected in a potentially complicated way (similar to how neurons are connected; hence the name "neural network"). Note that being very clever about the "fancy" stuff won the 2024 Nobel Prize in Physics!
So, why did I bother going through that here? Well, in the last 7 years, there has been a huge, game-changing development in protein structure prediction using these deep learning models. Every two years, there is an international protein structure prediction competition called CASP. In 2018, a submission from Google's AI wing called "Alphafold" placed first in the "free" modeling problems (where there is no homology in the PDB) and second in the "template-based" modeling problems (which use homology).
The original Alphafold algorithm used a type of neural network called a convolutional neural network that was trained on PDB data. What it was trained to do is: predict the physical distance between the second carbons (aka C, not Ca) on the peptide backbone. (If the protein was just a big long lin, these distances would all be the same and decrease linearly with the number of base pairs between them, but if the protein is folded, then carbons that are not nearby in the sequence will be nearby in space). The hard part was figuring out what network structure worked well for this. The neural network takes in protein sequence and multiple sequence alignment (MSA) data, and puts out a distance matrix for all the second carbons. An initial guess of the overall protein structure (aka carbon positions) is deduced from these carbon distances, and then a physical force-based model is used to optimize the structure to get to the lowest energy state. Alphafold demonstrated that you could use machine learning to predict protein structures effectively.
Then, two years later, at the next CASP, a completely different method by the same people called AlphaFold2 came around and completely destroyed the competition. This algorithm had a ton of new machine-learning innovations as well as better ways to integrate biological data.
Post-AlphaFold
In 2021, AlphaFold 2 was published in open-source form (meaning anyone can incorporate it into their methods, and many people have done so since). AlphaFold 3, released in 2024, focuses on predicting complexes: structures of multiple proteins and/or proteins and ligands or nucleic acids. In 2024, John Jumper, the head of the team that created AlphaFold 2 as well as Demis Hassabis, the head of Google's AI unit won the Nobel Prize in Chemistry, along with another guy I haven't mentioned yet: David Baker. While the DeepMind team sort of just "jumped in" (get it?) to protein folding from an AI background, Baker and his group have been working on protein folding for decades. His group's main algorithm is called Rosetta and was one of the leading ab-initio protein structure prediction methods. His group mostly works on protein design these days: creating new proteins that can be used in clinical and other settings.
Lab Activity
A lab activity called "protein_structure_lab.zip" can be found at this link. It contains an RMarkdown file for the lab as well as data files.


