Skip to main content
Biology LibreTexts

4.14: Predicting Structure from Sequence and Sequence from Structure/Function (New 10/24)

  • Page ID
    146142
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Search Fundamentals of Biochemistry

    Learning Goals (ChatGPT o1, 1/27/25

    Below is a list of learning goals designed for junior and senior biochemistry majors to help them master the key concepts in predicting protein structure from sequence and the reverse protein folding problem:

    1. Understand the Evolution of Protein Structure Determination:

      • Explain the shift from experimental techniques (NMR, X‑ray crystallography, cryo‑EM) to computational structure prediction.
      • Discuss how large databases (e.g., PDB) and evolutionary information enable structure prediction from protein sequences.
    2. Grasp the Fundamentals of the Protein Folding Problem:

      • Describe the “protein folding problem” and why predicting 3D structure from a linear sequence is both challenging and crucial.
      • Compare traditional homology modeling with modern machine learning approaches (e.g., AlphaFold, RoseTTAFold).
    3. Evaluate Computational Structure Prediction Methods:

      • Interpret and compare metrics (RMSD, TM-score, ipTM, pTM) used to assess the accuracy of predicted protein structures and complexes.
      • Explain how AI models leverage hidden evolutionary signals in sequences to generate accurate 3D models.
    4. Explore Protein-Protein Interaction Prediction:

      • Understand how tools like AlphaFold3 are used to predict the quaternary structures of protein complexes.
      • Evaluate the statistical comparisons between experimentally determined and predicted complex structures.
    5. Analyze the Reverse Protein Folding Problem:

      • Describe methods for deducing protein function from 3D structure when sequence homology is limited.
      • Explain how structural comparison tools (e.g., FoldSeek) utilize a 3D interaction alphabet to compare protein structures.
    6. Understand De Novo Protein Design:

      • Outline the steps involved in designing a protein from scratch (e.g., using RFDiffusion and ProteinMPNN).
      • Describe how conditioning on functional motifs or binding sites can lead to the creation of novel protein binders, catalysts, or symmetric assemblies.
    7. Assess Applications and Case Studies:

      • Review examples such as de novo designed IL-6 mimetics, influenza HA binders, pentameric helical bundles, and symmetric oligomers.
      • Discuss how these designed proteins have potential applications in therapeutics, vaccines, biosensors, and industrial catalysts.
    8. Critically Evaluate the Integration of Computational and Experimental Methods:

      • Recognize the importance of experimental validation (e.g., X‑ray crystallography, cryo‑EM, NMR) for confirming computational predictions.
      • Reflect on how computational tools are used to narrow down plausible candidates for experimental testing, not replace bench work.
    9. Consider Future Challenges and Directions:

      • Identify remaining challenges in designing effective binders, catalysts, and proteins with dynamic conformational flexibility.
      • Discuss the potential impact of continued advancements in AI and machine learning on the field of protein engineering and functional annotation.

    These learning goals will guide you through understanding how modern computational techniques—especially protein language models and transformer architectures—are transforming our ability to predict, analyze, and design protein structures and functions.

    Recent Updates: November 2024

    The Protein Folding Problem:  Sequence to 3D Structure

    In Chapter 3.4:  Analyses of Protein Structure, we discussed how protein structure can be experimentally analyzed and the 3D structure of a protein determined using NMR, X-ray crystallography, and cryo-EM. Now, using sequence and structural databases (like the PDB with over 227,000 structures), we can often predict the 3D structure of a protein just from its linear sequence by comparing the sequence of a protein of unknown tertiary structure to homologous proteins (by sequence) whose 3D structures are known. Machine learning and artificial intelligence have extended earlier and simpler "homology modeling" attempts to allow structural predictions for millions of protein sequences using programs such as RoseTTAFold and AlphaFold. RoseTTAFold and AlphaFold produce high-quality structure predictions when trained using the vast sequence information in the Protein Data Bank.  Embedded in those linear sequences is a large amount of hidden (to the human eye) evolutionary information that machine learning and AI can harness to predict 3D structures.  They work less well when very limited sequence comparisons are available. 

    In general (for smaller proteins), the protein folding problem, the prediction of structure from sequence, appears to have been "solved". The Nobel Prize in Chemistry in 2024 was awarded to Demis Hassabis and John M. Jumper from Google DeepMind for developing AlphaFold and David Baker for developing RoseTTaFold and other powerful techniques described below.   (Check out the previous section, Chapter 4.13: Predicting Structure and Function of Biomolecules Through Natural Language Processing Tools, if you are interested in a "deep dive" into how these programs work!)

    A comparison of protein structures obtained using these programs with known 3D structures obtained through X-ray crystallography or other techniques shows them almost identical. Different metrics can be used to compare predicted structures to the actual ones. The root mean squared deviation (RMSD) is a common one. RoseTTAFold uses a TM-score to assess the topological similarity of protein structures. Compared to RMDS, the TM-score weights smaller distance errors higher than larger ones, making it sensitive to the global fold, not local structural differences. TM values range from 0-100 (100 is a perfect match). Scores below 17 indicate no topology match, while those above 50 suggest a common fold.

    AlphaFold uses a "neural network, meaning it simultaneously considers patterns in protein sequences, how a protein’s amino acids interact, and its possible three-dimensional structure. In this architecture, one-, two-, and three-dimensional information flows back and forth, allowing the network to collectively reason about the relationship between a protein’s chemical parts and its folded structure". Programs of this type might allow the generation of proteins with new therapeutic or commercial potential based on sequences. These include vaccines, sensors, specific immune system suppressors or activators, and antivirals. AlphaFold has now been used to predict the structure of 214 million proteins from more than one million species — essentially all known protein-coding sequences. We have included many AlphaFold iCn3D models throughout this book.  

    Figure \(\PageIndex{1}\) shows the backbone tube cartoon of the x-ray pdb structure of the small protein (1xww, cyan) and the structure predicted by both RoseTTAFold program and AlphaFold (magenta) just from its primary sequence. Sulfate, a competitive inhibitor, is shown (spacefill) bound in the active site. The alignment is spectacular, except for the N-terminal 5 amino acids at the bottom of the figure (6 o'clock). This stretch has more disorder even in the x-ray structure as the amino acids have high B-factors, indicating more conformational flexibility. 

    aligned_1xww_RoseTTAFold_Sess1_B.png 1xww_AF_p2466_align.png
    Figure \(\PageIndex{1}\): Comparison of the x-ray and computationally predicted structures of human low molecule weight protein tyrosine phosphatase

    Left panel: X-ray structure (cyan) of low molecular weight protein tyrosine phosphatase with bound SO42- (1xww) and corresponding structure predicted by the RoseTTAFold (magenta).  Right panel: Same structures using AlphaFold for the structural prediction.

     

    Prediction of Protein-Protein Interactions

    AlphaFold3 has also been used to predict the structure of protein complexes in which multiples of the same or a different protein subunit combine to form a larger, quaternary structure. Figure \(\PageIndex{2}\) shows an interactive iCn3D model of a recent stunning example of a predicted AlphaFold complex required to bind a human sperm and egg.  The complex consists of three transmembrane human sperm proteins and a human egg protein attached to the egg membrane with a posttranslational lipid anchor (not shown). 

    HumanSpermEggProteinComplexAlphaFold4-mer.png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{2}\): Human sperm proteins and egg protein complex predicted by AlphaFold. (Copyright; author via source). The spacefill section of the three sperm proteins attached the proteins to the sperm membrane. The egg protein is JUNO (cyan) and the three sperm proteins are 1ZUMO1 (magenta), SPACA6 (brown/gold), and TMEM81 (blue)

    Reference for PDB file: Deneke, V. et al. A conserved fertilization complex bridges sperm and egg in vertebrates. Cell. October, 2024.  https://doi.org/10.1016/j.cell.2024.09.035.   Creative Commons Attribution (CC BY 4.0)

    You can download this iCn3D file and load it in iCn3d using these commands to see the structure as rendered in the image above:  IMPORTANT: If the file opens as an image in a nepw browser window, right-click the image and save the file to download it!

    • Open iCn3D
    • File, Open, iCn3D PNG appendable and browse for the file in your download folder.

    Here is the link to the AlphaFold Server 3. AlphaFold3 is based on a machine-learning process called diffusion, which is explained in more detail below. 

    The following biological species can be modeled in AlphaFold3: 

    • macromolecules, including proteins, DNA and RNA
    • common ligands including ATP, ADP, AMP, GTP, GDP, FAD, NADP, NADPH, NDP, heme, heme C, myristic acid, oleic acid, palmitic acid, citric acid, chlorophylls A and B, bacteriochlorophylls A and B
    • common ions such as Ca2+, Co2+, Cu2+, Fe3+, K+, Mg2+, Mn2+, Na+, Zn2+, and Cl-
    • common post-translational modifications (PTMs) of amino acid residues such as phosphorylation of serine, threonine, tyrosine, and histidine, acetylation of lysine residues, methylation of lysine and arginine, malonylation of cysteine, hydroxylation of proline, lysine, and asparagine, palmitoylation of cysteine, succinylation of asparagine, S-nitrosylation, formylation of tryptophan, crotonylation of lysine, citrullination of lysine and arginine
    • glycan chains (including branched chains) composed of some sugars, including alpha/beta-D-glucose, alpha/beta-D-mannose, alpha-L-fucose, beta-D-galactose, N-acetyl-beta-D-glucosamine
    • common chemical modifications of the DNA (including methylation of cytosine, guanine, and adenine, carboxylation of cytosine, oxidation of guanine, formylation of cytosine) and RNA (including isomerization of uridine into pseudouridine, formylation of cytosine, and methylation of cytosine, guanine, adenine, and uracil
    • structures composed of multiple proteins, nucleic acids, ligands, ions, and chemically modified derivatives. 

    Note that simple drugs are not on the list since those applications are proprietary (at present). AlphaFold 3 can now be downloaded for academic (non-commercial) applications and likely includes drugs. A more limited web version is available here.

    It is important to statistically compare the PDB experimentally determined and AF3-predicted structures of complexes.  One such comparison involves the wild type and their mutated forms, which should involve just subtle conformational changes.  For example, consider human angiogenin and placental ribonuclease inhibitors.  Figure \(\PageIndex{3}\) shows an interactive iCn3D model of the experimental human angiogenin and placental ribonuclease inhibitor complex (1A4Y).

    Human angiogenin - placental ribonuclease inhibitor complex (1A4Y).png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{3}\): Human angiogenin - placental ribonuclease inhibitor complex (1A4Y). (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...DaUAJzhh8H6LU8

    There are 27 mutant variants of the complex whose experimental structures are known.  They are included in the SKEMPI database, which contains thermodynamic and kinetic data for wild-type and mutant complexes with known PDB structures. One widely used thermodynamic parameter we have seen before is the change in thermodynamic stability (i.e. ΔΔG0= ΔG0mutant - ΔG0wildtype), in this case for the complex, when key residues lining the binding pocket are mutated. The entire database contains over 317 protein-protein complexes and 8338 mutations. How well does AF3 predict the structure of mutant complexes?   

    Three statistical values are used to compare the experimental and AF-predicted complex structures:

    • RMSD (Root-Mean-Square Deviation) measures the average distance between equivalent atoms in the complex subunits. Lower values show great similarity between the experiential and AlphaFold 3 models.
    • ipTM (Interface Predicted Template Model) measures changes in the interface of the subunits between the experimental and AF3 structures.  Higher values indicate a closer match.
    • pTM (Predicted Template Model), as with simple AF prediction, measures the overall accuracy of the predicted structure (based on both backbone and sidechain orientations).  Higher pTM values indicated a more accurate prediction. 

    Figure \(\PageIndex{4}\) shows the structure of the wild-type RNase inhibitor-Angiogenin complex with 27 red dots indicating mutants with experimental structures (Panel A) and a comparison of the wildtype and a mutant structure (Panel B).  Statistical results comparing experimental and AF structures for all 317 complexes in the SKEMPI database are shown in Panels C and D.

    wee-wei-2024-evaluation-of-alphafold-3-s-protein-protein-complexes-for-predicting-binding-free-energy-changes-uponFig1.svg

    Figure \(\PageIndex{4}\): Wildtype and AF predicted composite structure of RNase inhibitor-Angiogenin complexes (Panel A and B) and statistic comparison of all structures in the SKEMPI database (panels C and D). JunJie Wee and Guo-Wei Wei.  J. Chem. Inf. Model. 2024, 64, 16, 6676–6683.  https://doi.org/10.1021/acs.jcim.4c00976Published August 8, 2024. CC-BY 4.0 .

    Panel A: The cartoon representation of ribonuclease inhibitor-angiogenin complex (PDB ID: 1A4Y). The ribonuclease inhibitor is shown in blue, and the angiogenin is shown in green. 27 mutation spots of 1A4Y in the S8338 data set are indicated in red.

    Panel B: The structural alignment of 1A4Y with its AF3 predicted complex.

    Panels C and D below are based on the complexes studied, not just the RNase Inhib-Angiogenin complex.

    Panel C: The boxplot for RMSD, ipTM and pTM distributions of 317 predicted AF3 protein–protein complexes. RMSDs refer to the overall RMSD calculated by structurally aligning an AF3 complex with its original PDB complex.

    Panel D: The breakdown of AF3 protein-protein complexes based on their ipTM and pTM scoring criteria.

    The results for average statistical values for the wildtype and mutant structure are 1.61 Å (RMSD), 0.803 (ipTM), and 0.847 (pTM ), respectively. The graphs clearly show that most predictions (72%) had high ipTM scores (> 0.8), while 99% had reasonably high pTM scores (>0.5). However, they were outliers, as shown in Panel C, with RMSD values> 4 Å, which indicates poorer performance with AF3.  Most complexes with high prediction values also had low RMSD values.   Experiments like these will be used to continually refine programs such as AlphaFold 3.

    Reverse Protein Folding Problem:  3D Structure to Function

    So now we have structures of 200 million plus proteins. Pick one, Protein X, of unknown function.  What might its function be?  Surely, its sequence could be compared to the entire database to find homologous proteins that might give a clue to the function of Protein X.  But what if the sequence of Protein X is very divergent from potential sequence homologs since they are very distant from each other evolutionarily? Also, what if comparison proteins in the database are underrepresented?  For example, our knowledge of the sequences and structures of proteins from pathogens (viruses and bacteria) is incomplete, especially since we have studied just a small fraction of the virus and bacterial world.

    To circumvent this problem, the actual predicted or determined 3D structure of Protein X (not its sequence) could be compared to the 3D structures from the databases.  This would seem difficult since it would be a 3D comparison, not a 1D comparison of linear sequences.  A program called FoldSeek allows 3D structural comparisons to be made computationally easier.  

    Instead of using an alphabet of actual sequences (such as the single letter code for the 20 naturally occurring amino acids - ACDEFGHIKLMNPQRSTWYV), a new "structural alphabet" based on the conformations of short stretches of 3-5 alpha C (Cα) atoms in the protein backbone has been used, but this doesn't explicitly contain tertiary interactions found in proteins. Instead, FoldSeek uses a 3D interaction alphabet (3Di) with 20 states (one for each amino acid), each with 10 interaction "features."  A conformational state for residue X is defined for its closest spatial residue, y.  The state description is less dependent on the next amino acid in the linear sequence for a given amino acid.  The defined state has more information when x is in a conserved and packed protein core than in a nonconserved, more flexible loop. In contrast, there would be less information if just the backbone structural alphabet was used.  Figure \(\PageIndex{5}\) below gives a pictorial view of how the 3Di state for a single amino acid Val at a specific position in a 3D structure is defined.  Note that the state has 10 3D features, more than just the conformation of a backbone of 3 amino acids or the next amino acid in the linear sequence.

    Fast and accurate protein structure search with FoldseekFig1_x.svg

    Figure \(\PageIndex{5}\): Learning the 3Di alphabet. van Kempen, M., Kim, S.S., Tumescheit, C. et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol 42, 243–246 (2024). https://doi.org/10.1038/s41587-023-01773-0.  Creative Commons Attribution 4.0 International License.   http://creativecommons.org/licenses/by/4.0/.

    (1) 3Di states describe tertiary interaction between a residue i and its nearest neighbor j. Nearest neighbors have the closest virtual center distance (yellow). Virtual center positions were optimized for maximum search sensitivity. (2) To describe the interaction geometry of residues i and j, we extract seven angles, the Euclidean Cα distance, and two sequence distance features from the six Cα coordinates of the two backbone fragments (blue and red). (3) These 10 features are used to define 20 3Di states by training a VQ-VAE modified to learn states that are maximally evolutionary conserved. The encoder predicts the best-matching 3Di state for structure searches for each residue.

    Recently, FoldSeek has been used to find structural and function similarities of over 67,000 newly predicted viral proteins (underrepresented in the PDB) with other proteins of known structure.  Of these:

    • 62% had distinct structures and were not homologous to proteins in the AlphaFold database (as we indicated above).
    • Many of the 38% left were structurally homologous to nonviral proteins, suggesting a similarity in viral protein function to host analogs.  

    Similar 3D structures imply similar functions, so probable functions could be described to some novel proteins. Some were involved in the viral escape from the host's innate immune system.  We'll explore that in Chapter 5.4: Complementary Interactions between Proteins and Ligands - The Immune System.

    The FoldSeek server allows multi-database searches, including AlphaFoldDB (version 4: Proteomes and Swiss-Prot), AlphaFoldDB (version 4) and CATH25 clustered at 50% sequence identity, ESM Atlas-HQ and Protein Data Bank (PDB).  These Google Colab sites are also available:

    In summary, FoldSeek is useful in several circumstances:

    • you have a protein sequence, but a comparison to other sequences doesn't give you enough information.  A structure-based search would then be helpful;
    • you want to design a brand new protein (de novo protein synthesis), and you would want to know that its structure is not similar to other proteins
    • you want to design a protein with a particular function and want to compare its structure to other proteins with similar function

    Reverse Protein Folding Problem:  3D Shape to Linear Sequence - Designing Proteins From Scratch

    In yet another use of machine learning and artificial intelligence, programs can start with a desired 3D shape (protein backbone, for example) and determine the amino acid sequence necessary to get it. Two programs, ProteinMPNN and RoseTTAFold Diffusion (RFDiffusion), developed by David Baker (who also won the Nobel Prize in Chemistry in 2024) et al have enabled these predictions. It allows protein structure design, not structure prediction.

    Yet another dream that seemed so distant not so long ago was to design a protein from scratch with no linear sequence (hence little alignment information) but with a final desired structure or function in mind.  Here are some possible "de novo" design examples of novel proteins that...

    • are soluble versions of a known membrane protein, which could advance drug design;
    • bind with high affinity to a desired small molecule (much like an antibody), enabling the creation of sensors and protective agents;
    • bind target molecules and catalyze their chemical conversion to products, allowing the development of new and nontoxic catalysts;
    • bind to another target protein and modulate its function by activating or inhibiting it;
    • have novel, unrepresented folds that could further elucidate key principles of protein folding and stability while creating new functionalities.

    RFDiffusion

    This dream has also been accomplished in large measure. David Baker is a pioneer in de novo protein structure design and prediction. His group has developed and used several programs, including RoseTTAFold Diffusion (RFDifffusion), which uses AI to design new proteins with novel structures and functions. RFDiffusion is freely available to anyone for use in Google Collaboratory. It creates new structures by combining structure prediction from RoseTTAFold with an AI "Diffusion" model. 

    To understand the term diffusion in structure prediction, let's first explore AI/machine learning models for generative image creation.  Instead of starting with no previous information, start with a clear image, add random (Gaussian) noise to it (noising), and then try to recreate the original image by a "denoising" process. Some previous information and additional programs would help to constrain the denoising process for generative image creation for a requested image. This process is illustrated in Figure \(\PageIndex{6}\) below. Note the arrows are reversible.

    Single image super-resolution with denoising diffusion GANSFig4.svg

    Figure \(\PageIndex{6}\): Xiao, H., Wang, X., Wang, J. et al. Single image super-resolution with denoising diffusion GANS. Sci Rep 14, 4272 (2024). https://doi.org/10.1038/s41598-024-52370-3.  Creative Commons Attribution 4.0 International License.  http://creativecommons.org/licenses/by/4.0/.

    Simplistically, this is similar to solving X-ray structures of proteins.  A given protein in a crystal produces an X-ray diffraction pattern specific to the atoms and their arrangement in the crystal lattice.  In the reverse process, the X-ray diffraction pattern can be computationally analyzed to produce the arrangement of atoms (from an initial electron density map) in the lattice that would generate the given diffraction pattern.

    In a diffusion model for generative protein structure creation using RFDiffusion, randomly disordered small chemical fragments diffuse together to form a more ordered and realistic protein structure.  Information from the database of known protein structures is used to constrain the generative processes through a deep learning–based protein sequence design method called ProteinMPNN (Protein Message-Passing Neural Network).  It differs from Rosetta, a physically-based method that maximizes sidechain packing to produce the lowest energy state. Designing a sequence that produces the lowest energy state is more computationally challenging than finding it for a given sequence. Calculating energies of unwanted nonproductive oligomeric and aggregated states makes this approach intractable. 

    What is more doable is to carry out these two steps in succession:

    • first, search for the lowest-energy sequence for a given backbone structure;
    • then search the "universe" of possible structures for the sequence created in the first step to determine if it is indeed the lowest energy conformation.

    Methods like Rosetta use physical "rules" to minimize undesired results. For example, restrictions are used in placing hydrophobic side chains on the surface of a protein as these might promote unwanted aggregation states.  ProteinMPNN overcomes these issues since it uses data from all solved structures to find the most probable amino acid at a given position.  It requires less human theoretical knowledge as it extracts an energy-minimized folded state from an immense amount of structural data.  It's a bit like deriving Newton's Laws of Motion from data without the underlying theory, even though the data was acquired from systems whose motions and positions are well described by Newton's Laws.

    Figure \(\PageIndex{7}\) below shows the noising (right to left) and denoising (left to right) processes that can generate a protein structure in a diffusion model.  It parallels the image deconstruction and reconstruction shown in Figure \(\PageIndex{6}\) above.

    De novo design of protein structure and function with RFdiffusionFig1A.svg

    Figure \(\PageIndex{7}\): Protein design using RFdiffusion.  Diffusion models for proteins are trained to recover corrupted (noised) protein structures and to generate new structures by reversing the corruption process through iterative denoising of initially random noise XT into a realistic structure X0 (top panel).  Watson, J.L., Juergens, D., Bennett, N.R. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023). https://doi.org/10.1038/s41586-023-06415-8. Creative Commons Attribution 4.0 International License.  http://creativecommons.org/licenses/by/4.0/.

    RFDiffusion can create a protein sequence without introducing conditions to the final structure (unconditional process).  Conditions placed on the denoising process can lead to conditioned structures.  Conditions such as symmetric noise, a binding target, a functional motif such as pre-positioned amino acids in an active site, and a symmetric motif can lead to synthetic oligomers, a binder protein that interacts with a target protein, an active site with the correct 3D disposition of catalytic residues, and symmetrical scaffolds, respectively.  These examples are illustrated in Figure \(\PageIndex{8}\) below.

    De novo design of protein structure and function with RFdiffusionFig1b.svg

    Figure \(\PageIndex{8}\): b, RFdiffusion is broadly applicable for protein design. RFdiffusion generates protein structures without further input (top row) or by conditioning on (top to bottom): symmetry specifications; binding targets; protein functional motifs or symmetric functional motifs. In each case, random noise and conditioning information are input to RFdiffusion, which iteratively refines that noise until a final protein structure is designed. Watson, J.L. et al., ibid.

    Figure \(\PageIndex{9}\) below shows that the final structure predicted for a 300 amino acid protein sequence by AlphaFold (bottom row) is almost identical to the final structure produced by RFDiffusion (top row).

    De novo design of protein structure and function with RFdiffusionFig1c.svg

    Figure \(\PageIndex{9}\):  An example of an unconditional design trajectory for a 300-residue chain, depicting the input to the model (Xt) and the corresponding X^0 prediction. At early timesteps (high t), X^0 bears little resemblance to a protein but is gradually refined into a realistic protein structure. Watson, J.L. et al., ibid.

    Unconditional RFDiffusion models for protein sequences up to 600 amino acids are essentially the same as AlphaFold's.  

    Figure \(\PageIndex{10}\) below shows how hot spots (key binding residues) in a target protein are used as a condition in an RFDiffusion model in the de novo synthesis of a mini-binder for a target protein. 

    De novo design of protein structure and function with RFdiffusionFig6a.svg

    Figure \(\PageIndex{10}\):  RFdiffusion generates protein binders given a target and specification of interface hotspot residues. Watson, J.L. et al., ibid.

    The video below from the Baker lab (obtained from YouTube at https://youtu.be/geqlzPsigQo) shows how a protein structure can be created that binds to a predefined structure, in this case, the insulin receptor, using conditional RFDiffusion.

    The same video is found at the Baker site at: https://www.bakerlab.org/2023/07/11/...rotein-design/

    The structures produced by RFDiffussion and ProteinMPNN for any given sequence can be verified by making the protein and analyzing its structure using X-ray crystallography, NMR, or cryoEM.  Less precise methods, such as CD-spectroscopy, are also used to get a simpler measure of the overall secondary structure of the synthesized protein.

    Examples

    The following iCn3D models of crystal structures using these methods illustrate the power of RFDiffusion methods in creating new proteins of defined structure and function.

    Examples of de novo protein design Interactive iCn3D model with links

    GP130 (IL6 coreceptor) in complex with a de novo designed IL-6 mimetic (8UPA)

    Cytokine storms (cytokine release syndrome) are often deadly inflammatory responses accompanying bacteria or viral infections (such as Covid 19 in those with severe disease).  The storm is associated with the overexpression and release of two proinflammatory protein cytokines, interleukin 1 (IL-1) and interleukin 6 (IL-6) by activated immune cells such as macrophages.  IL-1 and IL-6 bind to their receptors (IL-1R and IL-2R) with high affinity.  IL-6 also binds to a coreceptor (GP130) needed for cytokine release.  Inhibitors used to interfere with the cytokines:receptor complex can persist too long and have deleterious effects since an appropriately amplified immune response is needed against bacteria and viral infections. A small protein antagonist (a minibinder or MB) with high affinity (pM to nM dissociation constant) for the receptor and the IL-6 coreceptor was made through de novo design and proved protective against a cytokine storm in animal models.  The structure of the IL-6 mimetic with its coreceptor, GP130, is shown in the iCn3D model to the right. The computationally designed structure was a close match to the X-ray structure.

    Reference:  Huang, B., Coventry, B., Borowska, M.T. et al. De novo design of miniprotein antagonists of cytokine storm inducers. Nat Commun 15, 7064 (2024). https://doi.org/10.1038/s41467-024-50919-4

    GP130 (IL6 coreceptor) in complex with a de novo designed IL-6 mimetic (8UPA).png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{11}\): GP130 (IL6 coreceptor, gray) in complex with a de novo designed IL-6 mimetic (8UPA, red). (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...rAQXeMxWaq74Y6

    Here is another link to see a surface representation of the two interacting proteins:  https://structure.ncbi.nlm.nih.gov/i...3bbFWhy2eHNGAA

    Designed Influenza HA binder, HA_20, bound to Influenza HA (8SK7)

    Hemagglutinin (HA) from the influenza virus is a trimeric membrane protein.  Each "monomer" is a heterodimer consisting of two different chains, HA1 and HA2.  The HA1 domain binds to a particular sugar, sialic acid, found on many human cells, but most importantly in the respiratory tract. The HA2 subunit is transmembrane.  We receive a vaccine each year that recognizes the HA protein since the globular head part of HA that interacts with the human cell surface mutates so quickly from year to year.  Large shifts in the structure of the influenza HA protein lead to pandemics.  

    Parts of the HA molecule are more conserved and are somewhat sequestered from the human immune system. Targeting them could lead to a more permanent and universal vaccine.  A small influenza binder was synthesized de novo and tightly bound (nanomolar dissociation constant). The de novo-designed protein had essentially the same structure as the AlphaFold computational model.   An iCn3D model showing the interaction of the HA binding with one HA1:HA2 heterodimer is shown to the right.  

    Reference:  Nature 620, 1089–1100 (2023). https://doi.org/10.1038/s41586-023-06415-8. Creative Commons Attribution 4.0.  International License.  http://creativecommons.org/licenses/by/4.0/.

    Designed Influenza HA binder, HA_20, bound to Influenza HA.png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{12}\): Designed Influenza HA binder, HA_20, bound to Influenza HA (8SK7). (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...xaAUkDPBBUeob7

    The gray is HA2, the cyan is the HA1, and the red/yellow-coded secondary structure is the designed HA minibinder.  The biological HA complex contains three copies of the heterodimeric structure shown above

    Here is another link to see a surface representation of the three interacting proteins:  
    https://structure.ncbi.nlm.nih.gov/i...a9LQq3SjGXp9JA

    Pentameric helical bundle protein (8U5W)

    De Novo protein synthesis was used to create a single protein chain (i.e. a monomeric protein with a single C5 rotational symmetry axes.  Open the iCn3D model to the right.  It contains a single rotational axis (red line).  Rotation around the axis by 3600/5 reproduces the identical structure.  

    The designed protein also displays near-infrared fluorescence when it binds to a synthetic dye, merocyanine.  The protein forms a covalent Schiff base with the dye.  If the Schiff base is protonated, the fluorescence spectra show a large red shift in both the excitation and emission wavelengths. The protein/dye complex can be used for tissue imaging at a greater depth than other visible light fluorophores.

    Reference:  https://www.researchsquare.com/article/rs-4652998/v1

    Pentameric helical bundle protein (8U5W).png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{13}\): Pentameric helical bundle protein (8U5W). (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...Xm7aBQhm2GEKWA

    Symmetric Oligomers

    In contrast to the previous example of a symmetric monomer, the de novo protein models to the right contain multiple subunits in oligomers that display different types of cyclic symmetry.

    Figure \(\PageIndex{14}\) to the right displays C2 symmetry, with a rotation of 3600/2 around the axis resulting in an identical structure.  The dimer also displays allostery - it changes its shape globally with the addition of effector molecules.  Allostery will be explained in Chapter 5.

    Figure \(\PageIndex{15}\) to the right is a homo 6-mer of identical subunit and C6 symmetry.  Rotation of 3600/6 around the axis results in an identical structure. 

    Figure \(\PageIndex{16}\) to the right is a homo 8-mer of identical subunit and C8 symmetry.  Rotation of 3600/8 around the axis results in an identical structure. 

     

    Allosterically Switchable De Novo Protein sr322In Closed State.png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{14}\):Allosterically Switchable De Novo Protein sr322, In Closed State (8UTM).  (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...HSZ7Ddt8v9R3J9


    designed modular protein oligomer C6-79.png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{15}\):Designed modular protein oligomer C6-79 (8f6r).  (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...SDwnTmNhovAny9


    designed modular protein oligomer C8-71.png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{16}\):Designed modular protein oligomer C8-71 (8f6q)..  (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...ydRkeJAaGGag38

    A protein with an active site

    A protein was designed using RFDiffusion that recreates an active site containing three key catalytic residues from a native enzyme, cytotoxic ribonuclease alpha-sarcin (1DE3).

    The left model in Figure \(\PageIndex{17}\) below and the iCn3D model in Figure \(\PageIndex{18}\) in the adjacent right column show the 3 active site residues used for conditional de novo protein modeling.  The middle two images below show the isolated input catalytic "triad" and the structure created by RFDiffusion.  The right image below is a zoomed image of the active site in the designed protein.

    Fig6BSupplementalNatEnzInput.png

    Figure \(\PageIndex{17}\): Comparison of native ribonuclease sarcin and RFDiffusion designed protein. Supplemental Figure,  Watson, J.L. et al., ibid.

    This iCn3D model is for cytotoxic ribonuclease alpha-sarcin (1DE3).  Three active site amino acids, H50, E96, and H137, were conditionally used to create the de novo-created protein with the same active site residues (shown to the left).

    Cytotoxic ribonuclease alpha-sarcin (1DE3).png

    NIH_NCBI_iCn3D_Banner.svg Figure \(\PageIndex{18}\): Cytotoxic ribonuclease alpha-sarcin (1DE3). (Copyright; author via source).  Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...WR1oo3fru633d6

    In Chapter 11.1 we will explore how RFDiffusion can create novel membrane proteins and soluble versions.

    One final comment: Structures predicted by these AI programs must be subjected to experimental validation of structure and function.  Since creating new structures with designed functions is so easy, we must be careful not to blindly accept the results without supporting experimental validation.

    AlphaProteo

    AlphaProteo from Google DeepMind is also used to design protein binders for target sites on proteins. Download this file for a video of a synthesized protein binder designed for the SARS-CoV-2 spike receptor-binding domain (reference).  This program is not yet freely available (as of 11/11/24) for use. The machine learning methods used in AlphaProteo were not reported in the preprint reference because of "biosecurity and commercial considerations," so we can't explain the basis of the program as we did above for RFDiffusion.  Figure \(\PageIndex{19}\) below shows, in general, the steps involved in developing binders that interact with "hotspots" sites on target proteins.

    De novo design of high-affinity protein binders with AlphaProteoFig1.svg

    Figure \(\PageIndex{19}\):  Overview and experimental performance of AlphaProteo.  Vinicius Zambaldi et al. De novo design of high-affinity protein binders with AlphaProteo.  Submitted 9/12/24.   https://arxiv.org/abs/2409.08022https://creativecommons.org/licenses/by-nc-sa/4.0/

    Panel (A) Schematic of the design system. The generative model outputs designed structures and sequences of binder candidates, and the filter is a model or procedure that predicts whether a design will bind.

    Panel (B) Schematic of target-structure-conditioned binder design as performed by the generative model.

    Panel (C) Crystal structures (light yellow) and hotspot residues (dark yellow spheres) of seven target proteins for binder design experiments in this work. VEGF-A and IL-17A are both disulfide-linked homodimers. See Table S1 for PDB IDs and hotspot residue numbers.

    Figure \(\PageIndex{20}\) below shows the interactions of the de novo synthesized binder with seven target proteins.  

    De novo design of high-affinity protein binders with AlphaProteoFig2A1.svg De novo design of high-affinity protein binders with AlphaProteoFig2A2.svg De novo design of high-affinity protein binders with AlphaProteoFig2A3.svg De novo design of high-affinity protein binders with AlphaProteoFig2A4.svg
    De novo design of high-affinity protein binders with AlphaProteoFig2A5.svg De novo design of high-affinity protein binders with AlphaProteoFig2A6.svg De novo design of high-affinity protein binders with AlphaProteoFig2A7.svg no binder reported

    Figure \(\PageIndex{20}\): Biochemical characterization of representative binders for each target-design model.

    The binders all interacted tightly with their target protein.  

    Challenges that remain

    Here are some examples that pose challenges

    • Binders that affect the function of a protein: These include both small molecule binders (i.e. drugs) that target orthosteric or allosteric sites.  Essentially, this is the task of the drug and pharmaceutical industry.  Designing binders is especially hard for membrane proteins.  Also, binders that mimic small drugs are difficult since the databases are more limited and often proprietary, so the training set is smaller.  In addition, the differences between binders that activate or inhibit a target protein can be subtle.
    • de novo synthesis of protein catalyst: Much of a protein structure is used to bring key groups into a stable configuration for catalysis.  Synthetic chemists try to make small transition metal catalysts that mimic the function of proteins with a catalytic site that often contains a metal ion.  This suggests that natural proteins might not be the most efficient mimic to produce novel protein catalysts.  Also, proteins that differ in 3D structure can carry out similar reactions.
    • Conformational flexibility in proteins:  Unless we look at the dynamic structures of proteins, our minds can be trapped into creating just the most stable, low-energy protein structure.  Yet flexibility and conformational changes are key to protein function and regulation. Programming conformational change into the algorithms for de novo synthesis is another complicated task.
    • Creating proteins and protein complexes with functions other than catalysis:  Many macromolecular assemblies (inflammasomes, proteasomes, regulated membrane pores, mobility proteins, etc) provide critical cellular functions.  Creating new ones could offer novel ways to modulate cell function.  One example would be to create nanoparticles that can deliver "cargo" (such as vaccines) into cells or potentially sequester and eliminate deleterious intracellular components (like misfolded and aggregated proteins).  

    Summary

    In this chapter, we explore how the tremendous progress in computational biology has transformed our approach to predicting protein structures from their amino acid sequences—a challenge once known as the "protein folding problem." Traditionally, protein structures were determined through experimental methods such as NMR spectroscopy, X‑ray crystallography, and cryo‑EM, but these techniques are time-consuming and costly. With the advent of massive structural databases like the Protein Data Bank (PDB) and the explosion of sequence data, researchers have shifted toward computational methods to predict 3D structures from primary sequences.

    Modern machine learning approaches—most notably AlphaFold and RoseTTAFold—have revolutionized the field. These methods leverage hidden evolutionary signals embedded within protein sequences to infer spatial contacts and predict highly accurate 3D models, often matching experimental structures with remarkable fidelity. Key metrics, such as RMSD and TM-scores, are used to evaluate the accuracy of these predictions by comparing them to known structures.

    Beyond predicting the 3D structure of individual proteins, recent advances like AlphaFold3 extend these capabilities to protein complexes, offering insights into protein–protein interactions that underpin many cellular processes. Meanwhile, structure-based methods like FoldSeek utilize a “structural alphabet” (3Di) to compare and infer protein function from 3D structures, especially useful when sequence homology is limited.

    The chapter further addresses the reverse protein folding problem—determining function from structure—and delves into the emerging field of de novo protein design. Techniques such as RFDiffusion and ProteinMPNN allow scientists to generate novel protein sequences with desired structural features and functions, paving the way for innovative applications in therapeutics, enzyme engineering, and synthetic biology.

    Overall, this chapter highlights how computational protein modeling, powered by advanced neural network architectures (transformers and protein language models), is bridging the gap between sequence and structure. It emphasizes that while these computational methods greatly accelerate research, experimental validation remains essential to confirm predicted structures and functions, ensuring that these digital tools serve as powerful complements to traditional biochemical approaches.