5.8: Problems - Predicting Protein Structure and Function Using Machine Learning and AI Programs

Last updated
Save as PDF

Page ID: 148782

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

Search Fundamentals of Biochemistry

The exercises below use simple examples to demonstrate the power of machine learning/AI algorithms to predict protein structure and function and to design proteins with specific functions. You will then model the results using iCn3D. Many of the programs are quite new and likely will be changed when new and better software becomes available. This field is moving extremely fast. Some of the programs are commercial with limited free use sufficient for these exercises. Remember these exercises provide computational results. Experimental methods should be used to give these proposed structures additional support. The background for many of the methods used below is found in Chapter 4.14: Predicting Structure from Sequence and Sequence from Structure/Function

Glossary of AlphaFold terms

Predicting structure from Sequence with AlphaFold

Exercise \(\PageIndex{1}\)

1. Use AlphaFold3 from this link to determine a structure from this protein sequence:

MPGAISQLVSYGAQDVYLTGNPQITFFKAVYRRYTNFAMESIQQTFDGTTDFGKFPTVTISRNGDLAGPIWIEVNLPSLLGYNITPTPAEGNTSNIAAISTVFKDDYNNYWWTYNPGTTPQYSNLIAAFSNVDYKYYANAVTSTYPPTALSNVVYSWPYMITGNTGTRSTVAIPTANLRYVNGIGLALFNSIELELGGQRIDKHYSEWWDIWTELTETAEKIQGYNTMVGRYDPAVYNAGWNISQAQGGTYYVPLKFCYNRNPGLYMPLVALSYHQMKLNFNINNYLNCVKCNYPVTALTSKNGANPLSITNMKLYTDFVFLDAPERIRMSEIQHEYLVTQLQWQGSEPVTAPGDPNGSTNRKITLNFNHPVRELVFVYQAASNYDVDAVTGNNIFDYEIPANPTATPPYAGGGEVFTEVKLIINGSDRFSGRPGAYFRLVQPYEHHVRVPSKSVYVYSFALEDADSRQPNGSANFTRYDSVQLQLTLNENLASGRVQIYAPNFNILRIAAGMGGLAFAN

Select these in the input window and then paste in the above sequence.
Select Continue and Preview Job
Name the job: FOB_Unknown_Protein1
Select Confirm and Submit Job
When completed, check the job box and open the results
Take a screen capture of the results window
Select the download icon in the top menu bar to download a compressed zip file of the results
Unzip the folder/file

Use iCn3D to open the first structure file: fold_fob_unknown_protein1_model_0.cif (it is a .cif file, not a pdb file) and render as follows:

File, Open File, mmCIF
Analysis, Seq. & Annotations
In the Seq & Annotations window, choose the Details tab

iCn3D shows 2 domains, the Capsid_N and Capsid_NCLDV domains (NCLDV stands for nucleocytoplasmic large DNA viruses).

Click the blue Capsid_N domain name to highlight it
Color, Unicolor, Cyan, Cyan
Select, Save Selection, and name it Capsid_N domain
Analysis, Label, Per Selection, then name it Capsid_N, and change the size to 10
Click the blue Capsid_NCLDV domain name to highlight it
Color, Unicolor, Magenta, Magenta
Select, Save Selection, and name it Capsid_NCLDV domain
Analysis, Label, Per Selection, then name it Capsid_NCLDV, and change the size to 10
Clear Selection
Style, Background, Transparent
File, Save File, iCn3D PNG image, Original Size and HTML
name it: fold_fob_unknown_protein1_model_0

Answer

f. Here is the screen capture

A line graph with a multi-colored trajectory on the left and a green heat map on the right, depicting data patterns.

iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!

Open iCn3D
File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Try this: https://www.ncbi.nlm.nih.gov/Structu...?origin=mt-web

https://www.ncbi.nlm.nih.gov/Structu...d/pdb/3GVU.png

Here is a static image:

3D molecular structure visualization with strands in cyan, magenta, and gray representing protein folding.

Domain information from iCn3D:

Capsid_N: This is the N-terminal domain of the major capsid protein in several dsDNA viruses.(open details view...)
Capsid_NCLDV: This family includes the major capsid protein of iridoviruses, chlorella virus, and Spodoptera ascovirus, which are all dsDNA viruses with no RNA stage. This is the most abundant structural protein and can account for up to 45% of virion protein. In Chlorella virus PBCV-1 the major capsid protein is a glycoprotein. The four families of large eukaryotic DNA viruses, Poxviridae, Asfarviridae, Iridoviridae, and Phycodnaviridae, are referred to collectively as nucleocytoplasmic large DNA viruses or NCLDV. The virions of different NCLDV have dramatically different structures. The major capsid proteins of iridoviruses and phycodnaviruses, both of which have icosahedral capsids surrounding an inner lipid membrane, showed a high level of sequence conservation. A more limited but statistically significant sequence similarity was observed between these proteins and the major capsid protein (p72) of ASFV, which also has an icosahedral capsid. It was surprising, however, to find that all of these proteins shared a conserved domain with the poxvirus protein D13L, which is an integral virion component thought to form a scaffold for the formation of viral crescents and immature virions. (open details view...)

The figure below shows an interactive iCn3D model of a capsid trimer from Paramecium bursaria Chlorella virus 1 (1M4X).

Figure: Capsid trimer from Paramecium bursaria Chlorella virus 1 (1M4X). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...7MCUVpKLytn9bA

The monomer in this trimeric capsid protein is a different capsid protein with a different sequence than the one you modeled. Two monomers are shown in brown and purple. The other monomer is colored to show the two Capsid domains (same as in the one you modeled). One is in cyan, and the other is colored by secondary structure. Each of these two domains contains a jellyroll domain with eight beta-strands that form two 4-stranded sheets. Thousands of these individual trimers form the capsid shell of the virus, as shown in the figure below.

Diagram showing a cell structure on the left with DNA strands and a geometric shape on the right demonstrating geometric patterns.

Figure: Structure of a Phycodnaviridae. https://viralzone.expasy.org/145

In this virus, there are 5040 copies of the major capsid protein (hexamers), 60 copies of the penton protein (pentamers) and 1800 minor capsid proteins of different types

In this example, you took linear information (the amino acid sequence) and turned it into 3D information, the tertiary structure of an unknown protein. Subsequent analyses within iCn3D show that this is a capsid protein from the dsDNA Paramecium bursaria Chlorella virus 1. The protein you modeled (same sequence) is listed in Uniprot (O41104) and the PDB (8H2I) as an averaged PBCV-1 capsid, but no 3D structures of this protein are available in these databases.

Predicting Structure from Sequence with ESMFold

This program predicts structures from the sequence in the ESM Metagenomic Atlas. In contrast to genomics, which involves the DNA sequence analysis of a single organism, metagenomics involves DNA sequencing of a complex sample (such as soil, the gut microbiome, etc) with a large number of different organisms and viruses. The Metagenomic Atlas contains information on a vast number of proteins. Protein structures in this database are determined using another AI program, ESMFold. These structures can be viewed in iCn3D as shown in the figure below.

Screenshot of a software interface displaying file options, alignment settings, and product ID information.

The Atlas contains predicted protein structures for over 700 million proteins. A million of these can be previewed on their Explore page. ESMFold uses statistical scoring metrics similar to AlphaFold. They include pTM, which is the predicted TM-score, giving the reliability of the global fold, and the mean predicted LDDT (pLDDT) score, which gives the reliability of the local structure. Over 225 million structures, including the sampling of those on the explore page, have high confidence (both pTM and pLDDT >0.7).

Exercise \(\PageIndex{2}\)

Go to iCn3D
File, Predict by Sequence, EMS Fold.
Input this sequence:
MIKNFFQNTEKSEFFKNSLVANFIIILTMVPQFLLVPIILNFWGEAQYSSYVVFLSVLNIAIQINGAMQNGYINRLFQSGKISVAELSSMFLLNLGLLSFMLILLFVFYVSGYGAAFFSHITLIILSLPFLFALSSLKIIYQLGHKVTTPL
then select EMSFold
Render the structure in iCn3D as you wish.
File, Save File, iCn3D PNG image, Original Size and HTML

Answer

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Here is a static image of the calculated structure:

3D protein structure in shades of blue with a color legend indicating varying confidence levels in predictions. .

Predicting binding interactions and complex formation with AlphaFold

Exercise \(\PageIndex{3}\)

Use AlphaFold3 from this link to determine the structure of a complex between a protein and a dsDNA from just their sequence. Then compare the computed structure with a very similar X-ray structure. This example comes from a Nature paper that used AlphaFold to predict the structure of the complex. Here is the reference:

Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). https://doi.org/10.1038/s41586-024-07487-w

Protein: AMP-binding protein-catabolite gene activator from Rhizobium meliloti (strain 1021) (Ensifer meliloti) (Sinorhizobium meliloti), uniprot: Q92SD2. Here is the protein sequence:

MAEVIRSSAFWRSFPIFEEFDSETLCELSGIASYRKWSAGTVIFQRGDQGDYMIVVVSGRIKLSLFTPQGRELMLRQHEAGALFGEMALLDGQPRSADATAVTAAEGYVIGKKDFLALITQRPKTAEAVIRFLCAQLRDTTDRLETIALYDLNARVARFFLATLRQIHGSEMPQSANLRLTLSQTDIASILGASRPKVNRAILSLEESGAIKRADGIICCNVGRLLSIADPEED

dsDNA sequence: Here are sequences of both DNA strands (5'-3'):

5'-CTAGGTAACATTACTCGCG-3' (19 mer) and 5'-GCGAGTAATGTTAC-3' (14 mer)

In AlphaFold 3, input the entry type, copies (number of molecules in the structure), and the sequences, as shown in the figure below.

Screenshot of text interface displaying lines of code or commands, positioned in a grid format.

In the actual crystal structure, a protein dimer binds the dsDNA

Answer

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Here is a static image of the calculated structure:

3D molecular structure showing intertwined chains in purple and blue, representing a protein complex with nucleic acids.

The figure below shows an interactive iCn3D model of the actual X-ray crystal structure of the capsid trimer from Clr-cAMP-DNA complex (7PZB) for comparison. Note that cAMP could not be included in the AlphaFold 3 predicted structure of the complex using the version available.

Figure: X-ray crystal structure of the capsid trimer from Clr-cAMP-DNA complex (7PZB). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...YZJitoVd7Y2LAA

The program can be used to determine the structure of protein complexes as well as in this example from Chapter 4.14, the human sperm proteins and egg protein complex predicted by AlphaFold.

Designing target proteins de novo (ex., making novel binders) with RFDiffusion

The figure below shows an interactive iCn3D model of the X-ray crystal structure of human thrombin (3U69). Thrombin is a serine protease. It cleaves other proteins using its active site serine as a nucleophile. Other examples of serine proteases are the gut proteases chymotrypsin and trypsin, which are used in digestion. Thrombin cleaves a limited repertoire of proteins, mostly involved in blood clotting and its control. A main substrate is the protein fibrinogen, which, after cleavage of small peptides, forms fibrin that associates to form a fibrin clot. In addition to binding at the active site, specific substrates like fibrinogen interact with an anionic "exosite" on thrombin near the active site. This additional binding site limits the specificity of thrombin to specific substrates.

The figure below shows an interactive iCn3D model of human thrombin (3U69). Three active site residues (catalytic triad) involved in cleaving fibrinogen and other clotting proteins are shown in CPK-colored sticks. The blue side chains and surface areas represent key positively charged residues in "Exosite 1" of thrombin, involved in binding the anionic region of fibrinogen and a modulator of thrombin activity, thrombomodulin.

Figure: X-ray crystal structure of human thrombin (3U69). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...PkxukNXxPGuHa7

Organisms that need liquid blood for food have exploited Exosite 1 and created peptide inhibitors that bind to it and inhibit fibrinogen binding, for example. One such organism is a leech, which secretes a protein in its saliva called hirudin. It has a long C-terminal anionic tail that binds in the exosite groove and inhibits the binding of fibrinogen.

The figure below shows an interactive iCn3D model of the X-ray crystal structure of the hirudin-thrombin complex (4HTC). Key anionic residues in the C-terminal tail of hirudin are shown as red sticks and red surfaces interacting with the positive residues (blue sticks and surfaces) in Exosite 1 of thrombin.

Figure: X-ray crystal structure of the hirudin-thrombin complex (4HTC). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...EyESJzsUV85Ex5.. To change the background color on the popup choose Style, Background, Transparent.

Synthetic inhibitors of thrombin that target exosite 1 of thrombin have been made. One contains a negatively charged DNA (an aptamer) covalently linked to an active site inhibitor, creating an "EXosite and ACTive site (EXACT) inhibitor".

Exercise \(\PageIndex{4}\)

In the next exercise, you will create a novel peptide inhibitor that targets exosite 1 of thrombin using RFDiffusion. This should be a simple example as the peptide is expected to have glutamates and aspartates, mimicking the hirudin C-terminal end.

You will run RFDiffusion from a commercial company called Neurosnap. First, set up an account (free, five models/month)

Go to RFDiffusion-v2
Separately, download this pdb file for human thrombin to your computer: 3U69
Open RFdiffusion-v2 and input the following data, which is shown in red. Leave the rest alone. H stands for the Heavy Chain in the PDB file.

Binder Input Chain	H
Binder Length Maximum	20
Binder Length Minimum	17
Binder ROG	false
Binding Pocket Residue End	1
Binding Pocket Residue Start	1
Fixed Residue End	20
Fixed Residue Start	10
Hotspots	H340,H341,H388,H390,H393,H397,H426,H427
Input Structure	input_structure.pdb

Run the program.

When completed, the output page has several tabs: Data and Visuals, Config, and Files. View the results in the Data and Visuals tab

Answer

iCn3D results: Download this PNG file and upload it into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Note that the amino acid numbering of the hotspots changed in the output files from RFDiffusion, as shown in the table below.

Initial PDB numbering system for Heavy Chain (RU69)	RFDiffusion numbering for Heavy Chain Hot Spots
R340	20
K341	21
R388	68
R390	70
R393	73
K397	77
K426	106
K427	107

Here is a static image:

3D model of a protein complex, showing various colored molecular structures against a translucent background.

The light chain is not shown. The heavy chain H is shown in gray. The hot spots on thrombin (H340, H341, H388, H390, H393, H397, H426, and H427) are colored by charge (blue stick and surface). The binder is shown as a gold surface with the charged side chains shown in red (negative) and blue (positive). The sequence of RFDiffusion binder is EEEEKELELLREEIEKLEKE, which has 11 glutamates (E) but also 3 lysines (K), perhaps to maintain some charge balance.

Here is an animation from the output to show how RFDiffusion produces the final peptide at the hotspot residues.

A tangled, three-dimensional structure made of green, zigzagging lines resembling a molecular model.

Here are the stats for the top-ranked binder

Rank	MPNN Score	RMSD	Mean pLDDT	Max PAE	pTM
1	1.14	19.69	92.98	23.48	0.91

Watson, Joseph et al. “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models” bioRxiv.Org. doi: https://doi.org/10.1101/2022.12.09.519842. Neurosnap Inc. - Computational Biology Platform for Research. Wilmington, DE, 2022. https://neurosnap.ai/.

Exercise \(\PageIndex{5}\)

Create a binder for human thrombin using Google Colab at this site (note: this link might be replaced in the future)

a. Inputs: Accept all defaults except for the sections below. For those, use the inputs shown. The goal is not to optimize the run but to show how easy it is to create a binder. The inputs allow you to create a 17 amino acid binder to the H (heavy) chain of human thrombin (3U69) at the positively charged Exosite 1 hot spots residues in the H chain, as described above.

A user interface displaying parameters for running RFdiffusion, including fields for name, config, path, iterations, and more.

User interface displaying settings for generating sequences and validating using specific parameters.

b. Run the program. You can run sections individually or all at once. To run one section after the other, click the arrow for the first section as shown below, wait for it to complete, and then run the rest

Settings interface for generating a blueprint for RFdiffusion, with an option for "manual" mode highlighted.

The first section sets up the program and takes a few minutes to run. Alternatively, run all at once as shown in the image below.

Menu interface showing options under "Runtime," with "Run all" highlighted.

The results should be downloaded to a .zip file in your computer's download folder.

c. Visualizing the results: Unzip the folder and open the "best_design0.pdb file in iCn3D. Display and render your file in iCn3D.

Answer

iCn3D results: Download this PNG file and upload it into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Note that the number system for the H chain changed as the 1st run. Here is a static image:

3D molecular structure with colored spheres representing atoms and white ribbon depicting protein backbone.

The gray chain is the heavy chain of thrombin. The hot spots on this chain are shown in spacefill, CPK colors, and labeled. The binder is shown as a transparent surface with side chains colored by charge (red = positive, blue = negative).

Predicting Protein Function through Structure with FoldSeek

FoldSeek can be run directly on a local PDB file or one retrieved from the PDB through this link. However, it can also be run through iCn3D, adding many more possibilities for rendering and analyses.

Use iCn3D to run FoldSeek

Exercise \(\PageIndex{6}\)

We will explore the structural similarities of a newly predicted viral protein from the Wuhan insect virus from an uncertain family to the structure of all proteins in the databases using FoldSeek using iCn3D.

First, download this file: hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727.pdb

Open iCn3D
File, Open File, PDB appendable, and load the downloaded file
Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view other viral protein.

3D model of a protein structure with red, blue, and yellow elements representing different strands and helix formations.

Now let's use FoldSeek within iCn3D to find similar structures (not sequences) with the databases (PDB and AlphaFold).

File, Search Similar, FoldSeek (PDB and AlphaFold)
Click Submit

A new tab with FoldSeek appears with the best statistical structural fits. The statistics shown in this table include the probability (0-1) that the structural match of the query (the protein studied) to the target (what it might resemble) is not due to chance. The other statistic is the expected or E-value, which gives the probability (0-1) that the structures (or sequences) match just by chance. The closer the E-value is to zero, the more significant the match. (See this PDB page describing E-values for sequence alignments.)

Which hit is best? How good is the quality of the fit?
If you click the blue hyperlink for the best hit, you will get the structure from the ESM database, and you can download it. The database protein structure is shown below.

3D molecular structure of a protein shown with a blue and yellow ribbon, against a black background.

Instead, click the staggered three lines (=) to the right to get the alignment with the hypothetical protein.

This pops up:

3D protein structure visualization with a colorful molecular model in blue, yellow, and other colors on the right side.

The query (blue, the protein studied) and the target (orange, what it might resemble)

Click the PDB link to download a PDB file of the two aligned protein segments. Model them in iCn3D to replicate the structure above.

3D molecular structure of a protein, featuring intertwined blue and yellow helices and loops.

Query (hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727) is blue, Target (MGYP001363004432) is orange

This hypothetical viral protein structure is very dissimilar to any structure in the databases except for the helical bundles similar to protein MGYP001363004432 in the ESM Metagenomic Atlas database.

Answer

f. You will see that for this particular structure, there are just five "hits" in only one database, mgnify_esm30, from the ESM Metagenomic Atlas Database described above.

Here are the five hits.

A table displaying data with headings, including achievement metrics and progress bars in green for various items.

None of the structural alignments are good statistically.

iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Query (hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727) is blue, Target (MGYP001363004432) is orange

Exercise \(\PageIndex{7}\)

Now, let's try a new viral protein structure whose function we can likely infer from the FoldSeek structural overlay.

Download this protein structure (whose function has been surmised through programs like FoldSeek: matrix _protein__YP_001531158__Marburg_marburgvirus__11269.pdb.

Open iCn3D
File, Open File, PDB appendable, and load the downloaded file
Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view other viral protein.

Here is a static image:

3D structure of a protein, featuring twisted strands in red, yellow, and blue colors.

Now repeat the procedure above and use FoldSeek within iCn3D (as described in Exercise 6 above) to find proteins of similar structure and presumably function.
Model one of the top hits in iCn3D

Answer

The optimal target is VP24 of the Marburg virus (Uniprot ID: Membrane-associated protein VP24 - P35256). Both query and target proteins are from the Marburg virus, in the Filoviridae family, including the Ebola virus. These are deadly viruses that cause hemorrhagic fevers and have high death rates. Here is one example of an almost-perfect hit.

Target	Description	Scientific Name	Prob.	Seq. Id.	E-Value	Position in query	Alignment
4or8-assembly1_B	Crystal structure of Marburg virus VP24	Marburg virus - Musoke, Kenya, 1980	1.00	94.1	9.90e-32

A visual representation of a grid with data points and a map displaying various routes and locations, labeled with color coding.

The query (blue, the protein studied) and the target (orange, what it might resemble)

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.

3D representation of a protein structure, featuring intertwined blue and yellow ribbons in a dynamic formation.

Query (matrix_protein__YP_001531158__Marburg_marburgvirus__11269) is blue, Target (VP24 of the Marburg virus) is orange

From the Uniprot page for the VP24 protein, FoldSeek can be run to determine which proteins in the database are similar. In this case, VP24 is the query and FoldSeek finds targets of similar structure.

Table displaying database information, including database names and a link for more details.

The best hits are in the PDB database:

Target	Description	Scientific Name	Prob.	Seq. Id.	E-Value

4or8-assembly1_A	Crystal structure of Marburg virus VP24	Marburg virus - Musoke, Kenya, 1980	1.00	100	1.03e-46
4or8-assembly1_B	Crystal structure of Marburg virus VP24	Marburg virus - Musoke, Kenya, 1980	1.00	94.7	4.12e-37
6ehm-assembly1_C	Model of the Ebola virus nucleocapsid subunit from recombinant virus-like particles	Ebola virus - Mayinga, Zaire, 1976	1.00	37	7.36e-19
4u2x-assembly3_C	Ebola virus VP24 in complex with Karyopherin alpha 5 C-terminus	Ebola virus - Mayinga, Zaire, 1976	1.00	36.5	6.94e-19
3vne-assembly1_A	Structure of the ebolavirus protein VP24 from Sudan	Sudan ebolavirus	1.00	35.5	3.25e-18
4d9o-assembly1_A	Structure of ebolavirus protein VP24 from Reston	Reston ebolavirus - Reston	1.00	36	1.55e-16
3vnf-assembly1_A	Structure of the ebolavirus protein VP24 from Sudan	Sudan ebolavirus	1.00	34.8	1.75e-16
4d9o-assembly1_B	Structure of ebolavirus protein VP24 from Reston	Reston ebolavirus - Reston	1.00	35	3.84e-15

Running FoldSeek using 310 Copilot

Exercise \(\PageIndex{8}\)

Now run FoldSeek using 310 Copilot. This program is part of a suite of commercial programs from Open AI. These programs allow users to input questions and data as sentences (much like Chatbots like ChatGPT, Claude, Gemini, etc.) to address complex questions in biology and biochemistry.

Click try it now. Input the query shown in the bottom of the figure below. IN this case, you are asking to find structures similar to the VP24 of the Marburg using the PDB ID = 4or8

A user interface displaying a prompt to provide a username to search for similar structures within a database.

Answer

Here are the results:

Table displaying data with headers: name, subject id, version, and timestamp; contains three entries with numerical values.

As of 11/21/24, you can't upload a local pdb file (as we did with iCn3D above) and run FoldSeek within Copilot.

Exploring protein complexes with FoldSeek Multimer

In Exercise 3 above, you used AlphaFold to predict the structure of protein complexes (in that particular example, a protein:DNA complex) from sequences. By analogy, FoldSeek Multimer can find the 3D structures of target complexes from the 3D structure of a known (query) complex. In short, AlphaFold can do large-scale sequence-to-structure comparisons, while FoldSeek Multimer can do large-scale structure-to-structure comparisons. Let's try an example with a simple complex, the hepatitis A virus C3 proteinase (1HAV), a homodimer of an A and B chain.

Exercise \(\PageIndex{9}\)

Go to the FoldSeek Multimer server

Load Accession 1hav and run the program.
Copy/Snip the top 4 results. Compare the 1st and 4th results
Click the alignment icons = for each and take a screen snips. =

Answer

b. Here is a snip of the top results.

Table displaying song titles, artists, album names, and other details with accompanying progress bars.

The top results show high-quality alignments of one chain with a human protein, but that human protein is not part of a similar dimer. Result 4 shows good alignment of both chains of the query with a human dimer.

Note the label for the top hit: B ➔ ProtVar_P83110_Q96RQ3_A. The P83110 is the Uniprot number for human serine protease HTRA3.

c. Hit 1 - alignment for just 1 chain of the query and target

A molecular structure diagram with a ribbon representation of a protein, alongside labeled sequence information and annotations.

Query (blue, the protein studied) and the target (orange, what it might resemble)

The icons underneath the blue/orange models are explained below.

PDB: select to download the combined file and model in iCn3D
blue: toggle between the entire query and the aligned structure
orange: toggle between the entire target and the aligned structure. Note that in this case, the target (human protein) is much bigger than the query (viral) protein so the alignment is only between part of the human protein.

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.

3D representation of a protein structure with intertwined blue and gold ribbon-like strands.

Query (blue, the protein studied) and the target (orange, what it might resemble)

Hit 4: Alignment of two chains (A and B) for the query and target

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.

3D molecular structure representation with intertwined strands in blue, yellow, and gold colors.

Docking of Small Molecules (Ligands)

Predicting binding interactions between small molecules and proteins is at the heart of the pharmaceutical industry. Previously, docking software for small molecule ligands and proteins was proprietary, costly, and difficult to run. Now, docking can be done online. In the next exercises, you will use three programs: 310 Copilot DiffDock (from a commercial company), DiffDock through Neuroapp, and SwissDock (Molecular Modeling Group, University of Lausanne, and the SIB Swiss Institute of Bioinformatics).

The inputs for a docking protein include the protein (typically the PDB ID) and the ligand, which can be represented as the actual structure, but more often a code in the SMILES or InChi formats. For these exercises, you will dock the small molecule pyridoxal phosphate to a low molecular weight protein tyrosine phosphatase, a protein that cleaves phosphorylated tyrosines in proteins.

The input representation for PLP can be obtained through PubChem shown below.

Chemical structure diagram of caffeine with molecular formula C8H10N4O2 and detailed annotations.

Docking using 310 Copilot

Exercise \(\PageIndex{10}\)

Docking using Copilot - DiffDock

Go to PubChem and get the SMILES representation for PLP.
Open 310 Copilot
Input this text: Run a docking experiment with the protein PDB ID = 5JNR and the small molecule SMILES=CC1=NC=C(C(=C1O)C=O)COP(=O)(O)O
When the docked structure is shown, select Download PDB
Save as 310CoPilotDock5JNR_PLP.pdb

Now model the results in iCn3D as follows:

Open iCn3D
File, Open File, PDB appendable, and load 310CoPilotDock5JNR_PLP.pdb
Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view of the viral protein.

Answer

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Here is a static view.

3D model of a protein structure with labeled atoms, featuring a complex arrangement of helices and loops.

Docking using DiffDock in Neurosnap

Exercise \(\PageIndex{11}\)

Now, use DiffDock through Neurosnap to dock the ligand and protein.

Complete the menu as shown below.

The output folder contains a separate PDB file and 100 SDF files for the various ligand "poses." Display them in iCn3D.

Open iCn3D
File, Open File, SDF, and choose the top-ranked SDF file (rank1_confidence-0.09.sdf)
File, Open File, PDB Appendable, and choose the proteins_no_ligands pdb file.
Render it as you see fit

Answer

Here is a screen snap of the docked structure.

3D molecular structure of a protein with a color-coded ligand, surrounded by ribbon-like representations of amino acid chains.

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Docking using SwissDock

Exercise \(\PageIndex{12}\)

Now use SwissDock to dock PLP to a different low molecular weight protein tyrosine phosphate, PDB ID = 1xww. This has a SO₄²^- in the active site, which must be removed before the docking. The program uses two different methods, AutoDock Vina and Cavity Prioritization. Use the second one, which seems to give better results for this protein.

Input the values shown in the box below and run the docking experiment. When you input the target PDB file (1xww), choose the prompts shown below.

Document excerpt with parameters and references related to a statistical analysis or data study.

Screenshot of a form with fields for setting a target URL and viewing options for the desired target.

Download the result as a zip file.

View and render the docked ligand in iCn3D. First, you must modify one of the files to view the docking results in iCn3D.

Choose Export Results as Zip file
Open the Zip folder and extract all.
Open the result.dock4 file in a simple text editor. Save just the first PLP coordinates by finding the 1st line with TER (terminate). Delete everything after that. Save the file as result0dock4.pdb file into the File subfolder you just extract
Open iCn3D
File, Open File, PDB appendable, and choose both the file you just made (results.dock4.pdb) and the receptor.pdb load it.
Render the image as you wish.

Answer

Here is the link to the actual SwissDock docking results page.

Here is a snip of the best-docked structure.

3D molecular structure showing a protein with ribbon-like shapes and a central compound, highlighting chemical bonds.

Top Pose:

Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.

Search

Text Color

Text Size

Margin Size

Font Type

Exercise \(\PageIndex{1}\)

Exercise \(\PageIndex{2}\)

Exercise \(\PageIndex{3}\)

Exercise \(\PageIndex{4}\)

Exercise \(\PageIndex{5}\)

Exercise \(\PageIndex{6}\)

Exercise \(\PageIndex{7}\)

Exercise \(\PageIndex{8}\)

Exercise \(\PageIndex{9}\)

Exercise \(\PageIndex{10}\)

Exercise \(\PageIndex{11}\)

Exercise \(\PageIndex{12}\)