5.8: Problems - Predicting Protein Structure and Function Using Machine Learning and AI Programs
- Page ID
- 148782
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The exercises below use simple examples to demonstrate the power of machine learning/AI algorithms to predict protein structure and function and to design proteins with specific functions. You will then model the results using iCn3D. Many of the programs are quite new and likely will be changed when new and better software becomes available. This field is moving extremely fast. Some of the programs are commercial with limited free use sufficient for these exercises. Remember these exercises provide computational results. Experimental methods should be used to give these proposed structures additional support. The background for many of the methods used below is found in Chapter 4.14: Predicting Structure from Sequence and Sequence from Structure/Function
- Glossary of AlphaFold terms
Predicting structure from Sequence with AlphaFold
1. Use AlphaFold3 from this link to determine a structure from this protein sequence:
MPGAISQLVSYGAQDVYLTGNPQITFFKAVYRRYTNFAMESIQQTFDGTTDFGKFPTVTISRNGDLAGPIWIEVNLPSLLGYNITPTPAEGNTSNIAAISTVFKDDYNNYWWTYNPGTTPQYSNLIAAFSNVDYKYYANAVTSTYPPTALSNVVYSWPYMITGNTGTRSTVAIPTANLRYVNGIGLALFNSIELELGGQRIDKHYSEWWDIWTELTETAEKIQGYNTMVGRYDPAVYNAGWNISQAQGGTYYVPLKFCYNRNPGLYMPLVALSYHQMKLNFNINNYLNCVKCNYPVTALTSKNGANPLSITNMKLYTDFVFLDAPERIRMSEIQHEYLVTQLQWQGSEPVTAPGDPNGSTNRKITLNFNHPVRELVFVYQAASNYDVDAVTGNNIFDYEIPANPTATPPYAGGGEVFTEVKLIINGSDRFSGRPGAYFRLVQPYEHHVRVPSKSVYVYSFALEDADSRQPNGSANFTRYDSVQLQLTLNENLASGRVQIYAPNFNILRIAAGMGGLAFAN
- Select these in the input window and then paste in the above sequence.
- Select Continue and Preview Job
- Name the job: FOB_Unknown_Protein1
- Select Confirm and Submit Job
- When completed, check the job box and open the results
- Take a screen capture of the results window
- Select the download icon in the top menu bar to download a compressed zip file of the results
- Unzip the folder/file
Use iCn3D to open the first structure file: fold_fob_unknown_protein1_model_0.cif (it is a .cif file, not a pdb file) and render as follows:
- File, Open File, mmCIF
- Analysis, Seq. & Annotations
- In the Seq & Annotations window, choose the Details tab
iCn3D shows 2 domains, the Capsid_N and Capsid_NCLDV domains (NCLDV stands for nucleocytoplasmic large DNA viruses).
- Click the blue Capsid_N domain name to highlight it
- Color, Unicolor, Cyan, Cyan
- Select, Save Selection, and name it Capsid_N domain
- Analysis, Label, Per Selection, then name it Capsid_N, and change the size to 10
- Click the blue Capsid_NCLDV domain name to highlight it
- Color, Unicolor, Magenta, Magenta
- Select, Save Selection, and name it Capsid_NCLDV domain
- Analysis, Label, Per Selection, then name it Capsid_NCLDV, and change the size to 10
- Clear Selection
- Style, Background, Transparent
- File, Save File, iCn3D PNG image, Original Size and HTML
- name it: fold_fob_unknown_protein1_model_0
- Answer
-
f. Here is the screen capture
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
- Open iCn3D
- File, Open, iCn3D PNG appendable and browse for the file in your download folder.
Try this: https://www.ncbi.nlm.nih.gov/Structu...?origin=mt-web
https://www.ncbi.nlm.nih.gov/Structu...d/pdb/3GVU.png
Here is a static image:
Domain information from iCn3D:
- Capsid_N: This is the N-terminal domain of the major capsid protein in several dsDNA viruses.(open details view...)
- Capsid_NCLDV: This family includes the major capsid protein of iridoviruses, chlorella virus and Spodoptera ascovirus, which are all dsDNA viruses with no RNA stage. This is the most abundant structural protein and can account for up to 45% of virion protein. In Chlorella virus PBCV-1 the major capsid protein is a glycoprotein. The four families of large eukaryotic DNA viruses, Poxviridae, Asfarviridae, Iridoviridae, and Phycodnaviridae, are referred to collectively as nucleocytoplasmic large DNA viruses or NCLDV. The virions of different NCLDV have dramatically different structures. The major capsid proteins of iridoviruses and phycodnaviruses, both of which have icosahedral capsids surrounding an inner lipid membrane, showed a high level of sequence conservation. A more limited, but statistically significant sequence similarity was observed between these proteins and the major capsid protein (p72) of ASFV, which also has an icosahedral capsid. It was surprising, however, to find that all of these proteins shared a conserved domain with the poxvirus protein D13L, which is an integral virion component thought to form a scaffold for the formation of viral crescents and immature virion.(open details view...)
The figure below shows an interactive iCn3D model of a capsid trimer from Paramecium bursaria Chlorella virus 1 (1M4X).
Figure: Capsid trimer from Paramecium bursaria Chlorella virus 1 (1M4X). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...7MCUVpKLytn9bAThe monomer in this trimeric capsid protein is a different capsid protein with a different sequence than the one you modeled. Two monomers are shown in brown and purple. The other monomer is colored to show the two Capsid domains (same as in the one you modeled). One is in cyan, and the other is colored by secondary structure. Each of these 2 domains contains a jellyroll domain with 8 beta-strands that form two 4-stranded sheets. Thousands of these individual trimers form the capsid shell of the virus, as shown in the figure below.
Figure: Structure of a Phycodnaviridae. https://viralzone.expasy.org/145
In this virus, there are 5040 copies of the major capsid protein (hexamers), 60 copies of the penton protein (pentamers) and 1800 minor capsid proteins of different types
In this example, you took linear information (the amino acid sequence) and turned it into 3D information, the tertiary structure of an unknown protein. Subsequent analyses within iCn3D show that this is a capsid protein from the dsDNA Paramecium bursaria Chlorella virus 1. The protein you modeled (same sequence) is listed in Uniprot (O41104) and the PDB (8H2I) as an averaged PBCV-1 capsid, but no 3D structures of this protein are available in these databases.
Predicting Structure from Sequence with ESMFold
This program predicts structures from the sequence in the ESM Metagenomic Atlas. In contrast to genomics, which involves the DNA sequence analysis of a single organism, metagenomics involves DNA sequencing of a complex sample (such as soil, the gut microbiome, etc) with a large number of different organisms and viruses. The Metagenomic Atlas contains information on a vast number of proteins. Protein structures in this database are determined using another AI program, ESMFold. These structures can be viewed in iCn3D as shown in the figure below.
The Atlas contains predicted protein structures for over 700 million proteins. A million of these can be previewed on their Explore page. ESMFold uses statistical scoring metrics similar to AlphaFold. They include pTM, which is the predicted TM-score, giving the reliability of the global fold, and the mean predicted LDDT (pLDDT) score which gives the reliability of the local structure. Over 225 million structures, including the sampling of those on the explore page, have high confidence (both pTM and pLDDT >0.7).
- Go to iCn3D
- File, Predict by Sequence, EMS Fold.
- Input this sequence:
MIKNFFQNTEKSEFFKNSLVANFIIILTMVPQFLLVPIILNFWGEAQYSSYVVFLSVLNIAIQINGAMQNGYINRLFQSGKISVAELSSMFLLNLGLLSFMLILLFVFYVSGYGAAFFSHITLIILSLPFLFALSSLKIIYQLGHKVTTPL - then select EMSFold
- Render the structure in iCn3D as you wish.
- File, Save File, iCn3D PNG image, Original Size and HTML
- Answer
-
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable and browse for the file in your download folder.Here is a static image of the calculated structure:
.
Predicting binding interactions and complex formation with AlphaFold
Use AlphaFold3 from this link to determine the structure of a complex between a protein and a dsDNA from just their sequence. Then compare the computed structure with a very similar X-ray structure. This example comes from a Nature paper that used AlphaFold to predict the structure of the complex. Here is the reference:
Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). https://doi.org/10.1038/s41586-024-07487-w
Protein: AMP-binding protein-catabolite gene activator from Rhizobium meliloti (strain 1021) (Ensifer meliloti) (Sinorhizobium meliloti), uniprot: Q92SD2. Here is the protein sequence:
MAEVIRSSAFWRSFPIFEEFDSETLCELSGIASYRKWSAGTVIFQRGDQGDYMIVVVSGRIKLSLFTPQGRELMLRQHEAGALFGEMALLDGQPRSADATAVTAAEGYVIGKKDFLALITQRPKTAEAVIRFLCAQLRDTTDRLETIALYDLNARVARFFLATLRQIHGSEMPQSANLRLTLSQTDIASILGASRPKVNRAILSLEESGAIKRADGIICCNVGRLLSIADPEED
dsDNA sequence: Here are sequences of both DNA strands (5'-3'):
5'-CTAGGTAACATTACTCGCG-3' (19 mer) and 5'-GCGAGTAATGTTAC-3' (14 mer)
In AlphaFold 3, input the entry type, copies (number of molecules in the structure), and the sequences, as shown in the figure below.
In the actual crystal structure, a protein dimer binds the dsDNA
- Answer
-
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable and browse for the file in your download folder.Here is a static image of the calculated structure:
The figure below shows an interactive iCn3D model of the actual X-ray crystal structure of the capsid trimer from Clr-cAMP-DNA complex (7PZB) for comparison. Note that cAMP could not be included in the AlphaFold 3 predicted structure of the complex using the version available.
Figure: X-ray crystal structure of the capsid trimer from Clr-cAMP-DNA complex (7PZB). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...YZJitoVd7Y2LAA
The program can be used to determine the structure of protein complexes as well as in this example from Chapter 4.14, the human sperm proteins and egg protein complex predicted by AlphaFold.
Designing target proteins de novo (ex. making novel binders) with RFDiffusion
The figure below shows an interactive iCn3D model of the X-ray crystal structure of human thrombin (3U69). Thrombin is a serine protease. It cleaves other proteins using its active site serine as a nucleophile. Other examples of serine proteases are the gut proteases chymotrypsin and trypsin used in digestion. Thrombin cleaves a limited repertoire of proteins, mostly involved in blood clotting and its control. A main substrate is the protein fibrinogen, which after cleavage of small peptides forms fibrin that associates to form a fibrin clot. Specific substrates like fibrinogen, in addition to binding at the active site, interact with an anionic "exosite" on thrombin near the active site. This additional binding site limits the specificity of thrombin to specific substrates.
The figure below shows an interactive iCn3D model of human thrombin (3U69). Three active site residues (catalytic triad) involved in cleaving fibrinogen and other clotting proteins are shown in CPK-colored sticks. The blue side chains and surface areas represent key positively charged residues in "Exosite 1" of thrombin, involved in binding the anionic region of fibrinogen and a modulator of thrombin activity, thrombomodulin.
Figure: X-ray crystal structure of human thrombin (3U69). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...PkxukNXxPGuHa7
Organisms that need liquid blood for food have exploited this Exosite 1 and created peptide inhibitors that bind to it and inhibit fibrinogen binding, for example. One such organism is a leech, which secretes a protein in its saliva called hirudin. It has a long C-terminal anionic tail that binds in the exosite groove and inhibits the binding of fibrinogen.
The figure below shows an interactive iCn3D model of the X-ray crystal structure of the hirudin-thrombin complex (4HTC). Key anionic residues in the C-terminal tail of hirudin are shown as red sticks and red surfaces interacting with the positive residues (blue sticks and surfaces) in Exosite 1 of thrombin.
Figure: X-ray crystal structure of the hirudin-thrombin complex (4HTC). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...EyESJzsUV85Ex5.. To change the background color on the popup choose, Style, Background, Transparent
Synthetic inhibitors of thrombin that target exosite 1 of thrombin have been made. One contains a negatively charged DNA (an aptamer) covalently linked to an active site inhibitor creating an "EXosite and ACTive site (EXACT) inhibitor".
In the next exercise, you will create a novel peptide inhibitor that targets exosite 1 of thrombin using RFDiffusion. This should be a simple example as the peptide is expected to have glutamates and aspartates, mimicking the hirudin C-terminal end.
You will run RFDiffusion from a commercial company called Neurosnap. First set up an account (free, 5 models/month)
- Go to RFDiffusion-v2
- Separately, download this pdb file for human thrombin to your computer: 3U69
- Open RFdiffusion-v2 and input the following data shown in red. Leave the rest alone. H stands for the Heavy Chain in the PDB file.
Binder Input Chain | H |
Binder Length Maximum | 20 |
Binder Length Minimum | 17 |
Binder ROG | false |
Binding Pocket Residue End | 1 |
Binding Pocket Residue Start | 1 |
Fixed Residue End | 20 |
Fixed Residue Start | 10 |
Hotspots | H340,H341,H388,H390,H393,H397,H426,H427 |
Input Structure | input_structure.pdb |
- Run the program.
When completed, the output page has several tabs, Data and Visuals, Config, and Files. View the results in the Data and Visuals tab
- Answer
-
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.Note that the amino acid numbering of the hotspots changed in the output files from RFDiffusion, as shown in the table below.
Initial PDB numbering system for Heavy Chain (RU69) RFDiffusion numbering for Heavy Chain Hot Spots R340 20 K341 21 R388 68 R390 70 R393 73 K397 77 K426 106 K427 107 Here is a static image:
The light chain is not shown. The heavy chain H is shown in gray. The hot spots on thrombin (H340, H341, H388, H390, H393, H397, H426, and H427) are colored by charge (blue stick and surface). The binder is shown as a gold surface with the charged side chains shown in red (negative) and blue (positive). The sequence of RFDiffusion binder is EEEEKELELLREEIEKLEKE, which has 11 glutamate (E) but also 3 lysines (K), perhaps to maintain some charge balance.
Here is an animation from the output to show how RFDiffusion produces the final peptide at the hotspot residues.
Here are the stats for the top-ranked binder
Rank MPNN Score RMSD Mean pLDDT Max PAE pTM 1 1.14 19.69 92.98 23.48 0.91
Watson, Joseph et al. “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models” bioRxiv.Org. doi: https://doi.org/10.1101/2022.12.09.519842. Neurosnap Inc. - Computational Biology Platform for Research. Wilmington, DE, 2022. https://neurosnap.ai/.
Create a binder for human thrombin using Google Colab at this site (note: this link might be replaced in the future)
a. Inputs: Accept all defaults except for the sections below. For those use the inputs shown. The goal is not to optimize the run but show how easy it is to create a binder. The inputs allow you to create a 17 amino acid binder to the H (heavy) chain of human thrombin (3U69) at the positively charged Exosite 1 hot spots residues in the H chain, as described above.
b. Run the program. You can run sections individually or all at once. To run one section after the other, click the arrow for the first section as shown below, wait for it to complete, and then run the rest
The first section sets up the program and takes a few minutes to run. Alternatively, run all at once as shown in the image below.
The results should be downloaded to a .zip file in your computer's download folder.
c. Visualizing the results: Unzip the folder and open the "best_design0.pdb file in iCn3D. Display and render your file in iCn3D.
- Answer
-
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. -
Note that the number system for the H chain changed as the 1st run. Here is a static image:
The gray chain is the heavy chain of thrombin. The hot spots on this chain are shown in spacefill, CPK colors and labeled. The binder is shown as a transparent surface with side chains colored by charge (red = positive, blue = negative).
Predicting Protein Function through Structure with FoldSeek
FoldSeek can be run directly on a local PDB file or one retrieved from the PDB through this link. However, it can also be run through iCn3D, adding many more possibilities for rendering and analyses.
Use iCn3D to run FoldSeek
We will explore the structural similarities of a newly predicted viral protein from the Wuhan insect virus from an uncertain family to the structure of all proteins in the databases using FoldSeek using iCn3D.
First download this file: hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727.pdb
- Open iCn3D
- File, Open File, PDB appendable, and load the downloaded file
- Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view other viral protein.
Now let's use FoldSeek within iCn3D to find similar structures (not sequences) with the databases (PDB and AlphaFold).
- File, Search Similar, FoldSeek (PDB and AlphaFold)
- Click Submit
A new tab with FoldSeek appears with the best statistical structural fits. The statistics shown in this table include the probability (0-1) that the structural match of the query (the protein studied) to the target (what it might resemble) is not due to chance. The other statistic is the expected or E-value, which gives the probability (0-1) that the structures (or sequences) match just by chance. The closer the E-value is to zero, the more significant the match. (See this page from the PDB that describes E-values for sequence alignments.)
- Which hit is best? How good is the quality of the fit?
- If you click the blue hyperlink for the best hit, you will get the structure from the ESM database and you can download it. The database protein structure is shown below.
Instead, click the staggered three lines (=) to the right to get the alignment with the hypothetical protein.
This pops up:
The query (blue, the protein studied) and the target (orange, what it might resemble)
- Click the PDB link to download a PDB file of the two aligned protein segments. Model them in iCn3D to replicate the structure above.
Query (hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727) is blue, Target (MGYP001363004432) is orange
This hypothetical viral protein structure is very dissimilar to any structure in the databases except for the helical bundles similar to protein MGYP001363004432 in the ESM Metagenomic Atlas database.
- Answer
-
f. You will see for this particular structure, there are just 5 "hits" in only one database, mgnify_esm30 from the ESM Metagenomic Atlas Database described above.
Here are the 5 hits.
None of the structural alignments are good statistically.
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.This hypothetical viral protein structure is very dissimilar to any structure in the databases except for the helical bundles similar to protein MGYP001363004432 in the ESM Metagenomic Atlas data base.
Query (hypothetical_protein__YP_009329883__Wuhan_insect_virus_23__1923727) is blue, Target (MGYP001363004432) is orange
Now let's try a new viral protein structure that we can likely infer its function from the FoldSeek structural overlay.
Download this protein structure (whose function has been surmised through programs like FoldSeek: matrix_protein__YP_001531158__Marburg_marburgvirus__11269.pdb.
- Open iCn3D
- File, Open File, PDB appendable, and load the downloaded file
- Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view other viral protein.
Here is a static image:
- Now repeat the procedure above and use FoldSeek within iCn3D (as described in Exercise 6 above) to find proteins of similar structure and presumably function.
- Model one of the top hits in iCn3D
- Answer
-
The optimal target is VP24 of the Marburg virus (Uniprot ID: Membrane-associated protein VP24 - P35256). Both query and target proteins are from the Marburg virus, in the Filioviridae family, which also includes the Ebola virus. These are deadly viruses that cause hemorrhagic fevers and have high death rates. Here is one example of an almost-perfect hit.
Target Description Scientific Name Prob. Seq. Id. E-Value Position in query Alignment 4or8-assembly1_B Crystal structure of Marburg virus VP24 Marburg virus - Musoke, Kenya, 1980 1.00 94.1 9.90e-32 The query (blue, the protein studied) and the target (orange, what it might resemble)
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.Query (matrix_protein__YP_001531158__Marburg_marburgvirus__11269) is blue, Target (VP24 of the Marburg virus) is orange
From the Uniprot page for the VP24 protein, FoldSeek can be run to determine which proteins in the database are similar. In this case, VP24 is the query and FoldSeek finds targets of similar structure.
The best hits are in the PDB database:
Target Description Scientific Name Prob. Seq. Id. E-Value 4or8-assembly1_A Crystal structure of Marburg virus VP24 Marburg virus - Musoke, Kenya, 1980 1.00 100 1.03e-46 4or8-assembly1_B Crystal structure of Marburg virus VP24 Marburg virus - Musoke, Kenya, 1980 1.00 94.7 4.12e-37 6ehm-assembly1_C Model of the Ebola virus nucleocapsid subunit from recombinant virus-like particles Ebola virus - Mayinga, Zaire, 1976 1.00 37 7.36e-19 4u2x-assembly3_C Ebola virus VP24 in complex with Karyopherin alpha 5 C-terminus Ebola virus - Mayinga, Zaire, 1976 1.00 36.5 6.94e-19 3vne-assembly1_A Structure of the ebolavirus protein VP24 from Sudan Sudan ebolavirus 1.00 35.5 3.25e-18 4d9o-assembly1_A Structure of ebolavirus protein VP24 from Reston Reston ebolavirus - Reston 1.00 36 1.55e-16 3vnf-assembly1_A Structure of the ebolavirus protein VP24 from Sudan Sudan ebolavirus 1.00 34.8 1.75e-16 4d9o-assembly1_B Structure of ebolavirus protein VP24 from Reston Reston ebolavirus - Reston 1.00 35 3.84e-15
Running FoldSeek using 310 Copilot
Now run FoldSeek using 310 Copilot. This program is part of a suite of commercial programs from Open AI. These programs allow users to input questions and data as sentences (much like Chatbots like ChatGPT, Claude, Gemini, etc) to address complex questions in biology and biochemistry.
- Click try it now. Input the query shown in the bottom of the figure below. IN this case you are asking to find structures similar to the VP24 of the Marburg using the PDB ID = 4or8
- Answer
-
Here are the results:
As of 11/21/24, you can't upload a local pdb file (as we did with iCn3D above) and run FoldSeek within Copilot.
Exploring protein complexes with FoldSeek Multimer
In Exercise 3 above, you used AlphaFold to predict the structure of protein complexes (in that particular example a protein:DNA complex) from sequences. By analogy, FoldSeek Multimer can find the 3D structures of target complexes from the 3D structure of a known (query) complex. In short, AlphaFold can do large-scale sequence-to-structure comparisons while FoldSeek Multimer can do large-scale structure-to-structure comparisons. Let's try an example with a simple complex, the hepatitis A virus C3 proteinase (1HAV), a homodimer of an A and B chain.
Go to the FoldSeek Multimer server
- Load Accession 1hav and run the program.
- Copy/Snip the top 4 results. Compare the 1st and 4th results
- Click the alignment icons = for each and take a screen snips. =
- Answer
-
b. Here is a snip of the top results.
The top results show high-quality alignments of one chain with a human protein, but that human protein is not part of a similar dimer. Result 4 shows good alignment of both chains of the query with a human dimer.
Note the label for the top hit: B ➔ ProtVar_P83110_Q96RQ3_A. The P83110 is the Uniprot number for human serine protease HTRA3.
c. Hit 1 - alignment for just 1 chain of the query and target
Query (blue, the protein studied) and the target (orange, what it might resemble)
The icons underneath the blue/orange models are explained below.
- PDB: select to download the combined file and model in iCn3D
- blue: toggle between the entire query and the aligned structure
- orange: toggle between the entire target and the aligned structure. Note that in this case, the target (human protein) is much bigger than the query (viral) protein so the alignment is only between part of the human protein.
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.
Query (blue, the protein studied) and the target (orange, what it might resemble)
Hit 4: Alignment of two chains (A and B) for the query and target
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder. A static image is shown below.
Docking of Small Molecules (Ligands)
Predicting binding interactions between small molecules and proteins is at the heart of the pharmaceutical industry. Previously, docking software for small molecule ligands and proteins was proprietary, costly, and difficult to run. Now docking can be done online. In the next exercises, you will use three programs, 310 Copilot DiffDock (from a commercial company), DiffDock through Neuroapp, and SwissDock (Molecular Modeling Group, University of Lausanne, and the SIB Swiss Institute of Bioinformatics).
The inputs for a docking protein include the protein (typically the PDB ID) and the ligand, which can be represented as the actual structure but more often a code in the SMILES or InChi formats. For these exercises, you will dock the small molecule pyridoxal phosphate to a low molecular weight protein tyrosine phosphate, a protein that cleaves phosphorylated tyrosines in proteins.
The input representation for PLP can be obtained through PubChem shown below.
Docking using 310 Copilot
Docking using Copilot - DiffDock
- Go to PubChem and get the SMILES representation for PLP.
- Open 310 Copilot
- Input this text: Run a docking experiment with the protein PDB ID = 5JNR and the small molecule SMILES=CC1=NC=C(C(=C1O)C=O)COP(=O)(O)O
- When the docked structure is shown, select Download PDB
- Save as 310CoPilotDock5JNR_PLP.pdb
Now model the results in iCn3D as follows:
- Open iCn3D
- File, Open File, PDB appendable, and load 310CoPilotDock5JNR_PLP.pdb
- Color, Secondary, Sheets in Yellow. Note that the protein is predominantly alpha-helical with some beta-sheets. Here is a static view other viral protein.
- Answer
-
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.
Here is a static view.
Docking using DiffDock in Neurosnap
Now use DiffDock through Neurosnap to dock the ligand and protein.
- Complete the menu as shown below.
The output folder contains a separate PDB file and 100 SDF files for the various ligand "poses". Display them in iCn3D.
- Open iCn3D
- File, Open File, SDF, and choose the top-ranked SDF file (rank1_confidence-0.09.sdf)
- File, Open File, PDB Appendable, and choose the proteins_no_ligands pdb file.
- Render it as you see fit
- Answer
-
Here is a screen snap of the docked structure.
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein-binder complex. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable, and browse for the file in your download folder.
Docking using SwissDock
Now use SwissDock to dock PLP to a different low molecular weight protein tyrosine phosphate, PDB ID = 1xww. This has a SO42- in the active site which must be removed before the docking. The program uses two different methods, AutoDock Vina and Cavity Prioritization. Use the second one which seems to give better results for this protein.
- Input the values shown in the box below and run the docking experiment. When you input the target PDB file (1xww), choose the prompts shown below.
- Download the result as a zip file.
View and render the docked ligand in iCn3D. First, you must modify one of the files to view the docking results in iCn3D.
- Choose Export Results as Zip file
- Open Zip folder and extract all.
- Open the result.dock4 file in a simple text editor. Save just the first PLP coordinates by finding the 1st line with TER (terminate). Delete everything after that. Save the file as result0dock4.pdb file into the File subfolder you just extract
- Open iCn3D
- File, Open File, PDB appendable, and select the choose both the file you just made (results.dock4.pdb) and the receptor.pdb load it.
- Render the image as you wish.
- Answer
-
Here is the link to the actual SwissDock docking results page.
Here is a snip of the best docked structure.
Top Pose:
iCn3D results: Download this PNG file for upload into iCn3D to see a rendered image of the protein. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
Open iCn3D File, Open, iCn3D PNG appendable and browse for the file in your download folder.