4.13: Predicting Structure and Function of Biomolecules Through Natural Language Processing Tools

Last updated
Save as PDF

Page ID: 120258

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

Search Fundamentals of Biochemistry

Recent Updates: New Chapter Section 11/1/23

Written by Logan Hallee and Henry Jakubowski

Introduction

So far in this chapter, you have learned about protein structure and its determination in the laboratory. After decades of work in modeling protein structure and properties, the life science community has built massive databases organizing this information. While sequencing DNA and discovering protein sequences has become relatively cheap, the actual characterization of protein structure and function is still time and cost-intensive. Instead, researchers look to model and predict protein properties from their amino acid sequence alone to speed up the work necessary in the lab. The most recent and effective tool in this quest is the protein language model (pLM), which models proteins as a biological language of amino acids. At their core, pLMs are transformers with a protein vocabulary.

The transformer, an attention-based neural network, emerged as a game-changer for the scientific community with the iconic 2017 paper “Attention Is All You Need.” The crux of this work is the revolutionary idea that by strategically organizing simple neural networks, performance can be enhanced beyond merely scaling up a singular neural network. A neural network is a type of AI/machine learning process, often described as deep learning, which is patterned after the brain with nodes (neurons) that are interconnected by lines (axon/dendritic connections) as in the brain.

Especially adept at processing sequential data like time series or sentences, transformers have become the bedrock of modern natural language processing (NLP). This technology is grounded in understanding the semantic and contextual intricacies of their trained vocabulary.

The Essence of Protein Language Models

Tokenization & Token Embedding

The fundamental problem in NLP is encoding text, a string datatype, into a meaningful numerical representation; it is awfully hard to do math on words. One approach to the problem is to give every sub-part of the vocabulary, say a word, a unique integer. That way, any sequence of words can be turned into a vector of integers and we can easily do math on vectors; think back to physics or math doing the dot product on collections of numbers that have direction. This process creates a look-up table called a tokenizer because it translates tokens (strings of words, letters, or characters) to integers.

Vectors - A Simplified Review

To understand how AI/Machine Learning can be used to predict structure and function, we need to know about vectors and their use in physics and mathematics, particularly in matrices. Most students likely need a refresher. Click the link below for a guided view that will allow you to get a better understanding of the rest of the material in this section.

The Review!

Vectors – A Simplified Review

Most biochemistry students have taken physics in high school and college. In those courses, you were introduced to scalar and vector quantities. Scalar quantities like distance and work have no direction. Vector quantities like displacement and force have both a magnitude and a direction. Vectors are shown as arrows with the length representing the magnitude and the direction by an arrow at the end of the vector.

Let’s review a simple concept from elementary physics, work. Work is a scalar quantity and you probably remember that mechanical work is done on an object when an external force moves an object a given distance. Consider a force F (bold represents a vector) applied to a block which causes it to move a distance along the surface. The distance and direction together are described as displacement d, as shown in Figure \(\PageIndex{1}\) below.

Figure \(\PageIndex{1}\): Forces on a block moving along a surface

If the force was applied vertically, the block would not move along the surface, so no work is done on the box. If the force is applied at some angle, only the horizontal component of the force would cause the block to slide.

The horizontal component of the force is described as Fcosθ. (When θ = 90⁰, cos θ = 0 so no work is done.) Hence the work W = Fcosqd, which is the “dot product” of the vectors F and d.

Vectors are also used in math and can be considered as directed line segments. Let’s consider the equations for circles and spheres. These equations are based on the Pythagorean Theorem. As you learned in high school geometry, the Cartesian equation of a circle is:

x² + y² = r²

To generate a circle, set the r-value to a fixed number (such as 1 for a “unit” circle”), and for a multiple number of x values, calculate y values (where -1 < x,y < +1) from the equation. Then plot the x and y coordinate pairs and presto you have a circle, as shown in Figure \(\PageIndex{2}\) below.

Figure \(\PageIndex{2}\): Cartesian graph of a circle

Each of the (x,y) pairs can be considered a 2D vector with the origin at 0,0, a magnitude of 1 (the fixed radius in this example), and a direction, described by the specific x,y points that fall on the circle. Two simple (x,y) pairs are (1,0) and (0,1) for the unit circle. Another x,y pair that satisfies the Pythagorean theorem is (0.5, 0,866)

The Pythagorean Theorem can be extended to three dimensions to give the Cartesian equation for a sphere:

x² + y² + z² = r²

To generate a sphere simply solve for z for a multitude of x and y values and a fixed r value. Then plot the x, y, and z values and presto you get a sphere, as shown in Figure \(\PageIndex{3}\) below for a “unit” sphere of radius 1.

Figure \(\PageIndex{3}\): Cartesian graph of a sphere

The sets of x, y, and z points that land on the surface of the surface are vectors (directed line segments).

Now vectors are also used to describe matrices. Matrices are two-dimensional arrays of numbers. A matrix with just one row or one column is called a row vector or column vector, respectively, as shown in Figure \(\PageIndex{4}\) below.

Figure \(\PageIndex{4}\):

You should now see that all of the vectors that define a sphere can be written as a large matrix. Figure \(\PageIndex{5}\) below shows a x3 square matrix that represents the unit vectors that lie along the x, y, and z axes.

Figure \(\PageIndex{5}\):

In this example, any vector will lie on the surface of the unit sphere if the three components have the relationship of the Cartesian equation above.

These examples showcase how values in vectors can represent a position in space, but they can also be measurements of an object. For example, a car with 4 tires, 4 cylinders, and 180 horsepower could be represented as (4, 4, 180), where a more detailed description will lead to a longer vector with more components.

Another use of vectors and matrices comes from more complicated mathematical graphs. Graphs are arbitrary objects with nodes and edges, where edges connect nodes. A popular example of a mathematical graph is the social network Facebook. You can represent the entirety of the Facebook network with each member as a node and an edge between each node when the nodes are friends on the site. An example of a social networking graph is shown in Figure \(\PageIndex{6}\) below.

Social Network Graph_large.svg

Figure \(\PageIndex{6}\): Graph showing social relationships using graph theory. Darwin Peacock. CC BY 3.0, https://commons.wikimedia.org/w/inde...?curid=6057981

This example is an unweighted undirected graph because the edges do not have a specific value or direction. We could make a weighted graph that represents Facebook by utilizing weighted edges, perhaps the number of mutual friends between different members. This is still an undirected example.

Graphs are an important concept in computational biochemistry because molecules can be represented as graphs, with atoms as nodes and bonds as edges. This is illustrated in Figure \(\PageIndex{7}\) for a multidentate adsorbate complex.

Graph theory approach to determine configurations of multidentate and high coverage adsorbates for heterogeneous catalysisFig2.svg

Figure \(\PageIndex{7}\): Graph theory-based algorithm to generate graphs for a given atomic model. Deshpande, S., Maxson, T. & Greeley, J. Graph theory approach to determine configurations of multidentate and high coverage adsorbates for heterogeneous catalysis. npj Comput Mater 6, 79 (2020). https://doi.org/10.1038/s41524-020-0345-2. http://creativecommons.org/licenses/by/4.0/. Creative Commons Attribution 4.0 International License.

Panel a shows an atomic model for a simple nanoparticle with adsorbates. Panel b is an algorithm to generate graph-based representations. Panel c shows the generated graph mode

Figure \(\PageIndex{8}\) below shows a protein structure graph (right) for a short stretch of an alpha helix (left).

Struct2Graph-a graph attention network for structure based predictions of protein–protein interactionsFig2.svg

Figure \(\PageIndex{8}\): Protein and protein graph. Baranwal, M., Magner, A., Saldinger, J. et al. Struct2Graph: a graph attention network for structure-based predictions of protein–protein interactions. BMC Bioinformatics 23, 370 (2022). https://doi.org/10.1186/s12859-022-04910-9. http://creativecommons.org/licenses/by/4.0/

Figure \(\PageIndex{9}\) shows a graph not of a single protein but of a small protein:protein interaction network.

Figure \(\PageIndex{9}\): protein interactions of TMEM8A in humans. https://commons.wikimedia.org/wiki/F...for_TMEM8A.png

You can even produce graphs that represent entire networks of molecules and their relationships. A directed molecular graph might showcase proteins and their substrate. Having a direction in an edge is important in this distinction because a protein may use a substrate for a chemical reaction but a substrate might not act on a protein on its own. Many molecular relationships are weighted and directed. A weight in the protein substrate case might be the relative affinity of binding between the molecules.

These types of graphs contain three types of edges, undirected, directed, and weighted as illustrated in Figure \(\PageIndex{10}\) below.

Figure \(\PageIndex{10}\): The main types of edges found in a network. https://www.ebi.ac.uk/training/online/courses/network-analysis-of-protein-interaction-data-an-introduction/introduction-to-graph-theory/graph-theory-graph-types-and-edge-properties/ . Attribution 4.0 International (CC BY 4.0) license

Undirected edges: Connections in protein-protein interactions, as shown in Figure 9 above, are examples. The proteins are connected through binding but without implied flow between them;
Directed edges: These are found in metabolic and signaling pathways when arrows indicate the flow of reactants/products in a pathway. These can be arranged in complex hierarchies as those familiar with metabolic and signaling pathways know.
Weighted edges: Either undirected or directed edges can have a quantitative weight associated with them. These many reflect affinities, similarities between genes, fold effects, etc.

These examples are great for showcasing the versatility of mathematical graphs, but how can we use matrices to represent them? Enter adjacency matrices.

Adjacency matrices state which notes are connected. Each node can have multiple features. Figure \(\PageIndex{11}\) below shows a network of 5 nodes, the adjacency matrix, and a features matrix with each node having features. For example, if an atom is a node, the features could be electronegativity, partial charge, size, etc.

Figure \(\PageIndex{11}\): Properties of a 5 node network

An adjacency matrix stacks n vectors together for a graph that has n nodes. The vectors are also n long, so the resultant matrix is n by n. At the ith jth index of the matrix is a number dictating how many edges the node shares. So if the 1st node has an edge to the 2nd node the 1st row and 2nd column of the adjacency matrix will have a 1. These tend to be symmetric, and in this example, there would also be a 1 at the 2nd row and 1st column. Figure \(\PageIndex{12}\) below does a great job at explaining:

Figure \(\PageIndex{12}\): https://mathworld.wolfram.com/AdjacencyMatrix.html

If there is a weight on an edge the number in the adjacency matrix can be used to store the weight instead of the count of edges between nodes. This can also hold a direction by allowing for positive and negative entries.

Any complex network can be described mathematically as an adjacency matrix with rows and columns indicating nodes and an edge as a number. Unweighted and undirected edges lead to symmetric matrices with just 0 and 1. Directed and weighed can be more complicated with different numbers used to show relationships like affinity. +/- values can be used where + is an activation and a – is an inhibition. These matrices can be manipulated using linear algebra. Examples of adjacency matrices for Undirectrf, Directrf, and Weighted networks are shown in Figure \(\PageIndex{13}\) below.

Figure \(\PageIndex{13}\): Adjacency matrices from undirected, directed and weighted networks. https://www.ebi.ac.uk/training/online/courses/network-analysis-of-protein-interaction-data-an-introduction/introduction-to-graph-theory/graph-theory-adjacency-matrices/ . Attribution 4.0 International (CC BY 4.0) license

Hopefully, after this brief intro to vectors and matrices, you can now understand how amino acid position and their semantic properties (polarity, charge, size, etc) could be described as large matrices and their associated vectors.

Back to Tokens

Here are a few types of tokenizers commonly used in NLP:

Word Tokenizers: Word tokenizers split text into individual words based on spaces or punctuation marks. This approach assumes that words are the primary units of meaning in a language. For example, given the sentence "The cat is sleeping," a word tokenizer would split it into tokens: ["The", "cat", "is", "sleeping"].
Subword Tokenizers: Subword tokenizers split text into subword units that capture partial linguistic information. This approach is useful for handling out-of-vocabulary words like abbreviations or reducing vocabulary size which saves computational resources. Popular subword tokenization algorithms include Byte Pair Encoding (BPE), Unigram Language Model, and SentencePiece. Subword tokenizers can be complicated, but one possible subword tokenization or our example above would be [“The#”, “cat#”, “is#”, “sleep”, “ing#”] where # has been added to showcase the ending of a word.
Character Tokenizers: Character tokenizers treat each character as a separate token. This approach is beneficial when dealing with languages without explicit word boundaries or for character-level modeling tasks. Our sleepy cat is now [“T”, “h”, “e”, “\s”, “c”, “a”, “t”, “\s”] and so on. Here, we need to add a space token so the model can tell where the words start and finish.

For pLMs, researchers typically treat amino acids as tokens or “words” and protein sequences as “sentences.” Ex. a protein sequence like "MVKLTA" would be tokenized into individual amino acids: 'M', 'V', 'K', 'L', 'T', and 'A'.

The main problem with tokenizing sentences or sequences is that this numerical space has no semantic meaning. The grammar and word meanings are not encoded here. To work on this semantic information storage we will define another look-up table. However, this time, we will make it a large matrix with some arbitrary dimension, say, 768. Let us assume there are 50,000 or so English words we want to model, so our matrix is 50,000 x 768. Now, we will simply connect our tokenizer and our matrix. If the word “dog” corresponds to the token 5, the vector that represents the dog will be the 5th row of our matrix. Maybe, “protein” is token 800. The vector that represents protein is the 800th row in our matrix.

These vectors that represent each word are where we will store the semantic information behind each word, the meaning, and related grammar within the vocabulary. Unlike an object like a car, with easy numerical features to pick out (4 door, 4 wheel, 6 cylinder, etc.) we need to learn the features of a word. And so, within the transformer neural network, we have a vector that represents each word that contains learnable weights. Through the process of gradient descent, these weights will be adjusted from a random starting point to contain the necessary information for language modeling. One neat fact is that within a well-trained English language model, the vectors for words King - Man + Woman roughly equals Queen. How amazing, that concepts behind gender and royalty can be encoded in a meaningful numerical space!

Of course, many words have different meanings in different contexts. This is still the case in proteins, where specific residues may be important because of their charge, or maybe just because of the space they occupy. That is why we need a portion of our transformer that handles the contextual understanding. This is where attention comes in.

Attention

Attention is used to signify the importance of a word or its part. If high attention is given to a word, for example, less overall information is needed to predict the word. Attention can be conferred by weighting the importance of the word/token. In real language, we attend to certain words in sentences and weigh them more to provide context and make predictions. Consider this sentence referring to a TV show or movie. Predict which show/movie is it from.

..and Spock said that the needs of the many outweigh the needs of the few.

The answer comes quickly (if you are a Star Trek fan) by attending to the word Spock. You could maybe predict the entire sentence using databases by perhaps attending to another word such as outweigh.

Using attention enables the transformers we discussed above to long-term memory and concentrate (attend) to previously generated tokens.

The attention mechanism identifies dependencies and relationships between tokens within an input sequence. But what makes attention especially compelling is its ability to dive deep into a relational space and then seamlessly bridge back into the token embedding space we previously discussed.

Another feature is self-attention which focuses on the relationships with a given sentence.

Self-attention, where every token in a sequence attends to every other token, allows for a dynamic weighting of significance.

Three key inputs are required. The Query is what you are asking (such as the input in a web search box), the Key is the search results, and the Values are the returned content in each search result. In a web search, the program has to find the best matches between the query and the keys.

Now consider both the query and key to be vectors. The similarity between them can be determined using a cosine similarity function which is a bit similar to the dot product of the vectors = A^.K/|A||K|= Akcosθ/|A||K|. The denominator is the product of the vector lengths. The cosine function makes sense since it varies between +1 (when the vectors are in the same direction, q = 1) and -1 (when they point in opposite directions, q = -1). This gives the degree of similarity.

Mathematically the relationships among query, key, and values matrices are formulated as

\begin{equation}
\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
\end{equation}

Here, the Q, K, and V stand for query, key, and value matrices respectively, and d_k is the dimension of the queries and keys. These come directly by multiplying learned weights with our token embeddings.

\[ Q = W_e \times W_Q, \quad K = W_e \times W_K, \quad V = W_e \times W_V , \]

The softmax function turns a vector of K real values into a vector of K real values that sum to 1, in effect converting them into probabilities (that sum to 1).

W_e symbolizes the token embedding matrix, or, at least the portion extracted from our tokenizer and token embedding process. d_K denotes the dimension of the key matrix, which is used to normalize the values.

During the attention operation, there is a length-wise representation and an embedding-wise representation. The length-wise representation is typically referred to as an “attention matrix” and is often analyzed for researchers to look at how a model makes decisions. Coming from the Softmax(QK^T/d_K) during the attention operation is this attention matrix: A_LxL where L symbolizes the sequence length and values are scaled from 0 to 1. When A_{i,j} is close to 1 signifies that the i^th amino acid bears contextual relevance to the j^th one. In pLMs, such associations often resonate with spatial or chemical bonds—perhaps the i^th amino acid is nestled close to the j^th one in the protein's 3D structure. Notably, A_{i,j} scores are perpetually high, emphasizing an amino acid's inherent significance to itself. When A is multiplied further by V, we go from this relational space back to the embedding space, where the matrix is L x embedding dimension. In our previous example, this was 768 but this number is completely arbitrary. A smaller embedding dimension essentially leads to a smaller model with a worse theoretical top-end performance, with the advantage that it is much cheaper to use and train. A larger embedding dimension and model will have a theoretically high top-end performance, require more data and computational resources to train, and be more expensive to use.

Diving deeper, multi-head attention further refines this process. It transforms the input sequence into multiple smaller queries, keys, and values—each an independent attention head brandishing unique weights. Each head, in isolation, computes its own version of the attention matrix on a section of the input. These matrices are then woven together, under an additional learned linear transformation, into a final cohesive representation. This multi-headed approach has proven important for pLMs, where an increased headcount is correlated with better performance.

Bringing it all together ...

Lastly, bridging back from the realm of relational depths to the embeddings, transformers don't solely rely on attention. They also encompass feed-forward layers which further improve the generalized modeling capabilities. Together, these components are the essence of the transformer; a multi-headed attention layer and a linear layer. These transformer layers are simply stacked on top of each other, so the input of one is the output of another. These transformer stacks are referred to as large language models (LLMs). LLMs have a couple of main forms and some other important parts that we did not discuss. For example, there are transformer encoders (BERT) and decoders (GPT) that have different organizations of layers. There are also some additional embeddings called positional embeddings, which help transformers learn the intended order of input sequences because the native mathematical operations are position invariant. Also, there are different types of normalizations and skip connections that are also extremely important for protein modeling. So while this introduction to transformers is enough to get your feet wet, we have included several other resources to further refine your knowledge if desired.

Training transformers

At a high level, transformers are trained just like any other neural network. For a large corpus of training data, there is an input and a known desired output called the ground truth. The input is fed to the model and the output is compared against the ground truth with a loss function. This is all set up in a way so that the loss function measures some type of error between the output and the ground truth so that the loss function, and corresponding error, are minimized with optimization techniques. Regardless of the technical details, it is essentially strategic trial and error, where the model is rewarded for producing outputs close to or equal to the ground truth. This way, through optimization, the error is reduced over time and the model learns how to do the specified task effectively.

However, because transformers have such a large parameter count they take a lot of data to successfully adjust for a specific task. Thus, we need a way to produce a ton of labeled data without wasting too much human time annotating data. Luckily, sequences of strings are perfect for this. If we feed a transformer a sequence we can simply hide a token and ask the model to predict what word went there.

Now, in the input, let’s replace some words (or tokens) with Mask tokens randomly. Mask tokens are used to hide tokens from the model so that we can train the model to recover the missing words, where the desired output is the original sequence. Such a task forces the model to learn what words are around it; hence building the model's semantic and contextual understanding. This is called denoising because we artificially injected noise into our input and our model used the surrounding context to get rid of the noise. Another popular training objective is next token prediction, where part of a sequence is input and the model predicts what goes next. Different organizations of transformer layers are better or worse at these various tasks.

The subject of a MASK token brings up the broader topic of special tokens. Special tokens are tokens that are added to a specific vocabulary that serve a specific purpose. As we discussed, a model can learn to replace a MASK token with a correct token that belongs in the sequence.

Some other popular tokens are CLS and SEP, which stand for classification and separator respectively.

CLS tokens are typically prepended to the beginning of sequences so that models can learn to summarize the entire input in a single vector, which is useful for classification tasks. Separator tokens are typically placed at the end of sequences or between sequences if more than one was fed to the model at a time. This way, the model can treat separate sequences as individual entities even if they are input at the same time. Researchers will often create new special tokens to use alongside model training for specific tasks. For example, a pLM called ProstT5 has a special token that indicates a translation from structure to amino acid sequence and an additional token that does the reverse. The respective token was prepended to the necessary inputs during training so that the model could intuit which task it was supposed to be performing. This is particularly useful if you are designing a transformer model that needs to do multiple distinct tasks.

Applications of protein language models

Now that we have learned substantially about how modern language modeling works we can explore how researchers utilize these techniques to further advance our understanding of biochemistry.

Prediction 3D Structure

By in large the most famous pLM is AlphaFold: The deep learning model that first successfully mapped amino acid sequences to protein structure at a large scale. Since then, other pLMs have also been able to learn from a large corpus of sequences and structure data to perform well on unseen sequences. The fundamental organization of sequence-to-structure models is made up of

a transformer (pLM) that builds a semantic and contextual understanding of sequences;
a structure module that maps the latent sequence representation to 3D coordinates.

AlphaFold is special in a few ways. Firstly, the transformer used for AlphaFold is actually two transformers in one! Let us call the first transformer T1. T1 is an MSA transformer, which works on several sequences at once instead of a single sequence. This comes from the concept of multiple sequence alignment (MSA), which is a common search method in bioinformatics that compares amino acid strings based on an evolutionary-informed algorithm. The output of an MSA is a list of strings aligned based on similarity and substitution probability. Figure \(\PageIndex{13}\) below …

Figure \(\PageIndex{13}\): Correlated mutations carry information about distance relationships in protein structure.https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/

The sequence of the protein for which the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarily related family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structural family). Evolutionary variation in the sequences is constrained by many requirements, including the maintenance of favorable interactions and indirect residue-residue contacts (red line, right). The inverse problem of protein folding prediction from sequence addressed here exploits pair correlations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensional structure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximate predicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of the protein of interest

Including this information in the model allows the model to learn more about the protein than it could from a single sequence because of the concept of coevolution. Coevolution is a simple process that necessitates meaningful substitutions in amino acid sequences through various mutations if an organism is going to stay fit.

However, as we have discussed, transformers are incredibly adept at processing and understanding single sequences; how would one process many sequences simultaneously? The MSA input to the pLM is many sequences stacked on top of each other, so the attention mechanism needs to be modified. The MSA goes through row-wise attention, which picks out the important residues, and column-wise attention which picks out the important sequences. This creates a protein latent space built from an MSA instead of a single sequence.

The other transformer, T2, is also a transformer with modified attention (triangular self-attention). T2 is for computing over a pair-wise representation of the single input sequence. It builds a representation similar to a distogram, which is a matrix representation of the distance between every residue and every other residue. T1 and T2 make up the section of AlphaFold known as the EvoFormer, and there are 48 of these EvoFormer layers in total. The latent representation of the MSA input from T1 and the pair-wise representation from T2 are both inputs structure module.

The structure module also has some fancy attention (invariant point attention) that merges the information from T1 and T2. A simple computer-vision-inspired architecture called a ResNet used the attention output to predict side chain and backbone torsion angles at the atomic level. From these outputs, the atom coordinates for the entire protein are calculated and the structure is relaxed with Amber, which removes any structural violations based on the charges and locations of the atoms.

This entire process is repeated three times, where the structure information and MSA information can inform each other through skip connections and linear transformations. This recycling greatly improves the final 3D structure. Figure \(\PageIndex{14}\) below.

Highly accurate protein structure prediction with AlphaFoldFig1.svg

Figure \(\PageIndex{14}\): Model architecture. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2. Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/.

Arrows show the information flow among the various components. Array shapes are shown in parentheses with s, the number of sequences; r, the number of residues; c, the number of channels.

In summary, AlphaFold is the combination of three main networks that talk to each other and work together. The input sequence is searched with MSA through a large database that pulls out similar sequences. These similar sequences are input to T1 which builds a semantic and contextual understanding of the amino acids called the latent space. This latent space informs a pair-wise distogram that tracks the distances between amino acid residues in the original input sequence. All of this information is utilized by the structure module which does some fancy math to calculate the 3D coordinates of each atom. The entire process is repeated so the structure information can inform the MSA and the MSA can further inform the structure. Throughout the process, a ResNet and Amber are used to prevent any weird side chain or backbone angles that cannot exist in nature.

Other projects have also taken the MSA approach to structure prediction. RoseTTAFold and xtrimoGPLM utilize an MSA with different networks and attention variants to perform with similar accuracy to AlphaFold. However, MSA can be quite computationally expensive and requires a database where similar sequences exist. For so-called protein orphans, who have little to no sequence homology in recorded repositories, these methodologies fall short.

To address the computational and protein orphan concern various projects like ESM, OmegaFold, and Ember have utilized a standard pLM trained through mask denoising and a structure module to also obtain high accuracy on structure prediction, outperforming MSA-based method on protein orphans. Breakthroughs from AlphaFold like recycling are standard practice throughout these different approaches.

Some researchers utilize these various projects in parallel to get multiple predictions of structure for the same sequence. This ensemble approach allows accelerated progress by leveraging the advantages of each model and averaging out the disadvantages. Low confidence regions or disagreements in models may also be an indicator of intrinsically disordered regions of protein structure, which is incredibly important in biological function.

All in all, sequence-to-structure mapping has been effectively correlated with modern computational methods. The backbone of pLMs and structure modules has enabled the large-scale annotation of protein sequences with high-quality structure, something the scientific community can utilize for accelerated breakthroughs. However, structure prediction is not the only thing you can do with pLMs.

Protein function prediction

The latent space learned from vast mask denoising on corpora of amino acid sequences correlates extremely highly with protein structure, which is vital for the study and annotation of proteins (link to other parts of the textbook). Interestingly, this latent space also correlates highly with other useful types of annotation like function. By averaging across the length of the last hidden state output of pLMs one can build an effective vector representation of a protein, called a vector embedding. This way, every protein has the same size numerical representation and can easily be fed to machine learning classifiers like support vector machines, k-nearest neighbors, random forests, and more. The pLM can also be fine-tuned as a classifier given enough annotated data. Precomputed protein embeddings from popular models like ProtT5 can be downloaded from UniProt, as well as large amounts of annotated data if you would like to try this for yourself.

The types of annotations researchers are typically interested in come down to details like EC and GO classes. EC stands for Enzyme Commission, and EC numbers break down protein functionalities into a hierarchical organization scheme delimited by what type of reaction the proteins catalyze.

For example, the hierarchy of an EC number can be illustrated as:

1st digit: Represents one of the six primary classes of enzymes, e.g., '1' stands for oxidoreductases.
2nd digit: Describes a subclass within the primary class. If an enzyme is a '1.1', it specifically deals with acting on the CH-OH group of donors.
3rd digit: Categorizes the enzyme even further by specifying the acceptor. For instance, '1.1.1' would mean that the enzyme acts on the CH-OH group with NAD+ or NADP+ as the acceptor.
4th digit: Provides a unique identifier for each enzyme within its specific class, subclass, and sub-subclass. So, '1.1.1.1' is the EC number for alcohol dehydrogenase.

In total, there are currently over 8000 unique EC numbers! pLMs have shown remarkable competency in predicting them from sequence alone, often achieving between 80-90+% accuracy on unseen data.

Another type of popular annotation is Gene Ontology (GO) which labels genes based on what their protein products do in a biological context. GO is split into three main categories

Biological Process (BP): Describes a series of events accomplished by one or more ordered assemblies of molecular functions. For instance:

GO:0006955 - Immune response

GO:0006958 - Complement activation, classical pathway

GO:0045087 - Innate immune response

… and so on

Cellular Component (CC): Describes parts of a cell or environment a protein product likely localizes to. For example:

GO:0005634 - Nucleus

GO:0005654 - Nucleoplasm

GO:0005694 - Chromosome

… and others

Molecular Function (MF): Describes catalytic activities, such as binding or catalysis, that occur at the molecular level. This subcategory is very similar to EC numbers. For instance:

GO:0003824 - Catalytic activity

GO:0016491 - Oxidoreductase activity

GO:0016614 - Oxidoreductase activity, acting on CH-OH group of donors

… and further subcategories

The main difference in organization is that GO terms have parent-child relationships that are not unique. In simpler terms, a GO term can have multiple parent terms while an EC number may only have one. Regardless, pLMs also show a wide breadth of impressive performances in predicting GO terms from sequence alone.

Additionally, researchers are interested in the complex interplay of proteins in the cell. How do proteins modify each other and their surrounding cellular components? Why does a specific gene expression cause a disease state? Which chaperones or post-translational modifications can contribute to homeostasis and which ones are detrimental? All of such questions can be picked away by building an understanding of protein-protein interactions and networks.

Protein-protein interactions (PPIs) can be defined in a variety of ways, but the literature typically focuses on some mediated chemical or conformational changes when one protein comes in contact with another. Some other terms are often added to the definition requiring an interaction to have a nonredundant function in some sense. Regardless, researchers typically drastically simplify the problem by treating it as a binary classification: Proteins either interact or not. In biological context, it is much more complicated than this but good PPI classifiers are still informative towards the questions we mentioned above.

Recently, pLMs have received increasing attention for their ability to compare protein sequences and guess about interaction in biological contexts. Vast databases of positive interactors enable this type of analysis. However, confirming that two proteins for certain never interact is a much harder problem. Clever data science towards training PPI classifiers with massive inherent class imbalance is challenging, but there are many promising modern approaches. Hence, PPI classification is another way where protein function can be partially uncovered computationally via pLMs.

Protein sequence generation

Lastly, we discuss one more general application of pLMs in protein sequence design and generation. As we discussed above, transformer decoders (or GPT models) are often trained to predict the next token given some other tokens for context. The popularized ChatGPT does this incredibly well for the English language. Generative pLMs perform the identical task on amino acid tokens, generating sequences from scratch or completing sequence prompts. Many pLM projects are notable in this space.

ProtGPT2: Stacked transformer decoders that generate viable nature-like sequences from scratch.
ProGen: Stacked transformers decoders that generate plausible sequences given control tags for context, thus being able to generate sequences of a particular family or ontology.
ANKH: A general-purpose encoder-decoder pLM that can generate proteins of a specific superfamily or plausible variants with possible increased functionality.
ProtDT: A pLM fine-tuned by contrasting vector embeddings with an English language model to enable protein generation based on English natural language input.
xtrimoGPLM: A massive general-purpose pLM that has an extremely capable sequence design. It can even generate sequences with nearly identical structures that have almost no sequence similarity.
PostT5: A fine-tuned version of ProtT5, which is also capable of sequence generation. This encoder-decoder architecture can generate a sequence given a structure input that approximates said structure. This is a bilingual model with amino acid and structure-based tokens.
SAProt: An encoder-only system with a similar bilingual vocabulary as ProstT5 that can also translate between sequence and structure, enabling sequence generation based on a structure.

Concluding remarks

Protein language modeling is an interdisciplinary science at the intersection of bioinformatics, biochemistry, and computational sciences. Such modeling techniques are becoming an integral part of biochemical research through the fast-paced progress of NLP and computational hardware. It is easy to recognize the potential of protein language modeling in general life sciences: The generation of novel sequences for therapeutics, industrial catalysts, and synthetic biology, all the while annotating newly sequenced and generated proteins alike.

Importantly, amino acid-based vocabularies are not the only biochemically relevant use of NLP models. DNA, codon, and even atom-wise vocabularies are being explored in many applications: For example, genomics, phylogenetics, and small-molecule-to-protein interactions. There are many avenues ready to explore in the biological NLP field.

As someone learning competency in biochemistry, it is important to keep in mind the capabilities of protein language modeling while recognizing it is a somewhat new and accelerating field. Computational protein modeling may look vastly different in two years and will likely be completely different a decade from now. Regardless, computational tools serve as a helper to biochemists, not replacements. Standard biochemical assays to determine the structure and function of proteins will always be necessary to confirm and further inform the findings of a computational domain. This computational domain can simply weed out the plausible options for a given problem away from the implausible ones, enabling the impossible-seeming search of biochemical space.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv December 5, 2017. https://doi.org/10.48550/arXiv.1706.03762.

EMBL-EBI. Introduction to graph theory | Network analysis of protein interaction data. https://www.ebi.ac.uk/training/onlin...-graph-theory/ (accessed 2023-10-28).

Geetansh Kalra. Attention Networks: A simple way to understand Self Attention. https://medium.com/@geetkal67/attent...n-f5fb363c736d

DeepFindr. Understanding Graph Attention Networks. https ://www.youtube.com/watch?v=A-yKQamf2Fc

Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.; Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596 (7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2.

Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; Rives, A. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 2023, 379 (6637), 1123–1130. https://doi.org/10.1126/science.ade2574.

Chen, B.; Cheng, X.; Geng, Y.; Li, S.; Zeng, X.; Wang, B.; Gong, J.; Liu, C.; Zeng, A.; Dong, Y.; Tang, J.; Song, L. xTrimoPGLM: Unified 100B-Scale Pre-Trained Transformer for Deciphering the Language of Protein. bioRxiv July 6, 2023, p 2023.07.05.547496. https://doi.org/10.1101/2023.07.05.547496.

Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; Bhowmik, D.; Rost, B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 2022, 44 (10), 7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381.

Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. arXiv January 16, 2023. https://doi.org/10.48550/arXiv.2301.06568.

Su, J.; Han, C.; Zhou, Y.; Shan, J.; Zhou, X.; Yuan, F. SaProt: Protein Language Modeling with Structure-Aware Vocabulary. bioRxiv October 2, 2023, p 2023.10.01.560349. https://doi.org/10.1101/2023.10.01.560349.

Heinzinger, M.; Weissenow, K.; Sanchez, J. G.; Henkel, A.; Steinegger, M.; Rost, B. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv July 25, 2023, p 2023.07.23.550085. https://doi.org/10.1101/2023.07.23.550085.

Hallee, L.; Rafailidis, N.; Gleghorn, J. P. cdsBERT - Extending Protein Language Models with Codon Awareness. bioRxiv September 17, 2023, p 2023.09.15.558027. https://doi.org/10.1101/2023.09.15.558027.

Hallee, L.; Gleghorn, J. P. Protein-Protein Interaction Prediction Is Achievable with Large Language Models. bioRxiv June 9, 2023, p 2023.06.07.544109. https://doi.org/10.1101/2023.06.07.544109.

Ferruz, N.; Schmidt, S.; Höcker, B. ProtGPT2 Is a Deep Unsupervised Language Model for Protein Design. Nat Commun 2022, 13 (1), 4348. https://doi.org/10.1038/s41467-022-32007-7.

Liu, S.; Zhu, Y.; Lu, J.; Xu, Z.; Nie, W.; Gitter, A.; Xiao, C.; Tang, J.; Guo, H.; Anandkumar, A. A Text-Guided Protein Design Framework. arXiv February 9, 2023. http://arxiv.org/abs/2302.04611 (accessed 2023-02-14).

Search

Text Color

Text Size

Margin Size

Font Type

Vectors - A Simplified Review