Skip to main content
Biology LibreTexts

4.13: Predicting Structure and Function of Biomolecules Through Natural Language Processing Tools

  • Page ID
    120258
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Search Fundamentals of Biochemistry

    Written by Logan Hallee and Henry Jakubowski

    Learning Goals 

    (Learning goals written by Claude, Sonnet 4.6, Anthropic)

    Mathematical Foundations: Vectors, Graphs, and Matrices

    • Explain how protein sequences and molecular networks can be represented as mathematical graphs (atoms as nodes, bonds as edges) and encoded as adjacency matrices and feature vectors, and describe how directed, undirected, and weighted edges capture different types of biological relationships such as protein-protein interactions, metabolic flux, and binding affinity.
    • Describe the process of tokenization and token embedding in natural language processing — converting amino acid sequences into integer tokens and then into high-dimensional learnable vectors — and explain why this numerical representation is necessary for mathematical operations on sequence data and how it encodes semantic information such as residue polarity, charge, and structural context.

    Transformer Architecture and Protein Language Models

    • Explain the self-attention mechanism in transformer networks — including the roles of query, key, and value matrices, the cosine similarity basis of attention scoring, and the softmax normalization that converts scores to probabilities — and describe how multi-head attention allows a protein language model to simultaneously capture multiple types of residue-residue relationships (spatial proximity, chemical complementarity, evolutionary covariation).
    • Describe masked language modeling and next-token prediction as the self-supervised training strategies used to build protein language models from large sequence databases, explaining how mask denoising forces the model to develop contextual understanding of amino acid co-occurrence patterns without requiring manually annotated training data.
    • Explain the key innovations of AlphaFold — including the EvoFormer's row-wise and column-wise MSA attention and pair-wise triangular attention, the structure module's invariant point attention and ResNet-based torsion angle prediction, and the recycling strategy — and describe how these components collectively map an input sequence through multiple sequence alignment and pair-wise distance representation to predicted atomic 3D coordinates.

    Applications and Limitations of Protein Language Models

    • Describe how pLM vector embeddings are used for protein function prediction (EC number and Gene Ontology classification at 80–90+% accuracy on unseen sequences), protein-protein interaction classification, and de novo protein sequence generation — and explain the conceptual distinction between encoder-only, decoder-only, and encoder-decoder transformer architectures in terms of the tasks each is best suited for.
    • Critically evaluate the capabilities and limitations of protein language models relative to experimental biochemistry — recognizing that computational predictions require experimental validation, that MSA-based models underperform on orphan proteins, that low-confidence or disagreeing ensemble predictions may indicate intrinsically disordered regions, and that the field is evolving rapidly enough that specific model capabilities will change substantially within years.

    Introduction

    So far in this chapter, you have learned about protein structure and its determination in the laboratory. After decades of work in modeling protein structure and properties, the life science community has built massive databases organizing this information. While sequencing DNA and discovering protein sequences has become relatively inexpensive, the actual characterization of protein structure and function remains time- and cost-intensive. Instead, researchers aim to model and predict protein properties from amino acid sequences alone to speed up lab work. The most recent and effective tool in this quest is the protein language model (pLM), which models proteins as a biological language of amino acids. At their core, pLMs are transformers with a protein vocabulary.

    The transformer, an attention-based neural network, emerged as a game-changer for the scientific community with the iconic 2017 paper “Attention Is All You Need.” The crux of this work is the revolutionary idea that, by strategically organizing simple neural networks, performance can be enhanced beyond mere scaling-up of a single neural network. A neural network is a type of AI/machine learning process, often described as deep learning, that is patterned after the brain, with nodes (neurons) interconnected by lines (axons and dendrites).

    Transformers have become the bedrock of modern natural language processing (NLP) and are especially adept at processing sequential data like time series or sentences. This technology is grounded in understanding the semantic and contextual intricacies of the vocabulary it has been trained on.

    The Essence of Protein Language Models

    Tokenization & Token Embedding

    The fundamental problem in NLP is encoding text, a string datatype, into a meaningful numerical representation; it is extremely challenging to perform mathematical operations on words. One approach to the problem is to assign every subpart of the vocabulary, say, a word, a unique integer. That way, any sequence of words can be turned into a vector of integers, and we can easily do the math on vectors; think back to physics or math, where we compute the dot product on collections of numbers that have an associated direction. This process creates a lookup table, called a tokenizer, that translates tokens (strings of words, letters, or characters) to integers.

    Vectors - A Simplified Review

    To understand how AI/Machine Learning can be used to predict structure and function, we need to know about vectors and their use in physics and mathematics, particularly in matrices.  Most students likely need a refresher.  Click the link below for a guided view that will allow you to better understand the rest of the material in this section.

    The Review!

    Vectors – A Simplified Review

    Most biochemistry students have taken physics in high school and college.  In those courses, you were introduced to scalar and vector quantities.  Scalar quantities like distance and work have no direction. Vector quantities, such as displacement and force, have both magnitude and direction.  Vectors are shown as arrows, with the length representing the magnitude and the direction by an arrow at the end of the vector.

    Let’s review a simple concept from elementary physics work.  Work is a scalar quantity, and you probably remember that mechanical work is done on an object when an external force moves an object a given distance.  Consider a force F (bold represents a vector) applied to a block, which causes it to move a distance along the surface.  The distance and direction together are described as displacement d, as shown in Figure \(\PageIndex{1}\) below.

    Diagram illustrating forces acting on two gray blocks, with arrows indicating their directions: red (resultant), blue (normal), and black (friction).

    Figure \(\PageIndex{1}\): Forces on a block moving along a surface

    The block would not move along the surface if the force were applied vertically, so no work is done on the box.  If the force is applied at some angle, only the horizontal component of the force would cause the block to slide.

    The horizontal component of the force is F cos θ.  (When θ = 900,  cos θ = 0, so no work is done.) Hence, the work W = FdcosΘ, the “dot product” of the vectors F and d.

    Vectors are also used in math and can be thought of as directed line segments.  Let’s consider the equations for circles and spheres.  These equations are based on the Pythagorean Theorem. As you learned in high school geometry, the Cartesian equation of a circle is:

    x2 + y2 = r2

    To generate a circle, set the r-value to a fixed number (such as 1 for a “unit circle”), and for a multiple number of x values, calculate y values (where -1 < x,y < +1) from the equation.  Then plot the x and y coordinate pairs, and presto, you have a circle, as shown in Figure \(\PageIndex{2}\) below.

    Circle diagram with a black line and a blue vertical line intersecting at a red horizontal line, forming an angle.

    Figure \(\PageIndex{2}\): Cartesian graph of a circle

    Each of the (x,y) pairs can be considered a 2D vector with the origin at 0,0, a magnitude of 1 (the fixed radius in this example), and a direction described by the specific x,y points that fall on the circle.  Two simple (x,y) pairs are (1,0) and (0,1) for the unit circle.  Another x,y pair that satisfies the Pythagorean theorem is (0.5, 0,866)

    The Pythagorean Theorem can be extended to three dimensions to give the Cartesian equation for a sphere:

    x2 + y2 + z2 = r2

    To generate a sphere, solve for z for many x and y values and a fixed r value.  Then plot the x, y, and z values, and presto, you get a sphere, as shown in Figure \(\PageIndex{3}\) below for a “unit” sphere of radius 1.

    3D sphere with red, green, and blue arrows representing different axes and directions in space.

    Figure \(\PageIndex{3}\): Cartesian graph of a sphere

    The sets of x, y, and z points that land on the surface are vectors (directed line segments).

    Vectors are also used to describe matrices.  Matrices are two-dimensional arrays of numbers.   A matrix with just one row or one column is called a row vector or column vector, respectively, as shown in Figure \(\PageIndex{4}\) below.

    Abstract graphic featuring various brackets arranged in a grid pattern against a plain background.

    Figure \(\PageIndex{4}\):

    You should now see that all vectors defining a sphere can be written as a large matrix.  Figure \(\PageIndex{5}\) below shows a x3 square matrix representing the unit vectors along the x, y, and z axes.

    A blank white canvas framed by two black brackets on the left and right sides.

    Figure \(\PageIndex{5}\):

    In this example, any vector will lie on the unit sphere's surface if the three components have the relationship of the Cartesian equation above.

    These examples show that values in vectors can represent a position in space, but they can also represent measurements of an object. For example, a car with four tires, four cylinders, and 180 horsepower could be represented as (4, 4, 180), where a more detailed description would yield a longer vector with more components.

     

    Another use of vectors and matrices comes from more complicated mathematical graphs. Graphs are arbitrary objects with nodes and edges, where edges connect nodes. A popular example of a mathematical graph is the social network Facebook. You can represent the entire Facebook network by treating each member as a node and adding an edge between nodes when they are friends on the site.  Figure \(\PageIndex{6}\) below shows an example of a social networking graph.

    A network graph with interconnected nodes, featuring dense blue clusters and scattered black nodes, set against a white background.

    Figure \(\PageIndex{6}\):  Graph showing social relationships using graph theory. Darwin Peacock.  CC BY 3.0, https://commons.wikimedia.org/w/inde...?curid=6057981

    This example is an unweighted, undirected graph because its edges do not have specific values or directions. We could construct a weighted graph of Facebook using weighted edges, perhaps based on the number of mutual friends between members. However, this is still an undirected example.

    Graphs are important in computational biochemistry because molecules can be represented as graphs, with atoms as nodes and bonds as edges.  This is illustrated in Figure \(\PageIndex{7}\) for a multidentate adsorbate complex.

    Diagram illustrating atomic structure and graph construction process: a) molecular representation, b) steps for node and bond creation, c) resulting graph with nodes and edges.

    Figure \(\PageIndex{7}\):  Graph theory-based algorithm to generate graphs for a given atomic model.  Deshpande, S., Maxson, T., & Greeley, J. Graph theory approach to determine configurations of multidentate and high coverage adsorbates for heterogeneous catalysis. npj Comput Mater 6, 79 (2020). https://doi.org/10.1038/s41524-020-0345-2. http://creativecommons.org/licenses/by/4.0/. Creative Commons Attribution 4.0 International License. 

    Panel a shows an atomic model for a simple nanoparticle with adsorbates.  Panel b is an algorithm to generate graph-based representations.  Panel C shows the generated graph mode.

    Figure \(\PageIndex{8}\) below shows a protein structure graph (right) for a short stretch of an alpha helix (left).

    Struct2Graph-a graph attention network for structure based predictions of protein–protein interactionsFig2.svg

    Figure \(\PageIndex{8}\):  Protein and protein graph. Baranwal, M., Magner, A., Saldinger, J. et al. Struct2Graph: a graph attention network for structure-based predictions of protein–protein interactions. BMC Bioinformatics 23, 370 (2022). https://doi.org/10.1186/s12859-022-04910-9http://creativecommons.org/licenses/by/4.0/

     

    Figure \(\PageIndex{9}\) shows a graph not of a single protein but of a small protein:protein interaction network. 

    A network diagram showing protein interactions with nodes labeled as gene symbols and interconnected by blue lines.

    Figure \(\PageIndex{9}\):  protein interactions of TMEM8A in humans.  https://commons.wikimedia.org/wiki/F...for_TMEM8A.png

    You can even produce graphs that represent entire networks of molecules and their relationships. A directed molecular graph might showcase proteins and their substrate. Having a direction in an edge is important in this distinction because a protein may use a substrate in a chemical reaction, but a substrate might not act on a protein on its own. Many molecular relationships are weighted and directed. A weight in the protein-substrate case might be the relative binding affinity between the protein and the substrate. 

    These graphs contain three types of edges: undirected, directed, and weighted, as illustrated in Figure \(\PageIndex{10}\) below.

    Three directed graphs are shown: the left with blue nodes labeled A-G, the middle with green nodes, and the right with purple nodes showing weights.

    Figure \(\PageIndex{10}\):  The main types of edges found in a network.  https://www.ebi.ac.uk/training/online/courses/network-analysis-of-protein-interaction-data-an-introduction/introduction-to-graph-theory/graph-theory-graph-types-and-edge-properties/ .  Attribution 4.0 International (CC BY 4.0) license

    • Undirected edges: Connections in protein-protein interactions, as shown in Figure 9 above, are examples.  The proteins are connected through binding, but without implied flow between them.
    • Directed edges:  These are found in metabolic and signaling pathways when arrows indicate the flow of reactants/products in a pathway.  These can be arranged in complex hierarchies, as those familiar with metabolic and signaling pathways know.
    • Weighted edges: Undirected or directed edges can have a quantitative weight.  These may reflect affinities, similarities between genes, fold effects, etc.

    These examples are great for showcasing the versatility of mathematical graphs, but how can we use matrices to represent them? Enter adjacency matrices.

    Adjacency matrices state which notes are connected.  Each node can have multiple features.  Figure \(\PageIndex{11}\) below shows a network of 5 nodes, the adjacency matrix, and a features matrix with each node having features.  For example, if an atom is a node, the features could be electronegativity, partial charge, size, etc.

    Diagram featuring a network of blue nodes interconnected by lines, with two oval shapes labeled "fb" at the bottom.

    Figure \(\PageIndex{11}\): Properties of a 5 node network

    An adjacency matrix stacks n vectors together for a graph with n nodes. The vectors are also n long, so the resultant matrix is n by n. At the ith jth index of the matrix is a number dictating how many edges the node shares. So if the 1st node has an edge to the 2nd node, the 1st row and 2nd column of the adjacency matrix will have a 1. These tend to be symmetric; in this example, there would also be a 1 at the 2nd row and 1st column. Figure \(\PageIndex{12}\) below does a great job at explaining:

    Three graphs are depicted: a tree structure on the left, a diamond shape in the center, and a square on the right, all connected with edges.

    Figure \(\PageIndex{12}\):  https://mathworld.wolfram.com/AdjacencyMatrix.html

    If there is a weight on an edge, the number in the adjacency matrix can be used to store the weight instead of the count of edges between nodes. This can also hold a direction by allowing positive and negative entries.

    Any complex network can be described mathematically as an adjacency matrix, with rows and columns indicating nodes and an edge represented by a number.  Unweighted, undirected edges yield symmetric matrices with only 0s and 1s.  Directed and weighted edges can be more complicated with different numbers used to show relationships, such as affinity. +/- values can be used where + is an activation and – is an inhibition.   These matrices can be manipulated using linear algebra. Examples of adjacency matrices for Undirected, Directed, and Weighted networks are shown in Figure \(\PageIndex{13}\) below.

    Three illustrations of a graph: vertices with edges, directed edges, and weighted connections, with corresponding degree matrices below.

    Figure \(\PageIndex{13}\): Adjacency matrices from undirected, directed, and weighted networks. https://www.ebi.ac.uk/training/online/courses/network-analysis-of-protein-interaction-data-an-introduction/introduction-to-graph-theory/graph-theory-adjacency-matrices/ .  Attribution 4.0 International (CC BY 4.0) license

    Hopefully, after this brief introduction to vectors and matrices, you can understand how amino acid positions and their semantic properties (polarity, charge, size, etc.) can be described as large matrices and their associated vectors.

    Back to Tokens

    Here are a few types of tokenizers commonly used in NLP:

    1. Word Tokenizers: Split text into individual words based on spaces or punctuation. This approach assumes that words are the primary units of meaning in a language. For example, given the sentence "The cat is sleeping," a word tokenizer would split it into tokens: ["The", "cat", "is", "sleeping"].
    2. Subword Tokenizers: Split text into subword units that capture partial linguistic information. This approach is useful for handling out-of-vocabulary words, such as abbreviations, and for reducing vocabulary size, thereby saving computational resources. Popular subword tokenization algorithms include Byte Pair Encoding (BPE), Unigram Language Model, and SentencePiece. Subword tokenizers can be complicated, but one possible subword tokenization for our example above would be [“The#”, “cat#”, “is#”, “sleep”, “ing#”] where # has been added to showcase the ending of a word.
    3. Character Tokenizers: Treat each character as a separate token. This approach is beneficial when dealing with languages without explicit word boundaries or for character-level modeling tasks. Our sleepy cat is now [“T”, “h”, “e”, “\s”, “c”, “a”, “t”, “\s”] and so on. Here, we need to add a space token so the model can tell where the words start and finish.

    For pLMs, researchers typically treat amino acids as tokens or “words” and protein sequences as “sentences.” For example, a protein sequence like "MVKLTA" would be tokenized into individual amino acids: “M,” “V,” “K,” “L,” “T,” and “A.”

    The main problem with tokenizing sentences or sequences is that the resulting numerical space lacks semantic meaning. The grammar and word meanings are not encoded here. We will define another lookup table to store this semantic information. However, this time we will make it a large matrix with arbitrary dimensions, say, 768. We assume we want to model approximately 50,000 English words, so our matrix is 50,000 x 768. Now, we will simply connect our tokenizer and our matrix. If the word “dog” corresponds to the token 5, the vector representing the dog will be the 5th row of our matrix. Maybe “protein” is token 800. The vector representing the protein is the 800th row of our matrix.

    These vectors representing each word are where we will store the semantic information behind each word, the meaning, and related grammar within the vocabulary. Unlike an object like a car, with easily identifiable numerical features (4 doors, 4 wheels, 6 cylinders, etc.), we need to learn the features of a word. And so, within the transformer neural network, we have a vector for each word, with learnable weights. Through gradient descent, these weights will be adjusted from a random starting point to capture the necessary information for language modeling. One neat fact is that, within a well-trained English-language model, the vectors for the words King, Man, and Woman roughly equal those for the word Queen. How amazing that concepts behind gender and royalty can be encoded in a meaningful numerical space!

    Of course, many words have different meanings in different contexts. This is still the case in proteins, where specific residues may be important because of their charge or maybe just because of the space they occupy. That is why we need a portion of our transformer dedicated to contextual understanding. This is where attention comes in.

    Attention

    Attention is used to signify the importance of a word or its part.  For example, if high attention is given to a word, less overall information is needed to predict the word.  Attention can be conferred by weighting the importance of the word/token.  In real language, we attend to certain words in sentences and weigh them more to provide context and make predictions.  Consider this sentence referring to a TV show or movie. Predict which show/movie it is from.

    ..and Spock said that the needs of the many outweigh the needs of the few.

    The answer comes quickly (if you are a Star Trek fan) by attending to the word Spock. You could maybe predict the entire sentence using databases by attending to another word, such as outweigh.

    Using attention enables the transformers we discussed above to access long-term memory and concentrate (attend) to previously generated tokens. 

    The attention mechanism identifies dependencies and relationships between tokens within an input sequence. What makes attention especially compelling is its ability to dive deep into a relational space and seamlessly bridge back to the token-embedding space we previously discussed.

    Another feature is self-attention, which focuses on the relationships within a given sentence.

    Self-attention, where every token in a sequence attends to every other token, allows for a dynamic weighting of significance.  

    Three key inputs are required.  The Query is what you are asking (such as the input in a web search box), the Key is the search results, and the Values are the returned content in each search result.  In a web search, the program has to find the best matches between the query and the keys.

    Now, consider both the query and key to be vectors.  The similarity between them can be determined using a cosine similarity function, which is a bit similar to the dot product of the vectors = A.K/|A||K|= Akcosθ/|A||K|.  The denominator is the product of the vector lengths.  The cosine function makes sense since it varies between +1 (when the vectors are in the same direction, q = 1) and -1 (when they point in opposite directions, q = -1).  This gives the degree of similarity. 

    Mathematically, the relationships among query, key, and values matrices are formulated as

     

    \begin{equation}
    \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
    \end{equation}

    Here, Q, K, and V denote the query, key, and value matrices, respectively, and dk denotes the dimension of the queries and keys. These are obtained directly by multiplying learned weights with our token embeddings.

    \[ Q = W_e \times W_Q, \quad K = W_e \times W_K, \quad V = W_e \times W_V , \]

    The softmax function turns a vector of K real values into a vector of K real values that sum to 1, effectively converting them into probabilities (that sum to 1).

    We symbolize the token embedding matrix, or, at least, the portion extracted from our tokenizer and token embedding process. dK denotes the dimension of the key matrix used to normalize the values.

    During the attention operation, there is a length-wise representation and an embedding-wise representation. The length-wise representation is typically called an “attention matrix” and is often analyzed by researchers to determine how a model makes decisions. Coming from the Softmax(QKT/dK) during the attention operation is this attention matrix: ALxL, where L symbolizes the sequence length, and values are scaled from 0 to 1. When A{i,j} is close to 1, it signifies that the ith amino acid bears contextual relevance to the jth one. In pLMs, such associations often resonate with spatial or chemical bonds—perhaps the ith amino acid is nestled close to the jth one in the protein's 3D structure. Notably, A{i,j} scores are perpetually high, emphasizing an amino acid's inherent significance to itself. When A is multiplied further by V, we go from this relational space back to the embedding space, where the matrix is L x embedding dimension. In our previous example, this was 768, but this number is completely arbitrary. A smaller embedding dimension results in a smaller model with worse theoretical top-end performance, but it is much cheaper to use and train. A larger embedding dimension and model will have a theoretically high top-end performance, require more data and computational resources to train, and be more expensive.

    Diving deeper, multi-head attention further refines this process. It transforms the input sequence into multiple smaller queries, keys, and values, each an independent attention head with its own weights. Each head, in isolation, computes its own version of the attention matrix on a section of the input. These matrices are then woven together, under an additional learned linear transformation, into a final cohesive representation. This multi-headed approach has proven important for pLMs, where an increased headcount is correlated with better performance.

    Bringing it all together ...

    Lastly, bridging back from the realm of relational depths to the embeddings, transformers don't solely rely on attention. They also include feed-forward layers, which further improve the model's generalization capabilities. Together, these components are the essence of the transformer: a multi-headed attention layer and a linear layer. These transformer layers are simply stacked on top of each other, so the input to one is the output of the previous one. These transformer stacks are referred to as large language models (LLMs). LLMs have a couple of main forms, along with other important parts we did not discuss. For example, there are transformer encoders (BERT) and decoders (GPT) with different layer organizations. There are also positional embeddings, which help transformers learn the intended order of input sequences because native mathematical operations are position-invariant. Also, there are different types of normalizations and skip connections that are also extremely important for protein modeling. While this introduction to transformers is enough to get your feet wet, we have included several other resources to further refine your knowledge if desired.

    Training transformers

    At a high level, transformers are trained just like any other neural network. For a large corpus of training data, there is an input and a known desired output called the ground truth. The input is fed to the model, and the model's output is compared with the ground truth using a loss function. This is set up so that the loss function measures some error between the output and the ground truth, and the loss function and corresponding error are minimized via optimization techniques. Regardless of the technical details, it is essentially a strategic trial-and-error process in which the model is rewarded for producing outputs close to or equal to the ground truth. This way, through optimization, the error is reduced over time, and the model learns how to do the specified task effectively.

    However, because transformers have such a large parameter count, they require substantial data to perform well on a specific task. Thus, we need a way to generate large amounts of labeled data without requiring excessive human time for annotation. Luckily, sequences of strings are perfect for this. If we feed a transformer a sequence, we can simply hide a token and ask the model to predict which word it would have been.

    Now, in the input, let’s randomly replace some words (or tokens) with Mask tokens. Mask tokens are used to hide tokens from the model so that it can learn to recover the missing words, with the desired output being the original sequence. Such a task forces the model to learn which words surround it, thereby building its semantic and contextual understanding. This is called denoising because we artificially injected noise into our input, and our model used the surrounding context to remove it. Another popular training objective is next-token prediction, in which a portion of a sequence is input, and the model predicts what comes next. Different transformer layer organizations perform better or worse on these tasks.

    The subject of a MASK token brings up the broader topic of special tokens. Special tokens are added to a specific vocabulary and serve a specific purpose. As we discussed, a model can learn to replace a MASK token with a correct token that belongs in the sequence.

    Some other popular tokens are CLS and SEP, which stand for classification and separator, respectively.

    CLS tokens are typically prepended to the beginning of sequences so that models can learn to summarize the entire input into a single vector, which is useful for classification tasks. Separator tokens are typically placed at the end of sequences or between sequences if more than one was fed to the model at a time. This way, the model can treat separate sequences as individual entities even if they are input simultaneously. Researchers often create specialized tokens to use alongside model training for specific tasks. For example, a pLM called ProstT5 has a special token that indicates a translation from structure to amino acid sequence and an additional token that does the reverse. The respective token was prepended to the necessary inputs during training so that the model could infer which task it was supposed to perform. This is particularly useful if you are designing a transformer model that needs to do multiple distinct tasks.

    Applications of protein language models

    Now that we have learned in depth how modern language modeling works, we can explore how researchers use these techniques to further advance our understanding of biochemistry.

    Prediction 3D Structure

    By and large, the most famous pLM is AlphaFold, the deep learning model that first successfully mapped amino acid sequences to protein structures at large scale. Since then, other pLMs have also learned from large corpora of sequence and structure data to perform well on unseen sequences. The fundamental organization of sequence-to-structure models is made up of

    • a transformer (pLM) that builds a semantic and contextual understanding of sequences;
    • a structure module that maps the latent sequence representation to 3D coordinates.

    AlphaFold is special in a few ways. Firstly, the transformer used for AlphaFold is two transformers in one! Let us call the first transformer T1. T1 is an MSA transformer that operates on multiple sequences simultaneously rather than a single sequence. This stems from the concept of multiple sequence alignment (MSA), a common bioinformatics method that compares amino acid strings based on an evolutionary-informed algorithm. The output of an MSA is a list of strings aligned based on similarity and substitution probability. Figure \(\PageIndex{13}\) below …

    Diagram illustrating amino acid sequences with highlighted residues, showing constraints, inferences, and 3D contact representation.

    Figure \(\PageIndex{13}\): Correlated mutations carry information about distance relationships in protein structure.https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/

    The sequence of the protein for which the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarily related family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structural family). Evolutionary variation in the sequences is constrained by many requirements, including the maintenance of favorable interactions and indirect residue-residue contacts  (red line,  right).  The inverse problem of protein folding prediction from sequence, addressed here, exploits pair correlations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close together in the three-dimensional structure (right).  A  subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximate predicted  3D  shape  (‘fold’), which is then refined using standard molecular physics techniques,  yielding a predicted all-atom  3D  structure of the protein of interest.

    Including this information in the model allows it to learn more about the protein than it could from a single sequence, leveraging the concept of coevolution. Coevolution is a simple process that requires meaningful substitutions in amino acid sequences through various mutations if an organism is to remain fit.

    However, as we have discussed, transformers are incredibly adept at processing and understanding single sequences; how would one process many sequences simultaneously? The MSA input to the pLM is many sequences stacked on top of each other, so the attention mechanism needs to be modified. The MSA uses row-wise attention to pick out important residues and column-wise attention to pick out important sequences. This creates a protein latent space from an MSA rather than a single sequence.

    The other transformer, T2, also has modified attention (triangular self-attention). T2 is for computing over a pairwise representation of the single input sequence. It builds a representation similar to a distogram, a matrix representation of the distance between every residue and every other residue. T1 and T2 make up the section of AlphaFold known as the EvoFormer, and there are 48 of these EvoFormer layers in total. The latent representation of the MSA input from T1 and the pair-wise representation from T2 are both input structure modules.

    The structure module also includes a fancy attention mechanism (invariant point attention) that merges information from T1 and T2. A simple computer-vision-inspired architecture, ResNet, used the attention output to predict side-chain and backbone torsion angles at the atomic level. From these outputs, the atom coordinates for the entire protein are calculated, and the structure is relaxed with Amber (a molecular mechanics/dynamics force field), which removes any structural violations based on atomic charges and locations.

    This entire process is repeated three times, with the structure and MSA information informing each other through skip connections and linear transformations. This recycling greatly improves the final 3D structure.  Figure \(\PageIndex{14}\) below.

    Flowchart illustrating a data processing pipeline with various databases, analysis steps, and final output as a protein structure.

    Figure \(\PageIndex{14}\): Model architecture.  Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2.  Creative Commons Attribution 4.0 International License.  http://creativecommons.org/licenses/by/4.0/.

    Arrows show the information flow among the various components. Array shapes are shown in parentheses, with s indicating the number of sequences, r indicating the number of residues, and c indicating the number of channels.

    In summary, AlphaFold combines three main networks that talk to each other and work together. The input sequence is searched in MSA using a large database to retrieve similar sequences. These similar sequences are fed into T1, which builds a semantic and contextual understanding of amino acids, called the latent space. This latent space informs a pair-wise distogram that tracks the distances between amino acid residues in the original input sequence. All of this information is utilized by the structure module, which does some fancy math to calculate the 3D coordinates of each atom. The entire process is repeated so that the structural information can inform the MSA, and the MSA can, in turn, inform the structure. Throughout the process, ResNet and Amber are used to prevent any weird side chains or backbone angles that do not occur in nature.

    Other projects have also taken the MSA approach to structure prediction. RoseTTAFold and xtrimoGPLM use different networks and attention variants in their MSAs, achieving accuracies similar to AlphaFold. However, MSA can be computationally intensive and requires a database of similar sequences. These methodologies fall short for so-called protein orphans, which have little to no sequence homology in publicly available databases.

    To address the computational and protein-orphan concerns, projects such as ESM, OmegaFold, and Ember have used a standard pLM trained via mask denoising and a structure module to achieve high accuracy in structure prediction, outperforming MSA-based methods on protein orphans. Breakthroughs from AlphaFold, such as recycling, are standard practice across these different approaches.

    Some researchers use these projects in parallel to obtain multiple structural predictions for the same sequence. This ensemble approach accelerates progress by leveraging the strengths of each model and averaging out its weaknesses. Low-confidence regions or model disagreements may also indicate intrinsically disordered regions of protein structure, which are incredibly important for biological function.

    Overall, sequence-to-structure mapping has been effectively correlated with modern computational methods. The backbone of pLMs and structure modules has enabled large-scale annotation of protein sequences with high-quality structures, which the scientific community can use to accelerate breakthroughs. However, structure prediction is not the only thing you can do with pLMs.

    Protein function prediction

    The latent space learned from extensive masked denoising of amino acid sequences correlates extremely well with protein structure, which is vital for the study and annotation of proteins (link to other parts of the textbook). Interestingly, this latent space also correlates highly with other useful annotation types, such as function. By averaging across the length of the last hidden state output of pLMs, one can build an effective vector representation of a protein, called a vector embedding. This way, every protein has the same numerical representation and can be easily fed to machine learning classifiers such as support vector machines, k-nearest neighbors, random forests, and more. The pLM can also be fine-tuned as a classifier, given enough annotated data. Precomputed protein embeddings from popular models like ProtT5 can be downloaded from UniProt, as well as large amounts of annotated data, if you want to try this for yourself.

    Researchers are typically interested in annotations that include details such as EC and GO classes. EC stands for Enzyme Commission, and EC numbers organize protein functions into a hierarchical scheme, delimited by the type of reaction the proteins catalyze.

    For example, the hierarchy of an EC number can be illustrated as:

    • 1st digit: Represents one of the six primary classes of enzymes, e.g., '1' stands for oxidoreductases.
    • 2nd digit: Describes a subclass within the primary class. If an enzyme is a '1.1', it specifically acts on the CH-OH group of donors.
    • 3rd digit: Categorizes the enzyme even further by specifying the acceptor. For instance, '1.1.1' indicates that the enzyme acts on the CH-OH group with NAD+ or NADP+ as the acceptor.
    • 4th digit: Provides a unique identifier for each enzyme within its specific class, subclass, and sub-subclass. So, '1.1.1.1' is the EC number for alcohol dehydrogenase.

    In total, there are currently over 8000 unique EC numbers! pLMs have shown remarkable competency in predicting them from sequence alone, often achieving between 80-90+% accuracy on unseen data.

    Another popular annotation type is Gene Ontology (GO), which labels genes based on what their protein products do in biological contexts. GO is split into three main categories

    1. Biological Process (BP): Describes a series of events accomplished by one or more ordered assemblies of molecular functions. For instance:

    GO:0006955 - Immune response

    GO:0006958 - Complement activation, classical pathway

    GO:0045087 - Innate immune response

    … and so on

    1. Cellular Component (CC): Describes parts of a cell or environment a protein product likely localizes to. For example:

    GO:0005634 - Nucleus

    GO:0005654 - Nucleoplasm

    GO:0005694 - Chromosome

    … and others

    1. Molecular Function (MF): Describes catalytic activities, such as binding or catalysis, that occur at the molecular level. This subcategory is very similar to EC numbers. For instance:

    GO:0003824 - Catalytic activity

    GO:0016491 - Oxidoreductase activity

    GO:0016614 - Oxidoreductase activity, acting on CH-OH group of donors

    … and further subcategories

    The main difference in organization is that GO terms have non-unique parent-child relationships. In simpler terms, a GO term can have multiple parent terms, whereas an EC number may have only one. Regardless, pLMs also show a wide breadth of impressive performances in predicting GO terms from sequence alone.

    Additionally, researchers are interested in the complex interplay of proteins in the cell. How do proteins modify each other and their surrounding cellular components? Why does a specific gene expression cause a disease state? Which chaperones or post-translational modifications can contribute to homeostasis, and which ones are detrimental? All of these questions can be answered by building an understanding of protein-protein interactions and networks.

    Protein-protein interactions (PPIs) can be defined in various ways, but the literature typically focuses on chemical or conformational changes that occur when one protein comes into contact with another. Some other terms are often added to the definition, requiring an interaction to have a nonredundant function in some sense. Regardless, researchers typically drastically simplify the problem by treating it as a binary classification: Proteins either interact or do not. In a biological context, it is much more complicated than this, but good PPI classifiers are still informative towards the questions we mentioned above.

    Recently, pLMs have received increasing attention for their ability to compare protein sequences and guess about interactions in biological contexts. Vast databases of positive interactors enable this type of analysis. However, confirming that two proteins for certain never interact is a much harder problem. Clever data science towards training PPI classifiers with massive inherent class imbalance is challenging, but there are many promising modern approaches. Hence, PPI classification is another way to partially uncover protein function computationally via pLMs.

    Protein sequence generation

    Lastly, we discuss one more general application of pLMs in protein sequence design and generation. As we discussed above, transformer decoders (or GPT models) are often trained to predict the next token given some other tokens for context. The popularized ChatGPT does this incredibly well in English. Generative pLMs perform the identical task on amino acid tokens, generating sequences from scratch or completing sequence prompts. Many pLM projects are notable in this space.

    • ProtGPT2: Stacked transformer decoders that generate viable nature-like sequences from scratch.
    • ProGen: Stacked transformer decoders that generate plausible sequences given control tags for context, thus being able to generate sequences of a particular family or ontology.
    • ANKH: A general-purpose encoder-decoder pLM that can generate proteins of a specific superfamily or plausible variants with possible increased functionality.
    • ProtDT: A pLM fine-tuned by contrasting vector embeddings with an English language model to enable protein generation based on English natural language input.
    • xtrimoGPLM: A massive general-purpose pLM that has an extremely capable sequence design. It can even generate sequences with nearly identical structures with almost no sequence similarity.
    • PostT5: A fine-tuned version of ProtT5, which is also capable of sequence generation. This encoder-decoder architecture can generate a sequence given a structure input that approximates said structure. This is a bilingual model with amino acid and structure-based tokens.
    • SAProt: An encoder-only system with a similar bilingual vocabulary as ProstT5 that can also translate between sequence and structure, enabling sequence generation based on a structure.

    Concluding remarks

    Protein language modeling is an interdisciplinary science at the intersection of bioinformatics, biochemistry, and computational sciences. Such modeling techniques are becoming an integral part of biochemical research as NLP and computational hardware advance rapidly. It is easy to recognize the potential of protein language modeling in general life sciences: The generation of novel sequences for therapeutics, industrial catalysts, and synthetic biology, all the while annotating newly sequenced and generated proteins alike. 

    Importantly, amino acid-based vocabularies are not the only biochemically relevant uses of NLP models. DNA, codon, and even atom-wise vocabularies are being explored in many applications, including genomics, phylogenetics, and small-molecule-to-protein interactions. There are many avenues to explore in the field of biological NLP.

    As someone learning competency in biochemistry, it is important to remember the capabilities of protein language modeling while recognizing that it is a relatively new and rapidly evolving field. Computational protein modeling may look vastly different in two years and will likely be completely different a decade from now. Regardless, computational tools serve as a helper to biochemists, not replacements. Standard biochemical assays to determine protein structure and function will always be necessary to confirm and further inform computational domain findings. This computational domain can simply weed out plausible from implausible ones.

    Summary

    (Summary written by Claude, Sonnet 4.6, Anthropic)

    This chapter introduces protein language models (pLMs) as a transformative computational framework at the intersection of biochemistry, bioinformatics, and artificial intelligence, providing biochemistry students with the conceptual vocabulary needed to understand, evaluate, and apply these tools in a research context.

    Mathematical prerequisites establish the foundations: proteins and molecular networks can be represented as mathematical graphs in which atoms, residues, or proteins serve as nodes and covalent bonds, noncovalent interactions, or functional relationships serve as edges. Edges can be undirected (protein-protein binding), directed (metabolic flux), or weighted (binding affinity). Any such network can be encoded as an adjacency matrix, enabling manipulation by linear algebra. Feature matrices assign numerical properties to each node — for amino acid residues, these might include charge, polarity, size, and chemical reactivity. Vectors and matrices are the native language of machine learning, and this graph-theoretic framing explains why molecular systems are naturally amenable to modern AI methods.

    Protein language models treat amino acid sequences as sentences in a biological language. The workflow begins with tokenization — assigning each amino acid a unique integer identifier — followed by token embedding, in which each integer is mapped to a high-dimensional learnable vector (typically 768 or more dimensions). These vectors, initially random, are updated during training to encode the semantic and contextual meaning of each residue. In a well-trained model, vector arithmetic captures meaningful biological relationships, analogous to the famous King − Man + Woman ≈ Queen result in English language models. The transformer architecture processes these embedded sequences through self-attention layers. In the attention mechanism, query, key, and value matrices (derived by multiplying the input embeddings by learnable weight matrices) are combined so that the dot product of query and key vectors encodes similarity; a softmax function converts these similarities into attention weights that sum to 1, and these weights are applied to value vectors to produce a context-aware representation of each residue. The resulting attention matrix L × L encodes which residues are contextually relevant to which others — often reflecting spatial proximity in the folded protein. Multi-head attention runs multiple independent attention heads in parallel, each capturing different aspects of residue relationships, with their outputs concatenated and linearly projected. Feed-forward layers further refine the representation, and multiple transformer layers are stacked to form large language models (LLMs). Positional embeddings, normalization layers, and skip connections are additional architectural components that improve performance on sequence data. Training uses masked language modeling (randomly masking tokens and training the model to predict them from context) or next-token prediction (predicting each successive residue from preceding ones), both of which generate effectively unlimited labeled training data from raw sequence databases without human annotation.

    AlphaFold is the most celebrated application of pLM technology to structural biology. Its EvoFormer module consists of two transformers: T1, an MSA transformer that processes multiple sequence alignments to extract co-evolutionary information (correlated mutations between residue pairs signal structural proximity through coevolution), using row-wise attention (identifying important residues across sequences) and column-wise attention (identifying informative sequences); and T2, which builds a pairwise distance representation (distogram) of the input sequence using triangular self-attention. The EvoFormer is stacked 48 times. The structure module uses invariant point attention to merge MSA and pairwise information, a ResNet to predict side chain and backbone torsion angles, and Amber-based energy minimization to correct steric clashes. The entire pipeline is recycled three times to enable structural information to inform MSA analysis and vice versa. For proteins lacking MSA homologs (orphans), single-sequence pLMs such as ESM and OmegaFold achieve comparable accuracy at lower computational cost than database searching.

    Beyond structure prediction, pLMs have demonstrated 80–90+% accuracy in function annotation — predicting Enzyme Commission (EC) numbers and Gene Ontology (GO) terms from sequence alone — by using the final hidden-state embeddings as protein vector representations fed to classical machine learning classifiers. Protein-protein interaction (PPI) classification from pLM embeddings is an active and challenging area due to the inherent class imbalance between known interactors and the vast unknown non-interacting space. Generative pLMs (ProtGPT2, ProGen, ANKH, ProstT5, SAProt) employ decoder or encoder-decoder architectures to generate novel protein sequences from scratch, from family-specific prompts, or from structural inputs — enabling de novo protein design for therapeutics, enzymes, and synthetic biology. The chapter concludes by emphasizing that while pLMs are powerful tools for hypothesis generation and narrowing the search space of biochemical possibilities, experimental validation through structural, functional, and biochemical assays remains indispensable, and the field is evolving rapidly enough that specific model capabilities will look dramatically different within a decade.

    References

    Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv December 5, 2017. https://doi.org/10.48550/arXiv.1706.03762.

    EMBL-EBI. Introduction to graph theory | Network analysis of protein interaction data. https://www.ebi.ac.uk/training/onlin...-graph-theory/ (accessed 2023-10-28).

    Geetansh Kalra.  Attention Networks: A simple way to understand Self Attention.  https://medium.com/@geetkal67/attent...n-f5fb363c736d

    DeepFindr.  Understanding Graph Attention Networks. https://www.youtube.com/watch?v=A-yKQamf2Fc

    Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.; Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596 (7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2.

    Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; Rives, A. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 2023, 379 (6637), 1123–1130. https://doi.org/10.1126/science.ade2574.

    Chen, B.; Cheng, X.; Geng, Y.; Li, S.; Zeng, X.; Wang, B.; Gong, J.; Liu, C.; Zeng, A.; Dong, Y.; Tang, J.; Song, L. xTrimoPGLM: Unified 100B-Scale Pre-Trained Transformer for Deciphering the Language of Protein. bioRxiv July 6, 2023, p 2023.07.05.547496. https://doi.org/10.1101/2023.07.05.547496.

    Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; Bhowmik, D.; Rost, B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 2022, 44 (10), 7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381.

    Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. arXiv January 16, 2023. https://doi.org/10.48550/arXiv.2301.06568.

    Su, J.; Han, C.; Zhou, Y.; Shan, J.; Zhou, X.; Yuan, F. SaProt: Protein Language Modeling with Structure-Aware Vocabulary. bioRxiv October 2, 2023, p 2023.10.01.560349. https://doi.org/10.1101/2023.10.01.560349.

    Heinzinger, M.; Weissenow, K.; Sanchez, J. G.; Henkel, A.; Steinegger, M.; Rost, B. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv July 25, 2023, p 2023.07.23.550085. https://doi.org/10.1101/2023.07.23.550085.

    Hallee, L.; Rafailidis, N.; Gleghorn, J. P. cdsBERT - Extending Protein Language Models with Codon Awareness. bioRxiv September 17, 2023, p 2023.09.15.558027. https://doi.org/10.1101/2023.09.15.558027.

    Hallee, L.; Gleghorn, J. P. Protein-Protein Interaction Prediction Is Achievable with Large Language Models. bioRxiv June 9, 2023, p 2023.06.07.544109. https://doi.org/10.1101/2023.06.07.544109.

    Ferruz, N.; Schmidt, S.; Höcker, B. ProtGPT2 Is a Deep Unsupervised Language Model for Protein Design. Nat Commun 2022, 13 (1), 4348. https://doi.org/10.1038/s41467-022-32007-7.

    Liu, S.; Zhu, Y.; Lu, J.; Xu, Z.; Nie, W.; Gitter, A.; Xiao, C.; Tang, J.; Guo, H.; Anandkumar, A. A Text-Guided Protein Design Framework. arXiv February 9, 2023. http://arxiv.org/abs/2302.04611 (accessed 2023-02-14).

     


    4.13: Predicting Structure and Function of Biomolecules Through Natural Language Processing Tools is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.