4.4: Secondary Structural Motifs and Domains
- Page ID
- 26180
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)-
Differentiate Between Motifs and Domains
- Define what constitutes a secondary structural motif versus a protein domain, and explain how these concepts relate to one another within a polypeptide chain.
-
Recognize Common Structural Motifs
- Identify and describe key motifs (e.g., helix-turn-helix, leucine zipper, EF-hand, Greek key, β-barrel), noting the underlying secondary structures and typical functional roles.
-
Examine the Arrangement of Motifs into Domains
- Discuss how multiple motifs can combine to form larger, independently folding domains, and relate the modular architecture of domains to protein evolution and diversity.
-
Correlate Domain Structure with Function
- Illustrate how domains often correspond to specific functional units (e.g., binding sites, catalytic centers) and how domain organization impacts the overall activity and regulation of proteins.
-
Explore the Evolutionary Significance of Domains
- Investigate how gene duplication, shuffling, and fusion events can create new proteins by combining existing domains, driving the innovation of protein functions.
-
Analyze Domain Boundaries Experimentally and Computationally
- Outline how techniques like limited proteolysis, X-ray crystallography, and bioinformatics-based domain prediction (e.g., sequence alignments, structural comparisons) are used to identify domain boundaries.
-
Discuss Folding and Stability of Domains
- Describe how domains can fold independently or co-translationally, emphasizing the role of local stability, chaperones, and intradomain interactions in achieving native conformations.
-
Relate Motif and Domain Knowledge to Protein Engineering
- Highlight how understanding motifs and domains aids in the rational design of proteins (e.g., creating novel enzymes or fusion proteins), leveraging well-characterized structural modules.
By mastering these learning goals, students will better appreciate how secondary structural motifs organize into functionally distinct domains, driving protein structure's remarkable versatility and complexity.
prompt: Write a series of learning goals for the following web page. The page is designed for junior and senior biochemistry majors.
Common Structural Motifs
Given the number of possible combinations of 1o, 2o, and 3o structures, one might guess that the 3D structure of each protein is quite distinctive. This is, in general, true. However, similar substructures are found in proteins. For instance, common secondary structures are often grouped to form common structural motifs, often called super-secondary structures. Often, the same motif is found in proteins with similar functions (such as proteins that bind DNA, Ca2+, etc). Let's explore some of the common motifs.
Alpha-loop-Alpha
These are found in DNA-binding proteins that regulate transcription and calcium-binding proteins, the motif of which is often called the EF-hand. The loop region in calcium-binding proteins is enriched in Asp, Glu, Ser, and Thr. Why? The EF-hand shown below is from Calmodulin.
Figure \(\PageIndex{1}\) shows an interactive iCn3D model of a basic helix-turn-helix from the c-Myc protein (1NKP). The iCn3D model shows the helices interacting with the major grove of DNA, which is shown in spacefill.
Figure \(\PageIndex{1}\): Basic helix-turn-helix from the c-Myc protein (1NKP). (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...kDv9DGzWWWoMZ8
Figure \(\PageIndex{2}\) shows an interactive iCn3D model of the "EF-hand" from the calcium-binding protein calmodulin (1cll)
Figure \(\PageIndex{2}\): EF hand from Calmodulin (1cll): Secondary Structure Motif. (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...vCYH7EX4sVvtu6
The EF Hand can be envisioned as a hand gripping a ball (calcium ion) with the index finger and thumb representing alpha helices, as shown in Figure \(\PageIndex{3}\).

The EF-hand motif of calmodulin is used in various Ca2+ binding proteins. Figure \(\PageIndex{4}\) shows the alignment of the first 50 residues of human calmodulin with four other human calcium-binding proteins. The EF-hand (F12-L29) of calmodulin consists of the second half of the first helix (F12-L18), an intervening loop (F19-T28), and the second helix (T29-L29). Sometimes, it is annotated to encompass a larger stretch (8-43)
Part A shows the degree of conservation of amino acids in this first Ca2+-binding EF-hand. Part B shows the general conservation of key hydrophobic (F12, F19, I27) as well additionally, those of similar polarity (36 and 39)
Figure \(\PageIndex{5}\) shows an interactive iCn3D model of a bound calcium ion and interacting amino acids in human calmodulin with key amino acids labeled.
Figure \(\PageIndex{5}\): Bound calcium ion and interacting amino acids in human calmodulin (1cll) (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/icn3d/share.html?bCf4mtNbk4kjkCHw6
Hover over the amino acid side chains that are coordinating the Ca2+ ion. Are they what you would expect?
A linear connectivity "wiring" diagram showing a secondary structure connected by connecting regions is shown in Figure \(\PageIndex{6}\). This wiring diagram shows a 2-residue beta-strand, which is insignificant in length to be considered an actual strand.

A more complicated 2D topology map is shown in Figure \(\PageIndex{7}\). In this case, it is linear, given the small section of amino acids depicted. We will see more complicated 2D topology maps with more complicated structures below.

It is presented on its side to save space on this page.
Beta-hairpin or beta-turn
This motif is present in most antiparallel beta structures, both as an isolated ribbon and as part of beta sheets.
Figure \(\PageIndex{8}\) shows an interactive iCn3D model of the beta hairpin from bovine pancreatic trypsin inhibitor (1k6u)
Beta hairpin from bovine pancreatic trypsin inhibitor (1k6u)
Figure \(\PageIndex{8}\): Beta hairpin from bovine pancreatic trypsin inhibitor (1k6u) (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...eMFdHkGogJHCCA
Figure \(\PageIndex{9}\) shows the 2D homology map for the beta-hairpin.

Greek Key
The "Greek Key" symbol represents infinity and the eternal flow of things and resembles, in part, primitive keys. The Greek Key motif in proteins can be seen in the structure of antiparallel beta sheets in ordering four adjacent antiparallel beta strands, as shown in Figure \(\PageIndex{9}\). The figure also shows the repetitive Greek key, which you will see many times if you visit Greece and tour its antiquities.
Figure \(\PageIndex{10}\)s shows a partial 2D topology map of Staphylococcus nuclease (2SNS).
Figure \(\PageIndex{11}\) shows an interactive iCn3D model of the Greek Key motif from Staphylococcus nuclease (2SNS). The involved beta strands are shown in yellow.
Figure \(\PageIndex{11}\): Greek Key motif from Staphylococcus nuclease (2SNS) (Copyright; author via source).
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...x2ef4xpttXrFb9
Beta-Alpha-Beta
The motif is a common way to connect two parallel beta strands, unlike beta hairpins, which are used to connect antiparallel beta strands.
Figure \(\PageIndex{12}\) shows an interactive iCn3D model of the beta-alpha-beta structure from triose phosphate isomerase (1amk).

Figure \(\PageIndex{13}\) shows the 1D wiring diagram for the first beta-alpha-beta motif in triose phosphate isomerase.

Figure \(\PageIndex{14}\) shows the 2D topology diagrams showing this motif.
Larger Structural Motifs - Protein Architecture
Some proteins combine larger secondary and supersecondary structural components, often repeatedly, to produce more complex structures. We've seen this with larger twisted sheets and beta barrels, such as the TIM barrel. Let's consider three of these, which can be considered examples of protein architectures without considering connectivity within the protein.
The Rossman Fold
Structural motifs can serve particular functions within proteins, such as enabling the binding of substrates or cofactors. For example, the Rossmann fold is responsible for binding to nucleotide cofactors such as nicotinamide adenine dinucleotide (NAD+), as shown in Figure \(\PageIndex{15}\). The Rossmann fold comprises six parallel beta strands forming an extended beta sheet. The first three strands are connected by α-helices, resulting in a beta-alpha-beta-alpha-beta structure. This pattern is duplicated once to produce an inverted tandem repeat with six strands. Overall, the strands are arranged in the order of 321456 (1 = N-terminal, 6 = C-terminal). Five stranded Rossmann-like folds are arranged in the sequential order 32145. The overall tertiary structure of the fold resembles a three-layered sandwich wherein the filling is composed of an extended beta sheet, and the connecting parallel alpha helices form the two slices of bread.

Image modified from: Boghog
One of the features of the Rossmann fold is its co-factor binding specificity. The most conserved segment of Rossmann folds is the first beta-alpha-beta segment. Since this segment is in contact with the ADP portion of dinucleotides such as FAD, NAD, and NADP, it is also called an "ADP-binding beta-beta fold."
Figure \(\PageIndex{16}\) shows an interactive iCn3D model of the Rossman fold of malate dehydrogenase (5KKA) from E. Coli. The beta strands (yellow) connecting alpha helices (red) and coil (blue) of the Rossman fold are shown in the context of the rest of the monomeric version of the protein, which is shown in gray.

The TIM barrel revisited
Interestingly, similar structural motifs do not always have a common evolutionary ancestor and can arise from convergent evolution. This is the case with the TIM Barrel, a conserved protein fold consisting of eight α-helices and eight parallel β-strands alternating along the peptide backbone. It is illustrated in Figure \(\PageIndex{17}\). The structure is named after triosephosphate isomerase, a conserved metabolic enzyme. TIM barrels are one of the most common protein folds. One of the most intriguing features among members of this class of proteins is that although they all exhibit the same tertiary fold, there is very little sequence similarity between them. At least 15 distinct enzyme families use this framework to generate the appropriate active site geometry, always at the C-terminal end of the eight parallel beta-strands of the barrel.

Figure \(\PageIndex{17}\) The TIM Barrel. TIM barrels are considered α/β protein folds because they include an alternating pattern of α-helices and β-strands in a single domain. In a TIM barrel, the helices and strands (usually 8 of each) form a solenoid that curves around to close on itself in a doughnut shape, topologically known as a toroid. The parallel β-strands form the inner wall of the doughnut (hence, a β-barrel), whereas the α-helices form the outer wall of the doughnut. Each β-strand connects to the next adjacent strand in the barrel through a long right-handed loop that includes one of the helices so that the ribbon N-to-C coloring in the top view (A) proceeds in rainbow order around the barrel. The TIM barrel can also be thought of as made up of 8 overlapping, right-handed β-α-β super-secondary structures, as shown in the side view (B).
Image modified from: WillowW
Although the ribbon diagram of the TIM Barrel shows a hole in the protein's central core, the amino acid side chains are not shown in this representation (Figure 2.26). The protein's core is tightly packed, mostly with bulky hydrophobic amino acid residues. However, a few glycines are needed to allow wiggle room for the highly constrained center of the 8 approximate repeats to fit together. The packing interactions between the strands and helices are also dominated by hydrophobicity, and the branched aliphatic residues valine, leucine, and isoleucine comprise about 40% of the total residues in the β-strands.
The figure \(\PageIndex{18}\) below shows an interactive iCn3D model of the TIM barrel (1WYI) from Chapter 4.2).
.png?revision=1&size=bestfit&width=317&height=288)
As our knowledge continues to increase about the myriad of structural motifs found in nature's treasure trove of protein structures, we continue to gain insight into how protein structure is related to function and are better enabled to characterize newly acquired protein sequences using in silico technologies.
Beta Helices
These right-handed parallel helical structures consist of a contiguous polypeptide chain with three parallel beta strands separated by three turns, forming a single rung of a larger helical structure, which might contain as many as nine rungs. The intrastrand H-bonds are between parallel beta strands in separate rungs. These seem to be prevalent in pathogens (bacteria, viruses, toxins) proteins that facilitate the binding of the pathogen to a host cell.
Figure \(\PageIndex{19}\) shows an interactive iCn3D model of the C-terminal fragment of the phage T4 GP5 beta helix (4osd).

Beta helices and found in the following organisms (with the diseases they cause in humans): Vibrio cholerae (cholera), Helicobacter pylori (ulcers), Plasmodium falciparum (malaria), Chlamyidia trachomatis (VD), Chlamydophilia pneumoniae (respiratory infection), Trypanosoma brucei (sleeping sickness), Borrelia burgdorferi (Lyme disease), Bordetella parapertussis (whooping cough), Bacillus anthracis (anthrax), Neisseria meningitides (menigitis) and Legionaella pneumophilia (Legionaire's disease).
Beta Propellers
Proteins with this structure have 4-8 blade-shaped beta sheets arranged around a central axis, forming an active site shaped like a funnel.
Figure \(\PageIndex{20}\) shows an interactive iCn3D model of the C-terminal domain of Tup1 (1ERJ), a yeast transcription factor, which has a seven-bladed beta propeller. Each blade contains a WD40 repeat sequence (around 40 amino acids) that often ends in tryptophan-aspartic acid (W-D). The particular protein has four WD dipeptides sequences, shown in sticks colored with CPK colors.

The funnel provides binding sites for proteins and other molecules, with the ones with more blades usually acting as enzymes.
Domains
Domains are the fundamental unit of tertiary (3o) structure. Domains can be considered a chain or part of a chain that can independently fold into a stable tertiary structure. Domains are units of structure but can also be units of function. Some proteins can be cleaved at a single peptide bond to form two domains. Often, these can fold independently of each other, and sometimes, each unit retains an activity it had in the uncleaved protein. Sometimes, binding sites on the proteins are found in the interface between the structural domains. Many proteins seem to share functional and structural domains, suggesting that the DNA of each shared domain might have arisen from the duplication of a primordial gene with a particular structure and function.
Evolution has increased complexity, requiring proteins to have new structures and functions. Increased and different functionalities in proteins have been obtained by adding domains to base proteins. Chothia (2003) has defined domain in an evolutionary and genetic sense as "an evolutionary unit whose coding sequence can be duplicated and/or undergo recombination." Proteins range from small with a single domain (typically from 100-250 amino acids) to large with many domains. From recent analyses of genomes, new protein functionalities appear to arise from the addition or exchange of other domains, which, according to Chothia, result from
- duplication of sequences that code for one or more domains
- divergence of duplicated sequences by mutations, deletions, and insertions that produce modified structures that may have helpful new properties to be selected
- recombination of genes that result in a novel arrangement of domains.
Structural analyses show that about half of all protein-coding sequences in genomes are homologous to other known protein structures. There appear to be about 750 different families of domains (i.e. small proteins derived from a common ancestor) in vertebrates, each with about 50 homologous structures. About 430 of these domain families are found in all the genomes that have been solved.
Proteins with multiple domains are more likely not to misfold if each domain can fold somewhat autonomously. In addition, they provide a myriad of binding sites that increase the number of biological functions expressed in a single protein. Multidomain proteins can also express multiple catalytic activities, allowing for a reaction product from one domain to diffuse to another catalytic domain (or interface between domains). This would reduce the dimensionality of the search for a substrate from 3D to more of a 1D or 2D search, enormously speeding up the net reaction. The process is often called substrate channeling.
Figure \(\PageIndex{21}\) shows an interactive iCn3D model of the three domains of the enzyme pyruvate kinase (1pkn). These include a nucleotide (ADP/ATP) binding domain (blue) made of beta strands, a substrate binding domain (green) in the middle composed of alpha/beta structure, and a regulatory domain (red) composed of alpha/beta structure. These domains were analyzed by a web program called CATH-Gene3D.

One ubiquitous domain is the Immunoglobulin Fold (IGF), also called the Immunoglobulin Domain (IgD). They are abundantly found in immune proteins, cell surface proteins involved in recognition, and other proteins. They are described in detail in Chapter 5.4: Recognition of Self and Nonself - The Immune System. Here are some images and iCn3D displays of proteins with Ig Domains in bacteria (left), viruses (center), and humans (right). (some have long load times)
Intimin-190 (Int190) from Enteropathogenic E. coli (1E5U) |
Ig Domain in SARS-CoV-2 Spike Glycoprotein (6VXX) |
AlphaFold ID Carcinoembryonic antigen-related cell adhesion molecule 1 (CEACAM1) (P13688) |
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/icn3d/share.html?AcPowvr2Uz37Y1Rh7 |
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...hPGzYJF9wV7gt5 |
Click the image for a popup or use this external link: https://structure.ncbi.nlm.nih.gov/i...GXmU8dwdNQ2w69 |
Recent Updates: 11/1/2024
With the recent advances, driven by AI, in predicting protein structure and function, there is a greater need to refine and develop programs that can determine the domain structures of the over 200 million protein structures in the AlphaFold database. Until recently, there were two different ways to determine domain structures in proteins:
- based on the 3D structure of a protein. The program CATH does this.
- based on 1D (linear) sequences. An example of a program that uses this approach is Pfam.
CATH classifieds protein structure based on the following hierarchy of organization: Class, Architecture, Topology, and Homologous Superfamilies
- Class: the highest level of organization, which consists of four classes - mainly alpha, mainly beta, alpha-beta, and a few secondary structures
- Architecture (40 types): describes the shape of the domain based on secondary structures but doesn't describe how they are connected. Ex: beta-barrel, beta-propeller
- Topology (or fold group, 1233 types): Members of topology groups share a common fold or topology in the "core" of the domain structure.
- Homologous Superfamilies (2386 types): These groups are homologous in sequence or structure and derive from a common precursor gene or protein.
Pfam uses multiple alignments of sequences
The pyruvate kinase example above shows three structural domains. Pfam finds two major domains: a pyruvate kinase beta-barrel domain and an alpha/beta domain. The domains determined by both programs show about a 75% overlap.
At a simpler level, domains are built from the kinds of motif structures we discussed above. Since proteins are very packed structures, the organizational structure of proteins can be thought of as closely packed motifs, but not all possible combinations are found. For example, if you have one beta hairpin next to another to form a 2-unit Greek key, there are 24 likely ways to connect them, but only eight are common. The two below account for more than the sum of the other 22. These are shown in Figure \(\PageIndex{22}\).
Figure \(\PageIndex{23}\) shows an example of the architecture of the multi-domain protein, human Attractin-like protein 1. This protein is an example of a lectin, a carbohydrate-binding protein, which we will explore in a subsequent chapter. It binds Ca2+, so it is considered a C-Lectin. Three different programs were used to analyze the domain structure.
Figure \(\PageIndex{23}\): Architecture of the multi-domain protein, human Attractin-like protein 1
A new AI-based "TED - Encyclopedia of Domains" has been developed to identify and classify around 365 million domains in the AlphFold database. Around 1/3 of these were not predicted through 1D sequence alignments and comparisons. Around 3/4 of the "nonredundant" domains were similar to domains predicted by the 3D structure alignments of CATH. TED identified new domain interactions between superfamilies and many new protein folds. New folds across life likely suggest a common function, whereas new folds with a given lineage suggest evolutionary changes.
Some new folds had higher symmetry, including the beta-propeller, which repeats to give high symmetry (see below for lower C3 symmetry). Additional AI-based sequence (1D) information was used to help infer function. For example, putative Zn2+ finger-like binding sites containing 2 Cys and 2 His but lacking a traditional Zn-finger motif were found. Most putative heme binding sites contained the CXXCH motif found in heme c. Inspection of the structures predicted to have these features supported their likely functions.
CATH and TED are 3D structure-based, so comparing their domains and domain interactions is warranted. TED found over 27 million examples of interacting domains with about 14,000 interacting superfamilies pairs compared to around 200,000 with 5000 pairs for CATH
Figure \(\PageIndex{24}\) below shows the classification of TED domains using the CATH hierarchy.
Figure \(\PageIndex{24}\): Classification of TED domains using the CATH hierarchy. The top 100 superfamilies in TED-100 for each CATH class where more matches to CATH superfamilies have been identified through structural hits in TED compared with sequence hits in Gene3D. (ii) Proportion of domains matched to CATH classes (n = 238,569,631). Andy M. Lau et al. Exploring structural diversity across the protein universe with The Encyclopedia of Domains. Science 386, eadq4946(2024). DOI:10.1126/science.adq4946. Author Accepted Manuscript (AAM) version available under a CC BY public copyright license." Manuscript published in Science, Volume 386, Issue 6721, 1 Nov 2024
Table \(\PageIndex{1}\) below shows examples of novel domain folds and probable functional sites described in the paper by Lau et al.
TED: A0A7M3WA57_TED05 - paired beta-strands in a closed, twisted hairpin with both termini adjacent. (4Ci from Lau et al.) | TED: E1Z635_TED02 - Alpha-helical variant, Eukaryotic. (4Dii from Lau et al). | TED: A0A2J6RQN3 - tentative Zn2+-binding protein | TED: M5FA19 - tentative heme c-binding protein |
Download iCn3D png file |
Download iCn3D png file |
Download iCn3D png file |
Download iCn3D png file |
View this in iCn3D as follows:
- download the above files to your computer. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
- open iCn3D
- File, Open File, iCn3D appendable, navigate to the folder with the downloaded png file, and select it.
Figure \(\PageIndex{25}\) below shows new examples of symmetry (C3) and extruded repeat domains found using TED.
Examples of high-symmetry domains and extruded repeats. Domains are identified as part of the novel domain identification pipeline, and domains with high internal symmetry are identified through scoring with the SymD program. Extruded repeats are domains with many ordered cyclical repeats projecting along one axis. Coloration follows plDDT confidence bins as per the AFDB. Dark blue indicates very high confidence: plDDT ≥ 90; blue indicates high confidence (90 > plDDT ≥ 70); yellow indicates low confidence (70 > plDDT ≥ 50); and orange indicates very low confidence (plDDT < 50). Andy M. Lau et al. ibid.
Table \(\PageIndex{2}\) below shows iCn3D examples of higher symmetry and extruded repeat domains described in the paper by Lau et al.
C11 symmetry-A0A1V6M2Y0 | C10 symmetry-A0A6C0LIE9 | Extruded-A0A1M5CF6 | Extruded- A0A833H0U1 |
Download iCn3D png file |
Download iCn3D png file |
Download iCn3D png file |
Download iCn3D png file |
View this in iCn3D as follows:
- download above files to your computer. IMPORTANT: If the file opens as an image in a new browser window, right-click the image and save the file to download it!
- open iCn3D
- File, Open File, iCn3D appendable, navigate to the folder with the downloaded png file, and select it.
Individual protein annotations can also be browsed from the TED website (https://ted.cathdb.info).
TED structural domain assignments for AlphaFold Database v4 and associated codes are available for download at Zenodo . The deposition contains domain assignments for TED, PDB files for novel folds, and individual domain assignments from Chainsaw, Merizo, and UniDoc to facilitate further benchmarking efforts. Specifically:
- novel_folds_set_models.tar.gz contains PDB files of all novel folds representatives identified in TED100.
- high_symmetry_folds_set_models.tar.gz contains PDB files of all highly symmetrical fold representatives identified in TED100.