• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.


Protein Structure Prediction

Page history last edited by PBworks 14 years, 8 months ago

Protein Structure Prediction



The goal of protein structure prediction programs is to predict the secondary, tertiary, or quaternary structure of proteins based on the sequence of amino acids. Protein structure prediction is important because the structure of a protein often gives clues to its function. Besides being an interesting computational problem, determining a protein's function is important for rational drug design, genetic engineering, modelling cellular pathways, and studying organismal function and evolution. Currently, protein structures may be found via complicated crystallography experiments. Homology studies, mutagenesis, biochemical analysis, and other modeling studies on the solved structure can then be used to deduce the protein's function. As the whole process is long and uncertain, computer algorithms capable of shortening the structure prediction step greatly enhances protein studies.



Protein Structure

Proteins are composed of monomers called amino acids. Amino acids contain amine and carboxyl functional groups and variable R side chains. There are twenty types of amino acids i.e. twenty different R groups, and thye can be joined together via peptide bond formation (dehydration synthesis). Depending on the polarity of the side chains, amino acids can be hydrophobic or hydrophilic to varying degrees.


Peptide bond formation


Proteins have four levels of structure:

  • Primary: the sequence of amino acids
  • Secondary: basic structures, such as alpha helices, beta sheets, and loops
  • Tertiary: the three-dimensional conformation of the protein
  • Quaternary: how several peptide strands interact with each other. For example, hemoglobin has four protein subunits.


The interaction of R groups with water and with other R groups determines a protein's tertiary structure. The outer distribution of R groups will also determine quaternary structurel; for example, a tertiary face that contains many hydrophobic amino acids will most likely face toward the inside of the final protein structure. Most proteins can fold themselves without the help of enzymes fairly quickly, on the scale of milliseconds. Larger proteins, however, tend to fold themselves more slowly due to steric hinderances. The protein will tend towards the final folded structure that is the lowest energy state of the protein. Protein folding can be represented on an energy graph as a large multi-dimensional space of changing conformations. The Gibbs free energy and temperature can be used to calculate the jump time (J): J = e(-ΔG*/RT)


Protein Folding Landscape


Protein folding generally follows several principles that may be implemented by algorithms to predict structure:

  • Rigidity of the protein backbone: may be determined by the size and structure of amino acids
  • Steric complementarity: whether the shape of a section of protein fits with another section. If atoms are brought too close together, there is an energy cost due to overlapping electron clouds.
  • Secondary structure preferences/hydrogen bonds: chemical groups of opposite polarities tend to be attracted to each other.
  • Hydrophobic/polar patterning: sections of protein that are hydrophobic tend to be shielded from water (which usually surrounds the protein).
  • Electrostatics: some amino acids have polar side chains, so proteins typically have sections that are positively or negatively charged.

Protein Databases



Created in 1986 by Amos Bairoch and developed by the Swiss Institute of Bioinformatics, Swiss-Prot is a biological database of protein sequences; it provides information such as the function of the protein, its domain structures, and post-translational modifications. High-quality annotations and integration with other datasets ensure a low level of redundancy.


Example: Structure of Bovine Catalase



TREMBL, Translated EMBL is a large protein database in Swiss-Prot format generated by a computer translation of information from the EMBL (European Molecular Biology Laboratory), a research institute supported by 19 European countries. In contrast to SwissProt, which contains only known proteins, TrEMBL may predict hypothetical proteins due to faulty computer translations.


Protein Information Resource

Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is a public bioinformatics resource that supports genomic and proteomic research. It provides protein databases and analysis tools to the scientific community, including the Protein Sequence Database (PSD), the first international database. Other tools enable users to classify proteins into families (PIRSF- Protein Family Classification System), easily find academic knowledge such as function, structure, and pathways (iProClass- Integrated Protein Knowledgebase), and retrieve relevant literature (iProLINK-Integrated Protein Literature, Information and Knowledge).


Protein Data Bank

Protein Data Bank (PDB), provides a variety of tools and resources for protein structural and functional analysis. It is a worldwide depository of information about the 3-D structures of biological molecules including proteins and nucleic acids. Each structure is assigned a unique four-character alphanumerical code called the PDB ID code. A variety of details is available for each structure such as sequence details, atomic coordinates, crystallization conditions, and links to additional resources.


Example: Image of Myoglobin



SCOP, Structural Classification of Proteins, is a database that serves to provide a detailed and comprehensive description of structural and evolutionary relationships among all proteins whose structure is known. It provides a survey of protein folds, information regarding close relatives of proteins, and a framework for future classification and research.

Viewing Protein Structures

The number of available protein structures is constantly on the rise. Since 2004, the number of solved structures has increased nearly two-fold from 24,785 to 40,354 today. There are 216,380 and 2,897,081 structures available in SwissProt and TrEMBL, respectively. These structures can be viewed through various free interactive programs in which the proteins can be moved around and manipulated. The Protein Data Bank provides a 3D coordinate file. Several other programs are also available for quick structure viewing. Of these Swiss-PDB is perhaps the most powerful. Interactive viewers are useful for drug design and fields such as cheminformatics.

Example: Image of hemoglobin using Swiss-PDB Viewer


Comparing Protein Structures

It is often useful to compare protein structures to each other. Structure often implies function; thus, one might be able to deduce the function of a protein by searching for other proteins with similar structures. Protein structure is more conserved than sequence, so two different DNA or amino acid sequences from two different might produce similar structures, which would imply an evolutionary relationship. In addition, comparing protein structures could help identify recurring structural motifs, which would help with further protein structure prediction. Finally, comparing protein structures allows the assessment of new predictions.


Proteins are most commonly compared using the root mean square deviation (RMSD). This method calculates the distance of all the corresponding atoms of two proteins after optimal superposition of their backbones:

For more information on RMSD, see McLachlan (1995). There are also other measures of comparison, such as contact maps and torsion angle RMSDs.


There are two key databases with protein structure comparisons and classifications:

  • SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/): Structural Classification of Proteins. SCOP is manually constructed and orders proteins according to their evolutionary and structural relationships. The 1.71 release is divided into 7 classes, 971 folds, 1589 superfamilies, 3004 families, and 75,930 domains. The distribution across folds is highly uneven, with five folds containing 20& of all homologous superfamilies. Also, some folds are multifunctional. For more information, see A. Murzin et al (1995).
  • DALI (http://www.ebi.ac.uk/dali/): DALI is a automatically calculated taxonomy of proteins based on their structural, functional, and sequence similarities. It is hosted by the European Molecular Biology Laboratory (EMBL). For more information, see S. Dietmann et al (2001).


Computational Structure Alignment


Structural Prediction

X-ray Crystallography

X-ray Crystallography is a technique that uses the diffraction pattern of X-rays through the atomic lattice of crystals to reveal the structure and nature of that lattice. To solve a protein structure, the protein must first be crystallized since a single molecule in solution has insufficient scattering power. Crystallization is traditionally done through three methods: diffusion gradient, concentration through evaporation, and sublimation. Successful growth is contingent upon finding suitable conditions i.e. pH, temperature, additional of salts and other additives, rendering the process long and difficult. Once crystals are obtained, they are mounted onto a diffractometer coupled to an X-ray emitting machine. The interaction between the electrons in the crystal and incoming X-rays produces a diffraction pattern that is recorded on CCD detectors and scanned into a computer. Images are recorded as the crystal is rotated within the photon beam. Each diffraction spot is governed by the size and shape of the unit cell. The intensity of each spot yields important information relating to the amplitude and phases of electron waves. Using a constructed electron density map, a starting model of the molecule can be constructed; this model is further refined to best fit the observed diffraction data.


X-ray diffraction pattern of protein


Possible to predict?

As indicated by the laborious process of x-ray crystallography, the development of bioinformatic techniques that can predict protein structure based on amino acid sequence would greatly faciliate protein studies. The task is by no means trivial since structural prediction involves the prediction of an unlimited number of rotations or degrees of freedom. However, as mentioned earlier, protein folding follows certain physical principles that may assist algorithms in model predictions. When considering the primary amino acid sequence, each residue in the amino acid main chain has two degrees of freedom: φ and ψ. Each amino acid side chain can have up to four degrees of freedom: χ1 - χ4. φ represents the rotation about the N-C bond, and ψ represents the rotation about the C-C bond.


Primary structure conformations


Secondary Structure Prediction

The secondary structure of a protein consists of beta sheets, alpha helices, and loops, all of which are formed by hydrogen bonds between amino acid chains.


Chou-Fasman (1974)

The Chou-Fasmman algorithm (1974) was one of the first attempts to predict secondary protein structure. It worked by assigning appropriate parameters (Pα, Pβ) to all residues based on 15 proteins (2473 AAs) of known conformation. A stretch of amino acids was determined to be part of a helix if the score was greater than or equal to 4, and it was determined to be part of a beta strand if the score was great than or equal to 3. This method is fairly intuitive, but it only had an accuracy rate of 50-60%.


GOR (1978)

In 1978, Garnier, Osguthorpe, and Robson developed another method of predicting secondary structures. The basic idea is that the probability of each residue being as particular structure (alpha helix, beta strand, or loop) is dependent on (1) the amino acid type of that residue, and (2) amino acid types of the neighboring residues. A set of GOR scoring tables were developed, which were matrices of the frequences of each structure based on the central amino acid and its neighbors. This method has an accuracy rate of 57% to 63%.


Eisenberg & McLachlan (1986)

Eisenberg and McLachlan's 1986 Nature paper described a new method for predicting protein secondary structures based on calculations of hydrophobicity. The stability of protein structures in water is calculated from its atomic coordinates, the accessibility of the atom to the solvent, and its atomic solvation parameter. One can then plot hydrophobicit as a function of sequence position and look for period repeats that imply secondary structure. A period of 3-4 amino acids corresponds to an alpha helix, whereas a period of 2 amino aicds corresponds to a beta sheet.


Cohen (1989)

Fred Cohen of UCSF published a paper on a new way to think about secondary structure prediction in 1989. This approach was based on the assumption that turns dictate the secondary structural elements of a protein. Therefore, his approach was to predict the turns in a protein first, then assign alpha helix or beta strand structures based on the identified turns.


JPRED (1998)

JPRED (http://www.compbio.dundee.ac.uk/~www-jpred/) is a composite secondary structure prediction server. It combines six different prediction algorithms: NNSSP, DSC, PREDATOR, MULPRED, ZPRED and PHD. It standardizes the input into one of two types: a family of aligned proteins or a single protein sequence. The output is based on simple majority of the six methods. This method achieves an accuracy rate of 72.9%, the best out of all the methods thus far, although it is an improvement of only 1% over PHD.


3D Structure Prediction


3D protein structure prediction was brought to the forefront by CASP, Critical Assessment of Techniques for Protein Structure Prediction, a community-wide experiment and contest that has been taking place every two years since 1994 at Asilomar, California. Different research groups are given a chance to test the quality of their structural prediction tools. Proteins for which structures were recently solved but yet unreleased are used as prediction inputs. The quality of the predictions are evaluated by automatic programs and experts. A wide range of prediction categories varying in input and output are available:

  • Tertiary structure prediction
  • Secondary structure prediction
  • Predictin of structure complexes
  • Residue-residue contact prediction
  • Disordered regions prediction
  • Domain boundary prediction
  • Function prediction
  • Model quality assessment
  • Model refinement

We will be discussing three different modes of tertiary structure prediction: homology modeling, fold recognition and ab initio prediction.

CASP Category I- Homology Modeling

Homology modeling, also known as comparative modeling, models unknown protein structures after known structures of homologous sequences. This method of comparison generally assums that protein tertiary tructure is better conserved than primary amino acid sequence and that a modest level of sequence identity implies significant structural similarity. We may consider this from an evolutionary standpoint: proteins must have mechanisms for overcoming sequence mutations since proper function depends on proper folding. Homology modeling follows a few basic steps.


1) Identify homologous proteins with known structures (usually sequence identities >25-30% are used)

2) Align sequences

3) Identify structurally conserved and variable regions

4) Generate coordinates for conserved regions

5) Generate conformations for variable regions

6) Refine and evaluate unknown structure


Many computerized search programs such as BLAST can be used to identify protein homologues (having several homologes facilitates finding conserved regions). Sequence homologies>70% can generate high resolution models (<3A°). Homologous sequenes qith known structures are aligned to identify conserved and non-conserved regions. Conserved regions generally correspond to secondary structure elements and important binding sites. Variable regions correspond to surgace loops and chain turns. The unknown sequence is then aligned tot he known sequences to generate coordinates from the known structures directly tot he unknown model. For less conserved regions, coordinates may be assigned either directly or via fragment libraries that may provide suitable models. Criteria for evaluation of quality and accuracy can include:

  • Acceptable main chain conformations
  • Planar peptide bonds
  • Proper orientation of hydrophobic/hydophilic regions
  • Lack of holes


Homology Modeling


The quality of predicted models largely depends ont he sequence identity between the target and the template. Above 50% homology, results tend to have only minor errors in side chain formations. Below 30% homology, serious errors in fold prediction are likely. Assuming high sequence identity, most errors result from poor template choice. At low sequence identity, the errors derive from poor sequence alighment. These two problems may be addressed by using structural alignment (aligning multiple sequences by comparing their solved structures) of proteins with known structures to refine modeling techniques. Conducting multiple alignment searches and using several templates may also increase the probability of finding an accurate model. Using sensitive multiple alignment programs such as PSI-BLAST also facilitate finding the best alignment. A good alignment may still produce a wrong local structure. An insertion or deletion mutation can produce a target structure not present in the template structure. Since missing regions are found most often in loops, non-conserved regions tend to have a higher likelihood for error. Side chains additionally have different rotameric (stereochemistry of bond rotations) states that make modeling difficult since an optimal rotameric conformation may not correspond to the lowest energy of the protein.


Despite the difficulties associated with homology modeling, this method of structural comparison has proven useful in identifying conserved ligand binding sites, important protein-protein interactions, and deriving a gene’s function. Poor loop and side chain formation would not greatly affect these utilities since non-conserved regions tend to play a small role in binding interactions. Homology models have identified cation binding sites on the ATPase and coding regions in these yeast genome.


Homology model of Class I-type Lysyl tRNA synthetase of Treponema palladium (A) and experimental structure of Class I-type Lysyl tRNA synthetase of Pyrococcus horikoshii (B)


Web-Based Homology Modeling



MODELLER finds and aligns homologous proteins with known structures. The algorithm collects distance distributions between atoms in known protein structures and uses these distributions to compute positions for equivalent atoms. It refines models using energetics.



SWISS-MODEL is a fully automate protein structure homology-modeling server. It was initiated in 1993 by Manuel Peitsch and is currently being revised withing the Swiss Institute of Bioinformatics.



WHAT IF includes three components, one to generate models, one to evaluate the quality of models, and one to evaluate the existing structures used in the predictions.


CASP Category II- Fold Recognition

The Protein Data Bank is continuously update with new structures, yet relatively few of these structures are distinct from existing proteins, suggesting that the millions of protein sequences may actually encode only a small number of fold topologies. This observation opened the possibility for structural predictions by using complete libraries of folds. There are two types of fold recognition: 3D-1D and threading.


3D-1D transforms the three-dimensional structural information of solved structures into one-dimensional strings of symbols or profiles that can be used to align with strings derived from query sequences. The alignment is scored according to how easily the sequence can form favorable interactions that are known to stabilize certain structural components present in the template i.e. how easily the query can adopt a specific fold. These structural components or folds are made of secondary structures and can be characterized by solvent accessibility and the location of polar atoms.


3D-1D methods are limited in that important contextual information is lost when folds are simply classified without consideration of structural relationships. Threading methods pass the query sequence through a library of 3D folds and score pairwise residue interactions at each match/alignment. While theoretically capable of detecting more distant homologues than 3D-1D, neither method is hugely successful. As in homology modeling, alignment poses great difficulty due to the large number of possible sub-alignments at each match/alignment.


Alignment between SfiI and structurally characterized REases. A) Fold-recognition alignment between full-length sequences of SfiI and BglI. Amino acids are colored according to the physico-chemical properties of their side-chains (negatively charged: red, positively charged: blue, polar: magenta, hydrophobic: green. Pairs of residues conserved between SfiI and BglI are highlighted. Putative catalytic residues are indicated by "#", putative DNA-binding residues are indicated by "*". Secondary structure elements of BglI are shown below the alignment. Numbers of amino acid residues at the N-terminus of each panel are shown. B) Structure-based sequence alignment of the conserved core, corresponding to the PD-(D/E)XK motif, including SfiI, BglI, and other selected REases from the β-class. Conserved residues of the active site are highlighted.


Fold recognition results were generally disappointing. Aligning sequence to fold proved to be a big problem. On a scale of 1 to 4, where 1 is a poor alignment and 4 is a good alignment, the average performance over various targets was below expectations:

HomologAnalogNew Fold
Best score from any group3.72.61.7
Average for best group2.50.80.9


CASP Category III- ‘’Ab initio’’ Prediction

‘’Ab initio’’ modeling methods rely only on primary amino acid sequence information to deduce structures. Predictions are based on thermodynamic assumptions that native structures adopt conformations that minimize free energy. There are two components of ‘’ab initio’’ prediction: exploring the conformational space (all possible conformations) for native-like structures and creating a scoring function that will distinguish between correct and incorrect structures. This is no easy task given that scoring functions must consider interactions between all pairs of atoms. Simplifications such as restricting assumptions about torsion angles can reduce the total number of possible conformations. Conformational space can be sampled using a variety of techniques:


1) Molecular dynamics calculates protein conformational changes to find native-like structures. These calculations use known atomic forces (derived from potential curves) to find acceleration functions that subsequently yield information regarding changes in position and velocity. The information produces a simulation of protein molecular movement.

  • x(t+Δt) = x(t) + v(t)Δt + [4a(t) – a(t-Δt)]Δt2/6
  • v(t+Δt) = v(t) +[2a(t+Δt) + 5a(t) – a(t-Δt)] Δt/6
  • Ukinetic = 1/2Σmivi(t)2 = 1/2nKBT

2) Continuous energy minimization selects conformations that minimize free energy (not just potential energy). This is not a trivial task given that the energy depends on thousands of atomic coordinates.



3) Monte Carlo simulations are generally used to find solutions to problems having an overwhelming number of variables and degrees of freedom. It is a nondeterministic algorithm that models stochastic protein movements in torsion and Cartesian conformation and evaluates the energy of the structure after each move. The new conformation is accepted if there is a high probability of its occurrence:


Under a simulation that is infinitely long, the conformation follows a Boltzmann distribution:

*Π(c) e-E(c)/kT


Finally, there needs to be a scoring function to select only native/native-like structures. Functions are usually physics-based and incorporate factors such electrostatics, van der Waals, solvation, and bond/angle terms. There may also be functions that are knowledge-based; these derive information about atomic properties from experimentally determined conformations (parameters are usually pairwise atomic distances and the tendency for amino acids to be buried or exposed).


Rosetta (http://boinc.bakerlab.org/rosetta/) is a protein structure prediction method pioneered by David Baker at the University of Washington. The idea is to break a protein sequence into short 7-9 amino acid fragments. Through parallel computation, the structure with lowest local energy is found from a library of known structures. Then, a simulated annealing algorithm is run to find the global optimimum, which considers hydrophobic burial, polar side chain interactions, hydrogen bonding between beta strands, and Van der Waals forces. A thousand structures are created using this method. This group is then clustered, and a representative is chosen from each cluster.


CASP Results


More generally, results from CASP were highly disappointing. Before CASP, scientists had reported many protein structures solved using computation methods, but the results of the first few CASPs raised suspicions that these reports were actually biased. However, the CASP results did improve in the later rounds.

Round Year Results RMSD

  1. Residues
Before CASP before 1994 Claimed Solved
CASP1 1994 Worse than random
CASP2 1996 Worse than random with one exception
CASP3 1998 Consistently predicted correct topology 6.0 Å 60+
CASP4 2000 Consistently predicted correct topology 4.0-6.0 Å 60-80+

In fact, at CASP2, A. Murzin had great success against automated methods with simply trying to solve the structures manually. He was so successful, in fact, that he was disqualified in future CASPs.


Prediction Methods at CASP

A variety of ab initio prediction methods were tried at CASP. One method was to predict inter-residue contacts and then compute the structure, which met with mild success. Another approach was to use simplified energy terms and reduced search spaced, which met with moderate success. Other methods included:

  • Assembly of fragments with simulated annealing (Simons et al)
  • Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al)
  • Constraint-based Monte Carlo optimisation (Skolnick et al)
  • Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimization (Lomize et al)
  • Minimization of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al)
  • Neural networks (Jones, Rost)
  • Manual examination (Murzin)
  • MetaServers, which combine different methods to get consensus structures


The Future of Structural Genomics

Despite the disappointing results of the first few CASPs, computational protein structure prediction should improve because of the increasing number of solved structures and the novel folds. In structural genomics, there is a worldwide initiative to determine protein structures in a high-throughput, collaborative fashion. There is a particular emphasis to solve structures with no homology. There are several schemes that use distributed computing to predict structures:



  • Baker D. et al. Protein structure prediction and structural genomics. Science. 2001 Oct 5;294(5540):93-6
  • Bonneau, R. and David Baker. Ab initio Protein Structure Prediction: Progress and Prospects. Annual Review of Biophysics and Biomolecular Structure. Vol.30:173-189, June 2001.
  • Bowie J.U., Lüthy R. and D. Eisenberg (1991). "A method to identify protein sequences that fold into a known three-dimensional structure." Science 253: 164-170.
  • Chou, P.Y. and G.D. Fasman (1974). "Conformational parameters for amino acids in helical, β-sheet, and random coil regions calculated from proteins." Biochem. 13: 211-222.
  • Cohen F.E., Gregoret L., Presnell S.R. and I.D. Kuntz. "Protein structure predictions: new theoretical approaches." Prog Clin Biol Res 289: 75-85.
  • Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M. and G.J. Barton (1998). "Jpred: A Consensus Secondary Structure Prediction Server." Bioinformatics 14: 892-893.
  • Eisenberg D. and A.D. McLachlan (1986). "Solvation energy in protein folding and binding." Nature 319: 199-203.
  • Garnier, J., Osguthorpe, D.J. and B. Robson (1978). "Analysis and implications of simple methods for predicting the secondary structure of globular proteins."
  • Godzik A. et al. Fold Recognition Methods. Methods Biochem Anal. 2003;44:525-46
  • Greer, J. (1991) Comparative Modeling of Homologous Proteins. Meth. Enzymol. 202:239-252

J. Mol. Biol. 120: 97-120.

  • Murzin A.G., Brenner S.E., Hubbard T. and C. Chothia (1995). "SCOP: a structural classification of proteins database for the investigation of sequences and structures." J. Mol. Biol. 247: 536-540.
  • Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From Comparisons of Protein Sequences and Structures to Protein Modelling and Design. Trends Biochem. Sci. 15:235-240



Comments (0)

You don't have permission to comment on this page.