Protein Folding: The HP Lattice Model Explained

Q: What is the HP model of protein folding?

The HP (hydrophobic-polar) model, proposed by Ken Dill in 1985, is a simplified lattice model of protein folding. It represents each amino acid as either hydrophobic (H) or polar (P), places the chain on a 2D or 3D square lattice, and defines the energy of a conformation as -1 for each non-bonded H-H contact. Despite its simplicity, it captures the dominant physical force in protein folding — the hydrophobic effect — and serves as a foundational model in computational biology.

Q: What is Anfinsen's dogma and why is it important?

Anfinsen's dogma (the thermodynamic hypothesis) states that the native three-dimensional structure of a protein is the thermodynamically stable state determined entirely by its amino acid sequence under physiological conditions. Christian Anfinsen demonstrated this by denaturing ribonuclease A and showing it refolded spontaneously into its active form when denaturant was removed. This principle implies that the sequence alone encodes the structure — the foundational assumption behind all computational protein structure prediction methods.

Q: Why is protein folding NP-hard?

Finding the minimum-energy (ground-state) conformation of an HP sequence on a 3D lattice was proven NP-hard by Crescenzi and colleagues in 1998. This means the number of conformations that must be explored in the worst case grows exponentially with sequence length — approximately 2.638^n on a 2D square lattice. For sequences of length 50, there are roughly 10^19 possible self-avoiding walks. Exact optimisation becomes computationally infeasible, requiring heuristic methods such as Monte Carlo sampling, genetic algorithms, or reinforcement learning.

Q: What is the energy landscape and folding funnel concept?

The energy landscape is a high-dimensional surface where each point represents a protein conformation and the height represents its free energy. For proteins shaped by evolution, this landscape is funnel-shaped: as the chain becomes more compact and native-like, free energy generally decreases, guiding folding toward the native state. This funnel resolves Levinthal's paradox — folding is not a random search through all conformations but a biased descent down the funnel. Local bumps in the funnel correspond to metastable folding intermediates.

Q: How does the HP model relate to modern protein structure prediction like AlphaFold?

The HP model and AlphaFold2 sit at opposite ends of the abstraction spectrum. The HP model captures a single physical principle — hydrophobic burial — with a binary residue alphabet on a lattice. AlphaFold2 uses deep learning with equivariant neural networks trained on millions of experimental structures, leveraging coevolutionary information from multiple sequence alignments. Despite this difference, the physical intuition is shared: hydrophobic core burial is the dominant driving force both encode, explicitly or implicitly. The HP model remains important as a teaching tool, a benchmark for search algorithms, and a formal proof-of-concept that even minimal folding models are computationally hard.

3D Simulations

🧬 Molecular Biology · Computational Biology

📅 June 2026⏱ 10 min🟡 Intermediate · Last updated: 28 June 2026

Protein Folding: The HP Lattice Model Explained

A chain of amino acids emerging from a ribosome spontaneously collapses into a precise three-dimensional shape within milliseconds — and that shape determines everything about the protein's function. The HP lattice model distils this staggeringly complex process into a single driving force: the tendency of hydrophobic residues to escape water. Despite its simplicity, the model encodes an NP-hard combinatorial problem and has guided decades of research into folding algorithms and drug design.

Written by MySimulator Team · Reviewed by MySimulator Editorial Review

1. The Protein Folding Problem

Proteins are the molecular machines of life. Every enzyme that catalyses a chemical reaction, every receptor that receives a hormonal signal, every structural fibre that holds a cell together is a protein. All proteins are built from the same fundamental components — linear chains of amino acids encoded in DNA — yet each folds into a unique, reproducible three-dimensional architecture.

The central puzzle is this: given only the sequence of amino acids (the primary structure), can we predict the final folded shape (the tertiary structure)? This is the protein folding problem, and it occupied biochemists and computational biologists for more than half a century before AlphaFold2 achieved near-experimental accuracy in 2021.

To understand why prediction is so difficult — and why simplified models remain essential teaching and research tools — we must first understand what drives folding.

Why Folding Matters Clinically

Misfolded proteins are not merely non-functional; they are often actively toxic. Alzheimer's disease involves the misfolding and aggregation of amyloid-beta peptides into plaques. Parkinson's disease is associated with alpha-synuclein aggregates. Prion diseases (CJD, BSE) propagate by inducing normal PrP proteins to adopt the misfolded prion conformation. Understanding the rules of folding is therefore not purely academic — it is central to drug development and the treatment of neurodegenerative disease.

2. Anfinsen's Dogma

In the early 1960s, Christian Anfinsen at the NIH performed a landmark series of experiments with ribonuclease A (RNase A), a small enzyme of 124 amino acids. He denatured the protein completely using urea (which disrupts non-covalent interactions) and broke its four disulfide bonds using a reducing agent. The unfolded, inactive protein was then allowed to re-oxidise in urea-free buffer — and it spontaneously refolded into its native, fully active conformation.

This experiment established what became known as Anfinsen's dogma (or the thermodynamic hypothesis): the native structure of a protein is the thermodynamically stable state determined solely by its amino acid sequence, under physiological conditions. No additional template or instruction is required. The information is entirely within the sequence.

Levinthal's Paradox: If a protein of 100 amino acids randomly sampled even a limited set of conformations for each residue — say three rotational states — the total conformational space would be 3¹⁰⁰ ≈ 5 × 10⁴⁷ structures. Sampling them at 10¹³ per second would take longer than the age of the universe. Yet proteins fold reproducibly in microseconds to milliseconds. This paradox, articulated by Cyrus Levinthal in 1969, shows that folding cannot be a random search — the sequence must bias the search pathway toward the native state.

3. The HP Model: Hydrophobic–Polar Abstraction

Proposed by Ken Dill in 1985, the HP model is the simplest lattice model of protein folding that captures the dominant physical force: the hydrophobic effect. Rather than representing all 20 amino acids, the model reduces each residue to one of two types:

H (Hydrophobic): Residues that are non-polar and repelled by water — leucine, valine, isoleucine, phenylalanine, methionine, and others.
P (Polar): Residues that are comfortable in aqueous environments — serine, threonine, lysine, arginine, asparagine, and charged residues.

The protein chain is placed on a two-dimensional (or three-dimensional) square lattice. Each amino acid occupies one lattice site. The chain must be self-avoiding — no two residues may occupy the same site — and must form a connected path of adjacent lattice steps.

The Energy Function

The energy of a conformation is determined by a single rule: two H residues that are adjacent on the lattice but not adjacent in sequence (called a topological contact) contribute an energy of −1. All other contacts (H–P, P–P, or P–H) contribute zero. The goal is to find the conformation with the minimum total energy — the maximum number of H–H contacts.

E = -1 \times (number of H-H non-bonded contacts) Example sequence (length 8): H P H H P H P H Residues: 1 2 3 4 5 6 7 8 A folded conformation might place residues 1,3,4,6,8 (all H) in a compact hydrophobic core. If residues 1 and 6 are lattice-adjacent (non-bonded): -1 If residues 3 and 8 are lattice-adjacent (non-bonded): -1 Total energy: -2 (better than extended chain at E = 0) Minimum energy = most negative value achievable for a given sequence.

Despite its simplicity, this energy function produces rich behaviour. Sequences with many H residues clustered together in sequence tend to fold into compact globular structures — mirroring the behaviour of real hydrophobic-core proteins. Sequences with alternating H and P residues tend to remain more extended.

Why the Hydrophobic Effect Dominates

Water molecules form hydrogen bonds with each other. When a non-polar group is inserted into water, it cannot participate in hydrogen bonding, forcing surrounding water molecules to reorder into a more rigid, lower-entropy cage structure. Removing non-polar groups from contact with water — by burying them in a hydrophobic core — releases these water molecules and increases the entropy of the solvent. This entropic gain is the thermodynamic driving force behind the hydrophobic effect. Studies suggest that 60–70% of the stabilisation energy in typical globular proteins comes from hydrophobic core packing, making it the dominant term in the real energy function as well as the HP model.

4. Energy Landscapes and Funnels

Modern protein folding theory, developed largely by Wolynes, Bryngelson, Onuchic, and colleagues in the 1990s, describes folding using the concept of an energy landscape — a high-dimensional surface where each axis corresponds to a degree of conformational freedom and the vertical axis represents free energy.

For a random heteropolymer, the energy landscape is rugged: many deep local minima (kinetic traps) separated by high barriers. The protein would spend most of its time stuck in these traps and never reach the native state within biological timescales.

Real proteins, shaped by billions of years of evolution, have a funnel-shaped energy landscape. The funnel is biased: as the chain becomes more compact and native-like, the free energy generally decreases. There may still be local minima — partially folded intermediates — but there is an overall thermodynamic gradient toward the native structure. This funnel principle resolves Levinthal's paradox: folding is not a random search but a guided descent down the funnel.

Frustration and Misfolding: The funnel is not perfectly smooth. Competing interactions cause local frustration — regions of a sequence where satisfying one contact necessarily prevents satisfying another. Highly frustrated sequences tend to fold slowly or misfold. Intrinsically disordered proteins (IDPs) occupy an extreme: their landscapes have such shallow funnels that they remain flexible rather than adopting a unique fold, a property that is itself functionally important in signalling and regulation.

In the HP model, the energy landscape can be visualised directly. For short sequences (length 10–20), all self-avoiding walks can be enumerated. For longer sequences, the landscape is explored by simulation. Sequences with many HP model contacts at the global minimum, and few structures at that same energy, are said to have good foldability — a concept that maps onto real protein design principles.

5. Computational Complexity

One of the most important theoretical results in computational biology is that finding the ground-state (minimum energy) conformation of an HP sequence on a 2D or 3D lattice is NP-hard. This was proved rigorously by Crescenzi, Goldman, Papadimitriou, Piccolboni, and Yannakakis in 1998 for the 3D model, and by Hart and Istrail for variants of the 2D model.

NP-hardness means that no polynomial-time algorithm is known (or expected) to solve all instances of the HP folding problem. As the sequence length grows, the number of self-avoiding walks grows exponentially. For a 2D square lattice, the number of self-avoiding walks of length n grows approximately as 2.638ⁿ. For length 50 that is approximately 10¹⁹ — far beyond exhaustive enumeration.

Self-avoiding walks on 2D square lattice: Length 10: ~4,000 conformations Length 20: ~1.7 \times 10^7 conformations Length 30: ~7.1 \times 10^10 conformations Length 50: ~2.6 \times 10^19 conformations Growth rate: ~2.638^n (Domb, 1960) HP sequences of length n = 50 cannot be solved exactly in reasonable time \to heuristic algorithms required.

This complexity result is significant beyond the HP model itself: it provides a formal lower bound on the difficulty of real protein folding prediction. Even a radically simplified model — two residue types, square lattice, single energy term — is computationally intractable in the worst case. The success of AlphaFold2 does not contradict this; it exploits evolutionary information and co-evolutionary signals to bypass exhaustive search entirely.

6. Search Algorithms for HP Folding

Because exact optimisation is intractable, a rich ecosystem of heuristic algorithms has been applied to the HP model. These algorithms also serve as testbeds for methods used in real protein structure prediction.

Monte Carlo Methods

The most widely used approach for HP model exploration. A random conformation is iteratively modified by small moves: pivot moves (rotating a segment of the chain around a lattice point), end-point moves, or crankshaft moves (local rearrangements that preserve chain connectivity). Each new conformation is accepted or rejected using the Metropolis criterion: always accept if energy decreases; accept with probability exp(−ΔE / k_BT) if energy increases. This allows escape from local minima. Simulated annealing — gradually decreasing the effective temperature — is commonly applied to drive the system toward its ground state.

Genetic Algorithms

A population of candidate conformations evolves over generations. Conformations are represented as sequences of moves (U/D/L/R on the lattice). Selection favours lower-energy (more compact, H-H-rich) conformations. Crossover and mutation operators generate new candidates. Genetic algorithms are effective at exploring diverse regions of conformational space simultaneously, reducing the risk of converging to a single local minimum.

Dynamic Programming and Exact Methods

For short sequences (length up to ~25 in 2D), dynamic programming combined with branch-and-bound pruning can find provably optimal conformations. Upper-bound estimators for the maximum possible H–H contacts prune branches that cannot beat the current best solution. For 3D models the practical limit is somewhat shorter. These exact methods are invaluable for generating benchmark datasets against which heuristics are tested.

Reinforcement Learning

More recently, reinforcement learning (RL) agents have been trained to fold HP sequences by learning a policy that places residues one at a time on the lattice. The agent receives a reward signal based on the energy of the completed conformation. RL-based approaches achieve competitive results on benchmark sequences and have the advantage of generalising across sequence lengths and topologies without re-running an optimisation from scratch for each new sequence.

7. Beyond the HP Model

The HP model's power lies in its tractability for analysis and simulation. Its limitations lie in the same source of simplicity: real proteins are not made of just two residue types, and real interactions extend far beyond pairwise hydrophobic contacts.

Extensions to the HP Model

HP+ (charge model): Adds a charged residue type (positive or negative). Allows modelling of electrostatic interactions and salt bridges that stabilise protein tertiary structure.
HPNX model: Four residue types (H, P, N, X) encoding hydrophobic, polar, negative, and positive charges. Richer contact energy matrix. Better captures the diversity of real amino acid interactions.
Miyazawa–Jernigan (MJ) matrix: A 20x20 empirical contact energy matrix derived from statistical analysis of known protein structures. Replaces the binary HP classification with residue-specific pairwise energies estimated from database frequencies. Used in more realistic lattice simulations.
3D lattice models: Placing the HP chain on a 3D cubic or face-centred cubic (FCC) lattice dramatically increases conformational flexibility and better approximates real protein packing. The FCC lattice is particularly useful because each residue has 12 neighbours, closer to the coordination number seen in real protein cores.

From Lattice Models to AlphaFold

The trajectory from the HP model to state-of-the-art structure prediction illustrates the layered progress of computational biology. Lattice models provided mathematical insights and algorithm benchmarks. Coarse-grained off-lattice models (CASP-era fragment assembly: Rosetta, I-TASSER) added realistic geometry. Deep learning approaches — culminating in AlphaFold2's equivariant transformer architecture and multiple sequence alignment-based coevolutionary features — achieved near-experimental accuracy on most protein families. Yet the HP model remains in every structural bioinformatics curriculum because it makes the core physics accessible without computational overhead, and because the algorithms developed for it — Monte Carlo, genetic search, RL — remain foundational throughout computational science.

AlphaFold2 and the HP legacy: AlphaFold2 does not use a lattice model, but the physical intuition behind the HP model — that hydrophobic burial is the dominant driving force — is implicitly encoded in the training data. The model learns from millions of solved protein structures in which hydrophobic cores are consistently buried. The HP model makes this insight explicit and computable.

Key Takeaways

Protein folding is driven by the thermodynamic principle that the native structure minimises free energy — Anfinsen's dogma, confirmed experimentally and theoretically.
Levinthal's paradox shows that folding cannot be a random search; an energy funnel biases the chain toward the native state.
The HP model abstracts amino acids into two types (hydrophobic H and polar P) and defines energy solely by non-bonded H–H contacts on a lattice.
The hydrophobic effect — the entropic gain from releasing ordered water around non-polar groups — accounts for 60–70% of folding stability in real proteins.
Finding the ground-state HP conformation is NP-hard; heuristic methods including Monte Carlo, genetic algorithms, and reinforcement learning are required for long sequences.
Extensions of the HP model (HPNX, MJ matrix, 3D lattice) progressively close the gap between the minimal abstraction and real protein physics.
The HP model remains the canonical entry point for understanding protein folding computationally, even in the era of AlphaFold2.