Model Compression
Compression is the family. Quantisation is one cousin. The techniques that shrink language models for self-hosting — what they do, what they cost in quality, and which ones you actually reach for.
Most people use the words interchangeably. They aren't.
Compression is the umbrella — any technique that reduces a model's footprint or inference cost while attempting to preserve its behaviour. Quantisation is one family within it, alongside pruning, distillation, and low-rank factorisation. They compose. A modern self-hosted SLM has typically been touched by two or three.
It snaps continuous weights to a coarser grid.
A weight stored as FP32 has roughly four billion possible values. INT8 has 256. INT4 has sixteen. INT2 has four. The model still computes — just on a blurrier representation of itself. Drag the precision down and watch the signal lose detail.
The simulator's working, shown plainly.
Three equations carry most of the weight. They are deliberately first-order — the hierarchy of cost between them matters more than the constants. Quantisation saves memory linearly. Pruning damages quality cubically past a threshold. Distillation decays logarithmically in the compression factor. That is the intuition worth carrying around.
Precision is a multiplier on memory, full stop. An 8B model at FP16 is 16 GB of weights. Drop to INT4 and it is 4 GB. The arithmetic does not care which architecture you chose.
Pruning subtracts a fraction. Distillation divides. They reach similar destinations by different routes — one removes weights from a trained model, the other trains a smaller model from scratch under supervision.
Multiplicative, because the techniques compose. Each factor is a number less than one; you can stack them, but the product falls faster than any individual term.
Where the cliff lives
The first 30% of pruning costs almost nothing. The next 30% costs a little. The last 30% is where the cubic term wakes up, and the model degrades faster than the size shrinks. This is the curve to keep in your head.
Real models will diverge from this in both directions — well-engineered structured pruning can push the cliff right; naïve magnitude pruning brings it forward.
These coefficients are calibrated from published benchmarks — GPTQ, AWQ, Wanda, SparseGPT, and the various distilled-model papers — not derived from first principles. The shapes of the curves matter more than the absolute values. A reader who wants to argue with the 0.94 on INT4 is having exactly the conversation this piece is trying to start.
Compose techniques. Watch the footprint move.
The numbers below are first-order estimates — a rule-of-thumb model, not a benchmark. Real-world results depend on calibration set, activation outliers, and whether you fine-tune after pruning. Use this to build intuition, not to commit a procurement.
Quantisation
Lower-precision arithmetic for the same shape of network
Store and compute weights (and sometimes activations) at fewer bits. The matrix multiplications still happen — they just use INT4 or INT8 numbers instead of FP16.
Pruning
Remove parts of the network and stitch the rest back together
Identify weights, neurons, attention heads, or whole layers that contribute little to the output and zero them out (or delete them entirely).
Knowledge distillation
Train a small student to imitate a large teacher
The student model learns from the teacher's output distributions (soft labels) rather than just the ground truth. It absorbs not just the right answer but the teacher's confidence structure.
Low-rank factorisation
Replace one big matrix with two smaller ones
A weight matrix W (d × d) is approximated as the product of two thinner matrices A (d × r) and B (r × d), where r ≪ d. The total parameter count drops from d² to 2dr.
Where real models sit.
Quality is rough — anchored on MMLU and chat-arena Elo at time of release, normalised so Llama-3.1-8B FP16 ≈ 100. Size is on-disk bytes for the weights only.
The frontier you actually care about runs diagonally from bottom-left to top-right. INT4 quantisation pulls every model leftward without dropping it meaningfully on the vertical axis — which is why it has become the default.
A model is not really a thing; it is a budget — of memory, of arithmetic, of attention. Compression is the practice of spending that budget more cheaply for almost the same behaviour. Quantisation is the most efficient way to do it. It is not the only way.