Model Compression

Compression is the family. Quantisation is one cousin. The techniques that shrink language models for self-hosting — what they do, what they cost in quality, and which ones you actually reach for.

01
The map

Most people use the words interchangeably. They aren't.

Compression is the umbrella — any technique that reduces a model's footprint or inference cost while attempting to preserve its behaviour. Quantisation is one family within it, alongside pruning, distillation, and low-rank factorisation. They compose. A modern self-hosted SLM has typically been touched by two or three.

Model compression
UMBRELLA TERM
Quantisation
Lower numerical precision
Pruning
Remove weights or whole structures
Distillation
Train a smaller student on a teacher
Low-rank factorisation
Decompose weight matrices (SVD, LoRA-style)
02
What quantisation actually does

It snaps continuous weights to a coarser grid.

A weight stored as FP32 has roughly four billion possible values. INT8 has 256. INT4 has sixteen. INT2 has four. The model still computes — just on a blurrier representation of itself. Drag the precision down and watch the signal lose detail.

RMS reconstruction error
0.0361
Weight value across the network
4-bit · 16 levels
+1.00-1.0
Original (FP32-ish)
Quantised to 4-bit
03
The arithmetic

The simulator's working, shown plainly.

Three equations carry most of the weight. They are deliberately first-order — the hierarchy of cost between them matters more than the constants. Quantisation saves memory linearly. Pruning damages quality cubically past a threshold. Distillation decays logarithmically in the compression factor. That is the intuition worth carrying around.

01 · Footprint

Precision is a multiplier on memory, full stop. An 8B model at FP16 is 16 GB of weights. Drop to INT4 and it is 4 GB. The arithmetic does not care which architecture you chose.

02 · Effective parameters

Pruning subtracts a fraction. Distillation divides. They reach similar destinations by different routes — one removes weights from a trained model, the other trains a smaller model from scratch under supervision.

03 · Quality retained

Multiplicative, because the techniques compose. Each factor is a number less than one; you can stack them, but the product falls faster than any individual term.

The three quality factors
FP32 → 1.000
FP16 → 0.999
INT8 → 0.980
INT4 → 0.940
INT2 → 0.780
Discrete steps. The cliff lives between INT4 and INT2.
Linear at first. The cubic term takes over past about 60% pruning.
A small, fixed loss per halving. A 4× student keeps about 92%.

Where the cliff lives

The first 30% of pruning costs almost nothing. The next 30% costs a little. The last 30% is where the cubic term wakes up, and the model degrades faster than the size shrinks. This is the curve to keep in your head.

Real models will diverge from this in both directions — well-engineered structured pruning can push the cliff right; naïve magnitude pruning brings it forward.

Quality vs pruning fraction
q = 1 − (0.05p + 0.6p³)
A note on the constants

These coefficients are calibrated from published benchmarks — GPTQ, AWQ, Wanda, SparseGPT, and the various distilled-model papers — not derived from first principles. The shapes of the curves matter more than the absolute values. A reader who wants to argue with the 0.94 on INT4 is having exactly the conversation this piece is trying to start.

04
The lab

Compose techniques. Watch the footprint move.

The numbers below are first-order estimates — a rule-of-thumb model, not a benchmark. Real-world results depend on calibration set, activation outliers, and whether you fine-tune after pruning. Use this to build intuition, not to commit a procurement.

8B parameters
0.5B8B30B70B
INT4
GPTQ / AWQ regime · 4 bits per weight
0%
Fraction of weights set to zero. Above ~50% you typically need fine-tuning to recover.
1× smaller
Train a student with this fraction of the teacher's parameters.
8K tokens
KV-cache scales linearly with context length — it dominates VRAM at long contexts.
Weights on disk
4.0 GB
88% smaller
VRAM required
5.2 GB
incl. KV + activations
Quality retained
94.0%
noticeable drift
Hardware tier
Apple Silicon
8 GB
M-series unified memory, RTX 3060 8GB
Footprint vs FP32 baseline
05
The techniques in depth
01

Quantisation

Lower-precision arithmetic for the same shape of network

Store and compute weights (and sometimes activations) at fewer bits. The matrix multiplications still happen — they just use INT4 or INT8 numbers instead of FP16.

Post-training quantisation (PTQ). Convert a trained FP16 model with a small calibration set. Fast, free of retraining. GPTQ, AWQ, and bitsandbytes live here.
Quantisation-aware training (QAT). Simulate quantisation during training so weights settle into a representable distribution. Better quality at INT4 and below.
Mixed-precision. Keep sensitive layers (often the first and last) at higher precision; squash the rest. Used by SmoothQuant, MXFP4.
Typical trade-off
~2× to 8× reduction; quality near-lossless at INT8, degrading at INT4 and breaking down below INT2 without QAT.
When you reach for it
Always your first move. INT4 weight-only quantisation is the de facto default for self-hosted SLMs in 2026.
02

Pruning

Remove parts of the network and stitch the rest back together

Identify weights, neurons, attention heads, or whole layers that contribute little to the output and zero them out (or delete them entirely).

Unstructured pruning. Zero out individual weights. Achieves high sparsity but needs specialist kernels to translate into speed (NVIDIA Ampere 2:4 sparsity is the practical case).
Structured pruning. Remove whole channels, heads, or layers. Less aggressive but the resulting model is dense and runs fast on any hardware.
Wanda / SparseGPT. Modern one-shot pruners that match magnitude pruning quality without retraining. Often combined with quantisation.
Typical trade-off
~1.5× to 3× reduction in practice; quality holds to ~50% sparsity, then drops sharply unless you fine-tune.
When you reach for it
Reach for it when quantisation alone leaves you ~20-30% over your VRAM budget. Less common than quantisation in production SLM stacks.
03

Knowledge distillation

Train a small student to imitate a large teacher

The student model learns from the teacher's output distributions (soft labels) rather than just the ground truth. It absorbs not just the right answer but the teacher's confidence structure.

Response distillation. Student matches the teacher's final logits. Most common.
Feature distillation. Student matches intermediate layer activations. Higher fidelity, more setup.
Synthetic data distillation. Use the teacher to generate training data for the student. The pattern behind Phi-3, Gemma 2, and most modern small models.
Typical trade-off
5–10× smaller for 90–95% of capability on the targeted task distribution. Worse generalisation outside that distribution.
When you reach for it
When you control the training pipeline and have a specific deployment target. Less of a knob you turn at inference time; more of an architectural decision upstream.
04

Low-rank factorisation

Replace one big matrix with two smaller ones

A weight matrix W (d × d) is approximated as the product of two thinner matrices A (d × r) and B (r × d), where r ≪ d. The total parameter count drops from d² to 2dr.

SVD-based compression. Decompose existing weights via singular value decomposition, keep the top-r components.
LoRA / QLoRA adapters. Not compression of the base model — additive low-rank adapters for efficient fine-tuning. Adjacent territory, often confused.
Tucker / tensor decomposition. Higher-order factorisations for convolutional and embedding layers.
Typical trade-off
Variable — depends entirely on the effective rank of the weight matrices, which is hard to predict.
When you reach for it
Niche for LLMs in 2026. Useful for embedding tables and specific architectural bottlenecks. LoRA dominates the conversation in this neighbourhood, but solves a different problem.
06
The landscape

Where real models sit.

Quality is rough — anchored on MMLU and chat-arena Elo at time of release, normalised so Llama-3.1-8B FP16 ≈ 100. Size is on-disk bytes for the weights only.

General-purpose open
Small models (≤4B)
Tiny (≤1.5B)
Large (≥40B)

The frontier you actually care about runs diagonally from bottom-left to top-right. INT4 quantisation pulls every model leftward without dropping it meaningfully on the vertical axis — which is why it has become the default.

07
Decision guide
Situation
Self-hosting an open SLM on consumer hardware
First reach for
INT4 weight-only quantisation (AWQ or GPTQ)
Caveat
Drop to INT3 or INT2 only if desperate; otherwise pick a smaller model.
Situation
Production inference on a datacentre GPU
First reach for
FP8 or INT8 (TensorRT-LLM, vLLM with FP8 KV cache)
Caveat
Quality holds; latency and throughput improve materially.
Situation
Edge deployment (mobile, embedded)
First reach for
Distillation to a 1–3B model, then INT4 quantisation
Caveat
Optionally structured pruning of attention heads if the budget is tight.
Situation
Long-context inference (32K+ tokens)
First reach for
Quantise the KV cache (FP8 or INT4)
Caveat
KV cache dominates VRAM at long contexts — quantising weights alone is not enough.
Situation
Fine-tuning to a domain
First reach for
LoRA / QLoRA adapters, not full retraining
Caveat
This is parameter-efficient training, not compression — but it ships in the same conversation.

A model is not really a thing; it is a budget — of memory, of arithmetic, of attention. Compression is the practice of spending that budget more cheaply for almost the same behaviour. Quantisation is the most efficient way to do it. It is not the only way.

Economic SovereigntyFrom Content Wars to Experience Wars