Model Compression

Compression is the family. Quantisation is one cousin. The techniques that shrink language models for self-hosting — what they do, what they cost in quality, and which ones you actually reach for.

01

The map

Most people use the words interchangeably. They aren't.

Compression is the umbrella — any technique that reduces a model's footprint or inference cost while attempting to preserve its behaviour. Quantisation is one family within it, alongside pruning, distillation, and low-rank factorisation. They compose. A modern self-hosted SLM has typically been touched by two or three.

Model compression

UMBRELLA TERM

Quantisation

Lower numerical precision

Pruning

Remove weights or whole structures

Distillation

Train a smaller student on a teacher

Low-rank factorisation

Decompose weight matrices (SVD, LoRA-style)

02

What quantisation actually does

It snaps continuous weights to a coarser grid.

A weight stored as FP32 has roughly four billion possible values. INT8 has 256. INT4 has sixteen. INT2 has four. The model still computes — just on a blurrier representation of itself. Drag the precision down and watch the signal lose detail.

RMS reconstruction error

0.0361

Weight value across the network

4-bit · 16 levels

Original (FP32-ish)
Quantised to 4-bit

03

The arithmetic

The simulator's working, shown plainly.

Three equations carry most of the weight. They are deliberately first-order — the hierarchy of cost between them matters more than the constants. Quantisation saves memory linearly. Pruning damages quality cubically past a threshold. Distillation decays logarithmically in the compression factor. That is the intuition worth carrying around.

01 · Footprint

weights_{GB} \approx \frac{params \cdot bits}{8}

Precision is a multiplier on memory, full stop. An 8B model at FP16 is 16 GB of weights. Drop to INT4 and it is 4 GB. The arithmetic does not care which architecture you chose.

02 · Effective parameters

params_{eff} = \frac{params \cdot ( 1 - prune )}{distill}

Pruning subtracts a fraction. Distillation divides. They reach similar destinations by different routes — one removes weights from a trained model, the other trains a smaller model from scratch under supervision.

03 · Quality retained

q = q_{prec} \cdot q_{prune} \cdot q_{distill}

Multiplicative, because the techniques compose. Each factor is a number less than one; you can stack them, but the product falls faster than any individual term.

The three quality factors

q_{precision}

FP32 → 1.000
FP16 → 0.999
INT8 → 0.980
INT4 → 0.940
INT2 → 0.780

Discrete steps. The cliff lives between INT4 and INT2.

q_{pruning}

1 - (0.05 p + 0.6 p^{3})

Linear at first. The cubic term takes over past about 60% pruning.

q_{distill}

0.96 - 0.02 lo g_{2} (factor)

A small, fixed loss per halving. A 4× student keeps about 92%.

Where the cliff lives

The first 30% of pruning costs almost nothing. The next 30% costs a little. The last 30% is where the cubic term wakes up, and the model degrades faster than the size shrinks. This is the curve to keep in your head.

Real models will diverge from this in both directions — well-engineered structured pruning can push the cliff right; naïve magnitude pruning brings it forward.

Quality vs pruning fraction

q = 1 − (0.05p + 0.6p³)

A note on the constants

These coefficients are calibrated from published benchmarks — GPTQ, AWQ, Wanda, SparseGPT, and the various distilled-model papers — not derived from first principles. The shapes of the curves matter more than the absolute values. A reader who wants to argue with the 0.94 on INT4 is having exactly the conversation this piece is trying to start.

04

The lab

Compose techniques. Watch the footprint move.

The numbers below are first-order estimates — a rule-of-thumb model, not a benchmark. Real-world results depend on calibration set, activation outliers, and whether you fine-tune after pruning. Use this to build intuition, not to commit a procurement.

Base model size8B parameters

0.5B8B30B70B

Numerical precisionINT4

GPTQ / AWQ regime · 4 bits per weight

Pruning (sparsity)0%

Fraction of weights set to zero. Above ~50% you typically need fine-tuning to recover.

Distillation target1× smaller

Train a student with this fraction of the teacher's parameters.

Context window8K tokens

KV-cache scales linearly with context length — it dominates VRAM at long contexts.

Weights on disk

4.0 GB

88% smaller

VRAM required

5.2 GB

incl. KV + activations

Quality retained

94.0%

noticeable drift

Hardware tier

Apple Silicon

≤ 8 GB

M-series unified memory, RTX 3060 8GB

Footprint vs FP32 baseline

05

The techniques in depth

01

Quantisation

Lower-precision arithmetic for the same shape of network

Store and compute weights (and sometimes activations) at fewer bits. The matrix multiplications still happen — they just use INT4 or INT8 numbers instead of FP16.

Post-training quantisation (PTQ). Convert a trained FP16 model with a small calibration set. Fast, free of retraining. GPTQ, AWQ, and bitsandbytes live here.

Quantisation-aware training (QAT). Simulate quantisation during training so weights settle into a representable distribution. Better quality at INT4 and below.

Mixed-precision. Keep sensitive layers (often the first and last) at higher precision; squash the rest. Used by SmoothQuant, MXFP4.

Typical trade-off

~2× to 8× reduction; quality near-lossless at INT8, degrading at INT4 and breaking down below INT2 without QAT.

When you reach for it

Always your first move. INT4 weight-only quantisation is the de facto default for self-hosted SLMs in 2026.

02

Pruning

Remove parts of the network and stitch the rest back together

Identify weights, neurons, attention heads, or whole layers that contribute little to the output and zero them out (or delete them entirely).

Unstructured pruning. Zero out individual weights. Achieves high sparsity but needs specialist kernels to translate into speed (NVIDIA Ampere 2:4 sparsity is the practical case).

Structured pruning. Remove whole channels, heads, or layers. Less aggressive but the resulting model is dense and runs fast on any hardware.

Wanda / SparseGPT. Modern one-shot pruners that match magnitude pruning quality without retraining. Often combined with quantisation.

Typical trade-off

~1.5× to 3× reduction in practice; quality holds to ~50% sparsity, then drops sharply unless you fine-tune.

When you reach for it

Reach for it when quantisation alone leaves you ~20-30% over your VRAM budget. Less common than quantisation in production SLM stacks.

03

Knowledge distillation

Train a small student to imitate a large teacher

The student model learns from the teacher's output distributions (soft labels) rather than just the ground truth. It absorbs not just the right answer but the teacher's confidence structure.

Response distillation. Student matches the teacher's final logits. Most common.

Feature distillation. Student matches intermediate layer activations. Higher fidelity, more setup.

Synthetic data distillation. Use the teacher to generate training data for the student. The pattern behind Phi-3, Gemma 2, and most modern small models.

Typical trade-off

5–10× smaller for 90–95% of capability on the targeted task distribution. Worse generalisation outside that distribution.

When you reach for it

When you control the training pipeline and have a specific deployment target. Less of a knob you turn at inference time; more of an architectural decision upstream.

04

Low-rank factorisation

Replace one big matrix with two smaller ones

A weight matrix W (d × d) is approximated as the product of two thinner matrices A (d × r) and B (r × d), where r ≪ d. The total parameter count drops from d² to 2dr.

SVD-based compression. Decompose existing weights via singular value decomposition, keep the top-r components.

LoRA / QLoRA adapters. Not compression of the base model — additive low-rank adapters for efficient fine-tuning. Adjacent territory, often confused.

Tucker / tensor decomposition. Higher-order factorisations for convolutional and embedding layers.

Typical trade-off

Variable — depends entirely on the effective rank of the weight matrices, which is hard to predict.

When you reach for it

Niche for LLMs in 2026. Useful for embedding tables and specific architectural bottlenecks. LoRA dominates the conversation in this neighbourhood, but solves a different problem.

06

The landscape

Where real models sit.

Quality is rough — anchored on MMLU and chat-arena Elo at time of release, normalised so Llama-3.1-8B FP16 ≈ 100. Size is on-disk bytes for the weights only.

General-purpose open

Small models (≤4B)

Tiny (≤1.5B)

Large (≥40B)

The frontier you actually care about runs diagonally from bottom-left to top-right. INT4 quantisation pulls every model leftward without dropping it meaningfully on the vertical axis — which is why it has become the default.

07

Decision guide

Situation

Self-hosting an open SLM on consumer hardware

First reach for

INT4 weight-only quantisation (AWQ or GPTQ)

Caveat

Drop to INT3 or INT2 only if desperate; otherwise pick a smaller model.

Situation

Production inference on a datacentre GPU

First reach for

FP8 or INT8 (TensorRT-LLM, vLLM with FP8 KV cache)

Caveat

Quality holds; latency and throughput improve materially.

Situation

Edge deployment (mobile, embedded)

First reach for

Distillation to a 1–3B model, then INT4 quantisation

Caveat

Optionally structured pruning of attention heads if the budget is tight.

Situation

Long-context inference (32K+ tokens)

First reach for

Quantise the KV cache (FP8 or INT4)

Caveat

KV cache dominates VRAM at long contexts — quantising weights alone is not enough.

Situation

Fine-tuning to a domain

First reach for

LoRA / QLoRA adapters, not full retraining

Caveat

This is parameter-efficient training, not compression — but it ships in the same conversation.

A model is not really a thing; it is a budget — of memory, of arithmetic, of attention. Compression is the practice of spending that budget more cheaply for almost the same behaviour. Quantisation is the most efficient way to do it. It is not the only way.

I'm speaking on this — The Compute Infrastructure Questions Every AI Buyer Should Ask →

Cloud Repatriation From Content Wars to Experience Wars