Compression Lab

Compose precision, pruning, distillation, and context length. Watch the footprint, VRAM bill, and hardware tier shift as you go.

01

Compose the stack

Compose techniques. Watch the footprint move.

The numbers below are first-order estimates — a rule-of-thumb model, not a benchmark. Real-world results depend on calibration set, activation outliers, and whether you fine-tune after pruning. Use this to build intuition, not to commit a procurement.

Base model size8B parameters

0.5B8B30B70B

Numerical precisionINT4

GPTQ / AWQ regime · 4 bits per weight

Pruning (sparsity)0%

Fraction of weights set to zero. Above ~50% you typically need fine-tuning to recover.

Distillation target1× smaller

Train a student with this fraction of the teacher's parameters.

Context window8K tokens

KV-cache scales linearly with context length — it dominates VRAM at long contexts.

Weights on disk

4.0 GB

88% smaller

VRAM required

5.2 GB

incl. KV + activations

Quality retained

94.0%

noticeable drift

Hardware tier

Apple Silicon

≤ 8 GB

M-series unified memory, RTX 3060 8GB

Footprint vs FP32 baseline

Read the full Model Compression insight →