Benchmark Summary

The table highlights the strongest compression result on each row and keeps the control row visible for comparison. The benchmark result is done via Llama-3.2-3B and using Llama-3.2-3B as LLM as a judge.

Compression speed

330-token sample latency

LLMLingua2 is the current prompt-compression state of the art, so the speed comparison uses it as the reference point for the other methods.

330 token sample
GPTZip
30ms
40x faster
40x faster than LLMLingua2
Headroom
200ms
6x faster
6x faster than LLMLingua2
Bear
250ms
4.8x faster
4.8x faster than LLMLingua2
LLMLingua2
1200ms
Baseline
Baseline / SOTA reference

CoQA

coqa
Control
Baseline
76.6% Accuracy without compression
control
Baseline
GPTZip
Best
76.6% (0.0pp*) Accuracy with 13.9% median compression
vs baseline
19.2% max compression
gptzip:v1+win4@100
Lossless
Headroom
55.0% (-21.6pp) Accuracy with 51.8% median compression
vs baseline
77.3% max compression
headroom:auto
Lossy
TheTokenCompany
76.5% (-0.1pp) Accuracy with 13.1% median compression
vs baseline
25.9% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
58.1% (-18.5pp) Accuracy with 63.0% median compression
vs baseline
66.0% max compression
llmlingua2-conservative
Lossy

SQuAD v2

squad_v2
Control
Baseline
60.1% Accuracy without compression
control
Baseline
GPTZip
Best
60.2% (+0.1pp*) Accuracy with 12.3% median compression
vs baseline
22.4% max compression
gptzip:v2+qaSafe+H1+win2@30
Lossless
Headroom
56.0% (-4.1pp) Accuracy with 11.8% median compression
vs baseline
51.1% max compression
headroom:auto
Lossy
TheTokenCompany
58.5% (-1.6pp) Accuracy with 9.0% median compression
vs baseline
23.1% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
40.7% (-19.4pp) Accuracy with 62.5% median compression
vs baseline
67.6% max compression
llmlingua2-conservative
Lossy

GPQA Diamond

gpqa_diamond
Control
Baseline
29.3% Accuracy without compression
control
Baseline
GPTZip
Best
30.5% (+1.2pp*) Accuracy with 21.3% median compression
vs baseline
70.7% max compression
gptzip:v1+H1+J2@100
Lossless
Headroom
26.8% (-2.5pp) Accuracy with 12.2% median compression
vs baseline
95.4% max compression
headroom:auto
Lossy
TheTokenCompany
29.3% (0.0pp) Accuracy with 5.8% median compression
vs baseline
19.8% max compression
bear:bear-1.2@0.05
Lossless
LLM-Lingua
23.2% (-6.1pp) Accuracy with 57.7% median compression
vs baseline
82.7% max compression
llmlingua2-conservative
Lossy

FinanceBench

financebench
Control
Baseline
84.1% Accuracy without compression
control
Baseline
GPTZip
Best
85.0% (+0.9pp*) Accuracy with 16.5% median compression
vs baseline
40.0% max compression
gptzip:v1+H1@100
Lossless
Headroom
75.9% (-8.2pp) Accuracy with 26.3% median compression
vs baseline
83.8% max compression
headroom:auto
Lossy
TheTokenCompany
77.7% (-6.4pp) Accuracy with 1.5% median compression
vs baseline
14.1% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
62.2% (-21.9pp) Accuracy with 60.3% median compression
vs baseline
78.1% max compression
llmlingua2-conservative
Lossy

HumanEval

humaneval
Control
Baseline
58.0% Accuracy without compression
control
Baseline
GPTZip
Best
60.0% (+2.0pp*) Accuracy with 14.0% median compression
vs baseline
34.2% max compression
gptzip:v4@100
Lossless
Headroom
52.0% (-6.0pp) Accuracy with 14.7% median compression
vs baseline
64.6% max compression
headroom:auto
Lossy
TheTokenCompany
60.0% (+2.0pp) Accuracy with 9.8% median compression
vs baseline
33.6% max compression
bear:bear-1.2@0.05
Lossless
LLM-Lingua
19.0% (-39.0pp) Accuracy with 61.7% median compression
vs baseline
70.8% max compression
llmlingua2-conservative
Lossy

LongBench v2

longbench_v2
Control
Baseline
42.1% Accuracy without compression
control
Baseline
GPTZip
Best
47.1% (+5.0pp*) Accuracy with 28.9% median compression
vs baseline
41.8% max compression
gptzip:v3+sw+H1@100
Lossless
TheTokenCompany
38.9% (-3.2pp) Accuracy with 9.8% median compression
vs baseline
21.8% max compression
bear:bear-1.2@0.05
Lossy