Benchmark Summary
The table highlights the strongest compression result on each row and keeps the control row visible for comparison. The benchmark result is done via Llama-3.2-3B and using Llama-3.2-3B as LLM as a judge.
Compression speed
330-token sample latency
LLMLingua2 is the current prompt-compression state of the art, so the speed comparison uses it as the reference point for the other methods.
330 token sample
GPTZip
30ms
40x faster
40x faster than LLMLingua2
Headroom
200ms
6x faster
6x faster than LLMLingua2
Bear
250ms
4.8x faster
4.8x faster than LLMLingua2
LLMLingua2
1200ms
Baseline
Baseline / SOTA reference
| Dataset | Control | GPTZip | Headroom | TheTokenCompany | LLM-Lingua |
|---|---|---|---|---|---|
CoQA coqa | 76.6% Accuracy without compression control Baseline | 76.6% (0.0pp*) Accuracy with 13.9% median compression vs baseline 19.2% max compression gptzip:v1+win4@100 Lossless | 55.0% (-21.6pp) Accuracy with 51.8% median compression vs baseline 77.3% max compression headroom:auto Lossy | 76.5% (-0.1pp) Accuracy with 13.1% median compression vs baseline 25.9% max compression bear:bear-1.2@0.05 Lossy | 58.1% (-18.5pp) Accuracy with 63.0% median compression vs baseline 66.0% max compression llmlingua2-conservative Lossy |
SQuAD v2 squad_v2 | 60.1% Accuracy without compression control Baseline | 60.2% (+0.1pp*) Accuracy with 12.3% median compression vs baseline 22.4% max compression gptzip:v2+qaSafe+H1+win2@30 Lossless | 56.0% (-4.1pp) Accuracy with 11.8% median compression vs baseline 51.1% max compression headroom:auto Lossy | 58.5% (-1.6pp) Accuracy with 9.0% median compression vs baseline 23.1% max compression bear:bear-1.2@0.05 Lossy | 40.7% (-19.4pp) Accuracy with 62.5% median compression vs baseline 67.6% max compression llmlingua2-conservative Lossy |
GPQA Diamond gpqa_diamond | 29.3% Accuracy without compression control Baseline | 30.5% (+1.2pp*) Accuracy with 21.3% median compression vs baseline 70.7% max compression gptzip:v1+H1+J2@100 Lossless | 26.8% (-2.5pp) Accuracy with 12.2% median compression vs baseline 95.4% max compression headroom:auto Lossy | 29.3% (0.0pp) Accuracy with 5.8% median compression vs baseline 19.8% max compression bear:bear-1.2@0.05 Lossless | 23.2% (-6.1pp) Accuracy with 57.7% median compression vs baseline 82.7% max compression llmlingua2-conservative Lossy |
FinanceBench financebench | 84.1% Accuracy without compression control Baseline | 85.0% (+0.9pp*) Accuracy with 16.5% median compression vs baseline 40.0% max compression gptzip:v1+H1@100 Lossless | 75.9% (-8.2pp) Accuracy with 26.3% median compression vs baseline 83.8% max compression headroom:auto Lossy | 77.7% (-6.4pp) Accuracy with 1.5% median compression vs baseline 14.1% max compression bear:bear-1.2@0.05 Lossy | 62.2% (-21.9pp) Accuracy with 60.3% median compression vs baseline 78.1% max compression llmlingua2-conservative Lossy |
HumanEval humaneval | 58.0% Accuracy without compression control Baseline | 60.0% (+2.0pp*) Accuracy with 14.0% median compression vs baseline 34.2% max compression gptzip:v4@100 Lossless | 52.0% (-6.0pp) Accuracy with 14.7% median compression vs baseline 64.6% max compression headroom:auto Lossy | 60.0% (+2.0pp) Accuracy with 9.8% median compression vs baseline 33.6% max compression bear:bear-1.2@0.05 Lossless | 19.0% (-39.0pp) Accuracy with 61.7% median compression vs baseline 70.8% max compression llmlingua2-conservative Lossy |
LongBench v2 longbench_v2 | 42.1% Accuracy without compression control Baseline | 47.1% (+5.0pp*) Accuracy with 28.9% median compression vs baseline 41.8% max compression gptzip:v3+sw+H1@100 Lossless | 38.9% (-3.2pp) Accuracy with 9.8% median compression vs baseline 21.8% max compression bear:bear-1.2@0.05 Lossy |
CoQA
coqa
Control
Baseline
76.6% Accuracy without compression
control
Baseline
GPTZip
Best
76.6% (0.0pp*) Accuracy with 13.9% median compression
vs baseline
19.2% max compression
gptzip:v1+win4@100
Lossless
Headroom
55.0% (-21.6pp) Accuracy with 51.8% median compression
vs baseline
77.3% max compression
headroom:auto
Lossy
TheTokenCompany
76.5% (-0.1pp) Accuracy with 13.1% median compression
vs baseline
25.9% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
58.1% (-18.5pp) Accuracy with 63.0% median compression
vs baseline
66.0% max compression
llmlingua2-conservative
Lossy
SQuAD v2
squad_v2
Control
Baseline
60.1% Accuracy without compression
control
Baseline
GPTZip
Best
60.2% (+0.1pp*) Accuracy with 12.3% median compression
vs baseline
22.4% max compression
gptzip:v2+qaSafe+H1+win2@30
Lossless
Headroom
56.0% (-4.1pp) Accuracy with 11.8% median compression
vs baseline
51.1% max compression
headroom:auto
Lossy
TheTokenCompany
58.5% (-1.6pp) Accuracy with 9.0% median compression
vs baseline
23.1% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
40.7% (-19.4pp) Accuracy with 62.5% median compression
vs baseline
67.6% max compression
llmlingua2-conservative
Lossy
GPQA Diamond
gpqa_diamond
Control
Baseline
29.3% Accuracy without compression
control
Baseline
GPTZip
Best
30.5% (+1.2pp*) Accuracy with 21.3% median compression
vs baseline
70.7% max compression
gptzip:v1+H1+J2@100
Lossless
Headroom
26.8% (-2.5pp) Accuracy with 12.2% median compression
vs baseline
95.4% max compression
headroom:auto
Lossy
TheTokenCompany
29.3% (0.0pp) Accuracy with 5.8% median compression
vs baseline
19.8% max compression
bear:bear-1.2@0.05
Lossless
LLM-Lingua
23.2% (-6.1pp) Accuracy with 57.7% median compression
vs baseline
82.7% max compression
llmlingua2-conservative
Lossy
FinanceBench
financebench
Control
Baseline
84.1% Accuracy without compression
control
Baseline
GPTZip
Best
85.0% (+0.9pp*) Accuracy with 16.5% median compression
vs baseline
40.0% max compression
gptzip:v1+H1@100
Lossless
Headroom
75.9% (-8.2pp) Accuracy with 26.3% median compression
vs baseline
83.8% max compression
headroom:auto
Lossy
TheTokenCompany
77.7% (-6.4pp) Accuracy with 1.5% median compression
vs baseline
14.1% max compression
bear:bear-1.2@0.05
Lossy
LLM-Lingua
62.2% (-21.9pp) Accuracy with 60.3% median compression
vs baseline
78.1% max compression
llmlingua2-conservative
Lossy
HumanEval
humaneval
Control
Baseline
58.0% Accuracy without compression
control
Baseline
GPTZip
Best
60.0% (+2.0pp*) Accuracy with 14.0% median compression
vs baseline
34.2% max compression
gptzip:v4@100
Lossless
Headroom
52.0% (-6.0pp) Accuracy with 14.7% median compression
vs baseline
64.6% max compression
headroom:auto
Lossy
TheTokenCompany
60.0% (+2.0pp) Accuracy with 9.8% median compression
vs baseline
33.6% max compression
bear:bear-1.2@0.05
Lossless
LLM-Lingua
19.0% (-39.0pp) Accuracy with 61.7% median compression
vs baseline
70.8% max compression
llmlingua2-conservative
Lossy
LongBench v2
longbench_v2
Control
Baseline
42.1% Accuracy without compression
control
Baseline
GPTZip
Best
47.1% (+5.0pp*) Accuracy with 28.9% median compression
vs baseline
41.8% max compression
gptzip:v3+sw+H1@100
Lossless
TheTokenCompany
38.9% (-3.2pp) Accuracy with 9.8% median compression
vs baseline
21.8% max compression
bear:bear-1.2@0.05
Lossy