This directory contains the 100 prompts and scripts used to compare local GGUF variants against official DeepSeek V4 continuations.
The metric is target-token negative log likelihood: collect a deterministic official continuation, then ask each local GGUF how much probability it assigns to that exact continuation token by token. This avoids judging quality from one sampled answer.
export DEEPSEEK_API_KEY=...
python3 gguf-tools/quality-testing/collect_official.py \
--prompts gguf-tools/quality-testing/prompts.jsonl \
--out gguf-tools/quality-testing/data/flash \
--count 100 \
--max-tokens 24Use one output directory per official model. The default model is Flash, so
data/flash is the recommended path for Flash continuations. For PRO:
python3 gguf-tools/quality-testing/collect_official.py \
--model deepseek-v4-pro \
--prompts gguf-tools/quality-testing/prompts.jsonl \
--out gguf-tools/quality-testing/data/pro \
--count 100 \
--max-tokens 24 \
--top-logprobs 20The script writes:
data/<model>/prompts/case_*.txtdata/<model>/continuations/case_*.txtdata/<model>/responses/case_*.jsondata/<model>/manifest.tsv
The prompt list is tracked in prompts.jsonl; the official responses are not
tracked because they are derived from an external API.
make -C gguf-tools quality-scoreThe scorer links against the DS4 runtime and uses Metal by default.
gguf-tools/quality-testing/score_official \
../deepseek-v4-quants/gguf/OLD.gguf \
gguf-tools/quality-testing/data/flash/manifest.tsv \
/tmp/old.tsv \
4096
gguf-tools/quality-testing/score_official \
../deepseek-v4-quants/gguf/NEW.gguf \
gguf-tools/quality-testing/data/flash/manifest.tsv \
/tmp/new.tsv \
4096Use data/pro/manifest.tsv for PRO GGUFs. The scorer and comparator do not
care which model produced the manifest; the manifest path selects the
continuation set.
python3 gguf-tools/quality-testing/compare_scores.py /tmp/old.tsv /tmp/new.tsvOutput fields:
avg_nll: average negative log likelihood; lower is better.delta_new_minus_old: negative means the new GGUF fits the official continuation better.case_wins_new_old_ties: per-prompt NLL wins.first_token_matches: how often the local greedy first token matches the official first token.avg_greedy_lcp: average greedy longest common prefix against the official continuation.