Official-Continuation Quality Testing

This directory contains the 100 prompts and scripts used to compare local GGUF variants against official DeepSeek V4 continuations.

The metric is target-token negative log likelihood: collect a deterministic official continuation, then ask each local GGUF how much probability it assigns to that exact continuation token by token. This avoids judging quality from one sampled answer.

1. Collect Official Continuations

export DEEPSEEK_API_KEY=...
python3 gguf-tools/quality-testing/collect_official.py \
  --prompts gguf-tools/quality-testing/prompts.jsonl \
  --out gguf-tools/quality-testing/data/flash \
  --count 100 \
  --max-tokens 24

Use one output directory per official model. The default model is Flash, so data/flash is the recommended path for Flash continuations. For PRO:

python3 gguf-tools/quality-testing/collect_official.py \
  --model deepseek-v4-pro \
  --prompts gguf-tools/quality-testing/prompts.jsonl \
  --out gguf-tools/quality-testing/data/pro \
  --count 100 \
  --max-tokens 24 \
  --top-logprobs 20

The script writes:

data/<model>/prompts/case_*.txt
data/<model>/continuations/case_*.txt
data/<model>/responses/case_*.json
data/<model>/manifest.tsv

The prompt list is tracked in prompts.jsonl; the official responses are not tracked because they are derived from an external API.

2. Build The Local Scorer

make -C gguf-tools quality-score

The scorer links against the DS4 runtime and uses Metal by default.

3. Score GGUF Variants

gguf-tools/quality-testing/score_official \
  ../deepseek-v4-quants/gguf/OLD.gguf \
  gguf-tools/quality-testing/data/flash/manifest.tsv \
  /tmp/old.tsv \
  4096

gguf-tools/quality-testing/score_official \
  ../deepseek-v4-quants/gguf/NEW.gguf \
  gguf-tools/quality-testing/data/flash/manifest.tsv \
  /tmp/new.tsv \
  4096

Use data/pro/manifest.tsv for PRO GGUFs. The scorer and comparator do not care which model produced the manifest; the manifest path selects the continuation set.

4. Compare

python3 gguf-tools/quality-testing/compare_scores.py /tmp/old.tsv /tmp/new.tsv

Output fields:

avg_nll: average negative log likelihood; lower is better.
delta_new_minus_old: negative means the new GGUF fits the official continuation better.
case_wins_new_old_ties: per-prompt NLL wins.
first_token_matches: how often the local greedy first token matches the official first token.
avg_greedy_lcp: average greedy longest common prefix against the official continuation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official-Continuation Quality Testing

1. Collect Official Continuations

2. Build The Local Scorer

3. Score GGUF Variants

4. Compare

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Official-Continuation Quality Testing

1. Collect Official Continuations

2. Build The Local Scorer

3. Score GGUF Variants

4. Compare