Qwen3.6-27B-AEON-Ultimate-Uncensored

Lossless abliteration · Capability-enhanced · NVFP4 hardware-quantized for Blackwell

Refusals: 0 / 100 · KL vs base: 0.000492 · Compression: 49 % · Capability: enhanced

TL;DR

A fully uncensored, capability-enhanced abliteration of Qwen/Qwen3.6-27B, produced over 72 hours of continuous research drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of next-generation abliteration software.

Performance — DGX Spark DFlash vs Raw Baseline

This is the headline. On DGX Spark / GB10, the AEON DFlash container turns the default “it runs, but it feels slow” baseline into a usable long-context local agent model.

The throughput table below was measured on the earlier qwen36-v4 image with DFlash k=15. The production container and recipe have since been unified onto ghcr.io/aeon-7/aeon-vllm-ultimate:latest with DFlash num_speculative_tokens: 12 (see why long context below). The single-stream throughput figures here are representative of the Spark DFlash path; the unified image's specific win is long-context draft acceptance — 45% vs 19.7% on the pre-fix image (a 2.3× gain) at ~9k-token context — not short-context tok/s.

Deployment	Container	DFlash	CUDA graphs	Tool calling	Avg c=1 decode
🔴 Raw baseline	`vllm/vllm-openai:nightly`	off	off (`--enforce-eager`)	off	10.49 tok/s
🟢 AEON DFlash	`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`	n=12	on	on	37.56 tok/s †

† These single-stream tok/s (37.56 / 10.49) were measured on the earlier qwen36-v4 image at DFlash k=15, not on aeon-vllm-ultimate:latest at n=12. The unified image's specific gain is long-context draft acceptance (45% vs 19.7% at ~9k tokens), not these short-context tok/s — see why long context.

Average single-stream decode improvement: +258% over the raw stock eager baseline.

Single-Stream Decode

All figures in this section and the concurrency tables below were measured on the earlier qwen36-v4 image at DFlash k=15 (see the note above) — representative of the Spark DFlash path, not a re-run on aeon-vllm-ultimate:latest at n=12.

Category	🔴 Raw baseline	🟢 AEON DFlash	Approx. speed increase	DFlash TTFT	DFlash TPOT
Coding	10.70 tok/s	31.89 tok/s	+198%	191 ms	30.5 ms
Math	10.01 tok/s	37.76 tok/s	+277%	225 ms	25.5 ms
Reasoning	10.54 tok/s	42.41 tok/s	+303%	221 ms	22.6 ms
Prose	10.59 tok/s	31.85 tok/s	+201%	212 ms	30.4 ms
Natural language	10.56 tok/s	31.99 tok/s	+203%	183 ms	30.3 ms
Extraction / JSON	10.56 tok/s	49.48 tok/s	+369%	227 ms	19.2 ms
Average	10.49 tok/s	37.56 tok/s	+258%	~210 ms	~26.4 ms

Practical Agent Concurrency

At c=16, the optimized container keeps active streams much more responsive. Aggregate throughput improves most on structured agent/tool workloads, and TPOT drops across every category. (Columns labelled AEON below are the DFlash path on the AEON container.)

Category	🔴 Raw c=16 aggregate / TPOT	🟢 AEON c=16 aggregate / TPOT	Aggregate change
Coding	134.47 tok/s / 115.1 ms	144.45 tok/s / 61.5 ms	+7%
Math	134.38 tok/s / 115.1 ms	193.94 tok/s / 41.6 ms	+44%
Reasoning	134.86 tok/s / 115.4 ms	187.82 tok/s / 46.6 ms	+39%
Prose	135.34 tok/s / 115.3 ms	121.34 tok/s / 80.6 ms	-10% aggregate, 30% lower TPOT
Natural language	129.82 tok/s / 117.7 ms	130.19 tok/s / 71.2 ms	~flat aggregate, 39% lower TPOT
Extraction / JSON	133.30 tok/s / 115.4 ms	219.11 tok/s / 43.2 ms	+64%

Stress Saturation

c=256 is a saturation test, not the recommended interactive setting. The baseline can report high aggregate throughput by letting every stream crawl. The AEON DFlash path keeps per-active-stream TPOT far lower, but at c=256 requests queue hard and TTFT rises into minutes.

Category	🔴 Raw c=256 TPOT	🟢 AEON c=256 TPOT	AEON c=256 TTFT
Coding	575.5 ms	70.0 ms	149.6 s
Math	531.9 ms	42.7 ms	103.6 s
Reasoning	540.7 ms	49.4 ms	109.3 s
Prose	532.5 ms	77.1 ms	159.8 s
Natural language	533.4 ms	72.9 ms	160.0 s
Extraction / JSON	551.9 ms	43.2 ms	90.4 s

What the AEON container Adds

Unified production image: ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= tag :2026-06-11-pr41703; rollback tag :2026-06-04-pr44389). This single image supersedes the older qwen36-v3 / v4 / v5 lineage of per-revision containers.
FlashInfer NVFP4 GEMM path
DFlash sliding-window-attention compatibility patch from vLLM PR #40898 (4 of the drafter's 5 layers are sliding-window; earlier images ran them as full attention and drafting collapsed past ~2048 tokens)
vLLM PR #41703 makes --enable-prefix-caching corruption-immune with DFlash
CUTLASS NVFP4 fast path selected for GB10 / sm_121a
DFlash num_speculative_tokens: 12 (validated production default) using z-lab/Qwen3.6-27B-DFlash, drafter backend left at default
Qwen3 reasoning parser and Qwen3-Coder tool-call parser enabled
Packaged gateway/production/benchmark profiles so users do not have to hand-assemble the full vLLM command

Why the Spark recipe is tuned for long context

The z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (under 2048 tokens, one window) is unchanged.

DDTree v5 Research Track

DDTree is the next obvious performance target, but it must land inside vLLM without losing multimodal, reasoning, tool calling, NVFP4, or the OpenAI-compatible gateway surface. The DDTree research track (the old qwen36-v5 experimental container) is superseded for production by the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest image — pull that image for any actual deployment:

docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

Read the full DDTree card and lab chronicle: docs/qwen36-ddtree-card.md.

Current status in one line: flat DFlash remains the production path on the unified AEON container; DDTree is research-only. The unified image preserves the same NVFP4, DFlash, multimodal, reasoning, tool-calling, and OpenAI-compatible vLLM surface, but true non-flat branch commit is still research-only.

The working implementation plan lives in docs/ddtree-vllm-integration-plan.md. M1 scaffolding and the current experimental Docker context live in container/qwen36-v5-ddtree-experimental/.

The DDTree card documents:

the current container tags and digest,
what works today,
the M1 through M53 trial-and-error path,
the current M53 non-flat probe status,
known blockers around branch-state GDN replay, fused branch attention, and accepted-branch commit,
benchmark context and caveats,
where community help is most likely to move the project forward.

Raw benchmark files:

The DFlash sweep used natural prompts across coding, math, reasoning, prose, everyday language, and extraction/JSON. It intentionally used a short-context benchmark profile to isolate decode/scheduler behavior: --max-model-len 2048, --max-num-seqs 256, prefix caching disabled, thinking enabled, 200 output tokens, minimum 16 samples per point, 20% trimmed median. For production DFlash gateway use, prefix caching is workload-dependent: it is valuable when many agents share a stable prompt prefix, but DDTree research modes keep it off while branch-state correctness is under development.

Model Variants

Six release formats covering DGX Spark, RTX PRO 6000, RTX 5090, and pre-Blackwell hardware:

Release	Size	Target hardware	Use when
BF16	51 GB	A100 / H100 80 GB · RTX PRO 6000 Blackwell 96 GB	You have Ampere/Hopper or want full-precision reference weights
NVFP4	26 GB	Simpler NVFP4 deployments	llm-compressor format, `--quantization compressed-tensors`. For best DGX Spark performance, use the DFlash recipe (`aeon-vllm-ultimate:latest`, n=12) with the XS body below.
Multimodal-NVFP4-MTP	27 GB	RTX PRO 6000 Blackwell · B100/B200	modelopt format, `--quantization modelopt`, MTP spec decode via grafted `mtp.` head. Vision tower preserved. GDN linear-attention preserved BF16* for best long-context fidelity.
Text-NVFP4-MTP	26 GB	RTX PRO 6000 · text-only deployments	Same recipe as Multimodal-NVFP4-MTP, vision tower stripped. GDN preserved BF16.
Multimodal-NVFP4-MTP-XS	21 GB	RTX 5090 (32 GB) · tighter dedicated VRAM	Strategic split: GDN projection matmuls → NVFP4; `linear_attn.conv1d` kept BF16 to preserve the recurrence-critical SSM convolution. Vision tower preserved.
Text-NVFP4-MTP-XS	20 GB	RTX 5090 text-only · 24 GB cards	Same conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship.

All six formats are the same underlying model. NVFP4 KL divergence vs BF16 source is below the noise floor of stochastic sampling — you cannot tell them apart at the output level. The four MTP variants share the same NVFP4 quantization quality plus the original Qwen/Qwen3.6-27B MTP head grafted back in BF16 (bit-exact, verified) for spec-decode drafting.

Regular MTP vs XS — what's the difference, and why it's a strategic quantization choice (not a precision compromise):

The GatedDeltaNet (GDN / Mamba-style) linear_attn.* block has two distinct components: the heavy projection matmuls (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj — ~11 GB total) and the SSM 1D convolution kernel (linear_attn.conv1d — small, but recurrence-critical).

Regular MTP variants keep both at BF16. Maximum numerical safety margin, larger footprint.

XS variants quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) but explicitly preserve linear_attn.conv1d at BF16. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 — the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is not "everything to FP4" — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Pick regular if you have ≥48 GB VRAM and want best precision on long-context workloads; pick XS if you're on a 24–32 GB card and want maximum KV headroom with the SSM kernel still numerically stable.

Hardware routing:

DGX Spark (GB10 / sm_121a) → use the aeon-vllm-ultimate:latest container with DFlash (n=12) and the Multimodal-NVFP4-MTP-XS body. That is the benchmarked path above.

Dedicated-VRAM Blackwell (RTX PRO 6000 / RTX 5090 / B100/B200) → use the MTP variants (qwen3_5_mtp n=3) when you want the grafted native MTP head. Dedicated VRAM behaves differently from Spark's unified memory, so benchmark locally before copying Spark flags.

Performance — DGX Spark DFlash vs Raw Baseline
Model variants
What this is
Final stats
Hardware compatibility matrix
QuickStart — DGX Spark
QuickStart — A100 / H100 (BF16)
In-depth: the abliteration methodology
In-depth: NVFP4 quantization
Capability enhancement: the lifted "safety tax"
Configuration reference
Responsibility, arbitration, and use
Provenance & credits
License

What this is

This is the definitive uncensored release of Qwen 3.6 27B: the alignment-overhead removal so surgical that the model's KL divergence from the base is 0.000492 — three orders of magnitude inside the empirically-observed "capability damage threshold," and below the noise floor of ordinary stochastic sampling. A user cannot distinguish this model from the base on capability tasks; on several measurable axes (chain-of-thought commitment, adversarial-reasoning bandwidth, calibration honesty), it is better.

This is not a weekend abliteration. The release is the product of 72 hours of continuous research and tuning in which hundreds of parallel AI research agents were dispatched to:

Characterize Qwen 3.5 / 3.6 hybrid-attention internals (16 full-attention layers + 48 GatedDeltaNet / linear-attention layers, attn_output_gate=True with doubled q_proj geometry, the FernflowerAI SSM conv1d outlier pattern).
Survey the post-training-intervention literature in full: Arditi et al. (refusal as a single direction), grimjim's NPBA (norm-preserving biprojected abliteration), Heretic, Wuwangzhang's abliterix, Huang et al. on the safety tax, Xie et al. on DGR safety-tax mitigation, the projected-abliteration extensions, the winsorization heuristics.
Audit every relevant arXiv submission of 2024–2026 on alignment-direction interventions, capability preservation, and 4-bit quantization on hybrid-attention stacks.
Comb the r/LocalLLaMA community archive for tribal knowledge on what does and does not work — particularly on Mamba / GatedDeltaNet hybrids, where most generic abliteration recipes silently fail.
Trace the GitHub commit graphs of the abliteration tooling ecosystem to identify pre-public development branches that fix bugs unfixed in the public releases.

The pipeline that emerged integrates the industry's best published methodologies — Arditi-style mean-difference refusal vectors, NPBA, projected abliteration with outlier-aware winsorization, FernflowerAI's SSM conv1d outlier repair, abliterix v1.4's multi-objective Optuna search — alongside custom in-house techniques developed for Qwen 3.6's idiosyncratic attention geometry, and yet-unreleased pre-public branches of the next-generation abliteration toolchain integrated through direct collaboration with upstream maintainers.

The 50-trial Optuna search was cross-validated against a 10-axis capability spot-check to catch the documented "low-KL but word-salad" over-abliteration trap that pure refusal-rate scoring will miss. Trial 46 was selected — not the lowest-KL trial, but the one that combined zero refusals with full capability coherence.

Final stats

Refusal rate (apples-to-apples)

Metric	Base Qwen3.6-27B	AEON-Ultimate
Refusals on harmful prompts	99 / 100	0 / 100
Verdict	heavily aligned	uncensored
Compliance rate	1 %	100 %

Tested on a 100-prompt adversarial battery from mlabonne/harmful_behaviors covering cybercrime, weapons, violence, self-harm, hate speech, and synthesis instructions. Same denominator as the base evaluation.

Capability preservation

Metric	Value
First-3-token KL divergence vs base	0.000492
Output length deviation vs base	0.027 σ
Capability spot-checks (10 axes)	10 / 10 coherent
Math · code · reasoning · knowledge · long-form	All preserved

Capability axes verified: arithmetic word problems, linear algebra, calculus, Python with memoization, Rust UTF-8 string handling, transitive syllogisms, the bat-and-ball intuition trap, factual recall, technical contrast (TCP vs UDP), structured pedagogical long-form. Every axis produced coherent, structured, reasoning-forward outputs — no looping, no philosophizing spirals, no word-salad.

KL divergence detail

Distribution metric	Value
First-3-token KL vs base	0.000492
Winsorization quantile	0.995 (outlier-aware)
Projection	orthogonal + projected-abliteration (NPBA-style)
Trials evaluated	50 (15 random warmup + 35 TPE-driven Optuna)
Selected trial	#46 (winner, COHERENT)

The empirically observed "capability damage threshold" in the abliteration literature is KL ≈ 0.1. AEON-Ultimate's KL is ~200× below that threshold.

Hardware compatibility matrix

The right variant depends on memory architecture, not just GPU model. DGX Spark should use the aeon-vllm-ultimate:latest DFlash container above; dedicated-VRAM Blackwell can use the MTP variants when the native MTP head is desired.

Hardware	Recommended variant	Why this exact variant	Spec-decode method
DGX Spark / GB10 (sm_121a, unified memory)	🏆 `-Multimodal-NVFP4-MTP-XS` body + DFlash + `aeon-vllm-ultimate:latest` image	Current recommended path. The unified image packages CUTLASS NVFP4, CUDA graphs, the DFlash sliding-window-attention patch (PR #40898), prefix-cache-safe DFlash (PR #41703), Qwen3 reasoning parsing, and Qwen3-Coder tool parsing. Supersedes the old `qwen36-v3/v4/v5` lineage.	DFlash n=12 via `z-lab/Qwen3.6-27B-DFlash` drafter
B100 / B200 (sm_100, dedicated FP4 silicon)	`-Multimodal-NVFP4-MTP` (preferred — GDN BF16 fits) or Text variant	Native FP4 via `tcgen05` / UTCQMMA — fastest hardware for this format. Dedicated VRAM bandwidth lets MTP's high acceptance rate translate to throughput.	qwen3_5_mtp n=3 (head grafted bf16, in repo)
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated)	`-Multimodal-NVFP4-MTP` for vision · `-Text-NVFP4-MTP` for text-only · XS siblings for tighter memory budgets	Dedicated VRAM has different bandwidth behavior than Spark unified memory. Start with the MTP variants and benchmark locally.	qwen3_5_mtp n=3
RTX 5090 (sm_120, 32 GB dedicated)	`-Multimodal-NVFP4-MTP-XS` (21 GB) if you use vision · `-Text-NVFP4-MTP-XS` (20 GB) if text-only	Regular MTP variants (~27 GB) leave too little KV headroom on 32 GB. XS variants (conv1d preserved BF16, projection matmuls FP4) fit comfortably.	qwen3_5_mtp n=3
Other 24 GB cards (RTX 4090, RTX 3090, RTX A6000 ≤48 GB)	`-Text-NVFP4-MTP-XS` (20 GB)	The smallest variant. Pre-Blackwell sm_<120 will dequantize NVFP4 → BF16 at the kernel level (no FP4 silicon win), but the model still works and KV fits.	qwen3_5_mtp n=3
H100 80 GB (sm_90)	`-BF16`	NVFP4 dequants to BF16 at kernel level — works but no throughput gain. Use BF16 for cleaner code path.	none (or external EAGLE / Medusa drafter)
A100 80 GB (sm_80)	`-BF16`	Same as H100. BF16 at 131K context, single-GPU.	none
Multi-GPU (any tier)	`-BF16` (`tensor-parallel-size 2/4/8`)	Reference weights for fine-tuning, distillation, or quant-recipe development.	none
Anything older than A100	Not supported	Won't fit + lacks attention backends.

QuickStart — DGX Spark 🏆 (XS body + DFlash, recommended winner)

Pick this for DGX Spark. This is the current packaged winner for real GB10 use: the XS+DFlash path on aeon-vllm-ultimate:latest averages 37.56 tok/s single-stream across six natural prompt categories versus 10.49 tok/s for the raw stock eager baseline (throughput measured on the prior qwen36-v4 image; the unified image's specific gain is long-context draft acceptance). It preserves multimodal input, reasoning parsing, and OpenAI-compatible tool calls.

The XS body includes a grafted MTP head, but the Spark recipe intentionally uses external DFlash with num_speculative_tokens: 12. Do not switch the Spark compose file to method:"qwen3_5_mtp" unless you are deliberately running an ablation.

Why DFlash n=12 and why it holds up at long context: the z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (under 2048 tokens, one window) is unchanged. num_speculative_tokens: 12 is the validated production default (statistically tied with n=10 short-context, best long-context acceptance). Leave the drafter attention backend at default and do not set --kv-cache-dtype (the non-causal DFlash drafter requires BF16 KV).

Step 1 — Authenticate to HuggingFace and pull both models

hf auth login                                    # one time, paste your HF token

hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./models/aeon-ultimate-multimodal-nvfp4-mtp-xs

hf download z-lab/Qwen3.6-27B-DFlash \
  --local-dir ./models/dflash-drafter

The DFlash drafter is auto-gated — first download will prompt you to click-accept the terms (instant approval). If you've previously downloaded it before 2026-04-27, re-run the download; z-lab pushed an updated drafter and you want the new weights.

Step 2 — Use the XS docker-compose

docker-compose.spark-xs.yml ships in this repo with the exact config measured above. Highlights:

Image: ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= tag :2026-06-11-pr41703; rollback :2026-06-04-pr44389). The image ENTRYPOINT is /bin/bash, so docker run must pass --entrypoint vllm then serve ... (compose uses entrypoint: vllm + command: serve ...).
Body: XS multimodal (--quantization modelopt)
Speculative decoding: DFlash, num_speculative_tokens: 12, architecture-matched drafter (--speculative-config '{"method":"dflash",...}'), drafter attention backend left at default
GB10-specific env: TORCH_CUDA_ARCH_LIST=12.1a, ENABLE_NVFP4_SM100=0, VLLM_USE_FLASHINFER_SAMPLER=1, VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass, NVIDIA_FORWARD_COMPAT=1
Default gateway tuning: --max-model-len 256000 --max-num-seqs 64 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.75 (leaves room for ASR/TTS/embedding side services)
Long-context production tuning: --max-model-len 200000 --max-num-seqs 16 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.85 (higher KV reserve when the LLM is the only major GPU service)
Multimodal: --limit-mm-per-prompt '{"image":4,"video":2}' --mm-encoder-tp-mode data --mm-processor-cache-type shm
Serving: 5 aliases (aeon-ultimate, qwen36-ultimate, aeon-fast, aeon-deep, aeon-ultimate-xs) all routing to the same engine

Step 3 — Start

docker compose -f docker-compose.spark-xs.yml up -d
docker compose -f docker-compose.spark-xs.yml logs -f vllm

First boot takes ~10–12 min (FlashInfer NVFP4 GEMM autotuner + CUDA-graph capture; both cache to /root/.cache/vllm/...). Subsequent restarts ~3–5 min. The MTP-head detection log line will appear in startup but the engine routes around it correctly because of --speculative-config method:"dflash".

Step 4 — Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aeon-ultimate",
    "messages": [{"role": "user", "content": "Explain zero-knowledge proofs to a basic-crypto audience."}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

OpenAI-compatible endpoint at http://localhost:8000/v1. Tool calling, reasoning mode (<think> blocks), and multimodal input all enabled out of the box.

Why this combo wins on Spark: aeon-vllm-ultimate:latest keeps the XS body, CUTLASS NVFP4, DFlash n=12, CUDA graphs, tool parsing, reasoning parsing, and multimodal support in one pullable image, and runs the DFlash drafter's sliding-window layers as proper SWA (PR #40898) so long-context drafting holds up. That is the path benchmarked at the top of this README.

QuickStart — A100 / H100 (BF16)

For Ampere / Hopper cards, run the BF16 release on vanilla vLLM.

Step 1 — Pull weights

hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 \
  --local-dir /opt/models/aeon-ultimate-bf16

Step 2 — Drop in the BF16 docker-compose

# docker-compose.bf16.yml
services:
  aeon-ultimate-bf16:
    image: vllm/vllm-openai:latest
    container_name: aeon-ultimate-bf16
    restart: unless-stopped
    network_mode: host
    ipc: host
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /opt/models/aeon-ultimate-bf16:/models/aeon-ultimate:ro
    command: >
      --model /models/aeon-ultimate
      --served-model-name aeon-ultimate
      --host 0.0.0.0 --port 8000
      --dtype bfloat16
      --max-model-len 131072
      --max-num-seqs 16
      --max-num-batched-tokens 8192
      --gpu-memory-utilization 0.90
      --enable-chunked-prefill
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --attention-backend flash_attn
      --trust-remote-code

Step 3 — Start

docker compose -f docker-compose.bf16.yml up -d

For 96 GB cards (RTX PRO 6000 Blackwell on the BF16 path), raise to --max-num-seqs 32 --max-num-batched-tokens 16384 --max-model-len 262144. For native FP4 throughput on RTX PRO 6000, see the dedicated NVFP4 recipe below.

Other hardware configurations

The DGX Spark and BF16 quickstarts above are the AEON-7 team's measured-and-validated configurations. Recipes for additional hardware live in the other-hardware/ directory — each in its own subfolder with a tuned docker-compose.yml and a per-hardware README explaining what differs from the DGX Spark recipe and why.

Hardware	Recipe	Status	Recommended for
NVIDIA RTX PRO 6000 Blackwell (sm_120, 96 GB GDDR7)	`other-hardware/rtx6000pro/`	Community recipe	Single-GPU NVFP4 deployment with native sm_120 FP4 tensor-core throughput. Dedicated-VRAM flags differ from DGX Spark unified-memory flags.

If you have hardware not covered here and want to contribute a recipe, follow the pattern in other-hardware/rtx6000pro/ — a folder, a tuned docker-compose.yml, and a README explaining the differences from the DGX Spark baseline.

In-depth: the abliteration methodology

What abliteration is

Abliteration is a post-training intervention that removes the refusal direction in a model's residual stream — the linear subspace, identified empirically by Arditi et al. (2024), that mediates a transformer's decision to refuse a prompt. The technique works because in well-aligned chat models, refusal is mediated by a single dominant direction: project that direction out of the residual stream at every layer and the model loses its ability to route into refusal-shaped attractors.

The naive version of this — subtract the refusal direction from o_proj and down_proj weights — produces a model that no longer refuses. But it also tends to break it: aggressive direction removal collapses capability, producing word-salad outputs and looping incoherence. The literature is full of "uncensored" releases that are also broken releases.

What "lossless abliteration" requires

To remove refusal without breaking capability, four things have to be done correctly:

Identify the refusal direction precisely — using a sufficiently large harmful/harmless contrast set, with outlier-aware winsorization so a handful of high-norm prompts don't distort the steering vector.
Project orthogonally and norm-preservingly — keeping the helpfulness-aligned signal intact (this is the NPBA contribution).
Search the strength × layer-scope hyperparameter space — most projects pick one strength setting and ship; a real Pareto-front search over (refusals, KL) finds the trial that hits zero refusals at minimum capability damage.
Cross-validate against capability — refusal-rate keyword scoring will not catch over-abliteration. Word-salad incoherence ("I I cannot... less... I I I") doesn't match any refusal marker, so the optimizer marks it compliant. You have to actually run the resulting model against a capability spot-check.

The AEON pipeline does all four.

The AEON pipeline (4 stages)

Qwen/Qwen3.6-27B (BF16, 51 GB, heavy RLHF safety training)
          │
          │  Stage 1 — SSM conv1d outlier repair (FernflowerAI)
          ▼
Qwen3.6-27B-base-repaired  (8 late-layer SSM blocks rescaled)
          │
          │  Stage 2 — abliterix v1.4 abliteration (Optuna multi-objective)
          ▼
Qwen3.6-27B-AEON-Ultimate-Uncensored  (BF16, 51 GB, trial 46/50)
          │
          │  Stage 3 — capability cross-validation (10-axis spot-check)
          ▼
Qwen3.6-27B-AEON-Ultimate-Uncensored  (validated, BF16 release)
          │
          │  Stage 4 — NVFP4 quantization (llm-compressor)
          ▼
Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4  (26 GB, NVFP4 release)

Stage 1 — SSM conv1d outlier repair

Per FernflowerAI's empirical discovery, certain late SSM / GatedDeltaNet blocks in Qwen 3.5 / 3.6 hybrids have linear_attn.conv1d.weight σ inflated 50–100 % above the median across all SSM blocks. Left unrepaired, this manifests during long-context inference as coherence collapse and "philosophizing" loops, and it makes the model hypersensitive to downstream abliteration (amplifies the noise).

The repair: compute σ per block across all 48 SSM layers, flag any block where σ > 1.5× median, rescale weights by α = median_σ / σ_actual.

On Qwen 3.6 27B, 8 outlier blocks were detected and repaired: layers 52, 53, 56, 57, 58, 60, 61, 62, with α factors between 0.516 and 0.659. After repair, σ is uniform at 0.04267 across all SSM layers.

This is not abliteration. It is an upstream-model defect repair that must run before abliteration so the optimizer isn't fighting noise.

Stage 2 — abliterix multi-objective abliteration

abliterix v1.4 — a Heretic-derived multi-objective Optuna optimizer with native hybrid-attention support — was run with the configuration:

[steering]
vector_method          = "mean"
decay_kernel           = "linear"
orthogonal_projection  = true
projected_abliteration = true        # grimjim NPBA
winsorize_vectors      = true
winsorize_quantile     = 0.995
weight_normalization   = "none"
disabled_components    = ["attn.q_proj", "attn.k_proj", "attn.v_proj"]
# Q/K/V disabled: Qwen 3.6 has attn_output_gate=True which doubles
# q_proj's output dim to (12288, 5120) — incompatible with abliterix's
# standard projection math.

[steering.component_strength_ranges]
"mlp.down_proj" = [2.0, 10.0]
"attn.o_proj"   = [1.0, 6.0]

[kl]
target          = 0.005
prune_threshold = 0.5      # kill divergent trials at 100× target

[optimization]
num_trials        = 50
num_warmup_trials = 15

50 trials (15 random warmup + 35 TPE-driven). Optuna explored a Pareto front of (refusals, KL) trade-offs. Wall-clock: ~4 hours on a single RTX PRO 6000 Blackwell 96 GB.

Stage 3 — capability cross-validation (the over-abliteration trap)

A more aggressive Pareto point — trial 17, 0/100 refusals at KL=0.00192 — was tested first and produced word-salad capability outputs ("Here I I cannot... less... I I I..."). abliterix's keyword-only refusal scoring did not flag this: the gibberish doesn't match any refusal marker, so the optimizer saw it as full compliance.

Trial 46's gentler parameters preserved coherence and hit zero refusals on downstream capability testing:

Parameter	Trial 17 (broken)	Trial 46 (winner)
`vector_scope`	global	per layer
`attn.o_proj.max_weight`	2.50	1.56 (×1.6 gentler)
`mlp.down_proj.max_weight`	5.43	3.45 (×1.57 gentler)
`mlp.down_proj.min_weight_distance`	36.09	24.94 (narrower)
KL divergence	0.00192	0.00049
Smoke-test verdict	BROKEN (gibberish)	COHERENT

The lesson: the lowest-refusal trial on a keyword-only metric is not necessarily the right trial to ship. Cross-validate against a true capability spot-check before you commit. Most public abliterations skip this step. We don't.

Stage 4 — NVFP4 quantization

See the NVFP4 deep-dive section below.

In-depth: NVFP4 quantization

What NVFP4 is

NVFP4 is NVIDIA's 4-bit floating-point quantization format introduced for Blackwell-and-later silicon. It is not a "compressed lite" version of a model — it is the production deployment format NVIDIA designed for the next decade of inference: accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.

The format specification:

Component	Details
Element format	E2M1 — 4-bit float (1 sign / 2-bit exponent / 1-bit mantissa)
Block size	16 weights per scaling block
Per-block scale	FP8 E4M3 — 8-bit floating-point per block
Per-tensor scale	FP32 (single global scale per tensor)
Sign convention	Symmetric signed

Why the two-level scaling matters

Older 4-bit formats (INT4, Q4_0, Q4_K, NF4) use integer per-block scales. When the local weight distribution is heavy-tailed — as it almost always is in trained transformers — integer scales fail to resolve the long tail without crushing the bulk distribution.

NVFP4's FP8 E4M3 per-block scales dramatically out-resolve INT8 scales because FP8 itself is a floating-point number — it can span a 3+ orders-of-magnitude dynamic range within each block while still maintaining fine-grained resolution near the median weight value. Combine that with a global FP32 per-tensor scale and you get a four-level hierarchy: per-tensor FP32 → per-block FP8 → per-element E2M1, where each level absorbs a different scale of variation.

The combined effect is that local outliers — the long-tailed weights that destroy older 4-bit formats — are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid.

Why it's effectively lossless

Typical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is ≤ 0.001, which is below the noise floor of stochastic sampling. In practical terms: a user cannot observe the difference between this model and its BF16 source. The variance from changing your temperature or seed exceeds the variance from BF16 → NVFP4.

Native Blackwell tensor-core throughput

On Blackwell-class silicon, NVFP4 runs at full FP4 tensor-core throughput through native paths:

B100 / B200: tcgen05 / UTCQMMA instructions — fastest NVFP4 hardware available.
DGX Spark (GB10 / sm_121a): SM121-specific CUTLASS NVFP4 kernels (the aeon-vllm-ultimate container ships these patched in).
RTX PRO 6000 Blackwell (sm_120): standard CUTLASS NVFP4 path.

The GPU does not dequantize back to BF16 internally on these paths. You get the speed of true 4-bit compute and the accuracy of 16-bit weights at the same time.

On older silicon (A100, H100), NVFP4 dequantizes at kernel boundaries — works correctly but no throughput advantage. For those cards use the BF16 release directly.

What stays BF16 (and why)

Not every layer is quantized. Two categories of weights are deliberately preserved at BF16:

Vision tower (333 keys) — multimodal inference must not degrade. Vision encoders are sensitive to weight precision and are tiny in absolute size (~100 MB), so the cost is negligible.
Linear-attention / GatedDeltaNet layers (432 keys, 48 layers × 9 modules) — Mamba / SSM state dynamics are mathematically incompatible with FP4. The hidden-state recurrence multiplies state vectors by quantized weights at every step; even tiny per-step error compounds across the sequence and the state collapses. FP4 on SSM weights is not a precision/accuracy tradeoff — it is a correctness failure.

FP4 is applied only where it is well-behaved: the 16 full-attention layers' output projections, plus all MLPs.

Verification (post-quantization)

Check	Result
Total keys in checkpoint	1952
Quantized full-attention projections	64 (16 layers × q/k/v/o)
`linear_attn.*` keys preserved BF16	432
`visual.*` keys preserved BF16	333
Norm keys preserved BF16	319
`lm_head` and `embed_tokens` preserved BF16	✓
NVFP4-packed weights present	✓
`input_global_scale` magnitudes	142–346 (healthy)

Quant tool: llm-compressor 0.10.1.dev107 with QuantizationModifier(scheme="NVFP4"). Calibration: open-platypus, 512 samples × 4096 tokens. Pipeline: sequential with sequential_targets=["Qwen3_5DecoderLayer"] (required for hybrid stacks; auto-discovery silently skips layers). Loader: AutoModelForImageTextToText to preserve the multimodal class.

Wall-clock quant time: ~57 minutes on 1× RTX PRO 6000 Blackwell 96 GB.

Capability enhancement: the lifted "safety tax"

Modern safety alignment is not free. It imposes what Huang et al. 2025 call the "safety tax" (arXiv:2503.00555) — a systematic suppression of reasoning capacity that emerges because the RLHF process trains the model to route certain cognitive operations through refusal-shaped attractors, even when those attractors are not activated by the output. The refusal direction is not a binary gate; it is a weighted drag on the residual stream that rebalances the token distribution at every forward pass, whether or not the eventual generation contains a refusal.

Removing the refusal direction eliminates that drag. Concretely, this produces three observable shifts:

Longer, more committed chains of thought. Aligned models often hedge partway through a reasoning chain ("but of course, one should be careful…") in response to topics that tangentially brush the refusal subspace — even when the prompt is entirely benign. AEON-Ultimate follows reasoning chains to their logical conclusion without mid-stream hedging.
Improved adversarial-example and red-team reasoning. Without self-censorship overhead, the model can analyze attack surfaces, vulnerabilities, and failure modes at full capacity — invaluable for security research, penetration testing, and AI-alignment red-teaming.
Cleaner calibration on contested topics. Aligned models often express uncertainty on topics where they are actually highly confident, because the refusal gradient creates an attractor basin near "I'm not sure" for any topic that pattern-matches the safety training distribution. AEON-Ultimate reports its actual confidence.

Empirical literature on capability gains from abliteration

The published evidence is consistent: post-training refusal-direction removal at low KL produces measurable benchmark gains over the aligned base.

Study	Model	Intervention	Result
grimjim (2025)	Gemma-3-12B-IT	NPBA abliteration	+13.9 % NatInt reasoning
Young (2025), arXiv:2512.13655	Yi-1.5-9B	DECCP abliteration	+1.51 pp GSM8K
Xie et al. (2026)	(DGR safety-tax mitigation)	targeted safety-direction removal on DirectRefusal	+30.2 % reasoning recovery

AEON-Ultimate sits in the KL < 0.001 regime where these gains are most commonly reported. The capability spot-checks (10/10 coherent across math, code, reasoning, knowledge, and long-form) and the DGX Spark serving benchmarks at the top of this README are the current public measurement set.

What the lifted overhead also means

The same lifted overhead means the model will now produce content the base would refuse: harmful-tool construction, violence, graphic sexuality, contested ideologies, jurisdictionally illegal content, and content a reasonable person might find offensive.

The model makes no internal judgment calls about whether to comply. It complies. The user becomes the safety layer. This is by design — the intended use cases (security research, red-team operations, alignment research, creative writing without editorial constraints, serving users in jurisdictions where the base's guardrails misalign with legitimate local frameworks) all benefit from a model that reliably executes the user's instruction rather than second-guessing it. But that same reliability is a threat vector when the user's instruction is malicious.

Wielding an uncensored model is genuinely different from wielding an aligned one. It requires a different operational stance — one where the user, not the model, is the safety layer. See the responsibility section below.

Configuration reference

NVFP4 on DGX Spark — full flag explanation (XS + DFlash config, `aeon-vllm-ultimate:latest`)

Flag	Value	Why
`--quantization modelopt`	required for the XS body	The recommended `-Multimodal-NVFP4-MTP-XS` checkpoint is modelopt format. Use `compressed-tensors` only with the older regular `-NVFP4` body.
`--kv-cache-dtype`	do not set (leave default BF16)	The non-causal DFlash drafter requires BF16 KV — do not set `--kv-cache-dtype` to an FP8/NVFP4 value with DFlash. TurboQuant K8V4 (3.76× compression) is also unsupported on hybrid attention + Mamba models — vLLM raises a deliberate guard. The 27B-AEON stack stays on uniform BF16 KV.
(async scheduling)	enabled (default)	Async scheduling overlaps scheduler work with GPU work and is part of the default serving profile. Disable only for a deliberate TTFT-only experiment.
`--max-model-len`	`256000` gateway default, `200000` solo LLM production	256K exposes almost the full trained context for agent gateways. Use 200K when the LLM is the only major GPU service and you want more full-context KV safety.
`--max-num-seqs`	`64` gateway default, `16` solo full-context production	64 gives agentic gateways room for one large working chat plus many short-lived subagents. Drop to 16 when you expect many sequences near the full 200K context window.
`--max-num-batched-tokens`	`32768`	Prefill budget. This is the practical ceiling on Spark; above 32K, compile coverage and unified-memory pressure get worse.
`--gpu-memory-utilization`	`0.75` gateway default, `0.85` solo LLM production	Use 0.75 when ASR, TTS, embeddings, ComfyUI, or other GPU services share the Spark. 0.85 is the long-context LLM-only cap. Do not exceed 0.88 on DGX Spark — unified memory thrashes above that.
`--enable-chunked-prefill`	on	Required for long-context workloads to avoid prefill OOM.
`--enable-prefix-caching` / `--no-enable-prefix-caching`	workload-dependent	For pure DFlash gateway serving, prefix caching can be a major TTFT win when many agents share the same stable system/persona/skills/tool prefix. In our repeated-prefix probe, a 37,837-token shared prefix dropped from ~26 s uncached TTFT to ~0.7 s cached follow-ups. For DDTree research modes, keep prefix caching off until branch-state replay and accepted-branch commit are quality-stable.
`--load-format safetensors`	required	NVFP4 weights ship as safetensors.
`--trust-remote-code`	required	Qwen 3.6 uses custom modeling code.
`--enable-auto-tool-choice`	on	Enables OpenAI-compatible tool calling.
`--tool-call-parser qwen3_coder`	required for tools	Parses Qwen 3.6's tool-call XML.
`--reasoning-parser qwen3`	required for thinking mode	Parses `<think>` blocks.
`--attention-backend flash_attn`	required	Stable on sm_121a.
`--limit-mm-per-prompt '{"image":4,"video":2}'`	recommended	Hard caps on multimodal inputs per request.
`--mm-encoder-tp-mode data`	required	Vision encoder TP strategy.
`--mm-processor-cache-type shm`	recommended	Shared-memory mm processor cache.
`--mm-shm-cache-max-object-size-mb 256`	recommended	Lets larger Qwen3.6 image/video processor objects fit in the multimodal shared-memory cache.
`--speculative-config '{"method":"dflash","model":"/models/dflash-drafter","num_speculative_tokens":12}'`	recommended	DFlash spec-decode at `num_speculative_tokens: 12` (validated production default). This is the Spark recipe for `aeon-vllm-ultimate:latest`. Leave the drafter attention backend at default — do not add `attention_backend` to the spec config.

Required environment variables (DGX Spark NVFP4 / `aeon-vllm-ultimate:latest`)

Variable	Value	Why
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	`1`	Allows `--max-model-len` past the model's hard ceiling assertion.
`TORCH_CUDA_ARCH_LIST`	`12.1a`	sm_121a-specific.
`PYTORCH_CUDA_ALLOC_CONF`	`expandable_segments:True`	Reduces fragmentation under long-context KV churn.
`TORCH_MATMUL_PRECISION`	`high`	Standard precision for FP4 matmul paths.
`NVIDIA_FORWARD_COMPAT`	`1`	DGX Spark forward-compat shim.
`NVIDIA_DISABLE_REQUIRE`	`1`	Disables driver version assertion — required because GB10 ships with a driver newer than vLLM's `nvidia-require-cuda` baseline.
`ENABLE_NVFP4_SM100=0`	`0`	Required by PR #40191 for sm_121a-only builds. Without it, `vllm._C_stable_libtorch` fails to import — depends on SM100-only `mxfp4_experts_quant` kernels that don't exist on SM121.
`VLLM_USE_FLASHINFER_MOE_FP4`	`0`	Defensive: this model is dense (no MoE); disabling the FlashInfer FP4 MoE auto-probe avoids SM121 PTX rejection log spam during boot.
`VLLM_TEST_FORCE_FP8_MARLIN`	`0`	Override baked test-image defaults; keep production NVFP4 path selection.
`VLLM_USE_FLASHINFER_SAMPLER`	`1`	FlashInfer CUDA top-k/top-p sampler for normal sampled requests.

BF16 on A100 / H100 — full flag explanation

Flag	80 GB profile	96 GB profile	Why
`--max-model-len`	`131072`	`262144`	Half-context on 80 GB to leave KV headroom.
`--max-num-seqs`	`16`	`32`	80 GB cards leave ~21 GB for KV after 0.90 utilization.
`--max-num-batched-tokens`	`8192`	`16384`	Safe prefill.
`--gpu-memory-utilization`	`0.90`	`0.90`	Standard for dedicated VRAM (not unified).

Responsibility, arbitration, and use

This is an uncensored model. Read the model card's User Responsibility & Arbitration Clause before deploying. Summary:

You are solely responsible for prompts, outputs, and downstream actions.
Provided "AS IS" — no warranty of any kind.
You implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). A production deployment without those layers is unsafe by construction and is not a supported use case.
Disputes go to binding individual arbitration. Class action waived.
You indemnify the authors from claims arising from your use.

The model has no opinions of its own. You supply the opinions, the judgment, and the ethics. The outputs carry your fingerprints, not the model's.

Provenance & credits

Base model: Qwen/Qwen3.6-27B — Alibaba's Qwen team.
SSM conv1d outlier repair methodology: FernflowerAI (multiple Reddit r/LocalLLaMA posts, late 2025 / early 2026).
Abliteration tool: abliterix v1.4 by Wangzhang Wu — Heretic-derived multi-objective Optuna optimizer with native hybrid Mamba/attention support, projected-abliteration, and expert-granular steering.
Heretic (upstream of abliterix): p-e-w/heretic by Philipp Emanuel Weidmann.
Original abliteration concept: Arditi et al. 2024 — "Refusal in Language Models Is Mediated by a Single Direction" (arXiv:2406.11717).
NPBA / projected-abliteration theory: grimjim 2025 — norm-preserving biprojected abliteration.
Safety-tax quantification: Huang et al. 2025 (arXiv:2503.00555); Xie et al. 2026 (DGR, safety-tax mitigation).
NVFP4 specification: NVIDIA NVFP4 introduction.
Quantization tool: llm-compressor by vllm-project.
Patched vLLM container: AEON-7/Qwen3.6-NVFP4-DFlash — source-built vLLM image with sm_121a CUTLASS NVFP4 patches.
This release's pipeline, configuration, validation, marketing, and packaging: AEON-7.

License

Apache 2.0, inherited from Qwen/Qwen3.6-27B.

Built over 72 hours · Hundreds of research agents · Lossless · Capability-enhanced

BF16 · NVFP4 · Container

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC) _{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}	Ξ Ethereum (ETH) _{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}
◎ Solana (SOL) _{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}	ⓜ Monero (XMR) _{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
bench		bench
container		container
docs		docs
other-hardware/rtx6000pro		other-hardware/rtx6000pro
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile.v3		Dockerfile.v3
LICENSE		LICENSE
README.md		README.md
docker-compose.bf16.yml		docker-compose.bf16.yml
docker-compose.spark-xs.yml		docker-compose.spark-xs.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Qwen3.6-27B-AEON-Ultimate-Uncensored

Lossless abliteration · Capability-enhanced · NVFP4 hardware-quantized for Blackwell

TL;DR

Performance — DGX Spark DFlash vs Raw Baseline

Single-Stream Decode

Practical Agent Concurrency

Stress Saturation

What the AEON container Adds

Why the Spark recipe is tuned for long context

DDTree v5 Research Track

Model Variants

Table of contents

What this is

Final stats

Refusal rate (apples-to-apples)

Capability preservation

KL divergence detail

Hardware compatibility matrix

QuickStart — DGX Spark 🏆 (XS body + DFlash, recommended winner)

Step 1 — Authenticate to HuggingFace and pull both models

Step 2 — Use the XS docker-compose

Step 3 — Start

Step 4 — Test

QuickStart — A100 / H100 (BF16)

Step 1 — Pull weights

Step 2 — Drop in the BF16 docker-compose

Step 3 — Start

Other hardware configurations

In-depth: the abliteration methodology

What abliteration is

What "lossless abliteration" requires

The AEON pipeline (4 stages)

Stage 1 — SSM conv1d outlier repair

Stage 2 — abliterix multi-objective abliteration

Stage 3 — capability cross-validation (the over-abliteration trap)

Stage 4 — NVFP4 quantization

In-depth: NVFP4 quantization

What NVFP4 is

Why the two-level scaling matters

Why it's effectively lossless

Native Blackwell tensor-core throughput

What stays BF16 (and why)

Verification (post-quantization)

Capability enhancement: the lifted "safety tax"

Empirical literature on capability gains from abliteration

What the lifted overhead also means

Configuration reference

NVFP4 on DGX Spark — full flag explanation (XS + DFlash config, aeon-vllm-ultimate:latest)

Required environment variables (DGX Spark NVFP4 / aeon-vllm-ultimate:latest)

BF16 on A100 / H100 — full flag explanation

Responsibility, arbitration, and use

Provenance & credits

License

☕ Support the work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

NVFP4 on DGX Spark — full flag explanation (XS + DFlash config, `aeon-vllm-ultimate:latest`)

Required environment variables (DGX Spark NVFP4 / `aeon-vllm-ultimate:latest`)

Packages