webgpu-q — CLAUDE.md

Project-local instructions for Claude. Load this first.

One-paragraph read

WebGPU quantum circuit simulator. Runs in a browser tab. Target: piece one of a six-level research ladder — statevector → MPS → kernel fusion → WebRTC swarm → IBM hardware cross-verify → quantum chemistry. Each level is a set of research-grade experiments (not just benchmarks): named seed, warmup, trials, fidelity pass bar, honest negative results. The master doc is RESEARCH.md. Per-level protocols live under experiments/level-N-<slug>/protocol.md.

Communication mode: hero. Terse, bold, first-principles, attempt-first. Scope-honest. See ~/.claude/skills/hero/SKILL.md. Project skill: webgpu-q-research. See ~/.claude/skills/webgpu-q-research/SKILL.md.

Roadmap to the frontier

The project is past the launchpad. All six chemistry-track phases are shipped (A through E5: foundation → 1D records → real molecules → HF SCF → MP2 → cc-pVDZ basis → CCSD → CCSD(T) → cc-pVDZ CCSD(T) on H₂O). Repo is public + CI-green. Path forward is ranked by what it costs vs what it unlocks, not by ladder position.

Shipped (recap)

✓ L1 statevector, L2 MPS (incl. GPU MPS Phase 6 v1, χ ≤ 64), L3 kernel fusion (Tier B/C/D — 4.22× headline), L6 chemistry (full quantum-chemistry stack)
✓ DMRG with Lanczos + MPO; ITensor cross-checked at N = 8 to f64
✓ Phase B: TFIM/Heisenberg N = 128 in browser, validated vs Pfeuty/Bethe
✓ Phase C/D/E1-5: HF / MP2 / FCI / CCSD / CCSD(T) on H₂ → LiH → BeH₂ → H₂O → CH₄ in STO-3G; cc-pVDZ CCSD(T) on H₂O in 106 s
✓ Tier 1 bundle: DIIS, frozen-core, spherical-d, f/g/h, aug-cc-pVDZ, Schwarz screening
✓ Tier 2 stages 1–23: geometry optimization → DFT/LDA → GGA + hybrids (BVWN5, BLYP, B3VWN5, B3LYP5) → HF + DFT analytical gradients → Lebedev grids → CIS / TDA / TDDFT (full functional ladder) → oscillator strengths → dipole moments → Mulliken + Mayer-Wiberg analysis → triplet TDA/TDDFT (full ladder via spin-polarized LSDA + B88 + LYP) → vibrational frequencies + IR + Raman + thermo → polarizability + hyperpolarizability → UHF + ΔSCF ionization potentials + electron affinities → molecular SI report page (/molecule.html)
✓ Reference-grounded validation green in CI (PySCF / FCI / ITensor / Pfeuty-Bethe gates); e2e browser benches across all levels

Next: chemistry-track tier roadmap

Ranked by ROI. One focused session ≈ a few hours.

Tier 2 — ALL SHIPPED through stage 38 (2026-05-12)

feature	status	unlocks
DFT (LDA + B3LYP + Lebedev grids)	✓	~90% of all real chemistry
HF analytical gradients + BFGS	✓	geometry optimization
WebGPU port of (T) kernel	✓ (39× on H₂O cc-pVDZ, single-run)	10-100× speedup; cc-pVTZ CCSD(T) routine
EOM-CCSD (excited states)	✓ (+ eigenvectors, oscillator strengths, spin classifier)	UV-vis, photochemistry
UHF + open-shell CCSD	✓ (UHF stage 21, UCCSD stage 25)	radicals, transition metals
Density fitting (RI)	✓ correctness + speedup (aux-basis 3-index DF now shipped: `buildAuxBasisDFStreaming` WASM, never builds the 4-index ERI; the old CD-DF 11-20× regression is retired)	half memory + faster
WebGPU aux-basis DF integrals	✓ (`df-gpu.ts`: s/p/d McMurchie–Davidson 3-index V + 2-index metric in WGSL f32, validated ~1e-4 rel; `buildDFAuto` auto-selects GPU in the d-regime)	GPU integral build, 1.1-1.35× d-regime
Fully-GPU DF-HF SCF	✓ (`makeGpuDFJK` + `buildDFAuto`: whole HF loop on GPU from a URL, no 4-index ERI; benzene cc-pVDZ 5-6× faster whole-loop vs WASM; level-0 aux → ~30 mHa screening; f32 JK floor ~6e-4 is element-precision)	fast browser HF (screening)
Hybrid GPU/WASM DF — EXPERIMENTAL	✓ but demoted (decision 2026-06-10): `buildV3idxHybrid` is chemistry-grade (DF-vs-exact ladder H₂O 0.19 / CH₂O 0.53 / C₂H₄ 0.36 mHa, `e2e/df-accuracy-ladder`) but the win is only ~1.3× on the integral BUILD in a medium band — WebGPU has no f64 so it can't touch the f64-bound J/K. f64 WASM is the recommended chemistry default; the GPU hybrid is `fast=true` opt-in, proof-of-mechanism only.	marginal; not the default
`runRHFAuto` / `runRKSAuto` entry points	✓ (`rhf-auto.ts`: size-gated exact(small)/f64-DF(large)/hybrid-GPU(fast) with honest provenance, for both HF and DFT — `runRKSDFT` gained a `useDF` option; pure functionals ride the cheap DF J, hybrids the DF K; validated H₂O LDA 0.07 mHa / B3LYP5 0.02 mHa vs exact)	one call, right method, attributed, HF+DFT
IP-EOM-CCSD / EA-EOM-CCSD	✓ (stages 37–38, beyond original Tier 2 plan)	accurate IPs / EAs

Tier 3 — Substantial (~25 sessions)

CCSDT (full triples), CASSCF (multi-reference), TD-DFT, MP2/CCSD gradients (Z-vector), PCM solvent, coupled-perturbed HF (NMR / polarizabilities). (WebGPU integral parallelization: DF 3-index/metric

DF-JK now shipped — see Tier 2; full 4-index ERI on GPU still open.)

Tier 4 — Genuinely hard (a season each)

CASPT2 / NEVPT2, periodic DFT (k-points), spin-orbit / X2C, analytical CC gradients, QM/MM.

Deferred moonshots

Phase D (WebRTC swarm) — distributed 1D chain across browsers. ~3-5 sessions. Reuse webgpu-p2p-evolution's relay.
E.1 — Verify Sycamore — 2D PEPS + Sycamore gates + distributed contraction. ~3-5 sessions on top of Phase D.
E.2 — Fault-tolerant qubit — stabilizer sim + surface code + syndrome decoder + threshold curve. ~4-6 sessions.
E.3 — Browser-native lattice QCD — 4D lattice + Wilson Dirac + fused CG solver. ~6-10 sessions.

Cleanest near-term path

WebGPU (T) → EOM-CCSD → DFT excited-state properties. ~5-6 sessions to "real chemistry tool in a browser tab, with speed." Every step ships a publishable artifact.

Unifying thesis: "every advanced physics simulation in the world ships as a URL". webgpu-q is the proof point; the chemistry track is its highest-leverage demonstration.

Current state (one-pager — for per-stage detail: `git log`)

Headline numbers:

L1 statevector: F ≥ 0.999999 vs CPU; 4-experiment ladder (E1–E4) green.
L2 MPS / DMRG: TFIM & Heisenberg N=128 in browser, χ=32, validated to Pfeuty/Bethe limits at 1/N. ITensor cross-checked at N=8 to f64.
L3 kernel fusion: 4.22× headline (Tier C, 8×8 cascade); Tier D plateau (3.78×) is the documented honest negative.
L6 chemistry: HF (enforced ≤ 0.5 mHa vs PySCF general; ≤ 0.1 mHa H₂O cc-pVDZ spherical-d) → MP2 → FCI (CH₄ to 0.76 mHa) → CCSD (enforced ≥ 95% correlation capture, ~99% typical on H₂O/CH₄) → CCSD(T) (≤ 0.25 mHa vs FCI). cc-pVDZ CCSD(T) on H₂O — CPU 116 s, GPU 13.8× median (5 warmup + 20 trials, M2 Pro; p10=28×, p90=10×, std/median 42% noisy). Full DFT ladder (LDA/GGA/B3-hybrid) on RHF/UHF/RKS/UKS. Full {α, α(ω), α(iω), C₆} response matrix. EE/IP/EA-EOM-CCSD with eigenvectors, oscillator strengths, spin classifier. DF engine = f64 WASM (recommended default). runRHFAuto + siblings are size-gated: small → exact ERI (f64), large → streaming aux-basis DF (f64 WASM, buildAuxBasisDFStreaming), with honest method/engine/precision provenance. The GPU paths are EXPERIMENTAL (decision 2026-06-10), not the chemistry default. WebGPU has no f64, so the chemistry-grade hybrid (buildV3idxHybrid, fast=true) can only put f32 on the insensitive s/p/d-aux columns (~8 µHa) and buys just ~1.3× on the integral BUILD in a medium band — it can't touch the f64-bound J/K, and loses at PAH scale. The fully-GPU f32-JK path (makeGpuDFJK, benzene 5-6× whole-loop) is ~30 mHa screening only. Both kept as proof-of-mechanism ("GPU in the browser") and as the seam where a real win lands IF df64 emulation ever makes the GPU JK chemistry-grade. GPU genuinely wins on the f32-tolerant tracks (statevector, kernel-fusion 4.22×, (T) 39×) — DF chemistry just isn't one (needs f64).

Live: https://webgpu-q.vercel.app — landing, /viz.html (4D hyperscope), /molecule.html (SI report), /experiments/ (E1–E33+). Standing preference: do NOT auto-deploy — deploy only when explicitly asked.

Validation surface (what's checked, not how many): reference-grounded gates green in CI — bit-exact / sub-µHa vs PySCF, EOM-CCSD full-tensor brute-force diffs (14×14 LiH), CCSD(T) sub-mHa vs FCI, ITensor N=8 and Pfeuty/Bethe 1D limits, swarm partition-sum vs single-slab below 1e-12. npx tsc --noEmit clean, npm run lint clean, vitest green, e2e browser benches (e2e/) cover all levels + the swarm/acene series.

Honest negatives / open work (each its own session):

IP-EOM-CCSD: PySCF-ported (2026-05-22), multi-electron-validated (2026-06-23). σ_1 + σ_2 follow Tu-Wang-Li 2012 Eqs (8)-(9) with PySCF eom_gccsd intermediates. The earlier R_2 satellite over-count (~60 eV on H₂) was a structural bug (NOT a curve-fit patch — unlike EA) and is closed. IP was the one EOM variant whose exact oracle stayed H₂-only (T̂²≈0, can't probe σ_2); a NEW multi-electron oracle (tests/chemistry/ip-eom-ccsd-bruteforce-lih.test.ts, LiH NSO=6, T̂²≠0, full 16-eigenvalue H̄ = e^{-T}He^{T} projection vs runIPEOMCCSD) matches to 4.97e-13 Ha — IP passed first try, confirming it carried no patch, only a too-weak verifier. Element-by-element H₂ diff < 1e-10 retained.
Phase D — swarm shipped (2026-05-22), all three steps:
- Step 1: swarmMap(items, fn) primitive + BroadcastChannelTransport (same-origin multi-tab, no infra).
- Step 2: WebRTCTransport via PeerJS broker (cross-machine, NAT- traversal via Google STUN; symmetric-NAT corporate networks may need TURN, documented).
- Step 3: real-chemistry kernel — chem-energy runs a molecule tile, swarm distributes H₂ bond-length scans (and any 1D parameter scan) across tabs. /swarm.html ships both prime-counting (Demo 1) and bond-scan (Demo 2) demos.
- Step 4 (swarm × GPU, 2026-06-09): the kernel now runs runRHFAuto per tile, so each worker tab auto-picks exact / hybrid-GPU DF and reports its own provenance. e2e/swarm-gpu distributes an N₂ cc-pVDZ batch across 2 tabs, every tile gpu+wasm, tracing N₂'s bond curve (min r=1.098 Å). GPU-accelerated chemistry-grade single-points split across the crowd — the project's two theses in one demo.
- Step 5 (distributed DF-MP2 + honest-negative measurement, 2026-06-15): the swarm's first collaborative single-molecule reduction — ONE molecule's MP2 correlation energy E_corr = Σ_i Σ_j Σ_ab … partitioned over the outer occupied index i, each tab owns an i-slice, master sums the scalar partials (mp2-slice kernel + mp2EnergyDF(...,iRange) + reduceMP2Slices). Comm- optimal (spec in, one f64 out, one round); reduceMP2Slices guards the deterministic-reference assumption (throws if per-tab E_HF disagree). Validated: partition-sum == single-machine to <1e-12 (tests/chemistry/mp2-slice), 2-tab e2e to <1e-9 (e2e/swarm-mp2-distributed). HONEST NEGATIVE — distributing the contraction barely speeds up one molecule (e2e/swarm-mp2-speedup, single-shot M2 Pro): H₂O 0.51×, benzene cc-pVDZ 1.10× (single 96s / 2-tab 87s). Breakdown: redundant SCF+DF setup S≈79s (82%, paid in full on every tab, on the critical path) vs splittable grind C≈17.5s (18%). speedup=(S+C)/(S+C/k) is pinned near 1 while S≫C; C≫S needs n≈600 whose DF tensor (~5 GB) won't fit a tab. The swarm's scaling axis is throughput (N independent molecules via chem-energy), NOT single-molecule wall-time. To speed up one molecule you'd have to parallelize the SCF+DF setup, not the correlation.
- Step 6 (screening + honest multi-tab scaling, 2026-06-15): the throughput axis demonstrated. chem-energy now returns the HOMO–LUMO gap (eV) as a screening descriptor; e2e/swarm-screening ranks a 10-molecule library by gap, validated to give the IDENTICAL ranking distributed vs single-tab (that spec validates ranking correctness only — its timing is indicative). e2e/swarm-scaling is the honest measurement (warmed JIT per tab, even round-robin split, true parallelism, wall = slowest tab): 1→1.00×, 2→1.73× (87% eff), 3→2.02× (67%), 4→2.36× (59%) on the library, HF/cc-pVDZ, M2 Pro. Sub-linear because molecule costs are uneven (H₂ ≪ C₂H₄) so the tab holding the heavy molecules caps the win. Further efficiency would need cost-aware scheduling (big molecules first/alone) + a larger library. (Earlier screening "1.59×" was retracted — it was warmup-inflated + a master-heavy auto-distribution; the warmed/balanced 2-tab number is 1.73×.)
- Step 7 (greedy-pull scheduler fix, 2026-06-15): rewrote swarmMap's distribution from single-claim-per-worker (master-heavy: a worker grabbed ONE tile then idled while the master ran the rest via a timeout fallback) to a greedy pull queue — every peer, master included, pulls one tile at a time and requests another only after finishing, so a slow tile parks only its own puller while everyone else keeps draining (also auto-balances uneven tile costs). Two subtleties fixed during the rewrite: (a) a failing worker's tile is run by the MASTER, not requeued to the shared pool — requeueing let a persistently-failing worker re-pull and re-fail the same tile in a tight loop (livelock, hung swarm.test.ts); (b) the master does a one-time head-start yield + keeps yielding while remote workers pull, because local kernel compute blocks the single-threaded event loop and would otherwise drain the queue via microtasks before any remote macrotask is processed. Result: auto-distributed screening went 9/1 → 4/6 (master/other), 1.05× → 1.48×; tests/parallel/swarm (13 tests) green; distributed-MP2 reduction still bit-exact.
EE-EOM-CCSD: PySCF-ported (2026-05-21). σ_1 + σ_2 follow Wang-Tu-Wang 2014 Eqs (9)-(10) with PySCF eom_gccsd intermediates (EOM-Fvv/Foo/Wovvo with full t2 dressing + Wovoo / Wvvvo). EE's empirical stage-32c patch removed; brute-force LiH diff < 1e-10 Ha element-by-element. H₂ STO-3G now matches FCI to 8+ decimals.
EA-EOM-CCSD: PySCF-ported + multi-electron-validated (2026-06-16). The 2026-06-16 audit surfaced that ea-eom-ccsd.ts carried an empirical +½·E_corr·R₂ σ_2 diagonal patch (stage-32e) curve-fit to the H₂ brute-force (the diagnostic-loop-trap anti-pattern) — and used BARE integrals where PySCF uses dressed Wvvvo/Wvovv. A NEW multi-electron oracle (tests/chemistry/ea-eom-ccsd-bruteforce-lih.test.ts, LiH NSO=6, T̂²≠0) measured the patch ~1 mHa wrong. σ is now a direct port of PySCF eom_gccsd.eaccsd_matvec onto the shared dressed intermediates (buildEOMIntermediates: Fvv/Foo/Fov/Wvvvv/Wovvo/Wvovv/Wvvvo) + the proper −½ Σ⟨kl||cd⟩ r_l^{cd} t_{ki}^{ab} term, matching the explicit H̄ projection to ~5e-13 Ha on LiH (machine precision). All three EOM variants (EE/IP/EA) are now patch-free PySCF ports with multi-electron (LiH, T̂²≠0) brute-force verifiers — IP's LiH oracle added 2026-06-23 (it passed first try; only EA ever carried an actual curve-fit patch, EE/IP did not).
✓ Aux-basis DF (stage 31 proper) — done: buildAuxBasisDFStreaming (WASM) + df-gpu.ts (WGSL s/p/d 3-index V + metric). No longer open.
DF-CCSD via B-tensor through spin-orbital ERI build.
df64 (double-single) emulation on GPU-JK products to push past the f32 ~6e-4 element-precision floor (Kahan on the sum was a no-op — see jk-df-gpu.ts). Lower priority now: the hybrid path (buildV3idxHybrid) already gives chemistry-grade GPU-accelerated DF via f64 JK, sidestepping the f32-JK floor.
f-functions in the WGSL 3-index kernel: would raise the GPU-carried aux fraction past the current ~91% (hybrid offloads f-aux to WASM) for a bigger GPU win — but no longer needed for accuracy (the hybrid is chemistry-grade).
WASM (or GPU-side) merge kernel to replace the JS block-assembly in buildHybridDFStreaming — THE lever to extend the GPU hybrid past medium molecules. The hybrid currently gates off at n²·nAux ≥ 12 M because the per-block f32-low + f64-f-aux merge is a JS triple-loop that loses to WASM-SIMD streaming at PAH scale (naphthalene >2× slower; honest negative, 2026-06-09). Large molecules use all-WASM streaming DF, which is excellent; the GPU hybrid is a medium-molecule optimization (chemistry-grade + 1.31× V-build).
Naphthalene/PAH-scale DF-HF is feasibility-demonstrated, NOT precision- validated. The capstone (e2e/naphthalene-capstone) asserts only a sane energy window (ERI never built); DF-vs-exact chemical-accuracy is validated only up to the size ladder where the exact 4-index ERI still fits a tab (H₂O→CH₂O→C₂H₄, n≤50, e2e/df-accuracy-ladder). To claim "chemistry-grade up to naphthalene" needs an external PySCF DF-HF reference for that geometry + a sub-mHa assertion (exact ERI is uncomputable in a tab there). Surfaced by the scientific-critic pass 2026-06-09. Don't conflate feasible with validated.
WGSL (T) kernel optimization to push 39× → 100× (no warmup+trials harness yet).
UKS-TDDFT response α(ω) — only remaining {ref}×{response} cell.
Z-vector for MP2 / CCSD analytical gradients.
NMR shielding via magnetic-perturbation CPHF.
Becke-partition weight derivatives in DFT gradients (~1e-3 Ha/Bohr residual).
Spherical-d on TDA-DFT / DFT-gradient grid (refuses with clear error).
Davidson eigensolver for large-basis CIS / TDDFT.
Continuum representation for E17 σ_ion convergence.
Degenerate-eigenvector orthogonalization in eigGeneralWithVectors.

Permanent verifiers:

tests/chemistry/eom-ccsd-bruteforce-lih.test.ts — full 14×14 M_mine − M_exact diff after any σ_1/σ_2 change. Binary feedback.
tests/chemistry/ip-eom-ccsd-bruteforce.test.ts
tests/chemistry/ea-eom-ccsd-bruteforce.test.ts

Research-grade discipline (non-negotiable)

From RESEARCH.md. Every experiment enforces them.

Reproducibility

No Math.random() in any experiment path. Every random draw uses a named seed from experiments/lib/seeds.ts via mulberry32(seed).
Every JSON artifact records: git SHA (when available), navigator.userAgent, adapter.info, WebGPU limits, UTC ISO8601 timestamp, and echoes back protocol, hypothesis, passBar, seed, warmup, trials. See experiments/lib/env.ts → captureEnv(device, adapter).
Artifact shape locked: { meta, env, rows, status, diagnosis }. Don't add top-level keys without updating experiments/lib/runner.ts and the downstream dashboard.

Timing

performance.now() with a forced GPU sync before AND after — a mapped readback of a tiny buffer. queue.submit alone is non-blocking so raw timing is fiction. Harness: experiments/lib/runner.ts → timedRun.
Discard 5 warmup samples. Retain 20 trials. Report median, p10, p90, p99, std, IQR — never single-shot.
If std/median > 0.1 on any cell, mark the artifact "status": "noisy".

Correctness

Use fidelity F = |⟨ψ_ref | ψ_test⟩|², not max|Δp|. Two states can share a probability distribution and differ in phase — that kills any downstream controlled gate. Use experiments/lib/fidelity.ts → stateMetrics.
Pass bar for f32-amplitude GPU paths: F ≥ 1 − 1e-5.
Pass bar for f64 MPS vs f64 statevector: F ≥ 0.999 (MPS has SVD truncation + accumulated Jacobi error, ~9 digits realistic at χ = 64).
Secondary: TVD, L1, L2, max|Δp|, ‖ψ_ref‖², ‖ψ_test‖² — always reported.

Honest negative results

If an experiment fails its pass bar, still commit the JSON with "status": "fail" and a "diagnosis" naming the first failing cell and the smoking gun. Failures are the evidence. No silent rerunning until it passes.
Example (MPS canonical-form bug, 2026-04-22): brick-wall F = 0.25 at depth 2. Diagnosis: "non-monotonic two-site gate order breaks mixed-canonical invariant, local Frobenius norm ≠ global norm, renormalization distorts." Fix: _canonicalizeBond(q) before every applyTwoSite.

Commands

npm install
npm run dev          # Vite dev server, http://localhost:5175
                     # experiments live at http://localhost:5175/experiments/
npm run test         # Vitest, ~500 ms (one outlier 5 s for the MPS bug repro)
npm run test:watch   # TDD loop
npm run typecheck    # tsc --noEmit (strict, noUncheckedIndexedAccess on)
npm run lint         # ESLint flat config, src/ tests/ experiments/
npm run build        # → dist/
npm run test:e2e     # Playwright, all 4 levels headless (~1.4 min on M2 Pro).
                     # Saves JSON artifacts to experiments/results/<date>/level-N/.
                     # Each level also reachable via window.__webgpuq.runLevelN()
                     # in devtools at /experiments/.
npm run test:e2e:headed   # Same, with a visible browser window.

File layout

src/
  shaders/
    single-qubit.wgsl    # 1-q gate kernel, N/2 threads, 2×2 complex matrix via uniform
    two-qubit.wgsl       # controlled-U kernel, N/4 threads
  gates.ts               # H, X, Y, Z, S/Sdg, T/Tdg, Rx/Ry/Rz, P, matrixFloats()
  quantum.ts             # QuantumCircuit (GPU) + initGPU() with requiredLimits
  cpu-reference.ts       # CpuCircuit (Float64 TS reference, ground truth)
  circuits.ts            # bell, ghz, qft, deutschJozsa, randomCircuit builders
  linalg.ts              # ComplexMatrix, Jacobi complex SVD, matmul   — Level 2
  mps.ts                 # MPS class with canonical form + TEBD         — Level 2
  bench.ts               # GPU vs CPU throughput sweep (pre-research harness)
  main.ts                # Legacy browser demo entrypoint
  chemistry/             # Level 6: HF, MP2, CCSD, CCSD(T), DFT, CIS/TDA/TDDFT,
                         # properties, gradients, geom-opt, vibrational analysis

tests/                   # Vitest unit tests (chemistry/, gates, linalg, mps, …)

experiments/
  index.html             # Research dashboard (run buttons, result tables)
  runner.ts              # Dashboard entry point — wires each level's run-all
  lib/
    seeds.ts             # Named deterministic seeds (no Math.random)
    runner.ts            # timedRun harness + Artifact / ArtifactMeta schema
    env.ts               # captureEnv(device, adapter) → EnvBlock
    fidelity.ts          # stateMetrics, FIDELITY_PASS_BAR
    stats.ts             # stats() — median, p10/p90/p99, std, IQR
  level-1-statevector/   # E1–E4 + run-all
  level-2-mps/           # E5–E7, E18, E19 + run-all
  level-3-fusion/        # E8–E13 shipped (Tiers A/B/C/D fusion)
  level-6-chemistry/     # E16, E20–E31 shipped (H₂ → CCSD(T)/cc-pVDZ)
  results/               # JSON artifacts, organized YYYY-MM-DD/level-N/

Architecture notes (carry forward)

Statevector (Level 1)

Amplitudes stored as vec2<f32> interleaved (re, im). Buffer = 2^(N+3) B.
Single-qubit gate: N/2 threads, each processes the pair (i, j) where bit q is 0 and 1. Apply 2×2 complex matrix from uniform buffer.
Two-qubit (controlled-U): N/4 threads, index scattered around control
- target bits, only control=1 is touched.
initGPU() MUST request the adapter's max maxBufferSize and maxStorageBufferBindingSize via requiredLimits. Default 128 MiB cap silently truncates N ≥ 25 dispatches.
No atomics needed — gate application is pair-local read / write, zero contention.

MPS (Level 2)

Tensor storage: tensors[i] is a ComplexMatrix of shape (χ_L · 2, χ_R) — left-grouped. Element T[l, s, r] at row l·2 + s, col r. Single-qubit gates apply cleanly this way.
Statevector convention: qubit 0 is LSB of the index — ψ[s_0 + 2·s_1 + 4·s_2 + …]. mps.statevector() follows this for comparison with CpuCircuit.psi.
Two-site gate order within the 4×4: i = s_lo · 2 + s_hi — site q is the MSB within the pair. Controlled-U needs the right ordering; see buildControlledMatrix4(U, controlIsLo).
Canonical form invariant (critical). Two-site TEBD needs ‖M‖_F² = ‖ψ‖², which requires left-canonical on sites [0..q−1] and right-canonical on [q+2..N−1]. _canonicalizeBond(q) does the sweep. Cost: O(N · χ³) per two-site gate. Trivial at N ≤ 20, χ ≤ 64.
SVD is one-sided Jacobi on complex matrices: phase-align col q by e^(−iφ) so ⟨p, q⟩ is real, then apply the real Jacobi rotation. 60 sweep cap, TOL = 1e-14.
apply* returns void (mutates). statevector() refuses N > 24.
v1 constraint: applyTwoSite / applyControlled require |c − t| = 1. Non-adjacent two-qubit gates need SWAP ladders (not yet implemented).

Research harness

experiments/lib/runner.ts → timedRun(device, fn, cfg) is the only legitimate way to measure wall time on GPU paths. It owns the sync fence and the error-scope guards.
Artifact<Row> is the JSON shape. emitArtifact logs; downloadArtifact serves it as a download from a click handler.
Per-experiment logs use the [artifact:protocol] status — diagnosis prefix on stdout so CI greps can find pass/fail without parsing JSON.

WebGPU gotchas (carry forward from webgpu-dna)

initGPU() MUST pass requiredLimits for maxStorageBufferBindingSize and maxBufferSize. Default 128 MiB cap silently truncates large dispatches.
atomicAdd only on u32. Not needed in statevector path (no contention).
No recursive function calls in WGSL. All shaders are single-pass.
Uniform buffers must be aligned.

Hero-mode conventions for this repo

Scope-honest. Most research tasks here = hours for a capable agent, not weeks. Attempt now; decompose only if truly large.
Speculation labeled. "This should work" ≠ "tested". Benchmark > belief.
Raw WGSL > framework. Dispatch ceremony is the enemy.
Edge hardware underrated. The thesis is "no one has shipped this in a browser tab." Don't reinvent it; ship the numbers.

Engineering policy — port, don't re-derive (NEW 2026-05-13)

Discovered the hard way via E35/E36 EOM-CCSD bug: webgpu-q's differentiator is the browser/WebGPU layer. The chemistry methods themselves are textbook with peer-reviewed reference implementations (PySCF, libxc, ITensor). Re-deriving them from papers, as we did, produces bugs that take weeks to find. Going forward:

Hand-write only the novel layer: WGSL shaders, WebGPU dispatch + sync, MPS browser memory bookkeeping, kernel fusion, research-grade harness.
Port from references with proper Apache 2.0 attribution everything else: HF, MP2, CCSD, UCCSD, CCSD(T), EOM-CCSD, DFT functionals (libxc), gradients (Pulay), density fitting, integrals if vectorizable, basis-set tables (EMSL).

Migration framework in MIGRATION.md. Per-module status table (🔴 hand-derived → 🟢 ported), priority order, attribution recipe. LICENSE-PYSCF at root covers ported portions.

First scheduled port: eom-ccsd.ts σ_2 from PySCF pyscf/cc/eom_rccsd.py. Closes the singlet-sector bug E35 surfaced on H₂O / NH₃ / CH₄ / BeH₂ / LiH. Verifier is the LiH brute-force diagnostic (tests/chemistry/eom-ccsd-bruteforce-lih.test.ts) — after the port, M_mine − M_exact should collapse to numerical noise.

Modern reference standards (audited 2026-05)

What our claims map to in current literature. Run this audit again before any release or paper draft.

Chemical accuracy = 1 kcal/mol = 1.594 mHa (Pople pragmatic threshold). Our CCSD(T) vs FCI residuals (≤ 0.25 mHa) are sub-chemical; our GPU↔CPU |Δ| (≈ 10⁻¹⁰ Ha) is ~6 orders past chemical accuracy and characterizes f32 reduction noise, not method error.
CCSD(T) is still the gold standard in 2025/2026 (multiple JCTC reviews). MAE ~0.2–0.3 kcal/mol at CBS for noncovalent interactions.
AFQMC (Mahajan et al. JCTC Feb 2025, arXiv:2410.02885) now beats CCSD(T) at O(N⁶) vs O(N⁷). Tier 4 candidate "beyond CCSD(T)".
EOM-CCSD literature accuracy vs FCI for singlet single-excitations is 0.1–0.2 eV (~3.7–7.4 mHa) typical, 0.3 eV conservative. Doubly-excited states: errors up to 1 eV. Our 10⁻⁵ Ha on H₂ STO-3G is algorithmic precision (T̂² = 0 for 2-electron systems makes EOM-CCSD ≡ FCI exactly there) — it validates the implementation, not the method on real systems.
GMTKN55 best functionals (2024–2025): ωB97M(2) DH WTMAD2 = 2.19 kcal/mol (best ever), xrevDSD-PBEP86-D4 = 2.23, revDSD-PBEP86-D4 = 2.33. Best RSH: ωB97X-V. Best meta-GGA: SCAN-D3(BJ). We benchmark with B3LYP5 / BLYP / LSDA / B88 / LYP — textbook, not current SOTA. Modern functionals are in the Tier 3 row.
MPS state-of-the-art: TeNPy / ITensor are the reference libraries. Production runs go to χ = 1000+. Our χ ≤ 64 is "browser-feasible"; the comparison Schollwöck 2011 still holds (χ scales with entanglement).
WebGPU subgroups: out of WebGPU 1.0 spec (gpuweb#3950); coming later. Would unlock 2× reductions in fusion kernels (shuffle/add).
FAIR / Zenodo DOI: standard for reproducible computational chemistry data publishing. We emit JSON artifacts with full env capture but don't mint DOIs. Tier 3+ research-publishing improvement.
Browser-native quantum chemistry: as of 2026-05 web search, no published WebGPU + HF/DFT/CCSD(T) implementation exists outside this repo. Worth a paper if Phase D / hardware verify ever lands.

Related repos / links

Sibling: /Users/ahmetbarisgunaydin2/Downloads/webgpu-dna/ — Geant4-DNA port. Has its own CLAUDE.md. Level 6 chemistry cross-links here.
kernelfusion.dev — umbrella theory.
gpubench.dev — WebGPU bench harness reuse pattern.
Pan & Zhang 2021 (arXiv:2103.03074) — Sycamore tensor-network baseline.
Karamitros 2011 — IRT chemistry, cross-link target.
IBM Heron r2 (156q, 2025), Nighthawk (120q, Jan 2026) — E14 target.
Schollwöck 2011 — MPS / DMRG review, χ-vs-error baseline.
Vidal 2003 — iTEBD algorithm (what applyTwoSite implements).
GMTKN55: Goerigk, Hansen, Bauer et al., PCCP 2017 — main DFT benchmark.
Mahajan et al. JCTC 2025 — AFQMC beats CCSD(T) at O(N⁶).
NIST CCCBDB — experimental reference IP, EA, vibrational data.

License

MIT (simulation). Research protocol and experiment artifacts: MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu-q — CLAUDE.md

One-paragraph read

Roadmap to the frontier

Shipped (recap)

Next: chemistry-track tier roadmap

Tier 2 — ALL SHIPPED through stage 38 (2026-05-12)

Tier 3 — Substantial (~25 sessions)

Tier 4 — Genuinely hard (a season each)

Deferred moonshots

Cleanest near-term path

Current state (one-pager — for per-stage detail: `git log`)

Research-grade discipline (non-negotiable)

Reproducibility

Timing

Correctness

Honest negative results

Commands

File layout

Architecture notes (carry forward)

Statevector (Level 1)

MPS (Level 2)

Research harness

WebGPU gotchas (carry forward from webgpu-dna)

Hero-mode conventions for this repo

Engineering policy — port, don't re-derive (NEW 2026-05-13)

Modern reference standards (audited 2026-05)

Related repos / links

License

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

webgpu-q — CLAUDE.md

One-paragraph read

Roadmap to the frontier

Shipped (recap)

Next: chemistry-track tier roadmap

Tier 2 — ALL SHIPPED through stage 38 (2026-05-12)

Tier 3 — Substantial (~25 sessions)

Tier 4 — Genuinely hard (a season each)

Deferred moonshots

Cleanest near-term path

Current state (one-pager — for per-stage detail: git log)

Research-grade discipline (non-negotiable)

Reproducibility

Timing

Correctness

Honest negative results

Commands

File layout

Architecture notes (carry forward)

Statevector (Level 1)

MPS (Level 2)

Research harness

WebGPU gotchas (carry forward from webgpu-dna)

Hero-mode conventions for this repo

Engineering policy — port, don't re-derive (NEW 2026-05-13)

Modern reference standards (audited 2026-05)

Related repos / links

License

Current state (one-pager — for per-stage detail: `git log`)