pccx — Bare-Metal Transformer Accelerator on Kria KV260

PCCX™ technology / operated by Altifigence™.

This repository is the KV260 + PCCX v002 LLM application integration repo. The reusable v002 IP-core is pinned at third_party/pccx-v002. Future v003 IP-core will live in pccx-v003.

Open SystemVerilog NPU for experimental Gemma-class LLM acceleration on AMD/Xilinx Kria KV260.

PCCX KV260 Evidence State

xsim: PASS 12/0
post-synth timing: RTL synthesis-closed (WNS 0.052)
post-impl timing recovery in progress; evidence pending
bitstream: not generated
KV260 board execution: no evidence
Gemma 3N E4B runtime: no evidence
throughput: no measurement

Current status: RTL synthesis-closed (WNS 0.052); post-impl timing recovery in progress; evidence pending.

This repo is the KV260 + PCCX v002 LLM application integration repo. It hosts board integration, bare-metal driver source, application wiring, and the Vivado/sim wrappers that consume the reusable pccx-v002 LLM package through third_party/pccx-v002 and hw/vivado/filelist.v002.f.

This is not a production bitstream release. xsim and RTL synthesis evidence are present; post-impl timing recovery, bitstream generation, board execution, and full Gemma 3N E4B application wiring remain in progress or pending.

Start with the architecture docs

The design rationale, ISA, memory map, and model-mapping notes live on the pccx documentation site. This repo implements what that site specifies — read the spec first, then come back here for the KV260 integration flow.

→ pccx v002 — Architecture & ISA spec · Gemma 3N E4B on pccx v002 · 한국어 문서

Related repos: pccx (spec) · pccx-lab (profiler / simulator) · llm-bottleneck-lab (related research)

Project status

Public alpha — v0.1.0-alpha is published as a prerelease. Core RTL and ISA are available; verification and KV260 bring-up are in progress. This is not a bitstream release. Feedback and issues are welcome.

Entry point	Link
Architecture & ISA spec	https://pccx.pages.dev/en/docs/v002/index.html
RTL consumed by KV260	`third_party/pccx-v002/LLM/rtl/` + `third_party/pccx-v002/common/rtl/` via `hw/vivado/filelist.v002.f`; top `pccx_npu_top.sv`
Releases	https://github.com/pccxai/pccx-FPGA-NPU-LLM-kv260/releases
`v0.1.0-alpha` notes	docs/releases/v0.1.0-alpha.md
Roadmap (project board)	https://github.com/orgs/pccxai/projects/1
Contributing	https://github.com/pccxai/.github/blob/main/CONTRIBUTING.md
How to cite	CITATION.cff
Verification check	`repo-validate` required on `main`. The v002 Sail typecheck runs in `pccxai/pccx-v002`.
Discussions (board bring-up, RTL Q&A)	https://github.com/pccxai/pccx-FPGA-NPU-LLM-kv260/discussions
Good first issues	good first issue Suitable for first-time contributors

The repository Wiki is intentionally empty — RTL- and bring-up questions belong in Discussions, and the canonical architecture documentation lives on the pccx Sphinx site.

What's here vs. what's in pccx

Layer	Lives in	Authoritative source
Architecture / ISA / driver spec	`pccx/docs/v002/`	pccx v002 docs
Target-model pipeline (Gemma 3N E4B)	`pccx/docs/v002/Models/`	Models section
Reusable v002 IP-core RTL	`third_party/pccx-v002/LLM/`, `third_party/pccx-v002/common/`	`pccx-v002` compatibility contract + `third_party/PINS.md`
KV260 Vivado integration	this repo — `hw/vivado/`	Wrapper, Tcl flow, and `filelist.v002.f`
Bare-metal driver (C/C++)	this repo — `sw/driver/`	API spec: Drivers/api
Application (planned, v0.2.0)	this repo — `sw/gemma3NE4B/` (not yet in tree)	—

If you want to read about how the accelerator works, head to the pccx v002 docs — that's the canonical source for every architectural decision in this repo. If you want to inspect the KV260 integration wrapper, the submodule pin, or the board flow, stay here.

Architecture Snapshot (pccx v002)

Click the diagram for the annotated top-level page on the pccx site.

Three heterogeneous cores around a centralized L2 URAM cache:

Core	Shape	What it runs
Systolic Array (GEMM)	32 × 16 × 2	Prefill — `Q · Kᵀ` across the full context, FFN in prefill
GEMV Core	32 × 1 × 4	Decode — every projection in the autoregressive step
SFU (CVO)	32 × 1 × 4	Softmax · GELU · RMSNorm · RoPE · sin/cos · reduce · scale · recip

Clock domains: AXI 250 MHz ↔ core 400 MHz, crossed by async CDC FIFOs.
Weight path: HP0/1 = GEMM weights, HP2/3 = GEMV weights, 128-bit/clk each.
Activation path: host DDR4 → ACP DMA → L2 URAM (1.75 MB, true dual port).
Direct-connect FIFO: GEMV → SFU, so softmax runs without an L2 round-trip.

Full rationale and numbers: Top-Level → · Design rationale →

Matrix Core (GEMM) — 32 × 16 × 2 Systolic Array

Weight-stationary 2D systolic layout. Activations broadcast along columns, partial sums propagate vertically into the result accumulator. Used only during prefill; idle during decode.

Inside each PE — a DSP48E2 wrapped with input flip-flops on both Activation and Weight ports, and an accumulator with a P-register output:

Details: GEMM core → · GEMM dataflow →

Vector Core (GEMV) — 32 × 1 × 4 Lanes

Four parallel GEMV lanes, each an 8-wide DSP pipeline fed by an Activation broadcast and a Weight row. Outputs feed a reduction tree that collapses partial products into the final vector entry register. The primary compute path during autoregressive decode.

Per-cycle operand shapes — a 1×N activation row multiplied against an N×N weight tile:

Details: GEMV core → · GEMV dataflow →

Why W4A8 with 1 DSP = 2 MAC

The DSP48E2 has a single 27×18 multiplier, not two. pccx v002 bit-packs two INT4 weights into port A alongside a single INT8 activation on port B, so each DSP emits two MACs per cycle into the 48-bit accumulator with a 19-bit guard band between the two channels.

After accumulation, a sign-recovery step restores the upper channel when the lower channel borrowed a carry:

Maximum accumulations before draining the ACCM: 2^10 ≈ 1024 per channel (guard-band limited).
For K > 1024 (e.g. Gemma 3N's FFN with K = 16384), the Global Scheduler drains the ACCM every 1024 cycles into a LUT-based adder tree and merges the partial sums.
Peak: 2048 MAC × 400 MHz ≈ 819 GMAC/s across the two systolic arrays.

Details: DSP48E2 W4A8 bit-packing →

Target Gemma 3N E4B Execution Path

The target model (Google Gemma 3N E4B) has several deviations from a textbook decoder that the scheduler has to honor. The short list:

Feature	Effect
AltUp 4 residual streams	Four copies of `xs` live in L2; main stream `xs[0]` stays clean, shadow streams `xs[1..3]` receive depth-dependent updates.
Alternating RoPE θ (5-layer cycle)	θ = 10 000 (local) or 1 000 000 (global), preloaded per-layer via `MEMSET`.
No attention scaling, no softcap	Softmax sequence drops from 4 CVO instructions to 3.
LAuReL parallel branch	Two tiny GEMVs (`D × 64`, `64 × D`) + a `CVO_SCALE` by `1/sqrt(2)`.
PLE shadow-stream injection	Per-Layer Embedding only touches `xs[1..3]` at the end of each layer; the main stream is untouched.
FFN Gaussian Top-K sparsity (L0–9)	Replaces a sort with `Mean + 1.645·Std`; ~95 % of `gate_raw` becomes zero and `W_down` skips masked rows.
Cross-layer KV sharing	Only layers 0–19 own their KV cache; layers 20–34 reuse layer 18 (local) or 19 (global). Cache shape is `[20, L, 512]`, not `[35, L, 512]`.

End-to-end decode flow, per-cycle overlap strategy, instruction-level mapping, memory layout, and the performance budget all live in the pccx Models section:

Custom 64-bit VLIW ISA

Five opcodes, 64 bits each: [63:60] opcode + [59:0] body.

Opcode	Mnemonic	Use
`4'h0`	`OP_GEMV`	Vector × Matrix — decode projections
`4'h1`	`OP_GEMM`	Matrix × Matrix — prefill Q·Kᵀ, A·V across full sequence
`4'h2`	`OP_MEMCPY`	Host DDR4 ↔ L2 DMA (ACP)
`4'h3`	`OP_MEMSET`	Write shape / size / scale constants to the Constant Cache
`4'h4`	`OP_CVO`	Element-wise non-linear (exp, sqrt, GELU, sin, cos, reduce_sum, scale, recip)

Spec: Per-instruction encoding → · Dataflow per opcode →

The pipeline is fully decoupled: the front-end decodes and enqueues into per-engine FIFOs, and each compute engine fires independently once its local dependencies (weight stream, fmap ready) are satisfied. A stall in one engine never halts another.

Example — softmax sequence (three CVO calls because Gemma 3N has no softcap)

GEMV   flags.findemax=1              ; Q · Kᵀ, track e_max
CVO    CVO_EXP  flags.sub_emax=1     ; exp(score - e_max)
CVO    CVO_REDUCE_SUM                ; Σ exp → scalar
CVO    CVO_SCALE flags.recip_scale=1 ; divide each exp by the sum

KV Cache Strategy

KV bandwidth (not FLOPs) is what pins down L on KV260. At 32 K context the cumulative cache would hit ~1.31 GB, and DDR4's ~10 GB/s puts floor-to-floor read time above 130 ms per token. Three mitigations, enforced at RTL / memory controller / driver level:

KV quantization — DRAM format is INT8 (default) or INT4. 2–4× bandwidth and capacity savings, aligned with the W4A8 compute path.
Attention-sink + local-window eviction — the driver retains only the first few tokens and a sliding recent window; middle tokens are evicted on a schedule, combined with Google Turbo-Quant-style requantization.
Hard cap — the KV ring-buffer ceiling is set at init (max_tokens = 8192). Wrap-around overwrites the oldest entries. This bounds both OOM risk and worst-case memory traffic.

Details: KV cache strategy →

Roadmap — Integration and IP-core lines

This repository remains the KV260 + PCCX v002 LLM application integration repo. Reusable v002 IP-core ownership is in pccx-v002 and is consumed here through the pinned third_party/pccx-v002 submodule; future v003 IP-core ownership will be separate in pccx-v003.

Track	Owner	Target model	Scope	Status
v002 integration	this repo + `pccx-v002`	Gemma 3N E4B	KV260 board flow, bare-metal driver, application wiring, and v002 LLM package consumption	In progress
v003 IP-core	`pccx-v003`	Gemma 4 E4B	Separate IP-core line and compatibility contract	Future / TBD

third_party/pccx-v002 supplies the consumed v002 IP-core sources.
hw/vivado/filelist.v002.f is the KV260 integration compile entry for the submodule-backed RTL.
v003 RTL belongs to the separate IP-core repository line.

Full phase-by-phase plan, decision points, compute budget, and Year 2 Auto-Porting Pipeline α vision:

→ Roadmap (EN) · 한국어

Implementation Status

Block	Status
Custom VLIW ISA	Spec complete
VLIW frontend + decoder	RTL complete
Global Scheduler	RTL complete
Systolic Array (Matrix Core, v001 32×32)	RTL complete
GEMV pipeline (Vector Core, 4 lanes)	RTL complete
CVO SFU + CORDIC	RTL complete
FMap preprocessing (BF16 → fixed-pt)	RTL complete
L2 URAM cache + ACP DMA	RTL complete
CVO ↔ L2 stream bridge	RTL complete
NPU top-level wiring	RTL complete
Python golden model	Verified
pccx v002 re-parameterization (1 DSP = 2 MAC)	In progress
uXC driver (AXI-Lite HAL)	Skeleton
Gemma 3N E4B application (`sw/gemma3NE4B/`)	Planned (v0.2.0) — not yet in tree
Simulation / trace-driven verification	xsim smoke suite active
Vivado synthesis + timing closure	Synth attempted; no completed report yet. Timing closure pending.

The reusable RTL previously carried under hw/rtl/ now comes from third_party/pccx-v002/LLM/rtl/ and third_party/pccx-v002/common/rtl/. The KV260-specific wrapper remains at hw/vivado/npu_core_wrapper.sv.

Repository Layout

hw/
  vivado/
    filelist.v002.f           ← forwards to third_party/pccx-v002/LLM/scripts/filelist.f
    npu_core_wrapper.sv       ← plain-signal wrapper around pccx_npu_top
    *.tcl                     ← KV260 Vivado project, synth, and impl flow
sw/
  driver/                     ← AXI-Lite MMIO HAL + inference API (skeleton)
  gemma3NE4B/                 ← Gemma 3N E4B application — planned for v0.2.0, not yet in tree
docs/                         ← Redirect stub only — full docs live on pccx
scripts/v002/                 ← submodule sim wrapper and local candidate checks
third_party/pccx-v002/        ← pinned v002 IP-core submodule

docs/ in this repo is intentionally a redirect stub. All architectural and model documentation now lives in the pccx repo / GitHub Pages site.

Verification workflow

The repo runs the v002 xsim harness through scripts/v002/use_submodule_sources.sh, which forwards to third_party/pccx-v002/LLM/sim/run_verification.sh with the pinned submodule RTL root. One command runs the full suite and emits a .pccx trace per bench for pccx-lab to visualise:

scripts/v002/use_submodule_sources.sh

For a shorter local smoke subset:

scripts/v002/use_submodule_sources.sh --quick
scripts/v002/run-local-candidate.sh --quick

See docs/SIMULATION.md for run log paths, PASS verdict rules, and the evidence checklist.

Current testbench matrix

Testbench	Module(s) under test	Verdict count
`tb_shape_const_ram`	`shape_const_ram` (reset / write / read contract)	15
`tb_mem_dispatcher_shape_lookup`	`mem_dispatcher` + `shape_const_ram` LOAD pointer routing	11
`tb_GEMM_dsp_packer_sign_recovery`	`GEMM_dsp_packer` + `GEMM_sign_recovery` (W4A8 dual-MAC)	1024
`tb_GEMM_fmap_staggered_delay`	`GEMM_fmap_staggered_dispatch` (column stagger)	65
`tb_GEMM_weight_dispatcher`	`GEMM_weight_dispatcher` (upper / lower AND-valid)	128
`tb_mat_result_normalizer`	`mat_result_normalizer` (48 b 2sC → BF16 4-stage)	256
`tb_FROM_mat_result_packer`	`FROM_gemm_result_packer` (32 lanes → 4×128 b FSM)	4
`tb_barrel_shifter_BF16`	`barrel_shifter_BF16` (BF16 → 27 b fixed-point)	512
`tb_ctrl_npu_decoder`	`ctrl_npu_decoder` (4-bit opcode → one-hot valid)	6
`tb_mem_u_operation_queue`	`mem_u_operation_queue` (queue push / pop smoke)	32
`tb_v002_runtime_smoke_program`	v002 ISA `.memh` -> decoder/scheduler handoff	7

Every bench emits the canonical PASS: line that pccx-lab's from_xsim_log converter recognises — the deterministic runner fails nonzero on a missing verdict or explicit non-PASS verdict.

Adding a new testbench

Reusable testbench registration lives in third_party/pccx-v002/LLM/sim/run_verification.sh, with testbench sources under third_party/pccx-v002/LLM/tb/. Changes to that reusable suite belong in pccx-v002; this repo should only update KV260-specific runtime inputs and wrappers. See pccx-lab's verification-workflow doc for the end-to-end flow diagram and the Tauri IPC surface.

What plugs into pccx-lab

Surface	Command
Full suite runner	`run_verification` IPC -> `scripts/v002/use_submodule_sources.sh`
Per-tb trace loader	"Open" button per row (calls `load_pccx`)
Synth utilisation	`SynthStatusCard` (parses `hw/build/reports/`)
Roofline classification	`RooflineCard` (runs on the loaded trace)
Markdown summary	`generate_markdown_report` IPC

License

Apache 2.0 — same as pccx. This protects the architecture from patent risk while keeping the ecosystem open for hardware research.

Trademark

PCCX™ is a mark used by the PCCX project. Korean trademark applications are pending for PCCX in Classes 09 and 42. Registration has not been granted; do not use any registration-mark variant until the central trademark policy is updated. See pccx/TRADEMARKS.md.

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
.github		.github
configs/v002		configs/v002
docs		docs
evidence/v002		evidence/v002
hw		hw
scripts		scripts
sw		sw
tests		tests
third_party		third_party
tools/v002		tools/v002
.gitignore		.gitignore
.gitmodules		.gitmodules
.library_mapping.xml		.library_mapping.xml
.project		.project
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPATIBILITY.yaml		COMPATIBILITY.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MIGRATION.md		MIGRATION.md
NOTICE		NOTICE
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
TRADEMARKS.md		TRADEMARKS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pccx — Bare-Metal Transformer Accelerator on Kria KV260

Start with the architecture docs

Project status

What's here vs. what's in pccx

Architecture Snapshot (pccx v002)

Matrix Core (GEMM) — 32 × 16 × 2 Systolic Array

Vector Core (GEMV) — 32 × 1 × 4 Lanes

Why W4A8 with 1 DSP = 2 MAC

Target Gemma 3N E4B Execution Path

Custom 64-bit VLIW ISA

Example — softmax sequence (three CVO calls because Gemma 3N has no softcap)

KV Cache Strategy

Roadmap — Integration and IP-core lines

Implementation Status

Repository Layout

Verification workflow

Current testbench matrix

Adding a new testbench

What plugs into pccx-lab

License

Trademark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

pccx — Bare-Metal Transformer Accelerator on Kria KV260

Start with the architecture docs

Project status

What's here vs. what's in pccx

Architecture Snapshot (pccx v002)

Matrix Core (GEMM) — 32 × 16 × 2 Systolic Array

Vector Core (GEMV) — 32 × 1 × 4 Lanes

Why W4A8 with 1 DSP = 2 MAC

Target Gemma 3N E4B Execution Path

Custom 64-bit VLIW ISA

Example — softmax sequence (three CVO calls because Gemma 3N has no softcap)

KV Cache Strategy

Roadmap — Integration and IP-core lines

Implementation Status

Repository Layout

Verification workflow

Current testbench matrix

Adding a new testbench

What plugs into pccx-lab

License

Trademark

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages