EDABK_MXFP4_CIM

This project, submitted to Systems to Silicon Design Contest, introduces a Compute-in-Memory Accelerator for MXFP4 GEMM, integrated and controlled by Caravel SoC Platform.

Abstract

The advancement of Large Language Models (LLMs) demands hardware solutions capable of efficiently storing and processing large weight matrices while maintaining high throughput, energy, and area efficiency. One promising solution is adopting low-precision numerical formats. Standardized by the Open Compute Project (OCP) in early 2024, Microscaling floating-point numbers (MXFP) introduce a shared-scale mechanism that optimizes both computation and memory, improving AI workload efficiency. Specifically, MXFP4 (Microscaling 4-bit Floating-Point), which uses 4-bit floating-point elements with shared scaling, reduces storage and computation costs while maintaining a wide dynamic range.

The Microscaling Data Formats for Deep Learning study (2023) shows that MXFP4 can effectively replaces FP32 with minimal accuracy loss. On the GPT-2 (1.5B) model, Perplexity increases from 18.4 (FP32) to 18.7 (MXFP4). For ResNet-50, Top-1 accuracy is 75.9%, close to the 76.1% baseline. MXFP4 reduces storage and computation costs by up to 8x while maintaining stable performance for large Transformer and LLM models.

However, on the hardware side, efficiently supporting LLMs, particularly accelerator for General Matrix Multiplication (GEMM) using this format is challenging due to its unique representation and scaling, particularly in moving weights between memory and compute units. Compute-in-memory (CIM) solves this by performing operations within memory, reducing energy consumption and latency. MXFP4 is ideal for CIM, with its compact representation and shared scale mechanism enabling efficient weight storage and scaling.

Therefore, our team proposes EDABK_MXFP4_CIM, an architecture designed with the goal of performing the GEMM for the MXFP4 format using CIM. The overall architecture and activities' waveforms are described in the System Block Diagram section.

The key optimization of this design include the use of Compute-in-Memory (CIM) to reduce memory access time and improve efficiency by performing computations within memory. Additionally, results are accumulated before being quantized into MXFP4, preserving precision and ensuring better accuracy in high-precision tasks like General Matrix Multiplication (GEMM).

Contributors

All members are affiliated to EDABK Laboratory, School of Electrical and Electronic Engineering, Hanoi University of Science and Technology (HUST).

No.	Name	Study programme
1	Phuong-Linh Nguyen	Master of Engineer in IC Design
2	Ngoc-Duong Nguyen	Master of Science in IC Design
3	Hoang-Son Nguyen	Bachelor in Electronics Engineering
4	Viet-Tung Pham	Senior student in Electronics Engineering

Documentation & Resources

For detailed hardware specifications and register maps, refer to the following official documents:

Caravel Datasheet: Detailed electrical and physical specifications of the Caravel harness.
Caravel Technical Reference Manual (TRM): Complete register maps and programming guides for the management SoC.
ChipFoundry Marketplace: Access additional IP blocks, EDA tools, and shuttle services.
OCP Microscaling Formats (MX) Specification: Detailed specifications of the Microscaling Formats.

Prerequisites

Ensure your environment meets the following requirements:

Docker Linux | Windows | Mac
Python 3.8+ with pip.
Git: For repository management.

System Block Diagram

Signal Name	Direction	Width	Description
D_in	Input	4-bit	Data input bus for MXFP4 inputs and weights.
Wr_en	Input	1-bit	Write Enable: Used to load weight data into memory.
Rd_en	Input	1-bit	Read Enable: Standard memory access to verify stored weight values.
En	Input	1-bit	Operations Enable for the Computing-In-Memory (CIM) macro.
Addr	Input	5-bit	Column Address Decoder (Supports 32 columns).
Scale	Input	8-bit	Scaling factor for MXFP4 quantization and normalization process.
Done	Output	1-bit	Indicates completion of MAC accumulation and quantization.
Sel	Input	1-bit	Mode Select: Switch between Weight Loading and CIM Computation mode.
D_rd	Output	4-bit	Standard data output for memory read-back operations.
D_out	Output	4-bit	Final computed MXFP4 result from the CIM core.

CIM Operation Waveform

Read Operation Waveform

Timeline

Checklist for Shuttle Submission

Top-level macro is named user_project_wrapper.
Full Chip Simulation passes for both RTL and GL.
Hardened Macros are LVS and DRC clean.
user_project_wrapper matches the required pin order/template.
Design passes the local cf precheck.
Documentation (this README) is updated with project-specific details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.cf		.cf
.github		.github
def		def
docs		docs
gds		gds
img		img
lef		lef
lib		lib
lvs/user_project_wrapper		lvs/user_project_wrapper
mag		mag
openlane		openlane
sdc		sdc
signoff		signoff
spef		spef
spi/lvs		spi/lvs
verilog		verilog
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EDABK_MXFP4_CIM

Table of Contents

Abstract

Contributors

Documentation & Resources

Prerequisites

System Block Diagram

Timeline

Checklist for Shuttle Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EDABK_MXFP4_CIM

Table of Contents

Abstract

Contributors

Documentation & Resources

Prerequisites

System Block Diagram

Timeline

Checklist for Shuttle Submission

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages