Skip to content

edabk-hust/EDABK_MXFP4_CIM

Repository files navigation

Chipfoundry Logo EDABK Logo

Typing SVG

License ChipFoundry Marketplace


EDABK_MXFP4_CIM

This project, submitted to Systems to Silicon Design Contest, introduces a Compute-in-Memory Accelerator for MXFP4 GEMM, integrated and controlled by Caravel SoC Platform.


Table of Contents


Abstract

The advancement of Large Language Models (LLMs) demands hardware solutions capable of efficiently storing and processing large weight matrices while maintaining high throughput, energy, and area efficiency. One promising solution is adopting low-precision numerical formats. Standardized by the Open Compute Project (OCP) in early 2024, Microscaling floating-point numbers (MXFP) introduce a shared-scale mechanism that optimizes both computation and memory, improving AI workload efficiency. Specifically, MXFP4 (Microscaling 4-bit Floating-Point), which uses 4-bit floating-point elements with shared scaling, reduces storage and computation costs while maintaining a wide dynamic range.

The Microscaling Data Formats for Deep Learning study (2023) shows that MXFP4 can effectively replaces FP32 with minimal accuracy loss. On the GPT-2 (1.5B) model, Perplexity increases from 18.4 (FP32) to 18.7 (MXFP4). For ResNet-50, Top-1 accuracy is 75.9%, close to the 76.1% baseline. MXFP4 reduces storage and computation costs by up to 8x while maintaining stable performance for large Transformer and LLM models.

However, on the hardware side, efficiently supporting LLMs, particularly accelerator for General Matrix Multiplication (GEMM) using this format is challenging due to its unique representation and scaling, particularly in moving weights between memory and compute units. Compute-in-memory (CIM) solves this by performing operations within memory, reducing energy consumption and latency. MXFP4 is ideal for CIM, with its compact representation and shared scale mechanism enabling efficient weight storage and scaling.

Therefore, our team proposes EDABK_MXFP4_CIM, an architecture designed with the goal of performing the GEMM for the MXFP4 format using CIM. The overall architecture and activities' waveforms are described in the System Block Diagram section.

The key optimization of this design include the use of Compute-in-Memory (CIM) to reduce memory access time and improve efficiency by performing computations within memory. Additionally, results are accumulated before being quantized into MXFP4, preserving precision and ensuring better accuracy in high-precision tasks like General Matrix Multiplication (GEMM).


Contributors

All members are affiliated to EDABK Laboratory, School of Electrical and Electronic Engineering, Hanoi University of Science and Technology (HUST).

No. Name Study programme Relevant link
1 Phuong-Linh Nguyen Master of Engineer in IC Design
2 Ngoc-Duong Nguyen Master of Science in IC Design
3 Hoang-Son Nguyen Bachelor in Electronics Engineering
4 Viet-Tung Pham Senior student in Electronics Engineering

Documentation & Resources

For detailed hardware specifications and register maps, refer to the following official documents:


Prerequisites

Ensure your environment meets the following requirements:

  1. Docker Linux | Windows | Mac
  2. Python 3.8+ with pip.
  3. Git: For repository management.

System Block Diagram

EDABK_MXFP4_CIM's Block Diagram

Signal Name Direction Width Description
D_in Input 4-bit Data input bus for MXFP4 inputs and weights.
Wr_en Input 1-bit Write Enable: Used to load weight data into memory.
Rd_en Input 1-bit Read Enable: Standard memory access to verify stored weight values.
En Input 1-bit Operations Enable for the Computing-In-Memory (CIM) macro.
Addr Input 5-bit Column Address Decoder (Supports 32 columns).
Scale Input 8-bit Scaling factor for MXFP4 quantization and normalization process.
Done Output 1-bit Indicates completion of MAC accumulation and quantization.
Sel Input 1-bit Mode Select: Switch between Weight Loading and CIM Computation mode.
D_rd Output 4-bit Standard data output for memory read-back operations.
D_out Output 4-bit Final computed MXFP4 result from the CIM core.

CIM operation
CIM Operation Waveform
Read operation
Read Operation Waveform


Timeline


Checklist for Shuttle Submission

  • Top-level macro is named user_project_wrapper.
  • Full Chip Simulation passes for both RTL and GL.
  • Hardened Macros are LVS and DRC clean.
  • user_project_wrapper matches the required pin order/template.
  • Design passes the local cf precheck.
  • Documentation (this README) is updated with project-specific details.

About

A Compute-in-Memory Accelerator for MXFP4 GEMM, integrated and controlled by Caravel SoC Platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors