A high-performance, polyphonic audio synthesis library for the ESP32 series (including S3, S2, C3, C6, etc.). Engineered for extreme bare-metal optimization, low-latency rendering, massive voice density, custom DSP hooks, and direct filesystem/SD-card streaming. Dual-framework support ensures compilation in both Arduino IDE and VS Code (PlatformIO) under either Arduino or native ESP-IDF.
- Architectural Philosophy: Why 500 Voices?
- PlatformIO (VS Code) & ESP-IDF Integration
- Multi-Core & Extended Chip Family Support
- Core Configuration & Latency Tuning
- Memory Footprint & Hardware Isolation
- Unified API Reference
- The Power of
SMODE_PWM(LEDC Bare-Metal Audio) - Dual-Framework Filesystem Streaming (SD Card)
- External Protocol Pull Mode (A2DP Bluetooth & Wi-Fi)
- Fixed-Point Advanced DSP & Custom Synthesis Blocks
- Development Tools & Advanced Troubleshooting
The extreme polyphony achievements of ESP32Synth (300+ voices on classic ESP32 chips, up to 500 on ESP32-S3) are not merely for playback metrics. This density serves as a mathematical proof of efficiency.
By eliminating float operations, hardware divisions, and branch instructions from the hot audio rendering path, we achieve extreme CPU headroom. This unused processing power allows developers to build highly complex synthesis blocks, such as:
- 6-Operator FM Synthesis (emulating hardware like the Yamaha DX7)
- Acoustic Physical Modeling (string, waveguide, and drum-head modeling)
- Adaptive Multi-Pole Resonant Filters
- Dynamic Waveshaping & Phase Distortion Engines
To implement these blocks, you must maintain this performance philosophy: use strictly 16.16 or 32.32 fixed-point math, look-up tables (LUTs), and bitwise shifts (>>).
With v2.4.3, PlatformIO integration is native. File system abstractions are unified, allowing you to run identical synth files under both Arduino and ESP-IDF frameworks.
For Arduino Framework:
[env:esp32s3]
platform = espressif32
board = esp32-s3-devkitc-1
framework = arduino
monitor_speed = 115200
build_flags =
-O3
-funroll-loopsFor ESP-IDF Framework:
Make sure to add ESP32Synth to your project's components or src directory.
[env:esp32s3-idf]
platform = espressif32
board = esp32-s3-devkitc-1
framework = espidf
monitor_speed = 115200
build_flags =
-O3
-funroll-loopsWhile ESP32Synth is highly optimized for standard dual-core ESP32 chips operating at 240MHz, its hardware abstraction layers support the broader Espressif chip family, including single-core and RISC-V variants.
- Dual-Core SoC (Classic ESP32, ESP32-S3): The high-priority DSP loop pins directly to Core 1 (
SYNTH_AUDIO_TASK_CORE 1). This completely isolates the real-time audio thread from application execution, Bluetooth/Wi-Fi processing, or display routines on Core 0, enabling maximum polyphony. - Single-Core SoC (ESP32-S2, ESP32-C3, ESP32-C6, etc.): The DSP task competes with other application threads[2]. To maintain stable output, set the CPU frequency to its highest supported state (e.g., 240MHz for S2, 160MHz for C3/C6)[1][4]. Limit polyphony parameters accordingly to avoid thread starvation.
| Chip Model | SMODE_DAC | SMODE_I2S | SMODE_PDM | SMODE_PWM |
|---|---|---|---|---|
| Classic ESP32 | Supported (GPIO 25, 26)[3] | Supported | Supported | Supported (High-Speed LEDC) |
| ESP32-S3 | Not Available[3] | Supported | Supported (Ideal) | Supported (Low-Speed LEDC)[5] |
| ESP32-S2 | Supported (GPIO 17, 18)[4] | Supported | Supported | Supported (Low-Speed LEDC)[2] |
| ESP32-C3 / C6 | Not Available[3] | Supported | Supported | Supported (Low-Speed LEDC)[1] |
Static parameters can be directly edited inside ESP32Synth_Config.hpp to customize RAM consumption and balance processing latency against overall polyphony stability.
// ESP32Synth_Config.hpp Core Limits
#define MAX_VOICES 80 // Maximum active concurrent synthesis voices.
#define MAX_WAVETABLES 20 // Maximum register space for custom wavetables.
#define MAX_SAMPLES 20 // Maximum registers for loaded RAM samples.
#define MAX_ARP_NOTES 16 // Maximum steps per individual voice arpeggiator.
#define MAX_STREAMS 4 // Maximum concurrent background SD file streams.
#define STREAM_BUF_SAMPLES 2048 // Streaming ring buffer length (must be a power of 2).For resource-constrained environments (e.g., when integrating heavy UI frameworks like LVGL alongside network tasks), reduce the synth limits to minimize memory allocation:
#define MAX_VOICES 1
#define MAX_WAVETABLES 1
#define MAX_SAMPLES 1
#define MAX_ARP_NOTES 7
#define MAX_STREAMS 1
#define STREAM_BUF_SAMPLES 2048You can calculate the processing latency using this formula:
Configure these definitions directly in your build files or inside ESP32Synth_Config.hpp to achieve the required latency characteristics:
- High Polyphony / Robust Protection (Default):
SYNTH_DMA_BUF_LEN 512|SYNTH_DMA_BUF_COUNT 6(Approx. 64ms latency; provides a high safety margin against buffer underruns under heavy CPU loads).
- Balanced / Real-Time MIDI:
SYNTH_DMA_BUF_LEN 256|SYNTH_DMA_BUF_COUNT 4(Approx. 21ms latency; provides good responsiveness for physical keyboards).
- Live Action / Ultra-Low Latency:
SYNTH_DMA_BUF_LEN 128|SYNTH_DMA_BUF_COUNT 2(Approx. 5.3ms latency; highly immediate response but reduces voice headrooms).
To maximize RAM availability, ESP32Synth employs an extreme structure alignment strategy. Mutual exclusion is achieved via an explicit union block inside the Voice structure:
struct Voice {
int64_t slideVolCurr; // 8-byte alignment for fast Xtensa pipeline execution
int64_t slideVolInc;
union {
// Mode: WAVE_SAMPLE & WAVE_STREAM
struct {
uint64_t samplePos1616;
uint32_t sampleInc1616;
uint32_t sampleLoopStart;
uint32_t sampleLoopEnd;
uint32_t streamFracAccum;
};
// Mode: WAVE_WAVETABLE
struct {
const void* wtData;
uint32_t wtSize;
};
// Mode: WAVE_CUSTOM
uint32_t cw[6];
};
// ...
};This union guarantees that regardless of your voice configuration, the core footprint of each voice does not exceed memory constraints, keeping cache misses at an absolute minimum.
The engine dynamically switches compiler directives to accommodate Arduino or native C filesystems.
The library supports several physical output layouts. Choose the initialization method that corresponds to your hardware routing:
#include "ESP32Synth.h"
ESP32Synth synth;
void setup_audio() {
// Standard I2S Mode (External DAC like PCM5102A - BCK, WS, DATA)
// Parameters: dataPin, mode, clkPin, wsPin, BitDepth
synth.begin(4, 15, 2, I2S_32BIT);
// Or: Single-Pin Hardware PWM Mode (10-bit audio on pin 25)
// synth.begin(25, SMODE_PWM, -1, -1, I2S_16BIT);
// Or: PDM Mode (High-Frequency 1-bit oversampled audio on pin 2)
// synth.begin(2, SMODE_PDM, 4, -1, I2S_16BIT);
// Set engine-wide volume (0-255 scaling)
synth.setMasterVolume(255);
}Pitch is controlled in hundredths of a Hz ("CentiHz") to achieve fine tuning without utilizing slow floating-point types.
// Triggers voice 0 at C4 (Middle C), Volume 255
synth.noteOn(0, c4, 255);
// Update frequency and pulse-width dynamically
synth.setFrequency(0, cs4); // Shift pitch up to C#4
synth.setWave(0, WAVE_PULSE);
synth.setPulseWidth(0, 128); // 50% square duty cycle (0-255 scale)
// Set custom bitcrush resolution (0-32 bits, 0 means disabled)
synth.setMasterBitcrush(8); // Lo-fi 8-bit output reduction
// Triggers envelope release stage
synth.noteOff(0);We use Bresenham's algorithm for pitch slides to perform high-resolution portamento without hardware divisions inside the control rate routine.
// Per-voice ADSR (Attack: 10ms, Decay: 150ms, Sustain Lvl: 120, Release: 1200ms)
synth.setEnv(0, 10, 150, 120, 1200);
// Vibrato (Frequency Modulation): LFO Rate 6.5Hz (650 cHz), LFO Depth 30Hz (3000 cHz)
synth.setVibrato(0, 650, 3000);
// Tremolo (Amplitude Modulation): LFO Rate 4Hz (400 cHz), LFO Depth 80
synth.setTremolo(0, 400, 80);
// Slide pitch to C5 over exactly 500 milliseconds
synth.slideFreqTo(0, c5, 500);
// Multi-step Arpeggiator (Voice 0, Step duration: 120ms, Notes: C4, E4, G4, C5)
synth.setArpeggio(0, 120, c4, e4, g4, c5);No external DAC? No problem. The PWM mode (SMODE_PWM) runs completely decoupled from traditional timers. We attach our interrupt handler (ledc_ovf_isr) directly to the LEDC timer's hardware overflow event:
// From ESP32Synth_Begins.hpp
esp_intr_alloc(ETS_LEDC_INTR_SOURCE, ESP_INTR_FLAG_IRAM, ledc_ovf_isr, this, (intr_handle_t*)&pwm_timer);The interrupt handler is written in high-priority Assembly-level IRAM, directly feeding duty-cycle updates to hardware registers. This bypasses FreeRTOS scheduling overhead:
#if defined(CONFIG_IDF_TARGET_ESP32)
// Classic ESP32 - High Speed Channel 0 Timer 0
if (int_st & LEDC_HSTIMER0_OVF_INT_ST) {
REG_WRITE(LEDC_INT_CLR_REG, LEDC_HSTIMER0_OVF_INT_CLR);
if (synth->_running && synth->pwm_ping_pong_buf[synth->pwm_active_buf]) {
int16_t sample = synth->pwm_ping_pong_buf[synth->pwm_active_buf][synth->pwm_read_idx];
synth->pwm_read_idx = synth->pwm_read_idx + 1;
uint32_t duty_val = ((uint32_t)(sample + 32768) >> 6) << 4; // Precise 10-bit shift
REG_WRITE(LEDC_HSCH0_DUTY_REG, duty_val);
REG_WRITE(LEDC_HSCH0_CONF1_REG, REG_READ(LEDC_HSCH0_CONF1_REG) | (1U << 31)); // Hardware Latch
}
}
#endifThis design minimizes scheduling jitter, producing a clean carrier frequency locked to 47,962 Hz with 10-bit duty cycle resolution.
ESP32Synth natively translates filesystem calls based on the active compiler toolchain.
#ifdef ARDUINO
#include <SD.h>
#include <SPI.h>
void play_background_track() {
// Voice, FS Handle, Filepath, Volume, RootPitch, Loop
synth.playStream(1, SD, "/ambient_music.wav", 255, c4, true);
// Position/Loop controls
synth.setStreamLoopPointsMs(1, 2000, 24000); // Loops segment between 2s and 24s
}
#endif#ifndef ARDUINO
void play_background_track_idf() {
// ESP-IDF abstracts the filesystem using POSIX. Pass the direct path.
synth.playStream(1, "/sdcard/ambient_music.wav", 255, c4, true);
}
#endifThe underlying file IO decoder runs on Core 0 inside a lower-priority background thread, loading and feeding a Ring Buffer (STREAM_BUF_SAMPLES) to prevent SD card read stalls from blocking audio rendering.
To output audio over wireless connections (Bluetooth A2DP, ESP-NOW, or WebSockets), configure the engine in SMODE_CUSTOM. This turns off internal DMA timers and relies on a "Pull Mode" architecture.
ESP32Synth synth;
void setup() {
// Setup at 44.1kHz or 48kHz with no automatic timer (customOutput = nullptr)
synth.beginCustom(44100, nullptr);
synth.noteOn(0, c4, 255);
}
// Your wireless network or Bluetooth stack audio callback
void write_bluetooth_packet(uint8_t *stream_buffer, int buffer_length) {
int samplePairs = buffer_length / 4; // Each 16-bit stereo frame is 4 bytes (L + R)
// Under the hood, this converts, scales, and copies rendered frames directly
synth.generateSamplesStereo((int16_t*)stream_buffer, samplePairs);
}Inject complex physical effects and waveshapes into the engine using the global and local callback structures.
Write a completely new synthesis generator block, assign it to a specific voice, and use standard ADSR controls:
// Highly optimized 2-Op FM Oscillator callback
void IRAM_ATTR fmTwoOpOscillator(Voice* vo, int32_t* mixBuffer, int samples, int32_t startEnv, int32_t envStep) {
int32_t currentEnv = startEnv;
int32_t volBase = ((uint32_t)vo->vol * vo->trmModGain) >> 8;
uint32_t carrierPhase = vo->phase;
uint32_t carrierInc = vo->phaseInc + vo->vibOffset;
// Modulator tracks at 2.0x carrier frequency (simple harmonic relationship)
uint32_t modulatorPhase = vo->cw[0];
uint32_t modulatorInc = carrierInc * 2;
int16_t prevSample = (int16_t)vo->cw[1]; // Feedback storage
for (int i = 0; i < samples; i++) {
// Modulator outputs sine wave with phase feedback (~12.5% scale)
uint32_t feedbackPhase = modulatorPhase + (prevSample << 12);
int32_t modSample = sineLUT[feedbackPhase >> SINE_SHIFT];
prevSample = (int16_t)modSample;
// Modulate Carrier Phase by Modulator Output
uint32_t finalCarrierPhase = carrierPhase + (modSample * 16); // Mod index
int32_t signal = sineLUT[(finalCarrierPhase >> SINE_SHIFT) & SINE_LUT_MASK];
// Apply 32-bit Envelope & Volume Scale
int32_t envSafe = currentEnv >> 14;
envSafe &= ~(envSafe >> 31); // Absolute protection against negative clipping
int32_t finalVol = (int32_t)((envSafe * volBase) >> 14);
mixBuffer[i] += (signal * finalVol) >> 16;
carrierPhase += carrierInc;
modulatorPhase += modulatorInc;
currentEnv += envStep;
}
// Save states back to custom voice array registers
vo->phase = carrierPhase;
vo->cw[0] = modulatorPhase;
vo->cw[1] = (uint32_t)prevSample;
}
void play_fm_lead() {
synth.setCustomWave(0, fmTwoOpOscillator);
synth.noteOn(0, c4, 255);
}Apply echo or spatial filters directly on the Master 32-bit mix bus before the signal is formatted for physical DAC registers:
#define DECAY_LINE_SIZE 4096
#define DECAY_MASK (DECAY_LINE_SIZE - 1)
int32_t delayLine[DECAY_LINE_SIZE];
int32_t delayWriteIndex = 0;
// High-speed fixed-point Comb Filter
void IRAM_ATTR globalDelayDSP(int32_t* mixBuffer, int numSamples) {
for (int i = 0; i < numSamples; i++) {
int32_t inputSample = mixBuffer[i];
// Retrieve delayed sample from memory
int32_t delayedSample = delayLine[(delayWriteIndex - 3000) & DECAY_MASK];
// Comb-filtering logic (Feedback scale: ~62.5% or 5/8)
int32_t newSample = inputSample + ((delayedSample * 5) >> 3);
// Write to ring buffer
delayLine[delayWriteIndex] = newSample;
delayWriteIndex = (delayWriteIndex + 1) & DECAY_MASK;
// Mix back into active master channel
mixBuffer[i] = newSample;
}
}
void setup() {
synth.begin(2, SMODE_I2S, 4, 15, I2S_16BIT);
synth.setCustomDSP(globalDelayDSP);
}The repository contains two high-speed python utilities:
WavetableMaker.py: Converts complex sound mathematical equations or wave segments directly into static aligned C tables (.h) mapped asWAVE_WAVETABLE.WavToEsp32SynthConverter.py: Converts short single-cycle audio files into 4-bit, 8-bit, or 16-bit aligned static memory arrays, avoiding the need for SD cards for transient instruments.
- WDT Reset / Starvation Jitter: If you hear digital clicking or trigger Core Watchdog Resets, verify that the Xtensa processor is operating at 240MHz. Standard ESP32 boards default to 160MHz in some configurations, which significantly reduces the available processing headroom.
- FPU Contention on S3: ESP32-S3 uses advanced vector SIMD registers on Core 1. If other intensive tasks (such as image analysis, cameras, or complex math) run concurrently on Core 1, task contention will occur. In these scenarios, configure standard tasks on Core 0 and preserve Core 1 exclusively for the synth engine.
- Flickering PWM Audio: Under
SMODE_PWM, make sure that no other task attempts to access LEDC Channel 0 or write to Timer 0 registers. This breaks the latch alignment of the overflow ISR. If there is high-frequency carrier whistle on the pin, route the signal through a simple passive RC low-pass reconstruction filter (a 150-ohm resistor with a 100nF capacitor).
Believe in Jesus Christ❤️
