yamnet-cry-distill-int8

Knowledge-distillation pipeline that turns Google's pretrained YAMNet (FP32, 521-class AudioSet) into a tiny INT8 student classifier targeted at on-device baby-cry detection on the ESP32-S3.

Status: in progress — see docs/experiments/ for the run log.

Architecture

[private home captures]   [public AudioSet IDs]
        │                         │
        ▼                         ▼
   librosa mel ──► mixer ──► YAMNet teacher (FP32) ──► soft logits
                              │                              │
                              └──► tiny student (DS-CNN) ──► KL loss
                                                              │
                                              held-out AudioSet eval (HEADLINE)
                                              held-out captures eval (side, private)
                                                              │
                                                              ▼
                                                 INT8 TFLite ──► HF model hub

Performance

Headline metric: held-out AudioSet crying_sobbing segment-level F1 / AUC on the 100-segment frozen test set (62 survivors after takedowns). Best-threshold F1 reported because distilled students inherit the teacher's diffuse softmax — see docs/experiments/EXP-005_eval_harness.md.

Experiment	Train data prep	Params	INT8 size	best-F1	AUC	FPR (best thr)
EXP-001 (smoke)	synthetic	80,713	—	—	—	—
EXP-002 (captures-only)	random 4 windows / clip	80,713	—	0.585	0.717	—
EXP-003 (audioset-only)	random 4 windows / clip	80,713	—	0.612	0.750	—
EXP-004 (combined)	random 4 windows / clip	80,713	—	0.667	0.823	—
EXP-006 (v0.1.0)	teacher-scored sliding window, 1:1 fixed pool	80,713	110 KB	0.696	0.860	31.8 %
EXP-007	+ per-epoch resample, 1:3 pos:neg	80,713	—	0.756	0.861	22.7 %
EXP-008 (v0.2.0)	+ LR drop 1e-3→1e-4 @ epoch 30	80,713	110 KB	0.756	0.870	22.7 %

EXP-008 is the published model (v0.2.0). Two key methodology shifts:

EXP-006 — teacher-as-filter (+0.037 AUC over random-window baseline). YAMNet pre-scores every 0.5 s-hop window of every clip; only confident-positive (p_cry > 0.30) and confident-negative (p_cry < 0.05) windows enter training. See docs/research/methodology-teacher-as-filter.md.
EXP-007–008 — deployment-realism training (-9 pp FPR at the operating point). 1:3 pos:neg ratio matches deployment audio better than 1:1; per-epoch random subsampling exposes the student to ~all 30 K negatives over training instead of the same 5 714 every epoch.

INT8 quantization cost is small (≤0.012 best-F1, ≤0.004 AUC). With 5-of-9 temporal voting at the operating threshold, the EXP-008 deployment math is ~5× fewer false alerts per night vs EXP-006 at the same recall.

Side metric: KL vs YAMNet on held-out home captures (time-stratified, private). The captures pool is the highest-confidence tier of the auto-ensemble's labels, so per-clip metrics don't translate to general nursery deployment without on-device threshold calibration. Aggregate per-tier behaviour comparison (no captures-derived data published) is at docs/research/student-on-captures-summary-20260503.md.

Deployment profile (ESP32-S3)

The published model is sized for real-time inference on the Waveshare ESP32-S3-CAM-GC2145 — 240 MHz dual-core Xtensa LX7, 8 MB PSRAM, 16 MB flash. Numbers below compare the v0.2.0 student to the YAMNet teacher running INT8 on the same device (sibling repo's chayuto/yamnet-mel-int8-tflm), since that's the reference cry detector currently deployed.

	YAMNet teacher (INT8)	EXP-008 student (INT8)	ratio
Flash footprint (`.tflite` size)	~4.0 MB	110 KB	36× smaller
Parameters	~4.0 M	80 713	50× smaller
Tensor arena (PSRAM allocation)	~600 KB	~64 KB est.	9× smaller
Inference latency / patch (CPU, training host)	~2 ms	<1 ms measured	~3× faster
Inference latency / patch (ESP32-S3)	tens of ms	single-digit ms est.	~10× faster
Real-time headroom @ 0.48 s patch hop	~10×	~100× est.	—

Student-on-ESP32-S3 latency is estimated until the on-device parallel-logging build lands (Phase A in the firmware integration plan).

Accuracy translation to deployment. Three different lenses on the same model:

	what it measures	EXP-008 INT8
AudioSet test AUC	public-domain ranking quality on a frozen 62-segment slice	0.866
AudioSet test best-F1	binary-classification quality at the F1-optimal threshold	0.744
Captures frame-correlation (high_pos)	how closely the student's per-frame cry score tracks the teacher's on the same audio	r = 0.81
Captures student max (high_neg)	how often the student wrongly fires on quiet/silent audio (lower = better)	0.06 (was 0.15 in EXP-006)

The student is a faithful approximator of the teacher on the deployment data — frame correlation 0.81 on high-confidence positives — and it is materially more conservative than v0.1.0 on quiet audio (high_neg max 0.15 → 0.06).

Calibration is per-deployment. Distilled students inherit the teacher's diffuse softmax: cry probabilities concentrate well below 0.5 even on obvious cries. The model's best-F1 threshold is ≈0.05 on the AudioSet test, but the right operating point depends on the specific room, mic gain, and ambient noise floor. The detector firmware (projects/cry-detect-01/main/detector.c in the sibling repo) exposes threshold, N-of-M consecutive-frame voting, and hold-time hysteresis as runtime-configurable parameters — so a deployment-time calibration recipe of "record 10 min of ambient, set threshold = max(score) + 0.05" is sufficient.

Why distill at all if YAMNet already runs on-device? Three things become possible with a 36× smaller model:

~3.9 MB of flash and ~540 KB of PSRAM freed for longer audio buffers (better temporal context), parallel image features from the on-board GC2145 camera, larger OTA partitions, or just less PSRAM pressure for the rest of the firmware.
Lower power draw on a battery-powered baby monitor — faster inference + smaller weight reads + fewer cache misses.
Portability beyond ESP32-S3. 110 KB INT8 fits on Cortex-M4, RP2040, ESP32-S2, and most TFLite-Micro targets where YAMNet's 4 MB does not.

The deployment story is not "the student is more accurate than the teacher" — the teacher remains the reference. The story is "the student is good enough to recover most of the teacher's behaviour at a fraction of the resource cost, freeing the rest of the device for the rest of the product."

Reproduce

git clone https://github.com/chayuto/yamnet-cry-distill-int8
cd yamnet-cry-distill-int8
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,audioset]"
pytest

# Download the public AudioSet smoke set (50 segment IDs, ~2 minutes)
python scripts/download_audioset.py --ids data/ids/audioset_smoke.csv

# Phase-1 smoke test: prove the distillation loop closes
bash scripts/run_exp001_smoke.sh

The full training pipeline reproduces against committed AudioSet segment IDs alone — no private data required.

Privacy

The headline number is reproducible from public AudioSet segment IDs. A private in-domain corpus (deployed-device home captures) is used as augmentation only during distillation; it is never committed, uploaded to HuggingFace, or required for reproduction. See docs/architecture.md and the model card for the public/private split.

Sibling repo

Device firmware, audio harvest, on-device inference, and the host-side auto-ensemble label tooling live at ws-ESP32-S3-CAM. The teacher artifact (chayuto/yamnet-mel-int8-tflm) was published from prior work in that repo.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude/commands		.claude/commands
configs		configs
data/ids		data/ids
docs		docs
ml-experiments		ml-experiments
models		models
scripts		scripts
src/yamnet_cry_distill_int8		src/yamnet_cry_distill_int8
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yamnet-cry-distill-int8

Architecture

Performance

Deployment profile (ESP32-S3)

Reproduce

Privacy

Sibling repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

yamnet-cry-distill-int8

Architecture

Performance

Deployment profile (ESP32-S3)

Reproduce

Privacy

Sibling repo

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages