Skip to content

feat: add Apple Intel CPU + MPS support#576

Open
amul3 wants to merge 1 commit into
karpathy:masterfrom
amul3:feat/apple-intel-cpu-support
Open

feat: add Apple Intel CPU + MPS support#576
amul3 wants to merge 1 commit into
karpathy:masterfrom
amul3:feat/apple-intel-cpu-support

Conversation

@amul3

@amul3 amul3 commented May 4, 2026

Copy link
Copy Markdown

Adds backend auto-detection (cuda > mps > cpu) so the training script runs on Apple Intel Macs without a discrete GPU and on Apple Silicon, in addition to the original NVIDIA CUDA path.

Key changes:

  • train.py: lazy/optional Flash-Attention 3 import (CUDA only); SDPA fallback for MPS/CPU; conditional torch.compile (CUDA only); conditional cuda.synchronize / cuda.manual_seed / max_memory_allocated; fp32 embeddings + rotary on non-CUDA backends to avoid slow/unsupported bf16 matmul; per-backend default DEPTH / DEVICE_BATCH_SIZE / TOTAL_BATCH_SIZE / WINDOW_PATTERN so the 5-minute budget actually completes on CPU.
  • prepare.py: dataloader buffers allocate on detected device; pin_memory only on CUDA; evaluate_bpb pulls device from the model rather than hardcoding cuda.
  • pyproject.toml: drop hard kernels dep + CUDA-only torch index; move kernels to an optional [cuda] extra; loosen torch to >=2.4 so CPU/MPS wheels resolve.
  • README: document the new platforms, install matrix, and per-backend defaults.

Adds backend auto-detection (cuda > mps > cpu) so the training script runs on
Apple Intel Macs without a discrete GPU and on Apple Silicon, in addition to
the original NVIDIA CUDA path.

Key changes:
- train.py: lazy/optional Flash-Attention 3 import (CUDA only); SDPA fallback
  for MPS/CPU; conditional torch.compile (CUDA only); conditional
  cuda.synchronize / cuda.manual_seed / max_memory_allocated; fp32 embeddings +
  rotary on non-CUDA backends to avoid slow/unsupported bf16 matmul; per-backend
  default DEPTH / DEVICE_BATCH_SIZE / TOTAL_BATCH_SIZE / WINDOW_PATTERN so the
  5-minute budget actually completes on CPU.
- prepare.py: dataloader buffers allocate on detected device; pin_memory only
  on CUDA; evaluate_bpb pulls device from the model rather than hardcoding cuda.
- pyproject.toml: drop hard kernels dep + CUDA-only torch index; move kernels
  to an optional [cuda] extra; loosen torch to >=2.4 so CPU/MPS wheels resolve.
- README: document the new platforms, install matrix, and per-backend defaults.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant