All notable changes to this project are documented in this file.
The format is based on Keep a Changelog and this project follows Semantic Versioning.
- Initial release of PyTorch Training Inspector.
- Core training wrapper with step-scoped instrumentation.
- Monitors for loss, gradients, learning rate, GPU memory, throughput, and activations.
- Detectors for NaN loss, gradient explosion, training stalls, LR mismatch, and OOM risk.
- Plotly dashboard generation and standalone report output.
- Alert routing primitives (stdout, Slack, email).
- Checkpoint analysis and profiler integration utilities.
- Example scripts for baseline usage, advanced monitoring, multi-GPU, and checkpoint resume.
- Unit and integration tests for key monitoring and detection flows.
- Documentation set under
docs/(user guide, API reference, anomaly patterns, performance tuning). - Repository governance and release readiness artifacts (
LICENSE,CONTRIBUTING.md,CODE_OF_CONDUCT.md,SECURITY.md,CITATION.cff,pyproject.toml,.gitattributes).