You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+57-6Lines changed: 57 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,19 @@
1
1
## Project overview
2
2
3
-
`cgm-format` is a Python library for converting vendor-specific Continuous Glucose Monitoring (CGM) data (Dexcom, Libre) into a standardized unified format for ML training and inference.
3
+
`cgm-format` is a Python library for converting vendor-specific Continuous Glucose Monitoring (CGM) data (Dexcom, Libre, Medtronic, Nightscout) into a standardized unified format for ML training and inference.
4
4
5
5
The package lives under `src/cgm_format/` (PEP 517 src layout). The two main public classes are:
6
6
-`FormatParser` (`format_parser.py`) — Stages 1–3: decode raw bytes, detect vendor format, parse to unified Polars DataFrame.
`uv` is used as the package manager. **Always run commands via `uv run`** — never use bare `pytest`, `python`, or `cgm-cli` directly. The project uses a src layout with hatchling; the package and its dependencies (including `polars`) are only available inside the uv-managed virtual environment. Running bare `pytest` picks up the system Python, which does not have the project installed, and fails with `ModuleNotFoundError`.
22
23
23
24
```bash
24
-
uv sync --extra dev # FIRST: install/sync all dependencies (dev includes pytest, typer, rich, pandas, pyarrow, frictionless)
25
+
uv sync --extra dev # FIRST: install/sync all dependencies (dev includes cli + pytest)
25
26
uv run pytest # run the full test suite
26
27
uv run pytest tests/test_format_parser.py # run a specific test file
27
28
uv run cgm-cli --help # explore the CLI
@@ -32,7 +33,7 @@ uv run cgm-cli pipeline <file> # run full 6-stage pipeline
32
33
33
34
If tests fail with `ModuleNotFoundError: No module named 'polars'` or `No module named 'cgm_format'`, run `uv sync --extra dev` first.
34
35
35
-
Tests are **integration tests that use real data** in `data/` — do not mock unless absolutely required.
36
+
Tests are **integration tests that use real data** in `data/input/` — do not mock unless absolutely required.
36
37
37
38
## Code style guidelines
38
39
@@ -78,6 +79,30 @@ The canonical output is a Polars DataFrame conforming to `CGM_SCHEMA` (`formats/
78
79
5. Export new public symbols from `src/cgm_format/__init__.py` and add to `__all__`.
79
80
6. Add real-data integration tests in `tests/`.
80
81
82
+
## Gap thresholds and grid-aligned gap measurement
83
+
84
+
### SMALL_GAP_MAX_MINUTES = 15 (3 intervals)
85
+
86
+
The gap threshold that separates "small" (fillable) from "large" (sequence-splitting) gaps is `SMALL_GAP_MAX_MINUTES = EXPECTED_INTERVAL_MINUTES * 3 = 15` minutes. This value is aligned with the sister library [`glucose_data_processing`](https://github.com/GlucoseDAO/glucose_data_processing) which uses the same `small_gap_max_minutes=15` default.
87
+
88
+
**Why a grid multiple matters:**`interpolate_gaps` uses grid-aligned gap measurement when `snap_to_grid=True` (the default). Raw timestamps are projected onto the 5-minute grid before measuring gaps, so effective gap sizes are always multiples of 5 (0, 5, 10, 15, 20, ...). A threshold that is itself a grid multiple (15) produces clean, deterministic fill/skip decisions. The previous threshold of 19 was not a grid multiple, which caused borderline instability: a raw gap of 18.7 min would round to 20 min on the grid (exceeding 19), while the same gap measured on raw timestamps would be below 19. This made `interpolate_gaps` and `synchronize_timestamps` disagree on whether to fill such gaps.
89
+
90
+
### Grid-aligned gap measurement for commutativity
91
+
92
+
`_interpolate_sequence` computes effective gap sizes by projecting both endpoints of each gap onto the 5-minute grid via `calculate_grid_point()`, then measuring the distance between grid positions. This ensures that `interpolate_gaps → synchronize_timestamps` and `synchronize_timestamps → interpolate_gaps` see identical gap sizes and produce identical results (commutativity). The approach is:
93
+
94
+
1. For each pair of consecutive glucose readings, compute the nearest grid point for both timestamps.
95
+
2. Measure the gap as the difference between grid positions (always a multiple of `expected_interval_minutes`).
96
+
3. Apply the `> expected_interval_minutes` and `<= SMALL_GAP_MAX_MINUTES` thresholds to the grid-aligned gap.
97
+
98
+
This is only active when `snap_to_grid=True`. When `snap_to_grid=False`, raw timestamp differences are used (no grid alignment), so commutativity with `synchronize_timestamps` is not guaranteed.
99
+
100
+
### Comparison operators
101
+
102
+
Both `cgm_format` and `glucose_data_processing` use the same operator convention:
103
+
-**Sequence splits:**`> threshold` (strictly greater → gap AT the threshold stays in the same sequence)
104
+
-**Interpolation fill:**`<= threshold` (less-or-equal → gap AT the threshold IS filled)
The CLI `report` and `validate` commands use `frictionless` if available, but degrade gracefully without it. Import it inside functions (not at module level) or guard with `HAS_FRICTIONLESS`. The core `FormatParser` / `FormatProcessor` do not depend on it.
106
131
132
+
### Nightscout dual-path architecture
133
+
134
+
Nightscout data is supported through two parsing paths:
135
+
136
+
1.**JSON API path** (primary): `FormatParser.parse_nightscout(entries_json, treatments_json)` or
137
+
`FormatParser.from_nightscout_exports(entries_path, treatments_path)` or
138
+
`FormatParser.from_nightscout_url(base_url, ...)`. Downloads entries and treatments as JSON,
139
+
combines glucose readings with insulin/carbs/temp basals. Supports `token` and `api_secret` auth.
140
+
141
+
2.**nightscout-exporter CSV path**: Combined CSV file with `# CGM ENTRIES` and `# TREATMENTS`
142
+
section headers. Auto-detected by `detect_format()` and parsed via `parse_file()` /
143
+
`parse_from_string()`. The `_process_nightscout` dispatcher handles both JSON and CSV.
144
+
145
+
The built-in Nightscout API CSV endpoints are **not supported** — entries.csv is headerless with
146
+
only 5 columns, and treatments.csv doesn't actually serve CSV (returns JSON regardless). The
147
+
`nightscout_entries.csv` file in `data/input/` is kept as a negative control.
148
+
149
+
### `httpx` is an optional dependency
150
+
151
+
The `nightscout_downloader` module requires `httpx` for HTTP requests. It is included in the `cli` and `dev` optional dependency groups. Import it inside functions and raise a clear `ImportError` with install instructions if missing.
-`uv lock --upgrade` only updates `uv.lock`; `pyproject.toml` minimum version bounds must be bumped manually if you want to raise them.
162
+
-`tests/conftest.py` loads `.env` via `python-dotenv` and provides a session-scoped `nightscout_data_dir` fixture that downloads Nightscout JSON data from `NIGHTSCOUT_URL` (with optional `NIGHTSCOUT_TOKEN` / `NIGHTSCOUT_API_SECRET`) into `data/input/`. Files are cached; pass `--nightscout-redownload` to force refresh.
163
+
-`data/.gitignore` uses an ignore-all + allowlist pattern (`*` then `!input/`, `!input/**`). To commit a new top-level subdirectory under `data/`, add explicit `!<dir>/` and `!<dir>/**` entries.
164
+
-`detect_format()` recognizes nightscout-exporter CSV (with `# CGM ENTRIES` section headers). Nightscout JSON files do **not** go through `detect_format` — use `parse_nightscout()` or `from_nightscout_exports()` instead.
- When upgrading dependencies (`uv lock --upgrade`), also raise the lower-bound version constraints in `pyproject.toml` to match the newly resolved versions.
170
+
- Tests should be resilient to changing data — use `pytest.skip()` for optional data features (e.g. specific treatment types) rather than hard assertions that assume specific data content.
0 commit comments