|
| 1 | +# tsauditor |
| 2 | + |
| 3 | +A data-quality auditing library for **time-series tabular data**, with a focus on |
| 4 | +financial and sensor domains. `tsauditor` scans a `DataFrame` and returns a |
| 5 | +structured report of structural problems, anomalies, and — its core contribution — |
| 6 | +**data-leakage** between features and the prediction target. |
| 7 | + |
| 8 | +The project grew out of a real bug: a same-day percentage-change feature (`ChangeP`) |
| 9 | +was used to predict a direction label derived from the same price, inflating accuracy |
| 10 | +to ~99% until the leaky column was removed and it collapsed to ~52–58%. `tsauditor` |
| 11 | +exists to catch that class of mistake automatically. |
| 12 | + |
| 13 | +## Installation |
| 14 | + |
| 15 | +```bash |
| 16 | +git clone <repo-url> |
| 17 | +cd tsauditor |
| 18 | +pip install -e ".[dev]" |
| 19 | +``` |
| 20 | + |
| 21 | +Requires Python ≥ 3.9. Core dependencies: `pandas`, `numpy`, `scipy`, `statsmodels`, `rich`. |
| 22 | + |
| 23 | +## Quickstart |
| 24 | + |
| 25 | +```python |
| 26 | +import tsauditor as tsa |
| 27 | + |
| 28 | +report = tsa.scan(df, target="Direction", domain="finance") |
| 29 | + |
| 30 | +report.summary() # rich-formatted CLI table |
| 31 | +report.critical # list[Issue] that block modeling |
| 32 | +report.filter(module="leakage") # programmatic filtering |
| 33 | +report.to_json("report.json") # structured export |
| 34 | +``` |
| 35 | + |
| 36 | +`scan()` returns a `GuardReport` holding `Issue` dataclasses bucketed by severity |
| 37 | +(`critical`, `warnings`, `info`) plus dataset metadata. |
| 38 | + |
| 39 | +## What it checks |
| 40 | + |
| 41 | +| Module | Code | Severity | Detects | |
| 42 | +|--------|------|----------|---------| |
| 43 | +| profiler | PRF001 | warning | Irregular timestamp frequency | |
| 44 | +| profiler | PRF002 | warning | Clustered missing values | |
| 45 | +| profiler | PRF003 | info | Non-stationarity (Augmented Dickey-Fuller) | |
| 46 | +| profiler | PRF004 | warning | Duplicate timestamps | |
| 47 | +| profiler | PRF005 | warning | Clustered gaps | |
| 48 | +| profiler | PRF006 | warning | High overall missing rate | |
| 49 | +| anomaly | ANO001 | warning | Stuck / repeated constant values | |
| 50 | +| anomaly | ANO002 | warning | Point outliers (z-score + IQR) | |
| 51 | +| anomaly | ANO003 | warning | Contextual spikes (local rolling z-score) | |
| 52 | +| leakage | LEK001 | critical | Target equivalence (feature reproduces the target) | |
| 53 | +| leakage | LEK002 | warning | Positive-lag cross-correlation peak (future info) | |
| 54 | +| leakage | LEK003 | warning | Rolling-window lookahead (excess over persistence) | |
| 55 | + |
| 56 | +### Leakage detection (the research core) |
| 57 | + |
| 58 | +Leakage checks are **rank-based**, chosen by target type: |
| 59 | + |
| 60 | +- **LEK001 — equivalence.** Continuous targets use `|Spearman ρ|`; binary targets use |
| 61 | + **AUC separation** (`max(AUC, 1−AUC)`). This is deliberate: Pearson against a binary |
| 62 | + 0/1 target is point-biserial correlation, which is capped near `√(2/π) ≈ 0.798`, so a |
| 63 | + feature whose sign *defines* the target scores only ~0.80 and slips under a naive |
| 64 | + threshold. AUC scores it 1.0. |
| 65 | +- **LEK002 — cross-correlation.** Flags features whose peak association with the target |
| 66 | + falls at a *positive* lag (the feature aligns with the target's future). |
| 67 | +- **LEK003 — temporal lookahead.** Flags features that correlate with the future target |
| 68 | + *beyond* what the target's own autocorrelation can explain — the signature of a |
| 69 | + forward-looking or centered window. The persistence baseline is what keeps a |
| 70 | + legitimate trailing feature from being false-flagged. |
| 71 | + |
| 72 | +LEK002/LEK003 are WARNING-level *suspicions*: in pure cross-correlation a genuine strong |
| 73 | +predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because |
| 74 | +equivalence is near-deterministic. |
| 75 | + |
| 76 | +## Architecture |
| 77 | + |
| 78 | +``` |
| 79 | +tsauditor/ |
| 80 | +├── scanner.py # scan() — orchestrates all modules into a GuardReport |
| 81 | +├── profiler/ # structural checks: frequency, missing, stationarity |
| 82 | +├── anomaly/ # point.py, contextual.py |
| 83 | +├── leakage/ # equivalence.py, correlation.py, temporal.py |
| 84 | +├── report/summary.py # GuardReport + Issue dataclasses, rich/JSON output |
| 85 | +└── utils/validation.py # input validation & DataFrame normalization |
| 86 | +``` |
| 87 | + |
| 88 | +## Testing |
| 89 | + |
| 90 | +```bash |
| 91 | +pytest -q |
| 92 | +``` |
| 93 | + |
| 94 | +## Status |
| 95 | + |
| 96 | +Alpha (`0.1.0`). Profiler, anomaly, and leakage modules are implemented and tested. |
| 97 | + |
| 98 | +## License |
| 99 | + |
| 100 | +MIT — see [LICENSE](LICENSE). |
0 commit comments