Skip to content

Commit ad4fb56

Browse files
committed
Update: Project Complete
1 parent d3c6e27 commit ad4fb56

84 files changed

Lines changed: 1382 additions & 85 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
name: Test (Python ${{ matrix.python-version }}, ${{ matrix.os }})
12+
runs-on: ${{ matrix.os }}
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
os: [ubuntu-latest, windows-latest, macos-latest]
17+
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14"]
18+
19+
steps:
20+
- name: Check out repository
21+
uses: actions/checkout@v4
22+
23+
- name: Set up Python ${{ matrix.python-version }}
24+
uses: actions/setup-python@v5
25+
with:
26+
python-version: ${{ matrix.python-version }}
27+
28+
- name: Install package and dev dependencies
29+
run: |
30+
python -m pip install --upgrade pip
31+
pip install -e ".[dev]"
32+
33+
- name: Run tests with coverage
34+
run: |
35+
pytest --cov=tsauditor --cov-report=xml --cov-report=term-missing -v
36+
37+
- name: Upload coverage to Codecov
38+
# Only upload once per matrix run to avoid duplicate/conflicting reports.
39+
# ubuntu + the lowest supported Python version is the canonical run.
40+
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.9'
41+
uses: codecov/codecov-action@v4
42+
with:
43+
file: ./coverage.xml
44+
fail_ci_if_error: false
45+
token: ${{ secrets.CODECOV_TOKEN }}

.github/workflows/dependabot.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
version: 2
2+
updates:
3+
# Python package dependencies (pandas, numpy, scipy, statsmodels, rich, pytest, etc.)
4+
- package-ecosystem: "pip"
5+
directory: "/"
6+
schedule:
7+
interval: "weekly"
8+
open-pull-requests-limit: 5
9+
labels:
10+
- "dependencies"
11+
12+
# Keep the GitHub Actions themselves (checkout, setup-python, codecov-action) up to date
13+
- package-ecosystem: "github-actions"
14+
directory: "/"
15+
schedule:
16+
interval: "weekly"
17+
labels:
18+
- "dependencies"
19+
- "github-actions"

.gitignore

606 Bytes
Binary file not shown.

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Changelog
2+
3+
All notable changes to this project are documented here. The format is based on
4+
[Keep a Changelog](https://keepachangelog.com/), and the project adheres to
5+
[Semantic Versioning](https://semver.org/).
6+
7+
## [Unreleased]
8+
9+
### Added
10+
- `leakage` module fully implemented: LEK001 (rank-based target equivalence —
11+
Spearman for continuous targets, AUC separation for binary), LEK002 (positive-lag
12+
cross-correlation), LEK003 (rolling-window lookahead via excess-over-persistence).
13+
- Test suites for the leakage module: `test_equivalence.py`, `test_correlation.py`,
14+
`test_temporal.py`, covering clean/leak/edge cases.
15+
- Standard repository files: `README.md`, `LICENSE`, `CHANGELOG.md`, CI workflow.
16+
17+
### Fixed
18+
- ANO003 contextual spike detection no longer self-masks: rolling statistics exclude
19+
the current observation, use a wider window, and handle zero-variance context.
20+
- `scan()` runs end-to-end now that all non-stub modules are implemented; stale
21+
scaffold tests updated to assert real behavior.
22+
- `.gitignore` re-encoded from UTF-16 to UTF-8 so its patterns take effect.
23+
24+
## [0.1.0]
25+
26+
### Added
27+
- Initial architecture: `profiler`, `anomaly`, `leakage` modules behind a single
28+
`tsa.scan()` entry point returning a `GuardReport`.
29+
- Profiler checks (PRF001–PRF006), point anomalies (ANO002), CLI/JSON report output.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 Iman
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# tsauditor
2+
3+
A data-quality auditing library for **time-series tabular data**, with a focus on
4+
financial and sensor domains. `tsauditor` scans a `DataFrame` and returns a
5+
structured report of structural problems, anomalies, and — its core contribution —
6+
**data-leakage** between features and the prediction target.
7+
8+
The project grew out of a real bug: a same-day percentage-change feature (`ChangeP`)
9+
was used to predict a direction label derived from the same price, inflating accuracy
10+
to ~99% until the leaky column was removed and it collapsed to ~52–58%. `tsauditor`
11+
exists to catch that class of mistake automatically.
12+
13+
## Installation
14+
15+
```bash
16+
git clone <repo-url>
17+
cd tsauditor
18+
pip install -e ".[dev]"
19+
```
20+
21+
Requires Python ≥ 3.9. Core dependencies: `pandas`, `numpy`, `scipy`, `statsmodels`, `rich`.
22+
23+
## Quickstart
24+
25+
```python
26+
import tsauditor as tsa
27+
28+
report = tsa.scan(df, target="Direction", domain="finance")
29+
30+
report.summary() # rich-formatted CLI table
31+
report.critical # list[Issue] that block modeling
32+
report.filter(module="leakage") # programmatic filtering
33+
report.to_json("report.json") # structured export
34+
```
35+
36+
`scan()` returns a `GuardReport` holding `Issue` dataclasses bucketed by severity
37+
(`critical`, `warnings`, `info`) plus dataset metadata.
38+
39+
## What it checks
40+
41+
| Module | Code | Severity | Detects |
42+
|--------|------|----------|---------|
43+
| profiler | PRF001 | warning | Irregular timestamp frequency |
44+
| profiler | PRF002 | warning | Clustered missing values |
45+
| profiler | PRF003 | info | Non-stationarity (Augmented Dickey-Fuller) |
46+
| profiler | PRF004 | warning | Duplicate timestamps |
47+
| profiler | PRF005 | warning | Clustered gaps |
48+
| profiler | PRF006 | warning | High overall missing rate |
49+
| anomaly | ANO001 | warning | Stuck / repeated constant values |
50+
| anomaly | ANO002 | warning | Point outliers (z-score + IQR) |
51+
| anomaly | ANO003 | warning | Contextual spikes (local rolling z-score) |
52+
| leakage | LEK001 | critical | Target equivalence (feature reproduces the target) |
53+
| leakage | LEK002 | warning | Positive-lag cross-correlation peak (future info) |
54+
| leakage | LEK003 | warning | Rolling-window lookahead (excess over persistence) |
55+
56+
### Leakage detection (the research core)
57+
58+
Leakage checks are **rank-based**, chosen by target type:
59+
60+
- **LEK001 — equivalence.** Continuous targets use `|Spearman ρ|`; binary targets use
61+
**AUC separation** (`max(AUC, 1−AUC)`). This is deliberate: Pearson against a binary
62+
0/1 target is point-biserial correlation, which is capped near `√(2/π) ≈ 0.798`, so a
63+
feature whose sign *defines* the target scores only ~0.80 and slips under a naive
64+
threshold. AUC scores it 1.0.
65+
- **LEK002 — cross-correlation.** Flags features whose peak association with the target
66+
falls at a *positive* lag (the feature aligns with the target's future).
67+
- **LEK003 — temporal lookahead.** Flags features that correlate with the future target
68+
*beyond* what the target's own autocorrelation can explain — the signature of a
69+
forward-looking or centered window. The persistence baseline is what keeps a
70+
legitimate trailing feature from being false-flagged.
71+
72+
LEK002/LEK003 are WARNING-level *suspicions*: in pure cross-correlation a genuine strong
73+
predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because
74+
equivalence is near-deterministic.
75+
76+
## Architecture
77+
78+
```
79+
tsauditor/
80+
├── scanner.py # scan() — orchestrates all modules into a GuardReport
81+
├── profiler/ # structural checks: frequency, missing, stationarity
82+
├── anomaly/ # point.py, contextual.py
83+
├── leakage/ # equivalence.py, correlation.py, temporal.py
84+
├── report/summary.py # GuardReport + Issue dataclasses, rich/JSON output
85+
└── utils/validation.py # input validation & DataFrame normalization
86+
```
87+
88+
## Testing
89+
90+
```bash
91+
pytest -q
92+
```
93+
94+
## Status
95+
96+
Alpha (`0.1.0`). Profiler, anomaly, and leakage modules are implemented and tested.
97+
98+
## License
99+
100+
MIT — see [LICENSE](LICENSE).

pyproject.toml

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,11 @@ classifiers = [
2424
"Topic :: Scientific/Engineering :: Information Analysis",
2525
]
2626
dependencies = [
27-
"pandas>=1.5",
28-
"numpy>=1.23",
29-
"scipy>=1.9",
30-
"statsmodels>=0.13",
31-
"rich>=13.0",
27+
"pandas>=1.5,<3",
28+
"numpy>=1.23,<3",
29+
"scipy>=1.9,<2",
30+
"statsmodels>=0.13,<1",
31+
"rich>=13.0,<15",
3232
]
3333

3434
[project.optional-dependencies]
@@ -43,3 +43,17 @@ include = ["tsauditor*"]
4343

4444
[tool.pytest.ini_options]
4545
testpaths = ["tests"]
46+
47+
[tool.coverage.run]
48+
source = ["tsauditor"]
49+
omit = [
50+
"tsauditor/__pycache__/*",
51+
]
52+
53+
[tool.coverage.report]
54+
exclude_lines = [
55+
"pragma: no cover",
56+
"raise NotImplementedError",
57+
"if __name__ == .__main__.:",
58+
]
59+
show_missing = true
167 Bytes
Binary file not shown.
2.17 KB
Binary file not shown.
6.27 KB
Binary file not shown.

0 commit comments

Comments
 (0)