| name | setting-up-reproducible-analysis |
|---|---|
| description | Use when starting analysis work that needs isolation, or before executing a pre-registered plan - ensures an isolated, reproducible workspace with pinned environment, fixed seeds, and immutable raw data |
Ensure the analysis happens in an isolated workspace that another person (or future you) can reproduce exactly: same code, same environment, same seed, same immutable input data.
Core principle: A result you cannot reproduce is not a result. Set up reproducibility before you run anything.
Announce at start: "I'm using the setting-up-reproducible-analysis skill to set up an isolated, reproducible workspace."
Before creating anything, check whether you're already in an isolated workspace.
GIT_DIR=$(cd "$(git rev-parse --git-dir)" 2>/dev/null && pwd -P)
GIT_COMMON=$(cd "$(git rev-parse --git-common-dir)" 2>/dev/null && pwd -P)
BRANCH=$(git branch --show-current)Submodule guard: GIT_DIR != GIT_COMMON is also true in submodules. Verify you're not in one:
git rev-parse --show-superproject-working-tree 2>/dev/nullIf GIT_DIR != GIT_COMMON (and not a submodule): already in a linked worktree. Skip to Step 2 (Environment). Don't create another.
If GIT_DIR == GIT_COMMON: normal checkout. If your human partner hasn't already stated a preference, ask before creating a worktree:
"Would you like me to set up an isolated worktree? It keeps this analysis separate from your current branch."
Honor any declared preference. If declined, work in place and continue to Step 2.
Native tools first. If you have a native worktree tool (EnterWorktree, WorktreeCreate, a /worktree command, a --worktree flag), use it and skip to Step 2. Using git worktree add when a native tool exists creates phantom state the harness can't manage.
Git fallback (only if no native tool):
# Prefer an existing .worktrees/ (must be git-ignored), else default to it
git check-ignore -q .worktrees 2>/dev/null || echo ".worktrees/" >> .gitignore
git worktree add ".worktrees/$BRANCH_NAME" -b "$BRANCH_NAME"
cd ".worktrees/$BRANCH_NAME"If git worktree add fails with a sandbox permission error, tell your human partner and work in the current directory instead.
A reproducible result needs a pinned environment. Detect and set one up:
# Python
if [ -f environment.yml ]; then conda env create -f environment.yml || conda env update -f environment.yml; fi
if [ -f requirements.txt ]; then python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt; fi
if [ -f pyproject.toml ]; then python -m venv .venv && . .venv/bin/activate && pip install -e .; fi
# R
if [ -f renv.lock ]; then Rscript -e 'renv::restore()'; fiIf no environment file exists, create one and record exact versions (a lockfile, pip freeze > requirements.txt, conda env export, or renv::snapshot()). Record the language runtime version too. An analysis run against unpinned dependencies is not reproducible.
Set a single global seed and record it (it belongs in the pre-registration too). Anything stochastic — sampling, splits, bootstrap, model init, cross-validation folds — must use it.
SEED = 20260528
import os, random, numpy as np
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED); np.random.seed(SEED)
# torch.manual_seed(SEED); etc.- Raw data is immutable. Put it under
data/raw/and treat it as read-only. Every transform writes a NEW artifact underdata/derived/. Never edit raw in place. - Record provenance: where each raw file came from, when, which version/query, and a checksum.
mkdir -p data/raw data/derived
shasum -a 256 data/raw/* > data/raw/CHECKSUMS.txt # detect silent changes later
chmod -w data/raw/* 2>/dev/null || true # best-effort read-only- Keep raw and derived separate so a reviewer can see exactly what came in and what you made.
Before adding new work, confirm the workspace starts clean and existing analyses reproduce:
# Run existing tests / re-run an existing analysis and confirm outputs match
pytest 2>/dev/null || trueIf existing results don't reproduce: report it and ask whether to fix the baseline before proceeding. You can't attribute new findings to your work if the starting point was already broken.
Workspace ready at <path> (branch <name>)
Environment pinned: <python 3.x / conda env>, lockfile recorded
Seed fixed: <SEED>
Raw data: <N files>, checksummed, read-only
Baseline: <reproduces / N tests pass>
Ready to execute <topic>
| Situation | Action |
|---|---|
| Already in a linked worktree | Skip creation (Step 0) |
| In a submodule | Treat as normal repo |
| Native worktree tool available | Use it |
| No environment file | Create one + record a lockfile |
| Stochastic step | Use the recorded global seed |
| Need to transform raw data | Write a new derived artifact; never edit raw |
| Baseline doesn't reproduce | Report + ask before proceeding |
Never:
- Run an analysis against unpinned dependencies
- Leave seeds unset for anything stochastic
- Edit raw data in place
- Skip the baseline reproducibility check
- Use
git worktree addwhen a native worktree tool exists
Always:
- Detect existing isolation first
- Pin the environment and record a lockfile
- Fix and record the seed
- Keep raw immutable and provenance documented
- Confirm the baseline reproduces