Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.
-
A project exists at
/root/code/ml-pipeline/with Git and DVC initialised. Theparams.yamlis in place and the.dvc/configis pre-configured to push to the SeaweedFS bucketdvc-storageathttp://localhost:8333. -
The
ingest,validate, andpreprocessstages are already declared indvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it. -
The remaining two stages need to be added:
train– Depends on the preprocessed dataset and scripts/train.py; readsn_estimators,max_depth,test_size, andrandom_seedfromparams.yaml; outputsmodels/model.pklanddata/processed/test_split.csv; declaresmetrics.jsonas a DVC metric withcache: false.evaluate– Depends onmodels/model.pkl,data/processed/test_split.csv, andscripts/evaluate.py; outputsreports/evaluation.jsondeclared withcache: false.
-
The two scripts you need are pre-staged at
/root/code/ml-pipeline/scripts-staging/train.pyandscripts-staging/evaluate.py. Copy them intoscripts/before adding the stages. -
Run the full pipeline with
dvc repro, push the cache to the SeaweedFS remote withdvc push, and tag the current state asv1.0. -
Commit every change to Git so the release is fully captured.
Open the SeaweedFS Filer button at the top of the lab and navigate to
/buckets/dvc-storage/to confirm that the bucket holds the pushed artefacts under thefiles/md5/...layout.
-
Let's move into project directory and run
dvc reproto identify the issue:cd ml-pipeline dvc reprodvc repro Running stage 'ingest': > python scripts/ingest.py Data ingested successfully: 20 rows, 5 columns Generating lock file 'dvc.lock' Updating lock file 'dvc.lock' Running stage 'validate': > python scripts/validate.py Validation: 20 rows, valid=True Updating lock file 'dvc.lock' Running stage 'preprocess': > python scripts/preprocess.py Preprocessed: 20 clean rows ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist
So, we can its failing at preprocess stage, and if we look at the
preprocess.pyscript the actual output file isclean.csv. Let's rename and run thedvc reproagain. -
Add remaining two stage:
let's add two more stages
trainandevaluateto complete the pipeline according to requirements.stages: ingest: cmd: python scripts/ingest.py deps: - scripts/ingest.py - data/raw/data.csv validate: cmd: python scripts/validate.py deps: - data/raw/data.csv - scripts/validate.py outs: - reports/validation.json: cache: false preprocess: cmd: python scripts/preprocess.py deps: - data/raw/data.csv - scripts/preprocess.py outs: - data/processed/clean.csv train: cmd: python scripts/train.py deps: - data/processed/clean.csv - scripts/train.py params: - n_estimators - random_seed - test_size - max_depth outs: - models/model.pkl - data/processed/test_split.csv metrics: - metrics.json: cache: false evaluate: cmd: python scripts/evaluate.py deps: - scripts/evaluate.py - models/model.pkl - data/processed/test_split.csv outs: - reports/evaluation.json: cache: false
We have added two missing stages. Full source code of day 19 dvc.yaml
-
Add missing scripts:
Let's copy
train.pyandevaluate.pyscripts from staging area to main scripts directory:cp scripts-staging/train.py scripts/ cp scripts-staging/evaluate.py scripts/
-
Let's run the
dvc reprodvc repro dvc push
-
Commit all changes to Git
git add . git commit -m "Build complete DVC full pipeline v1.0" git tag v1.0
- Run
git add .andgit commitbeforegit tag v1.0; the tag must point to the final committed release, not uncommitted work. - After
dvc push, confirm artifacts land in the SeaweedFS bucket underfiles/md5/...by opening the SeaweedFS Filer and browsing/buckets/dvc-storage/. - If
dvc reprofails onpreprocess, check thatdvc.yamlusesdata/processed/clean.csv, notdata/processed/cleaned.csv.
Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.