Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.

A project exists at /root/code/ml-pipeline/ with Git and DVC initialised. The params.yaml is in place and the .dvc/config is pre-configured to push to the SeaweedFS bucket dvc-storage at http://localhost:8333.
The ingest, validate, and preprocess stages are already declared in dvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it.
The remaining two stages need to be added:
- train – Depends on the preprocessed dataset and scripts/train.py; reads n_estimators, max_depth, test_size, and random_seed from params.yaml; outputs models/model.pkl and data/processed/test_split.csv; declares metrics.json as a DVC metric with cache: false.
- evaluate – Depends on models/model.pkl, data/processed/test_split.csv, and scripts/evaluate.py; outputs reports/evaluation.json declared with cache: false.
The two scripts you need are pre-staged at /root/code/ml-pipeline/scripts-staging/train.py and scripts-staging/evaluate.py. Copy them into scripts/ before adding the stages.
Run the full pipeline with dvc repro, push the cache to the SeaweedFS remote with dvc push, and tag the current state as v1.0.
Commit every change to Git so the release is fully captured.

Open the SeaweedFS Filer button at the top of the lab and navigate to /buckets/dvc-storage/ to confirm that the bucket holds the pushed artefacts under the files/md5/... layout.

Solution

Let's move into project directory and run dvc repro to identify the issue:

cd ml-pipeline
dvc repro

dvc repro
Running stage 'ingest':
> python scripts/ingest.py
Data ingested successfully: 20 rows, 5 columns
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

Running stage 'validate':
> python scripts/validate.py
Validation: 20 rows, valid=True
Updating lock file 'dvc.lock'

Running stage 'preprocess':
> python scripts/preprocess.py
Preprocessed: 20 clean rows
ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist

So, we can its failing at preprocess stage, and if we look at the preprocess.py script the actual output file is clean.csv. Let's rename and run the dvc repro again.

Add remaining two stage:

let's add two more stages train and evaluate to complete the pipeline according to requirements.

stages:
    ingest:
        cmd: python scripts/ingest.py
        deps:
            - scripts/ingest.py
            - data/raw/data.csv

    validate:
        cmd: python scripts/validate.py
        deps:
            - data/raw/data.csv
            - scripts/validate.py
        outs:
            - reports/validation.json:
                cache: false

    preprocess:
        cmd: python scripts/preprocess.py
        deps:
            - data/raw/data.csv
            - scripts/preprocess.py
        outs:
            - data/processed/clean.csv
    train:
        cmd: python scripts/train.py
        deps:
            - data/processed/clean.csv
            - scripts/train.py
        params:
            - n_estimators
            - random_seed
            - test_size
            - max_depth
        outs:
            - models/model.pkl
            - data/processed/test_split.csv
        metrics:
            - metrics.json:
                cache: false
    evaluate:
        cmd: python scripts/evaluate.py
        deps:
            - scripts/evaluate.py
            - models/model.pkl
            - data/processed/test_split.csv
        outs:
            - reports/evaluation.json:
                cache: false

We have added two missing stages. Full source code of day 19 dvc.yaml

Add missing scripts:

Let's copy train.py and evaluate.py scripts from staging area to main scripts directory:
```
cp scripts-staging/train.py scripts/
cp scripts-staging/evaluate.py scripts/
```
Let's run the dvc repro
```
dvc repro
dvc push
```

Commit all changes to Git

git add .
git commit -m "Build complete DVC full pipeline v1.0"
git tag v1.0

Good to Know?

Run git add . and git commit before git tag v1.0; the tag must point to the final committed release, not uncommitted work.
After dvc push, confirm artifacts land in the SeaweedFS bucket under files/md5/... by opening the SeaweedFS Filer and browsing /buckets/dvc-storage/.
If dvc repro fails on preprocess, check that dvc.yaml uses data/processed/clean.csv, not data/processed/cleaned.csv.

Want to test locally?

Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Solution

Good to Know?

Want to test locally?

FilesExpand file tree

019.md

Latest commit

History

019.md

File metadata and controls

Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Solution

Good to Know?

Want to test locally?