Skip to content

Latest commit

 

History

History
143 lines (109 loc) · 5.26 KB

File metadata and controls

143 lines (109 loc) · 5.26 KB

Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.

  1. A project exists at /root/code/ml-pipeline/ with Git and DVC initialised. The params.yaml is in place and the .dvc/config is pre-configured to push to the SeaweedFS bucket dvc-storage at http://localhost:8333.

  2. The ingest, validate, and preprocess stages are already declared in dvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it.

  3. The remaining two stages need to be added:

    • train – Depends on the preprocessed dataset and scripts/train.py; reads n_estimators, max_depth, test_size, and random_seed from params.yaml; outputs models/model.pkl and data/processed/test_split.csv; declares metrics.json as a DVC metric with cache: false.
    • evaluate – Depends on models/model.pkl, data/processed/test_split.csv, and scripts/evaluate.py; outputs reports/evaluation.json declared with cache: false.
  4. The two scripts you need are pre-staged at /root/code/ml-pipeline/scripts-staging/train.py and scripts-staging/evaluate.py. Copy them into scripts/ before adding the stages.

  5. Run the full pipeline with dvc repro, push the cache to the SeaweedFS remote with dvc push, and tag the current state as v1.0.

  6. Commit every change to Git so the release is fully captured.

Open the SeaweedFS Filer button at the top of the lab and navigate to /buckets/dvc-storage/ to confirm that the bucket holds the pushed artefacts under the files/md5/... layout.

Solution

  1. Let's move into project directory and run dvc repro to identify the issue:

    cd ml-pipeline
    dvc repro
    dvc repro
    Running stage 'ingest':
    > python scripts/ingest.py
    Data ingested successfully: 20 rows, 5 columns
    Generating lock file 'dvc.lock'
    Updating lock file 'dvc.lock'
    
    Running stage 'validate':
    > python scripts/validate.py
    Validation: 20 rows, valid=True
    Updating lock file 'dvc.lock'
    
    Running stage 'preprocess':
    > python scripts/preprocess.py
    Preprocessed: 20 clean rows
    ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist

    So, we can its failing at preprocess stage, and if we look at the preprocess.py script the actual output file is clean.csv. Let's rename and run the dvc repro again.

  2. Add remaining two stage:

    let's add two more stages train and evaluate to complete the pipeline according to requirements.

    stages:
        ingest:
            cmd: python scripts/ingest.py
            deps:
                - scripts/ingest.py
                - data/raw/data.csv
    
        validate:
            cmd: python scripts/validate.py
            deps:
                - data/raw/data.csv
                - scripts/validate.py
            outs:
                - reports/validation.json:
                    cache: false
    
        preprocess:
            cmd: python scripts/preprocess.py
            deps:
                - data/raw/data.csv
                - scripts/preprocess.py
            outs:
                - data/processed/clean.csv
        train:
            cmd: python scripts/train.py
            deps:
                - data/processed/clean.csv
                - scripts/train.py
            params:
                - n_estimators
                - random_seed
                - test_size
                - max_depth
            outs:
                - models/model.pkl
                - data/processed/test_split.csv
            metrics:
                - metrics.json:
                    cache: false
        evaluate:
            cmd: python scripts/evaluate.py
            deps:
                - scripts/evaluate.py
                - models/model.pkl
                - data/processed/test_split.csv
            outs:
                - reports/evaluation.json:
                    cache: false

    We have added two missing stages. Full source code of day 19 dvc.yaml

  3. Add missing scripts:

    Let's copy train.py and evaluate.py scripts from staging area to main scripts directory:

    cp scripts-staging/train.py scripts/
    cp scripts-staging/evaluate.py scripts/
  4. Let's run the dvc repro

    dvc repro
    dvc push
  5. Commit all changes to Git

    git add .
    git commit -m "Build complete DVC full pipeline v1.0"
    git tag v1.0

Good to Know?

  • Run git add . and git commit before git tag v1.0; the tag must point to the final committed release, not uncommitted work.
  • After dvc push, confirm artifacts land in the SeaweedFS bucket under files/md5/... by opening the SeaweedFS Filer and browsing /buckets/dvc-storage/.
  • If dvc repro fails on preprocess, check that dvc.yaml uses data/processed/clean.csv, not data/processed/cleaned.csv.

Want to test locally?

Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.