Skip to content

Latest commit

 

History

History
207 lines (157 loc) · 8.93 KB

File metadata and controls

207 lines (157 loc) · 8.93 KB

Prediction Workflow

The prediction stage runs forecast inference on a future (or held-out) time window using a previously trained ClassifierEnsemble. No labels are required at this stage - predictions are produced over an evenly-spaced window grid.

Driver: PredictionModel (src/eruption_forecast/model/prediction_model.py). Wrapped by ForecastModel.predict(...).


Internal Pipeline

                ┌──────────────────────────────────────────────┐
                │             PredictionModel                  │
                │   inherits BaseModel                         │
                │   self.ClassifierEnsemble loaded eagerly     │
                └──────────────┬───────────────────────────────┘
                               │
                  ┌────────────┼─────────────┐
                  ▼            ▼             ▼
        ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
        │ build_label │→ │ extract_    │→ │  forecast   │
        │             │  │ features    │  │             │
        └─────────────┘  └─────────────┘  └─────────────┘
            unlabelled       prediction-      ClassifierEnsemble
            window grid      mode tsfresh     .predict_with_uncertainty
            (id + datetime)  (no relevance    → per-classifier +
                              filtering)        consensus probabilities

ForecastModel.predict(...) chains all three internally.


Trained Model Sources

PredictionModel(model=...) accepts five forms via ClassifierEnsemble.from_any():

Source Example
ClassifierEnsemble object fm.TrainingModel.ClassifierEnsemble
SeedEnsemble object a single-classifier ensemble (auto-wrapped)
Path to ClassifierEnsemble*.pkl output/.../training/classifiers/ClassifierEnsemble_StratifiedShuffleSplit.pkl
Path to ClassifierEnsemble*.json the JSON registry written next to the .pkl
Path to a SeedEnsemble_*.pkl bundle for a single classifier
Path to a trained-model registry .json (new) or .csv (legacy) trained-model__RandomForestClassifier_...json

When called via fm.predict(...), the in-memory fm.ClassifierEnsemble is passed directly - no disk round-trip.


Build Forecast Grid (build_label)

Despite the name, prediction labels are placeholders - every is_erupted value is 0. The role of build_label() here is to lay out the window grid the model will score against.

Param Type Notes
window_step int Stride between consecutive forecast windows
window_step_unit "minutes" | "hours" A 10-minute step produces 144 forecasts/day

The grid is cached as {prediction_dir}/features/features-label_{basename}_step-{N}-{unit}.csv. On a re-run the cached grid is loaded unless overwrite=True.


Extract Features (extract_features)

Runs the same TremorMatrixBuilder → FeaturesBuilder chain as training, but in prediction mode - no tsfresh relevance filtering, because there are no labels to test relevance against.

select_tremor_columns, save_tremor_matrix_per_method, and exclude_features are mirrored from the training call so that the prediction feature columns line up with the trained model's input schema. ForecastModel.predict(...) forwards self.select_tremor_columns, self.save_tremor_matrix_per_method, and self.exclude_features from the upstream train() call automatically.


Forecast (forecast)

results = ClassifierEnsemble.predict_with_uncertainty(
    X=features_df,
    save=save_seed_result,             # write per-seed CSV under result_dir/{clf}/
    output_dir=result_dir,
    overwrite=overwrite,
)

The DataFrame returned by forecast() (also stored on self.results and fm.results) has one row per forecast window and the following columns:

Column suffix Meaning
{clf}_eruption_probability Mean P(eruption) across the classifier's seeds
{clf}_uncertainty Std-dev across the classifier's seeds
{clf}_confidence 1 - normalised_uncertainty
{clf}_prediction Binary prediction at the seed-averaged threshold
consensus_eruption_probability Mean of {clf}_eruption_probability across classifiers
consensus_uncertainty Pooled std across classifiers + seeds
consensus_confidence 1 - normalised consensus uncertainty
consensus_prediction Binary prediction on the consensus mean

The result CSV is written at {station_dir}/forecast-results_{basename}.csv.

Plotting

prediction/figures/forecast_{basename}.png       # always
prediction/figures/forecast_{basename}.pdf       # when plot_pdf=True (default)
forecast(...) param Default Effect
save_seed_result True Per-seed CSVs under prediction/results/{clf}/
plot_threshold 0.5 Horizontal threshold line on the forecast plot
plot_title None Optional title
plot_pdf True Also save a vector PDF
**plot_kwargs - Forwarded to eruption_forecast.plots.plot_forecast - e.g. eruption_dates=[...] to render eruption markers

Cache

PredictionModel inherits the cache layer from BaseModel. The cache identity includes:

  • NSLC (constructor param, threaded by ForecastModel.predict)
  • tremor DataFrame fingerprint
  • training_hash (constructor param) - the cache hash of the upstream TrainingModel
  • start_date, end_date, window_size
  • build_label kwargs (window_step, window_step_unit)
  • extract_features kwargs (select_tremor_columns, save_tremor_matrix_per_method, exclude_features)

Threading training_hash means re-training automatically invalidates the prediction cache. forecast() calls self.save(self.build_identity()); the pickle lands at {station_dir}/prediction/{hash}.PredictionModel.pkl + {hash}.PredictionModel.params.json.


Outputs

{station_dir}/
├── prediction/
│   ├── features/
│   │   ├── features-label_{basename}_step-{N}-{unit}.csv  # forecast grid
│   │   └── features-matrix_*.csv                           # tsfresh matrix
│   ├── results/{clf-slug}/{seed:05d}.csv                   # per-seed probability (save_seed_result=True)
│   └── figures/forecast_{basename}.{png,pdf}               # forecast plot
├── forecast-results_{basename}.csv             # top-level results dump
└── prediction/{hash}.PredictionModel.pkl       # content-addressable cache pickle (+ .params.json sidecar)

fm.PredictionModel.forecast_plot_path exposes the path to the rendered plot - used by scenarios.py to attach the figure to a Telegram notification.


Standalone Use

from eruption_forecast import PredictionModel

pm = (
    PredictionModel(
        model="output/VG.OJN.00.EHZ/training/classifiers/ClassifierEnsemble_StratifiedShuffleSplit.pkl",
        tremor_data="output/VG.OJN.00.EHZ/tremor/VG.OJN.00.EHZ_2025-01-01_2025-12-31.csv",
        start_date="2025-07-27",
        end_date="2025-08-22",
        window_size=2,                 # must match the trained model's window_size
        output_dir="output/VG.OJN.00.EHZ",
        n_jobs=4,
    )
    .build_label(window_step=10, window_step_unit="minutes")
    .extract_features(
        select_tremor_columns=["rsam_f2", "rsam_f3", "rsam_f4", "dsar_f3-f4", "entropy"],
    )
)

df_forecast = pm.forecast(
    plot_threshold=0.7,
    plot_pdf=True,
    eruption_dates=["2025-08-02", "2025-08-18"],   # → forecast_plots plot kwargs
)

pm.save()                       # → {output_dir}/PredictionModel_{basename}.pkl
print(pm.forecast_plot_path)    # path to the saved plot
print(df_forecast.head())       # 10-minute resolution forecast

Reload a saved .pkl

pm = PredictionModel.load("output/VG.OJN.00.EHZ/PredictionModel_2025-07-27_2025-08-22.pkl")
df = pm.results

The reloaded pm.results is the same DataFrame returned by forecast() - no re-inference needed for downstream analysis.

Persist the prediction config

pm.save_config()   # → {prediction_dir}/prediction.config.yaml

forecast() already auto-calls save_config() at the end, so a standalone prediction run always leaves a YAML snapshot at {output_dir}/prediction/prediction.config.yaml. The captured model and tremor_data fields preserve the path strings the user supplied (or null when a live ensemble / pre-loaded DataFrame was passed). See Configuration.