Skip to content

HauserGroup/scikit-opls

Repository files navigation

scikit-opls

License: MIT CI Documentation Code style: ruff uv Orcid: Jakob

Orthogonal Projections to Latent Structures (OPLS / OPLS-DA) with a scikit-learn interface.

OPLS (Trygg & Wold, 2002) splits the variation in X into a predictive part correlated with the response and orthogonal parts that are not. scikit-opls removes the orthogonal variation with an OSC-style orthogonal filter and then fits sklearn.cross_decomposition.PLSRegression on the cleaned X as the predictive engine. With n_orthogonal=0 the model reduces exactly to PLSRegression.

Install

uv sync

Usage

OPLS regression

import numpy as np
from scikit_opls import OPLS

rng = np.random.default_rng(0)
X = rng.normal(size=(100, 50))
y = X[:, 0] * 2.0 + rng.normal(scale=0.1, size=100)

model = OPLS(n_components=1, n_orthogonal=2, scale="standard").fit(X, y)

model.predict(X)              # predictions
model.transform(X)            # predictive scores
model.transform_orthogonal(X) # orthogonal scores
model.filter_transform(X)     # preprocessed, orthogonal-filtered X fed to the engine
model.r2x_, model.r2y_        # fit summaries
model.vip_                    # variable importance (predictive), lazy property

The whole fitted pipeline (scaling → orthogonal filter → predictive PLS) is linear, so it collapses to coefficients on the raw input space:

y_hat = (X @ model.coef_raw_.T + model.intercept_raw_).ravel()  # == model.predict(X)

Let cross-validated Q2 choose the number of orthogonal components with scikit-learn's GridSearchCV — no bespoke estimator needed (scoring=None gives out-of-fold R2, which equals Q2 for OPLS):

from sklearn.model_selection import GridSearchCV
from scikit_opls import OPLS

search = GridSearchCV(
    OPLS(n_components=1), {"n_orthogonal": list(range(10))}, cv=7
).fit(X, y)
search.best_params_["n_orthogonal"]       # chosen count
search.best_estimator_                    # final OPLS refit on all data
search.cv_results_["mean_test_score"]     # out-of-fold R2/Q2 path

For OPLS-DA, wrap OPLSDA() the same way; an int cv becomes stratified automatically and scoring="roc_auc" is usually preferable.

To bias toward fewer orthogonal components — prefer the smallest count whose mean score is within a tolerance of the best — pass a refit callable:

import numpy as np

def parsimonious_refit(cv_results, tol=0.01):
    scores = np.asarray(cv_results["mean_test_score"], dtype=float)
    counts = np.asarray(cv_results["param_n_orthogonal"], dtype=int)
    within = np.flatnonzero(scores >= np.nanmax(scores) - tol)
    return int(within[np.argmin(counts[within])])

GridSearchCV(
    OPLS(n_components=1), {"n_orthogonal": list(range(10))},
    cv=7, refit=parsimonious_refit,
).fit(X, y)

OPLS-DA (binary classification)

from scikit_opls import OPLSDA

y = np.where(X[:, 0] > 0, "case", "ctrl")
clf = OPLSDA(n_components=1, n_orthogonal=2).fit(X, y)

clf.predict(X)            # class labels
clf.decision_function(X)  # raw signed OPLS regression output
clf.opls_.transform(X)    # predictive scores of the underlying OPLS model

# Probabilities: wrap in a cross-fitted calibrator when each class has enough
# samples for the chosen calibration CV split.
from sklearn.calibration import CalibratedClassifierCV
CalibratedClassifierCV(clf, cv=5).fit(X, y).predict_proba(X)

Diagnostics

Plotting needs the optional plot extra (pip install scikit-opls[plot]); it follows scikit-learn's Display convention.

from scikit_opls.plotting import OPLSScoresDisplay, SPlotDisplay
from scikit_opls.validation import permutation_test

# Draw score plot (t_pred vs t_ortho). Supports component selection for multi-component PLS
OPLSScoresDisplay.from_estimator(
    model, X, y, predictive_component=0, orthogonal_component=0
)

# Draw S-plot (covariance vs correlation) for a specific predictive component
SPlotDisplay.from_estimator(model, X, component=0)

# Permutation significance testing
permutation_test(OPLS(n_orthogonal=2), X, y)

Note

Pipeline support in plotting: Diagnostic plotting displays support OPLS, OPLSDA, pipelines ending in one, and fitted search meta-estimators exposing best_estimator_ around either shape. When passing a pipeline, pass raw X as expected by the pipeline. When passing the final OPLS step directly, pass the already transformed matrix. For pipeline S-plots, points are in the transformed feature space received by the final OPLS step.

Example datasets

Two small scripts under examples/ show usage with CSV data hosted as GitHub release assets. The examples read these URLs directly with pandas.read_csv, so the datasets do not need to be stored in the local checkout:

  • https://github.com/HauserGroup/scikit-opls/releases/download/data/colorectal_cancer_nmr.csv
  • https://github.com/HauserGroup/scikit-opls/releases/download/data/palmerpenguins.csv
uv run python examples/colorectal_cancer_nmr_oplsda.py
uv run python examples/palmerpenguins_opls_regression.py

Parameters

Parameter Meaning
n_components Predictive components (classic OPLS uses 1).
n_orthogonal Orthogonal components to remove (int; tune via GridSearchCV).
scale "none", "center", "pareto", "standard".

Wrap OPLS in GridSearchCV over n_orthogonal for cross-validated selection (see the snippet above); cv, scoring and n_jobs come from GridSearchCV.

Development

uv sync --dev              # install the project and dev tools
uv run pre-commit install  # enable the git hooks (run once)

uv run pytest --cov        # tests + coverage (incl. sklearn check_estimator)
uv run ruff check          # lint
uv run ruff format --check # format check
uv run pyright src         # type-check
uv run pre-commit run --all-files  # run every hook

See CONTRIBUTING.md for the full contributor workflow.

References

This project is inspired by the R ropls package and uses the orthogonal-scores PLS algorithm of pls::oscorespls.fit as its engine.

  • Trygg, J. & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119–128.
  • Galindo-Prieto, B., Eriksson, L. & Trygg, J. (2014). Variable influence on projection (VIP) for OPLS models. Journal of Chemometrics, 28(8), 623–632.

Packages

 
 
 

Contributors

Languages