Orthogonal Projections to Latent Structures (OPLS / OPLS-DA) with a scikit-learn interface.
OPLS (Trygg & Wold, 2002) splits the variation in X into a predictive part
correlated with the response and orthogonal parts that are not. scikit-opls
removes the orthogonal variation with an OSC-style orthogonal filter and then fits
sklearn.cross_decomposition.PLSRegression
on the cleaned X as the predictive engine. With n_orthogonal=0 the model
reduces exactly to PLSRegression.
uv syncimport numpy as np
from scikit_opls import OPLS
rng = np.random.default_rng(0)
X = rng.normal(size=(100, 50))
y = X[:, 0] * 2.0 + rng.normal(scale=0.1, size=100)
model = OPLS(n_components=1, n_orthogonal=2, scale="standard").fit(X, y)
model.predict(X) # predictions
model.transform(X) # predictive scores
model.transform_orthogonal(X) # orthogonal scores
model.filter_transform(X) # preprocessed, orthogonal-filtered X fed to the engine
model.r2x_, model.r2y_ # fit summaries
model.vip_ # variable importance (predictive), lazy propertyThe whole fitted pipeline (scaling → orthogonal filter → predictive PLS) is linear, so it collapses to coefficients on the raw input space:
y_hat = (X @ model.coef_raw_.T + model.intercept_raw_).ravel() # == model.predict(X)Let cross-validated Q2 choose the number of orthogonal components with
scikit-learn's GridSearchCV — no bespoke estimator needed (scoring=None gives
out-of-fold R2, which equals Q2 for OPLS):
from sklearn.model_selection import GridSearchCV
from scikit_opls import OPLS
search = GridSearchCV(
OPLS(n_components=1), {"n_orthogonal": list(range(10))}, cv=7
).fit(X, y)
search.best_params_["n_orthogonal"] # chosen count
search.best_estimator_ # final OPLS refit on all data
search.cv_results_["mean_test_score"] # out-of-fold R2/Q2 pathFor OPLS-DA, wrap OPLSDA() the same way; an int cv becomes stratified
automatically and scoring="roc_auc" is usually preferable.
To bias toward fewer orthogonal components — prefer the smallest count whose mean
score is within a tolerance of the best — pass a refit callable:
import numpy as np
def parsimonious_refit(cv_results, tol=0.01):
scores = np.asarray(cv_results["mean_test_score"], dtype=float)
counts = np.asarray(cv_results["param_n_orthogonal"], dtype=int)
within = np.flatnonzero(scores >= np.nanmax(scores) - tol)
return int(within[np.argmin(counts[within])])
GridSearchCV(
OPLS(n_components=1), {"n_orthogonal": list(range(10))},
cv=7, refit=parsimonious_refit,
).fit(X, y)from scikit_opls import OPLSDA
y = np.where(X[:, 0] > 0, "case", "ctrl")
clf = OPLSDA(n_components=1, n_orthogonal=2).fit(X, y)
clf.predict(X) # class labels
clf.decision_function(X) # raw signed OPLS regression output
clf.opls_.transform(X) # predictive scores of the underlying OPLS model
# Probabilities: wrap in a cross-fitted calibrator when each class has enough
# samples for the chosen calibration CV split.
from sklearn.calibration import CalibratedClassifierCV
CalibratedClassifierCV(clf, cv=5).fit(X, y).predict_proba(X)Plotting needs the optional plot extra (pip install scikit-opls[plot]); it
follows scikit-learn's Display convention.
from scikit_opls.plotting import OPLSScoresDisplay, SPlotDisplay
from scikit_opls.validation import permutation_test
# Draw score plot (t_pred vs t_ortho). Supports component selection for multi-component PLS
OPLSScoresDisplay.from_estimator(
model, X, y, predictive_component=0, orthogonal_component=0
)
# Draw S-plot (covariance vs correlation) for a specific predictive component
SPlotDisplay.from_estimator(model, X, component=0)
# Permutation significance testing
permutation_test(OPLS(n_orthogonal=2), X, y)Note
Pipeline support in plotting: Diagnostic plotting displays support OPLS,
OPLSDA, pipelines ending in one, and fitted search meta-estimators exposing
best_estimator_ around either shape. When passing a pipeline, pass raw X as
expected by the pipeline. When passing the final OPLS step directly, pass the
already transformed matrix. For pipeline S-plots, points are in the transformed
feature space received by the final OPLS step.
Two small scripts under examples/ show usage with CSV data hosted as GitHub
release assets. The examples read these URLs directly with pandas.read_csv, so
the datasets do not need to be stored in the local checkout:
https://github.com/HauserGroup/scikit-opls/releases/download/data/colorectal_cancer_nmr.csvhttps://github.com/HauserGroup/scikit-opls/releases/download/data/palmerpenguins.csv
uv run python examples/colorectal_cancer_nmr_oplsda.py
uv run python examples/palmerpenguins_opls_regression.py| Parameter | Meaning |
|---|---|
n_components |
Predictive components (classic OPLS uses 1). |
n_orthogonal |
Orthogonal components to remove (int; tune via GridSearchCV). |
scale |
"none", "center", "pareto", "standard". |
Wrap OPLS in GridSearchCV over n_orthogonal for cross-validated selection
(see the snippet above); cv, scoring and n_jobs come from GridSearchCV.
uv sync --dev # install the project and dev tools
uv run pre-commit install # enable the git hooks (run once)
uv run pytest --cov # tests + coverage (incl. sklearn check_estimator)
uv run ruff check # lint
uv run ruff format --check # format check
uv run pyright src # type-check
uv run pre-commit run --all-files # run every hookSee CONTRIBUTING.md for the full contributor workflow.
This project is inspired by the R
ropls package
and uses the orthogonal-scores PLS algorithm of
pls::oscorespls.fit as its engine.
- Trygg, J. & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119–128.
- Galindo-Prieto, B., Eriksson, L. & Trygg, J. (2014). Variable influence on projection (VIP) for OPLS models. Journal of Chemometrics, 28(8), 623–632.