Principal Component Analysis in pure Standard ML — center a data matrix,
diagonalize its covariance with the symmetric cyclic Jacobi eigenvalue
algorithm (a fixed number of sweeps, a fixed tolerance, no data-driven
stopping rule), and read off the principal axes, eigenvalues, explained
variance, and projections. Built on top of
sml-matrix for the dense linear
algebra (covariance assembly, projection, reconstruction). No FFI, no external
dependencies at runtime, and deterministic, byte-identically under both
MLton and Poly/ML.
- 59 assertions, green on MLton and Poly/ML.
- Basis-library + vendored
sml-matrixonly; deterministic across compilers. - Vendors
sml-matrix(Layout B) underlib/github.com/sjqtentacles/sml-matrix/, so the repo builds standalone.
No FFI, no IO inside the library, no wall-clock, no ambient randomness, and no
threads. The Jacobi eigensolver runs a fixed number of sweeps at a fixed
tolerance, so the same data always produces the same model across runs,
machines, and compilers — the test suite and the demo are byte-identical under
MLton and Poly/ML. Every output is real, so tests compare them through an
explicit tolerance (Support.approxTol), never string or structural
equality on reals.
With smlpkg:
smlpkg add github.com/sjqtentacles/sml-pca
smlpkg sync
Include the MLB from your own (it pulls in the vendored sml-matrix):
local
$(SML_LIB)/basis/basis.mlb
lib/github.com/sjqtentacles/sml-pca/... (via smlpkg)
in
...
end
This brings structure Pca (and the vendored Matrix) into scope.
(* samples lie along y = x -> one dominant axis *)
val data = [[0.0,0.0],[1.0,1.0],[2.0,2.0],[3.0,3.0],[4.0,4.0]]
val model = Pca.fit data {components = 2}
val m = Pca.mean model (* [2.0, 2.0] *)
val ev = Pca.eigenvalues model (* [5.0, 0.0] (descending) *)
val cs = Pca.components model (* [[0.7071,0.7071],[...]] *)
val r = Pca.explainedVarianceRatio model (* [1.0, 0.0] (sums to ~1) *)
(* project new/centered samples onto the components *)
val scores = Pca.transform model [[5.0, 5.0]] (* one row of k scores *)
(* reconstruct from scores (exact round-trip when all components are kept) *)
val back = Pca.inverseTransform model scorestype result
exception Pca of string
val fit : real list list -> {components : int} -> result
val mean : result -> real list (* per-feature mean *)
val eigenvalues : result -> real list (* descending, >= 0 *)
val components : result -> real list list (* axes as rows *)
val explainedVariance : result -> real list
val explainedVarianceRatio : result -> real list (* sums to ~1 (all) *)
val totalVariance : result -> real
val transform : result -> real list list -> real list list
val inverseTransform : result -> real list list -> real list list
val jacobiEig : real list list -> real list * real list listfit data {components = k} takes a row-per-sample, column-per-feature data
matrix and retains k components (1 <= k <= features). transform centers
each input row with the model's mean and projects it onto the components;
inverseTransform maps scores back to feature space (exact inverse of
transform up to rounding when all components are kept, best rank-k
reconstruction otherwise).
- Centering. The per-feature mean is subtracted before anything else;
meanexposes the vector that was subtracted. - Covariance. The unbiased sample covariance
C = (Xcᵀ Xc) / (n - 1)is used forn > 1samples (population covariance, dividing byn = 1, for a single sample).Cis assembled with the vendoredMatrix(transpose then multiply). - Eigen-decomposition.
Cis symmetric and is diagonalized by the cyclic Jacobi algorithm: a fixed 100 sweeps at a fixed1e-14off-diagonal threshold. There is no data-dependent convergence test, so MLton and Poly/ML produce byte-identical eigenpairs. - Ordering. Eigenpairs are sorted by eigenvalue descending (stable
insertion sort — no
ListMergeSort). Tiny negative eigenvalues from rounding are clamped to0.0. - Sign convention (deterministic). An eigenvector is only defined up to an
overall sign. Each returned component is flipped, if necessary, so that its
entry of largest absolute value is positive (ties broken by the earliest
such index). This makes
componentsreproducible across runs and compilers. - Explained-variance ratio. Each retained eigenvalue divided by the total
variance (the sum over all eigenvalues / the covariance trace), so the full
model's ratios sum to ~1 and a truncated model's sum to
<= 1.
make test # MLton
make test-poly # Poly/ML
make all-tests # both
make example # build + run examples/demo.sml
make clean
Both compilers run the same strict-TDD suite (59 assertions):
- Jacobi eigensolver — closed-form spectrum of
[[2,1],[1,2]](eigenvalues 3 and 1; eigenvectors(1/√2, 1/√2)and(1/√2, -1/√2)after sign pinning), diagonal matrices left untouched, unit-length orthogonal eigenvectors, and the spectral identityA v = λ von a 3×3, all to a1e-9tolerance; - fit — on data along
y = xthe dominant component is(1/√2, 1/√2), the eigenvalues are non-negative and descending (λ₀ = 5.0,λ₁ ≈ 0), the explained-variance ratio sums to ~1, and the components are orthonormal; - transform — score shape, an exact
transform→inverseTransformround-trip when all components are kept, decorrelated projected scores (cross-covariance ≈ 0, i.e. a diagonal covariance), and score-column variances equal to the eigenvalues; - validation — empty/ragged/out-of-range inputs raise, non-square and non-symmetric matrices raise, and the sign convention is reproducible across repeated fits.
This library depends on
sml-matrix, whose sources are
vendored verbatim under lib/github.com/sjqtentacles/sml-matrix/ (matrix.sig,
matrix.sml, and the sources.mlb/sml-matrix.mlb). src/pca.mlb references
that sources.mlb first, then pca.sig/pca.sml; the Poly/ML use-chain
loads the vendored signature and structure first, in dependency order. sml.pkg
records the dependency in its require block so smlpkg sync can refresh it.
The covariance assembly, projection, and reconstruction all call into Matrix,
so the dependency is exercised on the compute path, not merely linked.
make example fits PCA on ten 2-D samples and prints the mean, the principal
components (sign-pinned), the eigenvalues and explained-variance ratio, a few
projected scores, and a rank-1 reconstruction (output is byte-identical under
MLton and Poly/ML):
=== sml-pca demo ==============================================
Fitted PCA on 10 samples x 2 features (built on sml-matrix).
Per-feature mean (centering): [1.8100, 1.9100]
Principal components (rows; largest-magnitude entry forced positive):
pc1 = [0.6779, 0.7352]
pc2 = [0.7352, -0.6779]
Eigenvalues (variance per axis, descending): [1.2840, 0.0491]
Explained variance ratio: [0.9632, 0.0368]
Total variance: 1.3331
Projected scores (first 3 samples onto both components):
sample 0 -> [0.8280, 0.1751]
sample 1 -> [-1.7776, -0.1429]
sample 2 -> [0.9922, -0.3844]
Rank-1 reconstruction (keep only pc1):
sample 0: [2.5000, 2.4000] -> [2.3713, 2.5187]
sample 1: [0.5000, 0.7000] -> [0.6050, 0.6032]
sample 2: [2.2000, 2.9000] -> [2.4826, 2.6394]
pc1 alone explains 0.9632 of the total variance.
===============================================================
CI builds Poly/ML 5.9.1 from source rather than using the Ubuntu package
(Poly/ML 5.7.1), whose X86 code generator crashes (asGenReg raised while compiling) on some code. See .github/workflows/ci.yml.
MIT — see LICENSE.