Skip to content

sjqtentacles/sml-pca

Repository files navigation

sml-pca

Principal Component Analysis in pure Standard ML — center a data matrix, diagonalize its covariance with the symmetric cyclic Jacobi eigenvalue algorithm (a fixed number of sweeps, a fixed tolerance, no data-driven stopping rule), and read off the principal axes, eigenvalues, explained variance, and projections. Built on top of sml-matrix for the dense linear algebra (covariance assembly, projection, reconstruction). No FFI, no external dependencies at runtime, and deterministic, byte-identically under both MLton and Poly/ML.

CI

Status

  • 59 assertions, green on MLton and Poly/ML.
  • Basis-library + vendored sml-matrix only; deterministic across compilers.
  • Vendors sml-matrix (Layout B) under lib/github.com/sjqtentacles/sml-matrix/, so the repo builds standalone.

Purity

No FFI, no IO inside the library, no wall-clock, no ambient randomness, and no threads. The Jacobi eigensolver runs a fixed number of sweeps at a fixed tolerance, so the same data always produces the same model across runs, machines, and compilers — the test suite and the demo are byte-identical under MLton and Poly/ML. Every output is real, so tests compare them through an explicit tolerance (Support.approxTol), never string or structural equality on reals.

Install

With smlpkg:

smlpkg add github.com/sjqtentacles/sml-pca
smlpkg sync

Include the MLB from your own (it pulls in the vendored sml-matrix):

local
  $(SML_LIB)/basis/basis.mlb
  lib/github.com/sjqtentacles/sml-pca/... (via smlpkg)
in
  ...
end

This brings structure Pca (and the vendored Matrix) into scope.

Quick start

(* samples lie along y = x -> one dominant axis *)
val data = [[0.0,0.0],[1.0,1.0],[2.0,2.0],[3.0,3.0],[4.0,4.0]]

val model = Pca.fit data {components = 2}

val m   = Pca.mean model                    (* [2.0, 2.0]                 *)
val ev  = Pca.eigenvalues model             (* [5.0, 0.0]  (descending)   *)
val cs  = Pca.components model               (* [[0.7071,0.7071],[...]]   *)
val r   = Pca.explainedVarianceRatio model   (* [1.0, 0.0]  (sums to ~1)  *)

(* project new/centered samples onto the components *)
val scores = Pca.transform model [[5.0, 5.0]]   (* one row of k scores *)

(* reconstruct from scores (exact round-trip when all components are kept) *)
val back = Pca.inverseTransform model scores

API (signature PCA)

type result
exception Pca of string

val fit                    : real list list -> {components : int} -> result
val mean                   : result -> real list           (* per-feature mean *)
val eigenvalues            : result -> real list           (* descending, >= 0 *)
val components             : result -> real list list       (* axes as rows    *)
val explainedVariance      : result -> real list
val explainedVarianceRatio : result -> real list           (* sums to ~1 (all) *)
val totalVariance          : result -> real
val transform              : result -> real list list -> real list list
val inverseTransform       : result -> real list list -> real list list
val jacobiEig              : real list list -> real list * real list list

fit data {components = k} takes a row-per-sample, column-per-feature data matrix and retains k components (1 <= k <= features). transform centers each input row with the model's mean and projects it onto the components; inverseTransform maps scores back to feature space (exact inverse of transform up to rounding when all components are kept, best rank-k reconstruction otherwise).

Conventions

  • Centering. The per-feature mean is subtracted before anything else; mean exposes the vector that was subtracted.
  • Covariance. The unbiased sample covariance C = (Xcᵀ Xc) / (n - 1) is used for n > 1 samples (population covariance, dividing by n = 1, for a single sample). C is assembled with the vendored Matrix (transpose then multiply).
  • Eigen-decomposition. C is symmetric and is diagonalized by the cyclic Jacobi algorithm: a fixed 100 sweeps at a fixed 1e-14 off-diagonal threshold. There is no data-dependent convergence test, so MLton and Poly/ML produce byte-identical eigenpairs.
  • Ordering. Eigenpairs are sorted by eigenvalue descending (stable insertion sort — no ListMergeSort). Tiny negative eigenvalues from rounding are clamped to 0.0.
  • Sign convention (deterministic). An eigenvector is only defined up to an overall sign. Each returned component is flipped, if necessary, so that its entry of largest absolute value is positive (ties broken by the earliest such index). This makes components reproducible across runs and compilers.
  • Explained-variance ratio. Each retained eigenvalue divided by the total variance (the sum over all eigenvalues / the covariance trace), so the full model's ratios sum to ~1 and a truncated model's sum to <= 1.

Build & test

make test        # MLton
make test-poly   # Poly/ML
make all-tests   # both
make example     # build + run examples/demo.sml
make clean

Both compilers run the same strict-TDD suite (59 assertions):

  • Jacobi eigensolver — closed-form spectrum of [[2,1],[1,2]] (eigenvalues 3 and 1; eigenvectors (1/√2, 1/√2) and (1/√2, -1/√2) after sign pinning), diagonal matrices left untouched, unit-length orthogonal eigenvectors, and the spectral identity A v = λ v on a 3×3, all to a 1e-9 tolerance;
  • fit — on data along y = x the dominant component is (1/√2, 1/√2), the eigenvalues are non-negative and descending (λ₀ = 5.0, λ₁ ≈ 0), the explained-variance ratio sums to ~1, and the components are orthonormal;
  • transform — score shape, an exact transforminverseTransform round-trip when all components are kept, decorrelated projected scores (cross-covariance ≈ 0, i.e. a diagonal covariance), and score-column variances equal to the eigenvalues;
  • validation — empty/ragged/out-of-range inputs raise, non-square and non-symmetric matrices raise, and the sign convention is reproducible across repeated fits.

Vendoring

This library depends on sml-matrix, whose sources are vendored verbatim under lib/github.com/sjqtentacles/sml-matrix/ (matrix.sig, matrix.sml, and the sources.mlb/sml-matrix.mlb). src/pca.mlb references that sources.mlb first, then pca.sig/pca.sml; the Poly/ML use-chain loads the vendored signature and structure first, in dependency order. sml.pkg records the dependency in its require block so smlpkg sync can refresh it. The covariance assembly, projection, and reconstruction all call into Matrix, so the dependency is exercised on the compute path, not merely linked.

Example

make example fits PCA on ten 2-D samples and prints the mean, the principal components (sign-pinned), the eigenvalues and explained-variance ratio, a few projected scores, and a rank-1 reconstruction (output is byte-identical under MLton and Poly/ML):

=== sml-pca demo ==============================================

Fitted PCA on 10 samples x 2 features (built on sml-matrix).

Per-feature mean (centering): [1.8100, 1.9100]

Principal components (rows; largest-magnitude entry forced positive):
  pc1 = [0.6779, 0.7352]
  pc2 = [0.7352, -0.6779]

Eigenvalues (variance per axis, descending): [1.2840, 0.0491]
Explained variance ratio: [0.9632, 0.0368]
Total variance: 1.3331

Projected scores (first 3 samples onto both components):
  sample 0 -> [0.8280, 0.1751]
  sample 1 -> [-1.7776, -0.1429]
  sample 2 -> [0.9922, -0.3844]

Rank-1 reconstruction (keep only pc1):
  sample 0: [2.5000, 2.4000] -> [2.3713, 2.5187]
  sample 1: [0.5000, 0.7000] -> [0.6050, 0.6032]
  sample 2: [2.2000, 2.9000] -> [2.4826, 2.6394]

pc1 alone explains 0.9632 of the total variance.

===============================================================

Poly/ML note

CI builds Poly/ML 5.9.1 from source rather than using the Ubuntu package (Poly/ML 5.7.1), whose X86 code generator crashes (asGenReg raised while compiling) on some code. See .github/workflows/ci.yml.

License

MIT — see LICENSE.

About

Principal Component Analysis in pure Standard ML via Jacobi eigen-decomposition: components, explained variance, transform. MLton + Poly/ML.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors