sml-pca

Principal Component Analysis in pure Standard ML — center a data matrix, diagonalize its covariance with the symmetric cyclic Jacobi eigenvalue algorithm (a fixed number of sweeps, a fixed tolerance, no data-driven stopping rule), and read off the principal axes, eigenvalues, explained variance, and projections. Built on top of sml-matrix for the dense linear algebra (covariance assembly, projection, reconstruction). No FFI, no external dependencies at runtime, and deterministic, byte-identically under both MLton and Poly/ML.

Status

59 assertions, green on MLton and Poly/ML.
Basis-library + vendored sml-matrix only; deterministic across compilers.
Vendors sml-matrix (Layout B) under lib/github.com/sjqtentacles/sml-matrix/, so the repo builds standalone.

Purity

No FFI, no IO inside the library, no wall-clock, no ambient randomness, and no threads. The Jacobi eigensolver runs a fixed number of sweeps at a fixed tolerance, so the same data always produces the same model across runs, machines, and compilers — the test suite and the demo are byte-identical under MLton and Poly/ML. Every output is real, so tests compare them through an explicit tolerance (Support.approxTol), never string or structural equality on reals.

Install

With smlpkg:

smlpkg add github.com/sjqtentacles/sml-pca
smlpkg sync

Include the MLB from your own (it pulls in the vendored sml-matrix):

local
  $(SML_LIB)/basis/basis.mlb
  lib/github.com/sjqtentacles/sml-pca/... (via smlpkg)
in
  ...
end

This brings structure Pca (and the vendored Matrix) into scope.

Quick start

(* samples lie along y = x -> one dominant axis *)
val data = [[0.0,0.0],[1.0,1.0],[2.0,2.0],[3.0,3.0],[4.0,4.0]]

val model = Pca.fit data {components = 2}

val m   = Pca.mean model                    (* [2.0, 2.0]                 *)
val ev  = Pca.eigenvalues model             (* [5.0, 0.0]  (descending)   *)
val cs  = Pca.components model               (* [[0.7071,0.7071],[...]]   *)
val r   = Pca.explainedVarianceRatio model   (* [1.0, 0.0]  (sums to ~1)  *)

(* project new/centered samples onto the components *)
val scores = Pca.transform model [[5.0, 5.0]]   (* one row of k scores *)

(* reconstruct from scores (exact round-trip when all components are kept) *)
val back = Pca.inverseTransform model scores

API (`signature PCA`)

type result
exception Pca of string

val fit                    : real list list -> {components : int} -> result
val mean                   : result -> real list           (* per-feature mean *)
val eigenvalues            : result -> real list           (* descending, >= 0 *)
val components             : result -> real list list       (* axes as rows    *)
val explainedVariance      : result -> real list
val explainedVarianceRatio : result -> real list           (* sums to ~1 (all) *)
val totalVariance          : result -> real
val transform              : result -> real list list -> real list list
val inverseTransform       : result -> real list list -> real list list
val jacobiEig              : real list list -> real list * real list list

fit data {components = k} takes a row-per-sample, column-per-feature data matrix and retains k components (1 <= k <= features). transform centers each input row with the model's mean and projects it onto the components; inverseTransform maps scores back to feature space (exact inverse of transform up to rounding when all components are kept, best rank-k reconstruction otherwise).

Conventions

Centering. The per-feature mean is subtracted before anything else; mean exposes the vector that was subtracted.
Covariance. The unbiased sample covariance C = (Xcᵀ Xc) / (n - 1) is used for n > 1 samples (population covariance, dividing by n = 1, for a single sample). C is assembled with the vendored Matrix (transpose then multiply).
Eigen-decomposition. C is symmetric and is diagonalized by the cyclic Jacobi algorithm: a fixed 100 sweeps at a fixed 1e-14 off-diagonal threshold. There is no data-dependent convergence test, so MLton and Poly/ML produce byte-identical eigenpairs.
Ordering. Eigenpairs are sorted by eigenvalue descending (stable insertion sort — no ListMergeSort). Tiny negative eigenvalues from rounding are clamped to 0.0.
Sign convention (deterministic). An eigenvector is only defined up to an overall sign. Each returned component is flipped, if necessary, so that its entry of largest absolute value is positive (ties broken by the earliest such index). This makes components reproducible across runs and compilers.
Explained-variance ratio. Each retained eigenvalue divided by the total variance (the sum over all eigenvalues / the covariance trace), so the full model's ratios sum to ~1 and a truncated model's sum to <= 1.

Build & test

make test        # MLton
make test-poly   # Poly/ML
make all-tests   # both
make example     # build + run examples/demo.sml
make clean

Both compilers run the same strict-TDD suite (59 assertions):

Jacobi eigensolver — closed-form spectrum of [[2,1],[1,2]] (eigenvalues 3 and 1; eigenvectors (1/√2, 1/√2) and (1/√2, -1/√2) after sign pinning), diagonal matrices left untouched, unit-length orthogonal eigenvectors, and the spectral identity A v = λ v on a 3×3, all to a 1e-9 tolerance;
fit — on data along y = x the dominant component is (1/√2, 1/√2), the eigenvalues are non-negative and descending (λ₀ = 5.0, λ₁ ≈ 0), the explained-variance ratio sums to ~1, and the components are orthonormal;
transform — score shape, an exact transform→inverseTransform round-trip when all components are kept, decorrelated projected scores (cross-covariance ≈ 0, i.e. a diagonal covariance), and score-column variances equal to the eigenvalues;
validation — empty/ragged/out-of-range inputs raise, non-square and non-symmetric matrices raise, and the sign convention is reproducible across repeated fits.

Vendoring

This library depends on sml-matrix, whose sources are vendored verbatim under lib/github.com/sjqtentacles/sml-matrix/ (matrix.sig, matrix.sml, and the sources.mlb/sml-matrix.mlb). src/pca.mlb references that sources.mlb first, then pca.sig/pca.sml; the Poly/ML use-chain loads the vendored signature and structure first, in dependency order. sml.pkg records the dependency in its require block so smlpkg sync can refresh it. The covariance assembly, projection, and reconstruction all call into Matrix, so the dependency is exercised on the compute path, not merely linked.

Example

make example fits PCA on ten 2-D samples and prints the mean, the principal components (sign-pinned), the eigenvalues and explained-variance ratio, a few projected scores, and a rank-1 reconstruction (output is byte-identical under MLton and Poly/ML):

=== sml-pca demo ==============================================

Fitted PCA on 10 samples x 2 features (built on sml-matrix).

Per-feature mean (centering): [1.8100, 1.9100]

Principal components (rows; largest-magnitude entry forced positive):
  pc1 = [0.6779, 0.7352]
  pc2 = [0.7352, -0.6779]

Eigenvalues (variance per axis, descending): [1.2840, 0.0491]
Explained variance ratio: [0.9632, 0.0368]
Total variance: 1.3331

Projected scores (first 3 samples onto both components):
  sample 0 -> [0.8280, 0.1751]
  sample 1 -> [-1.7776, -0.1429]
  sample 2 -> [0.9922, -0.3844]

Rank-1 reconstruction (keep only pc1):
  sample 0: [2.5000, 2.4000] -> [2.3713, 2.5187]
  sample 1: [0.5000, 0.7000] -> [0.6050, 0.6032]
  sample 2: [2.2000, 2.9000] -> [2.4826, 2.6394]

pc1 alone explains 0.9632 of the total variance.

===============================================================

Poly/ML note

CI builds Poly/ML 5.9.1 from source rather than using the Ubuntu package (Poly/ML 5.7.1), whose X86 code generator crashes (asGenReg raised while compiling) on some code. See .github/workflows/ci.yml.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sml-pca

Status

Purity

Install

Quick start

API (`signature PCA`)

Conventions

Build & test

Vendoring

Example

Poly/ML note

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
lib/github.com/sjqtentacles/sml-matrix		lib/github.com/sjqtentacles/sml-matrix
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
sml.pkg		sml.pkg

Folders and files

Latest commit

History

Repository files navigation

sml-pca

Status

Purity

Install

Quick start

API (signature PCA)

Conventions

Build & test

Vendoring

Example

Poly/ML note

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API (`signature PCA`)

Packages