Okapi BM25 and the TF-IDF building blocks it is made of, in pure Standard ML —
inverse document frequency (idf), per-document term frequency and length,
average document length, and a deterministic score that ranks a corpus
against a free-text query — built on top of
sml-invertedindex for the
underlying postings. No FFI, no external dependencies at runtime, and
deterministic, byte-identically under both MLton and
Poly/ML.
- 44 assertions, green on MLton and Poly/ML.
- Basis-library + vendored
sml-invertedindexonly; deterministic across compilers. - Vendors
sml-invertedindex(Layout B) underlib/github.com/sjqtentacles/sml-invertedindex/, so the repo builds standalone.
No FFI, no IO inside the library, no wall-clock, no ambient randomness, and no
threads. The same sequence of addDoc calls plus the same query always produce
the same ranked answer across runs, machines, and compilers — the test suite
and the demo are byte-identical under MLton and Poly/ML. All scores are real,
so tests compare them through an explicit tolerance (Support.approx), never
string or structural equality.
With smlpkg:
smlpkg add github.com/sjqtentacles/sml-bm25
smlpkg sync
Include the MLB from your own (it pulls in the vendored sml-invertedindex):
local
$(SML_LIB)/basis/basis.mlb
lib/github.com/sjqtentacles/sml-bm25/... (via smlpkg)
in
...
end
This brings structure Bm25 (and the vendored InvertedIndex) into scope.
(* build a corpus by adding documents as ordered token lists *)
val corpus =
let
val i0 = Bm25.addDoc Bm25.empty 0 ["the", "cat", "sat", "on", "the", "mat"]
val i1 = Bm25.addDoc i0 1 ["the", "dog", "sat"]
val i2 = Bm25.addDoc i1 2 ["the", "cat", "cat", "ran"]
in i2 end
(* TF-IDF building blocks *)
val n = Bm25.numDocs corpus (* 3 *)
val avg = Bm25.avgDocLength corpus (* 13/3 *)
val df = Bm25.docFreq corpus "cat" (* 2 *)
val w = Bm25.idf corpus "cat" (* ~0.4700 *)
(* rank the corpus against a query: (docid, score) sorted by score desc, ties
broken by ascending docid *)
val ranked = Bm25.score corpus { k1 = 1.2, b = 0.75 } ["cat", "sat"]
(* [(0, 0.8122), (2, 0.6605), (1, 0.5377)] *)type index
type params = { k1 : real, b : real }
val empty : index
val addDoc : index -> int -> string list -> index (* last write wins *)
val numDocs : index -> int
val docLength : index -> int -> int
val avgDocLength : index -> real
val docFreq : index -> string -> int (* n(t) *)
val idf : index -> string -> real (* BM25 idf weight *)
val score : index -> params -> string list -> (int * real) listscore idx params terms returns (docid, score) pairs sorted by score
descending, with ties broken by ascending docid for a total,
deterministic order. Documents that match no query term are omitted; duplicate
query terms are counted once.
-
BM25 formula. For a document
dand query termtwith term frequencyf(t,d)and length|d|,contribution(t,d) = idf(t) * ( f(t,d) * (k1 + 1) ) / ( f(t,d) + k1 * (1 - b + b * |d|/avgdl) )and
scoresums the contributions over the distinct query terms. -
idf.
idf(t) = ln( (N - n(t) + 0.5) / (n(t) + 0.5) + 1 ), withNthe number of documents andn(t)the document frequency. This form is nonnegative and strictly decreasing inn(t)— rarer terms weigh more. -
Parameters.
k1controls term-frequency saturation;bcontrols length normalization (b = 0disables it,b = 1applies it fully). -
Document model. The corpus is held in the vendored
InvertedIndex(for document frequencies and the candidate set) plus a per-document token list (for term frequencies and lengths); re-adding a docid replaces it.
make test # MLton
make test-poly # Poly/ML
make all-tests # both
make example # build + run examples/demo.sml
make clean
Both compilers run the same strict-TDD suite (44 assertions), pinned to hand-computed reference values on a tiny 3-document corpus:
- closed-form BM25 scores for the query
["cat","sat"](k1 = 1.2,b = 0.75), compared with a1e-9tolerance (never string equality on reals); - idf monotonicity —
idf(n=1) > idf(n=2) > idf(n=3), and equal-df terms share an idf; - length normalization — at equal term frequency the longer document
scores strictly lower, and
b = 0makes the two scores exactly equal; - deterministic tie-breaking — content-identical documents added out of docid order rank in ascending docid order, while a higher-scoring document still sorts first.
This library depends on
sml-invertedindex, whose
sources are vendored verbatim under
lib/github.com/sjqtentacles/sml-invertedindex/ (invertedindex.sig,
invertedindex.sml, and a sources.mlb). src/bm25.mlb references that
sources.mlb first, then bm25.sig/bm25.sml; the Poly/ML use-chain loads
the vendored signature and structure first, in dependency order. sml.pkg
records the dependency in its require block so smlpkg sync can refresh it.
make example indexes five short documents and ranks them against a few
queries (output is byte-identical under MLton and Poly/ML):
=== sml-bm25 demo =============================================
Indexed 5 documents on top of sml-invertedindex.
avg document length = 8.6000 tokens
k1 = 1.20, b = 0.75
Inverse document frequencies (rarer terms weigh more)
idf(the) = 0.2877 (df = 4)
idf(brown) = 0.5390 (df = 3)
idf(fox) = 0.8755 (df = 2)
idf(warm) = 0.8755 (df = 2)
idf(bears) = 1.3863 (df = 1)
Query: "brown" (brown documents)
doc 3 score 0.8386 | brown bears and brown foxes roam the brown hills
doc 2 score 0.5548 | a quick brown fox is a clever fox
doc 0 score 0.5289 | the quick brown fox jumps over the lazy dog
Query: "quick brown fox" (multi-term)
doc 2 score 2.6839 | a quick brown fox is a clever fox
doc 0 score 2.2472 | the quick brown fox jumps over the lazy dog
doc 3 score 0.8386 | brown bears and brown foxes roam the brown hills
Query: "warm sun" (warm sun)
doc 1 score 1.8024 | the lazy dog sleeps in the warm sun
doc 4 score 1.7182 | the sun is warm and the sky is clear
Query: "unicorn" (no match)
(no documents matched)
===============================================================
CI builds Poly/ML 5.9.1 from source rather than using the Ubuntu package
(Poly/ML 5.7.1), whose X86 code generator crashes (asGenReg raised while compiling) on some code. See .github/workflows/ci.yml.
MIT — see LICENSE.