Skip to content

sjqtentacles/sml-bm25

Repository files navigation

sml-bm25

Okapi BM25 and the TF-IDF building blocks it is made of, in pure Standard ML — inverse document frequency (idf), per-document term frequency and length, average document length, and a deterministic score that ranks a corpus against a free-text query — built on top of sml-invertedindex for the underlying postings. No FFI, no external dependencies at runtime, and deterministic, byte-identically under both MLton and Poly/ML.

CI

Status

  • 44 assertions, green on MLton and Poly/ML.
  • Basis-library + vendored sml-invertedindex only; deterministic across compilers.
  • Vendors sml-invertedindex (Layout B) under lib/github.com/sjqtentacles/sml-invertedindex/, so the repo builds standalone.

Purity

No FFI, no IO inside the library, no wall-clock, no ambient randomness, and no threads. The same sequence of addDoc calls plus the same query always produce the same ranked answer across runs, machines, and compilers — the test suite and the demo are byte-identical under MLton and Poly/ML. All scores are real, so tests compare them through an explicit tolerance (Support.approx), never string or structural equality.

Install

With smlpkg:

smlpkg add github.com/sjqtentacles/sml-bm25
smlpkg sync

Include the MLB from your own (it pulls in the vendored sml-invertedindex):

local
  $(SML_LIB)/basis/basis.mlb
  lib/github.com/sjqtentacles/sml-bm25/... (via smlpkg)
in
  ...
end

This brings structure Bm25 (and the vendored InvertedIndex) into scope.

Quick start

(* build a corpus by adding documents as ordered token lists *)
val corpus =
  let
    val i0 = Bm25.addDoc Bm25.empty 0 ["the", "cat", "sat", "on", "the", "mat"]
    val i1 = Bm25.addDoc i0      1 ["the", "dog", "sat"]
    val i2 = Bm25.addDoc i1      2 ["the", "cat", "cat", "ran"]
  in i2 end

(* TF-IDF building blocks *)
val n   = Bm25.numDocs corpus            (* 3            *)
val avg = Bm25.avgDocLength corpus       (* 13/3         *)
val df  = Bm25.docFreq corpus "cat"      (* 2            *)
val w   = Bm25.idf corpus "cat"          (* ~0.4700      *)

(* rank the corpus against a query: (docid, score) sorted by score desc, ties
   broken by ascending docid *)
val ranked = Bm25.score corpus { k1 = 1.2, b = 0.75 } ["cat", "sat"]
(* [(0, 0.8122), (2, 0.6605), (1, 0.5377)] *)

API (signature BM25)

type index
type params = { k1 : real, b : real }

val empty        : index
val addDoc       : index -> int -> string list -> index   (* last write wins *)
val numDocs      : index -> int
val docLength    : index -> int -> int
val avgDocLength : index -> real
val docFreq      : index -> string -> int                 (* n(t)            *)
val idf          : index -> string -> real                (* BM25 idf weight *)
val score        : index -> params -> string list -> (int * real) list

score idx params terms returns (docid, score) pairs sorted by score descending, with ties broken by ascending docid for a total, deterministic order. Documents that match no query term are omitted; duplicate query terms are counted once.

Conventions

  • BM25 formula. For a document d and query term t with term frequency f(t,d) and length |d|,

    contribution(t,d) = idf(t) * ( f(t,d) * (k1 + 1) )
                                / ( f(t,d) + k1 * (1 - b + b * |d|/avgdl) )
    

    and score sums the contributions over the distinct query terms.

  • idf. idf(t) = ln( (N - n(t) + 0.5) / (n(t) + 0.5) + 1 ), with N the number of documents and n(t) the document frequency. This form is nonnegative and strictly decreasing in n(t) — rarer terms weigh more.

  • Parameters. k1 controls term-frequency saturation; b controls length normalization (b = 0 disables it, b = 1 applies it fully).

  • Document model. The corpus is held in the vendored InvertedIndex (for document frequencies and the candidate set) plus a per-document token list (for term frequencies and lengths); re-adding a docid replaces it.

Build & test

make test        # MLton
make test-poly   # Poly/ML
make all-tests   # both
make example     # build + run examples/demo.sml
make clean

Both compilers run the same strict-TDD suite (44 assertions), pinned to hand-computed reference values on a tiny 3-document corpus:

  • closed-form BM25 scores for the query ["cat","sat"] (k1 = 1.2, b = 0.75), compared with a 1e-9 tolerance (never string equality on reals);
  • idf monotonicityidf(n=1) > idf(n=2) > idf(n=3), and equal-df terms share an idf;
  • length normalization — at equal term frequency the longer document scores strictly lower, and b = 0 makes the two scores exactly equal;
  • deterministic tie-breaking — content-identical documents added out of docid order rank in ascending docid order, while a higher-scoring document still sorts first.

Vendoring

This library depends on sml-invertedindex, whose sources are vendored verbatim under lib/github.com/sjqtentacles/sml-invertedindex/ (invertedindex.sig, invertedindex.sml, and a sources.mlb). src/bm25.mlb references that sources.mlb first, then bm25.sig/bm25.sml; the Poly/ML use-chain loads the vendored signature and structure first, in dependency order. sml.pkg records the dependency in its require block so smlpkg sync can refresh it.

Example

make example indexes five short documents and ranks them against a few queries (output is byte-identical under MLton and Poly/ML):

=== sml-bm25 demo =============================================

Indexed 5 documents on top of sml-invertedindex.
  avg document length = 8.6000 tokens
  k1 = 1.20, b = 0.75

Inverse document frequencies (rarer terms weigh more)
  idf(the) = 0.2877   (df = 4)
  idf(brown) = 0.5390   (df = 3)
  idf(fox) = 0.8755   (df = 2)
  idf(warm) = 0.8755   (df = 2)
  idf(bears) = 1.3863   (df = 1)

Query: "brown"  (brown documents)
  doc 3  score 0.8386  | brown bears and brown foxes roam the brown hills
  doc 2  score 0.5548  | a quick brown fox is a clever fox
  doc 0  score 0.5289  | the quick brown fox jumps over the lazy dog

Query: "quick brown fox"  (multi-term)
  doc 2  score 2.6839  | a quick brown fox is a clever fox
  doc 0  score 2.2472  | the quick brown fox jumps over the lazy dog
  doc 3  score 0.8386  | brown bears and brown foxes roam the brown hills

Query: "warm sun"  (warm sun)
  doc 1  score 1.8024  | the lazy dog sleeps in the warm sun
  doc 4  score 1.7182  | the sun is warm and the sky is clear

Query: "unicorn"  (no match)
  (no documents matched)

===============================================================

Poly/ML note

CI builds Poly/ML 5.9.1 from source rather than using the Ubuntu package (Poly/ML 5.7.1), whose X86 code generator crashes (asGenReg raised while compiling) on some code. See .github/workflows/ci.yml.

License

MIT — see LICENSE.

About

Okapi BM25 / TF-IDF document ranking in pure Standard ML, built on an inverted index. Deterministic, MLton + Poly/ML.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors