Skip to content

Latest commit

 

History

History
246 lines (190 loc) · 9.19 KB

File metadata and controls

246 lines (190 loc) · 9.19 KB

microdom — Architecture Specification

System overview

[HTML String] 
     ↓ (microdom::parse_html)
[coredom::DomTree (Arena)] 
     ↓
[XPath String] -> (microdom-xpath-engine::lexer/parser) -> [AST]
     ↓
(microdom-xpath-engine::evaluator) + [DomAdapter]
     ↓
[XPathValue (NodeSet / StringSet / String / Number / Boolean / Function)]
     ↓ (microdom::value_to_result)
[SelectResult]

Navigation:

Workspace layout

microdom/                    ← HTML DOM + public API
microdom-xpath-engine/       ← XPath parser + evaluator (no HTML knowledge)
coredom/                     ← raw arena DOM tree (no XPath, no HTML)

Dependencies flow (one direction only):
  coredom ← microdom-xpath-engine ← microdom

coredom/src/lib.rs

DomTree

Arena-based node storage. Nodes are identified by NodeId (u32 index). Each node stores: NodeData, parent, first_child, next_sibling. Doubly-linked sibling list for O(1) append/remove.

⛔ DO NOT change NodeId from Copy — the entire engine passes NodeId by value. If you make it non-Copy, hundreds of &node, *node dereferences break.

NodeData

enum { Document, Element { tag_name, attributes }, Text(TextBuffer) } Text uses a custom TextBuffer (not String) to avoid small-string allocations.

⛔ DO NOT add heap types to NodeData without ensuring Clone stays cheap.

coredom/src/attr.rs

AttrArray

Compact attribute storage. Small number of attrs (≤4) stored inline, upgrades to Vec on overflow. Designed for real-world HTML where most elements have 0–3 attributes.

WHY: HashMap per element would cost ~48 bytes overhead each. AttrArray costs 0 bytes for elements with no attributes.

microdom-xpath-engine/src/lexer.rs

Lexer

Byte-level scanner (operates on &[u8], not chars). Single-pass, no backtracking.

Key rules:

  • :: → DoubleColon (axis separator: child::, descendant::)
  • : → consumed as part of Name for QNames (math:pi, my:func) WHY: tokenizer can't know if "math:pi" is a namespace or an axis without parser context, so we let the Name absorb the colon.
  • => → Arrow (pipeline operator)
  • := → ColonEq (let binding)

⛔ DO NOT change the : vs :: distinction in the identifier scanner. Breaking it will confuse child::div (axis) with ns:div (QName).

microdom-xpath-engine/src/ast.rs

Expr

Full XPath 3.0 AST. Key variants:

  • Path(PathExpr) — /a/b/c, //div[@class]
  • StepsFrom { base, steps } — (expr)/step, $var/step, func()/step WHY separate from Path: Path always starts from root or context node. StepsFrom starts from an arbitrary evaluated expression.
  • Filter { target, predicate } — (expr)[pred]
  • Let / For / If / Quantified — XPath 3.0 control flow
  • InlineFunction / DynamicFunctionCall — XPath 3.0 lambdas

NodeTest

  • Name(String) — element name match
  • Wildcard — *
  • Node / Text / Comment — built-in node type tests (NOT functions)
  • FunctionCall(name, args) — custom/built-in function used as a path step e.g. //div/str:slug(), //div/string()

⛔ Text, Node, Comment are NOT the same as function calls even though they look like text(), node(), comment(). They are XPath grammar primitives. Treating them as FunctionCall breaks predicate evaluation.

microdom-xpath-engine/src/parser.rs

parse_path

Router — decides whether a token sequence is: a) location path → parse_location_path b) primary expr → parse_primary (+ maybe_steps_from for trailing /) c) keyword expr → parse_if_expr / parse_for_expr / parse_let_expr

maybe_steps_from(primary)

If current token is / or // after a primary expression, wraps it in StepsFrom { base: primary, steps: [...] }. Used by: $var/text(), func(args)/step, (expr)[n]/step.

parse_axis_and_test

Parses one step of a location path.

  • Name "::" → axis
  • Name "(" → either node test (text/node/comment with empty parens) or FunctionCall (anything else, including built-ins like string())

WHY text/node/comment are special-cased: In XPath grammar these are KindTest, not function calls. They can appear after :: (e.g. child::text()) which function calls cannot.

⛔ DO NOT remove the "text" | "node" | "comment" special case from parse_axis_and_test. Removing it breaks child::text(), //text(), etc. The is_node_test check (empty parens) allows text(arg) to work as a real function call while text() remains a node test.

microdom-xpath-engine/src/eval.rs

DomAdapter (trait)

The only interface between the engine and any DOM. string_value() has a default implementation via collect_text_rec which recursively gathers all descendant text nodes.

WHY recursive collect and not just text_content: string_value() must return "bold text italic text" (all descendants). text_content() returns None (it's an element, not a text node). text_content(text_node) returns Some("bold text").

⛔ DO NOT override string_value in HtmlDom to return only direct text. Doing so breaks string(), normalize-space(), contains() on elements.

XPathValue

enum { NodeSet, StringSet, String, Number, Boolean, Function }

StringSet vs NodeSet: NodeSet — live node references, converted to HTML/text by value_to_result StringSet — already-serialized strings (attribute values, function results)

WHY two separate string types: Attribute @class returns StringSet(["comment"]) not a NodeSet. If we returned NodeSet, we'd need fake "attribute nodes" with IDs. StringSet avoids that complexity.

⛔ DO NOT collapse StringSet into NodeSet. values_equal() has separate branches for (StringSet, String) comparison — merging them would break @class="foo" predicates.

eval_path

Core path evaluation. For each step:

  1. Attribute axis → collect StringSet, break
  2. FunctionCall node test → call function with current node as arg, collect results (NodeSet or StringSet depending on return type)
  3. Normal step → eval_axis + matches_node_test + apply_predicate

WHY FunctionCall is handled inside eval_path and not in evaluate(): The function receives each node individually with position context. //div/str:slug() means "call str:slug once per div, passing that div". If handled at evaluate() level we'd lose per-node context.

StepsFrom evaluator

Two modes depending on base value: NodeSet → normal step traversal (same as eval_path inner loop) other → pipeline mode: each step must be a FunctionCall, previous value is passed as first argument

Pipeline examples: string(//option)/upper-case() → "MERCEDES1" count(//li)/math:double() → 10 $var/text()/string-length() → 9

⛔ DO NOT make StepsFrom pipeline mode silently ignore non-FunctionCall steps. It should return the value as-is if the step is not a function (current behavior: return Ok(val)).

apply_predicate

Evaluates one predicate expression against a node list. Position context (position, size) is set per-node. Number predicates [1], [last()] handled specially for performance.

WHY local predicate application (not global): //li[1] must return the first li PER PARENT, not the globally first li. (//li)[1] uses Filter+StepsFrom for global selection.

CustomEvaluator

HashMap of user-registered functions (Arc). Arc not Box because HtmlDom stores them and clones Arc into each Evaluator.

⛔ DO NOT change Arc back to Box. HtmlCustomEvaluator stores functions and passes them to Evaluator on every select() call. Box is not Clone.

microdom/src/lib.rs

HtmlDom

Owns DomTree + root NodeId. Implements DomAdapter — bridge between coredom and xpath engine.

value_to_result

Converts XPathValue → SelectResult. NodeSet → each node serialized: text nodes as text, elements as HTML StringSet → Text (1 item) or List (many) Number → Int if whole number, Text otherwise

WHY text nodes serialize differently from elements: //li/text() should return "list 1", not "list 1". is_text() check handles this.

select / select_with

select(xpath) — no custom functions select_with(evaluator, xpath) — with custom functions

Both call value_to_result at the end. Parser::parse() is called on every select() — cheap (microseconds). For hot paths use XPath::compile() + select_compiled().

HtmlCustomEvaluator

Type alias + newtype wrapper around CustomEvaluator. Adds select(&dom, xpath) method for ergonomic API: eval.select(&dom, xpath) ←→ dom.select_with(&eval, xpath)

Performance notes

  • parse_html: ~4.7ms per 1MB document (2-3x faster than html5ever)
  • tl crate: 0.7ms but read-only and 2MB limit — different use case
  • XPath parse: microseconds, not a bottleneck
  • NodeId is u32 — fits in a register, zero heap cost per node reference
  • AttrArray: zero overhead for elements with no attributes