[HTML String]
↓ (microdom::parse_html)
[coredom::DomTree (Arena)]
↓
[XPath String] -> (microdom-xpath-engine::lexer/parser) -> [AST]
↓
(microdom-xpath-engine::evaluator) + [DomAdapter]
↓
[XPathValue (NodeSet / StringSet / String / Number / Boolean / Function)]
↓ (microdom::value_to_result)
[SelectResult]
- Workspace layout
- coredom/src/lib.rs
- coredom/src/attr.rs
- microdom-xpath-engine/src/lexer.rs
- microdom-xpath-engine/src/ast.rs
- microdom-xpath-engine/src/parser.rs
- microdom-xpath-engine/src/eval.rs
- microdom/src/lib.rs
- Performance notes
microdom/ ← HTML DOM + public API
microdom-xpath-engine/ ← XPath parser + evaluator (no HTML knowledge)
coredom/ ← raw arena DOM tree (no XPath, no HTML)
Dependencies flow (one direction only):
coredom ← microdom-xpath-engine ← microdom
Arena-based node storage. Nodes are identified by NodeId (u32 index). Each node stores: NodeData, parent, first_child, next_sibling. Doubly-linked sibling list for O(1) append/remove.
⛔ DO NOT change NodeId from Copy — the entire engine passes NodeId by value. If you make it non-Copy, hundreds of &node, *node dereferences break.
enum { Document, Element { tag_name, attributes }, Text(TextBuffer) } Text uses a custom TextBuffer (not String) to avoid small-string allocations.
⛔ DO NOT add heap types to NodeData without ensuring Clone stays cheap.
Compact attribute storage. Small number of attrs (≤4) stored inline, upgrades to Vec on overflow. Designed for real-world HTML where most elements have 0–3 attributes.
WHY: HashMap per element would cost ~48 bytes overhead each. AttrArray costs 0 bytes for elements with no attributes.
Byte-level scanner (operates on &[u8], not chars). Single-pass, no backtracking.
Key rules:
::→ DoubleColon (axis separator: child::, descendant::):→ consumed as part of Name for QNames (math:pi, my:func) WHY: tokenizer can't know if "math:pi" is a namespace or an axis without parser context, so we let the Name absorb the colon.=>→ Arrow (pipeline operator):=→ ColonEq (let binding)
⛔ DO NOT change the : vs :: distinction in the identifier scanner.
Breaking it will confuse child::div (axis) with ns:div (QName).
Full XPath 3.0 AST. Key variants:
- Path(PathExpr) — /a/b/c, //div[@class]
- StepsFrom { base, steps } — (expr)/step, $var/step, func()/step WHY separate from Path: Path always starts from root or context node. StepsFrom starts from an arbitrary evaluated expression.
- Filter { target, predicate } — (expr)[pred]
- Let / For / If / Quantified — XPath 3.0 control flow
- InlineFunction / DynamicFunctionCall — XPath 3.0 lambdas
- Name(String) — element name match
- Wildcard — *
- Node / Text / Comment — built-in node type tests (NOT functions)
- FunctionCall(name, args) — custom/built-in function used as a path step e.g. //div/str:slug(), //div/string()
⛔ Text, Node, Comment are NOT the same as function calls even though they look like text(), node(), comment(). They are XPath grammar primitives. Treating them as FunctionCall breaks predicate evaluation.
Router — decides whether a token sequence is: a) location path → parse_location_path b) primary expr → parse_primary (+ maybe_steps_from for trailing /) c) keyword expr → parse_if_expr / parse_for_expr / parse_let_expr
If current token is / or // after a primary expression, wraps it in StepsFrom { base: primary, steps: [...] }. Used by: $var/text(), func(args)/step, (expr)[n]/step.
Parses one step of a location path.
- Name "::" → axis
- Name "(" → either node test (text/node/comment with empty parens) or FunctionCall (anything else, including built-ins like string())
WHY text/node/comment are special-cased: In XPath grammar these are KindTest, not function calls. They can appear after :: (e.g. child::text()) which function calls cannot.
⛔ DO NOT remove the "text" | "node" | "comment" special case from parse_axis_and_test. Removing it breaks child::text(), //text(), etc. The is_node_test check (empty parens) allows text(arg) to work as a real function call while text() remains a node test.
The only interface between the engine and any DOM. string_value() has a default implementation via collect_text_rec which recursively gathers all descendant text nodes.
WHY recursive collect and not just text_content: string_value() must return "bold text italic text" (all descendants). text_content() returns None (it's an element, not a text node). text_content(text_node) returns Some("bold text").
⛔ DO NOT override string_value in HtmlDom to return only direct text. Doing so breaks string(), normalize-space(), contains() on elements.
enum { NodeSet, StringSet, String, Number, Boolean, Function }
StringSet vs NodeSet: NodeSet — live node references, converted to HTML/text by value_to_result StringSet — already-serialized strings (attribute values, function results)
WHY two separate string types: Attribute @class returns StringSet(["comment"]) not a NodeSet. If we returned NodeSet, we'd need fake "attribute nodes" with IDs. StringSet avoids that complexity.
⛔ DO NOT collapse StringSet into NodeSet. values_equal() has separate branches for (StringSet, String) comparison — merging them would break @class="foo" predicates.
Core path evaluation. For each step:
- Attribute axis → collect StringSet, break
- FunctionCall node test → call function with current node as arg, collect results (NodeSet or StringSet depending on return type)
- Normal step → eval_axis + matches_node_test + apply_predicate
WHY FunctionCall is handled inside eval_path and not in evaluate(): The function receives each node individually with position context. //div/str:slug() means "call str:slug once per div, passing that div". If handled at evaluate() level we'd lose per-node context.
Two modes depending on base value: NodeSet → normal step traversal (same as eval_path inner loop) other → pipeline mode: each step must be a FunctionCall, previous value is passed as first argument
Pipeline examples: string(//option)/upper-case() → "MERCEDES1" count(//li)/math:double() → 10 $var/text()/string-length() → 9
⛔ DO NOT make StepsFrom pipeline mode silently ignore non-FunctionCall steps. It should return the value as-is if the step is not a function (current behavior: return Ok(val)).
Evaluates one predicate expression against a node list. Position context (position, size) is set per-node. Number predicates [1], [last()] handled specially for performance.
WHY local predicate application (not global): //li[1] must return the first li PER PARENT, not the globally first li. (//li)[1] uses Filter+StepsFrom for global selection.
HashMap of user-registered functions (Arc). Arc not Box because HtmlDom stores them and clones Arc into each Evaluator.
⛔ DO NOT change Arc back to Box. HtmlCustomEvaluator stores functions and passes them to Evaluator on every select() call. Box is not Clone.
Owns DomTree + root NodeId. Implements DomAdapter — bridge between coredom and xpath engine.
Converts XPathValue → SelectResult. NodeSet → each node serialized: text nodes as text, elements as HTML StringSet → Text (1 item) or List (many) Number → Int if whole number, Text otherwise
WHY text nodes serialize differently from elements: //li/text() should return "list 1", not "list 1". is_text() check handles this.
select(xpath) — no custom functions select_with(evaluator, xpath) — with custom functions
Both call value_to_result at the end. Parser::parse() is called on every select() — cheap (microseconds). For hot paths use XPath::compile() + select_compiled().
Type alias + newtype wrapper around CustomEvaluator. Adds select(&dom, xpath) method for ergonomic API: eval.select(&dom, xpath) ←→ dom.select_with(&eval, xpath)
- parse_html: ~4.7ms per 1MB document (2-3x faster than html5ever)
- tl crate: 0.7ms but read-only and 2MB limit — different use case
- XPath parse: microseconds, not a bottleneck
- NodeId is u32 — fits in a register, zero heap cost per node reference
- AttrArray: zero overhead for elements with no attributes