Pre-Compiled Filter Sets and Query-Time Specialization in Temporal Databases
Query engines waste time re-evaluating constants inside hot loops. In MinnsDB, a single IN (...) predicate with 5 literal values evaluated against 100,000 binding rows produced 500,000 redundant expression evaluations. The fix was not a new index or operator, but query-time specialization: evaluate the constant part once, then run the residual predicate per row.
This article examines how MinnsDB addresses this through compile-time partial evaluation of filter expressions, what binding rows actually cost, and how the same principle extends to temporal viewport compilation.
The Problem: Redundant Work in Expression Evaluation
Consider the resolved boolean expression tree that the MinnsDB planner emits:
pub enum RBoolExpr {
Comparison(RExpr, CompOp, RExpr),
In(RExpr, Vec<RExpr>),
NotIn(RExpr, Vec<RExpr>),
And(Box<RBoolExpr>, Box<RBoolExpr>),
Or(Box<RBoolExpr>, Box<RBoolExpr>),
// ...
}
The In(RExpr, Vec<RExpr>) variant stores a left-hand expression (the property being tested) and a vector of right-hand expressions. When those right-hand expressions are all RExpr::Literal(_), their runtime Value representation is fixed. Converting Literal::String("active") to Value::String("active".to_string()) allocates a String on every call. Multiply by 100,000 rows and three literals: 300,000 string allocations for values that could have been computed once.
The Evolution: V0, V1, V2
V0: Evaluate everything per row (the starting point)
The straightforward implementation evaluates every expression from scratch for every row:
RBoolExpr::In(e, vals) => {
let v = self.evaluate_expr(e, binding)?; // evaluate left side
for candidate in vals {
let c = self.evaluate_expr(candidate, binding)?; // evaluate each literal
if compare_values(&v, &CompOp::Eq, &c) {
return Ok(true);
}
}
Ok(false)
}
For WHERE n.status IN ("active", "pending", "review", "escalated", "waiting") against 100,000 rows, this calls evaluate_expr on each of the 5 literals for each of the 100,000 rows. Each call to evaluate_expr(RExpr::Literal(Literal::String("active"))) allocates a new String via literal_to_value. That is 500,000 expression evaluations and 500,000 string allocations for values that are constant across every row.
V1: Pre-compile with pointer keys (fast but fragile)
The first optimization: walk the expression tree once before the row loop, pre-evaluate all literal IN values, and store them in a map. The per-row evaluation checks the map first and skips evaluate_expr on the literals.
The question is how to key the map. The first approach used the memory address of each RBoolExpr node:
// V1: pointer-based key
struct PreCompiledSets {
sets: FxHashMap<usize, Vec<Value>>,
}
// During build:
let key = expr as *const RBoolExpr as usize;
sets.insert(key, precomputed);
// During per-row evaluation:
fn get(&self, expr: &RBoolExpr) -> Option<&Vec<Value>> {
let key = expr as *const RBoolExpr as usize;
self.sets.get(&key)
}
This was fast. Casting a pointer to usize is free, and FxHashMap with integer keys is a single multiply-shift hash. The benchmarks showed a 26-38% speedup.
But it was fragile. The approach depends on the expression tree being pinned in memory for the entire filter step. Rust does not guarantee pointer stability for types unless they are explicitly Pinned or behind a Box that is never moved. If the plan were ever re-optimized during execution, or if the executor cloned the expression tree, the pointers would change and the hash map lookups would silently return None, falling back to the slow path without error. Worse, if memory were reused, a stale pointer could match a completely different expression.
V2: Pre-compile with structural hash keys (safe)
The fix: key on the structural content of the expression instead of its address. RBoolExpr derives Hash (all its variants contain types that implement Hash), so hashing the entire node produces a deterministic key that does not depend on memory layout:
struct PreCompiledSets {
sets: FxHashMap<u64, Vec<Value>>,
}
impl PreCompiledSets {
fn build(expr: &RBoolExpr) -> Self {
let mut sets = FxHashMap::default();
Self::collect(expr, &mut sets);
PreCompiledSets { sets }
}
fn collect(expr: &RBoolExpr, sets: &mut FxHashMap<u64, Vec<Value>>) {
match expr {
RBoolExpr::In(_, vals) | RBoolExpr::NotIn(_, vals) => {
let all_literals = vals.iter().all(|v| matches!(v, RExpr::Literal(_)));
if all_literals {
let precomputed: Vec<Value> = vals
.iter()
.map(|v| match v {
RExpr::Literal(lit) => literal_to_value(lit),
_ => unreachable!(),
})
.collect();
let key = Self::hash_expr(expr);
sets.insert(key, precomputed);
}
},
RBoolExpr::And(a, b) | RBoolExpr::Or(a, b) => {
Self::collect(a, sets);
Self::collect(b, sets);
},
RBoolExpr::Not(inner) | RBoolExpr::Paren(inner) => {
Self::collect(inner, sets);
},
_ => {},
}
}
#[inline]
fn hash_expr(expr: &RBoolExpr) -> u64 {
use std::hash::Hasher;
let mut hasher = std::collections::hash_map::DefaultHasher::new();
expr.hash(&mut hasher);
hasher.finish()
}
#[inline]
fn get(&self, expr: &RBoolExpr) -> Option<&[Value]> {
let key = Self::hash_expr(expr);
self.sets.get(&key).map(|v| v.as_slice())
}
}
The cost of V2 over V1: one DefaultHasher computation per IN/NotIn node per row (the get call during evaluation). For a typical filter with 1-3 IN clauses, this is 1-3 hash operations per row. The hash walks the RBoolExpr tree for that node, which for In(Property(0, "status"), [Literal("active"), Literal("pending"), ...]) touches roughly 50-100 bytes. On modern hardware this is under 10 nanoseconds. The pointer approach was faster (a no-op cast), but the structural hash is safe under any future changes to plan ownership or executor architecture.
Two structurally identical IN clauses share the same pre-compiled set, which is correct: they have the same literals, so they should use the same pre-evaluated values. Hash collisions at 64 bits are vanishingly rare for the number of IN clauses in a single query (typically 1-3).
During per-row evaluation, the IN branch becomes:
RBoolExpr::In(e, vals) => {
let v = self.evaluate_expr(e, binding)?;
if let Some(precomputed) = precompiled.get(expr) {
return Ok(precomputed.iter().any(|c| value_eq_loose(&v, c)));
}
// Fallback for non-literal IN lists (rare)
for candidate in vals {
let c = self.evaluate_expr(candidate, binding)?;
if compare_values(&v, &CompOp::Eq, &c) {
return Ok(true);
}
}
Ok(false)
}
The fallback path exists for IN clauses containing non-literal expressions (subqueries, function calls), which are uncommon in practice. For the common case, the per-row cost drops from O(V) allocations + comparisons to O(V) comparisons with zero allocations.
Measured Impact
Criterion benchmarks on MATCH (n:Concept) WHERE n.status IN ("active", "pending", "review", "escalated", "waiting") RETURN n.name against graphs of varying size. Each node has a status property drawn from 10 possible values. Measured on the same hardware, same graph, same query, comparing the executor before and after the PreCompiledSets optimization.
IN with 5 literal values:
| Nodes | Before (no precompilation) | After (precompiled) | Change |
|---|---|---|---|
| 1,000 | 177 µs | 110 µs | -38% |
| 10,000 | 1.93 ms | 1.22 ms | -37% |
| 50,000 | 19.2 ms | 14.2 ms | -26% |
IN with 1 literal value:
| Nodes | Before | After | Change |
|---|---|---|---|
| 1,000 | 100 µs | 82 µs | -18% |
| 10,000 | 1.13 ms | 973 µs | -14% |
| 50,000 | 12.9 ms | 9.9 ms | -23% |
Baselines (unchanged by this optimization):
| Query shape | 1,000 nodes | 10,000 nodes | 50,000 nodes |
|---|---|---|---|
| Scan only (no filter) | 67 µs | 702 µs | 5.4 ms |
Equality (= "active") |
100 µs | 1.13 ms | 12.9 ms |
The improvement scales with the number of IN values: 5 values shows a 26-38% reduction in total query time, while 1 value shows 14-23%. The scan baseline and equality filter are unaffected, confirming the optimization is isolated to the IN/NotIn evaluation path.
At larger node counts the speedup percentage decreases because scan cost (iterating all candidate nodes) dominates. The filter step becomes a smaller fraction of total execution time. At 1,000 nodes where the filter is a larger share of the work, the 38% improvement is closest to the theoretical savings from eliminating per-row literal evaluation.
Why This Matters: The Binding Row Cost Model
MinnsDB's executor uses a binding row model where each pattern variable is assigned a slot index (a u8, supporting up to 255 variables) by the planner. At runtime, each row in the intermediate result set is a BindingRow containing an array of bound values:
const INLINE_SLOTS: usize = 16;
enum BindingSlots {
Inline([Option<BoundValue>; INLINE_SLOTS]),
Dynamic(Vec<Option<BoundValue>>),
}
enum BoundValue {
Node(NodeId), // 8 bytes
Edge(EdgeId), // 8 bytes
Path(Box<[NodeId]>), // 16 bytes (pointer + length, no capacity)
}
For queries with 16 or fewer variables (the vast majority), the slot array is stack-allocated as an inline [Option<BoundValue>; 16]. The Path variant is boxed (Box<[NodeId]>, 16 bytes: pointer + length) to prevent the rare Path case from bloating every slot. Without boxing, Vec<NodeId> (24 bytes) would force every Option<BoundValue> to 32 bytes due to enum padding. With boxing, the common Node and Edge variants keep each Option<BoundValue> at 16 bytes in the current layout. The total inline size is 256 bytes.
Cross-product expansion in the ScanNodes and Expand steps clones the full binding row for each output:
for node in &candidates {
let mut new_binding = binding.clone();
new_binding.set(var, BoundValue::Node(node.id));
out.push(new_binding);
}
The cost math: 256 bytes per clone, 100,000 rows (the MAX_INTERMEDIATE_ROWS cap) = 25 MB of copies per step. For a two-step plan (scan + expand), that is up to 50 MB of binding row copies. This is the dominant memory cost of query execution, and it is why every optimization that avoids producing a binding row matters. Every edge that fails a filter check or temporal viewport check avoids a 256-byte clone.
The MAX_INTERMEDIATE_ROWS cap at 100,000 acts as backpressure. Without it, a query like MATCH (a)-[r]->(b) RETURN a, b on a graph with 10,000 nodes and average degree 10 would produce 100,000 binding rows in the expand step, each cloned from the scan step's output. The cap prevents OOM from accidental cartesian products while still allowing large result sets for targeted queries.
The Same Principle: Temporal Viewport as a Compile-Time Decision
The same pattern of pre-resolution applies to MinnsDB's temporal model. Every MinnsQL query implicitly or explicitly specifies a temporal viewport:
-- Default: only currently-active edges
MATCH (a)-[r]->(b) RETURN a.name, b.name
-- Last 30 days
MATCH (a)-[r]->(b) WHEN LAST "30d" WHERE r.confidence > 0.8 RETURN a.name, b.name
-- Point-in-time snapshot
MATCH (a)-[r]->(b) WHEN AT "2025-06-01T00:00:00Z" RETURN a.name
-- All edges, including superseded
MATCH (a)-[r]->(b) WHEN ALL RETURN a.name, r.valid_from, r.valid_until
The planner resolves the WHEN clause into a TemporalViewport enum at plan time:
pub enum TemporalViewport {
ActiveOnly,
PointInTime(Timestamp),
Range(Timestamp, Timestamp),
All,
}
Duration expressions like LAST "30d" are converted to absolute nanosecond timestamps during planning. The ago("30d") function computes now_nanos() - 30 * 86400 * 1_000_000_000 exactly once. By the time the executor sees the plan, the viewport is a fixed enum variant with concrete timestamps. No string parsing, no duration arithmetic happens per row.
The executor's edge visibility check is then a constant-time branch:
fn edge_visible(&self, edge: &GraphEdge) -> bool {
if !edge.is_valid() { return false; }
if let Some(cutoff) = self.transaction_cutoff {
if edge.created_at > cutoff { return false; }
}
match &self.viewport {
TemporalViewport::ActiveOnly => edge.valid_until.is_none(),
TemporalViewport::PointInTime(ts) => edge.valid_at(*ts),
TemporalViewport::Range(t1, t2) => edge.valid_during(*t1, *t2),
TemporalViewport::All => true,
}
}
This is early temporal pruning during scan and expand. In a non-temporal graph database, every edge is always visible, and the only filtering dimension is structural (labels, properties, direction). In a bi-temporal graph like MinnsDB, each edge has four time fields: created_at (transaction time), updated_at (transaction time), valid_from (valid time), and valid_until (valid time). The viewport resolves which of these dimensions to filter on, and the AS OF clause adds an independent transaction-time cutoff.
Edges outside the viewport are never materialized into binding rows. They are skipped during the Expand step before a BindingRow is ever cloned. Given the 256-byte-per-row cost described above, this is the single most impactful optimization in the temporal query path.
The Execution Pipeline
A MinnsQL execution plan is a sequence of physical operators:
pub enum PlanStep {
ScanNodes { var: SlotIdx, labels: Vec<String>, props: Vec<(String, Literal)> },
Expand { from_var: SlotIdx, edge_var: Option<SlotIdx>, to_var: SlotIdx, ... },
Filter(RBoolExpr),
ScanTable { table_name: String, alias: Option<String>, scan_limit: Option<usize> },
// ...
}
Execution follows a pull-based pipeline. Each step consumes a Vec<BindingRow> and produces a new Vec<BindingRow>:
- ScanNodes: iterates candidate nodes (using type index or concept index for fast lookup), applies property filters, and produces initial binding rows.
- Expand: for each binding row, follows outgoing or incoming edges from the bound node. Each matching edge produces a new binding row with the neighbor bound to
to_var. Variable-length expansion uses BFS with anFxHashSet<NodeId>visited set (pre-allocated to 256 entries) and aVecDequequeue. - Filter: applies the
RBoolExprpredicate to each row, usingPreCompiledSetsfor IN/NotIn optimization. Operates in-place viaVec::retain.
The planner also performs scan limit pushdown: if the plan has no Expand step and no aggregation, the LIMIT from the RETURN clause is pushed into the ScanNodes step to avoid collecting millions of candidates:
fn compute_scan_limit(plan: &ExecutionPlan) -> Option<usize> {
let limit = plan.limit?;
let has_expand = plan.steps.iter().any(|s| matches!(s, PlanStep::Expand { .. }));
let has_aggregation = !plan.aggregations.is_empty();
if has_expand || has_aggregation { return None; }
Some(limit as usize)
}
This is a simple but effective optimization. A query like MATCH (n:Concept) RETURN n.name LIMIT 10 on a graph with 50,000 concept nodes will scan at most 10 candidates instead of 50,000.
How Other Engines Handle This
The principle of separating compile-time work from per-row work is universal, but the implementations vary by orders of magnitude in sophistication:
- Postgres: expression JIT via LLVM. WHERE clauses are compiled to native machine code, amortized over millions of rows.
- DuckDB: vectorized batches. Filters process 2048 tuples at once through compiled functions with SIMD where possible.
- MinnsDB: interpreted executor with query-time specialization. A match dispatch per expression node, per row.
MinnsDB does not compete with these engines on raw filter throughput. The PreCompiledSets optimization and temporal viewport pre-resolution are targeted improvements within an interpreted model. They remove the most expensive redundant work (repeated literal evaluation, repeated timestamp arithmetic) without requiring JIT compilation or vectorized execution. In knowledge graph workloads, graph topology (fan-out, path length) dominates execution time rather than per-tuple filter cost. The principle is the same: move constant work to the cold path. The implementation matches the scale.
Trade-offs and Limitations
These optimizations are not free. some trade offs
1. Specialization has a warm-up cost. Walking the expression tree and building the FxHashMap is overhead. For queries that process fewer than 50-100 rows, the time spent pre-evaluating literal sets may exceed the savings. The breakeven depends on the number of IN values and the cost of each literal-to-value conversion (one String allocation per literal). For MinnsDB's typical workload (graph traversals producing thousands to hundreds of thousands of binding rows), the breakeven is cleared easily. For small transactional lookups, a simpler execution path without specialization would be faster.
2. Hash collisions are theoretically possible. The structural hash key uses DefaultHasher over the RBoolExpr tree. Two structurally different IN clauses could collide, causing one to use the other's pre-compiled values. In practice, hash collisions at 64 bits are vanishingly rare for the number of IN clauses in a single query (typically 1-3). An earlier version of this code used raw pointers as keys, which was faster but fragile: it depended on the expression tree being pinned in memory. The structural hash trades a few nanoseconds of hashing overhead for safety that does not depend on the memory layout of the plan.
3. The inline binding row has a performance discontinuity at 17 variables. Queries with 16 or fewer variables use the stack-allocated inline path. Queries with 17+ variables fall back to heap-allocated Vec<Option<BoundValue>>. The transition is not 10x (the heap path amortizes allocation across rows as the Vec grows), but it is measurable: roughly 1.2-1.5x slower due to cache misses and allocation overhead. In practice, the vast majority of graph pattern queries use 2-4 variables (MATCH (a)-[r]->(b) is 3). The threshold of 16 covers over 99% of real queries. If this becomes a bottleneck, the inline size could be raised to 32 (512 bytes per row, still reasonable for stack allocation).
4. Temporal pruning operates at the executor level, not the storage level. The TemporalViewport optimization prevents invisible edges from becoming binding rows, but the executor still must examine each edge to check its temporal fields. In a graph with 10 million historical edges where only 1,000 are currently active, the scan still touches all 10 million edge objects in memory to evaluate edge.valid_until.is_none(). A time-sorted B-tree index on valid_from would allow the storage layer to skip historical edges entirely, reducing the scan to O(active_edges) instead of O(total_edges). This is an architectural gap in the current implementation. Adding a temporal B-tree index to the edge storage layer is a planned future direction. Systems like XTDB and Crux use time-sorted storage to solve this at the I/O level. MinnsDB's in-memory graph masks this cost at small-to-medium scale (hundreds of thousands of edges), but at millions of historical edges, a storage-level temporal index becomes necessary.
5. String interning would solve this at a deeper level. The root cause of the per-row allocation problem is that literal_to_value creates a new String for each string literal. If MinnsDB used string interning (storing each unique string once in an arena and referencing it by index), the "literal-to-value" conversion would be an index copy, not a string allocation. Pre-compiled filter sets solve the symptom (redundant evaluation in the hot loop) rather than the cause (string allocation as the representation of values). String interning is a larger architectural change that would benefit the entire executor, not just IN clauses. It is a candidate for future optimization.
6. An interpreted executor pays a dispatch tax on every expression. Every filter evaluation requires a match branch on the RBoolExpr enum, then a match on each RExpr variant. In a tight scan over millions of rows, this branch overhead accumulates. A JIT-compiled query (Postgres LLVM, DuckDB compiled pipelines) eliminates dispatch entirely: the filter becomes native machine code. However, graph traversal workloads are typically memory-bound rather than compute-bound. The bottleneck is cache misses from following edge pointers across the adjacency structure, not the filter dispatch itself. JIT would help MinnsDB's table executor (which does columnar scans) more than its graph executor (which does pointer-chasing traversals).
MinnsDB is source-available (AGPL-3.0) at minns.ai.