# Zero-Allocation Graph Traversal: Why Adjacency List Representation Determines Query Speed

Every call to get_edges_from allocated a new Vec. In a graph query that expands 10,000 nodes, that was 10,000 heap allocations in the hot path. The fix required changing one function signature and rethinking how adjacency lists expose their contents.

This article explains how MinnsDB's adaptive adjacency list representation eliminates per-node allocation during traversal, and why that representation is the right tradeoff for a mutable temporal graph specifically.

The Problem: Vec Allocation Per Node

The original MinnsDB graph API exposed neighbor access through methods that collected edges into a freshly allocated Vec:

// Old API (simplified)
fn get_edges_from(&self, node_id: NodeId) -> Vec<&GraphEdge> {
    let mut edges = Vec::new();
    if let Some(edge_ids) = self.adjacency_out.get(node_id) {
        for &eid in edge_ids.iter() {
            if let Some(edge) = self.edges.get(eid) {
                edges.push(edge);
            }
        }
    }
    edges
}

In an Expand step processing 10,000 binding rows (one per node in the scan output), this allocates 10,000 Vecs. Each Vec allocation calls malloc, touches at least one cache line in the allocator's metadata, and the Vec itself is immediately dropped after iteration.

The fix is to expose the raw AdjList reference:

#[inline]
pub fn adjacency_out_ref(&self, node_id: NodeId) -> Option<&AdjList> {
    self.adjacency_out.get(node_id)
}

#[inline]
pub fn adjacency_in_ref(&self, node_id: NodeId) -> Option<&AdjList> {
    self.adjacency_in.get(node_id)
}

The executor iterates the AdjList directly, looking up each edge by ID and applying filters inline. Zero allocation per node. The total allocation for the entire Expand step is the output Vec<BindingRow>, which is pre-allocated with a capacity estimate:

let estimated = (rows.len() * 4).min(MAX_INTERMEDIATE_ROWS);
let mut out = Vec::with_capacity(estimated);

The estimate of 4 edges per node is a heuristic for the average degree in knowledge graphs. Over-estimating wastes memory; under-estimating triggers reallocation. The .min(MAX_INTERMEDIATE_ROWS) cap prevents pre-allocating absurdly large buffers for dense graph regions.

But exposing AdjList directly means the representation of that type now sits on the critical path. The rest of this article is about why that representation matters.

The Power-Law Problem

Knowledge graphs follow a power-law degree distribution. In a typical MinnsDB instance with 50,000 concept nodes:

~60% of nodes have 1-3 edges (a person, a place, a date mentioned once or twice)
~25% have 4-20 edges (frequently referenced entities)
~10% have 20-100 edges (hub entities like "user", "home", a company name)
~5% have 100-1000 edges (high-connectivity concepts accumulated over long conversations)
<0.1% have 1000+ edges (rare but real: the "user" node in a personal assistant with years of data)

A single adjacency list representation pays the wrong cost for at least one end of this distribution. A HashMap<NodeId, Vec<EdgeId>> wastes 48+ bytes per entry on a node with a single edge (HashMap overhead per bucket, the Vec header, alignment padding). A flat Vec<EdgeId> wastes time on a 10,000-edge node where checking whether a specific edge exists requires a linear scan. A BTreeSet<EdgeId> wastes space and cache locality on 1-edge nodes where a single u64 would suffice.

The Solution: Adaptive AdjList Enum

MinnsDB defines four representations and promotes between them based on edge count:

const ADJ_LARGE_THRESHOLD: usize = 1024;

pub enum AdjList {
    Empty,                          // 0 edges: 0 bytes of edge data
    One(EdgeId),                    // 1 edge: 8 bytes inline
    Small(SmallVec<[EdgeId; 8]>),   // 2-1024 edges: inline up to 8, heap beyond
    Large(BTreeSet<EdgeId>),        // 1025+ edges: O(log n) membership test
}

The transitions are monotonic during normal operation:

Empty + push = One(edge_id) (zero allocation)
One(x) + push = Small([x, new]) (SmallVec inline, no heap)
Small(sv) + push when sv.len() > 1024 = Large(BTreeSet) (one bulk conversion)
Large(set) + push = Large(set) (BTreeSet insert)

The retain method can demote: a Small with one remaining element becomes One, and a Small with zero elements becomes Empty. This matters during edge deletion and graph pruning.

This is not a novel data structure. It is a standard engineering pattern: match the representation to the workload distribution. The value is that it applies the right tradeoff at each tier of the degree distribution without requiring the caller to choose.

Memory Cost Analysis

For each variant, the cost per node for the adjacency list:

Variant	Edge count	Memory per adjacency list	Memory per edge
Empty	0	0 bytes (enum discriminant only)	N/A
One	1	8 bytes	8 bytes
Small (inline)	2-8	64 bytes (8 EdgeIds inline in SmallVec)	8-32 bytes
Small (spilled)	9-1024	24 bytes (SmallVec header) + n*8 on heap	~8 bytes
Large	1025+	~48 bytes (BTreeSet header) + n*8 in B-tree nodes	~10 bytes

The enum discriminant itself occupies 8 bytes (Rust aligns to the largest variant, which is SmallVec at ~72 bytes on x86-64 including the discriminant and inline buffer). But the key metric is not raw memory, it is cache behavior.

Cache Locality

SmallVec<[EdgeId; 8]> stores up to 8 edge IDs (64 bytes) inline within the struct. On typical x86_64 hardware, a cache line is 64 bytes. This means that for the common case of nodes with 8 or fewer edges, the entire adjacency list traversal is a sequential read of at most one cache line. No pointer chasing, no TLB misses, no random heap access.

When the SmallVec spills to the heap (9+ edges), the heap allocation is a contiguous [EdgeId] buffer. Iterating it is a sequential scan with hardware prefetch.

The BTreeSet variant for 1024+ edges trades some cache locality for O(log n) membership tests. When the executor needs to check "does node X already have an edge to node Y" (as in the merge operation), a linear scan of 5000 edges is unacceptable. The BTreeSet's B-tree nodes are themselves contiguous arrays, so the log(n) tree descent has reasonable cache behavior at each level.

The AdjListIter: Zero-Overhead Dispatch

Iterating an AdjList produces an AdjListIter enum that dispatches to the correct iterator variant:

pub enum AdjListIter<'a> {
    Empty,
    One(std::iter::Once<&'a EdgeId>),
    Small(std::slice::Iter<'a, EdgeId>),
    Large(std::collections::btree_set::Iter<'a, EdgeId>),
}

impl<'a> Iterator for AdjListIter<'a> {
    type Item = &'a EdgeId;

    fn next(&mut self) -> Option<Self::Item> {
        match self {
            AdjListIter::Empty => None,
            AdjListIter::One(it) => it.next(),
            AdjListIter::Small(it) => it.next(),
            AdjListIter::Large(it) => it.next(),
        }
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        match self {
            AdjListIter::Empty => (0, Some(0)),
            AdjListIter::One(it) => it.size_hint(),
            AdjListIter::Small(it) => it.size_hint(),
            AdjListIter::Large(it) => it.size_hint(),
        }
    }
}

impl<'a> ExactSizeIterator for AdjListIter<'a> {}

The ExactSizeIterator implementation is significant for downstream consumers. When the executor pre-allocates output vectors, it can use edge_ids.iter().len() to compute exact sizes instead of guessing. The match dispatch adds one branch per next() call. On modern branch predictors with speculative execution, this branch is almost always predicted correctly after the first iteration because the variant does not change within a single adjacency list traversal.

BFS Optimization: Pre-Allocated Collections

Variable-length path expansion uses BFS with explicit capacity management:

fn bfs_expand(
    &mut self,
    start: NodeId,
    edge_type: &Option<String>,
    direction: &AstDirection,
    min_hops: u32,
    max_hops: u32,
) -> Result<Vec<NodeId>, QueryError> {
    let mut visited = FxHashSet::with_capacity_and_hasher(256, Default::default());
    visited.insert(start);

    let mut queue: VecDeque<(NodeId, u32)> = VecDeque::with_capacity(256);
    queue.push_back((start, 0));

    let mut result = Vec::with_capacity(64);

    while let Some((current, depth)) = queue.pop_front() {
        if depth >= max_hops { continue; }
        if visited.len() >= MAX_BFS_VISITED { break; }

        let adj = match direction {
            AstDirection::Out => self.graph.adjacency_out_ref(current),
            AstDirection::In => self.graph.adjacency_in_ref(current),
        };
        if let Some(edge_ids) = adj {
            for &eid in edge_ids.iter() {
                // ... edge visibility, type matching, temporal check ...
                if visited.insert(next) {
                    let next_depth = depth + 1;
                    if next_depth >= min_hops {
                        result.push(next);
                    }
                    queue.push_back((next, next_depth));
                }
            }
        }
    }
    Ok(result)
}

Three design choices matter here:

1. FxHashSet instead of std HashSet. FxHash (from the rustc_hash crate, originally from the Rust compiler itself) uses a single multiply-shift operation as its mixing function. For integer keys like NodeId (which is u64), this is approximately 2x faster than the default SipHash hasher. SipHash was designed to resist hash-flooding denial-of-service attacks. For an internal visited set in BFS, there is no adversarial input, and the faster hash is strictly better.

2. Pre-allocated capacity of 256. The FxHashSet and VecDeque are initialized with capacity 256. This avoids the first few doublings (16 -> 32 -> 64 -> 128 -> 256) that a default-capacity collection would undergo during BFS on a moderately connected graph.

3. MAX_BFS_VISITED cap of 10,000. This prevents BFS from consuming unbounded memory on dense graph regions. The cap is a safety valve, not a performance optimization. It ensures that even pathological queries like MATCH (a)-[*1..100]->(b) RETURN b terminate in bounded time and memory.

How Other Representations Compare

Neo4j's doubly-linked list. Each relationship record contains pointers to the next relationship for the source and target nodes. Traversing all relationships of a node means following a linked list through the relationship store. This has poor cache locality because consecutive relationships are not necessarily stored adjacently. However, linked lists handle concurrent mutations well: inserting a new relationship requires updating only two pointers, with no reallocation or copying.

CSR (Compressed Sparse Row). Used by graph analytics frameworks like Apache Giraph and many in-memory graph libraries. Two arrays: an offset array (one entry per node) and an edge array (one entry per edge). Excellent cache locality, O(1) offset lookup. CSR is the best choice for static analytics graphs, but insertions and deletions require rebuilding the entire edge array (or maintaining a separate delta buffer), which makes it impractical for a mutable temporal graph that receives continuous edge additions and supersessions.

Adjacency HashMap. HashMap<NodeId, Vec<EdgeId>>. Good average-case O(1) lookup per node, O(degree) iteration. The constant factor is high: each Vec<EdgeId> is a separate heap allocation even for single-edge nodes. For a graph with 50,000 nodes, this means 50,000 Vec allocations.

MinnsDB's adaptive enum. Matches the cost to the actual degree. Single-edge nodes (the majority) pay 8 bytes and zero heap allocations. Medium-degree nodes get a contiguous buffer with good cache locality. High-degree nodes get O(log n) membership tests for merge operations. The tradeoff is a match dispatch on every operation, which costs one predicted branch. For a mutable temporal graph with a power-law degree distribution, this is the right set of tradeoffs. For a static analytics workload, CSR would be better.

The Underlying Principle

The adjacency list representation sets the floor for every graph operation. A BFS that touches 1000 nodes and 5000 edges performs 5000 next() calls on adjacency iterators. If each call involves a pointer chase (linked list) or a hash lookup (HashMap), the floor on typical x86_64 hardware is 5000 * ~50 ns = 250 microseconds just for the iterator overhead, before any actual computation. With inline SmallVec iteration, the same 5000 calls are sequential memory reads, dominated by L1/L2 cache hit latency of ~1-4 ns each, for a floor of ~5-20 microseconds. These numbers assume hot caches and no contention; real workloads will vary.

No amount of query optimization, predicate pushdown, or result caching can recover that difference. The data structure is the bottleneck, and the data structure must be chosen for the workload.

MinnsDB is source-available (AGPL-3.0) at minns.ai.