Indexing¶

Indexes are critical for query performance in Uni. This guide covers all index types, their use cases, and configuration options.

Index Types Overview¶

Uni supports five categories of indexes:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                              INDEX TYPES                                                     │
├─────────────────────┬─────────────────────┬──────────────────────┬──────────────────────┬────────────────────┤
│    VECTOR INDEXES   │   SCALAR INDEXES    │   FULL-TEXT INDEXES  │   JSON FTS INDEXES   │  INVERTED INDEXES  │
├─────────────────────┼─────────────────────┼──────────────────────┼──────────────────────┼────────────────────┤
│ • HNSW              │ • BTree             │ • Inverted Index     │ • Lance Inverted     │ • Set Membership   │
│ • IVF_PQ            │ • Hash              │ • Tokenizers         │ • BM25 Ranking       │ • ANY IN patterns  │
│ • Flat (exact)      │ • Bitmap            │ • Scoring            │ • Path-Specific      │ • Tag filtering    │
├─────────────────────┼─────────────────────┼──────────────────────┼──────────────────────┼────────────────────┤
│ Similarity search   │ Exact/range queries │ Keyword search       │ JSON document search │ List membership    │
│ Nearest neighbors   │ Equality checks     │ Text matching        │ CONTAINS operator    │ Multi-value props  │
│ Embeddings          │ Sorting             │ Relevance ranking    │ Phrase search        │ Security filtering │
└─────────────────────┴─────────────────────┴──────────────────────┴──────────────────────┴────────────────────┘

Vector Indexes¶

Vector indexes enable fast approximate nearest neighbor (ANN) search on embedding columns.

Supported Algorithms¶

Algorithm	Quantization	Trade-offs
Flat	None	Perfect recall, O(n) speed. Best for < 10k vectors.
IVF-Flat	None	Partition-based, exact within partitions
IVF-SQ	Scalar (int8)	Large datasets, good recall/memory tradeoff
IVF-PQ	Product	Default. Best latency/compression tradeoff; very large datasets
IVF-RQ	RaBitQ (1-bit)	Better accuracy than PQ at similar compression
HNSW-Flat	None	Graph search, no compression loss
HNSW-SQ	Scalar (int8)	Graph search, low latency for mid-size datasets
HNSW-PQ	Product	Large datasets needing graph speed + compression
MUVERA	FDE projection	Multi-vector (ColBERT / late-interaction) — see MUVERA Multi-Vector Indexes

Distance Metrics¶

Metric	Formula	Use Case
Cosine	1 - (A·B)/(‖A‖‖B‖)	Normalized embeddings
L2	√Σ(aᵢ-bᵢ)²	Euclidean distance
Dot	-A·B	Inner product (unnormalized)

Creating Vector Indexes¶

Via Cypher:

CREATE VECTOR INDEX paper_embeddings
FOR (p:Paper)
ON p.embedding
OPTIONS {
  type: "hnsw_sq"
}

DDL supports all algorithm types via the type option: flat, ivf_flat, ivf_sq, ivf_pq, ivf_rq, hnsw_flat, hnsw_sq (or hnsw), hnsw_pq, and muvera (for multi-vector columns). Parameters like m, ef_construction, partitions, sub_vectors, and num_bits can also be passed in OPTIONS. The default is IVF-PQ with cosine distance, applied uniformly across every surface (Cypher DDL, the uni.schema.createIndex procedure, the Python config map, and the Rust VectorAlgo builder).

HNSW Configuration¶

HNSW parameters are configurable via DDL or the Rust schema builder:

-- HNSW-SQ with custom parameters
CREATE VECTOR INDEX paper_embeddings FOR (p:Paper) ON p.embedding
OPTIONS { type: 'hnsw_sq', m: '32', ef_construction: '200' }

-- HNSW-Flat (no quantization, exact graph search)
CREATE VECTOR INDEX paper_embeddings FOR (p:Paper) ON p.embedding
OPTIONS { type: 'hnsw_flat', m: '16', ef_construction: '200' }

-- HNSW-SQ with IVF partitions for very large datasets (>1M vectors)
CREATE VECTOR INDEX paper_embeddings FOR (p:Paper) ON p.embedding
OPTIONS { type: 'hnsw_sq', partitions: '32' }

use uni_db::{DataType, IndexType, VectorAlgo, VectorIndexCfg, VectorMetric};

db.schema()
    .label("Paper")
        .property("embedding", DataType::Vector { dimensions: 768 })
        .index("embedding", IndexType::Vector(VectorIndexCfg {
            algorithm: VectorAlgo::HnswSq { m: 32, ef_construction: 200, partitions: None },
            metric: VectorMetric::Cosine,
            embedding: None,
        }))
    .apply()
    .await?;

IVF Configuration¶

-- IVF-PQ with tuned parameters
CREATE VECTOR INDEX product_embeddings FOR (p:Product) ON p.embedding
OPTIONS { type: 'ivf_pq', partitions: '256', sub_vectors: '16' }

-- IVF-RQ (RaBitQ — 1 bit per dimension, best accuracy/compression)
CREATE VECTOR INDEX product_embeddings FOR (p:Product) ON p.embedding
OPTIONS { type: 'ivf_rq', partitions: '256' }

-- IVF-RQ with higher fidelity (4 bits per dimension)
CREATE VECTOR INDEX product_embeddings FOR (p:Product) ON p.embedding
OPTIONS { type: 'ivf_rq', partitions: '256', num_bits: '4' }

db.schema()
    .label("Product")
        .property("embedding", DataType::Vector { dimensions: 384 })
        .index("embedding", IndexType::Vector(VectorIndexCfg {
            algorithm: VectorAlgo::IvfRq { partitions: 256, num_bits: None },
            metric: VectorMetric::Cosine,
            embedding: None,
        }))
    .apply()
    .await?;

MUVERA Multi-Vector Indexes¶

MUVERA indexes accelerate multi-vector (ColBERT / late-interaction) columns — a List<Vector(dim)> property storing one vector per token. MUVERA encodes each row's variable set of token vectors into a single fixed-dimensional FDE (Fixed-Dimensional Encoding) vector, stored in a derived internal column (__fde_<index_name>), and indexes that with a standard single-vector ANN. At query time the FDE drives fast first-stage retrieval, then candidates are re-scored with exact MaxSim.

-- Defaults: k_sim=4, reps=20, d_proj=16, inner=ivf_pq
CREATE VECTOR INDEX doc_tokens FOR (d:Document) ON d.tokens
OPTIONS { type: 'muvera' }

-- Tuned FDE parameters + explicit inner ANN
CREATE VECTOR INDEX doc_tokens FOR (d:Document) ON d.tokens
OPTIONS { type: 'muvera', k_sim: 4, reps: 20, d_proj: 16, inner: 'ivf_pq' }

Option	Default	Description
`k_sim`	4	SimHash hyperplanes per repetition (`2^k_sim` buckets)
`reps`	20	Independent repetitions concatenated into the FDE
`d_proj`	16	Inner projection dimension (`0` = no projection)
`inner`	`ivf_pq`	Single-vector ANN over the derived FDE column

The derived __fde_* column is internal — it never appears in RETURN, SHOW INDEXES, or uni.schema.labelInfo. Because the pipeline always finishes with an exact MaxSim re-rank, a poorly-tuned FDE only costs recall, never precision.

Tune FDE parameters per corpus

The defaults are reasonable starting points, not validated for any specific corpus. FDE recall is corpus-dependent — measure recall on your own data and adjust reps/k_sim to trade recall against FDE size. MUVERA is opt-in; the default vector index remains IVF-PQ.

See Vector Search for the full multi-vector storage + MaxSim query model.

Query-Time Tuning¶

Vector ANN queries accept tuning options in the procedure OPTIONS map to trade recall against latency:

Option	Type	Applies to	Description
`nprobes`	Integer	IVF indexes	Number of IVF partitions to probe (higher = better recall, slower)
`refine_factor`	Integer	IVF-PQ/SQ	Re-rank `refine_factor × k` candidates with exact distances (recovers quantization error)
`over_fetch`	Float	multi-vector	Candidate over-fetch multiplier before MaxSim re-rank (default `4.0`)

CALL uni.vector.query('Paper', 'embedding', $query_vector, 10, NULL, NULL,
  { nprobes: 8, refine_factor: 10 })
YIELD node, score
RETURN node.title, score
ORDER BY score DESC

refine_factor is the load-bearing recall knob for quantized (PQ/SQ) indexes — even a small value (e.g. 10) typically recovers most of the recall lost to quantization.

Querying Vector Indexes¶

Procedure Call:

CALL uni.vector.query('Paper', 'embedding', $query_vector, 10)
YIELD node, distance
RETURN node.title, distance
ORDER BY distance

With Threshold:

CALL uni.vector.query('Paper', 'embedding', $query_vector, 100, NULL, 0.2)
YIELD node, distance
WHERE distance < 0.15
RETURN node.title, distance

Hybrid (Vector + Graph):

CALL uni.vector.query('Paper', 'embedding', $query_vector, 10)
YIELD node as paper, distance
MATCH (paper)-[:AUTHORED_BY]->(author:Author)
RETURN paper.title, author.name, distance

Scalar Indexes¶

Scalar indexes optimize exact match and range queries on primitive properties.

Index Types¶

Type	Operations	Best For
BTree	`=`, `<`, `>`, `<=`, `>=`, `BETWEEN`	General purpose, range queries
Hash	`=`, `IN`	High-cardinality equality lookups
Bitmap	`=`, `IN`, low-cardinality filters	Enum-like columns (< 1000 distinct values), boolean flags
LabelList	`array_contains_any`, `array_contains_all`	List columns with tag/category filtering

Creating Scalar Indexes¶

BTree Index (default):

CREATE INDEX author_email FOR (a:Author) ON (a.email)

Via procedure API (Bitmap and LabelList):

CALL uni.schema.createIndex('Event', 'status', {"type": "BITMAP"})
CALL uni.schema.createIndex('Doc', 'tags', {"type": "LABEL_LIST"})

The storage layer supports BTree, Hash, Bitmap, and LabelList scalar indexes. BTree is the default when no type is specified.

Composite Indexes¶

Index multiple properties together:

CREATE INDEX paper_venue_year FOR (p:Paper) ON (p.venue, p.year)

Query utilization:

// Uses index (prefix match)
MATCH (p:Paper) WHERE p.venue = 'NeurIPS' AND p.year > 2020

// Uses index (first column only)
MATCH (p:Paper) WHERE p.venue = 'NeurIPS'

// Does NOT use index (missing prefix)
MATCH (p:Paper) WHERE p.year > 2020

Index Selection¶

Uni's query planner automatically selects indexes:

Query: MATCH (p:Paper) WHERE p.year > 2020 AND p.venue = 'NeurIPS'

Plan:
├── Project [p.title]
│   └── Scan [:Paper]
│         ↳ Index: paper_venue_year (venue='NeurIPS', year>2020)
│         ↳ Predicate Pushdown: venue = 'NeurIPS' AND year > 2020

Full-Text Indexes¶

Full-text indexes enable keyword search within text properties.

Creating Full-Text Indexes¶

CREATE FULLTEXT INDEX paper_search
FOR (p:Paper)
ON EACH [p.title, p.abstract]

Tokenizers¶

Tokenizer	Description	Example
`standard`	Unicode word boundaries	"Hello, World!" → ["hello", "world"]
`whitespace`	Split on whitespace only	"Hello, World!" → ["hello,", "world!"]
`ngram`	Character n-grams	"cat" → ["ca", "at"] (bigrams)
`keyword`	No tokenization	"Hello World" → ["hello world"]

Note: Tokenizer configuration is not yet exposed via DDL; the default is standard.

Querying Full-Text Indexes¶

MATCH (p:Paper)
WHERE p.title CONTAINS 'transformer' OR p.abstract CONTAINS 'attention'
RETURN p.title
LIMIT 10

Boolean Operators:

// AND (default)
'transformer attention'  // Both terms required

// OR
'transformer OR attention'

// NOT
'transformer NOT vision'

// Phrase
'"attention mechanism"'

// Wildcard
'transform*'

JSON Full-Text Indexes¶

JSON Full-Text indexes enable BM25-based full-text search on JSON document columns, leveraging Lance's native inverted index.

When to Use JSON FTS¶

Use Case	Index Type
Search within JSON documents	JSON Full-Text Index
Keyword/phrase search in text fields	JSON Full-Text Index
Exact JSON path matching	JsonPath Index
Equality filters on scalar fields	Scalar Index

Creating JSON Full-Text Indexes¶

Via Cypher:

CREATE JSON FULLTEXT INDEX article_fts
FOR (a:Article) ON _doc

With Options:

CREATE JSON FULLTEXT INDEX article_fts
FOR (a:Article) ON _doc
OPTIONS { with_positions: true }

The with_positions option enables phrase search by storing term positions.

If Not Exists:

CREATE JSON FULLTEXT INDEX article_fts IF NOT EXISTS
FOR (a:Article) ON _doc

Querying with CONTAINS¶

Use the CONTAINS operator to perform full-text search on FTS-indexed columns:

// Basic full-text search
MATCH (a:Article)
WHERE a._doc CONTAINS 'graph database'
RETURN a.title

// Path-specific search (searches within a JSON path)
MATCH (a:Article)
WHERE a._doc.title CONTAINS 'graph'
RETURN a.title

// Combined with exact matching
MATCH (a:Article)
WHERE a._doc.title CONTAINS 'graph' AND a.status = 'published'
RETURN a.title

Query Routing Priority¶

The query planner routes predicates to the most efficient index:

1. _uid = 'xxx'           → UidIndex (O(1) lookup)
2. column CONTAINS 'term' → Lance FTS (BM25 ranking)
3. path = 'exact'         → JsonPathIndex (exact match)
4. Pushable predicates    → Lance scan filter
5. Else                   → Residual (post-load filter)

JSON FTS Configuration¶

Parameter	Default	Description
`with_positions`	false	Enable phrase search (stores term positions)

How It Works¶

JSON Full-Text indexes use Lance's inverted index with triplet tokenization:

Document: { "title": "Graph Databases", "year": 2024 }
         ↓
Tokens:  (title, string, "graph"), (title, string, "databases"), (year, int, 2024)
         ↓
Query:   title:graph → Matches documents with "graph" in title path

Inverted Indexes¶

Inverted indexes enable efficient filtering on List<String> properties, ideal for tag-based access control and multi-value attribute queries.

Use Cases¶

Use Case	Query Pattern	Benefit
Tag filtering	`ANY(tag IN d.tags WHERE tag IN $allowed)`	O(k) vs O(n) scan
Security labels	Filter by granted access tags	Multi-tenant filtering
Categories	Documents in multiple categories	Efficient set intersection
Skills matching	Users with any required skill	Fast membership checks

Creating Inverted Indexes¶

Via Schema:

{
  "indexes": {
    "document_tags": {
      "type": "inverted",
      "label": "Document",
      "property": "tags",
      "config": {
        "normalize": true,
        "max_terms_per_doc": 10000
      }
    }
  }
}

Via Cypher:

CREATE INVERTED INDEX document_tags
FOR (d:Document)
ON d.tags
OPTIONS { normalize: true, max_terms_per_doc: 10000 }

Via Rust API:

db.schema()
    .label("Document")
        .property("tags", DataType::List(Box::new(DataType::String)))
        .index("tags", IndexType::Inverted(InvertedIndexConfig {
            normalize: true,
            max_terms_per_doc: 10_000,
        }))
    .apply()
    .await?;

Inverted Index Configuration¶

Parameter	Default	Description
`normalize`	`true`	Lowercase and trim whitespace on terms
`max_terms_per_doc`	`10_000`	Maximum terms per document (DoS protection)

Query Patterns¶

ANY IN pattern (optimized):

// Finds documents with ANY of the specified tags
MATCH (d:Document)
WHERE ANY(tag IN d.tags WHERE tag IN ['public', 'team:eng'])
RETURN d.title

With session variables (multi-tenant):

// Security filtering with session-based permissions
MATCH (d:Document)
WHERE d.tenant_id = $session.tenant_id
  AND ANY(tag IN d.tags WHERE tag IN $session.granted_tags)
RETURN d

How Inverted Indexes Work¶

Document 1: tags = ['rust', 'database']
Document 2: tags = ['python', 'ml']
Document 3: tags = ['rust', 'ml']
         ↓
Inverted Index (term → VID list):
  'rust'     → [vid_1, vid_3]
  'database' → [vid_1]
  'python'   → [vid_2]
  'ml'       → [vid_2, vid_3]
         ↓
Query: ANY(tag IN d.tags WHERE tag IN ['rust', 'python'])
Result: Union of 'rust' and 'python' → [vid_1, vid_2, vid_3]

Query Planner Integration¶

When an inverted index exists on a List<String> property, the query planner automatically rewrites ANY IN patterns to use index lookups:

Query: MATCH (d:Document) WHERE ANY(tag IN d.tags WHERE tag IN $allowed) RETURN d

Without Index:
├─ Full Scan: Document
└─ Filter: ANY(tag IN d.tags WHERE tag IN $allowed)  // O(n × m)

With Inverted Index:
├─ Inverted Index Lookup: tags IN $allowed           // O(k)
└─ Fetch: Document properties

Performance Comparison¶

Scenario	Without Index	With Index	Speedup
1M docs, 10 tags each, query 3 tags	~5s scan	~10ms	500x
100K docs, security filter	~500ms	~5ms	100x
Multi-value category filter	~1s	~15ms	67x

Index Management¶

List Indexes¶

SHOW INDEXES

For more detail:

CALL uni.schema.indexes()

Drop Indexes¶

DROP INDEX paper_year

Rebuild Indexes¶

// Rust API
db.rebuild_indexes("Paper", false).await?;

Index Lifecycle Management¶

Uni tracks the lifecycle of each index via an IndexStatus state machine. This ensures queries only use up-to-date indexes, while stale or rebuilding indexes transparently fall back to full scans.

Index States¶

Status	Description	Used by Query Planner?
Online	Index is up-to-date and queryable	Yes
Building	Rebuild is in progress	No (falls back to scan)
Stale	Outdated, scheduled for rebuild	No (falls back to scan)
Failed	Rebuild failed after retries exhausted	No (falls back to scan)

Status gating: The query planner only uses Online indexes. When an index is in any other state, queries transparently fall back to a full scan — no errors, no user intervention required.

State Transitions¶

Online ──(data changes exceed trigger)──► Stale
Stale  ──(rebuild starts)──────────────► Building
Building ──(success)───────────────────► Online
Building ──(failure, retries left)─────► Stale (retry after delay)
Building ──(failure, retries exhausted)► Failed

Automatic Rebuild Triggers¶

When auto_rebuild_enabled: true, the background worker checks indexes after each flush and marks them Stale when either trigger fires:

Trigger	Condition	Default
Growth	`current_rows > row_count_at_build × (1 + growth_trigger_ratio)`	50% growth (`0.5`)
Age	`time_since_last_build > max_index_age`	Disabled (`None`)

Set growth_trigger_ratio: 0.0 to disable the growth trigger. Set max_index_age: Some(Duration::from_secs(3600)) to enable time-based rebuilds.

Configuration¶

Index lifecycle is configured via IndexRebuildConfig:

use std::time::Duration;
use uni_db::UniConfig;

let mut config = UniConfig::default();
config.index_rebuild.auto_rebuild_enabled = true;   // Enable automatic rebuilds (default: false)
config.index_rebuild.growth_trigger_ratio = 0.5;    // Rebuild after 50% row growth (default: 0.5)
config.index_rebuild.max_index_age = None;          // Time-based trigger (default: None/disabled)
config.index_rebuild.max_retries = 3;               // Retry failed rebuilds (default: 3)
config.index_rebuild.retry_delay = Duration::from_secs(60); // Delay between retries (default: 60s)

Field	Type	Default	Description
`auto_rebuild_enabled`	`bool`	`false`	Enable automatic index rebuilds
`growth_trigger_ratio`	`f64`	`0.5`	Row growth ratio to trigger rebuild (0.0 disables)
`max_index_age`	`Option<Duration>`	`None`	Max time since last build before triggering rebuild
`max_retries`	`u32`	`3`	Maximum rebuild attempts before marking `Failed`
`retry_delay`	`Duration`	`60s`	Delay between retry attempts

Index Storage¶

Indexes are stored within the Lance dataset structure:

storage/
├── vertices_Paper/
│   ├── data/
│   │   └── *.lance
│   ├── _indices/                    # Lance native indexes
│   │   └── embedding_idx-uuid/      # Vector index
│   │       ├── index.idx
│   │       └── aux/
│   └── _versions/
└── indexes/
    ├── scalar_paper_year/           # Separate scalar index
    │   └── index.lance
    └── fulltext_paper_search/       # Full-text index
        └── index/

Predicate Pushdown¶

Indexes integrate with Uni's predicate pushdown optimization:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        PREDICATE PUSHDOWN FLOW                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Query: MATCH (p:Paper) WHERE p.year > 2020 AND p.title CONTAINS 'AI'     │
│                                                                             │
│   1. Predicate Analysis                                                     │
│      ├── p.year > 2020      → Pushable (scalar index or Lance filter)      │
│      └── p.title CONTAINS   → Residual (post-load filter)                  │
│                                                                             │
│   2. Index Selection                                                        │
│      └── paper_year index available? Yes → Use index scan                  │
│                                                                             │
│   3. Execution                                                              │
│      ├── Index Scan: year > 2020 → VIDs [v1, v2, v3, ...]                  │
│      ├── Load Properties: title for filtered VIDs                          │
│      └── Residual Filter: title CONTAINS 'AI'                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pushable Predicates¶

Predicate	Index Type	Pushed?
`p.x = 5`	BTree	Yes
`p.x > 5`	BTree	Yes
`p.x IN [1,2,3]`	BTree	Yes
`p.x IS NULL`	BTree	Yes
`p._doc CONTAINS 'foo'`	JSON FTS	Yes (if FTS-indexed)
`p.x CONTAINS 'foo'`	None	No (residual, if not FTS-indexed)
`p.x STARTS WITH 'foo'`	BTree	Partial
`func(p.x) = 5`	None	No (residual)

Best Practices¶

When to Create Indexes¶

✓ CREATE INDEX when:
  • Property appears in WHERE clauses frequently
  • Property is used for JOIN conditions
  • Property is used in ORDER BY
  • Range queries on numeric/date properties (BTree)

✗ AVOID INDEX when:
  • Property rarely queried
  • Very small dataset (<1000 rows)
  • Property updated frequently
  • Very low selectivity (e.g., boolean with 50/50 split)

Index Sizing¶

Index Type	Memory Formula	Example (1M vectors, 768d)
HNSW	~1.5x vectors × (4 + m×8) bytes	~120 MB
IVF_PQ	vectors × (d/sub_vectors) bytes	~24 MB
BTree	~40 bytes per key	~40 MB

Index Maintenance¶

SHOW INDEXES

For rebuilds, use the Rust API (db.rebuild_indexes("Label", async_)) or drop/recreate the index in Cypher.

Performance Comparison¶

Query Type	Without Index	With Index	Speedup
Point lookup	O(n) scan	O(log n) BTree	1000x+
Range query	O(n) scan	O(log n + k)	100x+
Vector KNN	O(n×d) brute	O(log n) HNSW	1000x+
Full-text	O(n×len) scan	O(log n) inverted	100x+

Next Steps¶

Vector Search Guide — Deep dive into similarity search
Performance Tuning — Optimization strategies
Query Planning — How indexes are selected