Skip to content

Indexing

Indexes are critical for query performance in Uni. This guide covers all index types, their use cases, and configuration options.

Index Types Overview

Uni supports five categories of indexes:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                              INDEX TYPES                                                     │
├─────────────────────┬─────────────────────┬──────────────────────┬──────────────────────┬────────────────────┤
│    VECTOR INDEXES   │   SCALAR INDEXES    │   FULL-TEXT INDEXES  │   JSON FTS INDEXES   │  INVERTED INDEXES  │
├─────────────────────┼─────────────────────┼──────────────────────┼──────────────────────┼────────────────────┤
│ • HNSW              │ • BTree             │ • Inverted Index     │ • Lance Inverted     │ • Set Membership   │
│ • IVF_PQ            │ • Hash              │ • Tokenizers         │ • BM25 Ranking       │ • ANY IN patterns  │
│ • Flat (exact)      │ • Bitmap            │ • Scoring            │ • Path-Specific      │ • Tag filtering    │
├─────────────────────┼─────────────────────┼──────────────────────┼──────────────────────┼────────────────────┤
│ Similarity search   │ Exact/range queries │ Keyword search       │ JSON document search │ List membership    │
│ Nearest neighbors   │ Equality checks     │ Text matching        │ CONTAINS operator    │ Multi-value props  │
│ Embeddings          │ Sorting             │ Relevance ranking    │ Phrase search        │ Security filtering │
└─────────────────────┴─────────────────────┴──────────────────────┴──────────────────────┴────────────────────┘

Vector Indexes

Vector indexes enable fast approximate nearest neighbor (ANN) search on embedding columns.

Supported Algorithms

Algorithm Description Trade-offs
HNSW Hierarchical Navigable Small World Best recall, higher memory
IVF_PQ Inverted File + Product Quantization Lower memory, good recall
Flat Exact brute-force search Perfect recall, O(n) speed

Distance Metrics

Metric Formula Use Case
Cosine 1 - (A·B)/(‖A‖‖B‖) Normalized embeddings
L2 √Σ(aᵢ-bᵢ)² Euclidean distance
Dot -A·B Inner product (unnormalized)

Creating Vector Indexes

Via Cypher:

CREATE VECTOR INDEX paper_embeddings
FOR (p:Paper)
ON p.embedding
OPTIONS {
  index_type: "hnsw",
  metric: "cosine",
  m: 16,
  ef_construction: 200
}

Via CLI:

uni index create vector \
    --name paper_embeddings \
    --label Paper \
    --property embedding \
    --type hnsw \
    --metric cosine \
    --m 16 \
    --ef-construction 200 \
    --path ./storage

HNSW Configuration

Parameter Default Description
m 16 Max connections per layer (higher = better recall, more memory)
ef_construction 200 Build-time search width (higher = better index, slower build)
ef_search 100 Query-time search width (higher = better recall, slower query)

Tuning Guidelines:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           HNSW Tuning Guide                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Recall vs Speed:                                                          │
│   ────────────────                                                          │
│   Higher ef_search → Better recall, slower queries                          │
│   Lower ef_search  → Faster queries, may miss results                       │
│                                                                             │
│   Memory vs Recall:                                                         │
│   ─────────────────                                                         │
│   Higher m → More connections, better recall, more memory                   │
│   Lower m  → Less memory, potentially lower recall                          │
│                                                                             │
│   Recommended Starting Points:                                              │
│   ────────────────────────────                                              │
│   Small dataset (<100K):  m=16, ef_construction=100, ef_search=50           │
│   Medium (100K-1M):       m=32, ef_construction=200, ef_search=100          │
│   Large (>1M):            m=48, ef_construction=400, ef_search=150          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

IVF_PQ Configuration

Parameter Default Description
num_partitions 256 Number of IVF clusters (√n is a good starting point)
num_sub_vectors 8 PQ compression factor (dimension / sub_vectors = bytes per vector)
num_probes 20 Clusters to search at query time

Example:

CREATE VECTOR INDEX product_embeddings
FOR (p:Product)
ON p.embedding
OPTIONS {
  index_type: "ivf_pq",
  metric: "cosine",
  num_partitions: 1024,
  num_sub_vectors: 32,
  num_probes: 50
}

Querying Vector Indexes

Procedure Call:

CALL db.idx.vector.query('Paper', 'embedding', $query_vector, 10)
YIELD node, distance
RETURN node.title, distance
ORDER BY distance

With Threshold:

CALL db.idx.vector.query('Paper', 'embedding', $query_vector, 100, 0.2)
YIELD node, distance
WHERE distance < 0.15
RETURN node.title, distance

Hybrid (Vector + Graph):

CALL db.idx.vector.query('Paper', 'embedding', $query_vector, 10)
YIELD node as paper, distance
MATCH (paper)-[:AUTHORED_BY]->(author:Author)
RETURN paper.title, author.name, distance

Scalar Indexes

Scalar indexes optimize exact match and range queries on primitive properties.

Index Types

Type Operations Best For
BTree =, <, >, <=, >=, BETWEEN General purpose, range queries
Hash = only High-cardinality exact match
Bitmap =, IN Low-cardinality categorical

Creating Scalar Indexes

BTree Index (default):

CREATE INDEX author_email FOR (a:Author) ON a.email

Explicit Type:

CREATE INDEX paper_year FOR (p:Paper) ON p.year OPTIONS { type: "btree" }
CREATE INDEX user_country FOR (u:User) ON u.country OPTIONS { type: "bitmap" }

Via CLI:

uni index create scalar \
    --name paper_year \
    --label Paper \
    --property year \
    --type btree \
    --path ./storage

Composite Indexes

Index multiple properties together:

CREATE INDEX paper_venue_year FOR (p:Paper) ON (p.venue, p.year)

Query utilization:

// Uses index (prefix match)
MATCH (p:Paper) WHERE p.venue = 'NeurIPS' AND p.year > 2020

// Uses index (first column only)
MATCH (p:Paper) WHERE p.venue = 'NeurIPS'

// Does NOT use index (missing prefix)
MATCH (p:Paper) WHERE p.year > 2020

Index Selection

Uni's query planner automatically selects indexes:

Query: MATCH (p:Paper) WHERE p.year > 2020 AND p.venue = 'NeurIPS'

Plan:
├── Project [p.title]
│   └── Scan [:Paper]
│         ↳ Index: paper_venue_year (venue='NeurIPS', year>2020)
│         ↳ Predicate Pushdown: venue = 'NeurIPS' AND year > 2020

Full-Text Indexes

Full-text indexes enable keyword search within text properties.

Creating Full-Text Indexes

CREATE FULLTEXT INDEX paper_search
FOR (p:Paper)
ON EACH [p.title, p.abstract]
OPTIONS {
  tokenizer: "standard",
  min_token_length: 2
}

Tokenizers

Tokenizer Description Example
standard Unicode word boundaries "Hello, World!" → ["hello", "world"]
whitespace Split on whitespace only "Hello, World!" → ["hello,", "world!"]
ngram Character n-grams "cat" → ["ca", "at"] (bigrams)
keyword No tokenization "Hello World" → ["hello world"]

Querying Full-Text Indexes

CALL db.index.fulltext.queryNodes('paper_search', 'transformer attention')
YIELD node, score
RETURN node.title, score
ORDER BY score DESC
LIMIT 10

Boolean Operators:

// AND (default)
'transformer attention'  // Both terms required

// OR
'transformer OR attention'

// NOT
'transformer NOT vision'

// Phrase
'"attention mechanism"'

// Wildcard
'transform*'

JSON Full-Text Indexes

JSON Full-Text indexes enable BM25-based full-text search on JSON document columns, leveraging Lance's native inverted index.

When to Use JSON FTS

Use Case Index Type
Search within JSON documents JSON Full-Text Index
Keyword/phrase search in text fields JSON Full-Text Index
Exact JSON path matching JsonPath Index
Equality filters on scalar fields Scalar Index

Creating JSON Full-Text Indexes

Via Cypher:

CREATE JSON FULLTEXT INDEX article_fts
FOR (a:Article) ON _doc

With Options:

CREATE JSON FULLTEXT INDEX article_fts
FOR (a:Article) ON _doc
OPTIONS { with_positions: true }

The with_positions option enables phrase search by storing term positions.

If Not Exists:

CREATE JSON FULLTEXT INDEX article_fts IF NOT EXISTS
FOR (a:Article) ON _doc

Querying with CONTAINS

Use the CONTAINS operator to perform full-text search on FTS-indexed columns:

// Basic full-text search
MATCH (a:Article)
WHERE a._doc CONTAINS 'graph database'
RETURN a.title

// Path-specific search (searches within a JSON path)
MATCH (a:Article)
WHERE a._doc.title CONTAINS 'graph'
RETURN a.title

// Combined with exact matching
MATCH (a:Article)
WHERE a._doc.title CONTAINS 'graph' AND a.status = 'published'
RETURN a.title

Query Routing Priority

The query planner routes predicates to the most efficient index:

1. _uid = 'xxx'           → UidIndex (O(1) lookup)
2. column CONTAINS 'term' → Lance FTS (BM25 ranking)
3. path = 'exact'         → JsonPathIndex (exact match)
4. Pushable predicates    → Lance scan filter
5. Else                   → Residual (post-load filter)

JSON FTS Configuration

Parameter Default Description
with_positions false Enable phrase search (stores term positions)

How It Works

JSON Full-Text indexes use Lance's inverted index with triplet tokenization:

Document: { "title": "Graph Databases", "year": 2024 }
Tokens:  (title, string, "graph"), (title, string, "databases"), (year, int, 2024)
Query:   title:graph → Matches documents with "graph" in title path

Inverted Indexes

Inverted indexes enable efficient filtering on List<String> properties, ideal for tag-based access control and multi-value attribute queries.

Use Cases

Use Case Query Pattern Benefit
Tag filtering ANY(tag IN d.tags WHERE tag IN $allowed) O(k) vs O(n) scan
Security labels Filter by granted access tags Multi-tenant filtering
Categories Documents in multiple categories Efficient set intersection
Skills matching Users with any required skill Fast membership checks

Creating Inverted Indexes

Via Schema:

{
  "indexes": {
    "document_tags": {
      "type": "inverted",
      "label": "Document",
      "property": "tags",
      "config": {
        "normalize": true,
        "max_terms_per_doc": 10000
      }
    }
  }
}

Via Cypher:

CREATE INVERTED INDEX document_tags
FOR (d:Document)
ON d.tags
OPTIONS { normalize: true, max_terms_per_doc: 10000 }

Via Rust API:

db.schema()
    .label("Document")
        .property("tags", DataType::List(Box::new(DataType::String)))
        .index("tags", IndexType::Inverted(InvertedIndexConfig {
            normalize: true,
            max_terms_per_doc: 10_000,
        }))
    .apply()
    .await?;

Inverted Index Configuration

Parameter Default Description
normalize true Lowercase and trim whitespace on terms
max_terms_per_doc 10_000 Maximum terms per document (DoS protection)

Query Patterns

ANY IN pattern (optimized):

// Finds documents with ANY of the specified tags
MATCH (d:Document)
WHERE ANY(tag IN d.tags WHERE tag IN ['public', 'team:eng'])
RETURN d.title

With session variables (multi-tenant):

// Security filtering with session-based permissions
MATCH (d:Document)
WHERE d.tenant_id = $session.tenant_id
  AND ANY(tag IN d.tags WHERE tag IN $session.granted_tags)
RETURN d

How Inverted Indexes Work

Document 1: tags = ['rust', 'database']
Document 2: tags = ['python', 'ml']
Document 3: tags = ['rust', 'ml']
Inverted Index (term → VID list):
  'rust'     → [vid_1, vid_3]
  'database' → [vid_1]
  'python'   → [vid_2]
  'ml'       → [vid_2, vid_3]
Query: ANY(tag IN d.tags WHERE tag IN ['rust', 'python'])
Result: Union of 'rust' and 'python' → [vid_1, vid_2, vid_3]

Query Planner Integration

When an inverted index exists on a List<String> property, the query planner automatically rewrites ANY IN patterns to use index lookups:

Query: MATCH (d:Document) WHERE ANY(tag IN d.tags WHERE tag IN $allowed) RETURN d

Without Index:
├─ Full Scan: Document
└─ Filter: ANY(tag IN d.tags WHERE tag IN $allowed)  // O(n × m)

With Inverted Index:
├─ Inverted Index Lookup: tags IN $allowed           // O(k)
└─ Fetch: Document properties

Performance Comparison

Scenario Without Index With Index Speedup
1M docs, 10 tags each, query 3 tags ~5s scan ~10ms 500x
100K docs, security filter ~500ms ~5ms 100x
Multi-value category filter ~1s ~15ms 67x

Index Management

List Indexes

Cypher:

SHOW INDEXES

CLI:

uni index list --path ./storage

Output:

┌─────────────────────┬────────┬─────────┬──────────┬───────────┐
│ Name                │ Type   │ Label   │ Property │ Status    │
├─────────────────────┼────────┼─────────┼──────────┼───────────┤
│ paper_embeddings    │ Vector │ Paper   │ embedding│ READY     │
│ paper_year          │ BTree  │ Paper   │ year     │ READY     │
│ author_email        │ BTree  │ Author  │ email    │ BUILDING  │
│ paper_search        │ FTS    │ Paper   │ [title,  │ READY     │
│                     │        │         │ abstract]│           │
└─────────────────────┴────────┴─────────┴──────────┴───────────┘

Drop Indexes

DROP INDEX paper_year
uni index drop paper_year --path ./storage

Rebuild Indexes

uni index rebuild paper_embeddings --path ./storage

Index Storage

Indexes are stored within the Lance dataset structure:

storage/
├── vertices_Paper/
│   ├── data/
│   │   └── *.lance
│   ├── _indices/                    # Lance native indexes
│   │   └── embedding_idx-uuid/      # Vector index
│   │       ├── index.idx
│   │       └── aux/
│   └── _versions/
└── indexes/
    ├── scalar_paper_year/           # Separate scalar index
    │   └── index.lance
    └── fulltext_paper_search/       # Full-text index
        └── index/

Predicate Pushdown

Indexes integrate with Uni's predicate pushdown optimization:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        PREDICATE PUSHDOWN FLOW                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Query: MATCH (p:Paper) WHERE p.year > 2020 AND p.title CONTAINS 'AI'     │
│                                                                             │
│   1. Predicate Analysis                                                     │
│      ├── p.year > 2020      → Pushable (scalar index or Lance filter)      │
│      └── p.title CONTAINS   → Residual (post-load filter)                  │
│                                                                             │
│   2. Index Selection                                                        │
│      └── paper_year index available? Yes → Use index scan                  │
│                                                                             │
│   3. Execution                                                              │
│      ├── Index Scan: year > 2020 → VIDs [v1, v2, v3, ...]                  │
│      ├── Load Properties: title for filtered VIDs                          │
│      └── Residual Filter: title CONTAINS 'AI'                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pushable Predicates

Predicate Index Type Pushed?
p.x = 5 BTree, Hash Yes
p.x > 5 BTree Yes
p.x IN [1,2,3] BTree, Bitmap Yes
p.x IS NULL BTree Yes
p._doc CONTAINS 'foo' JSON FTS Yes (if FTS-indexed)
p.x CONTAINS 'foo' None No (residual, if not FTS-indexed)
p.x STARTS WITH 'foo' BTree Partial
func(p.x) = 5 None No (residual)

Best Practices

When to Create Indexes

✓ CREATE INDEX when:
  • Property appears in WHERE clauses frequently
  • Property is used for JOIN conditions
  • Property is used in ORDER BY
  • High-cardinality exact match queries (Hash)
  • Range queries on numeric/date properties (BTree)
  • Low-cardinality categorical filters (Bitmap)

✗ AVOID INDEX when:
  • Property rarely queried
  • Very small dataset (<1000 rows)
  • Property updated frequently
  • Very low selectivity (e.g., boolean with 50/50 split)

Index Sizing

Index Type Memory Formula Example (1M vectors, 768d)
HNSW ~1.5x vectors × (4 + m×8) bytes ~120 MB
IVF_PQ vectors × (d/sub_vectors) bytes ~24 MB
BTree ~40 bytes per key ~40 MB
Bitmap ~n_distinct × rows / 8 bytes Varies

Index Maintenance

# Check index health
uni index stats --path ./storage

# Rebuild fragmented index
uni index rebuild paper_embeddings --path ./storage

# Optimize after bulk load
uni index optimize --path ./storage

Performance Comparison

Query Type Without Index With Index Speedup
Point lookup O(n) scan O(log n) BTree 1000x+
Range query O(n) scan O(log n + k) 100x+
Vector KNN O(n×d) brute O(log n) HNSW 1000x+
Full-text O(n×len) scan O(log n) inverted 100x+

Next Steps