Schema Design Guide¶

A well-designed schema is crucial for performance, maintainability, and query efficiency. This guide covers best practices for modeling your domain in Uni.

Schema Design Principles¶

1. Model the Domain, Not the Queries¶

Design your schema around your domain entities, not specific queries:

✓ Good: Entities represent real-world concepts
  Paper, Author, Venue, Citation

✗ Bad: Entities shaped by specific queries
  PaperWithAuthorNames, RecentPapersByVenue

2. Use Labels for Types, Not States¶

Labels define entity types, not transient states:

✓ Good: Labels are stable types
  :Paper, :Author, :Venue

✗ Bad: Labels represent changeable states
  :PublishedPaper, :DraftPaper, :RetractedPaper
  (Use a "status" property instead)

3. Relationships Are First-Class¶

Graph relationships are powerful—use them:

✓ Good: Relationships as edges
  (paper)-[:AUTHORED_BY]->(author)
  (paper)-[:CITES]->(cited)

✗ Bad: Relationships as properties
  Paper { author_ids: ["a1", "a2"] }

4. Keep Vertices Focused¶

Each vertex should represent one cohesive entity:

✓ Good: Focused vertex
  Paper { title, year, abstract }

✗ Bad: Kitchen sink vertex
  Paper { title, year, author_name, venue_name, citation_count }
  (author_name, venue_name should be separate vertices)

Labels¶

Naming Conventions¶

Convention	Example	Rationale
Singular nouns	`:Paper`	Represents one entity
PascalCase	`:ResearchPaper`	Standard convention
Descriptive	`:AcademicPaper`	Clear meaning
Avoid abbreviations	`:Organization` not `:Org`	Readable

Label Granularity¶

Too Few Labels:

// All entities in one label - hard to query efficiently
:Entity { type: "paper", ... }
:Entity { type: "author", ... }

Too Many Labels:

// Fragmented - complex schema, poor caching
:NeurIPSPaper, :ICMLPaper, :ICLRPaper, :ArXivPaper

Just Right:

// Labels represent fundamental types
:Paper { venue: "NeurIPS" }  // venue is a property
:Author
:Venue

Label Hierarchy Considerations¶

Uni uses single-label vertices (encoded in VID). If you need hierarchies:

// Option 1: Property-based classification
:Paper { paper_type: "research", venue_type: "conference" }

// Option 2: Separate labels with relationships
:Paper, :ConferenceSubmission
(paper)-[:SUBMITTED_TO]->(conference)

// Option 3: Composition via edges
:Paper, :Category
(paper)-[:IN_CATEGORY]->(category)

Edge Types¶

Naming Conventions¶

Convention	Example	Rationale
UPPER_SNAKE_CASE	`:AUTHORED_BY`	Visually distinct from labels
Verb phrases	`:CITES`, `:BELONGS_TO`	Describes relationship
Past tense or present	`:WROTE` or `:WRITES`	Consistent style
Active voice	`:CITES` not `:CITED_BY`	Clear direction

Direction Semantics¶

Choose direction based on typical query patterns:

// Natural reading direction: subject -[verb]-> object
(paper)-[:CITES]->(cited_paper)      // Paper cites another paper
(paper)-[:AUTHORED_BY]->(author)     // Paper is authored by author
(author)-[:WORKS_AT]->(institution)  // Author works at institution

// Query from either direction
MATCH (a:Author)<-[:AUTHORED_BY]-(p:Paper)  // Find author's papers
MATCH (p:Paper)-[:AUTHORED_BY]->(a:Author)  // Find paper's authors

Edge Properties¶

Use edge properties sparingly for relationship metadata:

{
  "edge_types": {
    "AUTHORED_BY": {
      "id": 1,
      "src_labels": ["Paper"],
      "dst_labels": ["Author"]
    }
  },
  "properties": {
    "AUTHORED_BY": {
      "position": { "type": "Int32" },      // Author order
      "contribution": { "type": "String" }   // Role: "lead", "contributor"
    }
  }
}

When to Use Edge Properties: - Relationship metadata (timestamps, weights, roles) - Data specific to the relationship, not the connected vertices

When to Avoid Edge Properties: - Frequently updated data (edges are immutable) - Large data (embeddings, documents)

Property Design¶

Data Type Selection¶

Data Type	Use Case	Example
`String`	Text, identifiers	title, name, doi
`Int32`	Small integers	year, count
`Int64`	Large integers	timestamp_ms, big_count
`Float64`	Decimal values	price, score
`Bool`	Flags	is_published, is_retracted
`Timestamp`	Date/time	created_at, published_at
`Vector`	Embeddings	embedding, image_vector
`Json`	Semi-structured	metadata, config

Nullability¶

Be intentional about nullable properties:

{
  "Paper": {
    // Required: every paper has these
    "title": { "type": "String", "nullable": false },

    // Optional: not all papers have these
    "abstract": { "type": "String", "nullable": true },
    "doi": { "type": "String", "nullable": true }
  }
}

Property Naming¶

Convention	Example	Notes
snake_case	`created_at`	Consistent with JSON
Descriptive	`citation_count` not `cc`	Self-documenting
No prefixes	`title` not `paper_title`	Label provides context

Avoid Property Bloat¶

✓ Good: Focused properties
  Paper { title, year, venue, abstract, doi }

✗ Bad: Everything on one vertex
  Paper {
    title, year, venue, abstract, doi,
    author_names,        // Should be vertex + edge
    all_citations,       // Should be edges
    raw_pdf_bytes,       // Too large
    processing_status    // Transient state
  }

Vector Properties¶

Dimension Planning¶

Vector dimensions are immutable after schema creation:

{
  "embedding": {
    "type": "Vector",
    "dimensions": 768  // Cannot change later
  }
}

Choosing Dimensions:

Model Family	Typical Dimensions	Notes
Sentence Transformers	384-768	General text
OpenAI embeddings	1536-3072	Commercial
CLIP	512-768	Multimodal
Custom	Varies	Match your model

Multiple Embeddings¶

For different embedding types, use separate properties:

{
  "Paper": {
    "title_embedding": { "type": "Vector", "dimensions": 384 },
    "abstract_embedding": { "type": "Vector", "dimensions": 768 },
    "figure_embedding": { "type": "Vector", "dimensions": 512 }
  }
}

Embedding Versioning¶

When upgrading embedding models:

{
  "Paper": {
    // Current
    "embedding": { "type": "Vector", "dimensions": 768 },

    // Legacy (deprecated)
    "embedding_v1": { "type": "Vector", "dimensions": 384 }
  }
}

Document Mode¶

When to Use Document Mode¶

Enable is_document: true for entities with: - Highly variable/nested structure - Frequently changing schema - Semi-structured metadata

{
  "labels": {
    "Paper": {
      "id": 1,
      "is_document": true
    }
  }
}

Document Properties¶

CREATE (p:Paper {
  title: "Research Paper",
  year: 2024,
  // Document field for flexible data
  _doc: {
    figures: [
      { id: "fig1", caption: "Architecture" },
      { id: "fig2", caption: "Results" }
    ],
    supplementary: {
      code_url: "https://github.com/...",
      datasets: ["imagenet", "coco"]
    },
    review_scores: [8, 7, 9]
  }
})

JSON vs Typed Properties¶

Use Typed Properties	Use JSON
Frequently queried	Rarely queried
Stable schema	Evolving schema
Needs indexing	No indexing needed
Performance critical	Flexibility critical

Index Planning¶

Index Strategy¶

Plan indexes based on query patterns:

{
  "indexes": {
    // Vector index for similarity search
    "paper_embeddings": {
      "type": "vector",
      "label": "Paper",
      "property": "embedding",
      "config": { "index_type": "hnsw", "metric": "cosine" }
    },

    // Scalar index for frequent filters
    "paper_year": {
      "type": "scalar",
      "label": "Paper",
      "property": "year",
      "config": { "index_type": "btree" }
    },

    // Scalar index for unique lookups
    "paper_doi": {
      "type": "scalar",
      "label": "Paper",
      "property": "doi",
      "config": { "index_type": "hash" }
    }
  }
}

Index Selection Guidelines¶

Query Pattern	Index Type	Example
`WHERE x = 5`	BTree or Hash	Year, ID
`WHERE x > 5`	BTree	Year ranges
`WHERE x IN [...]`	BTree or Bitmap	Categories
Vector similarity	HNSW or IVF_PQ	Embeddings
Text search	Full-text	Title, abstract

Schema Evolution¶

Adding Properties¶

Safe operation—existing data gets NULL:

// Before
{ "Paper": { "title": "String" } }

// After (add new property)
{ "Paper": {
    "title": "String",
    "citation_count": { "type": "Int32", "nullable": true }  // New
}}

Deprecating Properties¶

Use state markers for gradual removal:

{
  "Paper": {
    "old_field": {
      "type": "String",
      "state": "deprecated",
      "deprecated_since": "2024-01-01",
      "migration_hint": "Use new_field instead"
    },
    "new_field": { "type": "String" }
  }
}

Adding Labels/Edge Types¶

Safe operation—new types get new ID ranges:

// Add new label
{
  "labels": {
    "Paper": { "id": 1 },
    "Preprint": { "id": 2 }  // New label
  }
}

Breaking Changes (Avoid)¶

These require data migration: - Changing property types - Changing vector dimensions - Renaming labels (ID is fixed) - Changing edge type direction semantics

Example Schemas¶

Academic Papers¶

{
  "schema_version": 1,

  "labels": {
    "Paper": { "id": 1, "is_document": true },
    "Author": { "id": 2 },
    "Venue": { "id": 3 },
    "Institution": { "id": 4 }
  },

  "edge_types": {
    "CITES": { "id": 1, "src_labels": ["Paper"], "dst_labels": ["Paper"] },
    "AUTHORED_BY": { "id": 2, "src_labels": ["Paper"], "dst_labels": ["Author"] },
    "PUBLISHED_IN": { "id": 3, "src_labels": ["Paper"], "dst_labels": ["Venue"] },
    "AFFILIATED_WITH": { "id": 4, "src_labels": ["Author"], "dst_labels": ["Institution"] }
  },

  "properties": {
    "Paper": {
      "title": { "type": "String", "nullable": false },
      "abstract": { "type": "String", "nullable": true },
      "year": { "type": "Int32", "nullable": false },
      "doi": { "type": "String", "nullable": true },
      "embedding": { "type": "Vector", "dimensions": 768 }
    },
    "Author": {
      "name": { "type": "String", "nullable": false },
      "email": { "type": "String", "nullable": true },
      "orcid": { "type": "String", "nullable": true }
    },
    "Venue": {
      "name": { "type": "String", "nullable": false },
      "type": { "type": "String", "nullable": true }
    },
    "AUTHORED_BY": {
      "position": { "type": "Int32", "nullable": true }
    }
  }
}

E-Commerce¶

{
  "schema_version": 1,

  "labels": {
    "User": { "id": 1 },
    "Product": { "id": 2 },
    "Category": { "id": 3 },
    "Order": { "id": 4 }
  },

  "edge_types": {
    "PURCHASED": { "id": 1, "src_labels": ["User"], "dst_labels": ["Product"] },
    "VIEWED": { "id": 2, "src_labels": ["User"], "dst_labels": ["Product"] },
    "IN_CATEGORY": { "id": 3, "src_labels": ["Product"], "dst_labels": ["Category"] },
    "ORDERED": { "id": 4, "src_labels": ["Order"], "dst_labels": ["Product"] },
    "PLACED_BY": { "id": 5, "src_labels": ["Order"], "dst_labels": ["User"] }
  },

  "properties": {
    "User": {
      "email": { "type": "String", "nullable": false },
      "name": { "type": "String", "nullable": true },
      "preference_embedding": { "type": "Vector", "dimensions": 128 }
    },
    "Product": {
      "name": { "type": "String", "nullable": false },
      "description": { "type": "String", "nullable": true },
      "price": { "type": "Float64", "nullable": false },
      "embedding": { "type": "Vector", "dimensions": 384 }
    },
    "PURCHASED": {
      "quantity": { "type": "Int32", "nullable": false },
      "timestamp": { "type": "Timestamp", "nullable": false }
    }
  }
}

Schema Validation Checklist¶

Before deploying your schema:

Next Steps¶

Data Ingestion — Import data with your schema
Indexing — Configure indexes
Cypher Querying — Query your schema