Runtime Loading¶
Runtime loading has two phases: provider warmup and model warmup.
Provider warmup (ModelRuntimeBuilder::warmup_policy)¶
eager: awaitprovider.warmup()for each registered provider duringbuild().background: scheduleprovider.warmup()in detached tasks.lazy: no provider warmup during build.
Model warmup (ModelAliasSpec.warmup)¶
eager: load and warm model during build.background: schedule model load after build.lazy: load on first handle resolution.
required = true matters only for eager warmup: if eager load fails for a required alias, startup fails.
Deduplication and concurrency¶
Models are keyed by ModelRuntimeKey:
- task
- provider ID
- model ID
- revision
- normalized options hash
Aliases with identical runtime keys share one loaded instance.
Concurrent first-load calls for the same key are serialized with a per-key mutex so only one load happens.
Load timeout¶
load_timeout applies to provider.load(spec) plus model warmup, with a runtime default of 600 seconds if not set.
A load timeout returns RuntimeError::Timeout.
Prefetch APIs¶
runtime.prefetch_all().awaitwarms every alias.runtime.prefetch(&["embed/default", "generate/chat"]).awaitwarms selected aliases.
These methods are useful at service startup to avoid first-request cold starts.