Multimodal Generation¶

Uni-Xervo 0.2.0 extends local/mistralrs with four pipeline types: text, vision, diffusion, and speech. This guide covers when to use each pipeline, how to configure them, and how to work with the structured message API.

When to use each pipeline¶

Pipeline	Use case	Input	Output
`text`	Standard LLM chat and completion (default)	Text messages	`result.text`
`vision`	Image understanding with text prompts	Images + text messages	`result.text`
`diffusion`	Text-to-image generation	Text prompt	`result.images`
`speech`	Text-to-audio synthesis	Text prompt	`result.audio`

Message types¶

Generation in 0.2.0 uses structured Message types instead of plain strings.

Core types¶

use uni_xervo::{Message, MessageRole, ContentBlock, ImageInput};

// Simple text message (convenience constructor)
let msg = Message::user("Hello, world!");

// Full message with role
let msg = Message {
    role: MessageRole::User,
    content: vec![ContentBlock::Text("Hello".to_string())],
};

Message roles¶

MessageRole::System — system instructions
MessageRole::User — user input
MessageRole::Assistant — model responses (for multi-turn conversations)

Content blocks¶

ContentBlock::Text(String) — text content
ContentBlock::Image(ImageInput) — image content (vision pipeline)

Image input¶

// From raw bytes
let input = ImageInput::Bytes {
    data: image_bytes,
    media_type: "image/jpeg".to_string(),
};

// From URL
let input = ImageInput::Url("https://example.com/image.jpg".to_string());

Vision workflow¶

Process images with text prompts using a vision model.

Catalog configuration¶

{
  "alias": "vision/qwen",
  "task": "generate",
  "provider_id": "local/mistralrs",
  "model_id": "Qwen/Qwen2-VL-2B-Instruct",
  "options": {
    "pipeline": "vision",
    "dtype": "bf16"
  }
}

Rust code¶

let image_bytes = std::fs::read("scene.jpg")?;

let message = Message {
    role: MessageRole::User,
    content: vec![
        ContentBlock::Image(ImageInput::Bytes {
            data: image_bytes,
            media_type: "image/jpeg".to_string(),
        }),
        ContentBlock::Text("What objects are in this image?".to_string()),
    ],
};

let vision = runtime.generator("vision/qwen").await?;
let result = vision.generate(&[message], GenerationOptions::default()).await?;
println!("{}", result.text);

Diffusion workflow¶

Generate images from text prompts using FLUX models.

Catalog configuration¶

{
  "alias": "image/flux",
  "task": "generate",
  "provider_id": "local/mistralrs",
  "model_id": "black-forest-labs/FLUX.1-schnell",
  "options": {
    "pipeline": "diffusion",
    "diffusion_loader_type": "flux"
  }
}

Rust code¶

let gen = runtime.generator("image/flux").await?;
let result = gen.generate(
    &[Message::user("A serene mountain landscape at sunset")],
    GenerationOptions {
        width: Some(1024),
        height: Some(1024),
        ..Default::default()
    },
).await?;

for image in &result.images {
    std::fs::write("output.png", &image.data)?;
}

Speech workflow¶

Synthesize audio from text using Dia models.

Catalog configuration¶

{
  "alias": "tts/dia",
  "task": "generate",
  "provider_id": "local/mistralrs",
  "model_id": "nari-labs/Dia-1.6B",
  "options": {
    "pipeline": "speech",
    "speech_loader_type": "dia"
  }
}

Rust code¶

let tts = runtime.generator("tts/dia").await?;
let result = tts.generate(
    &[Message::user("[S1] Hello, welcome!")],
    GenerationOptions::default(),
).await?;

if let Some(audio) = &result.audio {
    // audio.pcm_data — raw PCM samples
    // audio.sample_rate — e.g. 24000
    // audio.channels — e.g. 1
}

GGUF models¶

Load quantized text models in GGUF format by specifying the filenames.

{
  "alias": "generate/gguf",
  "task": "generate",
  "provider_id": "local/mistralrs",
  "model_id": "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
  "options": {
    "gguf_files": ["mistral-7b-instruct-v0.2.Q4_K_M.gguf"]
  }
}

Note

gguf_files is only valid with the text pipeline. Vision, diffusion, and speech pipelines do not support GGUF.

Model precision (dtype)¶

Control model precision with the dtype option. Available on all four pipeline types.

Value	Description
`auto`	Automatic selection (BF16 on GPU, F32 on CPU)
`f16`	16-bit floating point
`bf16`	Brain floating point 16
`f32`	32-bit floating point

Default resolution logic:

Explicit dtype value in catalog options, if set
f32 when running on CPU or without GPU support
auto otherwise

Example¶

{
  "alias": "vision/qwen",
  "task": "generate",
  "provider_id": "local/mistralrs",
  "model_id": "Qwen/Qwen2-VL-2B-Instruct",
  "options": {
    "pipeline": "vision",
    "dtype": "bf16"
  }
}

Migration from 0.1.x¶

The generate() API changed from &[String] to &[Message].

Before (0.1.x):

let result = generator.generate(
    &["Hello".to_string()],
    GenerationOptions::default(),
).await?;

After (0.2.0):

let result = generator.generate(
    &[Message::user("Hello")],
    GenerationOptions::default(),
).await?;

The Message::user() convenience constructor creates a single text message with MessageRole::User. For multimodal content, build Message structs directly with multiple ContentBlock entries.