SpeechT5TTSAgent¶

The SpeechT5TTSAgent is a text-to-speech synthesis agent that converts text into spoken audio using Microsoft's SpeechT5 model via Hugging Face.

Purpose¶

This agent provides text-to-speech (TTS) capabilities within a Rustic AI guild, enabling conversion of text content into natural-sounding speech. It uses the Microsoft SpeechT5 model, which produces high-quality speech synthesis.

When to Use¶

Use the SpeechT5TTSAgent when your application needs to:

Convert text to spoken audio
Generate voice responses for users
Create audio content from textual data
Add voice capabilities to your AI system
Make information more accessible through audio formats

Dependencies¶

The SpeechT5TTSAgent requires:

filesystem (Guild-level dependency): A file system implementation for storing generated audio files

Message Types¶

Input Messages¶

GenerationPromptRequest¶

A request to convert text to speech:

class GenerationPromptRequest(BaseModel):
    generation_prompt: str  # The text to convert to speech

Output Messages¶

MediaLink¶

When synthesis is successful, a MediaLink message is emitted with the audio content:

class MediaLink(BaseModel):
    url: str  # Path to the generated audio file
    name: str  # Filename
    metadata: Dict  # Metadata including sampling rate
    on_filesystem: bool  # Always True for generated audio
    mimetype: str  # Content type (audio/wav)

ErrorMessage¶

Sent when speech synthesis fails:

class ErrorMessage(BaseModel):
    agent_type: str
    error_type: str  # "SpeechGenerationError" or "FileWriteError"
    error_message: str

Behavior¶

The agent receives a GenerationPromptRequest with text content
It processes the text through the SpeechT5 TTS pipeline
The synthesized speech is saved to a WAV file with a generated UUID filename
A MediaLink message is emitted with a reference to the generated audio file
If any errors occur during synthesis or file writing, an ErrorMessage is sent

Sample Usage¶

from rustic_ai.core.guild.builders import AgentBuilder
from rustic_ai.core.guild.agent_ext.depends.dependency_resolver import DependencySpec
from rustic_ai.huggingface.agents.text_to_speech.speecht5_tts_agent import SpeechT5TTSAgent

# Define a file system dependency
filesystem = DependencySpec(
    class_name="rustic_ai.core.guild.agent_ext.depends.filesystem.FileSystemResolver",
    properties={
        "path_base": "/tmp",
        "protocol": "file",
        "storage_options": {
            "auto_mkdir": True,
        },
    },
)

# Create the agent spec
tts_agent_spec = (
    AgentBuilder(SpeechT5TTSAgent)
    .set_id("tts_agent")
    .set_name("Text-to-Speech")
    .set_description("Converts text to spoken audio using SpeechT5")
    .build_spec()
)

# Add dependency to guild when launching
guild_builder.add_dependency("filesystem", filesystem)
guild_builder.add_agent_spec(tts_agent_spec)

Example Request¶

from rustic_ai.core.agents.commons.message_formats import GenerationPromptRequest

# Create a text-to-speech request
tts_request = GenerationPromptRequest(
    generation_prompt="Welcome to Rustic AI, a powerful multi-agent framework."
)

# Send to the agent
client.publish("default_topic", tts_request)

Technical Details¶

The agent uses: - The Hugging Face transformers library with the text-to-speech pipeline - Microsoft's SpeechT5 model (microsoft/speecht5_tts) - Speaker embeddings from the CMU Arctic dataset for voice characteristics - SoundFile for writing WAV audio files

Notes and Limitations¶

The agent uses a fixed speaker embedding, resulting in consistent voice characteristics
Only produces WAV format audio files
Requires a significant amount of memory for the SpeechT5 model
First-time initialization may take longer as models are downloaded
Currently only supports English text input