SpeechT5TTSAgent¶
The SpeechT5TTSAgent
is a text-to-speech synthesis agent that converts text into spoken audio using Microsoft's SpeechT5 model via Hugging Face.
Purpose¶
This agent provides text-to-speech (TTS) capabilities within a Rustic AI guild, enabling conversion of text content into natural-sounding speech. It uses the Microsoft SpeechT5 model, which produces high-quality speech synthesis.
When to Use¶
Use the SpeechT5TTSAgent
when your application needs to:
- Convert text to spoken audio
- Generate voice responses for users
- Create audio content from textual data
- Add voice capabilities to your AI system
- Make information more accessible through audio formats
Dependencies¶
The SpeechT5TTSAgent
requires:
- filesystem (Guild-level dependency): A file system implementation for storing generated audio files
Message Types¶
Input Messages¶
GenerationPromptRequest¶
A request to convert text to speech:
class GenerationPromptRequest(BaseModel):
generation_prompt: str # The text to convert to speech
Output Messages¶
MediaLink¶
When synthesis is successful, a MediaLink
message is emitted with the audio content:
class MediaLink(BaseModel):
url: str # Path to the generated audio file
name: str # Filename
metadata: Dict # Metadata including sampling rate
on_filesystem: bool # Always True for generated audio
mimetype: str # Content type (audio/wav)
ErrorMessage¶
Sent when speech synthesis fails:
class ErrorMessage(BaseModel):
agent_type: str
error_type: str # "SpeechGenerationError" or "FileWriteError"
error_message: str
Behavior¶
- The agent receives a
GenerationPromptRequest
with text content - It processes the text through the SpeechT5 TTS pipeline
- The synthesized speech is saved to a WAV file with a generated UUID filename
- A
MediaLink
message is emitted with a reference to the generated audio file - If any errors occur during synthesis or file writing, an
ErrorMessage
is sent
Sample Usage¶
from rustic_ai.core.guild.builders import AgentBuilder
from rustic_ai.core.guild.agent_ext.depends.dependency_resolver import DependencySpec
from rustic_ai.huggingface.agents.text_to_speech.speecht5_tts_agent import SpeechT5TTSAgent
# Define a file system dependency
filesystem = DependencySpec(
class_name="rustic_ai.core.guild.agent_ext.depends.filesystem.FileSystemResolver",
properties={
"path_base": "/tmp",
"protocol": "file",
"storage_options": {
"auto_mkdir": True,
},
},
)
# Create the agent spec
tts_agent_spec = (
AgentBuilder(SpeechT5TTSAgent)
.set_id("tts_agent")
.set_name("Text-to-Speech")
.set_description("Converts text to spoken audio using SpeechT5")
.build_spec()
)
# Add dependency to guild when launching
guild_builder.add_dependency("filesystem", filesystem)
guild_builder.add_agent_spec(tts_agent_spec)
Example Request¶
from rustic_ai.core.agents.commons.message_formats import GenerationPromptRequest
# Create a text-to-speech request
tts_request = GenerationPromptRequest(
generation_prompt="Welcome to Rustic AI, a powerful multi-agent framework."
)
# Send to the agent
client.publish("default_topic", tts_request)
Technical Details¶
The agent uses:
- The Hugging Face transformers
library with the text-to-speech
pipeline
- Microsoft's SpeechT5 model (microsoft/speecht5_tts
)
- Speaker embeddings from the CMU Arctic dataset for voice characteristics
- SoundFile for writing WAV audio files
Notes and Limitations¶
- The agent uses a fixed speaker embedding, resulting in consistent voice characteristics
- Only produces WAV format audio files
- Requires a significant amount of memory for the SpeechT5 model
- First-time initialization may take longer as models are downloaded
- Currently only supports English text input