CharacterSplitterResolver¶
The CharacterSplitterResolver provides a text splitting service that divides text into smaller chunks based on character separators. This is particularly useful for processing large documents before embedding or generating summaries.
Overview¶
- Type:
DependencyResolver[TextSplitter] - Provided Dependency:
CharacterTextSplitter - Package:
rustic_ai.langchain.agent_ext.text_splitter.character_splitter
Features¶
- Character-Based Splitting: Split text at specified separator characters
- Configurable Chunk Size: Control the approximate size of each text chunk
- Chunk Overlap: Define overlap between chunks to maintain context
- Multiple Separators: Use different separators with priority ordering
- Metadata Retention: Preserve document metadata across splits
Configuration¶
| Parameter | Type | Description | Default |
|---|---|---|---|
separator |
str |
Character(s) to split on | "\n\n" (blank line) |
chunk_size |
int |
Maximum chunk size in characters | 1000 |
chunk_overlap |
int |
Number of characters to overlap between chunks | 200 |
add_start_index |
bool |
Add chunk start index to metadata | True |
strip_whitespace |
bool |
Strip whitespace from chunk beginnings/ends | True |
Usage¶
Guild Configuration¶
from rustic_ai.core.guild.builders import GuildBuilder
from rustic_ai.core.guild.dsl import DependencySpec
guild_builder = (
GuildBuilder("text_processing_guild", "Text Processing Guild", "Guild with text processing capabilities")
.add_dependency_resolver(
"text_splitter",
DependencySpec(
class_name="rustic_ai.langchain.agent_ext.text_splitter.character_splitter.CharacterSplitterResolver",
properties={
"separator": "\n\n", # Split on blank lines
"chunk_size": 1000,
"chunk_overlap": 200
}
)
)
)
Agent Usage¶
from rustic_ai.core.guild import Agent, agent
from rustic_ai.langchain.agent_ext.text_splitter.character_splitter import CharacterTextSplitter
from rustic_ai.core.agents.commons.media import Document
class TextProcessingAgent(Agent):
@agent.processor(clz=DocumentChunkRequest, depends_on=["text_splitter"])
def split_document(self, ctx: agent.ProcessContext, text_splitter: CharacterTextSplitter):
# Create a document
document = Document(
id="doc1",
text=ctx.payload.text,
metadata={"source": ctx.payload.source}
)
# Split the document into smaller chunks
chunks = text_splitter.split_document(document)
# Each chunk is a Document object with its own ID and metadata
ctx.send_dict({
"original_id": document.id,
"chunk_count": len(chunks),
"chunks": [
{
"id": chunk.id,
"text": chunk.text,
"metadata": chunk.metadata
}
for chunk in chunks
]
})
@agent.processor(clz=TextChunkRequest, depends_on=["text_splitter"])
def split_text(self, ctx: agent.ProcessContext, text_splitter: CharacterTextSplitter):
# Split raw text (without creating a Document)
text = ctx.payload.text
chunks = text_splitter.split_text(text)
ctx.send_dict({
"chunk_count": len(chunks),
"chunks": chunks
})
API Reference¶
The CharacterTextSplitter class provides these primary methods:
| Method | Description |
|---|---|
split_text(text: str) -> List[str] |
Split a single text string into chunks |
split_documents(documents: List[Document]) -> List[Document] |
Split multiple Document objects |
split_document(document: Document) -> List[Document] |
Split a single Document object |
create_documents(texts: List[str], metadatas: List[Dict] = None) -> List[Document] |
Create Document objects from texts |
Working with Document Objects¶
When splitting Document objects, important points to note:
- The original document's metadata is preserved in each chunk
- Each chunk gets a unique ID (based on the original ID with a suffix)
- Additional metadata is added to track the chunk's position and content:
chunk_index: The position of the chunk in the sequencestart_index: The character position in the original text where the chunk starts
Example: Processing Large Documents for RAG¶
@agent.processor(clz=PrepareForRAGRequest, depends_on=["text_splitter", "vectorstore"])
def prepare_for_rag(self, ctx: agent.ProcessContext, text_splitter: CharacterTextSplitter, vectorstore: VectorStore):
# Convert input text to a Document
document = Document(
id=f"doc-{uuid.uuid4()}",
text=ctx.payload.text,
metadata={
"source": ctx.payload.source,
"author": ctx.payload.author,
"date": ctx.payload.date
}
)
# Split into manageable chunks
chunks = text_splitter.split_document(document)
# Store chunks in vector database for later retrieval
result = vectorstore.upsert(chunks)
ctx.send_dict({
"processed": True,
"document_id": document.id,
"chunk_count": len(chunks),
"success_count": len(result.succeeded),
"failed_count": len(result.failed)
})
Custom Separators¶
You can use different separators with varying priorities:
DependencySpec(
class_name="rustic_ai.langchain.agent_ext.text_splitter.character_splitter.CharacterSplitterResolver",
properties={
"separator": ["\n\n", "\n", ". ", ", "], # Try these separators in order
"chunk_size": 800,
"chunk_overlap": 150
}
)
With this configuration, the splitter will:
1. First try to split on blank lines ("\n\n")
2. If chunks are still too large, split on newlines ("\n")
3. If still too large, split on sentence endings (.)
4. If still too large, split on commas (,)
When to Use Character Splitting¶
Character splitting works well when:
- Documents have natural separators (paragraphs, lines, etc.)
- You need simple, fast splitting without complex logic
- Text is relatively well-structured
For more complex splitting needs involving recursive separators or language-aware splitting, consider using RecursiveSplitterResolver.
Related Resolvers¶
- RecursiveSplitterResolver - More advanced text splitting with recursive separator logic
- OpenAIEmbeddingsResolver - For embedding the split chunks
- ChromaResolver - For storing embedded chunks in a vector database