Chunking
This guide provides detailed technical information about document chunking capabilities in the Aurelio SDK. Chunking is the process of dividing a document into smaller, semantically meaningful segments for improved processing and retrieval.
Chunking Flow
Chunking Options
The SDK provides a flexible chunking API with several configurable parameters:
Parameter | Type | Default | Description |
---|---|---|---|
max_chunk_length | int | 400 | Maximum number of tokens per chunk |
chunker_type | str | "regex" | Chunking algorithm: "regex" or "semantic" |
window_size | int | 1 | Context window size for semantic chunking |
delimiters | List[str] | [] | Custom regex delimiters for regex chunking |
Chunking Methods
The SDK offers two primary methods for chunking documents:
- Direct chunking of text content via the
chunk
function - Chunking during file processing via the []
extract
function](file-extraction)
Direct Text Chunking
Usage Example
Chunking During Extraction
When processing files, chunking can be enabled with the chunk=True
parameter:
Chunking Algorithms
The SDK supports two chunking algorithms, each with different characteristics and use cases.
Regex Chunking
The default chunking method uses regular expressions to split text based on delimiters.
Key characteristics:
- Fast and deterministic
- Respects natural text boundaries like paragraphs
- Works well for well-structured documents
- Less compute-intensive than semantic chunking
Best for:
- Well-formatted text with clear paragraph breaks
- Large volumes of documents where processing speed is important
- Situations where chunk boundaries are less critical
Example with custom delimiters:
Semantic Chunking
A more advanced algorithm that attempts to preserve semantic meaning across chunk boundaries.
Key characteristics:
- Preserves semantic meaning across chunks
- More compute-intensive than regex chunking
- Creates more coherent chunks for complex content
- Better respects topical boundaries
Best for:
- Complex documents where semantic coherence is important
- Content that will be used for semantic search or LLM context
- Documents with varied formatting where regex may struggle
Example with window size:
The window_size
parameter controls how much surrounding context is considered when determining chunk boundaries. Larger values preserve more context but increase processing time.
Window-Based Processing
Semantic chunking uses a sliding window approach to maintain context across chunk boundaries.
Impact of Window Size
Window Size | Context Preservation | Processing Speed | Use Case |
---|---|---|---|
1 (default) | Minimal | Fastest | Basic chunking needs |
2-3 | Moderate | Medium | Balanced approach |
4+ | Maximum | Slower | High-precision needs |
Response Structure
The chunking response provides detailed information about each generated chunk:
Each chunk contains:
Recommendations for Effective Chunking
General Guidelines
- Choose the right algorithm: Use
regex
for speed,semantic
for meaning preservation - Set appropriate chunk sizes: 300-500 tokens works well for most applications
- Customize for your content: Adjust parameters based on document structure
By Content Type
Content Type | Recommended Chunker | Max Chunk Length | Window Size | Notes |
---|---|---|---|---|
Technical documentation | regex | 400 | 1 | Often has clear section breaks |
Academic papers | semantic | 350 | 2 | Complex ideas need semantic coherence |
Legal documents | semantic | 300 | 3 | Precise context preservation is critical |
News articles | regex | 450 | 1 | Well-structured with clear paragraphs |
Transcripts | semantic | 500 | 2 | Spoken language benefits from semantic boundaries |
Performance Considerations
- Regex chunking is significantly faster (5-10x) than semantic chunking
- Processing time increases with document size and window size
- For very large documents (>1MB of text), consider preprocessing into smaller segments
Advanced Usage: Custom Delimiters
For regex chunking, you can provide custom delimiters to better match your document structure:
Common delimiter patterns:
- Headers:
"\n#{1,6}\s+"
(matches Markdown headers) - Paragraphs:
"\n\s*\n"
(matches paragraph breaks) - List items:
"\n\s*[-*•]\s"
(matches list markers) - Sections:
"\n\d+\.\s+"
(matches numbered sections)