Parameter | Type | Default | Description |
---|---|---|---|
max_chunk_length | int | 400 | Maximum number of tokens per chunk |
chunker_type | str | "regex" | Chunking algorithm: "regex" or "semantic" |
window_size | int | 1 | Context window size for semantic chunking |
delimiters | List[str] | [] | Custom regex delimiters for regex chunking |
chunk
functionextract
function](file-extraction)chunk=True
parameter:
window_size
parameter controls how much surrounding context is considered when determining chunk boundaries. Larger values preserve more context but increase processing time.
Window Size | Context Preservation | Processing Speed | Use Case |
---|---|---|---|
1 (default) | Minimal | Fastest | Basic chunking needs |
2-3 | Moderate | Medium | Balanced approach |
4+ | Maximum | Slower | High-precision needs |
regex
for speed, semantic
for meaning preservationContent Type | Recommended Chunker | Max Chunk Length | Window Size | Notes |
---|---|---|---|---|
Technical documentation | regex | 400 | 1 | Often has clear section breaks |
Academic papers | semantic | 350 | 2 | Complex ideas need semantic coherence |
Legal documents | semantic | 300 | 3 | Precise context preservation is critical |
News articles | regex | 450 | 1 | Well-structured with clear paragraphs |
Transcripts | semantic | 500 | 2 | Spoken language benefits from semantic boundaries |
"\n#{1,6}\s+"
(matches Markdown headers)"\n\s*\n"
(matches paragraph breaks)"\n\s*[-*•]\s"
(matches list markers)"\n\d+\.\s+"
(matches numbered sections)