Core concepts
This page covers the fundamental concepts of the Aurelio SDK, explaining its key components and how they work together to provide powerful document processing capabilities.
Document Processing
Document processing in the Aurelio SDK converts unstructured documents (PDFs) into easily readable markdown that can be further processed or used in AI applications.
The processing pipeline:
- Ingestion: Documents are uploaded either as local files or via URLs.
- Quality Selection: Processing can be done in different quality modes:
low
: Faster but less accuratehigh
: More accurate but slower
- Text Extraction: The system identifies and extracts text content
- Structure Recognition: Identifies document elements like headers, paragraphs, tables
- Metadata Extraction: Retrieves document metadata when available
Chunking
Chunking is the process of breaking long documents into smaller, semantically meaningful pieces that are optimized for downstream tasks like embedding and retrieval.
Why Chunking Matters
- Token Limitations: Most embedding models have maximum context windows
- Semantic Coherence: Properly chunked documents maintain meaning and context
- Retrieval Precision: Well-defined chunks improve retrieval accuracy
- Processing Efficiency: Smaller chunks reduce computational overhead
The SDK supports different chunking strategies:
- Semantic Chunking: Creates chunks based on semantic boundaries (paragraphs, sections)
- Fixed-Size Chunking: Creates chunks of approximately equal size
- Custom Chunking: Configure chunking parameters to suit specific needs
Embedding
Embeddings are dense vector representations of text that capture semantic meaning in a form that machines can process efficiently. They enable semantic search, similarity comparison, and other NLP applications.
Embedding Applications
- Semantic Search: Find contextually similar content beyond keyword matching
- Information Retrieval: Retrieve relevant document sections for RAG applications
- Document Similarity: Compare documents based on meaning rather than exact wording
- Content Organization: Cluster similar content automatically
The SDK supports multiple embedding models to suit different needs and balance between quality and performance.
Async vs Sync Approaches
The Aurelio SDK offers both synchronous and asynchronous APIs to accommodate different usage patterns.
When to Use Synchronous API
- Simple Scripts: For straightforward, linear processing flows
- Small Documents: When processing time is minimal
- Development/Testing: During initial development or debugging
- Single Document Processing: When handling one document at a time
When to Use Asynchronous API
- Production Applications: For high-throughput systems
- Large Documents: When processing may take significant time
- Batch Processing: When handling multiple documents simultaneously
- Web Applications: To prevent blocking the main thread
The async API provides significant performance improvements for concurrent processing scenarios, making it the preferred choice for production applications with substantial throughput requirements.