Core concepts

This page covers the fundamental concepts of the Aurelio SDK, explaining its key components and how they work together to provide powerful document processing capabilities.

Document Processing

Document processing in the Aurelio SDK converts unstructured documents (PDFs) into easily readable markdown that can be further processed or used in AI applications.

The processing pipeline:

Ingestion: Documents are uploaded either as local files or via URLs.
Quality Selection: Processing can be done in different quality modes:
- low: Faster but less accurate
- high: More accurate but slower
Text Extraction: The system identifies and extracts text content
Structure Recognition: Identifies document elements like headers, paragraphs, tables
Metadata Extraction: Retrieves document metadata when available

# Example of document processing
response = client.extract_file(
    file_path="document.pdf", 
    quality="high",
    wait=30
)

Chunking

Chunking is the process of breaking long documents into smaller, semantically meaningful pieces that are optimized for downstream tasks like embedding and retrieval.

Why Chunking Matters

Token Limitations: Most embedding models have maximum context windows
Semantic Coherence: Properly chunked documents maintain meaning and context
Retrieval Precision: Well-defined chunks improve retrieval accuracy
Processing Efficiency: Smaller chunks reduce computational overhead

The SDK supports different chunking strategies:

Semantic Chunking: Creates chunks based on semantic boundaries (paragraphs, sections)
Fixed-Size Chunking: Creates chunks of approximately equal size
Custom Chunking: Configure chunking parameters to suit specific needs

# Example of custom chunking
chunking_options = ChunkingOptions(
    chunker_type="semantic",
    max_chunk_length=400,
    window_size=5
)

chunk_response = client.chunk(
    content=long_text, 
    processing_options=chunking_options
)

Embedding

Embeddings are dense vector representations of text that capture semantic meaning in a form that machines can process efficiently. They enable semantic search, similarity comparison, and other NLP applications.

Embedding Applications

Semantic Search: Find contextually similar content beyond keyword matching
Information Retrieval: Retrieve relevant document sections for RAG applications
Document Similarity: Compare documents based on meaning rather than exact wording
Content Organization: Cluster similar content automatically

The SDK supports multiple embedding models to suit different needs and balance between quality and performance.

# Example of generating embeddings
embedding_response = client.embedding(
    input=chunk_texts,
    model="bm25"  # Choose embedding model based on needs
)

Async vs Sync Approaches

The Aurelio SDK offers both synchronous and asynchronous APIs to accommodate different usage patterns.

When to Use Synchronous API

Simple Scripts: For straightforward, linear processing flows
Small Documents: When processing time is minimal
Development/Testing: During initial development or debugging
Single Document Processing: When handling one document at a time

# Synchronous API example
client = AurelioClient(api_key=os.environ["AURELIO_API_KEY"])
response = client.extract_file(file_path="document.pdf", wait=30)

When to Use Asynchronous API

Production Applications: For high-throughput systems
Large Documents: When processing may take significant time
Batch Processing: When handling multiple documents simultaneously
Web Applications: To prevent blocking the main thread

# Asynchronous API example
async_client = AsyncAurelioClient(api_key=os.environ["AURELIO_API_KEY"])
async def process_document():
    response = await async_client.extract_file(file_path="document.pdf")
    return response

The async API provides significant performance improvements for concurrent processing scenarios, making it the preferred choice for production applications with substantial throughput requirements.

Get Started

User Guide

Document Processing

Chunking

Why Chunking Matters

Embedding

Embedding Applications

Async vs Sync Approaches

When to Use Synchronous API

When to Use Asynchronous API

Get Started

User Guide

​Document Processing

​Chunking

​Why Chunking Matters

​Embedding

​Embedding Applications

​Async vs Sync Approaches

​When to Use Synchronous API

​When to Use Asynchronous API

Document Processing

Chunking

Why Chunking Matters

Embedding

Embedding Applications

Async vs Sync Approaches

When to Use Synchronous API

When to Use Asynchronous API