File extraction

This guide provides technical details about processing different types of files with the Aurelio SDK, including PDFs, videos, and web content. It covers all available parameters, recommended configurations, and waiting strategies for large files.

Processing Flow

Common Parameters

All file extraction methods accept these core parameters:

Parameter	Type	Default	Description
`model`	`"aurelio-base"` \| `"docling-base"` \| `"gemini-2-flash-lite"`	`"aurelio-base"`	Model to use for processing. Different models have different capabilities and price points.
`chunk`	`bool`	`True`	Whether to chunk the document using default chunking config.
`wait`	`int`	`30`	Time in seconds to wait for processing completion. Set to `-1` to wait indefinitely. Set to `0` to return immediately with a document ID.
`polling_interval`	`int`	`5`	Time in seconds between status check requests. Set to `0` to disable polling.
`retries`	`int`	`3`	Number of retry attempts in case of API errors (5xx).
`processing_options`	`dict`	`None`	Additional processing options for customizing extraction and chunking behavior.

Note: The quality parameter has been deprecated and replaced with the model parameter.

For PDF: quality="low" is equivalent to model="aurelio-base" (fastest, cheapest, best for clean PDFs)

For PDF: quality="high" is equivalent to model="docling-base" (code-based OCR for high precision)

For PDF: A new option model="gemini-2-flash-lite" uses a Vision Language Model (VLM) for state-of-the-art text extraction. Note that VLMs can offer superior PDF-to-text performance but come with the risk of hallucinating PDF content ^{Y. Liu, et al.}

For MP4: Both quality settings used "aurelio-base" but with different chunking methods, now specified in processing_options

MP4 files can only be processed with model="aurelio-base"

Processing from PDF Files

The SDK enables extracting text from PDF documents stored as local files.

Method Signature

def extract_file(
    self,
    file: Optional[Union[IO[bytes], bytes]] = None,
    file_path: Optional[Union[str, pathlib.Path]] = None,
    model: Literal["aurelio-base", "docling-base", "gemini-2-flash-lite"] = "aurelio-base",
    chunk: bool = True,
    wait: int = 30,
    polling_interval: int = 5,
    retries: int = 3,
    processing_options: Optional[Dict[str, Any]] = None,
) -> ExtractResponse:
    """Process a document from a file synchronously."""

Usage Examples

From a file path:

from aurelio_sdk import AurelioClient

client = AurelioClient()
response = client.extract_file(
    file_path="path/to/document.pdf",
    model="aurelio-base",  # Fastest and cheapest option, best for clean PDF files
    chunk=True,
    wait=30
)

From file bytes:

with open("path/to/document.pdf", "rb") as f:
    file_bytes = f.read()

file_bytes_io = io.BytesIO(file_bytes)
file_bytes_io.name = "document.pdf"  # Name is important for content type detection

response = client.extract_file(
    file=file_bytes_io,
    model="docling-base",  # Better for complex documents requiring high precision
    chunk=True,
    wait=-1  # Wait until completion
)

PDF Processing Recommendations

Use model="aurelio-base" for faster processing of simple documents (equivalent to old quality="low")
Use model="docling-base" for complex documents with tables, diagrams, or mixed layouts (equivalent to old quality="high")
Use model="gemini-2-flash-lite" for state-of-the-art text extraction using a Vision Language Model
For large PDFs (>100 pages) or image-heavy PDFs, consider increasing wait time or using -1
The SDK automatically handles pagination and merges content across pages

Processing from Video Files

The SDK can extract transcriptions from video files (MP4 format).

Usage Examples

response = client.extract_file(
    file_path="path/to/video.mp4",
    model="aurelio-base",  # Only model supported for video processing
    chunk=True,
    wait=-1,         # Video processing can take longer
    polling_interval=15,
    processing_options={
        "chunking": {
            "chunker_type": "semantic"  # Use semantic chunking (previously achieved with quality="high")
        }
    }
)

Video Processing Recommendations

Only model="aurelio-base" is supported for video transcription
Specify chunking preferences in processing_options (use “chunker_type”: “semantic” for better chunking, equivalent to old quality="high")
Set wait=-1 for videos longer than 5 minutes
Use a longer polling_interval (15-30 seconds) for videos to reduce API calls
Video processing is more resource-intensive and may take several minutes for longer files

Processing from URLs

Extract content from web-based URLs, including PDF documents and webpages.

Method Signature

def extract_url(
    self,
    url: str,
    model: Literal["aurelio-base", "docling-base", "gemini-2-flash-lite"],
    chunk: bool,
    wait: int = 30,
    polling_interval: int = 5,
    retries: int = 3,
    processing_options: Optional[Dict[str, Any]] = None,
) -> ExtractResponse:
    """Process a document from a URL synchronously."""

Usage Examples

# PDF URL
pdf_response = client.extract_url(
    url="https://example.com/document.pdf",
    model="aurelio-base",
    chunk=True,
    wait=30
)

# Web page URL
webpage_response = client.extract_url(
    url="https://example.com/blog/article",
    model="docling-base",
    chunk=True,
    wait=30
)

# Video URL (only supports aurelio-base)
video_response = client.extract_url(
    url="https://example.com/video.mp4",
    model="aurelio-base",
    chunk=True,
    wait=60,
    processing_options={
        "chunking": {
            "chunker_type": "semantic"
        }
    }
)

URL Processing Recommendations

For PDF URLs, follow the same model recommendations as for PDF files
For web pages, use model="docling-base" to better preserve page structure
For video URLs, only model="aurelio-base" is supported
When extracting from dynamic websites, be aware that client-side rendered content may not be fully captured

Waiting Strategies for Large Files

Processing large files (extensive PDFs or long videos) requires appropriate waiting strategies to handle longer processing times.

Recommended Strategies

Immediate Return (wait=0):
- Best for very large files where you want to process asynchronously
- You must handle polling separately
- Good for user-facing applications to avoid blocking
Wait Until Completion (wait=-1):
- Simplest approach for backend processing
- Blocks until processing completes
- Use polling_interval to control how frequently to check status
- Best for batch processing jobs or automation
Fixed Wait Time (wait=30):
- Wait for a predefined time (default 30 seconds)
- Returns with whatever status is available after that time
- Good for medium-sized files where you expect processing to be quick

Example: Progressive Polling with Timeout

For large files with uncertain processing times, implement a progressive polling strategy:

from aurelio_sdk import AurelioClient, TaskStatus

client = AurelioClient()

# First request returns immediately with document ID
response = client.extract_file(
    file_path="large_document.pdf",
    model="docling-base",
    chunk=True,
    wait=0
)

document_id = response.document.id
status = response.status
total_wait_time = 0
max_wait_time = 300  # 5 minutes
polling_interval = 10  # Start polling every 10 seconds

while status == TaskStatus.pending and total_wait_time < max_wait_time:
    # Wait before polling again
    time.sleep(polling_interval)
    total_wait_time += polling_interval
    
    # Get document status
    response = client.get_document(document_id)
    status = response.status
    
    # Increase polling interval for longer waits
    if total_wait_time > 60:
        polling_interval = 30
    
    print(f"Waited {total_wait_time}s, status: {status}")

if status == TaskStatus.completed:
    print("Processing completed successfully")
    document = response.document
else:
    print(f"Processing incomplete after {total_wait_time}s")

Response Structure

The ExtractResponse object contains detailed information about the processed document:

class ExtractResponse(BaseModel):
    status: TaskStatus  # "pending", "completed", or "failed"
    usage: Usage        # Resource usage information
    message: Optional[str]  # Additional information (e.g., errors)
    processing_options: ExtractProcessingOptions  # Applied options
    document: ResponseDocument  # The processed document

The ResponseDocument contains:

class ResponseDocument(BaseModel):
    id: str           # Document ID
    content: str      # Full document content
    source: str       # Source filename or URL
    source_type: SourceType  # MIME type (application/pdf, video/mp4, etc.)
    num_chunks: int   # Number of chunks if chunking was enabled
    metadata: Dict[str, Any]  # User-definable metadata
    chunks: List[ResponseChunk]  # List of document chunks

Error Handling

The SDK can raise several exceptions during file processing:

APITimeoutError: Raised when the request exceeds the wait time
APIError: General API error with details in the message
ApiRateLimitError: Raised when API rate limits are exceeded

Example error handling:

from aurelio_sdk import AurelioClient, APIError, APITimeoutError, ApiRateLimitError

client = AurelioClient()

try:
    response = client.extract_file(
        file_path="document.pdf",
        model="docling-base",
        chunk=True,
        wait=30
    )
except APITimeoutError:
    print("Processing is taking longer than expected")
except ApiRateLimitError:
    print("Rate limit exceeded, try again later")
except APIError as e:
    print(f"API error: {e.message}")

Get Started

User Guide

Processing Flow

Common Parameters

Processing from PDF Files

Method Signature

Usage Examples

From a file path:

From file bytes:

PDF Processing Recommendations

Processing from Video Files

Usage Examples

Video Processing Recommendations

Processing from URLs

Method Signature

Usage Examples

URL Processing Recommendations

Waiting Strategies for Large Files

Recommended Strategies

Example: Progressive Polling with Timeout

Response Structure

Error Handling

Get Started

User Guide

​Processing Flow

​Common Parameters

​Processing from PDF Files

​Method Signature

​Usage Examples

​From a file path:

​From file bytes:

​PDF Processing Recommendations

​Processing from Video Files

​Usage Examples

​Video Processing Recommendations

​Processing from URLs

​Method Signature

​Usage Examples

​URL Processing Recommendations

​Waiting Strategies for Large Files

​Recommended Strategies

​Example: Progressive Polling with Timeout

​Response Structure

​Error Handling

Processing Flow

Common Parameters

Processing from PDF Files

Method Signature

Usage Examples

From a file path:

From file bytes:

PDF Processing Recommendations

Processing from Video Files

Usage Examples

Video Processing Recommendations

Processing from URLs

Method Signature

Usage Examples

URL Processing Recommendations

Waiting Strategies for Large Files

Recommended Strategies

Example: Progressive Polling with Timeout

Response Structure

Error Handling