File extraction
This guide provides technical details about processing different types of files with the Aurelio SDK, including PDFs, videos, and web content. It covers all available parameters, recommended configurations, and waiting strategies for large files.
Processing Flow
Common Parameters
All file extraction methods accept these core parameters:
Parameter | Type | Default | Description |
---|---|---|---|
model | "aurelio-base" | "docling-base" | "gemini-2-flash-lite" | "aurelio-base" | Model to use for processing. Different models have different capabilities and price points. |
chunk | bool | True | Whether to chunk the document using default chunking config. |
wait | int | 30 | Time in seconds to wait for processing completion. Set to -1 to wait indefinitely. Set to 0 to return immediately with a document ID. |
polling_interval | int | 5 | Time in seconds between status check requests. Set to 0 to disable polling. |
retries | int | 3 | Number of retry attempts in case of API errors (5xx). |
processing_options | dict | None | Additional processing options for customizing extraction and chunking behavior. |
Note: The
quality
parameter has been deprecated and replaced with themodel
parameter.
- For PDF:
quality="low"
is equivalent tomodel="aurelio-base"
(fastest, cheapest, best for clean PDFs)- For PDF:
quality="high"
is equivalent tomodel="docling-base"
(code-based OCR for high precision)- For PDF: A new option
model="gemini-2-flash-lite"
uses a Vision Language Model (VLM) for state-of-the-art text extraction. Note that VLMs can offer superior PDF-to-text performance but come with the risk of hallucinating PDF content Y. Liu, et al.- For MP4: Both quality settings used
"aurelio-base"
but with different chunking methods, now specified inprocessing_options
- MP4 files can only be processed with
model="aurelio-base"
Processing from PDF Files
The SDK enables extracting text from PDF documents stored as local files.
Method Signature
Usage Examples
From a file path:
From file bytes:
PDF Processing Recommendations
- Use
model="aurelio-base"
for faster processing of simple documents (equivalent to oldquality="low"
) - Use
model="docling-base"
for complex documents with tables, diagrams, or mixed layouts (equivalent to oldquality="high"
) - Use
model="gemini-2-flash-lite"
for state-of-the-art text extraction using a Vision Language Model - For large PDFs (>100 pages) or image-heavy PDFs, consider increasing
wait
time or using-1
- The SDK automatically handles pagination and merges content across pages
Processing from Video Files
The SDK can extract transcriptions from video files (MP4 format).
Usage Examples
Video Processing Recommendations
- Only
model="aurelio-base"
is supported for video transcription - Specify chunking preferences in
processing_options
(use “chunker_type”: “semantic” for better chunking, equivalent to oldquality="high"
) - Set
wait=-1
for videos longer than 5 minutes - Use a longer
polling_interval
(15-30 seconds) for videos to reduce API calls - Video processing is more resource-intensive and may take several minutes for longer files
Processing from URLs
Extract content from web-based URLs, including PDF documents and webpages.
Method Signature
Usage Examples
URL Processing Recommendations
- For PDF URLs, follow the same model recommendations as for PDF files
- For web pages, use
model="docling-base"
to better preserve page structure - For video URLs, only
model="aurelio-base"
is supported - When extracting from dynamic websites, be aware that client-side rendered content may not be fully captured
Waiting Strategies for Large Files
Processing large files (extensive PDFs or long videos) requires appropriate waiting strategies to handle longer processing times.
Recommended Strategies
-
Immediate Return (
wait=0
):- Best for very large files where you want to process asynchronously
- You must handle polling separately
- Good for user-facing applications to avoid blocking
-
Wait Until Completion (
wait=-1
):- Simplest approach for backend processing
- Blocks until processing completes
- Use
polling_interval
to control how frequently to check status - Best for batch processing jobs or automation
-
Fixed Wait Time (
wait=30
):- Wait for a predefined time (default 30 seconds)
- Returns with whatever status is available after that time
- Good for medium-sized files where you expect processing to be quick
Example: Progressive Polling with Timeout
For large files with uncertain processing times, implement a progressive polling strategy:
Response Structure
The ExtractResponse
object contains detailed information about the processed document:
The ResponseDocument
contains:
Error Handling
The SDK can raise several exceptions during file processing:
APITimeoutError
: Raised when the request exceeds the wait timeAPIError
: General API error with details in the messageApiRateLimitError
: Raised when API rate limits are exceeded
Example error handling: