Document Ingestion
Drag files into your corpus. We parse, chunk, embed, and index them automatically. Your documents become searchable AI knowledge in seconds.
Supported Formats
Reports, contracts, legal filings, scanned documents with OCR fallback
Word documents with formatting, tables, and embedded images
Spreadsheets — each sheet processed as a separate section
Plain text, README files, logs, and structured data
Web pages with boilerplate removal — only meaningful content is indexed
Documentation, README, and structured content with heading hierarchy
Maximum 50 MB per file. Batch upload supported. All files processed in parallel.
How It Works
Open the CorpusAI platform, navigate to your corpus, and drag files directly into the upload zone. You can upload a single document or an entire batch — the system processes everything in parallel.
Each file shows real-time progress: parsing, chunking, embedding, indexing. When the status turns green, your document is live and queryable.
No configuration needed. No format conversion. Just drop your files and start querying.
Drop files here or click to browse
PDF, DOCX, XLSX, TXT, HTML, MD — up to 50 MB
Under the Hood
Every document goes through a 5-stage pipeline. From raw file to queryable vector — fully automated, fully sovereign.
File received and validated. Format detected, size checked.
Text extracted. PDF uses pdftotext with OCR fallback. DOCX via XML. HTML with boilerplate removal.
600-token segments with 50-token overlap. Semantic boundaries preserved at paragraph and heading breaks.
768-dimensional vectors generated on local GPU. No external API. Your text never leaves the server.
Vectors stored in Qdrant with full metadata. Instantly queryable via search, chat, or API.
Metadata
When you upload a document, you can attach metadata fields that make retrieval smarter. Metadata acts as a filter — so when you query "employment contracts from 2023", the system narrows to documents tagged with that year before searching content.
Available fields: Author, Tags, Source URL, Publication Date, Category, and Language. All optional, all filterable.
Many legal and regulatory documents are scanned images, not selectable text. Our pipeline detects this automatically and routes through a two-stage OCR process:
Fast native text extraction. If the PDF has a text layer, this returns content instantly with zero GPU cost.
If pdftotext returns empty, the document is rasterized and processed with Tesseract OCR. Output: a searchable PDF with embedded text layer.
REST API
Integrate document ingestion into your existing workflows with a single API call.
curl -X POST https://ai.bluenotelogic.com/api/v2/documents/upload \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@contract-2024.pdf" \
-F "corpus_id=corp_abc123" \
-F "metadata[author]=Oslo District Court" \
-F "metadata[category]=court-ruling" \
-F "metadata[tags]=family-law,custody" \
-F "metadata[published]=2024-03-15"
{
"id": "doc_7f3a9b2c",
"filename": "contract-2024.pdf",
"status": "processing",
"chunks": null,
"metadata": {
"author": "Oslo District Court",
"category": "court-ruling",
"tags": ["family-law", "custody"],
"published": "2024-03-15"
},
"created_at": "2026-03-25T14:30:00Z"
}
Create a free sandbox. Upload your first document in under 2 minutes. No credit card required.