Document Ingestion

Upload Once. Query Forever.

Drag files into your corpus. We parse, chunk, embed, and index them automatically. Your documents become searchable AI knowledge in seconds.

Supported Formats

Every Format Your Business Uses

PDF

Reports, contracts, legal filings, scanned documents with OCR fallback

DOCX

Word documents with formatting, tables, and embedded images

XLSX

Spreadsheets — each sheet processed as a separate section

TXT

Plain text, README files, logs, and structured data

HTML

Web pages with boilerplate removal — only meaningful content is indexed

Markdown

Documentation, README, and structured content with heading hierarchy

Maximum 50 MB per file. Batch upload supported. All files processed in parallel.

How It Works

Drag. Drop. Done.

Open the CorpusAI platform, navigate to your corpus, and drag files directly into the upload zone. You can upload a single document or an entire batch — the system processes everything in parallel.

Each file shows real-time progress: parsing, chunking, embedding, indexing. When the status turns green, your document is live and queryable.

No configuration needed. No format conversion. Just drop your files and start querying.

Drop files here or click to browse

PDF, DOCX, XLSX, TXT, HTML, MD — up to 50 MB

contract-2024-final.pdf indexed
employee-handbook.docx indexed
financial-report-Q4.xlsx embedding...

Under the Hood

The Processing Pipeline

Every document goes through a 5-stage pipeline. From raw file to queryable vector — fully automated, fully sovereign.

1

Upload

File received and validated. Format detected, size checked.

2

Parse

Text extracted. PDF uses pdftotext with OCR fallback. DOCX via XML. HTML with boilerplate removal.

3

Chunk

600-token segments with 50-token overlap. Semantic boundaries preserved at paragraph and heading breaks.

4

Embed

768-dimensional vectors generated on local GPU. No external API. Your text never leaves the server.

5

Index

Vectors stored in Qdrant with full metadata. Instantly queryable via search, chat, or API.

Metadata

Attach Context to Every File

When you upload a document, you can attach metadata fields that make retrieval smarter. Metadata acts as a filter — so when you query "employment contracts from 2023", the system narrows to documents tagged with that year before searching content.

Available fields: Author, Tags, Source URL, Publication Date, Category, and Language. All optional, all filterable.

author "Oslo District Court"
tags ["family-law", "custody"]
source_url "https://lovdata.no/..."
published "2024-03-15"
category "court-ruling"
language "nb"

Scanned PDFs? We Handle That.

Many legal and regulatory documents are scanned images, not selectable text. Our pipeline detects this automatically and routes through a two-stage OCR process:

Stage 1: pdftotext

Fast native text extraction. If the PDF has a text layer, this returns content instantly with zero GPU cost.

Stage 2: ocrmypdf

If pdftotext returns empty, the document is rasterized and processed with Tesseract OCR. Output: a searchable PDF with embedded text layer.

REST API

Upload via API

Integrate document ingestion into your existing workflows with a single API call.

POST /api/v2/documents/upload
curl -X POST https://ai.bluenotelogic.com/api/v2/documents/upload \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract-2024.pdf" \
  -F "corpus_id=corp_abc123" \
  -F "metadata[author]=Oslo District Court" \
  -F "metadata[category]=court-ruling" \
  -F "metadata[tags]=family-law,custody" \
  -F "metadata[published]=2024-03-15"
Response — 201 Created
{
  "id": "doc_7f3a9b2c",
  "filename": "contract-2024.pdf",
  "status": "processing",
  "chunks": null,
  "metadata": {
    "author": "Oslo District Court",
    "category": "court-ruling",
    "tags": ["family-law", "custody"],
    "published": "2024-03-15"
  },
  "created_at": "2026-03-25T14:30:00Z"
}

Ready to start uploading?

Create a free sandbox. Upload your first document in under 2 minutes. No credit card required.