Technical Reference

Under the Hood of
CaveauAI

Architecture, pipelines, protocols, and performance. Everything your engineering team needs to evaluate, integrate, and deploy.

PHP 8.x MariaDB Qdrant Ollama Python Tailwind CSS

01

System Architecture

Three-tier topology on dedicated bare-metal servers. No cloud dependencies. Each layer is independently scalable and geographically isolated within EU/EEA data centres.

Application Tier

S1 — Hetzner, Finland

  • • PHP 8.x application server
  • • MariaDB 10.11 (platform + RAG)
  • • MCP server (Python, FastMCP)
  • • Apache 2.4 + mod_php
  • • REST API endpoints

AI Gateway

S4 — WireGuard VPN

  • • Ollama inference server
  • • Embedding generation (nomic-embed)
  • • LLM inference (7B fine-tuned)
  • • Model routing & load balancing
  • • Health monitoring

GPU & Vector Store

Hippo + S2 — Germany

  • • NVIDIA RTX PRO 6000 (96 GB VRAM)
  • • 72B model inference (Q8_0)
  • • Qdrant vector DB (HNSW index)
  • • 768-dim cosine similarity
  • • Fine-tuning pipeline (QLoRA)

Network Topology

WireGuard VPN Mesh

S1 (App) ↔ S4 (AI) — 10.0.0.x

S1 (App) ↔ Hippo (GPU) — 10.8.0.x

S1 (App) ↔ S2 (Qdrant) — 10.0.0.x

Data Centre Locations

S1: Hetzner, Helsinki, Finland — Web, DB, MCP

S2: Hetzner, Falkenstein, Germany — Qdrant

S4: Hetzner, Falkenstein, Germany — Ollama

Hippo: Hetzner, Falkenstein, Germany — RTX PRO 6000

02

Corpus Pipeline

Six-stage ingestion pipeline transforms raw documents into searchable, citation-ready knowledge. Fully automated — from PDF upload to queryable vector index.

1

Source

Scraper or upload acquires raw document (PDF, DOCX, HTML, TXT).

2

Parse

Text extraction preserving headings, tables, and footnotes. Section hierarchy detected.

3

Chunk

Heading-aware semantic chunking. Max 600 words, min 50, 75-word overlap between chunks.

4

Embed

768-dimensional vectors via nomic-embed-text. CPU-only, no GPU required for ingestion.

5

Index

Dual storage: Qdrant (HNSW vector index) + MariaDB (metadata + fallback keyword search).

6

Validate

Deduplication check, chunk count verification, Qdrant point confirmation. Bad data rejected.

Chunking Parameters

ParameterValueNotes
Max chunk size600 wordsOptimised for 768-dim embedding window
Min chunk size50 wordsPrevents degenerate single-sentence chunks
Overlap75 wordsCross-boundary context preservation
Heading detectionMarkdown H1–H4 + font-weight heuristicsSection-aware boundaries
Section titleExtracted & stored per chunkDisplayed in citations

03

RAG Pipeline

Eight-step retrieval-augmented generation pipeline. Hybrid search combines vector similarity with keyword matching via reciprocal rank fusion.

1

Query Embedding

User question is embedded into 768-dim vector space using the same nomic-embed-text model used for document chunks.

2

Vector Search

Qdrant HNSW approximate nearest-neighbour search returns top-k candidate chunks by cosine similarity. Default k=8.

3

Keyword Search

Parallel BM25-style keyword search against MariaDB full-text index for exact term matching and rare token coverage.

4

Rank Fusion

Reciprocal Rank Fusion (RRF) merges vector and keyword results. Private chunks boosted 1.5× vs shared corpus chunks.

5

Context Assembly

Top-ranked chunks assembled into a prompt context window with document titles, section headers, and jurisdiction metadata.

6

LLM Inference

Prompt + context sent to fine-tuned LLM (bnl-legal 7B or qwen2.5:72b). Model generates answer grounded in retrieved chunks.

7

Citation Extraction

Every claim mapped back to source document. Output includes document title, chunk section, relevance score, and jurisdiction.

8

Response Delivery

Structured JSON response with answer text, source citations, model used, and response time. Streaming supported via SSE.

768

Embedding dimensions

1.5×

Private chunk boost factor

RRF

Reciprocal Rank Fusion

04

MCP Protocol

Model Context Protocol server exposing 9 tools for AI agent integration. Compatible with Claude Desktop, Claude Code, Cursor, and any MCP-compliant client.

Built with FastMCP (Python). Transports: SSH stdio (Claude Code) and SSE port 8001 (Claude Desktop).

MCP Tools Reference

ToolDescriptionKey Parameters
search_corpus Semantic vector search across shared corpuses query, corpus, limit, language, jurisdiction, category
list_corpuses Inventory of all corpuses with doc/chunk counts
get_document Full document with all chunks in reading order document_id, corpus
get_categories Category taxonomy tree for a corpus corpus
search_keywords Metadata filtering (no embedding needed) title_contains, tags, jurisdiction, authority_type, year_from/to
search_private Tenant-isolated search of client documents client_id, query, limit, category
ask Full RAG: search → context → LLM answer with citations question, corpus, model, temperature, max_tokens
ingest_text Chunk + embed + store a text document title, content, corpus, category, language, jurisdiction
corpus_stats Analytics: breakdowns by category, language, year corpus, client_id

Claude Desktop — SSE transport

{
  "mcpServers": {
    "caveauai": {
      "url": "https://ai.bluenotelogic.com/mcp",
      "auth": {
        "type": "bearer",
        "token": "$API_KEY"
      }
    }
  }
}

Claude Code — SSH stdio transport

{
  "mcpServers": {
    "legal-knowledge": {
      "command": "ssh",
      "args": [
        "user@server",
        "python server.py --stdio"
      ]
    }
  }
}

05

REST API

Standard REST endpoints authenticated via X-API-Key header. JSON request/response. HTTPS only.

MethodEndpointDescriptionAuth
GET /api/v2/documents List documents with pagination & filters API Key
POST /api/v2/documents Upload document (multipart) with metadata API Key
DELETE /api/v2/documents Remove document + chunks + Qdrant points API Key
POST /api/v2/chat RAG chat with streaming (SSE) support API Key
GET /api/v2/search Vector search across client corpus API Key
POST /api/v2/feedback Submit feedback on AI responses API Key
POST /api/v2/documents
curl -X POST https://ai.bluenotelogic.com/api/v2/documents \
  -H "X-API-Key: YOUR_KEY" \
  -F "file=@contract.pdf" \
  -F "title=Employment Contract" \
  -F "category=contracts" \
  -F "author=Legal Department" \
  -F "tags=employment,hr,compliance"

// Response:
{
  "success": true,
  "document_id": 1842,
  "chunks_created": 23,
  "processing_time_ms": 4200
}

06

AI Models

Three model tiers covering embedding, domain-specific reasoning, and enterprise-grade analysis. All self-hosted on EU bare-metal. No third-party API calls.

Embedding

nomic-embed-text

137M parameters

Dimensions768
Max tokens8,192
SimilarityCosine
HardwareCPU only
LocationS4
Domain Expert

bnl-legal

Qwen 2.5 7B — Q5_K_M

Parameters7.6B
QuantisationQ5_K_M
Fine-tuned onNorwegian law
MethodQLoRA (r=64)
LocationS4 + Hippo
Enterprise

qwen2.5:72b

Qwen 2.5 72B — Q8_0

Parameters72B
QuantisationQ8_0 (77GB)
VRAM required80+ GB
HardwareRTX PRO 6000
LocationHippo

Knowledge Distillation Pipeline

Teacher models (NorwAI-24B + Qwen-72B) generate domain-specific Q&A pairs from corpus documents. These are curated, scored, and used to QLoRA fine-tune the 7B student model. Result: expert-level answers at 7B inference cost.

07

Security & Compliance

GDPR compliance is an architectural decision, not an afterthought. Every layer enforces tenant isolation, data sovereignty, and access control.

Tenant Isolation

Each client’s documents, chunks, and vectors are stored with a client_id filter. Qdrant queries include mandatory payload filters. No cross-tenant data leakage is possible at the query level.

SSRF Protection

Client-submitted scraper URLs are validated against private IP ranges (RFC 1918, link-local, loopback). DNS resolution is checked post-redirect. Blocks TOCTOU attacks.

Approval Workflows

New scraper sources require admin approval before activation. URL or schedule changes trigger re-approval. Deactivated sources cannot ingest until approved.

EU Data Sovereignty

All servers in EU/EEA data centres (Hetzner, Germany & Finland). No US cloud providers. No data export. No third-party AI API calls. WireGuard encrypted transit.

Authentication

Platform: session-based with CSRF tokens. API: per-client API keys via X-API-Key header. Role-based access: owner, admin, editor, viewer. Bcrypt password hashing.

Prepared Statements

All database queries use PDO prepared statements with parameterised bindings. No string concatenation in SQL. XSS prevention via htmlspecialchars() on output.

08

Performance Benchmarks

Measured on production infrastructure with real corpus data. Numbers reflect typical workloads, not synthetic benchmarks.

< 50ms

Vector Search

Qdrant HNSW, 146K+ points

1.5–3s

RAG Answer (7B)

Full pipeline, 8-chunk context

3–8s

RAG Answer (72B)

Complex multi-doc reasoning

< 200ms

Embedding

Single query, nomic-embed-text

< 5s

Document Ingest

Average 10-page PDF

60s

Batch 100 docs

Parallel chunking + embedding

99.5%

Uptime SLA

Production (Corporate+ plans)

0

Data Export

Zero bytes leave EU jurisdiction

Qdrant Vector Collections

CollectionCorpusDocumentsVectorsIndex
bnl_chunks Norwegian Law 8,665 81,367 HNSW (m=16, ef=128)
corpus_eu_ai EU AI & Digital Law 710 24,938 HNSW (m=16, ef=128)
corpus_telecom Telecom Regulation 979 19,784 HNSW (m=16, ef=128)
corpus_co2 CO2 & Climate 874 14,818 HNSW (m=16, ef=128)
bnl_client_chunks Private Client Docs HNSW + client_id filter

Ready to integrate?

Connect your AI agents to CaveauAI in under 2 minutes. Full API access on Professional plans and above.