RAG Architecture — CaveauAI by Blue Note Logic

The RAG Pipeline

RAG (Retrieval-Augmented Generation) is the technique that prevents AI hallucination. Instead of asking a model to answer from memory, we first retrieve the most relevant passages from your documents, then ask the model to synthesize an answer using only those passages.

1. Document Chunking

Documents are split into overlapping chunks using a sliding window approach. Each chunk is ~512 tokens with ~50 token overlap to preserve context at chunk boundaries.

// Chunking configuration
chunk_size:    512 tokens
overlap:       50 tokens
strategy:      "sliding_window"
boundary_aware: true  // respects paragraph/section breaks

2. Embedding

Each chunk is transformed into a 768-dimensional vector using a multilingual embedding model. These vectors capture the semantic meaning of the text, enabling similarity search that understands concepts, not just keywords.

// Embedding specification
model:         "multilingual-e5-large"
dimensions:    768
index:         "HNSW" (Hierarchical Navigable Small World)
distance:      "cosine_similarity"

3. Hybrid Search

When you query, we run two search strategies in parallel:

Vector Search

Finds semantically similar passages. Understands that "termination clause" and "ending the agreement" mean the same thing.

BM25 Keyword Search

Finds exact term matches. Critical for proper nouns, case numbers, statute references, and technical terms.

Results from both strategies are fused using Reciprocal Rank Fusion (RRF), producing a single ranked list of the most relevant passages.

4. Multi-Corpus Federation

A single query can search across multiple corpora simultaneously. Each corpus maintains its own vector index and BM25 index. Results are fused across corpora before being passed to the LLM.

// Multi-corpus query
POST /api/v2/chat
{
  "message": "What are the GDPR requirements for AI systems?",
  "corpus_ids": ["internal-policies", "eu-ai-act", "gdpr-guidance"]
}

5. Citation Generation

The LLM receives the top-ranked passages as context and generates an answer. Every claim in the answer is annotated with a reference back to the specific source passage — document name, page number, and the relevant text extract.

If the retrieved passages don't contain enough information to answer the question, the system explicitly says so rather than fabricating an answer.