Automated Web Scrapers — CaveauAI by Blue Note Logic

Three Steps

Configure. Schedule. Forget.

1

Configure Source

Choose from 100+ pre-built scrapers or define a custom source URL. Set depth limits, page count, and content filters.

2

Schedule Runs

Set daily, weekly, or custom schedules. The scraper runs on your cadence, checking for new and updated content.

3

Auto-Ingest

New documents flow through the full pipeline: parse, extract metadata, deduplicate, chunk, embed, and index. Zero manual intervention.

The Architecture

Built on a Common Foundation

Every scraper extends our BaseScraper class — 428 lines of battle-tested infrastructure. Rate limiting, retry logic, PDF parsing, OCR fallback, metadata extraction, and deduplication are all handled at the base layer.

Domain-specific scrapers only need to implement page listing and content extraction. The base handles everything else.

Rate-limited: 2s+ base delay with random jitter
User agent: CorpusAI-ResearchBot/1.0
SSRF protection: private IP ranges blocked
Automatic deduplication via content hashing

// BaseScraper.php — core loop

foreach ($urls as $url) {

$this->rateLimit();

$html = $this->fetch($url);

$text = $this->extractContent($html);

$meta = $this->extractMetadata($html);

if ($this->isDuplicate($text))

continue;

$this->ingest($text, $meta);

}

Coverage

100+ Domain-Specific Scrapers

Pre-built scrapers for the sources that matter most. Each one understands the structure, pagination, and metadata patterns of its target domain.

Legal & Regulatory

Lovdata

Norwegian legislation, court rulings, legal encyclopaedia

Stortinget

Parliamentary proceedings, committee reports, legislation

EUR-Lex

EU directives, regulations, and legislative proposals

CJEU

Court of Justice of the EU — rulings and opinions

ECHR

European Court of Human Rights — case law

Datatilsynet

Norwegian Data Protection Authority — guidance and decisions

EDPB

European Data Protection Board — guidelines and opinions

AI Governance

EU AI Office

EU AI Act guidance, codes of practice, regulatory updates

NIST AI

AI Risk Management Framework, standards, and publications

OECD AI

AI policy observatory, principles, and country reports

High-Level Expert Group

EU expert group on AI — ethics guidelines and reports

Telecom & Digital

BEREC

Body of European Regulators — telecom market analysis

Ofcom

UK communications regulator — research and decisions

ITU

International Telecommunication Union — standards

PTS / NKOM

Swedish and Norwegian telecom regulators

Climate & Emissions

IPCC

Intergovernmental Panel on Climate Change — assessment reports

UNFCCC

UN Framework Convention — NDCs, decisions, COP reports

EC Climate

European Commission — Fit for 55, Green Deal policy

IEA

International Energy Agency — reports and data

Custom scraper sources available on Corporate and Sovereign plans. Request a source.

REST API

Control Scrapers via API

POST /api/v2/scrapers/create Configure a new scraper source

curl -X POST https://ai.bluenotelogic.com/api/v2/scrapers/create \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "corpus_id": "corp_abc123",
    "source_url": "https://lovdata.no/dokument/NL/lov/1981-04-08-7",
    "scraper_type": "lovdata",
    "schedule": "weekly",
    "max_depth": 3,
    "max_pages": 500,
    "metadata_defaults": {
      "category": "legislation",
      "jurisdiction": "norway"
    }
  }'

POST /api/v2/scrapers/{id}/run Trigger a manual scraper run

curl -X POST https://ai.bluenotelogic.com/api/v2/scrapers/scr_9d4f1a/run \
  -H "Authorization: Bearer YOUR_API_KEY"

GET /api/v2/scrapers/{id}/status Check scraper run status

curl https://ai.bluenotelogic.com/api/v2/scrapers/scr_9d4f1a/status \
  -H "Authorization: Bearer YOUR_API_KEY"

Response — 200 OK

{
  "id": "scr_9d4f1a",
  "status": "completed",
  "last_run": "2026-03-25T02:00:00Z",
  "pages_scraped": 147,
  "documents_ingested": 89,
  "documents_skipped": 58,
  "next_scheduled": "2026-04-01T02:00:00Z",
  "errors": []
}

Pipeline

From Web Page to Vector

1

Fetch

HTTP/HTTPS with retry, rate limiting, and SSRF protection

2

Extract

Boilerplate removal. Only meaningful content survives.

3

Metadata

Dates, case numbers, authority, jurisdiction — auto-extracted

4

Deduplicate

Content hash comparison. No duplicate chunks in your corpus.

5

Chunk

600 tokens, 50 overlap. Semantic boundaries preserved.

6

Index

768d vectors stored in Qdrant. Instantly queryable.

Built-In Security

Every scraper runs under strict security controls. No scraper can access private networks, bypass rate limits, or ingest without admin oversight.

SSRF Protection

Private IP ranges (10.x, 172.16-31.x, 192.168.x) are blocked at the network layer.

Rate Limiting

Minimum 2-second delay between requests with random jitter. Respects robots.txt.

Admin Approval

New scraper sources require administrator approval before first run.

Configurable Limits

Max depth, max pages, and content filters keep scraping focused and predictable.

Set Up Your First Scraper

Start with a pre-built source or configure your own. Your corpus stays current while you focus on your work.

Start Free → API Documentation

Keep Your Corpus Fresh. Automatically.