Automated Ingestion
100+ domain-specific scrapers monitor legal, regulatory, and industry sources. New documents are ingested, chunked, and indexed into your corpus automatically.
Three Steps
Choose from 100+ pre-built scrapers or define a custom source URL. Set depth limits, page count, and content filters.
Set daily, weekly, or custom schedules. The scraper runs on your cadence, checking for new and updated content.
New documents flow through the full pipeline: parse, extract metadata, deduplicate, chunk, embed, and index. Zero manual intervention.
The Architecture
Every scraper extends our BaseScraper class — 428 lines of battle-tested infrastructure. Rate limiting, retry logic, PDF parsing, OCR fallback, metadata extraction, and deduplication are all handled at the base layer.
Domain-specific scrapers only need to implement page listing and content extraction. The base handles everything else.
CorpusAI-ResearchBot/1.0Coverage
Pre-built scrapers for the sources that matter most. Each one understands the structure, pagination, and metadata patterns of its target domain.
Norwegian legislation, court rulings, legal encyclopaedia
Parliamentary proceedings, committee reports, legislation
EU directives, regulations, and legislative proposals
Court of Justice of the EU — rulings and opinions
European Court of Human Rights — case law
Norwegian Data Protection Authority — guidance and decisions
European Data Protection Board — guidelines and opinions
EU AI Act guidance, codes of practice, regulatory updates
AI Risk Management Framework, standards, and publications
AI policy observatory, principles, and country reports
EU expert group on AI — ethics guidelines and reports
Body of European Regulators — telecom market analysis
UK communications regulator — research and decisions
International Telecommunication Union — standards
Swedish and Norwegian telecom regulators
Intergovernmental Panel on Climate Change — assessment reports
UN Framework Convention — NDCs, decisions, COP reports
European Commission — Fit for 55, Green Deal policy
International Energy Agency — reports and data
Custom scraper sources available on Corporate and Sovereign plans. Request a source.
REST API
curl -X POST https://ai.bluenotelogic.com/api/v2/scrapers/create \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"corpus_id": "corp_abc123",
"source_url": "https://lovdata.no/dokument/NL/lov/1981-04-08-7",
"scraper_type": "lovdata",
"schedule": "weekly",
"max_depth": 3,
"max_pages": 500,
"metadata_defaults": {
"category": "legislation",
"jurisdiction": "norway"
}
}'
curl -X POST https://ai.bluenotelogic.com/api/v2/scrapers/scr_9d4f1a/run \
-H "Authorization: Bearer YOUR_API_KEY"
curl https://ai.bluenotelogic.com/api/v2/scrapers/scr_9d4f1a/status \
-H "Authorization: Bearer YOUR_API_KEY"
{
"id": "scr_9d4f1a",
"status": "completed",
"last_run": "2026-03-25T02:00:00Z",
"pages_scraped": 147,
"documents_ingested": 89,
"documents_skipped": 58,
"next_scheduled": "2026-04-01T02:00:00Z",
"errors": []
}
Pipeline
HTTP/HTTPS with retry, rate limiting, and SSRF protection
Boilerplate removal. Only meaningful content survives.
Dates, case numbers, authority, jurisdiction — auto-extracted
Content hash comparison. No duplicate chunks in your corpus.
600 tokens, 50 overlap. Semantic boundaries preserved.
768d vectors stored in Qdrant. Instantly queryable.
Every scraper runs under strict security controls. No scraper can access private networks, bypass rate limits, or ingest without admin oversight.
Private IP ranges (10.x, 172.16-31.x, 192.168.x) are blocked at the network layer.
Minimum 2-second delay between requests with random jitter. Respects robots.txt.
New scraper sources require administrator approval before first run.
Max depth, max pages, and content filters keep scraping focused and predictable.
Start with a pre-built source or configure your own. Your corpus stays current while you focus on your work.