SCOURR
Initializing . . .
Systematic Research Platform

Transform Data Chaos Into Precision

Scourr builds institutional data and research infrastructure for systematic funds. We turn noisy web-scale corpora into point-in-time, survivorship-bias–free features and event streams, delivered as APIs and notebooks for signal research, portfolio construction, and production deployment.

18M
Raw docs (2019–2024)
1.8M
Embeddings (FAISS HNSW)
35k/mo
LLM requests
Leakage (purged CV) 0.4%
Scanning

AI & Machine Learning

LLM calls: 35k/mo (90% local, 10% GPT-4/5)
Vector DB: 1.8M embeddings (1.4GB, FAISS HNSW, 8-bit quant)
Agentic RAG with entity resolution & timestamp normalization
ACTIVE

Data Infrastructure

Ingested: 18M raw docs → 8.64M post‑clean (48% retained)
Dedup reduction: 21.6% • Entity resolution F1: 0.91
Infra: Redpanda streaming, Redis caching, DuckDB/Polars backfills
STREAMING

Quantitative Analytics

Sharpe: 0.92 OOS vs 0.80 baseline (2019–2024)
Capacity: ~$10M @ 5bps before IC→0.03 decay
Backtest→Prod drift: −3bps over 6 months
COMPUTING

Security & Compliance

AES-256, TLS 1.3
Prometheus, Grafana, ELK
GDPR Compliance
PROTECTED

Platform SLOs (30‑day rolling)

226ms
API Latency (p95)
Target 220ms (+2.7%)
7.6m
Feature Freshness (p95)
On track
99.94%
Pipeline Success (30d)
3 minor failures
18h
Backfill SLA
vs 24h target
SLOs are 30‑day aggregates from Prometheus/Grafana. Definitions available on request.

Data Processing Pipeline

Data Ingestion

18M raw docs (2019–2024)

Processing

8.6M cleaned (48%), 21.6% dedup

Insights

1.8M embeddings • real‑time features

Agentic RAG Systems

Autonomous corpus‑to‑feature extraction with LLM agents: entity resolution, event labeling, deduplication, and timestamp normalization. Outputs clean, point‑in‑time features to a research feature store for robust cross‑sectional modeling.

OpenAI GPT-4 FAISS Pinecone Cohere
18M → 8.6M Docs (48% retained)

High-Throughput Systems

Asynchronous batched retrieval, Redis caching, and backfill orchestration to maintain lag‑corrected, vendor‑normalized panels. End‑to‑end RAG query latency median 220ms, p95 226ms. Supports ~100 concurrent signal evaluations.

Redis Async Processing Load Balancing
226ms p95 E2E query

Advanced Backtesting

Walk‑forward evaluation with 10‑fold purged CV (120‑day embargo), Monte Carlo path perturbations, and transaction‑cost/risk models. Enhanced multi‑signal model Sharpe 0.92 OOS vs 0.80 baseline (2019–2024); drift to production −3bps over 6 months.

Monte Carlo Factor Analysis Risk Models
$10M Capacity @5bps

Alpha Generation Pipeline

Data Ingestion

Tens of millions of documents

AI Processing

Agentic RAG analysis

Signal Generation

Alpha extraction

Portfolio Alpha

Sharpe 0.92 OOS vs 0.80 baseline

01

Mass Data Scraping

Continuous web‑scale acquisition with point‑in‑time snapshots, entity linking, and intelligent deduplication. Raw→cleaned: 18M→8.6M (48% retained). Dedup reduction 21.6%.

7.6m p95 Freshness
21.6% Dedup Reduction
Multi-source Real-time Intelligent
02

Business Intelligence

Research workbench: feature store, APIs, notebooks, diagnostics, and leakage checks (10‑fold purged CV, 120‑day embargo). Factor library and audit trails to accelerate idea→production cycles for QR/PM teams.

7 days Idea→Prod median
0.4% Leakage (purged CV)
Predictive Analytics Dashboards
03

Quantitative Trading

Execution‑ready signals for equities/stat‑arb: cross‑sectional alphas, risk‑neutralization hooks, universe/portfolio construction primitives, and monitoring to keep models stable in production.

−3 bps Backtest→Prod Drift
5 bps Slippage p50
Factor Analysis Portfolio Optimization Risk-Adjusted

Efficiency Through Intelligence

Shorten research cycles and ship models faster, automation for data prep, feature generation, and backtesting.

Time Savings

Automated analysis frees up strategic thinking

7 days
Idea→Prod vs ~30d

AI Automation

Self-learning agents adapt to market patterns

−72%
Manual preprocessing

Enterprise Grade

Consulting-level competitive advantage

99.94%
Pipeline success (30d)

Strategic Insights

Deep market intelligence for better decisions

3
Incidents, avg 22m

Ready to Transform Your Data?

Work with CIOs/PMs/QRs under NDA to deploy private, customized data and research pipelines, on‑prem or VPC-tailored to hedge funds and family offices.

Genesis Sep 2024
Dream The Future
Scale Q2 2026
Synthesis H2 2025

Genesis

The initiative began in September 2024 as a solo effort to test whether alternative data could be ingested, cleaned, and transformed into research-grade features. The initial months focused on building ingestion pipelines, implementing entity resolution, and developing leakage-aware backtesting. Despite being driven by a single developer, the work demonstrated that meaningful alpha could be extracted from messy, unstructured data.

Dream

  • 2027: Expand coverage to about 50k instruments across global equities and macro series. Ingest about 250M documents per year with fully automated quality gates. Grow the vector index to about 30M embeddings while keeping p95 latency near 150ms through GPU accelerated retrieval and aggressive caching.
  • 2028: Build a cross market knowledge graph that links entities, events, and features. Ship an internal feature marketplace and automated lineage and drift governance. Reach about 500k API calls per day with hybrid GPU and TPU inference that remains cost aware.
  • 2029+: Autopilot research loops that propose, test, and promote signals with human approval. Offer private on premise deployments for funds with reproducibility guarantees and cryptographic audit trails. Process over one billion documents per year while keeping end to end latency under 100ms for common queries.

Scale

  • 2025: Coverage to ~12k equities; docs to ~40M/yr; embeddings >3M; p95 latency under 250ms.
  • 2026: Global ~20k universe; embeddings 8–10M; ~120k API calls/mo; <15% external dependency; hundreds of concurrent signals.

Synthesis

Over time, the pieces began to come together: robust pipelines, automated deduplication, timestamp-normalized features, and a walk-forward validation framework. This synthesis created a self-reinforcing loop where data flowed in, features were generated, and signals were tested, producing stable, out-of-sample alphas with Sharpe improvements over baseline models. It marked the transition from experimentation to a fully functioning research platform.

Email

scourr@proton.me

Responses vary by consumer volume

Location

Toronto, ON

Canada

Status

Pre-Seed

Actively growing