Architecture Overview
System Design
Capcat uses a hybrid architecture: config-driven sources for straightforward RSS/HTML scraping, and custom Python sources for sites requiring comment integration or complex scraping logic.
Source System
Config-Driven Sources (10 active)
YAML-based configurations requiring no Python. Add a file to Config/sources/active/config_driven/configs/ and the source is live.
Active: BBC News, BBC Sport, Google Research, The Guardian, IEEE Spectrum, InfoQ, Mashable, MIT News, Nature, Scientific American.
Custom Sources (7 active)
Python implementations in Config/sources/active/custom/. Used where comment threads, authentication, or non-standard scraping is needed.
Active: Hacker News, Lobsters, Medium, Substack, Twitter/X, Vimeo, YouTube.
Processing Pipeline
capcat fetch <source>
|
v
SourceRegistry → SourceFactory → Source instance
|
article discovery (RSS / scrape)
+ date extraction (RSS pubDate / HTML JSON-LD)
|
UnifiedSourceProcessor (8 workers)
|
ArticleFetcher + fetch_comments()
+ date extraction from fetched HTML (no extra request)
|
MediaProcessor → images / PDFs
|
HTMLGenerator → article HTML
(sorted by publication date, newest first)
|
News/<source>/ (output directory)
Key Components
SourceRegistry
Auto-discovers all sources from Config/sources/active/. Singleton - call get_source_registry().
UnifiedSourceProcessor
ThreadPoolExecutor(max_workers=8) processes articles concurrently. Calls ArticleFetcher and fetch_comments() per article.
EthicalScrapingManager
- Robots.txt caching (15-minute TTL)
enforce_rate_limit()- thread-safe slot reservation, used by all sourcesrequest_hn_api()- HN-specific Firebase API wrapper with backoffrequest_with_backoff()- exponential backoff for 429/503 errors
SessionPool
pool_connections=20, pool_maxsize=20 - shared across all workers via get_session_pool().
DateExtractor
Extracts publication dates from already-fetched HTML pages (no extra HTTP requests). Priority: JSON-LD datePublished > <meta property="article:published_time"> > <time datetime="...">. RSS sources get dates directly from feed entries.
HTMLGenerator
Six templates: article-with-comments.html, article-no-comments.html, comments-with-navigation.html, article-capcats.html, root-index.html, source-index.html. Source-level article listings sort by publication date (newest first), falling back to file modification time for articles without dates.
Directory Layout
~/.capcat/ ← internal state (do not edit)
Config/
sources/active/
config_driven/configs/ ← YAML source configs
custom/<name>/source.py ← Python source implementations
bundles/bundles.yml
themes/ ← CSS overrides
News/ ← batch fetch output
Capcats/ ← single-article output
Configuration Hierarchy
- CLI flags (highest priority)
- Environment variables (
CAPCAT_*) Config/Global-settings.yamlConfig/capcat.yml(per-vault overrides)- Source YAML defaults (lowest priority)
Design Principles
- No code required to add a config-driven source
- Rate limiting is always on - never bypassable per-source
- Privacy by default: usernames replaced with “Anonymous” in comment output
download_files(images) anddownload_pdfsare independent flags;--mediasets both