Ethical Scraping
Capcat enforces ethical scraping practices across all sources. The EthicalScrapingManager is the single point of control - no source bypasses it.
Core Rules
- Robots.txt is always checked before fetching any URL (15-minute TTL cache)
- Rate limiting is mandatory - minimum 1 second between requests to any domain
- Exponential backoff on 429/503 responses - never hammers a server
- User-agent disclosure - identifies as
Capcat/2.0
EthicalScrapingManager API
from capcat.core.ethical_scraping import get_ethical_manager
manager = get_ethical_manager()
enforce_rate_limit
Thread-safe slot reservation. Blocks until the minimum delay has elapsed since the last request to the domain.
manager.enforce_rate_limit(domain, crawl_delay, min_delay=1.0)
Used by all sources before calling session.get(). Required - do not skip.
can_fetch
Checks robots.txt for the given URL.
allowed, reason = manager.can_fetch(url)
request_with_backoff
Full ethical fetch: robots.txt check + rate limiting + exponential backoff on 429/503.
response = manager.request_with_backoff(session, url, max_retries=3)
Not suitable for sites with Disallow: / in robots.txt (use enforce_rate_limit only in that case).
request_hn_api
HN-specific Firebase API wrapper with backoff. Returns parsed JSON or None.
data = manager.request_hn_api(session, url)
Robots.txt Cache
- TTL: 15 minutes
- Per-domain
get_cache_stats()returns hit/miss countsclear_stale_cache()removes expired entries
Rate Limit Enforcement
Each domain has a thread-safe last-access timestamp. With 8 concurrent workers, all requests to a domain are serialized through the slot - only one worker proceeds at a time.
Default minimum delay: 1.0 second. Sources override via rate_limit in their YAML config or self.config.rate_limit in Python sources.
Lobsters Note
lobste.rs/robots.txt has Disallow: / for User-agent: *. Capcat uses enforce_rate_limit() only (not request_with_backoff()) for Lobsters comment fetching to avoid the robots.txt block while still respecting rate limits.