Ethical Scraping Implementation

Status:

Implemented

Version:

2.0

Last Updated:

January 1, 2026

Overview

Capcat implements comprehensive ethical scraping practices to ensure respectful and compliant content collection from news sources.

Core Principles

1. Prefer Official APIs > RSS > HTML Scraping

Implementation Priority:

Official APIs
- Always preferred when available
RSS/Atom Feeds
- Second choice, widely supported
HTML Scraping
- Last resort, only when no alternatives exist

Current Source Distribution:

Custom sources (10): Use RSS feeds or specialized approaches
Config-driven sources (1): InfoQ uses RSS

2. Robots.txt Compliance

Implementation:

core/ethical_scraping.py

Features:

Automatic robots.txt fetching and parsing
15-minute TTL cache to reduce server load
Crawl-delay extraction and enforcement
Path validation against disallow rules

Cache Management:

from core.ethical_scraping import get_ethical_manager

manager = get_ethical_manager()
parser, crawl_delay = manager.get_robots_txt(base_url)
allowed, reason = manager.can_fetch(url)

Cache Statistics:

TTL: 15 minutes
Auto-cleanup of stale entries
Per-domain caching

3. User-Agent Identification

Standard User-Agent:

Capcat/2.0 (Personal news archiver)

Implementation Locations:

core/config.py:35 - Global default
core/source_system/source_config.py:54 - Source config default
core/source_system/config_driven_source.py:119 - RSS requests
sources/active/custom/lb/source.py:61,460 - Lobsters custom headers

Format Guidelines:

Product name and version: Capcat/2.0
Purpose description: (Personal news archiver)
No personal information or URLs

4. Rate Limiting

Enforcement:

Minimum 1 request per second globally
Respect robots.txt crawl-delay directives
Per-domain rate limiting tracking

Current Rate Limits:

InfoQ
→ 20.0s → 20.0s → Compliant
BBC
→ Custom → N/A → Compliant
HN
→ API → N/A → Compliant
Lobsters
→ RSS → N/A → Compliant
Others
→ 1.0-3.0s → Varies → Compliant

5. Error Handling with Exponential Backoff

Implementation:

core/ethical_scraping.py:request_with_backoff()

HTTP Status Codes Handled:

429 (Too Many Requests):
Respect Retry-After header
Exponential backoff if header missing
Initial delay: 1.0s, multiplier: 2x

503 (Service Unavailable):
Exponential backoff
Initial delay: 1.0s, multiplier: 2x
Max retries: 3 attempts

Backoff Strategy:

Attempt 1: 1.0s delay
Attempt 2: 2.0s delay
Attempt 3: 4.0s delay

Implementation Details

EthicalScrapingManager Class

Location:

core/ethical_scraping.py

Key Methods:

class EthicalScrapingManager:
    def __init__(self, user_agent: str = "Capcat/2.0")

    def get_robots_txt(self, base_url: str, timeout: int = 10) -> Tuple[RobotFileParser, float]

    def can_fetch(self, url: str) -> Tuple[bool, str]

    def enforce_rate_limit(self, domain: str, crawl_delay: float, min_delay: float = 1.0)

    def request_with_backoff(
        self, session: requests.Session, url: str,
        method: str = "GET", max_retries: int = 3,
        initial_delay: float = 1.0, **kwargs
    ) -> requests.Response

    def validate_source_config(self, base_url: str, rate_limit: float) -> Tuple[bool, str]

    def get_cache_stats(self) -> Dict[str, any]

    def clear_stale_cache(self)

Usage Example

from core.ethical_scraping import get_ethical_manager

# Get global manager instance
manager = get_ethical_manager()

# Validate source configuration
is_valid, message = manager.validate_source_config(
    base_url="https://example.com/news/",
    rate_limit=2.0
)

if not is_valid:
    print(f"Configuration issue: {message}")

# Make ethical request with backoff
response = manager.request_with_backoff(
    session=requests.Session(),
    url="https://example.com/article",
    timeout=30
)

Compliance Audit Process

Tool:

audit_ethical_compliance.py (temporary, created on-demand)

Audit Checks:

Robots.txt fetching and parsing
Crawl-delay requirement extraction
Path allowance validation
RSS feed availability detection
Rate limit compliance verification
Bundle membership verification

Last Audit:

January 1, 2026

Result:

All active sources compliant

Findings:

1 active config-driven source (InfoQ)
5 orphaned sources moved to inactive
0 active violations

Source Compliance Status

Active Sources (11)

Config-Driven (1):

InfoQ - RSS feed, 20.0s rate limit, compliant

Custom Sources (10):

Hacker News - Official API
Lobsters - RSS feed
BBC - Custom implementation
Gizmodo - RSS feed
Futurism - RSS feed
IEEE Spectrum - RSS feed
Nature - RSS feed
Scientific American - RSS feed
LessWrong - GraphQL API
MIT Tech Review - RSS feed (inactive in bundles)

Red Flags to Avoid

Never Do:

Ignore robots.txt directives
Bypass anti-bot protection
Scrape paths explicitly blocked
Use aggressive rate limits (<1s without permission)
Impersonate browser User-Agents deceptively
Scrape authentication-required content
Access paywalled content
Ignore 429/503 error responses

Always Do:

Check robots.txt before scraping
Respect crawl-delay directives
Use RSS/API when available
Identify as "Capcat/2.0 (Personal news archiver)"
Handle errors gracefully with backoff
Cache robots.txt to reduce load
Rate limit: minimum 1 req/s
Document scraping methodology

Configuration Reference

Rate Limit Configuration

File:

sources/active/config_driven/configs/iq.yaml

# Request configuration
timeout: 15
rate_limit: 20.0  # Must be >= robots.txt crawl-delay

User-Agent Configuration

File:

core/config.py

@dataclass
class NetworkConfig:
    user_agent: str = "Capcat/2.0 (Personal news archiver)"

RSS Discovery Configuration

File:

sources/active/config_driven/configs/iq.yaml

# Discovery method - use RSS for latest news articles only
discovery:
  method: "rss"
  rss_url: "https://feed.infoq.com"
  max_articles: 30

Testing Compliance

Manual Testing

# Test InfoQ RSS implementation
./capcat fetch iq --count 5

# Test with verbose logging
./capcat -L compliance.log fetch iq --count 5

# View logs
tail -f compliance.log

Automated Validation

from core.ethical_scraping import get_ethical_manager

manager = get_ethical_manager()

# Validate all sources
sources = ["iq", "hn", "lb", "bbc"]
for source_id in sources:
    config = load_source_config(source_id)
    is_valid, message = manager.validate_source_config(
        config.base_url,
        config.rate_limit
    )
    print(f"{source_id}: {message}")

Ethical Scraping Implementation

Status:

Version:

Last Updated:

Overview

Core Principles

1. Prefer Official APIs > RSS > HTML Scraping

Implementation Priority:

Official APIs

RSS/Atom Feeds

HTML Scraping

Current Source Distribution:

2. Robots.txt Compliance

Implementation:

Features:

Cache Management:

Cache Statistics:

3. User-Agent Identification

Standard User-Agent:

Implementation Locations:

Format Guidelines:

4. Rate Limiting

Enforcement:

Current Rate Limits:

InfoQ

BBC

HN

Lobsters

Others

5. Error Handling with Exponential Backoff

Implementation:

HTTP Status Codes Handled:

429 (Too Many Requests):

503 (Service Unavailable):

Backoff Strategy:

Implementation Details

EthicalScrapingManager Class

Location:

Key Methods:

Usage Example

Compliance Audit Process

Tool:

Audit Checks:

Last Audit:

Result:

Findings:

Source Compliance Status

Active Sources (11)

Config-Driven (1):

Custom Sources (10):

Red Flags to Avoid

Never Do:

Always Do:

Configuration Reference

Rate Limit Configuration

File:

User-Agent Configuration

File:

RSS Discovery Configuration

File:

Testing Compliance

Manual Testing

Automated Validation

References

Documentation

Internal Documentation

Maintenance

Regular Tasks

Monthly:

Quarterly:

As Needed:

Contact