Capcat implements comprehensive ethical scraping practices to ensure respectful and compliant content collection from news sources.
core/ethical_scraping.py
from core.ethical_scraping import get_ethical_manager
manager = get_ethical_manager()
parser, crawl_delay = manager.get_robots_txt(base_url)
allowed, reason = manager.can_fetch(url)
Capcat/2.0 (Personal news archiver)
core/config.py:35 - Global defaultcore/source_system/source_config.py:54 - Source
config default
core/source_system/config_driven_source.py:119 - RSS
requests
sources/active/custom/lb/source.py:61,460 - Lobsters
custom headers
Capcat/2.0(Personal news archiver)20.0s → 20.0s → Compliant
core/ethical_scraping.py:request_with_backoff()
Retry-After headerAttempt 1: 1.0s delay
Attempt 2: 2.0s delay
Attempt 3: 4.0s delay
core/ethical_scraping.py
class EthicalScrapingManager:
def __init__(self, user_agent: str = "Capcat/2.0")
def get_robots_txt(self, base_url: str, timeout: int = 10) -> Tuple[RobotFileParser, float]
def can_fetch(self, url: str) -> Tuple[bool, str]
def enforce_rate_limit(self, domain: str, crawl_delay: float, min_delay: float = 1.0)
def request_with_backoff(
self, session: requests.Session, url: str,
method: str = "GET", max_retries: int = 3,
initial_delay: float = 1.0, **kwargs
) -> requests.Response
def validate_source_config(self, base_url: str, rate_limit: float) -> Tuple[bool, str]
def get_cache_stats(self) -> Dict[str, any]
def clear_stale_cache(self)
from core.ethical_scraping import get_ethical_manager
# Get global manager instance
manager = get_ethical_manager()
# Validate source configuration
is_valid, message = manager.validate_source_config(
base_url="https://example.com/news/",
rate_limit=2.0
)
if not is_valid:
print(f"Configuration issue: {message}")
# Make ethical request with backoff
response = manager.request_with_backoff(
session=requests.Session(),
url="https://example.com/article",
timeout=30
)
audit_ethical_compliance.py (temporary, created
on-demand)
sources/active/config_driven/configs/iq.yaml
# Request configuration
timeout: 15
rate_limit: 20.0 # Must be >= robots.txt crawl-delay
core/config.py
@dataclass
class NetworkConfig:
user_agent: str = "Capcat/2.0 (Personal news archiver)"
sources/active/config_driven/configs/iq.yaml
# Discovery method - use RSS for latest news articles only
discovery:
method: "rss"
rss_url: "https://feed.infoq.com"
max_articles: 30
# Test InfoQ RSS implementation
./capcat fetch iq --count 5
# Test with verbose logging
./capcat -L compliance.log fetch iq --count 5
# View logs
tail -f compliance.log
from core.ethical_scraping import get_ethical_manager
manager = get_ethical_manager()
# Validate all sources
sources = ["iq", "hn", "lb", "bbc"]
for source_id in sources:
config = load_source_config(source_id)
is_valid, message = manager.validate_source_config(
config.base_url,
config.rate_limit
)
print(f"{source_id}: {message}")
ETHICAL_COMPLIANCE_REPORT.md - October 2025 audit
results
docs/architecture.md - System architecturedocs/source-development.md - Adding new sources
For questions about ethical scraping implementation:
ETHICAL_COMPLIANCE_REPORT.mdsources/active/