Source Development
Two Source Types
Config-Driven (no Python required)
Create a YAML file in Config/sources/active/config_driven/configs/:
display_name: "Example News"
base_url: "https://example.com/"
category: tech
rate_limit: 1.0
article_selectors:
- ".headline a"
- "h2.title a"
content_selectors:
- ".article-body"
- "article .content"
image_processing:
max_image_size_mb: 5
media:
download_pdfs: false
max_pdf_size_mb: 10
That’s it. Run capcat list sources to confirm it’s discovered, then capcat fetch example-news --count 5.
Custom Source (Python)
Use when you need comment integration, authentication, or non-standard scraping.
Create Config/sources/active/custom/<name>/source.py:
from capcat.core.source_system.base_source import BaseSource
class MySource(BaseSource):
def __init__(self, config=None):
super().__init__(config)
self.name = "mysource"
self.display_name = "My Source"
def discover_articles(self, count=30):
# Return list of Article objects
pass
def fetch_article_content(self, article):
# Populate article.content, article.title, etc.
pass
def fetch_comments(self, comment_url, article_title, article_folder_path):
# Optional. Called automatically if article.comment_url is set.
pass
Rate Limiting
All sources must use EthicalScrapingManager for HTTP requests. Never call session.get() without rate limiting:
from capcat.core.ethical_scraping import get_ethical_manager
manager = get_ethical_manager()
manager.enforce_rate_limit("example.com", 0.0, min_delay=self.config.rate_limit)
response = self.session.get(url)
For sites where robots.txt allows access, use request_with_backoff() for automatic 429/503 retry.
Validate
capcat fetch mysource --count 5 --html
Check the output in News/mysource/ for content, media, and any errors in the log.
Config-Driven YAML Reference
| Key | Required | Description |
|---|---|---|
display_name |
yes | Human-readable name shown in TUI |
base_url |
yes | Root URL of the site |
category |
yes | tech, science, news, etc. |
rate_limit |
no | Seconds between requests (default: 1.0) |
article_selectors |
yes | CSS selectors for article links |
content_selectors |
yes | CSS selectors for article body |
image_processing.max_image_size_mb |
no | Per-source image size cap in MB (default: 5) |
media.download_pdfs |
no | Override global PDF setting |
media.max_pdf_size_mb |
no | Per-source PDF size cap |
Testing Your Source
# Unit tests
cd ~/capcat && source venv/bin/activate && pytest tests/unit/ -q
# Acceptance test against live site
capcat fetch mysource --count 10