Comprehensive guide for configuring Capcat's hybrid architecture system and individual sources.
Capcat uses a hierarchical configuration system with the following precedence (highest to lowest):
--pdfs, --media, --count)CAPCAT_*)Config/Global-settings.yaml, capcat.yml)All configuration lives inside your vault directory — the project folder where capcat stores your archives. There is no application-level config file.
Located at the vault root. Controls which sources are active and how many articles each fetches per run:
# <vault>/capcat.yml
sources:
- name: hn
article_count: 10
- name: bbc
article_count: 5
- name: guardian
article_count: 5
bundles: {}
Located at <vault>/Config/Global-settings.yaml. Controls network behaviour, processing options, PDF downloads, UI, and logging. Generate or reset it with:
cd <vault>
capcat settings --force
Edit the generated file directly. Settings take effect on the next fetch — no restart required.
Selected settings can be overridden at runtime with CAPCAT_* environment variables:
# Network
export CAPCAT_CONNECT_TIMEOUT=15
export CAPCAT_READ_TIMEOUT=10
export CAPCAT_USER_AGENT="Custom User Agent"
export CAPCAT_MAX_RETRIES=5
export CAPCAT_RETRY_DELAY=2.0
# Processing
export CAPCAT_MAX_WORKERS=12
export CAPCAT_DOWNLOAD_IMAGES=true
export CAPCAT_DOWNLOAD_VIDEOS=false
export CAPCAT_DOWNLOAD_AUDIO=false
export CAPCAT_DOWNLOAD_DOCUMENTS=false
export CAPCAT_MAX_FILENAME_LENGTH=80
# Logging
export CAPCAT_LOG_LEVEL=DEBUG
Environment variables override vault config files but are overridden by CLI flags.
Command-line arguments override all other configuration sources:
# Override worker count
capcat bundle tech --count 10 --workers 12
# Download PDFs only (independent of other media)
capcat fetch hn --count 10 --pdfs
# Download all media (images, PDFs, video, audio, documents)
capcat fetch hn,bbc --count 15 --media
# Override output path
capcat single https://example.com/article --output /custom/path
# Enable file logging (all commands)
capcat --log-file capcat.log bundle tech --count 10
# Verbose console + file logging
capcat -V -L debug.log fetch hn --count 15
Controls HTTP requests and network behavior.
network:
connect_timeout: 10 # Connection timeout (seconds)
read_timeout: 8 # Read timeout (seconds)
user_agent: "Mozilla/5.0 (compatible; Capcat/2.0)"
max_retries: 3 # Maximum retry attempts
retry_delay: 1.0 # Delay between retries (seconds)
pool_connections: 20 # Connection pool size
pool_maxsize: 20 # Maximum pool size
connect_timeout: Maximum time to wait for connection establishmentread_timeout: Maximum time to wait for response datauser_agent: User-Agent header for HTTP requestsmax_retries: Number of retry attempts for failed requestsretry_delay: Base delay between retry attemptspool_connections: Number of connection pools to cachepool_maxsize: Maximum number of connections to save in poolControls article processing and download behavior.
processing:
max_workers: 8 # Parallel processing workers
download_images: true # Download and embed images (default: true)
download_videos: false # Download video files (requires --media)
download_audio: false # Download audio files (requires --media)
download_documents: false # Download non-PDF documents (requires --media)
create_comments_file: true # Save comments alongside articles
max_filename_length: 100 # Maximum characters in vault filenames
min_image_dimensions: 150 # Skip images smaller than this (px)
max_image_size_bytes: 5242880 # Skip images larger than this (5 MB)
markdown_line_breaks: true # Convert <br> to hard line breaks (\)
conversion_timeout: 30 # HTML-to-Markdown conversion timeout (seconds)
max_workers: Number of parallel ThreadPoolExecutor workersdownload_images: Download and embed images locally (on by default)download_videos: Download video files (requires --media flag)download_audio: Download audio files (requires --media flag)download_documents: Download non-PDF documents (requires --media flag)create_comments_file: Fetch and save comments alongside each articlemax_filename_length: Truncate vault folder and file names to this lengthmin_image_dimensions: Reject images whose width or height is below this value in pixelsmax_image_size_bytes: Reject images larger than this in bytes (checked via Content-Length before download)markdown_line_breaks: When true, <br> becomes a hard break (\); when false, a plain newlineconversion_timeout: Timeout for HTML-to-Markdown conversion per articlePDF downloads are controlled independently of other media via the pdf: section and the --pdfs flag.
pdf:
max_pdf_size_bytes: 20971520 # Skip PDFs larger than 20 MB
max_pdf_per_article: 10 # Maximum PDFs queued per article
max_pdf_size_bytes: PDFs whose Content-Length exceeds this value are skipped entirelymax_pdf_per_article: Cap on the number of PDF links processed per article--pdfs: Download PDF links in articles (independent of --media)--no-pdfs: Explicitly suppress PDF downloads even when source defaults enable them--media: Download all media including PDFs (superset of --pdfs)Individual source configs can set their own PDF behaviour via a media: block:
media:
download_pdfs: true # Enable PDFs for this source by default
max_pdf_size_mb: 10 # Source-level size cap (overrides global)
Resolution order (highest to lowest priority): CLI flag → TUI prompt answer → per-source media: → global media.download_pdfs.
Controls logging behavior and output.
logging:
console_level: INFO # Console log level (DEBUG, INFO, WARNING, ERROR)
file_level: DEBUG # Log file level when --log-file is used
DEBUG: Detailed debugging informationINFO: General information (default)WARNING: Warning messagesERROR: Error messages onlyCRITICAL: Critical errors onlyFile logging is controlled via CLI flags, not configuration files:
# Enable file logging with --log-file or -L flag
capcat --log-file capcat.log bundle tech --count 10
# Verbose console + file logging
capcat -V -L debug.log fetch hn --count 15
# Timestamped log files
capcat -L logs/news-$(date +%Y%m%d-%H%M%S).log bundle news --count 10
Output is always written into the vault. The folder structure is fixed and date-based:
Use max_filename_length in processing: to control folder and file name length.
Simple sources use YAML configuration files in Config/sources/active/config_driven/configs/.
# Config/sources/active/config_driven/configs/example.yaml
display_name: "Example News"
base_url: "https://example.com/news/"
category: "general" # tech, science, business, general
timeout: 10.0
rate_limit: 1.0 # Minimum seconds between requests
# Required: Article discovery
article_selectors:
- ".headline a"
- ".article-title a"
- "h2.title a"
# Required: Content extraction
content_selectors:
- ".article-content"
- ".post-body"
- "div.content"
# Advanced config-driven source
display_name: "Advanced News"
base_url: "https://advanced.com/"
category: "tech"
timeout: 15.0
rate_limit: 2.0
supports_comments: false
article_selectors:
- ".headline a"
- ".story-link"
content_selectors:
- ".article-content"
- ".story-body"
# Skip unwanted URLs
skip_patterns:
- "/about"
- "/contact"
- "/advertising"
- "?utm_"
- "/sponsored"
# Custom headers
custom_config:
headers:
Accept: "text/html,application/xhtml+xml"
Accept-Language: "en-US,en;q=0.5"
user_agent: "Custom Bot 1.0"
# Custom selectors for metadata
meta_selectors:
author: ".byline .author"
date: ".publish-date"
tags: ".article-tags a"
# Content cleaning
remove_selectors:
- ".advertisement"
- ".related-links"
- ".social-share"
Complex sources use Python implementations with YAML configuration.
# Config/sources/active/custom/example/config.yaml
display_name: "Example Custom"
base_url: "https://example.com/"
category: "tech"
timeout: 10.0
rate_limit: 1.0
supports_comments: true
# Custom source-specific configuration
custom_config:
api_endpoint: "/api/v1/articles"
api_key: "${EXAMPLE_API_KEY}" # Environment variable
max_pages: 5
items_per_page: 50
# Authentication
auth_type: "bearer" # bearer, basic, api_key
auth_header: "Authorization"
# Rate limiting
requests_per_minute: 60
burst_limit: 10
# Content processing
extract_metadata: true
process_images: true
follow_redirects: true
# Config/sources/active/custom/example/source.py
class ExampleSource(BaseSource):
def __init__(self, config: SourceConfig, session=None):
super().__init__(config, session)
# Access custom configuration
self.api_key = config.custom_config.get('api_key')
self.api_endpoint = config.custom_config.get('api_endpoint')
self.max_pages = config.custom_config.get('max_pages', 5)
def _get_headers(self):
headers = super()._get_headers()
# Add API authentication
if self.api_key:
headers['Authorization'] = f'Bearer {self.api_key}'
return headers
Capcat automatically validates configurations during source discovery:
# Check configuration validity
from core.source_system.source_registry import get_source_registry
registry = get_source_registry()
errors = registry.validate_all_sources(deep_validation=True)
for source_name, error_list in errors.items():
if error_list:
print(f"{source_name}: {', '.join(error_list)}")
# Validate all sources
python -c "
from core.source_system.validation_engine import ValidationEngine
from core.source_system.source_registry import get_source_registry
registry = get_source_registry()
engine = ValidationEngine()
configs = registry.discover_sources()
results = engine.validate_all_sources(configs, deep_validation=True)
report = engine.generate_validation_report(results)
print(report)
"
# Config/Global-settings.yaml
network:
crawl_delay: 0.1
processing:
max_workers: 8
download_images: true
logging:
console_level: "WARNING" # Suppress info noise in cron
processing:
max_workers: 4 # Slower but stable
download_images: true
pdf:
max_pdf_size_bytes: 52428800 # Allow up to 50 MB
max_pdf_per_article: 20
logging:
console_level: "INFO"
processing:
download_images: false
download_videos: false
download_audio: false
download_documents: false
create_comments_file: false
After a Capcat upgrade, new settings may have been added to the template. To add them without losing your edits, compare with the latest template:
# Generate fresh template to a temp file and diff
capcat settings > /tmp/latest-settings.yaml
diff Config/Global-settings.yaml /tmp/latest-settings.yaml
Or overwrite completely (back up your edits first):
cp Config/Global-settings.yaml Config/Global-settings.yaml.bak
capcat settings --force
For source-specific configuration details, see the Source Development Guide.