Comprehensive guide for configuring Capcat's hybrid architecture system and individual sources.
Capcat uses a hierarchical configuration system with the following precedence (highest to lowest):
capcat.yml, capcat.json)Create configuration files in the application root directory:
# capcat.yml
network:
connect_timeout: 10
read_timeout: 8
user_agent: "Mozilla/5.0 (compatible; Capcat/2.0)"
max_retries: 3
retry_delay: 1.0
processing:
max_workers: 8
download_images: true
download_videos: false
download_audio: false
download_documents: false
logging:
default_level: "INFO"
use_colors: true
output:
base_path: "../"
date_format: "%d-%m-%Y"
create_date_folders: true
{
"network": {
"connect_timeout": 10,
"read_timeout": 8,
"user_agent": "Mozilla/5.0 (compatible; Capcat/2.0)",
"max_retries": 3,
"retry_delay": 1.0
},
"processing": {
"max_workers": 8,
"download_images": true,
"download_videos": false,
"download_audio": false,
"download_documents": false
},
"logging": {
"default_level": "INFO",
"use_colors": true
},
"output": {
"base_path": "../",
"date_format": "%d-%m-%Y",
"create_date_folders": true
}
}
All configuration options can be overridden with environment variables using the CAPCAT_ prefix:
# Network configuration
export CAPCAT_NETWORK_CONNECT_TIMEOUT=15
export CAPCAT_NETWORK_READ_TIMEOUT=10
export CAPCAT_NETWORK_USER_AGENT="Custom User Agent"
# Processing configuration
export CAPCAT_PROCESSING_MAX_WORKERS=12
export CAPCAT_PROCESSING_DOWNLOAD_VIDEOS=true
# Logging configuration
export CAPCAT_LOGGING_DEFAULT_LEVEL=DEBUG
export CAPCAT_LOGGING_USE_COLORS=false
Command-line arguments override all other configuration sources:
# Override worker count
./capcat bundle tech --count 10 --workers 12
# Override media downloading
./capcat fetch hn,bbc --count 15 --media
# Override output path
./capcat single https://example.com/article --output /custom/path
# Enable file logging (all commands)
./capcat --log-file capcat.log bundle tech --count 10
# Verbose console + file logging
./capcat -V -L debug.log fetch hn --count 15
Controls HTTP requests and network behavior.
network:
connect_timeout: 10 # Connection timeout (seconds)
read_timeout: 8 # Read timeout (seconds)
user_agent: "Mozilla/5.0 (compatible; Capcat/2.0)"
max_retries: 3 # Maximum retry attempts
retry_delay: 1.0 # Delay between retries (seconds)
pool_connections: 20 # Connection pool size
pool_maxsize: 20 # Maximum pool size
connect_timeout: Maximum time to wait for connection establishmentread_timeout: Maximum time to wait for response datauser_agent: User-Agent header for HTTP requestsmax_retries: Number of retry attempts for failed requestsretry_delay: Base delay between retry attemptspool_connections: Number of connection pools to cachepool_maxsize: Maximum number of connections to save in poolControls article processing and download behavior.
processing:
max_workers: 8 # Parallel processing workers
download_images: true # Download and embed images
download_videos: false # Download video files
download_audio: false # Download audio files
download_documents: false # Download PDF/document files
skip_existing: true # Skip existing articles
content_timeout: 30 # Content fetching timeout
max_workers: Number of parallel ThreadPoolExecutor workersdownload_images: Always download images (embedded in articles)download_videos: Download video files (requires --media flag)download_audio: Download audio files (requires --media flag)download_documents: Download PDF/document files (requires --media flag)skip_existing: Skip articles that already existcontent_timeout: Timeout for content fetching operationsControls logging behavior and output.
logging:
default_level: "INFO" # Default log level (DEBUG, INFO, WARNING, ERROR)
use_colors: true # Colored console output
DEBUG: Detailed debugging informationINFO: General information (default)WARNING: Warning messagesERROR: Error messages onlyCRITICAL: Critical errors onlyFile logging is controlled via CLI flags, not configuration files:
# Enable file logging with --log-file or -L flag
./capcat --log-file capcat.log bundle tech --count 10
# Verbose console + file logging
./capcat -V -L debug.log fetch hn --count 15
# Timestamped log files
./capcat -L logs/news-$(date +%Y%m%d-%H%M%S).log bundle news --count 10
Controls output directory structure and file naming.
output:
base_path: "../" # Base output directory
date_format: "%d-%m-%Y" # Date format for folders
create_date_folders: true # Create date-based folders
sanitize_filenames: true # Clean invalid filename characters
max_filename_length: 100 # Maximum filename length
base_path: Base directory for all outputdate_format: Python strftime format for date folderscreate_date_folders: Create date-based organizationsanitize_filenames: Remove invalid filesystem charactersmax_filename_length: Truncate long filenamesSimple sources use YAML configuration files in sources/active/config_driven/configs/.
# sources/active/config_driven/configs/example.yaml
display_name: "Example News"
base_url: "https://example.com/news/"
category: "general" # tech, science, business, general
timeout: 10.0
rate_limit: 1.0 # Minimum seconds between requests
# Required: Article discovery
article_selectors:
- ".headline a"
- ".article-title a"
- "h2.title a"
# Required: Content extraction
content_selectors:
- ".article-content"
- ".post-body"
- "div.content"
# Advanced config-driven source
display_name: "Advanced News"
base_url: "https://advanced.com/"
category: "tech"
timeout: 15.0
rate_limit: 2.0
supports_comments: false
article_selectors:
- ".headline a"
- ".story-link"
content_selectors:
- ".article-content"
- ".story-body"
# Skip unwanted URLs
skip_patterns:
- "/about"
- "/contact"
- "/advertising"
- "?utm_"
- "/sponsored"
# Custom headers
custom_config:
headers:
Accept: "text/html,application/xhtml+xml"
Accept-Language: "en-US,en;q=0.5"
user_agent: "Custom Bot 1.0"
# Custom selectors for metadata
meta_selectors:
author: ".byline .author"
date: ".publish-date"
tags: ".article-tags a"
# Content cleaning
remove_selectors:
- ".advertisement"
- ".related-links"
- ".social-share"
Complex sources use Python implementations with YAML configuration.
# sources/active/custom/example/config.yaml
display_name: "Example Custom"
base_url: "https://example.com/"
category: "tech"
timeout: 10.0
rate_limit: 1.0
supports_comments: true
# Custom source-specific configuration
custom_config:
api_endpoint: "/api/v1/articles"
api_key: "${EXAMPLE_API_KEY}" # Environment variable
max_pages: 5
items_per_page: 50
# Authentication
auth_type: "bearer" # bearer, basic, api_key
auth_header: "Authorization"
# Rate limiting
requests_per_minute: 60
burst_limit: 10
# Content processing
extract_metadata: true
process_images: true
follow_redirects: true
# sources/active/custom/example/source.py
class ExampleSource(BaseSource):
def __init__(self, config: SourceConfig, session=None):
super().__init__(config, session)
# Access custom configuration
self.api_key = config.custom_config.get('api_key')
self.api_endpoint = config.custom_config.get('api_endpoint')
self.max_pages = config.custom_config.get('max_pages', 5)
def _get_headers(self):
headers = super()._get_headers()
# Add API authentication
if self.api_key:
headers['Authorization'] = f'Bearer {self.api_key}'
return headers
Capcat automatically validates configurations during source discovery:
# Check configuration validity
from core.source_system.source_registry import get_source_registry
registry = get_source_registry()
errors = registry.validate_all_sources(deep_validation=True)
for source_name, error_list in errors.items():
if error_list:
print(f"{source_name}: {', '.join(error_list)}")
# Validate all sources
python -c "
from core.source_system.validation_engine import ValidationEngine
from core.source_system.source_registry import get_source_registry
registry = get_source_registry()
engine = ValidationEngine()
configs = registry.discover_sources()
results = engine.validate_all_sources(configs, deep_validation=True)
report = engine.generate_validation_report(results)
print(report)
"
# capcat-dev.yml
network:
connect_timeout: 5
read_timeout: 5
max_retries: 1
processing:
max_workers: 4
download_videos: false
logging:
default_level: "DEBUG"
use_colors: true
log_to_file: true
# capcat-prod.yml
network:
connect_timeout: 15
read_timeout: 10
max_retries: 3
retry_delay: 2.0
processing:
max_workers: 16
download_images: true
download_videos: true
logging:
default_level: "INFO"
use_colors: false
log_to_file: true
format: "json"
# capcat-test.yml
network:
connect_timeout: 3
read_timeout: 3
max_retries: 0
processing:
max_workers: 2
download_images: false
download_videos: false
logging:
default_level: "WARNING"
use_colors: false
Use environment variables for sensitive data:
# Configuration file
custom_config:
api_key: "${NEWS_API_KEY}"
secret_token: "${SECRET_TOKEN}"
database_url: "${DATABASE_URL}"
# Environment variables
export NEWS_API_KEY="your-api-key-here"
export SECRET_TOKEN="your-secret-token"
export DATABASE_URL="postgresql://user:pass@localhost/db"
Use realistic user agents to avoid being blocked:
network:
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
Configure appropriate rate limits to respect server resources:
# Global rate limiting
network:
retry_delay: 2.0
# Per-source rate limiting
rate_limit: 1.5 # Minimum 1.5 seconds between requests
network:
connect_timeout: 8
read_timeout: 6
pool_connections: 50
pool_maxsize: 50
processing:
max_workers: 16 # Adjust based on CPU cores
skip_existing: true
content_timeout: 20
processing:
max_workers: 4
download_videos: false
download_audio: false
download_documents: false
logging:
log_to_file: false # Reduce memory usage
# Optimized for news aggregation
processing:
max_workers: 12
download_images: true
download_videos: false
sources:
priority:
- "bbc"
- "cnn"
- "reuters"
bundles:
daily_news:
- "bbc"
- "cnn"
- "aljazeera"
count: 50
# Optimized for research/archival
processing:
download_images: true
download_videos: true
download_documents: true
skip_existing: false
logging:
default_level: "DEBUG"
log_to_file: true
output:
create_date_folders: true
sanitize_filenames: true
# Specify configuration file
export CAPCAT_CONFIG_FILE="config/production.yml"
./capcat bundle tech --count 10
# Multiple configuration files (merged in order)
export CAPCAT_CONFIG_FILES="config/base.yml,config/production.yml"
#!/usr/bin/env python3
"""Configuration validation script."""
import yaml
from pathlib import Path
from core.config import load_config, validate_config
def validate_config_file(config_path):
"""Validate a configuration file."""
try:
config = load_config(config_path)
errors = validate_config(config)
if errors:
print(f"Configuration errors in {config_path}:")
for error in errors:
print(f" - {error}")
else:
print(f"Configuration valid: {config_path}")
except Exception as e:
print(f"Failed to load {config_path}: {e}")
if __name__ == "__main__":
for config_file in ["capcat.yml", "capcat.json"]:
if Path(config_file).exists():
validate_config_file(config_file)
This configuration guide covers all aspects of Capcat configuration management. For source-specific configuration details, see the Source Development Guide.