Configuration Guide

Comprehensive guide for configuring Capcat's hybrid architecture system and individual sources.

Configuration Hierarchy

Capcat uses a hierarchical configuration system with the following precedence (highest to lowest):

  1. Command-line flags

    (highest priority — e.g. --pdfs, --media, --count)
  2. Environment variables

    (CAPCAT_*)
  3. Vault config files

    (Config/Global-settings.yaml, capcat.yml)
  4. Built-in defaults

    (lowest priority)

Vault Configuration Files

All configuration lives inside your vault directory — the project folder where capcat stores your archives. There is no application-level config file.

capcat.yml — Source Selection

Located at the vault root. Controls which sources are active and how many articles each fetches per run:

# <vault>/capcat.yml
sources:
  - name: hn
    article_count: 10
  - name: bbc
    article_count: 5
  - name: guardian
    article_count: 5
bundles: {}

Config/Global-settings.yaml — All Other Settings

Located at <vault>/Config/Global-settings.yaml. Controls network behaviour, processing options, PDF downloads, UI, and logging. Generate or reset it with:

cd <vault>
capcat settings --force

Edit the generated file directly. Settings take effect on the next fetch — no restart required.

Environment Variables

Selected settings can be overridden at runtime with CAPCAT_* environment variables:

# Network
export CAPCAT_CONNECT_TIMEOUT=15
export CAPCAT_READ_TIMEOUT=10
export CAPCAT_USER_AGENT="Custom User Agent"
export CAPCAT_MAX_RETRIES=5
export CAPCAT_RETRY_DELAY=2.0

# Processing
export CAPCAT_MAX_WORKERS=12
export CAPCAT_DOWNLOAD_IMAGES=true
export CAPCAT_DOWNLOAD_VIDEOS=false
export CAPCAT_DOWNLOAD_AUDIO=false
export CAPCAT_DOWNLOAD_DOCUMENTS=false
export CAPCAT_MAX_FILENAME_LENGTH=80

# Logging
export CAPCAT_LOG_LEVEL=DEBUG

Environment variables override vault config files but are overridden by CLI flags.

Command-Line Overrides

Command-line arguments override all other configuration sources:

# Override worker count
capcat bundle tech --count 10 --workers 12

# Download PDFs only (independent of other media)
capcat fetch hn --count 10 --pdfs

# Download all media (images, PDFs, video, audio, documents)
capcat fetch hn,bbc --count 15 --media

# Override output path
capcat single https://example.com/article --output /custom/path

# Enable file logging (all commands)
capcat --log-file capcat.log bundle tech --count 10

# Verbose console + file logging
capcat -V -L debug.log fetch hn --count 15

Configuration Sections

Network Configuration

Controls HTTP requests and network behavior.

network:
  connect_timeout: 10          # Connection timeout (seconds)
  read_timeout: 8             # Read timeout (seconds)
  user_agent: "Mozilla/5.0 (compatible; Capcat/2.0)"
  max_retries: 3              # Maximum retry attempts
  retry_delay: 1.0            # Delay between retries (seconds)
  pool_connections: 20        # Connection pool size
  pool_maxsize: 20           # Maximum pool size

Options:

  • connect_timeout: Maximum time to wait for connection establishment
  • read_timeout: Maximum time to wait for response data
  • user_agent: User-Agent header for HTTP requests
  • max_retries: Number of retry attempts for failed requests
  • retry_delay: Base delay between retry attempts
  • pool_connections: Number of connection pools to cache
  • pool_maxsize: Maximum number of connections to save in pool

Processing Configuration

Controls article processing and download behavior.

processing:
  max_workers: 8              # Parallel processing workers
  download_images: true       # Download and embed images (default: true)
  download_videos: false      # Download video files (requires --media)
  download_audio: false       # Download audio files (requires --media)
  download_documents: false   # Download non-PDF documents (requires --media)
  create_comments_file: true  # Save comments alongside articles
  max_filename_length: 100    # Maximum characters in vault filenames
  min_image_dimensions: 150   # Skip images smaller than this (px)
  max_image_size_bytes: 5242880  # Skip images larger than this (5 MB)
  markdown_line_breaks: true  # Convert <br> to hard line breaks (\)
  conversion_timeout: 30      # HTML-to-Markdown conversion timeout (seconds)

Options:

  • max_workers: Number of parallel ThreadPoolExecutor workers
  • download_images: Download and embed images locally (on by default)
  • download_videos: Download video files (requires --media flag)
  • download_audio: Download audio files (requires --media flag)
  • download_documents: Download non-PDF documents (requires --media flag)
  • create_comments_file: Fetch and save comments alongside each article
  • max_filename_length: Truncate vault folder and file names to this length
  • min_image_dimensions: Reject images whose width or height is below this value in pixels
  • max_image_size_bytes: Reject images larger than this in bytes (checked via Content-Length before download)
  • markdown_line_breaks: When true, <br> becomes a hard break (\); when false, a plain newline
  • conversion_timeout: Timeout for HTML-to-Markdown conversion per article

PDF Configuration

PDF downloads are controlled independently of other media via the pdf: section and the --pdfs flag.

pdf:
  max_pdf_size_bytes: 20971520   # Skip PDFs larger than 20 MB
  max_pdf_per_article: 10        # Maximum PDFs queued per article

Options:

  • max_pdf_size_bytes: PDFs whose Content-Length exceeds this value are skipped entirely
  • max_pdf_per_article: Cap on the number of PDF links processed per article

CLI flags:

  • --pdfs: Download PDF links in articles (independent of --media)
  • --no-pdfs: Explicitly suppress PDF downloads even when source defaults enable them
  • --media: Download all media including PDFs (superset of --pdfs)

Per-source PDF defaults:

Individual source configs can set their own PDF behaviour via a media: block:

media:
  download_pdfs: true          # Enable PDFs for this source by default
  max_pdf_size_mb: 10          # Source-level size cap (overrides global)

Resolution order (highest to lowest priority): CLI flag → TUI prompt answer → per-source media: → global media.download_pdfs.

Logging Configuration

Controls logging behavior and output.

logging:
  console_level: INFO        # Console log level (DEBUG, INFO, WARNING, ERROR)
  file_level: DEBUG          # Log file level when --log-file is used

Log Levels:

  • DEBUG: Detailed debugging information
  • INFO: General information (default)
  • WARNING: Warning messages
  • ERROR: Error messages only
  • CRITICAL: Critical errors only

File Logging:

File logging is controlled via CLI flags, not configuration files:

# Enable file logging with --log-file or -L flag
capcat --log-file capcat.log bundle tech --count 10

# Verbose console + file logging
capcat -V -L debug.log fetch hn --count 15

# Timestamped log files
capcat -L logs/news-$(date +%Y%m%d-%H%M%S).log bundle news --count 10

Log Output Formats:

  • Console

    Colored output with log level indicators (user-friendly)
  • File

    Timestamped entries with module names and full context (debugging)

Output Structure

Output is always written into the vault. The folder structure is fixed and date-based:

News/News_DD-MM-YYYY/Source_DD-MM-YYYY/Article-Title/ ├── article.md ├── article-Comments.md # If comments enabled ├── images/ # Downloaded images └── files/ # PDFs and other media Capcats/DD-MM-YYYY-Article-Title/ # Single-article output

Use max_filename_length in processing: to control folder and file name length.

Source Configuration

Config-Driven Sources

Simple sources use YAML configuration files in Config/sources/active/config_driven/configs/.

Basic Configuration

# Config/sources/active/config_driven/configs/example.yaml
display_name: "Example News"
base_url: "https://example.com/news/"
category: "general"          # tech, science, business, general
timeout: 10.0
rate_limit: 1.0             # Minimum seconds between requests

# Required: Article discovery
article_selectors:
  - ".headline a"
  - ".article-title a"
  - "h2.title a"

# Required: Content extraction
content_selectors:
  - ".article-content"
  - ".post-body"
  - "div.content"

Advanced Configuration

# Advanced config-driven source
display_name: "Advanced News"
base_url: "https://advanced.com/"
category: "tech"
timeout: 15.0
rate_limit: 2.0
supports_comments: false

article_selectors:
  - ".headline a"
  - ".story-link"

content_selectors:
  - ".article-content"
  - ".story-body"

# Skip unwanted URLs
skip_patterns:
  - "/about"
  - "/contact"
  - "/advertising"
  - "?utm_"
  - "/sponsored"

# Custom headers
custom_config:
  headers:
    Accept: "text/html,application/xhtml+xml"
    Accept-Language: "en-US,en;q=0.5"
  user_agent: "Custom Bot 1.0"

  # Custom selectors for metadata
  meta_selectors:
    author: ".byline .author"
    date: ".publish-date"
    tags: ".article-tags a"

  # Content cleaning
  remove_selectors:
    - ".advertisement"
    - ".related-links"
    - ".social-share"

Custom Sources

Complex sources use Python implementations with YAML configuration.

Configuration File

# Config/sources/active/custom/example/config.yaml
display_name: "Example Custom"
base_url: "https://example.com/"
category: "tech"
timeout: 10.0
rate_limit: 1.0
supports_comments: true

# Custom source-specific configuration
custom_config:
  api_endpoint: "/api/v1/articles"
  api_key: "${EXAMPLE_API_KEY}"  # Environment variable
  max_pages: 5
  items_per_page: 50

  # Authentication
  auth_type: "bearer"  # bearer, basic, api_key
  auth_header: "Authorization"

  # Rate limiting
  requests_per_minute: 60
  burst_limit: 10

  # Content processing
  extract_metadata: true
  process_images: true
  follow_redirects: true

Python Implementation

# Config/sources/active/custom/example/source.py
class ExampleSource(BaseSource):
    def __init__(self, config: SourceConfig, session=None):
        super().__init__(config, session)

        # Access custom configuration
        self.api_key = config.custom_config.get('api_key')
        self.api_endpoint = config.custom_config.get('api_endpoint')
        self.max_pages = config.custom_config.get('max_pages', 5)

    def _get_headers(self):
        headers = super()._get_headers()

        # Add API authentication
        if self.api_key:
            headers['Authorization'] = f'Bearer {self.api_key}'

        return headers

Configuration Validation

Automatic Validation

Capcat automatically validates configurations during source discovery:

# Check configuration validity
from core.source_system.source_registry import get_source_registry

registry = get_source_registry()
errors = registry.validate_all_sources(deep_validation=True)

for source_name, error_list in errors.items():
    if error_list:
        print(f"{source_name}: {', '.join(error_list)}")

Manual Validation

# Validate all sources
python -c "
from core.source_system.validation_engine import ValidationEngine
from core.source_system.source_registry import get_source_registry

registry = get_source_registry()
engine = ValidationEngine()

configs = registry.discover_sources()
results = engine.validate_all_sources(configs, deep_validation=True)
report = engine.generate_validation_report(results)
print(report)
"

Common Configuration Recipes

Fast daily fetch

# Config/Global-settings.yaml
network:
  crawl_delay: 0.1

processing:
  max_workers: 8
  download_images: true

logging:
  console_level: "WARNING"   # Suppress info noise in cron

Research archival with PDFs

processing:
  max_workers: 4             # Slower but stable
  download_images: true

pdf:
  max_pdf_size_bytes: 52428800   # Allow up to 50 MB
  max_pdf_per_article: 20

logging:
  console_level: "INFO"

Minimal disk usage

processing:
  download_images: false
  download_videos: false
  download_audio: false
  download_documents: false
  create_comments_file: false

Regenerating Global-settings.yaml

After a Capcat upgrade, new settings may have been added to the template. To add them without losing your edits, compare with the latest template:

# Generate fresh template to a temp file and diff
capcat settings > /tmp/latest-settings.yaml
diff Config/Global-settings.yaml /tmp/latest-settings.yaml

Or overwrite completely (back up your edits first):

cp Config/Global-settings.yaml Config/Global-settings.yaml.bak
capcat settings --force
For source-specific configuration details, see the Source Development Guide.