Architecture Overview

Capcat 2.0 implements a hybrid architecture that combines config-driven simplicity with custom implementation flexibility.

System Architecture

┌─────────────────────────────────────────────────────────────┐ │ Capcat 2.0 Architecture │ ├─────────────────────────────────────────────────────────────┤ │ User Interface Layer │ │ ├── Interactive Mode (catch) │ │ │ ├── Main menu (bundle/fetch/single/manage/exit) │ │ │ └── Source management submenu │ │ └── CLI Interface (capcat.py) │ │ ├── bundle tech|news|science|aggregators │ │ ├── fetch source1,source2,source3 │ │ └── single https://example.com/article │ ├─────────────────────────────────────────────────────────────┤ │ Core Processing Layer │ │ ├── SourceRegistry (Discovery & Management) │ │ ├── SourceFactory (Instantiation & Pooling) │ │ ├── PerformanceMonitor (Metrics & Health) │ │ └── ValidationEngine (Quality Assurance) │ ├─────────────────────────────────────────────────────────────┤ │ Hybrid Source Layer │ │ ├── Config-Driven Sources (5 sources) │ │ │ └── YAML configurations → ConfigDrivenSource │ │ └── Custom Sources (6 sources) │ │ └── Python implementations → BaseSource │ ├─────────────────────────────────────────────────────────────┤ │ Shared Infrastructure │ │ ├── SessionPool (Network Optimization) │ │ ├── ArticleFetcher (Content Processing) │ │ ├── TemplateSystem (HTML Generation) │ │ │ ├── article-with-comments.html │ │ │ ├── article-no-comments.html │ │ │ └── comments-with-navigation.html │ │ ├── UnifiedMediaProcessor (Image/Media Embedding) │ │ │ ├── MediaEmbeddingProcessor (Core Engine) │ │ │ ├── ImageProcessor (Global Image Coordinator) │ │ │ ├── WebsiteClassifier (Aggregator Protection) │ │ │ ├── MediaConfigManager (Source Configs) │ │ │ └── Source-specific media processing configs │ │ ├── MediaDownloader (Images/Videos/Documents) │ │ └── Formatter (HTML → Markdown) │ └─────────────────────────────────────────────────────────────┘

Hybrid Source Types

Config-Driven Sources (Simple)

Purpose

Simplified development for straightforward news sites

Count

5 sources

Examples

InfoQ, Straits Times, Gizmodo, IEEE, Scientific American

# Example: sources/active/config_driven/configs/iq.yaml
display_name: "InfoQ"
base_url: "https://www.infoq.com/news/"
category: tech
article_selectors:
  - .card__title a
  - .news-headline a
content_selectors:
  - .article__content
  - article.article

Benefits

No Python coding required
Rapid deployment
Easy maintenance through configuration updates
Automatic validation and testing

Custom Sources (Complex)

Purpose

Full flexibility for complex scraping scenarios

Count

6 sources

Examples

Hacker News, BBC, CNN, Nature, Lobsters, LessWrong

# Example: sources/active/custom/hn/source.py
class HackerNewsSource(BaseSource):
    def get_articles(self, count=30):
        # Custom logic for Hacker News API integration
        # Comment system handling
        # Anti-bot protection workarounds
        pass

Benefits

Full control over scraping logic
Comment system integration
Anti-bot protection handling
Dynamic content loading support

Core Components

1. SourceRegistry Pattern

Location

core/source_system/source_registry.py

Purpose

Auto-discovery and management of all sources

# Automatic source discovery
registry = get_source_registry()
sources = registry.discover_sources()  # Returns 20 sources

# Source instantiation with session pooling
source = registry.get_source('hn', session=global_session)

Features

Auto-discovery from sources/active/
Validation during discovery
Source type management (config vs custom)
Instance caching for performance

2. Factory Pattern Implementation

Location

core/source_system/source_factory.py

Purpose

Unified source creation with monitoring integration

# Factory creates appropriate source type
source = SourceFactory.create_source(config, session)

# Automatic performance monitoring integration
metrics = source.get_performance_metrics()

Features

Unified creation interface
Performance monitoring integration
Health checking capabilities
Session pool optimization

3. Performance Monitoring System

Location

core/source_system/performance_monitor.py

Purpose

Real-time metrics and health tracking

@dataclass
class SourceMetrics:
    source_name: str
    total_requests: int = 0
    successful_requests: int = 0
    avg_response_time: float = 0.0
    articles_discovered: int = 0

    @property
    def success_rate(self) -> float:
        return (self.successful_requests / self.total_requests) * 100

Capabilities

Real-time performance tracking
Success rate monitoring
Response time analysis
Health status reporting

4. Enhanced Validation Engine

Location

core/source_system/validation_engine.py

Purpose

Comprehensive source validation and quality assurance

Validation Types

Basic
Configuration syntax and format
Network
Connectivity and accessibility testing
Selectors
CSS selector effectiveness
Deep
Live content validation

# Comprehensive validation
results = validation_engine.validate_all_sources(configs, deep=True)
report = validation_engine.generate_validation_report(results)

5. Interactive Mode System

Location

core/interactive.py

Purpose

User-friendly menu interface for all Capcat operations

Architecture

# Main interactive loop start_interactive_mode() ├── Main Menu │ ├── Bundle selection │ ├── Multi-source fetch │ ├── Single source fetch │ ├── Single URL fetch │ └── Source management submenu └── Source Management Submenu ├── Add RSS source ├── Generate config ├── Remove sources ├── List sources └── Test source

Key Features

Questionary UI Framework
Terminal-based interactive menus
Logging Suppression
Clean display during navigation
Screen Management
Automatic clearing and formatting
CLI Integration
Constructs argument lists for run_app()
Error Handling
Graceful failures with return to menu

Technology Stack

# UI Framework
from questionary import select, checkbox, text, confirm

# Custom styling
custom_style = Style([
    ('selected', 'fg:#d75f00'),  # Orange theme
    ('pointer', 'fg:#d75f00 bold'),
])

# Screen control
print('\033[2J\033[H', end='')  # Clear screen

Integration Points

Uses cli.py functions for source lists and bundles
Calls run_app() from capcat.py for execution
Integrates with source registry for discovery
Uses source management services for add/remove operations

Benefits

Zero command memorization
Visual feedback
Error prevention through validation
Ideal for new users and daily operations
Complements CLI for advanced use cases

For detailed documentation, see Interactive Mode Guide.

Session Pooling Architecture

# Global session pool for optimal performance
class SessionPool:
    def __init__(self):
        self.session = requests.Session()
        self.session.mount('http://', HTTPAdapter(pool_connections=20, pool_maxsize=20))
        self.session.mount('https://', HTTPAdapter(pool_connections=20, pool_maxsize=20))

# All sources share the same optimized session
global_session = get_global_session()

Benefits

Connection reuse across sources
Reduced latency
Improved throughput
Resource efficiency

Content Processing Architecture

Separated Article and Comment Workflows

Design Principle

Complete separation of concerns for better maintainability and error isolation.

# Clean separation pattern
class SourceImplementation:
    def fetch_article_content(self, url, ...):
        # Handles ONLY article content extraction and media
        pass

    def fetch_comments(self, comment_url, article_folder):
        # Handles ONLY comment extraction and formatting
        pass

Article Processing Pipeline

Article URL → Content Extraction → Media Processing → Markdown Conversion
                                ↓
                          Fallback Image Detection (if needed)
                                ↓
                          Final Article + Additional Images

Key Components

Primary Content Extraction
Uses source-specific or CSS selectors
Media Processing
Downloads images (always) + other media (with --media)
Fallback Image Detection
Automatic activation when < 2 images found
Markdown Generation
Clean HTML-to-Markdown conversion

Fallback Image Detection System

Smart Content Recovery

Handles websites without custom configurations by scanning entire pages for content images.

# Automatic activation logic
if primary_image_count < 2:
    logger.info("Activating fallback image detection")

    # Full page analysis
    additional_images = scan_full_page(
        original_html=full_page_html,
        existing_images=found_images,
        ui_filters=intelligent_filters
    )

Intelligent Filtering

UI Element Detection
Removes logos, navigation, ads, social icons
Size Filtering
Skips images < 150px dimensions
Pattern Matching
Filters by class, ID, alt text, and URL patterns
Duplicate Prevention
Avoids re-downloading existing images

Filter Categories

ui_patterns:
  class_patterns: [logo, icon, avatar, nav, menu, ad, banner, social, share]
  id_patterns: [logo, icon, nav, header, footer, sidebar]
  alt_patterns: [logo, icon, advertisement, navigation, social]
  src_patterns: [logo, icon, avatar, ad, pixel, tracker, beacon, analytics]

Comment Processing Pipeline

Comment URL → Comment Extraction → Thread Building → Markdown Generation
                                                  ↓
                               User Privacy Protection (anonymization)
                                                  ↓
                                      comments.md file

Independent Processing Benefits

Error Isolation
Comment failures don't affect articles
Clean Interfaces
No parameter pollution
Source Flexibility
Each source handles comments differently
Performance
Parallel processing possible
Maintainability
Clear separation of concerns

Privacy Protection

# Automatic user anonymization
comment_data = {
    'author': 'Anonymous',  # Privacy-compliant
    'original_profile': original_profile_link,  # Reference preserved
    'text': cleaned_comment_text,
    'timestamp': comment_timestamp
}

Unified Media Processing Architecture

Overview

The Unified Media Processing System eliminates the "whack-a-mole" pattern of fixing image embedding issues source by source. Instead of debugging each source individually, the system provides a single, configurable solution that works across all sources.

Architecture Components

MediaEmbeddingProcessor

Location

core/media_embedding_processor.py

Purpose

Core engine for image extraction, downloading, and URL replacement

from core.unified_media_processor import UnifiedMediaProcessor

# Simple integration for any source
updated_content = UnifiedMediaProcessor.process_article_media(
    content=markdown_content,
    html_content=html_content,
    url=article_url,
    article_folder=article_folder,
    source_name=source_name,
    session=session
)

MediaConfigManager

Location

core/media_config.py

Purpose

Source-specific media processing configurations

# Predefined configurations for major sources
configs = {
    '': {
        'hero_image_selectors': ['.featured-image img', '.post-thumbnail img'],
        'url_patterns': {'wordpress': ['/wp-content/uploads/']},
        'quality_thresholds': {'min_width': 150, 'min_height': 150}
    },
    '': {
        'hero_image_selectors': ['.featured-image img', '.hero-image img'],
        'url_patterns': {'': ['i..com/', '.com/wp-content/']},
        'skip_patterns': ['advertisement', 'ad-', 'sponsored']
    }
}

Key Features

1. Source-Specific Configuration

Hero Image Detection
Custom selectors per source
URL Pattern Recognition
WordPress, CDN, and custom patterns
Quality Filtering
Size and file type thresholds
Skip Patterns
Advertisements and irrelevant images

2. URL Processing Strategies

Protocol-relative URLs
//domain.com/image.jpg → https://domain.com/image.jpg
WordPress URLs
/wp-content/uploads/2025/09/image.jpg → images/image.jpg
Relative URLs
/media/image.jpg → https://domain.com/media/image.jpg
Absolute URLs
Direct processing with domain validation

3. Markdown Integration

BeautifulSoup Parsing
Extract images from original HTML
Regex URL Replacement
Multiple strategies for reliable replacement
Local Path Generation
Consistent images/filename.jpg format
Content Preservation
Maintains alt text and image context

Integration Points

The unified system integrates at the NewsSourceArticleFetcher level in core/news_source_adapter.py:

# Automatic source detection from URL domain
source_name = self._detect_source_from_url(url)

# Process media with unified system
updated_content = UnifiedMediaProcessor.process_article_media(
    content=article_content,
    html_content=response.text,
    url=url,
    article_folder=article_folder_path,
    source_name=source_name,
    session=self.session
)

Benefits Achieved

Eliminated "Whack-a-Mole" Pattern
Single fix applies to all 25+ sources
Configurable Processing
Easy to add new sources or adjust existing ones
Consistent Behavior
Same processing logic across all sources
Maintainable Architecture
Centralized instead of scattered fixes
Robust Error Handling
Graceful fallbacks and comprehensive logging

Performance Metrics

Gizmodo
→ 27 images
Futurism
→ 17 images
IEEE
→ 1 image

Performance Characteristics

Development Time
→ 15-30 min
Code Lines
→ ~10 (YAML)
Maintenance
→ Config update
Flexibility
→ Limited
Performance
→ Excellent

Simple Protection System

Overview

The Simple Protection System provides efficient protection against link aggregators and oversized images using clear, maintainable rules. It replaces the complex Website Classifier with simple heuristics that are easy to understand and modify.

Protection Rules

┌──────────────────────────────────────────────────────────────────┐ │ Simple Protection Rules │ ├──────────────────────┬─────────────────────────────────────────────┤ │ Protection Type │ Rule │ ├──────────────────────┼─────────────────────────────────────────────┤ │ Aggregator Detection │ Link density > 15% OR > 10 external domains│ │ Per-Image Filtering │ Skip images > 5MB OR < 150px │ │ Image Count Limit │ Maximum 20 images per article │ │ --media Flag │ Increases limit to 1000 images │ └──────────────────────┴─────────────────────────────────────────────┘

Simple Detection Logic

Aggregator Detection

Link density > 15% of content (links per word count)
Links to > 10 different external domains
Navigation links automatically excluded from analysis

Per-Image Filtering

HEAD request checks image size before download
Skip images > 5MB (unless --media flag used)
Skip images < 150px (likely icons/decorative)
Skip non-image content types

Real-World Protection Examples

Aggregator Site Example

Input: News aggregator with many external links
Analysis:
  - Word count: 200, Links: 35 → Link density: 17.5% (above 15%)
  - External domains: 12 (above 10 threshold)
Result: BLOCKED (aggregator detected)

Normal Article Example

Input: Technical article about React optimization
Analysis:
  - Word count: 800, Links: 5 → Link density: 0.6% (below 15%)
  - External domains: 2 (below 10 threshold)
Result: ALLOWED (up to 20 images)

--media Flag Integration

# Normal behavior (without --media)
max_images = 20  # Standard limit

# With --media flag
if media_enabled:
    max_images = 1000  # Very high limit
    # Per-image 5MB limit still applies for safety

Edge Case Handling

--media flag increases image count limit to 1000
Per-image size checking still applies for safety
Aggregator detection still blocks problematic sites
Simple, predictable behavior

Architecture Integration

Location

core/simple_protection.py

Integration Point

core/image_processor.py

# Simple protection check during image processing
protection_result = self.protection.check_content(html_content, base_url)

# Skip if aggregator detected
if protection_result.is_aggregator:
    logger.warning(f"Skipping aggregator site: {base_url}")
    return {}

Performance Characteristics

Fast Analysis
Content metrics extracted in ~10ms
No Network Calls
Uses only downloaded content
Memory Efficient
Processes content in streaming fashion
Logging
Detailed classification reasoning for debugging

Configuration Management

Hierarchy

(highest to lowest priority):

Command-line arguments
Environment variables
Configuration files (capcat.yml)
Default values

# Example: capcat.yml
network:
  connect_timeout: 10
  read_timeout: 8
  user_agent: "Mozilla/5.0 (compatible; Capcat/2.0)"

processing:
  max_workers: 8
  download_images: true
  download_videos: false

logging:
  default_level: "INFO"
  use_colors: true

Template System Architecture

Universal HTML Generation

Consistent navigation and professional output across all sources.

Template Components

templates/ ├── article-with-comments.html # For sources supporting comments (HN, Lobsters, LessWrong) ├── article-no-comments.html # For sources without comments (BBC, CNN, Nature, etc.) └── comments-with-navigation.html # For all comments pages (Back to Article button)

Template Configuration

# Source template configuration
template:
  variant: "article-with-comments"  # or "article-no-comments"
  navigation:
    back_to_news_url: "../../news.html"
    back_to_news_text: "Back to News"
    has_comments: true
    comments_url: "comments.html"
    comments_text: "View Comments"

Navigation Logic

Article Pages (with comments)
"Back to News" + "View Comments"
Article Pages (no comments)
"Back to News" only
Comments Pages
"Back to Article" only

Source Detection

Template system automatically detects 30+ source patterns:

Hacker-News, Lobsters, LessWrong → article-with-comments
BBC, CNN, Nature → article-no-comments
All comments pages → comments-with-navigation

Benefits

100% Navigation Consistency
Same patterns across all sources
Professional Appearance
Template-driven HTML with themes
Easy Maintenance
Update all sources by modifying templates
Scalable
Add new sources with simple YAML configuration

Testing Architecture

Test Levels

Unit Tests
Individual component testing
Integration Tests
Cross-component functionality
Source Tests
Individual source validation
System Tests
End-to-end workflow validation

# Comprehensive testing framework
python test_comprehensive_sources.py  # Tests all 20 sources
python test_validation_engine.py      # Tests validation system
python test_performance_monitor.py    # Tests monitoring system

Extension Points

Adding New Config-Driven Sources

# 1. Create YAML configuration
# sources/active/config_driven/configs/newsource.yaml
display_name: "New Source"
base_url: "https://newsource.com/"
article_selectors: [".headline a"]
content_selectors: [".article-content"]

# 2. Source automatically discovered and available

Adding New Custom Sources

# 1. Create source directory and files # sources/active/custom/newsource/ # ├── source.py # BaseSource implementation # └── config.yaml # Source configuration # 2. Implement BaseSource contract class NewSource(BaseSource): def get_articles(self, count=30): # Custom implementation pass

Error Handling Strategy

Graceful Degradation

Individual source failures don't affect others
Partial success reported clearly
Detailed error logging for debugging
Automatic retry mechanisms for transient failures

Error Categories

Network
Timeouts, connection failures
Parsing
Invalid HTML, missing selectors
Configuration
Invalid YAML, missing fields
Logic
Custom source implementation errors

HTML Generation Architecture

Capcat 2.0 features a

compartmentalized HTML generation system

that replaces monolithic conditional logic with source-specific configurations and templates.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐ │ HTML Generation System │ ├─────────────────────────────────────────────────────────────┤ │ Factory Pattern │ │ ├── HTMLGeneratorFactory │ │ └── Source-specific Generator Instances │ ├─────────────────────────────────────────────────────────────┤ │ Base System │ │ ├── BaseHTMLGenerator (Abstract) │ │ ├── Base Templates ({{}} syntax) │ │ └── Configuration Schema │ ├─────────────────────────────────────────────────────────────┤ │ Source-Specific Implementations │ │ ├── htmlgen/hn/ │ │ │ ├── config.yaml │ │ │ ├── generator.py │ │ │ └── templates/ (overrides) │ │ ├── htmlgen// │ │ └── htmlgen/lb/ │ └─────────────────────────────────────────────────────────────┘

Key Features

1. Configuration-Driven Behavior

# htmlgen/hn/config.yaml
comments:
  enabled: true
  pattern: '\*\*Anonymous\*\*\s*\(\[profile\]\([^)]*\)\)'
  conditional_display: true

navigation:
  breadcrumb_style: "technical"
  date_format: "YYYY-MM-DD"

layout:
  template_set: "base"
  custom_css_classes:
    - "hn-style"
    - "tech-focused"

2. Template System with Override Capability

Base Templates
htmlgen/base/templates/ using clean {{}} syntax
Source Overrides
htmlgen/[source]/templates/ for custom layouts
Fallback Logic
Jinja2 with string replacement fallback

3. Privacy-Compliant Comment Processing

def count_comments(self, comments_file: Path) -> int:
    pattern = self.source_config['comments']['pattern']
    matches = re.findall(pattern, content)
    return len(matches)

4. Validation and Testing

JSON Schema validation for all configurations
Comprehensive test suite with 100% pass rate
Source-specific pattern validation

Benefits

Maintainability
No more monolithic conditional logic
Extensibility
Easy to add new sources with YAML config
Consistency
Standardized configuration schema
Privacy
Built-in anonymization pattern support
Performance
Template caching and optimization
Testing
Isolated testing per source

Usage Example

# Automatic factory instantiation
generator = HTMLGeneratorFactory.create_generator("hn")

# Source-specific behavior from config
comment_count = generator.count_comments(comments_file)
should_show = generator.should_show_comment_link(comment_count)

# Template rendering with source customization
html_content = generator.render_template("article.html", context)

Privacy & Anonymization Architecture

Comment Processing Pipeline

Collection
Comments fetched from source APIs
Anonymization
Usernames replaced with "Anonymous"
Link Preservation
Profile URLs preserved for reference
Pattern Recognition
Source-specific regex patterns match anonymized format
HTML Generation
Templates render privacy-compliant output

Legal Compliance Features

No personal username storage
Anonymous-only display in all outputs
Functional profile links preserved
Transparent privacy policy
Source attribution maintained

Design Principles

Separation of Concerns
Config vs logic clearly separated
Single Responsibility
Each component has focused purpose
Open/Closed
Open for extension, closed for modification
DRY
No code duplication across sources
Performance
Session pooling and connection optimization
Testability
Comprehensive testing at all levels
Maintainability
Clear patterns and documentation
Privacy by Design
Built-in anonymization and compliance
Template Modularity
Reusable components with override capability

This architecture provides the foundation for scalable, maintainable news source management with optimal performance characteristics.

Architecture Overview

System Architecture

Hybrid Source Types

Config-Driven Sources (Simple)

Purpose

Count

Examples

Benefits

Custom Sources (Complex)

Purpose

Count

Examples

Benefits

Core Components

1. SourceRegistry Pattern

Location

Purpose

Features

2. Factory Pattern Implementation

Location

Purpose

Features

3. Performance Monitoring System

Location

Purpose

Capabilities

4. Enhanced Validation Engine

Location

Purpose

Validation Types

Basic

Network

Selectors

Deep

5. Interactive Mode System

Location

Purpose

Architecture

Key Features

Questionary UI Framework

Logging Suppression

Screen Management

CLI Integration

Error Handling

Technology Stack

Integration Points

Benefits

Session Pooling Architecture

Benefits

Content Processing Architecture

Separated Article and Comment Workflows

Design Principle

Article Processing Pipeline

Key Components

Primary Content Extraction

Media Processing

Fallback Image Detection

Markdown Generation

Fallback Image Detection System

Smart Content Recovery

Intelligent Filtering

UI Element Detection

Size Filtering

Pattern Matching

Duplicate Prevention

Filter Categories

Comment Processing Pipeline

Independent Processing Benefits

Error Isolation

Clean Interfaces

Source Flexibility

Performance

Maintainability

Privacy Protection

Unified Media Processing Architecture

Overview

Architecture Components

MediaEmbeddingProcessor

Location

Purpose