capcat.core.unified_article_processor

File: Application/capcat/core/unified_article_processor.py

Description

Unified Article Processor - Universal entry point for all article processing.

All commands (single, fetch, bundle) route through this processor to ensure consistent behavior across the application.

Classes

UnifiedArticleProcessor

Universal article processing entry point.

Provides consistent article fetching logic across all commands:

Single article mode
Fetch mode (multiple sources)
Bundle mode (predefined source groups)

Processing Order:

Check specialized sources (Twitter, YouTube, Medium, Substack)
Check source-specific handlers (HN, BBC, etc.)
Fall back to generic ArticleFetcher

Methods

init

def __init__(self)

Initialize processor with source registry.

Parameters:

self

process_article

def process_article(self, url: str, title: str, index: int, base_folder: str, download_files: bool = False, progress_callback: Optional[Callable[[float, str], None]] = None, update_mode: bool = False) -> Tuple[bool, Optional[str], Optional[str]]

Universal article processing with specialized source detection.

Args: url: Article URL to process title: Article title (may be placeholder) index: Article index for folder numbering base_folder: Output directory path download_files: Whether to download media files progress_callback: Optional progress reporting callback update_mode: Re-check URL validity and update timestamp

Returns: Tuple[bool, Optional[str], Optional[str]]: (success, article_title, article_folder_path)

Parameters:

self
url (str)
title (str)
index (int)
base_folder (str)
download_files (bool) optional
progress_callback (Optional[Callable[[float, str], None]]) optional
update_mode (bool) optional

Returns: Tuple[bool, Optional[str], Optional[str]]

_process_with_specialized_source

def _process_with_specialized_source(self, url: str, title: str, base_folder: str, update_mode: bool) -> Tuple[bool, Optional[str], Optional[str]]

Handle specialized sources (Twitter, YouTube, Medium, Substack).

Args: url: Article URL title: Article title base_folder: Output directory update_mode: Whether to re-check URL and update timestamp

Returns: Tuple of (success, article_title, article_folder_path)

Parameters:

self
url (str)
title (str)
base_folder (str)
update_mode (bool)

Returns: Tuple[bool, Optional[str], Optional[str]]

_process_with_source_handler

def _process_with_source_handler(self, url: str, title: str, index: int, base_folder: str, download_files: bool, progress_callback: Optional[Callable], source: str) -> Tuple[bool, Optional[str], Optional[str]]

Handle source-specific processing (HN, BBC, etc.).

Delegates to existing source-specific implementations.

Args: url: Article URL title: Article title index: Article index base_folder: Output directory download_files: Whether to download media progress_callback: Progress callback source: Source identifier (e.g., ‘hn’, ‘bbc’)

Returns: Tuple of (success, article_title, article_folder_path)

Parameters:

self
url (str)
title (str)
index (int)
base_folder (str)
download_files (bool)
progress_callback (Optional[Callable])
source (str)

Returns: Tuple[bool, Optional[str], Optional[str]]

_process_with_generic_handler

def _process_with_generic_handler(self, url: str, title: str, index: int, base_folder: str, download_files: bool, progress_callback: Optional[Callable]) -> Tuple[bool, Optional[str], Optional[str]]

Handle generic article processing.

Uses ArticleFetcher for unknown sources.

Args: url: Article URL title: Article title index: Article index base_folder: Output directory download_files: Whether to download media progress_callback: Progress callback

Returns: Tuple of (success, article_title, article_folder_path)

Parameters:

self
url (str)
title (str)
index (int)
base_folder (str)
download_files (bool)
progress_callback (Optional[Callable])

Returns: Tuple[bool, Optional[str], Optional[str]]

_check_url_validity

def _check_url_validity(self, url: str) -> bool

Check if URL is still valid (HEAD request).

Args: url: URL to check

Returns: True if URL valid (status 200-399), False otherwise

Parameters:

self
url (str)

Returns: bool

_update_timestamp

def _update_timestamp(self, article_folder: str)

Update timestamp in article metadata.

Args: article_folder: Path to article folder containing article.md

Parameters:

self
article_folder (str)

_add_url_warning

def _add_url_warning(self, article_folder: str, url: str)

Add warning about invalid URL to article.

Args: article_folder: Path to article folder url: Invalid URL

Parameters:

self
article_folder (str)
url (str)

GenericArticleFetcher

Inherits from: ArticleFetcher

Methods

should_skip_url

def should_skip_url(self, url: str, title: str) -> bool

Never skip URLs for generic fetching.

Parameters:

self
url (str)
title (str)

Returns: bool

Functions

get_unified_processor

def get_unified_processor() -> UnifiedArticleProcessor

Get the global unified article processor instance.

Returns: UnifiedArticleProcessor