capcat.core.source_system.base_source
File: Application/capcat/core/source_system/base_source.py
Description
Abstract base class for all news sources. Defines the contract that all source implementations must follow.
Classes
SourceConfig
Configuration data class for news sources.
Attributes:
name: Machine-readable identifier (e.g. "hn").
display_name: Human-readable name shown in menus (e.g. "Hacker News").
base_url: Root URL of the source.
timeout: HTTP request timeout in seconds.
rate_limit: Minimum delay between requests in seconds.
supports_comments: True if the source has comment threads.
has_comments: True if the current fetch includes comments.
category: Topic category (e.g. "tech", "science").
custom_config: Source-specific extra config key/value pairs.
Methods
post_init
def __post_init__(self) -> None
Ensure custom_config is never None after construction.
Parameters:
self
Returns: None
to_dict
def to_dict(self) -> Dict[str, Any]
Convert to dictionary format for compatibility.
Returns: Dictionary representation of the source configuration
Parameters:
self
Returns: Dict[str, Any]
Article
Lightweight data transfer object for a single news article.
Attributes:
title: Article headline.
url: Canonical URL of the article.
comment_url: URL of the comments page, if available.
author: Byline, anonymized to "Anonymous" where applicable.
published_date: Publication date string (format varies by source).
summary: Short excerpt or lede, if provided by the feed.
tags: List of topic tags associated with the article.
Methods
post_init
def __post_init__(self) -> None
Ensure tags is never None after construction.
Parameters:
self
Returns: None
BaseSource
Inherits from: ABC
Abstract base class for all news sources.
This defines the contract that all source implementations must follow, ensuring consistent behavior across different news sources.
Methods
init
def __init__(self, config: SourceConfig, session: Optional[requests.Session] = None)
Initialize the source with configuration.
Args: config: Source configuration session: Optional HTTP session for connection pooling
Parameters:
selfconfig(SourceConfig)session(Optional[requests.Session]) optional
source_type
def source_type(self) -> str
Return the source type (‘config_driven’ or ‘custom’).
Parameters:
self
Returns: str
discover_articles
def discover_articles(self, count: int) -> List[Article]
Discover articles from the source.
Args: count: Maximum number of articles to discover
Returns: List of Article objects
Raises: SourceError: If article discovery fails
Parameters:
selfcount(int)
Returns: List[Article]
fetch_article_content
def fetch_article_content(self, article: Article, output_dir: str, progress_callback = None, download_files: bool = False, download_pdfs: bool = False) -> Tuple[bool, Optional[str]]
Fetch and save article content.
Args: article: Article to fetch output_dir: Directory to save content progress_callback: Optional progress callback function download_files: Whether to download media files (–media flag) download_pdfs: Whether to download PDF files (–pdfs flag)
Returns: Tuple of (success, article_path)
Raises: SourceError: If content fetching fails
Parameters:
selfarticle(Article)output_dir(str)progress_callbackoptionaldownload_files(bool) optionaldownload_pdfs(bool) optional
Returns: Tuple[bool, Optional[str]]
fetch_comments
def fetch_comments(self, comment_url: str, article_title: str, article_folder_path: str) -> bool
Fetch and save article comments (if supported).
Override in sources that support comments. The default implementation returns False (no comments fetched).
Args: comment_url: URL to the comments page for this article. article_title: Title of the article (used for logging and filenames). article_folder_path: Path to the article’s output folder.
Returns: True if comments were saved, False otherwise.
Parameters:
selfcomment_url(str)article_title(str)article_folder_path(str)
Returns: bool
validate_config
def validate_config(self) -> List[str]
Validate the source configuration.
Returns: List of validation error messages (empty if valid)
Parameters:
self
Returns: List[str]
_validate_custom_config
def _validate_custom_config(self) -> List[str]
Validate custom configuration (override in subclasses).
Returns: List of validation error messages
Parameters:
self
Returns: List[str]
should_skip_url
def should_skip_url(self, url: str, title: str = '') -> bool
Check if a URL should be skipped during processing.
Args: url: URL to check title: Optional article title
Returns: True if URL should be skipped
Parameters:
selfurl(str)title(str) optional
Returns: bool
_should_skip_custom
def _should_skip_custom(self, url: str, title: str = '') -> bool
Custom skip logic (override in subclasses).
Args: url: URL to check title: Optional article title
Returns: True if URL should be skipped
Parameters:
selfurl(str)title(str) optional
Returns: bool
get_rate_limit
def get_rate_limit(self) -> float
Get the rate limit for this source.
Parameters:
self
Returns: float
get_timeout
def get_timeout(self) -> float
Get the timeout for this source.
Parameters:
self
Returns: float
_get_logger
def _get_logger(self)
Get a logger instance for this source.
Parameters:
self
_setup_performance_monitoring
def _setup_performance_monitoring(self)
Setup performance monitoring for this source.
Parameters:
self
_start_performance_timing
def _start_performance_timing(self) -> float
Start performance timing for an operation.
Parameters:
self
Returns: float
_end_performance_timing
def _end_performance_timing(self, start_time: float, success: bool, error_type: Optional[str] = None)
End performance timing for an operation.
Parameters:
selfstart_time(float)success(bool)error_type(Optional[str]) optional
_record_article_discovery
def _record_article_discovery(self, count: int)
Record successful article discovery.
Parameters:
selfcount(int)
_record_content_fetch
def _record_content_fetch(self, success: bool)
Record content fetching result.
Parameters:
selfsuccess(bool)
discover_articles_with_retry_skip
def discover_articles_with_retry_skip(self, count: int, max_retries: int = 2, batch_mode: bool = False) -> Optional[List[Article]]
Discover articles with retry-and-skip logic for network resilience.
Attempts to discover articles up to max_retries times. If all attempts fail due to timeout or connection errors, returns None (skip source).
Args: count: Maximum number of articles to discover max_retries: Maximum number of retry attempts (default: 2) batch_mode: Whether processing multiple sources (affects user messages)
Returns: List of Article objects if successful, None if skipped
Example: »> source = MySource(config) »> articles = source.discover_articles_with_retry_skip( … count=10, max_retries=2 … ) »> if articles is None: … print(“Source skipped after failures”)
Parameters:
selfcount(int)max_retries(int) optionalbatch_mode(bool) optional
Returns: Optional[List[Article]]
str
def __str__(self) -> str
String representation of the source.
Parameters:
self
Returns: str
repr
def __repr__(self) -> str
Detailed string representation of the source.
Parameters:
self
Returns: str
SourceError
Inherits from: Exception
Base exception for source-related errors.
Methods
init
def __init__(self, message: str, source_name: str = None)
Create a SourceError with an optional source name prefix.
Args:
message: Human-readable description of the error.
source_name: Source identifier prepended to the message in
__str__ (e.g. "hn"). Defaults to None.
Parameters:
selfmessage(str)source_name(str) optional
str
def __str__(self)
Return message prefixed with [source_name] when set.
Returns:
"[source_name] message" if source_name is set, otherwise
the plain message string.
Parameters:
self
ArticleDiscoveryError
Inherits from: SourceError
Exception raised when article discovery fails.
ContentFetchError
Inherits from: SourceError
Exception raised when content fetching fails.
ConfigurationError
Inherits from: SourceError
Exception raised when source configuration is invalid.