capcat.core.source_system.config_driven_source
File: Application/capcat/core/source_system/config_driven_source.py
Description
Config-driven source implementation. Handles sources that are defined purely through configuration files.
Refactored for:
- Proper XML parsing with lxml (eliminates XMLParsedAsHTMLWarning)
- Strategy pattern for discovery methods (RSS vs HTML)
- Reduced cyclomatic complexity
- Single Responsibility Principle compliance
- Improved error handling
Classes
ConfigDrivenSource
Inherits from: BaseSource
Source implementation for config-driven sources.
Uses configuration data to extract articles and content without requiring custom Python code. Delegates to strategy classes for different discovery methods (RSS, HTML).
Methods
source_type
def source_type(self) -> str
Return the source type.
Parameters:
self
Returns: str
discover_articles
def discover_articles(self, count: int) -> List[Article]
Discover articles using configured discovery method.
Supports both RSS and HTML scraping based on configuration. Uses Strategy pattern to delegate to appropriate discovery handler.
Args: count: Maximum number of articles to discover
Returns: List of Article objects
Raises: ArticleDiscoveryError: If article discovery fails
Parameters:
selfcount(int)
Returns: List[Article]
fetch_article_content
def fetch_article_content(self, article: Article, output_dir: str, progress_callback = None, download_files: bool = False, download_pdfs: bool = False) -> Tuple[bool, Optional[str]]
Fetch article content using configured content selectors.
Args: article: Article to fetch output_dir: Directory to save content progress_callback: Optional progress callback function
Returns: Tuple of (success, article_path)
Raises: ContentFetchError: If content fetching fails
Parameters:
selfarticle(Article)output_dir(str)progress_callbackoptionaldownload_files(bool) optionaldownload_pdfs(bool) optional
Returns: Tuple[bool, Optional[str]]
_prepare_fetcher_config
def _prepare_fetcher_config(self) -> dict
Prepare configuration for NewsSourceArticleFetcher.
Returns: Configuration dictionary with required fields
Parameters:
self
Returns: dict
_validate_custom_config
def _validate_custom_config(self) -> List[str]
Validate config-driven source configuration.
Returns: List of validation error messages
Parameters:
self
Returns: List[str]
_should_skip_custom
def _should_skip_custom(self, url: str, title: str = '') -> bool
Custom skip logic for config-driven sources.
Args: url: URL to check title: Optional article title
Returns: True if URL should be skipped
Parameters:
selfurl(str)title(str) optional
Returns: bool