capcat.core.source_system.feed_discovery

File: Application/capcat/core/source_system/feed_discovery.py

Description

RSS/Atom feed discovery utilities. Automatically discovers feed URLs from websites when configured feeds fail.

Functions

discover_feed_urls

def discover_feed_urls(base_url: str, timeout: int = 10) -> List[str]

Attempt to discover RSS/Atom feed URLs from a website.

Looks for:

  • in HTML

  • in HTML

  • Common feed paths: /feed, /rss, /atom, /feed.xml, /rss.xml

Args: base_url: Base URL of the website timeout: Request timeout in seconds

Returns: List of discovered feed URLs (may be empty)

Parameters:

  • base_url (str)
  • timeout (int) optional

Returns: List[str]

validate_feed

def validate_feed(content: bytes) -> bool

Quick validation that content is a valid RSS/Atom feed.

Args: content: Raw feed content as bytes

Returns: True if content appears to be a valid feed

Parameters:

  • content (bytes)

Returns: bool

test_feed_url

def test_feed_url(url: str, timeout: int = 10) -> bool

Test if a URL returns a valid feed.

Args: url: URL to test timeout: Request timeout in seconds

Returns: True if URL returns a valid feed

Parameters:

  • url (str)
  • timeout (int) optional

Returns: bool

find_working_feed_url

def find_working_feed_url(base_url: str, timeout: int = 10) -> str

Discover and return first working feed URL for a website.

Args: base_url: Base URL of the website timeout: Request timeout in seconds

Returns: First working feed URL found

Raises: ValueError: If no working feed URL is found

Parameters:

  • base_url (str)
  • timeout (int) optional

Returns: str