capcat.core.formatter

File: Application/capcat/core/formatter.py

Description

HTML to Markdown converter for Capcat. This module provides functionality to convert HTML content to clean Markdown format.

Functions

_normalize_url

def _normalize_url(url: str) -> str

Normalize URL by properly handling encoding/decoding issues.

Parameters:

url (str)

Returns: str

_create_smart_link

def _create_smart_link(text: str, url: str) -> str

Create a smart link that keeps functionality but has readable display text.

Parameters:

text (str)
url (str)

Returns: str

format_comment_paragraphs

def format_comment_paragraphs(comment_text: str) -> str

Format comment text with proper paragraph breaks and improved readability. This is a global utility function for all news sources to improve comment formatting.

Args: comment_text: Raw comment text

Returns: Formatted comment text with proper paragraphs

Parameters:

comment_text (str)

Returns: str

_preserve_button_images

def _preserve_button_images(soup)

Extract images from buttons before button removal destroys them.

Many sites wrap content images in

Parameters:

soup

html_to_markdown

def html_to_markdown(html_content: str, base_url: str = None) -> str

Convert HTML content to clean markdown format.

Parameters:

html_content (str)
base_url (str) optional

Returns: str

_parse_srcset

def _parse_srcset(srcset: str) -> str

Parse srcset attribute and return the highest resolution image URL.

Skips data: URI entries (lazy-load SVG placeholders used by WordPress/Avada and similar CMS platforms). Returns ‘’ if no real URL is found.

data: URIs may contain commas (e.g. the encoded SVG payload after the MIME type), so we re-join comma-split fragments that belong to the same data: URI before evaluating each entry.

Parameters:

srcset (str)

Returns: str

⚠️ High complexity: 23

_is_float

def _is_float(s: str) -> bool

Return True if s can be parsed as a float.

Parameters:

s (str)

Returns: bool

_alt_from_src

def _alt_from_src(src: str) -> str

Derive a human-readable alt text from an image filename.

Parameters:

src (str)

Returns: str

_process_images

def _process_images(soup)

Process img tags to ensure proper Markdown syntax, filtering out broken images.

Parameters:

soup

⚠️ High complexity: 16

_is_broken_image_url

def _is_broken_image_url(url: str) -> bool

Check if an image URL is likely to be broken or undownloadable.

Parameters:

url (str)

Returns: bool

_process_links

def _process_links(soup)

Process a tags to ensure proper Markdown syntax.

Parameters:

soup

⚠️ High complexity: 15

_process_code_blocks

def _process_code_blocks(soup)

Process pre and code tags to ensure proper Markdown code block formatting.

Parameters:

soup

⚠️ High complexity: 17

_process_media_elements

def _process_media_elements(soup)

Process audio and video elements to preserve them in the output.

Parameters:

soup

⚠️ High complexity: 42

_convert_element

def _convert_element(element, depth = 0, max_depth = 50) -> str

Recursively convert an HTML element to Markdown with improved formatting preservation.

Parameters:

element
depth optional
max_depth optional

Returns: str

⚠️ High complexity: 44

_process_list_items

def _process_list_items(list_element, marker_type, depth, start_num = 1)

Process list items with improved formatting and nesting support.

Parameters:

list_element
marker_type
depth
start_num optional

⚠️ High complexity: 20

_format_blockquote

def _format_blockquote(content)

Format blockquote content with proper markdown quoting.

Parameters:

content

_convert_table_element

def _convert_table_element(element, children_content)

Convert table elements to markdown table format.

Parameters:

element
children_content

_enhanced_cleanup

def _enhanced_cleanup(soup)

Enhanced cleanup for InfoQ and other sources.

Parameters:

soup

⚠️ High complexity: 90

_ends_with_descriptor

def _ends_with_descriptor(s: str) -> bool

Parameters:

s (str)

Returns: bool