capcat.core.formatter
File: Application/capcat/core/formatter.py
Description
HTML to Markdown converter for Capcat. This module provides functionality to convert HTML content to clean Markdown format.
Functions
_normalize_url
def _normalize_url(url: str) -> str
Normalize URL by properly handling encoding/decoding issues.
Parameters:
url(str)
Returns: str
_create_smart_link
def _create_smart_link(text: str, url: str) -> str
Create a smart link that keeps functionality but has readable display text.
Parameters:
text(str)url(str)
Returns: str
format_comment_paragraphs
def format_comment_paragraphs(comment_text: str) -> str
Format comment text with proper paragraph breaks and improved readability. This is a global utility function for all news sources to improve comment formatting.
Args: comment_text: Raw comment text
Returns: Formatted comment text with proper paragraphs
Parameters:
comment_text(str)
Returns: str
_preserve_button_images
def _preserve_button_images(soup)
Extract images from buttons before button removal destroys them.
Many sites wrap content images in
Parameters:
soup
html_to_markdown
def html_to_markdown(html_content: str, base_url: str = None) -> str
Convert HTML content to clean markdown format.
Parameters:
html_content(str)base_url(str) optional
Returns: str
_parse_srcset
def _parse_srcset(srcset: str) -> str
Parse srcset attribute and return the highest resolution image URL.
Skips data: URI entries (lazy-load SVG placeholders used by WordPress/Avada and similar CMS platforms). Returns ‘’ if no real URL is found.
data: URIs may contain commas (e.g. the encoded SVG payload after the MIME type), so we re-join comma-split fragments that belong to the same data: URI before evaluating each entry.
Parameters:
srcset(str)
Returns: str
⚠️ High complexity: 23
_is_float
def _is_float(s: str) -> bool
Return True if s can be parsed as a float.
Parameters:
s(str)
Returns: bool
_alt_from_src
def _alt_from_src(src: str) -> str
Derive a human-readable alt text from an image filename.
Parameters:
src(str)
Returns: str
_process_images
def _process_images(soup)
Process img tags to ensure proper Markdown syntax, filtering out broken images.
Parameters:
soup
⚠️ High complexity: 16
_is_broken_image_url
def _is_broken_image_url(url: str) -> bool
Check if an image URL is likely to be broken or undownloadable.
Parameters:
url(str)
Returns: bool
_process_links
def _process_links(soup)
Process a tags to ensure proper Markdown syntax.
Parameters:
soup
⚠️ High complexity: 15
_process_code_blocks
def _process_code_blocks(soup)
Process pre and code tags to ensure proper Markdown code block formatting.
Parameters:
soup
⚠️ High complexity: 17
_process_media_elements
def _process_media_elements(soup)
Process audio and video elements to preserve them in the output.
Parameters:
soup
⚠️ High complexity: 42
_convert_element
def _convert_element(element, depth = 0, max_depth = 50) -> str
Recursively convert an HTML element to Markdown with improved formatting preservation.
Parameters:
elementdepthoptionalmax_depthoptional
Returns: str
⚠️ High complexity: 44
_process_list_items
def _process_list_items(list_element, marker_type, depth, start_num = 1)
Process list items with improved formatting and nesting support.
Parameters:
list_elementmarker_typedepthstart_numoptional
⚠️ High complexity: 20
_format_blockquote
def _format_blockquote(content)
Format blockquote content with proper markdown quoting.
Parameters:
content
_convert_table_element
def _convert_table_element(element, children_content)
Convert table elements to markdown table format.
Parameters:
elementchildren_content
_enhanced_cleanup
def _enhanced_cleanup(soup)
Enhanced cleanup for InfoQ and other sources.
Parameters:
soup
⚠️ High complexity: 90
_ends_with_descriptor
def _ends_with_descriptor(s: str) -> bool
Parameters:
s(str)
Returns: bool