capcat.core.content_sanitizer

File: Application/capcat/core/content_sanitizer.py

Description

Content Sanitizer - Archive isolation for Capcat.

Strips tracking, analytics, scripts, and dangerous elements from archived content. Runs as a single pass at the end of the processing pipeline, before file write. Always enabled. No config toggle.

Functions

sanitize

def sanitize(content: str, mode: str = 'markdown') -> str

Sanitize content for complete archive isolation.

Args: content: Raw content string (markdown or HTML). mode: “markdown” or “html”.

Returns: Sanitized content with dangerous elements removed.

Parameters:

content (str)
mode (str) optional

Returns: str

_strip_dangerous_html

def _strip_dangerous_html(content: str) -> str

Strip dangerous HTML elements from content (rules M1-M9).

Parameters:

content (str)

Returns: str

_strip_tracking_heuristics

def _strip_tracking_heuristics(content: str) -> str

Detect and remove tracking elements by heuristic patterns.

Parameters:

content (str)

Returns: str

_apply_html_hardening

def _apply_html_hardening(content: str) -> str

Apply HTML-specific hardening rules (H1-H4).

Parameters:

content (str)

Returns: str

_stash_code_block

def _stash_code_block(match)

Parameters:

match

_restore_code_block

def _restore_code_block(match)

Parameters:

match

_remove_tracker_img

def _remove_tracker_img(match)

Parameters:

match

_clean_style_url

def _clean_style_url(match)

Parameters:

match

_is_heuristic_tracker

def _is_heuristic_tracker(tag: str) -> bool

Check if an tag matches tracking heuristics.

Parameters:

tag (str)

Returns: bool

_remove_external_stylesheet

def _remove_external_stylesheet(match)

Parameters:

match