Data Flow Diagram

flowchart TD
    Start([User Command]) --> Parse[Parse CLI Arguments]
    Parse --> Validate[Validate Configuration]
    Validate --> LoadSources[Load Source Configurations]

    LoadSources --> SourceType{Source Type?}

    SourceType -->|Config-Driven| LoadYAML[Load YAML Config]
    SourceType -->|Custom| LoadPython[Load Python Module]

    LoadYAML --> CreateSource1[Create Config-Driven Source]
    LoadPython --> CreateSource2[Create Custom Source]

    CreateSource1 --> FetchArticles[Fetch Article List]
    CreateSource2 --> FetchArticles

    FetchArticles --> ProcessParallel{Process in Parallel}

    ProcessParallel --> FetchContent[Fetch Article Content]
    FetchContent --> ExtractMedia[Extract Media URLs]
    ExtractMedia --> DownloadMedia[Download Media Files]
    DownloadMedia --> ProcessImages[Process Images]
    ProcessImages --> ConvertHTML[Convert HTML to Markdown]
    ConvertHTML --> GenerateHTML[Generate HTML Output]

    GenerateHTML --> SaveFiles[Save to File System]

    SaveFiles --> UpdateProgress[Update Progress]
    UpdateProgress --> CheckComplete{All Articles Done?}

    CheckComplete -->|No| ProcessParallel
    CheckComplete -->|Yes| Complete[Complete]

    %% Error handling
    FetchContent --> Error{Error?}
    Error -->|Yes| LogError[Log Error]
    Error -->|No| ExtractMedia
    LogError --> CheckComplete

    %% Styling
    classDef startEnd fill:#4caf50,color:#fff
    classDef process fill:#2196f3,color:#fff
    classDef decision fill:#ff9800,color:#fff
    classDef error fill:#f44336,color:#fff

    class Start,Complete startEnd
    class Parse,Validate,LoadSources,LoadYAML,LoadPython,CreateSource1,CreateSource2,FetchArticles,FetchContent,ExtractMedia,DownloadMedia,ProcessImages,ConvertHTML,GenerateHTML,SaveFiles,UpdateProgress process
    class SourceType,ProcessParallel,CheckComplete,Error decision
    class LogError error

Data Transformations

1. Input Processing

  • CLI ArgumentsConfiguration Object
  • Source NamesSource Instances
  • URLsArticle Metadata

2. Content Processing

  • HTML ContentCleaned HTML
  • Cleaned HTMLMarkdown Text
  • Media URLsLocal File Paths

3. Output Generation

  • Article DataMarkdown Files
  • Media ContentOrganized File Structure
  • Article + MetadataHTML Pages

4. Error Handling

  • Network ErrorsRetry Logic
  • Parse ErrorsFallback Processing
  • File ErrorsAlternative Paths