Development Documentation

Overview

This directory contains comprehensive technical documentation for Capcat developers, including architecture details, design patterns, implementation logic, and team onboarding materials.

Document Index

1. Architecture & Logic

File

01-architecture-logic.md

Purpose

Complete technical architecture explanation for junior developers

System architecture layers (UI, orchestration, source, processing, output)
Hybrid source system (config-driven vs custom)
Core design patterns (Factory, Registry, Strategy, Session Pooling, Observer)
Complete data flow diagrams
Error handling strategy
Performance optimizations
Security considerations
Configuration management

Target Audience

Junior developers, new team members, code reviewers

Key Concepts

Hybrid architecture supporting 2 source types
Registry pattern for source discovery
Factory pattern for source instantiation
Graceful degradation on errors
Connection pooling for performance

Code Examples

30+ real code snippets with explanations

2. Team Onboarding

File

02-team-onboarding.md

Purpose

Complete onboarding guide for new developers and designers

Day 1: Environment setup and verification
Day 1-2: Codebase exploration exercises
Day 3-4: First contribution tasks
Week 1: Deep dive sessions (architecture, user interface, testing)
Week 2: Collaboration practices (code review, git workflow)
Development workflows (adding sources, fixing bugs)
Common tasks and debugging tips
Success checklist and next steps

Target Audience

New team members (developers)

Time Estimate

Setup: 30 minutes
Codebase exploration: 2 days
First contribution: 3-4 days
Full onboarding: 2 weeks

Learning Path

Environment setup → Working Capcat installation
Code exploration → Understanding architecture
Small contributions → Confidence building
Independent work → Full productivity

How to Use This Documentation

For New Developers

Week 1 Roadmap

Day 1: Setup

(2-4 hours)

# Follow setup in 02-team-onboarding.md
git clone <repo>
./scripts/fix_dependencies.sh
./capcat list sources  # Verify installation

Day 2: Architecture

(4-6 hours)

Read 01-architecture-logic.md sections 1-3
Draw architecture diagram from memory
Complete Exercise 2: Trace code flow
Review design patterns section

Day 3: First Contribution

(4-6 hours)

Pick Task 1 or 2 from onboarding guide
Add config-driven source or fix documentation
Create pull request
Iterate based on feedback

Day 4-5: Deep Understanding

(8-10 hours)

Complete all hands-on exercises
Add custom source (if comfortable with Python)
Write tests for your contribution
Attend deep dive sessions

For Experienced Developers

Quick Reference

Adding New Source

Config-driven: See "Adding a New Config-Driven Source" in onboarding doc
Custom: See "Adding a New Custom Source" + Architecture patterns in logic doc

Understanding Data Flow

Section "Complete Article Fetching Flow" in 01-architecture-logic.md
Follow execution: CLI → Registry → Factory → Source → Processing → Output

Debugging Issues

"Debugging Tips" section in onboarding doc
"Error Handling Strategy" in architecture doc
Use --debug flag for verbose logging

Code Review Guidelines

"Code Review Guidelines" in onboarding doc
Check against design patterns in architecture doc

For Product Designers

Understanding Technical Constraints

Why Hybrid Architecture?

Read: 01-architecture-logic.md "Hybrid Source Layer" section

Understand: Config-driven (simple, 30min) vs Custom (complex, 4hr)
Implication: Feature complexity affects development time

Why CLI First?

Read: Architecture decisions

Understand: Automation, scriptability, power users
Implication: GUI is secondary, not primary interface

Performance Constraints

Read: "Performance Optimizations" in architecture doc

Understand: Network I/O is bottleneck
Implication: Bulk operations faster than individual

Privacy Architecture

Read: "Security Considerations" in architecture doc

Understand: Local-first, no telemetry by design
Implication: Analytics limited, user research critical

Collaboration Points

Feature feasibility discussion
Technical constraints awareness
Implementation time estimates
Interface vs technical tradeoffs

For Code Reviewers

Review Checklist

(from onboarding doc):

Architecture Compliance

[ ] Follows established patterns (Factory, Registry, Strategy)
[ ] Fits into layer architecture (doesn't violate boundaries)
[ ] Reuses existing components (no reinventing wheel)
[ ] Maintains separation of concerns

Code Quality

[ ] PEP 8 compliant
[ ] Docstrings present and clear
[ ] No code duplication
[ ] Single responsibility principle

Testing

[ ] Unit tests for new functions
[ ] Integration tests for workflows
[ ] Edge cases covered
[ ] Tests actually test what they claim

Documentation

[ ] README updated if user-facing change
[ ] Architecture doc updated if structural change
[ ] API reference updated if interface change
[ ] Comments explain "why" not "what"

Architecture Overview

System Layers

┌─────────────────────────────────────┐ │ USER INTERFACES │ cli.py, interactive.py ├─────────────────────────────────────┤ │ CORE ORCHESTRATION │ capcat.py, config.py ├─────────────────────────────────────┤ │ SOURCE SYSTEM │ Registry, Factory, Sources ├─────────────────────────────────────┤ │ CONTENT PROCESSING │ Fetcher, MediaProcessor ├─────────────────────────────────────┤ │ OUTPUT GENERATION │ FileWriter, HTMLGenerator └─────────────────────────────────────┘

Key Components

Source System

SourceRegistry: Auto-discovers sources from filesystem
SourceFactory: Creates appropriate source instance
BaseSource: Abstract base for all sources
ConfigDrivenSource: YAML-based simple sources
Custom sources: Python implementations

Processing Pipeline

Article discovery (Source.get_articles)
Content fetching (ArticleFetcher)
Media processing (UnifiedMediaProcessor)
Format conversion (HTMLConverter)
File generation (FileWriter)

Design Patterns

Factory: Source creation
Registry: Source discovery
Strategy: Content extraction
Observer: Progress tracking
Singleton: Session pooling

Development Workflows

Quick Reference

Add Config Source

(30 min):

# Create YAML config
cat > sources/active/config_driven/configs/source.yaml <<EOF
display_name: "Source Name"
base_url: "https://example.com/"
category: tech
article_selectors: [".headline a"]
content_selectors: [".article-body"]
EOF

# Test
./capcat fetch source --count 5

Add Custom Source

(4 hrs):

# Create structure
mkdir -p sources/active/custom/source
cd sources/active/custom/source

# Create source.py (implement BaseSource)
# Create config.yaml
# Create __init__.py

# Test and commit

Fix Bug

Write failing test
Fix code
Verify test passes
Run full test suite
Commit with "fix:" prefix

Add Feature

Discuss with team
Update architecture doc if needed
Implement with tests
Update user docs
Create PR with detailed description

Code Organization

Directory Structure

Application/ ├── capcat.py # Main application ├── cli.py # CLI interface ├── core/ # Core modules │ ├── source_system/ # Source management │ │ ├── source_registry.py │ │ ├── source_factory.py │ │ ├── base_source.py │ │ └── config_driven_source.py │ ├── article_fetcher.py │ ├── unified_media_processor.py │ └── config.py ├── sources/ │ └── active/ │ ├── config_driven/ │ │ └── configs/*.yaml │ └── custom/ │ └── */source.py ├── tests/ # Test suite ├── docs/ # Documentation │ ├── tutorials/ # User and exhaustive tutorials │ └── development/ # This directory └── scripts/ # Utility scripts

Import Conventions

# Standard library
import os
import sys
from typing import List, Dict

# Third-party
import requests
import yaml
from bs4 import BeautifulSoup

# Local modules
from core.config import get_config
from core.source_system.base_source import BaseSource
from core.exceptions import SourceError

Naming Conventions

Files

lowercase_with_underscores.py

Classes

PascalCase

Functions

lowercase_with_underscores

Constants

UPPERCASE_WITH_UNDERSCORES

Private

_leading_underscore

Testing Strategy

Test Types

Unit Tests

(Fast, isolated):

# Test individual functions
def test_slugify():
    assert slugify("Hello World") == "hello_world"
    assert slugify("Test@#$123") == "test_123"

Integration Tests

(Medium speed, component interaction):

# Test component interactions
def test_source_registry_and_factory():
    registry = SourceRegistry()
    source = registry.get_source('hn')
    articles = source.get_articles(count=5)
    assert len(articles) == 5

End-to-End Tests

(Slow, full workflow):

# Test complete user workflows
def test_fetch_command_workflow():
    result = run_capcat("fetch hn --count 5")
    assert result.success
    assert output_exists()

Coverage Goals

Core modules: 90%+
Sources: 80%+
CLI: 70%+
Overall: 85%+

Running Tests

# All tests
pytest

# With coverage
pytest --cov=core --cov=sources

# Specific module
pytest tests/test_source_registry.py

# Watch mode (re-run on change)
ptw

Common Patterns

Error Handling

try:
    result = operation()
except SpecificError as e:
    logger.error(f"Operation failed: {e}")
    return default_value
except Exception as e:
    logger.exception("Unexpected error")
    raise

Logging

logger = get_logger(__name__)

logger.debug("Detailed debug info")
logger.info("User-relevant information")
logger.warning("Something unexpected")
logger.error("Operation failed")
logger.exception("Error with traceback")

Configuration

# Get configuration
config = get_config()
count = config.get('default_count', 30)

# Respect CLI override
if args.count:
    count = args.count

Performance Guidelines

Do's

Use session pooling for HTTP requests
Parallelize independent operations
Cache expensive computations
Use generators for large datasets
Lazy load when possible

Don'ts

Don't create new sessions per request
Don't process serially when can parallelize
Don't load entire files into memory
Don't make unnecessary network calls
Don't ignore timeouts

Example Optimization

# Bad: Sequential processing
for article in articles:
    content = fetch_content(article.url)  # Slow
    process(content)

# Good: Parallel processing
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=10) as executor:
    contents = executor.map(fetch_content, article_urls)
    for content in contents:
        process(content)

Security Guidelines

Input Validation

def validate_count(count):
    if not 1 <= count <= 1000:
        raise ValidationError("Count must be 1-1000")
    return count

Path Sanitization

def sanitize_path(path):
    # Remove path traversal
    path = path.replace('..', '')
    # Remove absolute paths
    path = path.lstrip('/')
    return path

No Secrets in Code

# Bad
API_KEY = "secret_key_12345"

# Good
API_KEY = os.getenv('CAPCAT_API_KEY')
if not API_KEY:
    raise ConfigError("API_KEY not set")

Contributing

Before Starting Work

Check existing issues and PRs
Discuss significant changes with team
Read relevant documentation
Understand user impact

Development Process

Create feature branch
Write failing tests
Implement feature
Make tests pass
Update documentation
Create pull request
Address review feedback
Merge when approved

Code Review Process

Self-review checklist
Automated checks (CI)
Peer review (2 approvals)
Architecture review (for significant changes)
UX review (for user-facing changes)

Resources

Internal

Codebase
GitHub repository
Documentation
docs/ directory
Examples
examples/ directory
Tests
tests/ directory

External

Python
https://docs.python.org/3/
BeautifulSoup
https://www.crummy.com/software/BeautifulSoup/
Requests
https://requests.readthedocs.io/
Pytest
https://docs.pytest.org/

Contact

Tech Lead

[Name]

Senior Developer

[Name]

DevOps

[Name]

Channels

Slack: #capcat-dev
Email: dev-team@example.com
Office hours: Tue/Thu 2-3pm

Quick Command Reference

# Setup
./scripts/fix_dependencies.sh

# Run Capcat
./capcat fetch hn --count 10

# Tests
pytest
pytest --cov

# Code quality
black .
flake8 core/ sources/

# Documentation
python3 scripts/run_docs.py

# Debug mode
./capcat --debug fetch hn --count 1

Last Updated

2025-01-06

Next Review

Quarterly

Owner

Tech Lead

Status

Active, living documentation

Development Documentation

Overview

Document Index

1. Architecture & Logic

File

Purpose

Contents

Target Audience

Key Concepts

Code Examples

2. Team Onboarding

File

Purpose

Contents

Target Audience

Time Estimate

Learning Path

How to Use This Documentation

For New Developers

Week 1 Roadmap

Day 1: Setup

Day 2: Architecture

Day 3: First Contribution

Day 4-5: Deep Understanding

For Experienced Developers

Quick Reference

Adding New Source

Understanding Data Flow

Debugging Issues

Code Review Guidelines

For Product Designers

Understanding Technical Constraints

Why Hybrid Architecture?

Why CLI First?

Performance Constraints

Privacy Architecture

Collaboration Points

For Code Reviewers

Review Checklist

Architecture Compliance

Code Quality

Testing

Documentation

Architecture Overview

System Layers

Key Components

Source System

Processing Pipeline

Design Patterns

Development Workflows

Quick Reference

Add Config Source

Add Custom Source

Fix Bug

Add Feature

Code Organization

Directory Structure

Import Conventions

Naming Conventions

Files

Classes

Functions

Constants

Private

Testing Strategy

Test Types

Unit Tests

Integration Tests

End-to-End Tests

Coverage Goals

Running Tests

Common Patterns

Error Handling

Logging

Configuration

Performance Guidelines

Do's

Don'ts

Example Optimization

Security Guidelines