Developer Guide
Getting Started
Prerequisites
- Python 3.9+
- pip (Python package manager)
- git
Setup Development Environment
# Clone the repository
git clone <repository-url>
cd capcat
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install development dependencies
pip install -r requirements-dev.txt
# Verify installation
capcat list sources
Project Structure
Application/
├── capcat/ # Main package
│ ├── __init__.py # Version and entry point
│ ├── cli.py # CLI entry point, Global Settings template
│ ├── core/ # Core functionality
│ │ ├── config.py # Configuration management
│ │ ├── article_fetcher.py # Article processing
│ │ ├── unified_article_processor.py
│ │ ├── unified_media_processor.py
│ │ ├── source_system/ # Source management framework
│ │ └── config/ # Config subsystem
│ ├── sources/ # News source implementations
│ │ └── builtin/
│ │ ├── config_driven/ # YAML-configured sources
│ │ ├── custom/ # Python-implemented sources (hn, lb, etc.)
│ │ └── bundles.yml # Bundle definitions
│ └── htmlgen/ # HTML generation system
├── tests/ # Test suite
├── scripts/ # Build and doc generation scripts
├── docs/ # Documentation (GitHub Pages)
├── pyproject.toml
└── README.md
Development Workflow
Adding a New Source
Option 1: Config-Driven Source (Simple)
- Create YAML configuration:
# capcat/sources/builtin/config_driven/configs/newsource.yaml
display_name: "New Source"
base_url: "https://newsource.com/"
category: tech
article_selectors: [".headline a"]
content_selectors: [".article-content"]
- Verify the source:
capcat fetch newsource --count 5
Option 2: Custom Source (Advanced)
- Create source directory:
mkdir -p capcat/sources/builtin/custom/newsource
- Implement source class:
# capcat/sources/builtin/custom/newsource/source.py
from capcat.core.source_system.base_source import BaseSource
class NewSource(BaseSource):
def __init__(self):
super().__init__()
self.name = "newsource"
self.display_name = "New Source"
def get_articles(self, count=30):
# Implementation here
pass
def get_article_content(self, url):
# Implementation here
pass
- Validate implementation:
capcat fetch newsource --count 5
Code Style Guidelines
Follow PEP 8 standards:
- 4 spaces for indentation
- Maximum line length: 79 characters
- Use descriptive variable names
- Add type hints to function signatures
- Write docstrings for all public functions
Example:
def process_article(url: str, output_dir: Path) -> Optional[Article]:
"""
Process a single article from URL.
Args:
url: Article URL to process
output_dir: Directory for output files
Returns:
Article object if successful, None otherwise
Raises:
SourceError: If article cannot be fetched
FileSystemError: If output cannot be written
"""
# Implementation here
pass
Development Validation
Verify your changes work correctly:
# Fetch from source
capcat fetch sourcename --count 5
# Try bundle
capcat bundle tech --count 10
Documentation
Update documentation when making changes:
# Generate documentation
python scripts/doc_generator.py
# Update architecture diagrams
python scripts/generate_diagrams.py
Debugging
Enable debug logging:
# Enable debug mode
export CAPCAT_DEBUG=1
capcat fetch hn --count 5
# Or use Python directly
python -m capcat --debug fetch hn --count 5
Common debugging techniques:
- Source Issues: Check robots.txt and rate limiting
- Media Problems: Verify URLs and file permissions
- HTML Parsing: Use browser dev tools to inspect selectors
- Performance: Profile with
cProfilefor bottlenecks
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-feature - Make your changes following the guidelines
- Update documentation as needed
- Commit with descriptive messages
- Push to your fork and create a pull request
Performance Considerations
- Parallel Processing: Use ThreadPoolExecutor for concurrent operations
- Session Pooling: Reuse HTTP connections via SessionPool
- Rate Limiting: Respect source rate limits (1 req/10 sec default)
- Memory Usage: Process articles in batches for large collections
- Caching: Implement caching for frequently accessed data
Error Handling
Implement comprehensive error handling:
from capcat.core.exceptions import SourceError, FileSystemError
try:
article = fetch_article(url)
except SourceError as e:
logger.error(f"Source error: {e}")
return None
except FileSystemError as e:
logger.error(f"File system error: {e}")
raise
except Exception as e:
logger.exception(f"Unexpected error: {e}")
return None
Security Considerations
- Validate all external inputs
- Sanitize HTML content before processing
- Use secure file paths (avoid path traversal)
- Implement rate limiting to prevent abuse
- Follow ethical scraping guidelines