This comprehensive guide will help new developers and product designers understand Capcat, contribute effectively, and integrate with the team.
Before starting, ensure you have:
[ ] Python 3.8+ installed
[ ] Git configured with your credentials
[ ] GitHub account with repo access
[ ] Code editor (VS Code, PyCharm, etc.)
[ ] Terminal/command line proficiency
[ ] Basic understanding of Python
[ ] Familiarity with web scraping concepts
# Clone the repository
git clone <repository-url>
cd capcat
# Check out develop branch
git checkout develop
# Create your feature branch
git checkout -b feature/your-name-onboarding
# Automated setup (recommended)
./scripts/fix_dependencies.sh
# Manual setup (if automated fails)
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -r requirements-dev.txt # Development tools
# Test basic functionality
./capcat list sources
# Run test suite
pytest
# Check code quality
flake8 core/ sources/
black --check .
.vscode/settings.json):
{
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.formatting.provider": "black",
"python.testing.pytestEnabled": true,
"editor.formatOnSave": true,
"editor.rulers": [79],
"files.exclude": {
"**/__pycache__": true,
"**/*.pyc": true,
"**/.pytest_cache": true
}
}
README.md - Project overviewdocs/quick-start.md - Basic usagedocs/tutorials/user/01-getting-started.md - First stepsdocs/architecture.md - System designdocs/development/01-architecture-logic.md - Technical detailsdocs/architecture/components.md - Component breakdowndocs/source-development.md - Adding sourcesdocs/testing.md - Testing strategydocs/ethical-scraping.md - Scraping guidelines# Basic fetch command
./capcat fetch hn --count 5
# Check output
cd ../News/news_$(date +%d-%m-%Y)/
ls -la
cat HackerNews_*/01_*/article.md
# Try interactive mode
./capcat catch
Add print statements to trace execution:
# In capcat.py, add at start of main():
print("DEBUG: Starting main()")
# In cli.py, in parse_arguments():
print(f"DEBUG: Command: {args.command}")
# In source_registry.py, in discover_sources():
print(f"DEBUG: Found {len(sources)} sources")
# Run and observe output
./capcat fetch hn --count 1 2>&1 | grep DEBUG
Pick one source and understand it completely:
# Choose a simple config-driven source
cat sources/active/config_driven/configs/iq.yaml
# Or a complex custom source
cat sources/active/custom/hn/source.py
# Run all tests
pytest
# Run specific test file
pytest tests/test_source_registry.py
# Run with coverage
pytest --cov=core --cov=sources
# Run with verbose output
pytest -v tests/test_source_registry.py::TestSourceRegistry::test_discover_sources
# Find typos in docs
grep -r "teh\|taht\|adn" docs/
# Fix typo
sed -i 's/teh/the/g' docs/architecture.md
# Commit
git add docs/architecture.md
git commit -m "docs: fix typo in architecture.md"
git push origin feature/your-name-onboarding
Create a new simple source (estimated time: 30 minutes):
# sources/active/config_driven/configs/testsource.yaml
display_name: "Test Source"
base_url: "https://example.com/news/"
category: test
article_selectors:
- .headline a
- article h2 a
content_selectors:
- .article-body
- article .content
max_articles: 20
Test your source:
# Verify it's discovered
./capcat list sources | grep testsource
# Test fetching
./capcat fetch testsource --count 5
Find an unclear error message and improve it:
# Before (unclear)
raise ValueError("Invalid config")
# After (helpful)
raise ValidationError(
"Invalid source configuration",
details=f"Missing required field: {missing_field}",
suggestion=f"Add '{missing_field}' to config file",
example=f"{missing_field}: value_here"
)
docs/development/01-architecture-logic.mddocs/testing.md# Ensure branch is up to date
git checkout develop
git pull origin develop
git checkout feature/your-feature
git rebase develop
# Push to GitHub
git push origin feature/your-feature
# Create PR via GitHub UI
## What does this PR do?
Brief description of changes
## Why is this needed?
Problem being solved or feature being added
## How was this tested?
- [ ] Unit tests added
- [ ] Integration tests added
- [ ] Manual testing performed
- [ ] Documentation updated
## Checklist
- [ ] Code follows PEP 8
- [ ] Tests pass locally
- [ ] Documentation updated
- [ ] No breaking changes (or documented)
## Screenshots (if UI changes)
[Attach screenshots]
## Related Issues
Fixes #123
# Make requested changes
[edit files]
# Commit changes
git add .
git commit -m "review: address feedback on error handling"
# Push updates
git push origin feature/your-feature
Code Quality:
[ ] Follows PEP 8 style guide
[ ] No code duplication
[ ] Functions are single-purpose
[ ] Variable names are descriptive
[ ] Comments explain "why" not "what"
Functionality:
[ ] Solves stated problem
[ ] Edge cases handled
[ ] Error handling present
[ ] No obvious bugs
Testing:
[ ] Tests included
[ ] Tests cover edge cases
[ ] Tests pass locally
Documentation:
[ ] Docstrings present
[ ] README updated if needed
[ ] Breaking changes documented
Security:
[ ] No hardcoded secrets
[ ] Input validation present
[ ] No SQL injection risks
[ ] No XSS vulnerabilities
# Suggestion
Consider extracting this logic into a separate function for reusability.
# Question
Why did you choose approach A over approach B?
# Nitpick (optional fix)
Minor: This could be a list comprehension for readability
# Blocking (must fix)
This introduces a security vulnerability: [explain]
# Praise
Nice solution! This is much cleaner than the previous approach.
feature/short-description # New features
bugfix/issue-number # Bug fixes
docs/what-documenting # Documentation
refactor/what-refactoring # Code refactoring
test/what-testing # Test additions
type: short description (max 50 chars)
Longer description if needed (wrap at 72 chars).
Explain what and why, not how.
Fixes #123
feat: New featurefix: Bug fixdocs: Documentationstyle: Formatting, no code changerefactor: Code restructuringtest: Adding testschore: Maintenance tasksgit commit -m "feat: add RSS feed introspection for auto-config"
git commit -m "fix: handle timeout errors in article fetcher
Previously, timeout errors would crash the entire batch.
Now individual article failures are logged and processing continues.
Fixes #234"
git commit -m "docs: add user journey map for researcher persona"
Post in team chat:
Yesterday: Completed source addition feature
Today: Working on error handling improvements
Blockers: Need review on PR #456
docs/architecture.mddocs/docs/developer/# Step 1: Create YAML config
cat > sources/active/config_driven/configs/newsource.yaml <<'EOF'
display_name: "New Source"
base_url: "https://newsource.com/articles/"
category: tech
article_selectors:
- .article-title a
- h2.headline a
content_selectors:
- article .content
- .article-body
image_selectors:
- article img
- .featured-image img
max_articles: 30
timeout: 30
EOF
# Step 2: Test discovery
./capcat list sources | grep newsource
# Step 3: Test fetching
./capcat fetch newsource --count 5 --debug
# Step 4: Verify output
ls -la ../News/news_$(date +%d-%m-%Y)/NewsSource_*/
# Step 5: Add to bundle (optional)
echo " - newsource" >> sources/active/bundles.yml
# Step 6: Commit
git add sources/active/config_driven/configs/newsource.yaml
git commit -m "feat: add NewsSource config-driven source"
# Step 1: Create source directory structure
mkdir -p sources/active/custom/newsource
cd sources/active/custom/newsource
# Step 2: Create source.py
cat > source.py <<'EOF'
from core.source_system.base_source import BaseSource, Article
from typing import List
class NewSource(BaseSource):
"""
Custom source implementation for NewsSource.com
"""
def __init__(self, config, session=None):
super().__init__(config, session)
self.api_base = "https://api.newsource.com/v1"
def get_articles(self, count=30) -> List[Article]:
"""
Fetch articles from NewsSource
Args:
count: Number of articles to fetch
Returns:
List of Article objects
"""
articles = []
# Implement fetching logic
# ...
return articles
EOF
# Step 3: Create config.yaml
cat > config.yaml <<'EOF'
display_name: "New Source"
source_id: newsource
source_type: custom
category: tech
has_comments: false
EOF
# Step 4: Create __init__.py
touch __init__.py
# Step 5: Test source
cd ../../../../ # Back to project root
./capcat fetch newsource --count 5
# Step 6: Write tests
cat > tests/test_newsource.py <<'EOF'
import pytest
from sources.active.custom.newsource.source import NewSource
def test_newsource_initialization():
source = NewSource(config, session=None)
assert source.id == 'newsource'
def test_newsource_fetch_articles():
source = NewSource(config, session=None)
articles = source.get_articles(count=5)
assert len(articles) == 5
EOF
# Step 7: Run tests
pytest tests/test_newsource.py
# Step 8: Commit
git add sources/active/custom/newsource/
git add tests/test_newsource.py
git commit -m "feat: add custom NewsSource implementation"
# Step 1: Reproduce the bug
./capcat fetch hn --count 10 # Error occurs
# Step 2: Write failing test
cat > tests/test_bugfix.py <<'EOF'
def test_error_handling_in_fetch():
# This should not raise exception
source.fetch_article("invalid-url")
# Should log error and continue
EOF
# Step 3: Run test to confirm failure
pytest tests/test_bugfix.py # FAILS
# Step 4: Fix the bug
# Edit relevant file
# Step 5: Run test to confirm fix
pytest tests/test_bugfix.py # PASSES
# Step 6: Run full test suite
pytest
# Step 7: Manual verification
./capcat fetch hn --count 10 # No error
# Step 8: Commit
git add [changed files]
git commit -m "fix: handle invalid URLs in article fetcher
Previously, invalid URLs would crash the entire batch.
Now individual failures are logged and processing continues.
Fixes #234"
# All tests
pytest
# Specific test file
pytest tests/test_source_registry.py
# Specific test function
pytest tests/test_source_registry.py::test_discover_sources
# With coverage
pytest --cov=core --cov=sources --cov-report=html
# With verbose output
pytest -v
# Stop on first failure
pytest -x
# Run only failed tests
pytest --lf
# Format code with Black
black .
# Check formatting (don't modify)
black --check .
# Lint with flake8
flake8 core/ sources/ capcat.py cli.py
# Type checking with mypy
mypy core/ sources/
# Security audit
bandit -r core/ sources/
# All checks (CI simulation)
./scripts/run_checks.sh
# Generate all documentation
python3 scripts/run_docs.py
# Generate API docs only
python3 scripts/doc_generator.py
# Generate diagrams
python3 scripts/generate_diagrams.py
# Generate user guides
python3 scripts/generate_user_guides.py
# View generated docs
open docs/index.md
# Problem: Module not found when running tests
ModuleNotFoundError: No module named 'core'
# Solution 1: Activate venv
source venv/bin/activate
# Solution 2: Install in editable mode
pip install -e .
# Solution 3: Add to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)
# Problem: Tests fail unexpectedly
================================ FAILURES =================================
# Solution 1: Check test isolation
pytest --verbose -x # Stop on first failure
# Solution 2: Clear pytest cache
rm -rf .pytest_cache
pytest --cache-clear
# Solution 3: Run single test with print statements
pytest -s tests/test_file.py::test_function
# Problem: Source not discovered by registry
./capcat list sources # Source not in list
# Solution 1: Check file structure
ls -la sources/active/custom/yoursource/
# Must have: __init__.py, source.py, config.yaml
# Solution 2: Validate config
python3 -c "import yaml; yaml.safe_load(open('sources/active/custom/yoursource/config.yaml'))"
# Solution 3: Check Python syntax
python3 -m py_compile sources/active/custom/yoursource/source.py
import logging
logger = logging.getLogger(__name__)
def problematic_function():
logger.debug(f"Input: {input_value}")
result = do_something(input_value)
logger.debug(f"Result: {result}")
return result
# Run with debug flag
./capcat --debug fetch hn --count 1
def problematic_function():
import pdb; pdb.set_trace() # Breakpoint
result = do_something()
return result
# When breakpoint hits:
# (Pdb) print(variable)
# (Pdb) step # Step into function
# (Pdb) next # Next line
# (Pdb) continue # Continue execution
# Start Python REPL
python3
>>> from sources.active.custom.hn.source import HackerNewsSource
>>> from core.source_system.source_registry import get_source_registry
>>> registry = get_source_registry()
>>> source = registry.get_source('hn')
>>> articles = source.get_articles(count=1)
>>> print(articles[0].title)
docs/ directorydocs/diagrams/docs/api/docs/testing.mdexamples/ directory[ ] Environment setup complete
[ ] Capcat runs successfully
[ ] Can fetch articles from all sources
[ ] Read all required documentation
[ ] Completed hands-on exercises
[ ] Attended all deep dive sessions
[ ] Created first pull request
[ ] First PR merged
[ ] Added or fixed a source
[ ] Wrote tests with >80% coverage
[ ] Participated in code review
[ ] Updated documentation
[ ] Comfortable with git workflow
[ ] Understand architecture completely
[ ] Independently added new feature
[ ] Led code review for others
[ ] Improved documentation
[ ] Solved complex bug
[ ] Contributed to architecture decisions
[ ] Mentored new team member
After completing this onboarding: