Team Onboarding Guide

Welcome to Capcat Development

This comprehensive guide will help new developers and product designers understand Capcat, contribute effectively, and integrate with the team.

Day 1: Environment Setup

Prerequisites Checklist

Before starting, ensure you have:

[ ] Python 3.8+ installed
[ ] Git configured with your credentials
[ ] GitHub account with repo access
[ ] Code editor (VS Code, PyCharm, etc.)
[ ] Terminal/command line proficiency
[ ] Basic understanding of Python
[ ] Familiarity with web scraping concepts

Development Environment Setup

Step 1: Clone Repository

# Clone the repository
git clone <repository-url>
cd capcat

# Check out develop branch
git checkout develop

# Create your feature branch
git checkout -b feature/your-name-onboarding

Step 2: Install Dependencies

# Automated setup (recommended)
./scripts/fix_dependencies.sh

# Manual setup (if automated fails)
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -r requirements-dev.txt  # Development tools

Step 3: Verify Installation

# Test basic functionality
./capcat list sources

# Run test suite
pytest

# Check code quality
flake8 core/ sources/
black --check .

Step 4: Configure IDE

VS Code Configuration

(.vscode/settings.json):

{
  "python.linting.enabled": true,
  "python.linting.flake8Enabled": true,
  "python.formatting.provider": "black",
  "python.testing.pytestEnabled": true,
  "editor.formatOnSave": true,
  "editor.rulers": [79],
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    "**/.pytest_cache": true
  }
}

PyCharm Configuration

Open Project Settings
Python Interpreter → Select venv
Tools → Black → Enable format on save
Tools → Python Integrated Tools → Set pytest as test runner

Day 1-2: Codebase Exploration

Hands-On Exercises

Exercise 1: Run Capcat (10 minutes)

# Basic fetch command
./capcat fetch hn --count 5

# Check output
cd ../News/news_$(date +%d-%m-%Y)/
ls -la
cat HackerNews_*/01_*/article.md

# Try interactive mode
./capcat catch

Learning Goals

Understand basic CLI usage
See output structure
Experience user perspective

Exercise 2: Trace Code Flow (30 minutes)

Add print statements to trace execution:

# In capcat.py, add at start of main():
print("DEBUG: Starting main()")

# In cli.py, in parse_arguments():
print(f"DEBUG: Command: {args.command}")

# In source_registry.py, in discover_sources():
print(f"DEBUG: Found {len(sources)} sources")

# Run and observe output
./capcat fetch hn --count 1 2>&1 | grep DEBUG

Learning Goals

Understand execution flow
Identify key components
Learn debugging techniques

Exercise 3: Read Source Code (45 minutes)

Pick one source and understand it completely:

# Choose a simple config-driven source
cat sources/active/config_driven/configs/iq.yaml

# Or a complex custom source
cat sources/active/custom/hn/source.py

Learning Goals

Understand source types
See config vs custom patterns
Learn scraping techniques

Exercise 4: Run Tests (15 minutes)

# Run all tests
pytest

# Run specific test file
pytest tests/test_source_registry.py

# Run with coverage
pytest --cov=core --cov=sources

# Run with verbose output
pytest -v tests/test_source_registry.py::TestSourceRegistry::test_discover_sources

Learning Goals

Understand test structure
See what's tested
Learn testing patterns

Day 3-4: First Contribution

Small Tasks for Beginners

Task 1: Fix Documentation Typo

# Find typos in docs
grep -r "teh\|taht\|adn" docs/

# Fix typo
sed -i 's/teh/the/g' docs/architecture.md

# Commit
git add docs/architecture.md
git commit -m "docs: fix typo in architecture.md"
git push origin feature/your-name-onboarding

Task 2: Add Config-Driven Source

Create a new simple source (estimated time: 30 minutes):

# sources/active/config_driven/configs/testsource.yaml
display_name: "Test Source"
base_url: "https://example.com/news/"
category: test

article_selectors:
  - .headline a
  - article h2 a

content_selectors:
  - .article-body
  - article .content

max_articles: 20

Test your source:

# Verify it's discovered
./capcat list sources | grep testsource

# Test fetching
./capcat fetch testsource --count 5

Task 3: Improve Error Message

Find an unclear error message and improve it:

# Before (unclear)
raise ValueError("Invalid config")

# After (helpful)
raise ValidationError(
    "Invalid source configuration",
    details=f"Missing required field: {missing_field}",
    suggestion=f"Add '{missing_field}' to config file",
    example=f"{missing_field}: value_here"
)

Week 1: Deep Dive Sessions

Architecture Deep Dive (2 hours)

Session Goals

Understand hybrid architecture
Learn source registry pattern
Understand processing pipeline

Activities

Draw architecture diagram from memory
Explain data flow for article fetching
Identify design patterns used
Discuss architectural decisions

Homework

Read: docs/development/01-architecture-logic.md
Exercise: Add a new extraction strategy
Question: Why hybrid architecture vs. single approach?

User Interface Deep Dive (1.5 hours)

Session Goals

Understand user workflows
Learn design principles
Experience interactive mode

Activities

Test interactive mode workflows
Complete user journey exercises
Review usability principles
Identify interface improvements

Homework

Read: Interactive mode documentation
Exercise: Observe someone using Capcat
Question: Which persona are you most like?

Testing Deep Dive (1 hour)

Session Goals

Understand testing strategy
Write effective tests
Use testing tools

Activities

Write unit test for utility function
Write integration test for source
Run test coverage report
Fix failing test

Homework

Read: docs/testing.md
Exercise: Achieve 90% coverage on new module
Question: What's hard to test and why?

Week 2: Collaboration Practices

Code Review Guidelines

Submitting Code for Review

Create Pull Request

# Ensure branch is up to date
git checkout develop
git pull origin develop
git checkout feature/your-feature
git rebase develop

# Push to GitHub
git push origin feature/your-feature

# Create PR via GitHub UI

PR Description Template

## What does this PR do?
Brief description of changes

## Why is this needed?
Problem being solved or feature being added

## How was this tested?
- [ ] Unit tests added
- [ ] Integration tests added
- [ ] Manual testing performed
- [ ] Documentation updated

## Checklist
- [ ] Code follows PEP 8
- [ ] Tests pass locally
- [ ] Documentation updated
- [ ] No breaking changes (or documented)

## Screenshots (if UI changes)
[Attach screenshots]

## Related Issues
Fixes #123

Respond to Feedback

# Make requested changes
[edit files]

# Commit changes
git add .
git commit -m "review: address feedback on error handling"

# Push updates
git push origin feature/your-feature

Reviewing Code from Others

Review Checklist

Code Quality:
[ ] Follows PEP 8 style guide
[ ] No code duplication
[ ] Functions are single-purpose
[ ] Variable names are descriptive
[ ] Comments explain "why" not "what"

Functionality:
[ ] Solves stated problem
[ ] Edge cases handled
[ ] Error handling present
[ ] No obvious bugs

Testing:
[ ] Tests included
[ ] Tests cover edge cases
[ ] Tests pass locally

Documentation:
[ ] Docstrings present
[ ] README updated if needed
[ ] Breaking changes documented

Security:
[ ] No hardcoded secrets
[ ] Input validation present
[ ] No SQL injection risks
[ ] No XSS vulnerabilities

Review Comment Templates

# Suggestion
Consider extracting this logic into a separate function for reusability.

# Question
Why did you choose approach A over approach B?

# Nitpick (optional fix)
Minor: This could be a list comprehension for readability

# Blocking (must fix)
This introduces a security vulnerability: [explain]

# Praise
Nice solution! This is much cleaner than the previous approach.

Git Workflow

Branch Naming Convention

feature/short-description    # New features
bugfix/issue-number          # Bug fixes
docs/what-documenting        # Documentation
refactor/what-refactoring    # Code refactoring
test/what-testing            # Test additions

Commit Message Format

type: short description (max 50 chars)

Longer description if needed (wrap at 72 chars).
Explain what and why, not how.

Fixes #123

Types

feat: New feature
fix: Bug fix
docs: Documentation
style: Formatting, no code change
refactor: Code restructuring
test: Adding tests
chore: Maintenance tasks

Examples

git commit -m "feat: add RSS feed introspection for auto-config"

git commit -m "fix: handle timeout errors in article fetcher

Previously, timeout errors would crash the entire batch.
Now individual article failures are logged and processing continues.

Fixes #234"

git commit -m "docs: add user journey map for researcher persona"

Communication Practices

Daily Standups (Async)

Post in team chat:

Yesterday: Completed source addition feature
Today: Working on error handling improvements
Blockers: Need review on PR #456

Weekly Planning

Review sprint goals
Assign tasks
Identify dependencies
Schedule pair programming

Documentation Updates

When to Update Docs

Adding new feature → Update relevant guide
Fixing bug → Add to troubleshooting
Changing API → Update API reference
Learning something non-obvious → Add to docs

Where to Document

Architecture changes → docs/architecture.md
New features → docs/
Bug fixes → Changelog
HOWTOs → docs/developer/

Development Workflows

Adding a New Config-Driven Source

Time Estimate

15-30 minutes

# Step 1: Create YAML config
cat > sources/active/config_driven/configs/newsource.yaml <<'EOF'
display_name: "New Source"
base_url: "https://newsource.com/articles/"
category: tech

article_selectors:
  - .article-title a
  - h2.headline a

content_selectors:
  - article .content
  - .article-body

image_selectors:
  - article img
  - .featured-image img

max_articles: 30
timeout: 30
EOF

# Step 2: Test discovery
./capcat list sources | grep newsource

# Step 3: Test fetching
./capcat fetch newsource --count 5 --debug

# Step 4: Verify output
ls -la ../News/news_$(date +%d-%m-%Y)/NewsSource_*/

# Step 5: Add to bundle (optional)
echo "    - newsource" >> sources/active/bundles.yml

# Step 6: Commit
git add sources/active/config_driven/configs/newsource.yaml
git commit -m "feat: add NewsSource config-driven source"

Adding a New Custom Source

Time Estimate

2-4 hours

# Step 1: Create source directory structure
mkdir -p sources/active/custom/newsource
cd sources/active/custom/newsource

# Step 2: Create source.py
cat > source.py <<'EOF'
from core.source_system.base_source import BaseSource, Article
from typing import List

class NewSource(BaseSource):
    """
    Custom source implementation for NewsSource.com
    """

    def __init__(self, config, session=None):
        super().__init__(config, session)
        self.api_base = "https://api.newsource.com/v1"

    def get_articles(self, count=30) -> List[Article]:
        """
        Fetch articles from NewsSource

        Args:
            count: Number of articles to fetch

        Returns:
            List of Article objects
        """
        articles = []

        # Implement fetching logic
        # ...

        return articles
EOF

# Step 3: Create config.yaml
cat > config.yaml <<'EOF'
display_name: "New Source"
source_id: newsource
source_type: custom
category: tech
has_comments: false
EOF

# Step 4: Create __init__.py
touch __init__.py

# Step 5: Test source
cd ../../../../  # Back to project root
./capcat fetch newsource --count 5

# Step 6: Write tests
cat > tests/test_newsource.py <<'EOF'
import pytest
from sources.active.custom.newsource.source import NewSource

def test_newsource_initialization():
    source = NewSource(config, session=None)
    assert source.id == 'newsource'

def test_newsource_fetch_articles():
    source = NewSource(config, session=None)
    articles = source.get_articles(count=5)
    assert len(articles) == 5
EOF

# Step 7: Run tests
pytest tests/test_newsource.py

# Step 8: Commit
git add sources/active/custom/newsource/
git add tests/test_newsource.py
git commit -m "feat: add custom NewsSource implementation"

Fixing a Bug

Workflow

# Step 1: Reproduce the bug
./capcat fetch hn --count 10  # Error occurs

# Step 2: Write failing test
cat > tests/test_bugfix.py <<'EOF'
def test_error_handling_in_fetch():
    # This should not raise exception
    source.fetch_article("invalid-url")
    # Should log error and continue
EOF

# Step 3: Run test to confirm failure
pytest tests/test_bugfix.py  # FAILS

# Step 4: Fix the bug
# Edit relevant file

# Step 5: Run test to confirm fix
pytest tests/test_bugfix.py  # PASSES

# Step 6: Run full test suite
pytest

# Step 7: Manual verification
./capcat fetch hn --count 10  # No error

# Step 8: Commit
git add [changed files]
git commit -m "fix: handle invalid URLs in article fetcher

Previously, invalid URLs would crash the entire batch.
Now individual failures are logged and processing continues.

Fixes #234"

Common Tasks

Running Tests

# All tests
pytest

# Specific test file
pytest tests/test_source_registry.py

# Specific test function
pytest tests/test_source_registry.py::test_discover_sources

# With coverage
pytest --cov=core --cov=sources --cov-report=html

# With verbose output
pytest -v

# Stop on first failure
pytest -x

# Run only failed tests
pytest --lf

Code Quality Checks

# Format code with Black
black .

# Check formatting (don't modify)
black --check .

# Lint with flake8
flake8 core/ sources/ capcat.py cli.py

# Type checking with mypy
mypy core/ sources/

# Security audit
bandit -r core/ sources/

# All checks (CI simulation)
./scripts/run_checks.sh

Documentation Generation

# Generate all documentation
python3 scripts/run_docs.py

# Generate API docs only
python3 scripts/doc_generator.py

# Generate diagrams
python3 scripts/generate_diagrams.py

# Generate user guides
python3 scripts/generate_user_guides.py

# View generated docs
open docs/index.md

Debugging Tips

Common Issues and Solutions

Issue: ModuleNotFoundError

# Problem: Module not found when running tests
ModuleNotFoundError: No module named 'core'

# Solution 1: Activate venv
source venv/bin/activate

# Solution 2: Install in editable mode
pip install -e .

# Solution 3: Add to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)

Issue: Test Failures

# Problem: Tests fail unexpectedly
================================ FAILURES =================================

# Solution 1: Check test isolation
pytest --verbose -x  # Stop on first failure

# Solution 2: Clear pytest cache
rm -rf .pytest_cache
pytest --cache-clear

# Solution 3: Run single test with print statements
pytest -s tests/test_file.py::test_function

Issue: Import Errors in Sources

# Problem: Source not discovered by registry
./capcat list sources  # Source not in list

# Solution 1: Check file structure
ls -la sources/active/custom/yoursource/
# Must have: __init__.py, source.py, config.yaml

# Solution 2: Validate config
python3 -c "import yaml; yaml.safe_load(open('sources/active/custom/yoursource/config.yaml'))"

# Solution 3: Check Python syntax
python3 -m py_compile sources/active/custom/yoursource/source.py

Debugging Techniques

Technique 1: Add Debug Logging

import logging
logger = logging.getLogger(__name__)

def problematic_function():
    logger.debug(f"Input: {input_value}")
    result = do_something(input_value)
    logger.debug(f"Result: {result}")
    return result

# Run with debug flag
./capcat --debug fetch hn --count 1

Technique 2: Use Python Debugger

def problematic_function():
    import pdb; pdb.set_trace()  # Breakpoint
    result = do_something()
    return result

# When breakpoint hits:
# (Pdb) print(variable)
# (Pdb) step  # Step into function
# (Pdb) next  # Next line
# (Pdb) continue  # Continue execution

Technique 3: Interactive Testing

# Start Python REPL
python3

>>> from sources.active.custom.hn.source import HackerNewsSource
>>> from core.source_system.source_registry import get_source_registry
>>> registry = get_source_registry()
>>> source = registry.get_source('hn')
>>> articles = source.get_articles(count=1)
>>> print(articles[0].title)

Resources

Internal Resources

Documentation
docs/ directory
Architecture diagrams
docs/diagrams/
API reference
docs/api/
Testing guide
docs/testing.md
Code examples
examples/ directory

External Resources

Python

Web Scraping

Testing

Team Contacts

Product Owner
[Name]
Tech Lead
[Name]
DevOps
[Name]

Success Checklist

End of Week 1

[ ] Environment setup complete
[ ] Capcat runs successfully
[ ] Can fetch articles from all sources
[ ] Read all required documentation
[ ] Completed hands-on exercises
[ ] Attended all deep dive sessions
[ ] Created first pull request

End of Week 2

[ ] First PR merged
[ ] Added or fixed a source
[ ] Wrote tests with >80% coverage
[ ] Participated in code review
[ ] Updated documentation
[ ] Comfortable with git workflow
[ ] Understand architecture completely

End of Month 1

[ ] Independently added new feature
[ ] Led code review for others
[ ] Improved documentation
[ ] Solved complex bug
[ ] Contributed to architecture decisions
[ ] Mentored new team member

Next Steps

After completing this onboarding:

Choose Specialization

Source development

Core architecture
Testing infrastructure
Documentation
DevOps/deployment

Take Ownership

Become expert in specific component

Lead feature development
Improve testing coverage
Enhance documentation

Contribute to Community

Answer issues on GitHub

Write blog posts
Create tutorials
Speak at meetups

Document Status

Living document

Last Updated

2025-01-06

Owner

Tech Lead

Feedback

Submit improvements via PR

Team Onboarding Guide

Welcome to Capcat Development

Day 1: Environment Setup

Prerequisites Checklist

Development Environment Setup

Step 1: Clone Repository

Step 2: Install Dependencies

Step 3: Verify Installation

Step 4: Configure IDE

VS Code Configuration

PyCharm Configuration

Day 1-2: Codebase Exploration

Recommended Reading Order

Start Here

Architecture Understanding

Development Practices

Hands-On Exercises

Exercise 1: Run Capcat (10 minutes)

Learning Goals

Exercise 2: Trace Code Flow (30 minutes)

Learning Goals

Exercise 3: Read Source Code (45 minutes)

Learning Goals

Exercise 4: Run Tests (15 minutes)

Learning Goals

Day 3-4: First Contribution

Small Tasks for Beginners

Task 1: Fix Documentation Typo

Task 2: Add Config-Driven Source

Task 3: Improve Error Message

Week 1: Deep Dive Sessions

Architecture Deep Dive (2 hours)

Session Goals

Activities

Homework

User Interface Deep Dive (1.5 hours)

Session Goals

Activities

Homework

Testing Deep Dive (1 hour)

Session Goals

Activities

Homework

Week 2: Collaboration Practices

Code Review Guidelines

Submitting Code for Review

Create Pull Request

PR Description Template

Respond to Feedback

Reviewing Code from Others

Review Checklist

Review Comment Templates

Git Workflow

Branch Naming Convention

Commit Message Format

Types

Examples

Communication Practices

Daily Standups (Async)

Weekly Planning

Documentation Updates

When to Update Docs

Where to Document

Development Workflows

Adding a New Config-Driven Source

Time Estimate

Adding a New Custom Source

Time Estimate

Fixing a Bug

Workflow

Common Tasks

Running Tests

Code Quality Checks

Documentation Generation

Debugging Tips

Common Issues and Solutions

Issue: ModuleNotFoundError

Issue: Test Failures

Issue: Import Errors in Sources

Debugging Techniques