media-downloader/docs/WORKER_PROCESS_ARCHITECTURE_PROPOSAL.md

# Worker Process Architecture Proposal

## Problem Statement

Currently, all scrapers and downloaders run directly within the scheduler and API service processes. When these services restart (due to configuration changes, updates, or other reasons), any active scraping or downloading jobs are abruptly terminated, leaving downloads incomplete and requiring manual re-triggering.

**Current Issues:**
1. Scheduler service restarts kill active scrapers mid-process
2. API service restarts interrupt download operations
3. No job recovery mechanism - interrupted jobs are lost
4. Users must manually re-trigger failed/interrupted jobs
5. Long-running jobs (large downloads, full account scrapes) are particularly vulnerable

---

## Current Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Scheduler Service                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ Cron Logic  │→ │ Job Runner  │→ │ Scrapers/Downloaders│  │
│  │             │  │             │  │ (runs in-process)   │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│                                                              │
│  If scheduler restarts → ALL active jobs die                │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      API Service                             │
│  ┌─────────────┐  ┌─────────────────────────────────────┐   │
│  │ Endpoints   │→ │ Manual Triggers (runs in-process)   │   │
│  └─────────────┘  └─────────────────────────────────────┘   │
│                                                              │
│  If API restarts → ALL manual jobs die                      │
└─────────────────────────────────────────────────────────────┘
```

---

## Proposed Architecture

```
┌─────────────────────────┐     ┌─────────────────────────────┐
│   Scheduler Service     │     │      Worker Service         │
│  ┌─────────────┐        │     │  ┌───────────────────────┐  │
│  │ Cron Logic  │───────────────→ │ Job Queue Consumer    │  │
│  │             │  Enqueue │     │ │                       │  │
│  └─────────────┘  Jobs   │     │ └───────────┬───────────┘  │
│                          │     │             │              │
│  • Light process         │     │             ▼              │
│  • Just scheduling       │     │  ┌───────────────────────┐ │
│  • Can restart safely    │     │  │ Scrapers/Downloaders  │ │
└─────────────────────────┘     │  │ (isolated execution)  │ │
                                 │  └───────────────────────┘ │
┌─────────────────────────┐     │                             │
│     API Service         │     │  • Runs independently       │
│  ┌─────────────┐        │     │  • Survives API restarts    │
│  │ Endpoints   │───────────────→ │  • Survives sched restarts│
│  │             │  Enqueue │     │  • Job recovery on crash   │
│  └─────────────┘  Jobs   │     └─────────────────────────────┘
│                          │
│  • Can restart safely    │     ┌─────────────────────────────┐
│  • Reads status from DB  │     │        SQLite Database      │
│                          │     │  ┌───────────────────────┐  │
└──────────────────────────┘     │  │ job_queue table       │  │
              │                  │  │ job_status table      │  │
              │                  │  └───────────────────────┘  │
              └─────────────────→│                             │
                 Read Status     └─────────────────────────────┘
```

---

## Database Schema

### Job Queue Table

```sql
CREATE TABLE IF NOT EXISTS worker_job_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_type TEXT NOT NULL,          -- 'scrape', 'download', 'sync', etc.
    platform TEXT NOT NULL,          -- 'instagram', 'paid_content', 'tiktok', etc.
    account TEXT,                    -- username or null for system tasks
    priority INTEGER DEFAULT 5,      -- 1=highest, 10=lowest
    status TEXT DEFAULT 'pending',   -- 'pending', 'running', 'completed', 'failed', 'cancelled'
    payload TEXT,                    -- JSON blob with job-specific data
    error_message TEXT,              -- Error details if failed
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    started_at DATETIME,
    completed_at DATETIME,
    worker_id TEXT,                  -- ID of worker process handling this job
    progress_percent INTEGER DEFAULT 0,
    progress_message TEXT
);

CREATE INDEX idx_job_queue_status ON worker_job_queue(status);
CREATE INDEX idx_job_queue_priority ON worker_job_queue(priority, created_at);
CREATE INDEX idx_job_queue_platform ON worker_job_queue(platform, account);
```

### Job Types

| job_type | Description | Example payload |
|----------|-------------|-----------------|
| `scrape_account` | Scrape a social media account | `{"username": "user123", "scrape_type": "full"}` |
| `download_media` | Download specific media items | `{"media_ids": [1, 2, 3], "source": "instagram"}` |
| `paid_content_sync` | Sync paid content from all creators | `{"creator_ids": null}` |
| `paid_content_creator` | Sync specific creator | `{"creator_id": 123}` |
| `forum_scrape` | Scrape forum threads | `{"thread_ids": [456, 789]}` |
| `youtube_monitor` | Check YouTube channels for new videos | `{}` |
| `easynews_monitor` | Check Easynews for new content | `{"search_queries": [...]}` |
| `appearances_sync` | Sync TMDb appearances | `{}` |

---

## Worker Service Design

### File: `/opt/media-downloader/services/worker_service.py`

```python
# Conceptual outline - not implementation

class WorkerService:
    """
    Independent worker service that processes jobs from the queue.
    Designed to run as a separate systemd service.
    """

    def __init__(self):
        self.worker_id = generate_worker_id()
        self.running = True
        self.current_job = None

    def run(self):
        """Main loop - poll for jobs and execute them."""
        while self.running:
            job = self.claim_next_job()
            if job:
                self.execute_job(job)
            else:
                time.sleep(5)  # No jobs, wait before polling again

    def claim_next_job(self):
        """
        Atomically claim the highest priority pending job.
        Uses database transaction to prevent race conditions.
        """
        # UPDATE worker_job_queue
        # SET status='running', worker_id=?, started_at=NOW()
        # WHERE id = (SELECT id FROM worker_job_queue
        #             WHERE status='pending'
        #             ORDER BY priority, created_at LIMIT 1)
        pass

    def execute_job(self, job):
        """Execute a job and update status."""
        try:
            handler = self.get_handler(job.job_type)
            handler.execute(job.payload, progress_callback=self.update_progress)
            self.mark_completed(job.id)
        except Exception as e:
            self.handle_failure(job, e)

    def update_progress(self, job_id, percent, message):
        """Update job progress in database for UI to read."""
        # UPDATE worker_job_queue SET progress_percent=?, progress_message=? WHERE id=?
        pass

    def handle_failure(self, job, error):
        """Handle job failure - retry or mark as failed."""
        if job.retry_count < job.max_retries:
            # Requeue for retry with incremented count
            pass
        else:
            # Mark as permanently failed
            pass

    def recover_orphaned_jobs(self):
        """
        On startup, check for jobs marked 'running' with stale worker_id.
        These are orphaned jobs from a previous crash - requeue them.
        """
        pass
```

### Systemd Service: `/etc/systemd/system/media-downloader-worker.service`

```ini
[Unit]
Description=Media Downloader Worker Service
After=network.target

[Service]
Type=simple
User=media-downloader
WorkingDirectory=/opt/media-downloader
ExecStart=/opt/media-downloader/venv/bin/python -m services.worker_service
Restart=always
RestartSec=10

# Worker-specific settings
Environment="WORKER_CONCURRENCY=2"
Environment="WORKER_POLL_INTERVAL=5"

[Install]
WantedBy=multi-user.target
```

---

## API Changes

### Scheduler Service Changes

**Before:**
```python
def run_instagram_scrape(username):
    scraper = InstagramScraper(username)
    scraper.run()  # Blocks, runs in-process
```

**After:**
```python
def run_instagram_scrape(username):
    db.execute("""
        INSERT INTO worker_job_queue (job_type, platform, account, payload)
        VALUES ('scrape_account', 'instagram', ?, ?)
    """, [username, json.dumps({"scrape_type": "full"})])
    # Returns immediately - worker picks up job
```

### API Endpoints for Job Management

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/jobs` | List jobs with filters (status, platform, etc.) |
| GET | `/api/jobs/{id}` | Get job details and progress |
| POST | `/api/jobs/{id}/cancel` | Cancel a pending/running job |
| POST | `/api/jobs/{id}/retry` | Retry a failed job |
| DELETE | `/api/jobs/{id}` | Delete a job from queue |

### Status Endpoint Changes

The Dashboard currently shows "Currently Scraping" by checking active processes. This would change to query the job queue:

```python
@router.get("/api/scheduler/status")
def get_scheduler_status():
    running_jobs = db.query("""
        SELECT * FROM worker_job_queue
        WHERE status = 'running'
        ORDER BY started_at
    """)

    pending_jobs = db.query("""
        SELECT COUNT(*) as count, platform
        FROM worker_job_queue
        WHERE status = 'pending'
        GROUP BY platform
    """)

    return {
        "running": [format_job(j) for j in running_jobs],
        "pending_counts": pending_jobs,
        "worker_healthy": check_worker_heartbeat()
    }
```

---

## Files to Modify

### New Files
| File | Description |
|------|-------------|
| `/opt/media-downloader/services/worker_service.py` | Main worker service |
| `/opt/media-downloader/services/job_handlers/` | Directory for job type handlers |
| `/opt/media-downloader/services/job_handlers/instagram.py` | Instagram scrape handler |
| `/opt/media-downloader/services/job_handlers/paid_content.py` | Paid content sync handler |
| `/opt/media-downloader/services/job_handlers/tiktok.py` | TikTok scrape handler |
| `/opt/media-downloader/services/job_handlers/forum.py` | Forum scrape handler |
| `/etc/systemd/system/media-downloader-worker.service` | Systemd service file |

### Modified Files
| File | Changes |
|------|---------|
| `/opt/media-downloader/modules/unified_database.py` | Add job queue schema |
| `/opt/media-downloader/modules/scheduler.py` | Enqueue jobs instead of running directly |
| `/opt/media-downloader/web/backend/api.py` | Add job management endpoints |
| `/opt/media-downloader/web/backend/routers/scheduler.py` | Update status endpoint |
| `/opt/media-downloader/web/frontend/src/pages/Dashboard.tsx` | Display job queue status |
| `/opt/media-downloader/web/frontend/src/lib/api.ts` | Add job management API calls |

---

## Implementation Phases

### Phase 1: Foundation
1. Add job queue table to database schema
2. Create basic worker service structure
3. Implement job claiming with atomic transactions
4. Add systemd service file

### Phase 2: Job Handlers
1. Create job handler base class
2. Migrate Instagram scraper to job handler
3. Migrate Paid Content sync to job handler
4. Migrate remaining scrapers one by one

### Phase 3: Scheduler Integration
1. Modify scheduler to enqueue jobs instead of running directly
2. Update cron job triggers to use queue
3. Add job status endpoints to API

### Phase 4: UI Updates
1. Update Dashboard to show job queue status
2. Add job management UI (view, cancel, retry)
3. Show progress for long-running jobs

### Phase 5: Advanced Features
1. Job priority system
2. Concurrent job execution (configurable worker count)
3. Job dependencies (job B waits for job A)
4. Job scheduling (run at specific time)

---

## Benefits

1. **Reliability**: Downloads/scrapes survive service restarts
2. **Visibility**: Clear queue of pending work
3. **Control**: Cancel or reprioritize jobs
4. **Recovery**: Automatic retry of failed jobs
5. **Progress**: Real-time progress tracking for long jobs
6. **Scalability**: Can run multiple workers if needed
7. **Separation of Concerns**: Scheduler schedules, Worker works, API serves

---

## Risks and Mitigations

| Risk | Mitigation |
|------|------------|
| Worker crashes | Systemd auto-restart + orphan job recovery |
| Database locked | Use WAL mode, proper transaction handling |
| Job stuck running | Heartbeat timeout, automatic requeue |
| Memory leaks | Periodic worker restart, job isolation |
| Race conditions | Atomic job claiming with transactions |

---

## Questions to Resolve

1. **Concurrency**: Should workers run multiple jobs in parallel? How many?
2. **Priorities**: What priority scheme? User-triggered vs scheduled?
3. **Retention**: How long to keep completed/failed job records?
4. **Notifications**: Should users be notified of job completion/failure?
5. **Migration**: How to handle in-flight jobs during initial deployment?

---

## References

- Current scheduler: `/opt/media-downloader/modules/scheduler.py`
- Current scrapers: `/opt/media-downloader/modules/` (individual files: `snapchat_scraper.py`, `instaloader_module.py`, `imginn_module.py`, `paid_content/scraper.py`)
- Current downloaders: `/opt/media-downloader/modules/` (individual files: `forum_downloader.py`, `universal_video_downloader.py`, `download_manager.py`, `paid_content/embed_downloader.py`)
- Database module: `/opt/media-downloader/modules/unified_database.py`