Files
media-downloader/docs/WORKER_PROCESS_ARCHITECTURE_PROPOSAL.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

16 KiB

Worker Process Architecture Proposal

Problem Statement

Currently, all scrapers and downloaders run directly within the scheduler and API service processes. When these services restart (due to configuration changes, updates, or other reasons), any active scraping or downloading jobs are abruptly terminated, leaving downloads incomplete and requiring manual re-triggering.

Current Issues:

  1. Scheduler service restarts kill active scrapers mid-process
  2. API service restarts interrupt download operations
  3. No job recovery mechanism - interrupted jobs are lost
  4. Users must manually re-trigger failed/interrupted jobs
  5. Long-running jobs (large downloads, full account scrapes) are particularly vulnerable

Current Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Scheduler Service                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ Cron Logic  │→ │ Job Runner  │→ │ Scrapers/Downloaders│  │
│  │             │  │             │  │ (runs in-process)   │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│                                                              │
│  If scheduler restarts → ALL active jobs die                │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      API Service                             │
│  ┌─────────────┐  ┌─────────────────────────────────────┐   │
│  │ Endpoints   │→ │ Manual Triggers (runs in-process)   │   │
│  └─────────────┘  └─────────────────────────────────────┘   │
│                                                              │
│  If API restarts → ALL manual jobs die                      │
└─────────────────────────────────────────────────────────────┘

Proposed Architecture

┌─────────────────────────┐     ┌─────────────────────────────┐
│   Scheduler Service     │     │      Worker Service         │
│  ┌─────────────┐        │     │  ┌───────────────────────┐  │
│  │ Cron Logic  │───────────────→ │ Job Queue Consumer    │  │
│  │             │  Enqueue │     │ │                       │  │
│  └─────────────┘  Jobs   │     │ └───────────┬───────────┘  │
│                          │     │             │              │
│  • Light process         │     │             ▼              │
│  • Just scheduling       │     │  ┌───────────────────────┐ │
│  • Can restart safely    │     │  │ Scrapers/Downloaders  │ │
└─────────────────────────┘     │  │ (isolated execution)  │ │
                                 │  └───────────────────────┘ │
┌─────────────────────────┐     │                             │
│     API Service         │     │  • Runs independently       │
│  ┌─────────────┐        │     │  • Survives API restarts    │
│  │ Endpoints   │───────────────→ │  • Survives sched restarts│
│  │             │  Enqueue │     │  • Job recovery on crash   │
│  └─────────────┘  Jobs   │     └─────────────────────────────┘
│                          │
│  • Can restart safely    │     ┌─────────────────────────────┐
│  • Reads status from DB  │     │        SQLite Database      │
│                          │     │  ┌───────────────────────┐  │
└──────────────────────────┘     │  │ job_queue table       │  │
              │                  │  │ job_status table      │  │
              │                  │  └───────────────────────┘  │
              └─────────────────→│                             │
                 Read Status     └─────────────────────────────┘

Database Schema

Job Queue Table

CREATE TABLE IF NOT EXISTS worker_job_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_type TEXT NOT NULL,          -- 'scrape', 'download', 'sync', etc.
    platform TEXT NOT NULL,          -- 'instagram', 'paid_content', 'tiktok', etc.
    account TEXT,                    -- username or null for system tasks
    priority INTEGER DEFAULT 5,      -- 1=highest, 10=lowest
    status TEXT DEFAULT 'pending',   -- 'pending', 'running', 'completed', 'failed', 'cancelled'
    payload TEXT,                    -- JSON blob with job-specific data
    error_message TEXT,              -- Error details if failed
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    started_at DATETIME,
    completed_at DATETIME,
    worker_id TEXT,                  -- ID of worker process handling this job
    progress_percent INTEGER DEFAULT 0,
    progress_message TEXT
);

CREATE INDEX idx_job_queue_status ON worker_job_queue(status);
CREATE INDEX idx_job_queue_priority ON worker_job_queue(priority, created_at);
CREATE INDEX idx_job_queue_platform ON worker_job_queue(platform, account);

Job Types

job_type Description Example payload
scrape_account Scrape a social media account {"username": "user123", "scrape_type": "full"}
download_media Download specific media items {"media_ids": [1, 2, 3], "source": "instagram"}
paid_content_sync Sync paid content from all creators {"creator_ids": null}
paid_content_creator Sync specific creator {"creator_id": 123}
forum_scrape Scrape forum threads {"thread_ids": [456, 789]}
youtube_monitor Check YouTube channels for new videos {}
easynews_monitor Check Easynews for new content {"search_queries": [...]}
appearances_sync Sync TMDb appearances {}

Worker Service Design

File: /opt/media-downloader/services/worker_service.py

# Conceptual outline - not implementation

class WorkerService:
    """
    Independent worker service that processes jobs from the queue.
    Designed to run as a separate systemd service.
    """

    def __init__(self):
        self.worker_id = generate_worker_id()
        self.running = True
        self.current_job = None

    def run(self):
        """Main loop - poll for jobs and execute them."""
        while self.running:
            job = self.claim_next_job()
            if job:
                self.execute_job(job)
            else:
                time.sleep(5)  # No jobs, wait before polling again

    def claim_next_job(self):
        """
        Atomically claim the highest priority pending job.
        Uses database transaction to prevent race conditions.
        """
        # UPDATE worker_job_queue
        # SET status='running', worker_id=?, started_at=NOW()
        # WHERE id = (SELECT id FROM worker_job_queue
        #             WHERE status='pending'
        #             ORDER BY priority, created_at LIMIT 1)
        pass

    def execute_job(self, job):
        """Execute a job and update status."""
        try:
            handler = self.get_handler(job.job_type)
            handler.execute(job.payload, progress_callback=self.update_progress)
            self.mark_completed(job.id)
        except Exception as e:
            self.handle_failure(job, e)

    def update_progress(self, job_id, percent, message):
        """Update job progress in database for UI to read."""
        # UPDATE worker_job_queue SET progress_percent=?, progress_message=? WHERE id=?
        pass

    def handle_failure(self, job, error):
        """Handle job failure - retry or mark as failed."""
        if job.retry_count < job.max_retries:
            # Requeue for retry with incremented count
            pass
        else:
            # Mark as permanently failed
            pass

    def recover_orphaned_jobs(self):
        """
        On startup, check for jobs marked 'running' with stale worker_id.
        These are orphaned jobs from a previous crash - requeue them.
        """
        pass

Systemd Service: /etc/systemd/system/media-downloader-worker.service

[Unit]
Description=Media Downloader Worker Service
After=network.target

[Service]
Type=simple
User=media-downloader
WorkingDirectory=/opt/media-downloader
ExecStart=/opt/media-downloader/venv/bin/python -m services.worker_service
Restart=always
RestartSec=10

# Worker-specific settings
Environment="WORKER_CONCURRENCY=2"
Environment="WORKER_POLL_INTERVAL=5"

[Install]
WantedBy=multi-user.target

API Changes

Scheduler Service Changes

Before:

def run_instagram_scrape(username):
    scraper = InstagramScraper(username)
    scraper.run()  # Blocks, runs in-process

After:

def run_instagram_scrape(username):
    db.execute("""
        INSERT INTO worker_job_queue (job_type, platform, account, payload)
        VALUES ('scrape_account', 'instagram', ?, ?)
    """, [username, json.dumps({"scrape_type": "full"})])
    # Returns immediately - worker picks up job

API Endpoints for Job Management

Method Endpoint Description
GET /api/jobs List jobs with filters (status, platform, etc.)
GET /api/jobs/{id} Get job details and progress
POST /api/jobs/{id}/cancel Cancel a pending/running job
POST /api/jobs/{id}/retry Retry a failed job
DELETE /api/jobs/{id} Delete a job from queue

Status Endpoint Changes

The Dashboard currently shows "Currently Scraping" by checking active processes. This would change to query the job queue:

@router.get("/api/scheduler/status")
def get_scheduler_status():
    running_jobs = db.query("""
        SELECT * FROM worker_job_queue
        WHERE status = 'running'
        ORDER BY started_at
    """)

    pending_jobs = db.query("""
        SELECT COUNT(*) as count, platform
        FROM worker_job_queue
        WHERE status = 'pending'
        GROUP BY platform
    """)

    return {
        "running": [format_job(j) for j in running_jobs],
        "pending_counts": pending_jobs,
        "worker_healthy": check_worker_heartbeat()
    }

Files to Modify

New Files

File Description
/opt/media-downloader/services/worker_service.py Main worker service
/opt/media-downloader/services/job_handlers/ Directory for job type handlers
/opt/media-downloader/services/job_handlers/instagram.py Instagram scrape handler
/opt/media-downloader/services/job_handlers/paid_content.py Paid content sync handler
/opt/media-downloader/services/job_handlers/tiktok.py TikTok scrape handler
/opt/media-downloader/services/job_handlers/forum.py Forum scrape handler
/etc/systemd/system/media-downloader-worker.service Systemd service file

Modified Files

File Changes
/opt/media-downloader/modules/unified_database.py Add job queue schema
/opt/media-downloader/modules/scheduler.py Enqueue jobs instead of running directly
/opt/media-downloader/web/backend/api.py Add job management endpoints
/opt/media-downloader/web/backend/routers/scheduler.py Update status endpoint
/opt/media-downloader/web/frontend/src/pages/Dashboard.tsx Display job queue status
/opt/media-downloader/web/frontend/src/lib/api.ts Add job management API calls

Implementation Phases

Phase 1: Foundation

  1. Add job queue table to database schema
  2. Create basic worker service structure
  3. Implement job claiming with atomic transactions
  4. Add systemd service file

Phase 2: Job Handlers

  1. Create job handler base class
  2. Migrate Instagram scraper to job handler
  3. Migrate Paid Content sync to job handler
  4. Migrate remaining scrapers one by one

Phase 3: Scheduler Integration

  1. Modify scheduler to enqueue jobs instead of running directly
  2. Update cron job triggers to use queue
  3. Add job status endpoints to API

Phase 4: UI Updates

  1. Update Dashboard to show job queue status
  2. Add job management UI (view, cancel, retry)
  3. Show progress for long-running jobs

Phase 5: Advanced Features

  1. Job priority system
  2. Concurrent job execution (configurable worker count)
  3. Job dependencies (job B waits for job A)
  4. Job scheduling (run at specific time)

Benefits

  1. Reliability: Downloads/scrapes survive service restarts
  2. Visibility: Clear queue of pending work
  3. Control: Cancel or reprioritize jobs
  4. Recovery: Automatic retry of failed jobs
  5. Progress: Real-time progress tracking for long jobs
  6. Scalability: Can run multiple workers if needed
  7. Separation of Concerns: Scheduler schedules, Worker works, API serves

Risks and Mitigations

Risk Mitigation
Worker crashes Systemd auto-restart + orphan job recovery
Database locked Use WAL mode, proper transaction handling
Job stuck running Heartbeat timeout, automatic requeue
Memory leaks Periodic worker restart, job isolation
Race conditions Atomic job claiming with transactions

Questions to Resolve

  1. Concurrency: Should workers run multiple jobs in parallel? How many?
  2. Priorities: What priority scheme? User-triggered vs scheduled?
  3. Retention: How long to keep completed/failed job records?
  4. Notifications: Should users be notified of job completion/failure?
  5. Migration: How to handle in-flight jobs during initial deployment?

References

  • Current scheduler: /opt/media-downloader/modules/scheduler.py
  • Current scrapers: /opt/media-downloader/modules/ (individual files: snapchat_scraper.py, instaloader_module.py, imginn_module.py, paid_content/scraper.py)
  • Current downloaders: /opt/media-downloader/modules/ (individual files: forum_downloader.py, universal_video_downloader.py, download_manager.py, paid_content/embed_downloader.py)
  • Database module: /opt/media-downloader/modules/unified_database.py