Files

Todd 0d7b2b1aab Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-29 22:42:55 -04:00

16 KiB

Raw Blame History

Worker Process Architecture Proposal

Problem Statement

Currently, all scrapers and downloaders run directly within the scheduler and API service processes. When these services restart (due to configuration changes, updates, or other reasons), any active scraping or downloading jobs are abruptly terminated, leaving downloads incomplete and requiring manual re-triggering.

Current Issues:

Scheduler service restarts kill active scrapers mid-process
API service restarts interrupt download operations
No job recovery mechanism - interrupted jobs are lost
Users must manually re-trigger failed/interrupted jobs
Long-running jobs (large downloads, full account scrapes) are particularly vulnerable

Current Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Scheduler Service                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ Cron Logic  │→ │ Job Runner  │→ │ Scrapers/Downloaders│  │
│  │             │  │             │  │ (runs in-process)   │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│                                                              │
│  If scheduler restarts → ALL active jobs die                │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      API Service                             │
│  ┌─────────────┐  ┌─────────────────────────────────────┐   │
│  │ Endpoints   │→ │ Manual Triggers (runs in-process)   │   │
│  └─────────────┘  └─────────────────────────────────────┘   │
│                                                              │
│  If API restarts → ALL manual jobs die                      │
└─────────────────────────────────────────────────────────────┘

Proposed Architecture

┌─────────────────────────┐     ┌─────────────────────────────┐
│   Scheduler Service     │     │      Worker Service         │
│  ┌─────────────┐        │     │  ┌───────────────────────┐  │
│  │ Cron Logic  │───────────────→ │ Job Queue Consumer    │  │
│  │             │  Enqueue │     │ │                       │  │
│  └─────────────┘  Jobs   │     │ └───────────┬───────────┘  │
│                          │     │             │              │
│  • Light process         │     │             ▼              │
│  • Just scheduling       │     │  ┌───────────────────────┐ │
│  • Can restart safely    │     │  │ Scrapers/Downloaders  │ │
└─────────────────────────┘     │  │ (isolated execution)  │ │
                                 │  └───────────────────────┘ │
┌─────────────────────────┐     │                             │
│     API Service         │     │  • Runs independently       │
│  ┌─────────────┐        │     │  • Survives API restarts    │
│  │ Endpoints   │───────────────→ │  • Survives sched restarts│
│  │             │  Enqueue │     │  • Job recovery on crash   │
│  └─────────────┘  Jobs   │     └─────────────────────────────┘
│                          │
│  • Can restart safely    │     ┌─────────────────────────────┐
│  • Reads status from DB  │     │        SQLite Database      │
│                          │     │  ┌───────────────────────┐  │
└──────────────────────────┘     │  │ job_queue table       │  │
              │                  │  │ job_status table      │  │
              │                  │  └───────────────────────┘  │
              └─────────────────→│                             │
                 Read Status     └─────────────────────────────┘

Database Schema

Job Queue Table

CREATE TABLE IF NOT EXISTS worker_job_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_type TEXT NOT NULL,          -- 'scrape', 'download', 'sync', etc.
    platform TEXT NOT NULL,          -- 'instagram', 'paid_content', 'tiktok', etc.
    account TEXT,                    -- username or null for system tasks
    priority INTEGER DEFAULT 5,      -- 1=highest, 10=lowest
    status TEXT DEFAULT 'pending',   -- 'pending', 'running', 'completed', 'failed', 'cancelled'
    payload TEXT,                    -- JSON blob with job-specific data
    error_message TEXT,              -- Error details if failed
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    started_at DATETIME,
    completed_at DATETIME,
    worker_id TEXT,                  -- ID of worker process handling this job
    progress_percent INTEGER DEFAULT 0,
    progress_message TEXT
);

CREATE INDEX idx_job_queue_status ON worker_job_queue(status);
CREATE INDEX idx_job_queue_priority ON worker_job_queue(priority, created_at);
CREATE INDEX idx_job_queue_platform ON worker_job_queue(platform, account);

Job Types

job_type	Description	Example payload
`scrape_account`	Scrape a social media account	`{"username": "user123", "scrape_type": "full"}`
`download_media`	Download specific media items	`{"media_ids": [1, 2, 3], "source": "instagram"}`
`paid_content_sync`	Sync paid content from all creators	`{"creator_ids": null}`
`paid_content_creator`	Sync specific creator	`{"creator_id": 123}`
`forum_scrape`	Scrape forum threads	`{"thread_ids": [456, 789]}`
`youtube_monitor`	Check YouTube channels for new videos	`{}`
`easynews_monitor`	Check Easynews for new content	`{"search_queries": [...]}`
`appearances_sync`	Sync TMDb appearances	`{}`

Worker Service Design

File: `/opt/media-downloader/services/worker_service.py`

# Conceptual outline - not implementation

class WorkerService:
    """
    Independent worker service that processes jobs from the queue.
    Designed to run as a separate systemd service.
    """

    def __init__(self):
        self.worker_id = generate_worker_id()
        self.running = True
        self.current_job = None

    def run(self):
        """Main loop - poll for jobs and execute them."""
        while self.running:
            job = self.claim_next_job()
            if job:
                self.execute_job(job)
            else:
                time.sleep(5)  # No jobs, wait before polling again

    def claim_next_job(self):
        """
        Atomically claim the highest priority pending job.
        Uses database transaction to prevent race conditions.
        """
        # UPDATE worker_job_queue
        # SET status='running', worker_id=?, started_at=NOW()
        # WHERE id = (SELECT id FROM worker_job_queue
        #             WHERE status='pending'
        #             ORDER BY priority, created_at LIMIT 1)
        pass

    def execute_job(self, job):
        """Execute a job and update status."""
        try:
            handler = self.get_handler(job.job_type)
            handler.execute(job.payload, progress_callback=self.update_progress)
            self.mark_completed(job.id)
        except Exception as e:
            self.handle_failure(job, e)

    def update_progress(self, job_id, percent, message):
        """Update job progress in database for UI to read."""
        # UPDATE worker_job_queue SET progress_percent=?, progress_message=? WHERE id=?
        pass

    def handle_failure(self, job, error):
        """Handle job failure - retry or mark as failed."""
        if job.retry_count < job.max_retries:
            # Requeue for retry with incremented count
            pass
        else:
            # Mark as permanently failed
            pass

    def recover_orphaned_jobs(self):
        """
        On startup, check for jobs marked 'running' with stale worker_id.
        These are orphaned jobs from a previous crash - requeue them.
        """
        pass

Systemd Service: `/etc/systemd/system/media-downloader-worker.service`

[Unit]
Description=Media Downloader Worker Service
After=network.target

[Service]
Type=simple
User=media-downloader
WorkingDirectory=/opt/media-downloader
ExecStart=/opt/media-downloader/venv/bin/python -m services.worker_service
Restart=always
RestartSec=10

# Worker-specific settings
Environment="WORKER_CONCURRENCY=2"
Environment="WORKER_POLL_INTERVAL=5"

[Install]
WantedBy=multi-user.target

API Changes

Scheduler Service Changes

Before:

def run_instagram_scrape(username):
    scraper = InstagramScraper(username)
    scraper.run()  # Blocks, runs in-process

After:

def run_instagram_scrape(username):
    db.execute("""
        INSERT INTO worker_job_queue (job_type, platform, account, payload)
        VALUES ('scrape_account', 'instagram', ?, ?)
    """, [username, json.dumps({"scrape_type": "full"})])
    # Returns immediately - worker picks up job

API Endpoints for Job Management

Method	Endpoint	Description
GET	`/api/jobs`	List jobs with filters (status, platform, etc.)
GET	`/api/jobs/{id}`	Get job details and progress
POST	`/api/jobs/{id}/cancel`	Cancel a pending/running job
POST	`/api/jobs/{id}/retry`	Retry a failed job
DELETE	`/api/jobs/{id}`	Delete a job from queue

Status Endpoint Changes

The Dashboard currently shows "Currently Scraping" by checking active processes. This would change to query the job queue:

@router.get("/api/scheduler/status")
def get_scheduler_status():
    running_jobs = db.query("""
        SELECT * FROM worker_job_queue
        WHERE status = 'running'
        ORDER BY started_at
    """)

    pending_jobs = db.query("""
        SELECT COUNT(*) as count, platform
        FROM worker_job_queue
        WHERE status = 'pending'
        GROUP BY platform
    """)

    return {
        "running": [format_job(j) for j in running_jobs],
        "pending_counts": pending_jobs,
        "worker_healthy": check_worker_heartbeat()
    }

Files to Modify

New Files

File	Description
`/opt/media-downloader/services/worker_service.py`	Main worker service
`/opt/media-downloader/services/job_handlers/`	Directory for job type handlers
`/opt/media-downloader/services/job_handlers/instagram.py`	Instagram scrape handler
`/opt/media-downloader/services/job_handlers/paid_content.py`	Paid content sync handler
`/opt/media-downloader/services/job_handlers/tiktok.py`	TikTok scrape handler
`/opt/media-downloader/services/job_handlers/forum.py`	Forum scrape handler
`/etc/systemd/system/media-downloader-worker.service`	Systemd service file

Modified Files

File	Changes
`/opt/media-downloader/modules/unified_database.py`	Add job queue schema
`/opt/media-downloader/modules/scheduler.py`	Enqueue jobs instead of running directly
`/opt/media-downloader/web/backend/api.py`	Add job management endpoints
`/opt/media-downloader/web/backend/routers/scheduler.py`	Update status endpoint
`/opt/media-downloader/web/frontend/src/pages/Dashboard.tsx`	Display job queue status
`/opt/media-downloader/web/frontend/src/lib/api.ts`	Add job management API calls

Implementation Phases

Phase 1: Foundation

Add job queue table to database schema
Create basic worker service structure
Implement job claiming with atomic transactions
Add systemd service file

Phase 2: Job Handlers

Create job handler base class
Migrate Instagram scraper to job handler
Migrate Paid Content sync to job handler
Migrate remaining scrapers one by one

Phase 3: Scheduler Integration

Modify scheduler to enqueue jobs instead of running directly
Update cron job triggers to use queue
Add job status endpoints to API

Phase 4: UI Updates

Update Dashboard to show job queue status
Add job management UI (view, cancel, retry)
Show progress for long-running jobs

Phase 5: Advanced Features

Job priority system
Concurrent job execution (configurable worker count)
Job dependencies (job B waits for job A)
Job scheduling (run at specific time)

Benefits

Reliability: Downloads/scrapes survive service restarts
Visibility: Clear queue of pending work
Control: Cancel or reprioritize jobs
Recovery: Automatic retry of failed jobs
Progress: Real-time progress tracking for long jobs
Scalability: Can run multiple workers if needed
Separation of Concerns: Scheduler schedules, Worker works, API serves

Risks and Mitigations

Risk	Mitigation
Worker crashes	Systemd auto-restart + orphan job recovery
Database locked	Use WAL mode, proper transaction handling
Job stuck running	Heartbeat timeout, automatic requeue
Memory leaks	Periodic worker restart, job isolation
Race conditions	Atomic job claiming with transactions

Questions to Resolve

Concurrency: Should workers run multiple jobs in parallel? How many?
Priorities: What priority scheme? User-triggered vs scheduled?
Retention: How long to keep completed/failed job records?
Notifications: Should users be notified of job completion/failure?
Migration: How to handle in-flight jobs during initial deployment?

References

Current scheduler: /opt/media-downloader/modules/scheduler.py
Current scrapers: /opt/media-downloader/modules/ (individual files: snapchat_scraper.py, instaloader_module.py, imginn_module.py, paid_content/scraper.py)
Current downloaders: /opt/media-downloader/modules/ (individual files: forum_downloader.py, universal_video_downloader.py, download_manager.py, paid_content/embed_downloader.py)
Database module: /opt/media-downloader/modules/unified_database.py

16 KiB Raw Blame History