Files
media-downloader/docs/SCRAPER_PROXY_SYSTEM.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

28 KiB

Scraper Proxy Configuration System

Overview

This document describes the design and implementation plan for a centralized scraper configuration system that provides:

  1. Per-scraper proxy settings - Configure different proxies for different scrapers
  2. Centralized cookie management - Store cookies in database instead of files
  3. FlareSolverr integration - Test connections and refresh Cloudflare cookies
  4. Cookie upload support - Upload cookies from browser extensions for authenticated access
  5. Unified Settings UI - Single place to manage all scraper configurations

Background

Problem Statement

  • Proxy settings are not configurable per-module
  • Cookies are stored in scattered JSON files
  • No UI to test FlareSolverr connections or manage cookies
  • Adding new forums requires code changes
  • No visibility into cookie freshness or scraper health

Solution

A new scrapers database table that:

  • Stores configuration for all automated scrapers
  • Provides proxy settings per-scraper
  • Centralizes cookie storage with merge logic
  • Syncs automatically with platform configurations
  • Exposes management via Settings UI

Database Schema

Table: scrapers

CREATE TABLE scrapers (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    type TEXT NOT NULL,               -- 'direct', 'proxy', 'forum', 'cli_tool'
    module TEXT,                      -- Python module name, NULL for cli_tool
    base_url TEXT,                    -- Primary URL for the scraper
    target_platform TEXT,             -- 'instagram', 'snapchat', 'tiktok', NULL for forums/cli
    enabled INTEGER DEFAULT 1,        -- Enable/disable scraper

    -- Proxy settings
    proxy_enabled INTEGER DEFAULT 0,
    proxy_url TEXT,                   -- e.g., "socks5://user:pass@host:port"

    -- Cloudflare/Cookie settings
    flaresolverr_required INTEGER DEFAULT 0,
    cookies_json TEXT,                -- JSON blob of cookies
    cookies_updated_at TEXT,          -- ISO timestamp of last cookie update

    -- Test status
    last_test_at TEXT,                -- ISO timestamp of last test
    last_test_status TEXT,            -- 'success', 'failed', 'timeout'
    last_test_message TEXT,           -- Error message if failed

    -- Module-specific settings
    settings_json TEXT,               -- Additional JSON settings per-scraper

    created_at TEXT DEFAULT CURRENT_TIMESTAMP,
    updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);

Column Definitions

Column Type Description
id TEXT Unique identifier (e.g., 'imginn', 'forum_phun')
name TEXT Display name shown in UI
type TEXT One of: 'direct', 'proxy', 'forum', 'cli_tool'
module TEXT Python module name (e.g., 'imginn_module'), NULL for CLI tools
base_url TEXT Primary URL for the service
target_platform TEXT What platform this scraper downloads from (instagram, snapchat, tiktok, NULL)
enabled INTEGER 1=enabled, 0=disabled
proxy_enabled INTEGER 1=use proxy, 0=direct connection
proxy_url TEXT Proxy URL (http, https, socks5 supported)
flaresolverr_required INTEGER 1=needs FlareSolverr for Cloudflare bypass
cookies_json TEXT JSON array of cookie objects
cookies_updated_at TEXT When cookies were last updated
last_test_at TEXT When connection was last tested
last_test_status TEXT Result of last test: 'success', 'failed', 'timeout'
last_test_message TEXT Error message from last failed test
settings_json TEXT Module-specific settings as JSON

Scraper Types

Type Description Examples
direct Downloads directly from the platform instagram, tiktok, snapchat, coppermine
proxy Uses a proxy service to download imginn, fastdl, toolzu
forum Forum scraper forum_phun, forum_hqcelebcorner, forum_picturepub
cli_tool Command-line tool wrapper ytdlp, gallerydl

Target Platforms

The target_platform field indicates what platform the scraper actually downloads content from:

Scraper Target Platform Notes
imginn instagram Proxy service for Instagram
fastdl instagram Proxy service for Instagram
toolzu instagram Proxy service for Instagram
snapchat snapchat Direct via Playwright scraper
instagram instagram Direct via Instaloader
tiktok tiktok Direct via yt-dlp internally
coppermine NULL Not a social platform
forum_* NULL Not a social platform
ytdlp NULL Generic tool, multiple platforms
gallerydl NULL Generic tool, multiple platforms

Seed Data

Initial scrapers to populate on first run:

id name type module base_url target_platform flaresolverr_required
imginn Imginn proxy imginn_module https://imginn.com instagram 1
fastdl FastDL proxy fastdl_module https://fastdl.app instagram 1
toolzu Toolzu proxy toolzu_module https://toolzu.com instagram 1
snapchat Snapchat Direct direct snapchat_scraper https://snapchat.com snapchat 0
instagram Instagram (Direct) direct instaloader_module https://instagram.com instagram 0
tiktok TikTok direct tiktok_module https://tiktok.com tiktok 0
coppermine Coppermine direct coppermine_module https://hqdiesel.net NULL 1
forum_phun Phun.org forum forum_downloader https://forum.phun.org NULL 1
forum_hqcelebcorner HQCelebCorner forum forum_downloader https://hqcelebcorner.com NULL 0
forum_picturepub PicturePub forum forum_downloader https://picturepub.net NULL 0
ytdlp yt-dlp cli_tool NULL NULL NULL 0
gallerydl gallery-dl cli_tool NULL NULL NULL 0

Notes on Seed Data

  1. Snapchat: Uses direct Playwright-based scraper with optional proxy support (configured per-scraper in Scrapers settings page)

  2. Forums: Derived from existing forum_threads table entries and cookie files

  3. Excluded scrapers: YouTube and Bilibili are NOT included - they are on-demand downloaders from the Video Downloader page, not scheduled scrapers


Auto-Sync Logic

The scrapers table stays in sync with platform configurations automatically:

When Forums Change

  • New forum added in Forums settings → Create scraper entry with type='forum'
  • Forum removed from settings → Remove scraper entry

When Modules Are Enabled/Disabled

  • Module enabled → Ensure scraper entry exists
  • Module disabled → Scraper entry remains but enabled=0

No Manual Add/Delete

  • The Scrapers UI does NOT have Add or Delete buttons
  • Scrapers are managed through their respective platform configuration pages
  • Scrapers UI only manages: proxy settings, testing, cookies

Storage Format

Cookies are stored as JSON in the cookies_json column:

{
  "cookies": [
    {
      "name": "cf_clearance",
      "value": "abc123...",
      "domain": ".imginn.com",
      "path": "/",
      "expiry": 1735689600
    },
    {
      "name": "session_id",
      "value": "xyz789...",
      "domain": "imginn.com",
      "path": "/",
      "expiry": -1
    }
  ],
  "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36..."
}

CRITICAL: When updating cookies, MERGE with existing - never wipe:

def merge_cookies(existing_cookies: list, new_cookies: list) -> list:
    """
    Merge new cookies into existing, preserving non-updated cookies.

    This ensures:
    - Cloudflare cookies (cf_clearance, __cf_bm) get refreshed
    - Site session/auth cookies are preserved
    - No data loss on test/refresh
    """
    # Index existing by name
    cookie_map = {c['name']: c for c in existing_cookies}

    # Update/add from new cookies
    for cookie in new_cookies:
        cookie_map[cookie['name']] = cookie

    return list(cookie_map.values())
  1. FlareSolverr - Automated Cloudflare bypass, returns CF cookies
  2. Upload - User uploads JSON from browser extension (EditThisCookie, Cookie-Editor)
  3. Module - Some modules save cookies during operation

Existing cookie files to migrate on first run:

File Scraper ID
cookies/coppermine_cookies.json coppermine
cookies/imginn_cookies.json imginn
cookies/fastdl_cookies.json fastdl
cookies/snapchat_cookies.json snapchat
cookies/forum_cookies_phun.org.json forum_phun
cookies/forum_cookies_HQCelebCorner.json forum_hqcelebcorner
cookies/forum_cookies_PicturePub.json forum_picturepub

Proxy Configuration

Supported Proxy Formats

http://host:port
http://user:pass@host:port
https://host:port
https://user:pass@host:port
socks5://host:port
socks5://user:pass@host:port

FlareSolverr Proxy Integration

When a scraper has proxy_enabled=1, the proxy is passed to FlareSolverr:

payload = {
    "cmd": "request.get",
    "url": url,
    "maxTimeout": 120000
}
if proxy_url:
    payload["proxy"] = {"url": proxy_url}

Important: Cloudflare cookies are tied to IP address. If FlareSolverr uses a proxy, subsequent requests MUST use the same proxy or cookies will be invalid.

Per-Module Proxy Usage

Module How Proxy is Used
coppermine_module requests.Session(proxies={...})
imginn_module Playwright proxy option
fastdl_module Playwright proxy option
toolzu_module Playwright proxy option
snapchat_scraper Playwright proxy option (optional, configured in Scrapers page)
instaloader_module Instaloader proxy parameter
tiktok_module yt-dlp --proxy flag
forum_downloader Playwright proxy option + requests
ytdlp --proxy flag
gallerydl --proxy flag

API Endpoints

GET /api/scrapers

List all scrapers with optional type filter.

Query Parameters:

  • type (optional): Filter by type ('direct', 'proxy', 'forum', 'cli_tool')

Response:

{
  "scrapers": [
    {
      "id": "imginn",
      "name": "Imginn",
      "type": "proxy",
      "module": "imginn_module",
      "base_url": "https://imginn.com",
      "target_platform": "instagram",
      "enabled": true,
      "proxy_enabled": false,
      "proxy_url": null,
      "flaresolverr_required": true,
      "cookies_count": 23,
      "cookies_updated_at": "2025-12-01T10:30:00",
      "cookies_fresh": true,
      "last_test_at": "2025-12-01T10:30:00",
      "last_test_status": "success",
      "last_test_message": null
    }
  ]
}

GET /api/scrapers/{id}

Get single scraper configuration.

PUT /api/scrapers/{id}

Update scraper settings.

Request Body:

{
  "enabled": true,
  "proxy_enabled": true,
  "proxy_url": "socks5://user:pass@host:port",
  "base_url": "https://new-domain.com"
}

POST /api/scrapers/{id}/test

Test connection via FlareSolverr (if required) and save cookies on success.

Response:

{
  "success": true,
  "message": "Connection successful, 23 cookies saved",
  "cookies_count": 23
}

POST /api/scrapers/{id}/cookies

Upload cookies from JSON file. Merges with existing cookies.

Request Body:

{
  "cookies": [
    {"name": "session", "value": "abc123", "domain": ".example.com"}
  ]
}

Response:

{
  "success": true,
  "message": "Merged 5 cookies (total: 28)",
  "cookies_count": 28
}

DELETE /api/scrapers/{id}/cookies

Clear all cookies for a scraper.


Frontend UI

Settings > Scrapers Tab

The Scrapers tab displays all scrapers grouped by type/platform:

┌───────────────────────────────────────────────────────────────────────┐
│ Settings > Scrapers                                                   │
├───────────────────────────────────────────────────────────────────────┤
│                                              Filter: [All Types ▼]    │
│                                                                       │
│ ─── Instagram Proxies ────────────────────────────────────────────── │
│                                                                       │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ● Imginn                                              [Enabled ▼] │ │
│ │   https://imginn.com                                              │ │
│ │   ☐ Use Proxy  [                                              ]   │ │
│ │   Cloudflare: Required │ Cookies: ✓ Fresh (2h ago, 23 cookies)    │ │
│ │   [Test Connection] [Upload Cookies] [Clear Cookies]              │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│ ─── Direct ───────────────────────────────────────────────────────── │
│                                                                       │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ● Instagram (Direct)                                  [Enabled ▼] │ │
│ │   https://instagram.com                                           │ │
│ │   ☐ Use Proxy  [                                              ]   │ │
│ │   Cloudflare: Not Required │ Cookies: ✓ 12 cookies                │ │
│ │   [Test Connection] [Upload Cookies] [Clear Cookies]              │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│ ─── Forums ───────────────────────────────────────────────────────── │
│                                                                       │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ● Phun.org                                            [Enabled ▼] │ │
│ │   https://forum.phun.org                                          │ │
│ │   ☐ Use Proxy  [                                              ]   │ │
│ │   Cloudflare: Required │ Cookies: ⚠ Expired (3 days)              │ │
│ │   [Test Connection] [Upload Cookies] [Clear Cookies]              │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│ ─── CLI Tools ────────────────────────────────────────────────────── │
│                                                                       │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ● yt-dlp                                              [Enabled ▼] │ │
│ │   Generic video downloader                                        │ │
│ │   ☐ Use Proxy  [                                              ]   │ │
│ │   [Test Connection] [Upload Cookies]                              │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘

Button Visibility

Button When Shown
Test Connection Always
Upload Cookies Always
Clear Cookies When cookies exist

No Add/Delete Buttons

Scrapers are NOT added or deleted from this UI. They are managed through:

  • Forums settings (for forum scrapers)
  • Platform settings (for other scrapers)

This UI only manages:

  • Enable/disable
  • Proxy configuration
  • Cookie testing/upload/clear

Module Integration

Common Pattern

All modules follow this pattern to load scraper configuration:

class SomeModule:
    def __init__(self, unified_db=None, scraper_id='some_scraper', ...):
        self.db = unified_db
        self.scraper_id = scraper_id

        # Load config from DB
        self.config = self.db.get_scraper(scraper_id) if self.db else {}

        # Check if enabled
        if not self.config.get('enabled', True):
            raise ScraperDisabledError(f"{scraper_id} is disabled")

        # Get base URL from DB (not hardcoded)
        self.base_url = self.config.get('base_url', 'https://default.com')

        # Get proxy config
        self.proxy_url = None
        if self.config.get('proxy_enabled') and self.config.get('proxy_url'):
            self.proxy_url = self.config['proxy_url']

        # Initialize CloudflareHandler with DB storage
        self.cf_handler = CloudflareHandler(
            module_name=self.scraper_id,
            scraper_id=self.scraper_id,
            unified_db=self.db,
            proxy_url=self.proxy_url,
            ...
        )

CloudflareHandler Changes

class CloudflareHandler:
    def __init__(self,
                 module_name: str,
                 scraper_id: str = None,      # For DB cookie storage
                 unified_db = None,            # DB reference
                 proxy_url: str = None,        # Proxy support
                 cookie_file: str = None,      # DEPRECATED: backwards compat
                 ...):
        self.scraper_id = scraper_id
        self.db = unified_db
        self.proxy_url = proxy_url

    def get_cookies_via_flaresolverr(self, url: str, max_retries: int = 2) -> bool:
        payload = {
            "cmd": "request.get",
            "url": url,
            "maxTimeout": 120000
        }
        # Add proxy if configured
        if self.proxy_url:
            payload["proxy"] = {"url": self.proxy_url}

        # ... rest of implementation

        # On success, merge cookies (don't replace)
        if success:
            existing = self.load_cookies_from_db()
            merged = self.merge_cookies(existing, new_cookies)
            self.save_cookies_to_db(merged)

    def load_cookies_from_db(self) -> list:
        if self.db and self.scraper_id:
            config = self.db.get_scraper(self.scraper_id)
            if config and config.get('cookies_json'):
                data = json.loads(config['cookies_json'])
                return data.get('cookies', [])
        return []

    def save_cookies_to_db(self, cookies: list, user_agent: str = None):
        if self.db and self.scraper_id:
            data = {
                'cookies': cookies,
                'user_agent': user_agent
            }
            self.db.update_scraper_cookies(self.scraper_id, json.dumps(data))

    def merge_cookies(self, existing: list, new: list) -> list:
        cookie_map = {c['name']: c for c in existing}
        for cookie in new:
            cookie_map[cookie['name']] = cookie
        return list(cookie_map.values())

Scheduler Integration

The scheduler uses the scrapers table to determine what to run:

def run_scheduled_downloads(self):
    # Get all enabled scrapers
    scrapers = self.db.get_all_scrapers()
    enabled_scrapers = [s for s in scrapers if s['enabled']]

    for scraper in enabled_scrapers:
        if scraper['type'] == 'forum':
            self.run_forum_download(scraper['id'])
        elif scraper['id'] == 'coppermine':
            self.run_coppermine_download()
        elif scraper['id'] == 'instagram':
            self.run_instagram_download()
        elif scraper['id'] == 'tiktok':
            self.run_tiktok_download()
        # etc.

Migration Plan

Step 1: Create Table

Add to unified_database.py:

def _create_scrapers_table(self):
    self.cursor.execute('''
        CREATE TABLE IF NOT EXISTS scrapers (
            id TEXT PRIMARY KEY,
            name TEXT NOT NULL,
            type TEXT NOT NULL,
            module TEXT,
            base_url TEXT,
            target_platform TEXT,
            enabled INTEGER DEFAULT 1,
            proxy_enabled INTEGER DEFAULT 0,
            proxy_url TEXT,
            flaresolverr_required INTEGER DEFAULT 0,
            cookies_json TEXT,
            cookies_updated_at TEXT,
            last_test_at TEXT,
            last_test_status TEXT,
            last_test_message TEXT,
            settings_json TEXT,
            created_at TEXT DEFAULT CURRENT_TIMESTAMP,
            updated_at TEXT DEFAULT CURRENT_TIMESTAMP
        )
    ''')

Step 2: Seed Initial Data

def _seed_scrapers(self):
    scrapers = [
        ('imginn', 'Imginn', 'proxy', 'imginn_module', 'https://imginn.com', 'instagram', 1),
        ('fastdl', 'FastDL', 'proxy', 'fastdl_module', 'https://fastdl.app', 'instagram', 1),
        ('toolzu', 'Toolzu', 'proxy', 'toolzu_module', 'https://toolzu.com', 'instagram', 1),
        ('snapchat', 'Snapchat Direct', 'direct', 'snapchat_scraper', 'https://snapchat.com', 'snapchat', 0),
        ('instagram', 'Instagram (Direct)', 'direct', 'instaloader_module', 'https://instagram.com', 'instagram', 0),
        ('tiktok', 'TikTok', 'direct', 'tiktok_module', 'https://tiktok.com', 'tiktok', 0),
        ('coppermine', 'Coppermine', 'direct', 'coppermine_module', 'https://hqdiesel.net', None, 1),
        ('forum_phun', 'Phun.org', 'forum', 'forum_downloader', 'https://forum.phun.org', None, 1),
        ('forum_hqcelebcorner', 'HQCelebCorner', 'forum', 'forum_downloader', 'https://hqcelebcorner.com', None, 0),
        ('forum_picturepub', 'PicturePub', 'forum', 'forum_downloader', 'https://picturepub.net', None, 0),
        ('ytdlp', 'yt-dlp', 'cli_tool', None, None, None, 0),
        ('gallerydl', 'gallery-dl', 'cli_tool', None, None, None, 0),
    ]

    for s in scrapers:
        self.cursor.execute('''
            INSERT OR IGNORE INTO scrapers
            (id, name, type, module, base_url, target_platform, flaresolverr_required)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        ''', s)

Step 3: Migrate Cookies

def _migrate_cookies_to_db(self):
    cookie_files = {
        'coppermine': '/opt/media-downloader/cookies/coppermine_cookies.json',
        'imginn': '/opt/media-downloader/cookies/imginn_cookies.json',
        'fastdl': '/opt/media-downloader/cookies/fastdl_cookies.json',
        'snapchat': '/opt/media-downloader/cookies/snapchat_cookies.json',
        'forum_phun': '/opt/media-downloader/cookies/forum_cookies_phun.org.json',
        'forum_hqcelebcorner': '/opt/media-downloader/cookies/forum_cookies_HQCelebCorner.json',
        'forum_picturepub': '/opt/media-downloader/cookies/forum_cookies_PicturePub.json',
    }

    for scraper_id, cookie_file in cookie_files.items():
        if os.path.exists(cookie_file):
            try:
                with open(cookie_file, 'r') as f:
                    data = json.load(f)

                # Store in DB
                self.cursor.execute('''
                    UPDATE scrapers
                    SET cookies_json = ?, cookies_updated_at = ?
                    WHERE id = ?
                ''', (json.dumps(data), datetime.now().isoformat(), scraper_id))

                self.logger.info(f"Migrated cookies for {scraper_id}")
            except Exception as e:
                self.logger.error(f"Failed to migrate cookies for {scraper_id}: {e}")

Step 4: Migrate Snapchat proxy_domain

def _migrate_snapchat_proxy_domain(self):
    # Get current proxy_domain from settings
    settings = self.get_setting('snapchat')
    if settings and 'proxy_domain' in settings:
        proxy_domain = settings['proxy_domain']
        base_url = f"https://{proxy_domain}"

        self.cursor.execute('''
            UPDATE scrapers SET base_url = ? WHERE id = 'snapchat'
        ''', (base_url,))

        # Remove from settings (now in scrapers table)
        del settings['proxy_domain']
        self.save_setting('snapchat', settings)

Implementation Order

Step Task Files to Modify
1 Database schema + migration unified_database.py
2 Backend API endpoints api.py
3 CloudflareHandler proxy + DB storage + merge logic cloudflare_handler.py
4 Frontend Scrapers tab ScrapersTab.tsx, Settings.tsx, api.ts
5 Update coppermine_module (test case) coppermine_module.py
6 Test end-to-end -
7 Update remaining modules imginn_module.py, fastdl_module.py, toolzu_module.py, snapchat_scraper.py, instaloader_module.py, tiktok_module.py, forum_downloader.py
8 Update scheduler scheduler.py
9 Cookie file cleanup Remove old cookie files after verification

Testing Checklist

Database

  • Table created on first run
  • Seed data populated correctly
  • Cookies migrated from files
  • Snapchat proxy_domain migrated

API

  • GET /api/scrapers returns all scrapers
  • GET /api/scrapers?type=forum filters correctly
  • PUT /api/scrapers/{id} updates settings
  • POST /api/scrapers/{id}/test works with FlareSolverr
  • POST /api/scrapers/{id}/test works with proxy
  • POST /api/scrapers/{id}/cookies merges correctly
  • DELETE /api/scrapers/{id}/cookies clears cookies

Frontend

  • Scrapers tab displays all scrapers
  • Grouping by type works
  • Filter dropdown works
  • Enable/disable toggle works
  • Proxy checkbox and URL input work
  • Test Connection button works
  • Upload Cookies button works
  • Clear Cookies button works
  • Cookie status shows correctly (fresh/expired/none)

Modules

  • coppermine_module loads config from DB
  • coppermine_module uses proxy when configured
  • coppermine_module uses cookies from DB
  • All other modules updated and working

Scheduler

  • Only runs enabled scrapers
  • Passes correct scraper_id to modules

Rollback Plan

If issues occur:

  1. Database: The old cookie files are preserved as backups
  2. Modules: Can fall back to reading cookie files if DB fails
  3. API: Add backwards compatibility for old endpoints if needed

Future Enhancements

Potential additions not in initial scope:

  1. Rotating proxies - Support proxy pools with rotation
  2. Proxy health monitoring - Track proxy success/failure rates
  3. Auto-refresh cookies - Background job to refresh expiring cookies
  4. Cookie export - Download cookies as JSON for backup
  5. Scraper metrics - Track download success rates per scraper