# Scraper Proxy Configuration System ## Overview This document describes the design and implementation plan for a centralized scraper configuration system that provides: 1. **Per-scraper proxy settings** - Configure different proxies for different scrapers 2. **Centralized cookie management** - Store cookies in database instead of files 3. **FlareSolverr integration** - Test connections and refresh Cloudflare cookies 4. **Cookie upload support** - Upload cookies from browser extensions for authenticated access 5. **Unified Settings UI** - Single place to manage all scraper configurations ## Background ### Problem Statement - Proxy settings are not configurable per-module - Cookies are stored in scattered JSON files - No UI to test FlareSolverr connections or manage cookies - Adding new forums requires code changes - No visibility into cookie freshness or scraper health ### Solution A new `scrapers` database table that: - Stores configuration for all automated scrapers - Provides proxy settings per-scraper - Centralizes cookie storage with merge logic - Syncs automatically with platform configurations - Exposes management via Settings UI --- ## Database Schema ### Table: `scrapers` ```sql CREATE TABLE scrapers ( id TEXT PRIMARY KEY, name TEXT NOT NULL, type TEXT NOT NULL, -- 'direct', 'proxy', 'forum', 'cli_tool' module TEXT, -- Python module name, NULL for cli_tool base_url TEXT, -- Primary URL for the scraper target_platform TEXT, -- 'instagram', 'snapchat', 'tiktok', NULL for forums/cli enabled INTEGER DEFAULT 1, -- Enable/disable scraper -- Proxy settings proxy_enabled INTEGER DEFAULT 0, proxy_url TEXT, -- e.g., "socks5://user:pass@host:port" -- Cloudflare/Cookie settings flaresolverr_required INTEGER DEFAULT 0, cookies_json TEXT, -- JSON blob of cookies cookies_updated_at TEXT, -- ISO timestamp of last cookie update -- Test status last_test_at TEXT, -- ISO timestamp of last test last_test_status TEXT, -- 'success', 'failed', 'timeout' last_test_message TEXT, -- Error message if failed -- Module-specific settings settings_json TEXT, -- Additional JSON settings per-scraper created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ); ``` ### Column Definitions | Column | Type | Description | |--------|------|-------------| | `id` | TEXT | Unique identifier (e.g., 'imginn', 'forum_phun') | | `name` | TEXT | Display name shown in UI | | `type` | TEXT | One of: 'direct', 'proxy', 'forum', 'cli_tool' | | `module` | TEXT | Python module name (e.g., 'imginn_module'), NULL for CLI tools | | `base_url` | TEXT | Primary URL for the service | | `target_platform` | TEXT | What platform this scraper downloads from (instagram, snapchat, tiktok, NULL) | | `enabled` | INTEGER | 1=enabled, 0=disabled | | `proxy_enabled` | INTEGER | 1=use proxy, 0=direct connection | | `proxy_url` | TEXT | Proxy URL (http, https, socks5 supported) | | `flaresolverr_required` | INTEGER | 1=needs FlareSolverr for Cloudflare bypass | | `cookies_json` | TEXT | JSON array of cookie objects | | `cookies_updated_at` | TEXT | When cookies were last updated | | `last_test_at` | TEXT | When connection was last tested | | `last_test_status` | TEXT | Result of last test: 'success', 'failed', 'timeout' | | `last_test_message` | TEXT | Error message from last failed test | | `settings_json` | TEXT | Module-specific settings as JSON | ### Scraper Types | Type | Description | Examples | |------|-------------|----------| | `direct` | Downloads directly from the platform | instagram, tiktok, snapchat, coppermine | | `proxy` | Uses a proxy service to download | imginn, fastdl, toolzu | | `forum` | Forum scraper | forum_phun, forum_hqcelebcorner, forum_picturepub | | `cli_tool` | Command-line tool wrapper | ytdlp, gallerydl | ### Target Platforms The `target_platform` field indicates what platform the scraper actually downloads content from: | Scraper | Target Platform | Notes | |---------|-----------------|-------| | imginn | instagram | Proxy service for Instagram | | fastdl | instagram | Proxy service for Instagram | | toolzu | instagram | Proxy service for Instagram | | snapchat | snapchat | Direct via Playwright scraper | | instagram | instagram | Direct via Instaloader | | tiktok | tiktok | Direct via yt-dlp internally | | coppermine | NULL | Not a social platform | | forum_* | NULL | Not a social platform | | ytdlp | NULL | Generic tool, multiple platforms | | gallerydl | NULL | Generic tool, multiple platforms | --- ## Seed Data Initial scrapers to populate on first run: | id | name | type | module | base_url | target_platform | flaresolverr_required | |----|------|------|--------|----------|-----------------|----------------------| | imginn | Imginn | proxy | imginn_module | https://imginn.com | instagram | 1 | | fastdl | FastDL | proxy | fastdl_module | https://fastdl.app | instagram | 1 | | toolzu | Toolzu | proxy | toolzu_module | https://toolzu.com | instagram | 1 | | snapchat | Snapchat Direct | direct | snapchat_scraper | https://snapchat.com | snapchat | 0 | | instagram | Instagram (Direct) | direct | instaloader_module | https://instagram.com | instagram | 0 | | tiktok | TikTok | direct | tiktok_module | https://tiktok.com | tiktok | 0 | | coppermine | Coppermine | direct | coppermine_module | https://hqdiesel.net | NULL | 1 | | forum_phun | Phun.org | forum | forum_downloader | https://forum.phun.org | NULL | 1 | | forum_hqcelebcorner | HQCelebCorner | forum | forum_downloader | https://hqcelebcorner.com | NULL | 0 | | forum_picturepub | PicturePub | forum | forum_downloader | https://picturepub.net | NULL | 0 | | ytdlp | yt-dlp | cli_tool | NULL | NULL | NULL | 0 | | gallerydl | gallery-dl | cli_tool | NULL | NULL | NULL | 0 | ### Notes on Seed Data 1. **Snapchat**: Uses direct Playwright-based scraper with optional proxy support (configured per-scraper in Scrapers settings page) 2. **Forums**: Derived from existing `forum_threads` table entries and cookie files 3. **Excluded scrapers**: YouTube and Bilibili are NOT included - they are on-demand downloaders from the Video Downloader page, not scheduled scrapers --- ## Auto-Sync Logic The scrapers table stays in sync with platform configurations automatically: ### When Forums Change - New forum added in Forums settings → Create scraper entry with `type='forum'` - Forum removed from settings → Remove scraper entry ### When Modules Are Enabled/Disabled - Module enabled → Ensure scraper entry exists - Module disabled → Scraper entry remains but `enabled=0` ### No Manual Add/Delete - The Scrapers UI does NOT have Add or Delete buttons - Scrapers are managed through their respective platform configuration pages - Scrapers UI only manages: proxy settings, testing, cookies --- ## Cookie Management ### Storage Format Cookies are stored as JSON in the `cookies_json` column: ```json { "cookies": [ { "name": "cf_clearance", "value": "abc123...", "domain": ".imginn.com", "path": "/", "expiry": 1735689600 }, { "name": "session_id", "value": "xyz789...", "domain": "imginn.com", "path": "/", "expiry": -1 } ], "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36..." } ``` ### Cookie Merge Logic **CRITICAL**: When updating cookies, MERGE with existing - never wipe: ```python def merge_cookies(existing_cookies: list, new_cookies: list) -> list: """ Merge new cookies into existing, preserving non-updated cookies. This ensures: - Cloudflare cookies (cf_clearance, __cf_bm) get refreshed - Site session/auth cookies are preserved - No data loss on test/refresh """ # Index existing by name cookie_map = {c['name']: c for c in existing_cookies} # Update/add from new cookies for cookie in new_cookies: cookie_map[cookie['name']] = cookie return list(cookie_map.values()) ``` ### Cookie Sources 1. **FlareSolverr** - Automated Cloudflare bypass, returns CF cookies 2. **Upload** - User uploads JSON from browser extension (EditThisCookie, Cookie-Editor) 3. **Module** - Some modules save cookies during operation ### Cookie File Migration Existing cookie files to migrate on first run: | File | Scraper ID | |------|------------| | `cookies/coppermine_cookies.json` | coppermine | | `cookies/imginn_cookies.json` | imginn | | `cookies/fastdl_cookies.json` | fastdl | | `cookies/snapchat_cookies.json` | snapchat | | `cookies/forum_cookies_phun.org.json` | forum_phun | | `cookies/forum_cookies_HQCelebCorner.json` | forum_hqcelebcorner | | `cookies/forum_cookies_PicturePub.json` | forum_picturepub | --- ## Proxy Configuration ### Supported Proxy Formats ``` http://host:port http://user:pass@host:port https://host:port https://user:pass@host:port socks5://host:port socks5://user:pass@host:port ``` ### FlareSolverr Proxy Integration When a scraper has `proxy_enabled=1`, the proxy is passed to FlareSolverr: ```python payload = { "cmd": "request.get", "url": url, "maxTimeout": 120000 } if proxy_url: payload["proxy"] = {"url": proxy_url} ``` **Important**: Cloudflare cookies are tied to IP address. If FlareSolverr uses a proxy, subsequent requests MUST use the same proxy or cookies will be invalid. ### Per-Module Proxy Usage | Module | How Proxy is Used | |--------|-------------------| | coppermine_module | `requests.Session(proxies={...})` | | imginn_module | Playwright `proxy` option | | fastdl_module | Playwright `proxy` option | | toolzu_module | Playwright `proxy` option | | snapchat_scraper | Playwright `proxy` option (optional, configured in Scrapers page) | | instaloader_module | Instaloader `proxy` parameter | | tiktok_module | yt-dlp `--proxy` flag | | forum_downloader | Playwright `proxy` option + requests | | ytdlp | `--proxy` flag | | gallerydl | `--proxy` flag | --- ## API Endpoints ### GET /api/scrapers List all scrapers with optional type filter. **Query Parameters:** - `type` (optional): Filter by type ('direct', 'proxy', 'forum', 'cli_tool') **Response:** ```json { "scrapers": [ { "id": "imginn", "name": "Imginn", "type": "proxy", "module": "imginn_module", "base_url": "https://imginn.com", "target_platform": "instagram", "enabled": true, "proxy_enabled": false, "proxy_url": null, "flaresolverr_required": true, "cookies_count": 23, "cookies_updated_at": "2025-12-01T10:30:00", "cookies_fresh": true, "last_test_at": "2025-12-01T10:30:00", "last_test_status": "success", "last_test_message": null } ] } ``` ### GET /api/scrapers/{id} Get single scraper configuration. ### PUT /api/scrapers/{id} Update scraper settings. **Request Body:** ```json { "enabled": true, "proxy_enabled": true, "proxy_url": "socks5://user:pass@host:port", "base_url": "https://new-domain.com" } ``` ### POST /api/scrapers/{id}/test Test connection via FlareSolverr (if required) and save cookies on success. **Response:** ```json { "success": true, "message": "Connection successful, 23 cookies saved", "cookies_count": 23 } ``` ### POST /api/scrapers/{id}/cookies Upload cookies from JSON file. Merges with existing cookies. **Request Body:** ```json { "cookies": [ {"name": "session", "value": "abc123", "domain": ".example.com"} ] } ``` **Response:** ```json { "success": true, "message": "Merged 5 cookies (total: 28)", "cookies_count": 28 } ``` ### DELETE /api/scrapers/{id}/cookies Clear all cookies for a scraper. --- ## Frontend UI ### Settings > Scrapers Tab The Scrapers tab displays all scrapers grouped by type/platform: ``` ┌───────────────────────────────────────────────────────────────────────┐ │ Settings > Scrapers │ ├───────────────────────────────────────────────────────────────────────┤ │ Filter: [All Types ▼] │ │ │ │ ─── Instagram Proxies ────────────────────────────────────────────── │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ ● Imginn [Enabled ▼] │ │ │ │ https://imginn.com │ │ │ │ ☐ Use Proxy [ ] │ │ │ │ Cloudflare: Required │ Cookies: ✓ Fresh (2h ago, 23 cookies) │ │ │ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ─── Direct ───────────────────────────────────────────────────────── │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ ● Instagram (Direct) [Enabled ▼] │ │ │ │ https://instagram.com │ │ │ │ ☐ Use Proxy [ ] │ │ │ │ Cloudflare: Not Required │ Cookies: ✓ 12 cookies │ │ │ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ─── Forums ───────────────────────────────────────────────────────── │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ ● Phun.org [Enabled ▼] │ │ │ │ https://forum.phun.org │ │ │ │ ☐ Use Proxy [ ] │ │ │ │ Cloudflare: Required │ Cookies: ⚠ Expired (3 days) │ │ │ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ─── CLI Tools ────────────────────────────────────────────────────── │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ ● yt-dlp [Enabled ▼] │ │ │ │ Generic video downloader │ │ │ │ ☐ Use Proxy [ ] │ │ │ │ [Test Connection] [Upload Cookies] │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` ### Button Visibility | Button | When Shown | |--------|------------| | Test Connection | Always | | Upload Cookies | Always | | Clear Cookies | When cookies exist | ### No Add/Delete Buttons Scrapers are NOT added or deleted from this UI. They are managed through: - Forums settings (for forum scrapers) - Platform settings (for other scrapers) This UI only manages: - Enable/disable - Proxy configuration - Cookie testing/upload/clear --- ## Module Integration ### Common Pattern All modules follow this pattern to load scraper configuration: ```python class SomeModule: def __init__(self, unified_db=None, scraper_id='some_scraper', ...): self.db = unified_db self.scraper_id = scraper_id # Load config from DB self.config = self.db.get_scraper(scraper_id) if self.db else {} # Check if enabled if not self.config.get('enabled', True): raise ScraperDisabledError(f"{scraper_id} is disabled") # Get base URL from DB (not hardcoded) self.base_url = self.config.get('base_url', 'https://default.com') # Get proxy config self.proxy_url = None if self.config.get('proxy_enabled') and self.config.get('proxy_url'): self.proxy_url = self.config['proxy_url'] # Initialize CloudflareHandler with DB storage self.cf_handler = CloudflareHandler( module_name=self.scraper_id, scraper_id=self.scraper_id, unified_db=self.db, proxy_url=self.proxy_url, ... ) ``` ### CloudflareHandler Changes ```python class CloudflareHandler: def __init__(self, module_name: str, scraper_id: str = None, # For DB cookie storage unified_db = None, # DB reference proxy_url: str = None, # Proxy support cookie_file: str = None, # DEPRECATED: backwards compat ...): self.scraper_id = scraper_id self.db = unified_db self.proxy_url = proxy_url def get_cookies_via_flaresolverr(self, url: str, max_retries: int = 2) -> bool: payload = { "cmd": "request.get", "url": url, "maxTimeout": 120000 } # Add proxy if configured if self.proxy_url: payload["proxy"] = {"url": self.proxy_url} # ... rest of implementation # On success, merge cookies (don't replace) if success: existing = self.load_cookies_from_db() merged = self.merge_cookies(existing, new_cookies) self.save_cookies_to_db(merged) def load_cookies_from_db(self) -> list: if self.db and self.scraper_id: config = self.db.get_scraper(self.scraper_id) if config and config.get('cookies_json'): data = json.loads(config['cookies_json']) return data.get('cookies', []) return [] def save_cookies_to_db(self, cookies: list, user_agent: str = None): if self.db and self.scraper_id: data = { 'cookies': cookies, 'user_agent': user_agent } self.db.update_scraper_cookies(self.scraper_id, json.dumps(data)) def merge_cookies(self, existing: list, new: list) -> list: cookie_map = {c['name']: c for c in existing} for cookie in new: cookie_map[cookie['name']] = cookie return list(cookie_map.values()) ``` --- ## Scheduler Integration The scheduler uses the scrapers table to determine what to run: ```python def run_scheduled_downloads(self): # Get all enabled scrapers scrapers = self.db.get_all_scrapers() enabled_scrapers = [s for s in scrapers if s['enabled']] for scraper in enabled_scrapers: if scraper['type'] == 'forum': self.run_forum_download(scraper['id']) elif scraper['id'] == 'coppermine': self.run_coppermine_download() elif scraper['id'] == 'instagram': self.run_instagram_download() elif scraper['id'] == 'tiktok': self.run_tiktok_download() # etc. ``` --- ## Migration Plan ### Step 1: Create Table Add to `unified_database.py`: ```python def _create_scrapers_table(self): self.cursor.execute(''' CREATE TABLE IF NOT EXISTS scrapers ( id TEXT PRIMARY KEY, name TEXT NOT NULL, type TEXT NOT NULL, module TEXT, base_url TEXT, target_platform TEXT, enabled INTEGER DEFAULT 1, proxy_enabled INTEGER DEFAULT 0, proxy_url TEXT, flaresolverr_required INTEGER DEFAULT 0, cookies_json TEXT, cookies_updated_at TEXT, last_test_at TEXT, last_test_status TEXT, last_test_message TEXT, settings_json TEXT, created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ) ''') ``` ### Step 2: Seed Initial Data ```python def _seed_scrapers(self): scrapers = [ ('imginn', 'Imginn', 'proxy', 'imginn_module', 'https://imginn.com', 'instagram', 1), ('fastdl', 'FastDL', 'proxy', 'fastdl_module', 'https://fastdl.app', 'instagram', 1), ('toolzu', 'Toolzu', 'proxy', 'toolzu_module', 'https://toolzu.com', 'instagram', 1), ('snapchat', 'Snapchat Direct', 'direct', 'snapchat_scraper', 'https://snapchat.com', 'snapchat', 0), ('instagram', 'Instagram (Direct)', 'direct', 'instaloader_module', 'https://instagram.com', 'instagram', 0), ('tiktok', 'TikTok', 'direct', 'tiktok_module', 'https://tiktok.com', 'tiktok', 0), ('coppermine', 'Coppermine', 'direct', 'coppermine_module', 'https://hqdiesel.net', None, 1), ('forum_phun', 'Phun.org', 'forum', 'forum_downloader', 'https://forum.phun.org', None, 1), ('forum_hqcelebcorner', 'HQCelebCorner', 'forum', 'forum_downloader', 'https://hqcelebcorner.com', None, 0), ('forum_picturepub', 'PicturePub', 'forum', 'forum_downloader', 'https://picturepub.net', None, 0), ('ytdlp', 'yt-dlp', 'cli_tool', None, None, None, 0), ('gallerydl', 'gallery-dl', 'cli_tool', None, None, None, 0), ] for s in scrapers: self.cursor.execute(''' INSERT OR IGNORE INTO scrapers (id, name, type, module, base_url, target_platform, flaresolverr_required) VALUES (?, ?, ?, ?, ?, ?, ?) ''', s) ``` ### Step 3: Migrate Cookies ```python def _migrate_cookies_to_db(self): cookie_files = { 'coppermine': '/opt/media-downloader/cookies/coppermine_cookies.json', 'imginn': '/opt/media-downloader/cookies/imginn_cookies.json', 'fastdl': '/opt/media-downloader/cookies/fastdl_cookies.json', 'snapchat': '/opt/media-downloader/cookies/snapchat_cookies.json', 'forum_phun': '/opt/media-downloader/cookies/forum_cookies_phun.org.json', 'forum_hqcelebcorner': '/opt/media-downloader/cookies/forum_cookies_HQCelebCorner.json', 'forum_picturepub': '/opt/media-downloader/cookies/forum_cookies_PicturePub.json', } for scraper_id, cookie_file in cookie_files.items(): if os.path.exists(cookie_file): try: with open(cookie_file, 'r') as f: data = json.load(f) # Store in DB self.cursor.execute(''' UPDATE scrapers SET cookies_json = ?, cookies_updated_at = ? WHERE id = ? ''', (json.dumps(data), datetime.now().isoformat(), scraper_id)) self.logger.info(f"Migrated cookies for {scraper_id}") except Exception as e: self.logger.error(f"Failed to migrate cookies for {scraper_id}: {e}") ``` ### Step 4: Migrate Snapchat proxy_domain ```python def _migrate_snapchat_proxy_domain(self): # Get current proxy_domain from settings settings = self.get_setting('snapchat') if settings and 'proxy_domain' in settings: proxy_domain = settings['proxy_domain'] base_url = f"https://{proxy_domain}" self.cursor.execute(''' UPDATE scrapers SET base_url = ? WHERE id = 'snapchat' ''', (base_url,)) # Remove from settings (now in scrapers table) del settings['proxy_domain'] self.save_setting('snapchat', settings) ``` --- ## Implementation Order | Step | Task | Files to Modify | |------|------|-----------------| | 1 | Database schema + migration | `unified_database.py` | | 2 | Backend API endpoints | `api.py` | | 3 | CloudflareHandler proxy + DB storage + merge logic | `cloudflare_handler.py` | | 4 | Frontend Scrapers tab | `ScrapersTab.tsx`, `Settings.tsx`, `api.ts` | | 5 | Update coppermine_module (test case) | `coppermine_module.py` | | 6 | Test end-to-end | - | | 7 | Update remaining modules | `imginn_module.py`, `fastdl_module.py`, `toolzu_module.py`, `snapchat_scraper.py`, `instaloader_module.py`, `tiktok_module.py`, `forum_downloader.py` | | 8 | Update scheduler | `scheduler.py` | | 9 | Cookie file cleanup | Remove old cookie files after verification | --- ## Testing Checklist ### Database - [ ] Table created on first run - [ ] Seed data populated correctly - [ ] Cookies migrated from files - [ ] Snapchat proxy_domain migrated ### API - [ ] GET /api/scrapers returns all scrapers - [ ] GET /api/scrapers?type=forum filters correctly - [ ] PUT /api/scrapers/{id} updates settings - [ ] POST /api/scrapers/{id}/test works with FlareSolverr - [ ] POST /api/scrapers/{id}/test works with proxy - [ ] POST /api/scrapers/{id}/cookies merges correctly - [ ] DELETE /api/scrapers/{id}/cookies clears cookies ### Frontend - [ ] Scrapers tab displays all scrapers - [ ] Grouping by type works - [ ] Filter dropdown works - [ ] Enable/disable toggle works - [ ] Proxy checkbox and URL input work - [ ] Test Connection button works - [ ] Upload Cookies button works - [ ] Clear Cookies button works - [ ] Cookie status shows correctly (fresh/expired/none) ### Modules - [ ] coppermine_module loads config from DB - [ ] coppermine_module uses proxy when configured - [ ] coppermine_module uses cookies from DB - [ ] All other modules updated and working ### Scheduler - [ ] Only runs enabled scrapers - [ ] Passes correct scraper_id to modules --- ## Rollback Plan If issues occur: 1. **Database**: The old cookie files are preserved as backups 2. **Modules**: Can fall back to reading cookie files if DB fails 3. **API**: Add backwards compatibility for old endpoints if needed --- ## Future Enhancements Potential additions not in initial scope: 1. **Rotating proxies** - Support proxy pools with rotation 2. **Proxy health monitoring** - Track proxy success/failure rates 3. **Auto-refresh cookies** - Background job to refresh expiring cookies 4. **Cookie export** - Download cookies as JSON for backup 5. **Scraper metrics** - Track download success rates per scraper