761 lines
28 KiB
Markdown
761 lines
28 KiB
Markdown
# Scraper Proxy Configuration System
|
|
|
|
## Overview
|
|
|
|
This document describes the design and implementation plan for a centralized scraper configuration system that provides:
|
|
|
|
1. **Per-scraper proxy settings** - Configure different proxies for different scrapers
|
|
2. **Centralized cookie management** - Store cookies in database instead of files
|
|
3. **FlareSolverr integration** - Test connections and refresh Cloudflare cookies
|
|
4. **Cookie upload support** - Upload cookies from browser extensions for authenticated access
|
|
5. **Unified Settings UI** - Single place to manage all scraper configurations
|
|
|
|
## Background
|
|
|
|
### Problem Statement
|
|
|
|
- Proxy settings are not configurable per-module
|
|
- Cookies are stored in scattered JSON files
|
|
- No UI to test FlareSolverr connections or manage cookies
|
|
- Adding new forums requires code changes
|
|
- No visibility into cookie freshness or scraper health
|
|
|
|
### Solution
|
|
|
|
A new `scrapers` database table that:
|
|
- Stores configuration for all automated scrapers
|
|
- Provides proxy settings per-scraper
|
|
- Centralizes cookie storage with merge logic
|
|
- Syncs automatically with platform configurations
|
|
- Exposes management via Settings UI
|
|
|
|
---
|
|
|
|
## Database Schema
|
|
|
|
### Table: `scrapers`
|
|
|
|
```sql
|
|
CREATE TABLE scrapers (
|
|
id TEXT PRIMARY KEY,
|
|
name TEXT NOT NULL,
|
|
type TEXT NOT NULL, -- 'direct', 'proxy', 'forum', 'cli_tool'
|
|
module TEXT, -- Python module name, NULL for cli_tool
|
|
base_url TEXT, -- Primary URL for the scraper
|
|
target_platform TEXT, -- 'instagram', 'snapchat', 'tiktok', NULL for forums/cli
|
|
enabled INTEGER DEFAULT 1, -- Enable/disable scraper
|
|
|
|
-- Proxy settings
|
|
proxy_enabled INTEGER DEFAULT 0,
|
|
proxy_url TEXT, -- e.g., "socks5://user:pass@host:port"
|
|
|
|
-- Cloudflare/Cookie settings
|
|
flaresolverr_required INTEGER DEFAULT 0,
|
|
cookies_json TEXT, -- JSON blob of cookies
|
|
cookies_updated_at TEXT, -- ISO timestamp of last cookie update
|
|
|
|
-- Test status
|
|
last_test_at TEXT, -- ISO timestamp of last test
|
|
last_test_status TEXT, -- 'success', 'failed', 'timeout'
|
|
last_test_message TEXT, -- Error message if failed
|
|
|
|
-- Module-specific settings
|
|
settings_json TEXT, -- Additional JSON settings per-scraper
|
|
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
```
|
|
|
|
### Column Definitions
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | TEXT | Unique identifier (e.g., 'imginn', 'forum_phun') |
|
|
| `name` | TEXT | Display name shown in UI |
|
|
| `type` | TEXT | One of: 'direct', 'proxy', 'forum', 'cli_tool' |
|
|
| `module` | TEXT | Python module name (e.g., 'imginn_module'), NULL for CLI tools |
|
|
| `base_url` | TEXT | Primary URL for the service |
|
|
| `target_platform` | TEXT | What platform this scraper downloads from (instagram, snapchat, tiktok, NULL) |
|
|
| `enabled` | INTEGER | 1=enabled, 0=disabled |
|
|
| `proxy_enabled` | INTEGER | 1=use proxy, 0=direct connection |
|
|
| `proxy_url` | TEXT | Proxy URL (http, https, socks5 supported) |
|
|
| `flaresolverr_required` | INTEGER | 1=needs FlareSolverr for Cloudflare bypass |
|
|
| `cookies_json` | TEXT | JSON array of cookie objects |
|
|
| `cookies_updated_at` | TEXT | When cookies were last updated |
|
|
| `last_test_at` | TEXT | When connection was last tested |
|
|
| `last_test_status` | TEXT | Result of last test: 'success', 'failed', 'timeout' |
|
|
| `last_test_message` | TEXT | Error message from last failed test |
|
|
| `settings_json` | TEXT | Module-specific settings as JSON |
|
|
|
|
### Scraper Types
|
|
|
|
| Type | Description | Examples |
|
|
|------|-------------|----------|
|
|
| `direct` | Downloads directly from the platform | instagram, tiktok, snapchat, coppermine |
|
|
| `proxy` | Uses a proxy service to download | imginn, fastdl, toolzu |
|
|
| `forum` | Forum scraper | forum_phun, forum_hqcelebcorner, forum_picturepub |
|
|
| `cli_tool` | Command-line tool wrapper | ytdlp, gallerydl |
|
|
|
|
### Target Platforms
|
|
|
|
The `target_platform` field indicates what platform the scraper actually downloads content from:
|
|
|
|
| Scraper | Target Platform | Notes |
|
|
|---------|-----------------|-------|
|
|
| imginn | instagram | Proxy service for Instagram |
|
|
| fastdl | instagram | Proxy service for Instagram |
|
|
| toolzu | instagram | Proxy service for Instagram |
|
|
| snapchat | snapchat | Direct via Playwright scraper |
|
|
| instagram | instagram | Direct via Instaloader |
|
|
| tiktok | tiktok | Direct via yt-dlp internally |
|
|
| coppermine | NULL | Not a social platform |
|
|
| forum_* | NULL | Not a social platform |
|
|
| ytdlp | NULL | Generic tool, multiple platforms |
|
|
| gallerydl | NULL | Generic tool, multiple platforms |
|
|
|
|
---
|
|
|
|
## Seed Data
|
|
|
|
Initial scrapers to populate on first run:
|
|
|
|
| id | name | type | module | base_url | target_platform | flaresolverr_required |
|
|
|----|------|------|--------|----------|-----------------|----------------------|
|
|
| imginn | Imginn | proxy | imginn_module | https://imginn.com | instagram | 1 |
|
|
| fastdl | FastDL | proxy | fastdl_module | https://fastdl.app | instagram | 1 |
|
|
| toolzu | Toolzu | proxy | toolzu_module | https://toolzu.com | instagram | 1 |
|
|
| snapchat | Snapchat Direct | direct | snapchat_scraper | https://snapchat.com | snapchat | 0 |
|
|
| instagram | Instagram (Direct) | direct | instaloader_module | https://instagram.com | instagram | 0 |
|
|
| tiktok | TikTok | direct | tiktok_module | https://tiktok.com | tiktok | 0 |
|
|
| coppermine | Coppermine | direct | coppermine_module | https://hqdiesel.net | NULL | 1 |
|
|
| forum_phun | Phun.org | forum | forum_downloader | https://forum.phun.org | NULL | 1 |
|
|
| forum_hqcelebcorner | HQCelebCorner | forum | forum_downloader | https://hqcelebcorner.com | NULL | 0 |
|
|
| forum_picturepub | PicturePub | forum | forum_downloader | https://picturepub.net | NULL | 0 |
|
|
| ytdlp | yt-dlp | cli_tool | NULL | NULL | NULL | 0 |
|
|
| gallerydl | gallery-dl | cli_tool | NULL | NULL | NULL | 0 |
|
|
|
|
### Notes on Seed Data
|
|
|
|
1. **Snapchat**: Uses direct Playwright-based scraper with optional proxy support (configured per-scraper in Scrapers settings page)
|
|
|
|
2. **Forums**: Derived from existing `forum_threads` table entries and cookie files
|
|
|
|
3. **Excluded scrapers**: YouTube and Bilibili are NOT included - they are on-demand downloaders from the Video Downloader page, not scheduled scrapers
|
|
|
|
---
|
|
|
|
## Auto-Sync Logic
|
|
|
|
The scrapers table stays in sync with platform configurations automatically:
|
|
|
|
### When Forums Change
|
|
- New forum added in Forums settings → Create scraper entry with `type='forum'`
|
|
- Forum removed from settings → Remove scraper entry
|
|
|
|
### When Modules Are Enabled/Disabled
|
|
- Module enabled → Ensure scraper entry exists
|
|
- Module disabled → Scraper entry remains but `enabled=0`
|
|
|
|
### No Manual Add/Delete
|
|
- The Scrapers UI does NOT have Add or Delete buttons
|
|
- Scrapers are managed through their respective platform configuration pages
|
|
- Scrapers UI only manages: proxy settings, testing, cookies
|
|
|
|
---
|
|
|
|
## Cookie Management
|
|
|
|
### Storage Format
|
|
|
|
Cookies are stored as JSON in the `cookies_json` column:
|
|
|
|
```json
|
|
{
|
|
"cookies": [
|
|
{
|
|
"name": "cf_clearance",
|
|
"value": "abc123...",
|
|
"domain": ".imginn.com",
|
|
"path": "/",
|
|
"expiry": 1735689600
|
|
},
|
|
{
|
|
"name": "session_id",
|
|
"value": "xyz789...",
|
|
"domain": "imginn.com",
|
|
"path": "/",
|
|
"expiry": -1
|
|
}
|
|
],
|
|
"user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36..."
|
|
}
|
|
```
|
|
|
|
### Cookie Merge Logic
|
|
|
|
**CRITICAL**: When updating cookies, MERGE with existing - never wipe:
|
|
|
|
```python
|
|
def merge_cookies(existing_cookies: list, new_cookies: list) -> list:
|
|
"""
|
|
Merge new cookies into existing, preserving non-updated cookies.
|
|
|
|
This ensures:
|
|
- Cloudflare cookies (cf_clearance, __cf_bm) get refreshed
|
|
- Site session/auth cookies are preserved
|
|
- No data loss on test/refresh
|
|
"""
|
|
# Index existing by name
|
|
cookie_map = {c['name']: c for c in existing_cookies}
|
|
|
|
# Update/add from new cookies
|
|
for cookie in new_cookies:
|
|
cookie_map[cookie['name']] = cookie
|
|
|
|
return list(cookie_map.values())
|
|
```
|
|
|
|
### Cookie Sources
|
|
|
|
1. **FlareSolverr** - Automated Cloudflare bypass, returns CF cookies
|
|
2. **Upload** - User uploads JSON from browser extension (EditThisCookie, Cookie-Editor)
|
|
3. **Module** - Some modules save cookies during operation
|
|
|
|
### Cookie File Migration
|
|
|
|
Existing cookie files to migrate on first run:
|
|
|
|
| File | Scraper ID |
|
|
|------|------------|
|
|
| `cookies/coppermine_cookies.json` | coppermine |
|
|
| `cookies/imginn_cookies.json` | imginn |
|
|
| `cookies/fastdl_cookies.json` | fastdl |
|
|
| `cookies/snapchat_cookies.json` | snapchat |
|
|
| `cookies/forum_cookies_phun.org.json` | forum_phun |
|
|
| `cookies/forum_cookies_HQCelebCorner.json` | forum_hqcelebcorner |
|
|
| `cookies/forum_cookies_PicturePub.json` | forum_picturepub |
|
|
|
|
---
|
|
|
|
## Proxy Configuration
|
|
|
|
### Supported Proxy Formats
|
|
|
|
```
|
|
http://host:port
|
|
http://user:pass@host:port
|
|
https://host:port
|
|
https://user:pass@host:port
|
|
socks5://host:port
|
|
socks5://user:pass@host:port
|
|
```
|
|
|
|
### FlareSolverr Proxy Integration
|
|
|
|
When a scraper has `proxy_enabled=1`, the proxy is passed to FlareSolverr:
|
|
|
|
```python
|
|
payload = {
|
|
"cmd": "request.get",
|
|
"url": url,
|
|
"maxTimeout": 120000
|
|
}
|
|
if proxy_url:
|
|
payload["proxy"] = {"url": proxy_url}
|
|
```
|
|
|
|
**Important**: Cloudflare cookies are tied to IP address. If FlareSolverr uses a proxy, subsequent requests MUST use the same proxy or cookies will be invalid.
|
|
|
|
### Per-Module Proxy Usage
|
|
|
|
| Module | How Proxy is Used |
|
|
|--------|-------------------|
|
|
| coppermine_module | `requests.Session(proxies={...})` |
|
|
| imginn_module | Playwright `proxy` option |
|
|
| fastdl_module | Playwright `proxy` option |
|
|
| toolzu_module | Playwright `proxy` option |
|
|
| snapchat_scraper | Playwright `proxy` option (optional, configured in Scrapers page) |
|
|
| instaloader_module | Instaloader `proxy` parameter |
|
|
| tiktok_module | yt-dlp `--proxy` flag |
|
|
| forum_downloader | Playwright `proxy` option + requests |
|
|
| ytdlp | `--proxy` flag |
|
|
| gallerydl | `--proxy` flag |
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
### GET /api/scrapers
|
|
|
|
List all scrapers with optional type filter.
|
|
|
|
**Query Parameters:**
|
|
- `type` (optional): Filter by type ('direct', 'proxy', 'forum', 'cli_tool')
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"scrapers": [
|
|
{
|
|
"id": "imginn",
|
|
"name": "Imginn",
|
|
"type": "proxy",
|
|
"module": "imginn_module",
|
|
"base_url": "https://imginn.com",
|
|
"target_platform": "instagram",
|
|
"enabled": true,
|
|
"proxy_enabled": false,
|
|
"proxy_url": null,
|
|
"flaresolverr_required": true,
|
|
"cookies_count": 23,
|
|
"cookies_updated_at": "2025-12-01T10:30:00",
|
|
"cookies_fresh": true,
|
|
"last_test_at": "2025-12-01T10:30:00",
|
|
"last_test_status": "success",
|
|
"last_test_message": null
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### GET /api/scrapers/{id}
|
|
|
|
Get single scraper configuration.
|
|
|
|
### PUT /api/scrapers/{id}
|
|
|
|
Update scraper settings.
|
|
|
|
**Request Body:**
|
|
```json
|
|
{
|
|
"enabled": true,
|
|
"proxy_enabled": true,
|
|
"proxy_url": "socks5://user:pass@host:port",
|
|
"base_url": "https://new-domain.com"
|
|
}
|
|
```
|
|
|
|
### POST /api/scrapers/{id}/test
|
|
|
|
Test connection via FlareSolverr (if required) and save cookies on success.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"message": "Connection successful, 23 cookies saved",
|
|
"cookies_count": 23
|
|
}
|
|
```
|
|
|
|
### POST /api/scrapers/{id}/cookies
|
|
|
|
Upload cookies from JSON file. Merges with existing cookies.
|
|
|
|
**Request Body:**
|
|
```json
|
|
{
|
|
"cookies": [
|
|
{"name": "session", "value": "abc123", "domain": ".example.com"}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"message": "Merged 5 cookies (total: 28)",
|
|
"cookies_count": 28
|
|
}
|
|
```
|
|
|
|
### DELETE /api/scrapers/{id}/cookies
|
|
|
|
Clear all cookies for a scraper.
|
|
|
|
---
|
|
|
|
## Frontend UI
|
|
|
|
### Settings > Scrapers Tab
|
|
|
|
The Scrapers tab displays all scrapers grouped by type/platform:
|
|
|
|
```
|
|
┌───────────────────────────────────────────────────────────────────────┐
|
|
│ Settings > Scrapers │
|
|
├───────────────────────────────────────────────────────────────────────┤
|
|
│ Filter: [All Types ▼] │
|
|
│ │
|
|
│ ─── Instagram Proxies ────────────────────────────────────────────── │
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● Imginn [Enabled ▼] │ │
|
|
│ │ https://imginn.com │ │
|
|
│ │ ☐ Use Proxy [ ] │ │
|
|
│ │ Cloudflare: Required │ Cookies: ✓ Fresh (2h ago, 23 cookies) │ │
|
|
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ─── Direct ───────────────────────────────────────────────────────── │
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● Instagram (Direct) [Enabled ▼] │ │
|
|
│ │ https://instagram.com │ │
|
|
│ │ ☐ Use Proxy [ ] │ │
|
|
│ │ Cloudflare: Not Required │ Cookies: ✓ 12 cookies │ │
|
|
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ─── Forums ───────────────────────────────────────────────────────── │
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● Phun.org [Enabled ▼] │ │
|
|
│ │ https://forum.phun.org │ │
|
|
│ │ ☐ Use Proxy [ ] │ │
|
|
│ │ Cloudflare: Required │ Cookies: ⚠ Expired (3 days) │ │
|
|
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ─── CLI Tools ────────────────────────────────────────────────────── │
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● yt-dlp [Enabled ▼] │ │
|
|
│ │ Generic video downloader │ │
|
|
│ │ ☐ Use Proxy [ ] │ │
|
|
│ │ [Test Connection] [Upload Cookies] │ │
|
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
|
└───────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Button Visibility
|
|
|
|
| Button | When Shown |
|
|
|--------|------------|
|
|
| Test Connection | Always |
|
|
| Upload Cookies | Always |
|
|
| Clear Cookies | When cookies exist |
|
|
|
|
### No Add/Delete Buttons
|
|
|
|
Scrapers are NOT added or deleted from this UI. They are managed through:
|
|
- Forums settings (for forum scrapers)
|
|
- Platform settings (for other scrapers)
|
|
|
|
This UI only manages:
|
|
- Enable/disable
|
|
- Proxy configuration
|
|
- Cookie testing/upload/clear
|
|
|
|
---
|
|
|
|
## Module Integration
|
|
|
|
### Common Pattern
|
|
|
|
All modules follow this pattern to load scraper configuration:
|
|
|
|
```python
|
|
class SomeModule:
|
|
def __init__(self, unified_db=None, scraper_id='some_scraper', ...):
|
|
self.db = unified_db
|
|
self.scraper_id = scraper_id
|
|
|
|
# Load config from DB
|
|
self.config = self.db.get_scraper(scraper_id) if self.db else {}
|
|
|
|
# Check if enabled
|
|
if not self.config.get('enabled', True):
|
|
raise ScraperDisabledError(f"{scraper_id} is disabled")
|
|
|
|
# Get base URL from DB (not hardcoded)
|
|
self.base_url = self.config.get('base_url', 'https://default.com')
|
|
|
|
# Get proxy config
|
|
self.proxy_url = None
|
|
if self.config.get('proxy_enabled') and self.config.get('proxy_url'):
|
|
self.proxy_url = self.config['proxy_url']
|
|
|
|
# Initialize CloudflareHandler with DB storage
|
|
self.cf_handler = CloudflareHandler(
|
|
module_name=self.scraper_id,
|
|
scraper_id=self.scraper_id,
|
|
unified_db=self.db,
|
|
proxy_url=self.proxy_url,
|
|
...
|
|
)
|
|
```
|
|
|
|
### CloudflareHandler Changes
|
|
|
|
```python
|
|
class CloudflareHandler:
|
|
def __init__(self,
|
|
module_name: str,
|
|
scraper_id: str = None, # For DB cookie storage
|
|
unified_db = None, # DB reference
|
|
proxy_url: str = None, # Proxy support
|
|
cookie_file: str = None, # DEPRECATED: backwards compat
|
|
...):
|
|
self.scraper_id = scraper_id
|
|
self.db = unified_db
|
|
self.proxy_url = proxy_url
|
|
|
|
def get_cookies_via_flaresolverr(self, url: str, max_retries: int = 2) -> bool:
|
|
payload = {
|
|
"cmd": "request.get",
|
|
"url": url,
|
|
"maxTimeout": 120000
|
|
}
|
|
# Add proxy if configured
|
|
if self.proxy_url:
|
|
payload["proxy"] = {"url": self.proxy_url}
|
|
|
|
# ... rest of implementation
|
|
|
|
# On success, merge cookies (don't replace)
|
|
if success:
|
|
existing = self.load_cookies_from_db()
|
|
merged = self.merge_cookies(existing, new_cookies)
|
|
self.save_cookies_to_db(merged)
|
|
|
|
def load_cookies_from_db(self) -> list:
|
|
if self.db and self.scraper_id:
|
|
config = self.db.get_scraper(self.scraper_id)
|
|
if config and config.get('cookies_json'):
|
|
data = json.loads(config['cookies_json'])
|
|
return data.get('cookies', [])
|
|
return []
|
|
|
|
def save_cookies_to_db(self, cookies: list, user_agent: str = None):
|
|
if self.db and self.scraper_id:
|
|
data = {
|
|
'cookies': cookies,
|
|
'user_agent': user_agent
|
|
}
|
|
self.db.update_scraper_cookies(self.scraper_id, json.dumps(data))
|
|
|
|
def merge_cookies(self, existing: list, new: list) -> list:
|
|
cookie_map = {c['name']: c for c in existing}
|
|
for cookie in new:
|
|
cookie_map[cookie['name']] = cookie
|
|
return list(cookie_map.values())
|
|
```
|
|
|
|
---
|
|
|
|
## Scheduler Integration
|
|
|
|
The scheduler uses the scrapers table to determine what to run:
|
|
|
|
```python
|
|
def run_scheduled_downloads(self):
|
|
# Get all enabled scrapers
|
|
scrapers = self.db.get_all_scrapers()
|
|
enabled_scrapers = [s for s in scrapers if s['enabled']]
|
|
|
|
for scraper in enabled_scrapers:
|
|
if scraper['type'] == 'forum':
|
|
self.run_forum_download(scraper['id'])
|
|
elif scraper['id'] == 'coppermine':
|
|
self.run_coppermine_download()
|
|
elif scraper['id'] == 'instagram':
|
|
self.run_instagram_download()
|
|
elif scraper['id'] == 'tiktok':
|
|
self.run_tiktok_download()
|
|
# etc.
|
|
```
|
|
|
|
---
|
|
|
|
## Migration Plan
|
|
|
|
### Step 1: Create Table
|
|
|
|
Add to `unified_database.py`:
|
|
|
|
```python
|
|
def _create_scrapers_table(self):
|
|
self.cursor.execute('''
|
|
CREATE TABLE IF NOT EXISTS scrapers (
|
|
id TEXT PRIMARY KEY,
|
|
name TEXT NOT NULL,
|
|
type TEXT NOT NULL,
|
|
module TEXT,
|
|
base_url TEXT,
|
|
target_platform TEXT,
|
|
enabled INTEGER DEFAULT 1,
|
|
proxy_enabled INTEGER DEFAULT 0,
|
|
proxy_url TEXT,
|
|
flaresolverr_required INTEGER DEFAULT 0,
|
|
cookies_json TEXT,
|
|
cookies_updated_at TEXT,
|
|
last_test_at TEXT,
|
|
last_test_status TEXT,
|
|
last_test_message TEXT,
|
|
settings_json TEXT,
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
)
|
|
''')
|
|
```
|
|
|
|
### Step 2: Seed Initial Data
|
|
|
|
```python
|
|
def _seed_scrapers(self):
|
|
scrapers = [
|
|
('imginn', 'Imginn', 'proxy', 'imginn_module', 'https://imginn.com', 'instagram', 1),
|
|
('fastdl', 'FastDL', 'proxy', 'fastdl_module', 'https://fastdl.app', 'instagram', 1),
|
|
('toolzu', 'Toolzu', 'proxy', 'toolzu_module', 'https://toolzu.com', 'instagram', 1),
|
|
('snapchat', 'Snapchat Direct', 'direct', 'snapchat_scraper', 'https://snapchat.com', 'snapchat', 0),
|
|
('instagram', 'Instagram (Direct)', 'direct', 'instaloader_module', 'https://instagram.com', 'instagram', 0),
|
|
('tiktok', 'TikTok', 'direct', 'tiktok_module', 'https://tiktok.com', 'tiktok', 0),
|
|
('coppermine', 'Coppermine', 'direct', 'coppermine_module', 'https://hqdiesel.net', None, 1),
|
|
('forum_phun', 'Phun.org', 'forum', 'forum_downloader', 'https://forum.phun.org', None, 1),
|
|
('forum_hqcelebcorner', 'HQCelebCorner', 'forum', 'forum_downloader', 'https://hqcelebcorner.com', None, 0),
|
|
('forum_picturepub', 'PicturePub', 'forum', 'forum_downloader', 'https://picturepub.net', None, 0),
|
|
('ytdlp', 'yt-dlp', 'cli_tool', None, None, None, 0),
|
|
('gallerydl', 'gallery-dl', 'cli_tool', None, None, None, 0),
|
|
]
|
|
|
|
for s in scrapers:
|
|
self.cursor.execute('''
|
|
INSERT OR IGNORE INTO scrapers
|
|
(id, name, type, module, base_url, target_platform, flaresolverr_required)
|
|
VALUES (?, ?, ?, ?, ?, ?, ?)
|
|
''', s)
|
|
```
|
|
|
|
### Step 3: Migrate Cookies
|
|
|
|
```python
|
|
def _migrate_cookies_to_db(self):
|
|
cookie_files = {
|
|
'coppermine': '/opt/media-downloader/cookies/coppermine_cookies.json',
|
|
'imginn': '/opt/media-downloader/cookies/imginn_cookies.json',
|
|
'fastdl': '/opt/media-downloader/cookies/fastdl_cookies.json',
|
|
'snapchat': '/opt/media-downloader/cookies/snapchat_cookies.json',
|
|
'forum_phun': '/opt/media-downloader/cookies/forum_cookies_phun.org.json',
|
|
'forum_hqcelebcorner': '/opt/media-downloader/cookies/forum_cookies_HQCelebCorner.json',
|
|
'forum_picturepub': '/opt/media-downloader/cookies/forum_cookies_PicturePub.json',
|
|
}
|
|
|
|
for scraper_id, cookie_file in cookie_files.items():
|
|
if os.path.exists(cookie_file):
|
|
try:
|
|
with open(cookie_file, 'r') as f:
|
|
data = json.load(f)
|
|
|
|
# Store in DB
|
|
self.cursor.execute('''
|
|
UPDATE scrapers
|
|
SET cookies_json = ?, cookies_updated_at = ?
|
|
WHERE id = ?
|
|
''', (json.dumps(data), datetime.now().isoformat(), scraper_id))
|
|
|
|
self.logger.info(f"Migrated cookies for {scraper_id}")
|
|
except Exception as e:
|
|
self.logger.error(f"Failed to migrate cookies for {scraper_id}: {e}")
|
|
```
|
|
|
|
### Step 4: Migrate Snapchat proxy_domain
|
|
|
|
```python
|
|
def _migrate_snapchat_proxy_domain(self):
|
|
# Get current proxy_domain from settings
|
|
settings = self.get_setting('snapchat')
|
|
if settings and 'proxy_domain' in settings:
|
|
proxy_domain = settings['proxy_domain']
|
|
base_url = f"https://{proxy_domain}"
|
|
|
|
self.cursor.execute('''
|
|
UPDATE scrapers SET base_url = ? WHERE id = 'snapchat'
|
|
''', (base_url,))
|
|
|
|
# Remove from settings (now in scrapers table)
|
|
del settings['proxy_domain']
|
|
self.save_setting('snapchat', settings)
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
| Step | Task | Files to Modify |
|
|
|------|------|-----------------|
|
|
| 1 | Database schema + migration | `unified_database.py` |
|
|
| 2 | Backend API endpoints | `api.py` |
|
|
| 3 | CloudflareHandler proxy + DB storage + merge logic | `cloudflare_handler.py` |
|
|
| 4 | Frontend Scrapers tab | `ScrapersTab.tsx`, `Settings.tsx`, `api.ts` |
|
|
| 5 | Update coppermine_module (test case) | `coppermine_module.py` |
|
|
| 6 | Test end-to-end | - |
|
|
| 7 | Update remaining modules | `imginn_module.py`, `fastdl_module.py`, `toolzu_module.py`, `snapchat_scraper.py`, `instaloader_module.py`, `tiktok_module.py`, `forum_downloader.py` |
|
|
| 8 | Update scheduler | `scheduler.py` |
|
|
| 9 | Cookie file cleanup | Remove old cookie files after verification |
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
### Database
|
|
- [ ] Table created on first run
|
|
- [ ] Seed data populated correctly
|
|
- [ ] Cookies migrated from files
|
|
- [ ] Snapchat proxy_domain migrated
|
|
|
|
### API
|
|
- [ ] GET /api/scrapers returns all scrapers
|
|
- [ ] GET /api/scrapers?type=forum filters correctly
|
|
- [ ] PUT /api/scrapers/{id} updates settings
|
|
- [ ] POST /api/scrapers/{id}/test works with FlareSolverr
|
|
- [ ] POST /api/scrapers/{id}/test works with proxy
|
|
- [ ] POST /api/scrapers/{id}/cookies merges correctly
|
|
- [ ] DELETE /api/scrapers/{id}/cookies clears cookies
|
|
|
|
### Frontend
|
|
- [ ] Scrapers tab displays all scrapers
|
|
- [ ] Grouping by type works
|
|
- [ ] Filter dropdown works
|
|
- [ ] Enable/disable toggle works
|
|
- [ ] Proxy checkbox and URL input work
|
|
- [ ] Test Connection button works
|
|
- [ ] Upload Cookies button works
|
|
- [ ] Clear Cookies button works
|
|
- [ ] Cookie status shows correctly (fresh/expired/none)
|
|
|
|
### Modules
|
|
- [ ] coppermine_module loads config from DB
|
|
- [ ] coppermine_module uses proxy when configured
|
|
- [ ] coppermine_module uses cookies from DB
|
|
- [ ] All other modules updated and working
|
|
|
|
### Scheduler
|
|
- [ ] Only runs enabled scrapers
|
|
- [ ] Passes correct scraper_id to modules
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If issues occur:
|
|
|
|
1. **Database**: The old cookie files are preserved as backups
|
|
2. **Modules**: Can fall back to reading cookie files if DB fails
|
|
3. **API**: Add backwards compatibility for old endpoints if needed
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
Potential additions not in initial scope:
|
|
|
|
1. **Rotating proxies** - Support proxy pools with rotation
|
|
2. **Proxy health monitoring** - Track proxy success/failure rates
|
|
3. **Auto-refresh cookies** - Background job to refresh expiring cookies
|
|
4. **Cookie export** - Download cookies as JSON for backup
|
|
5. **Scraper metrics** - Track download success rates per scraper
|