760
docs/SCRAPER_PROXY_SYSTEM.md
Normal file
760
docs/SCRAPER_PROXY_SYSTEM.md
Normal file
@@ -0,0 +1,760 @@
|
||||
# Scraper Proxy Configuration System
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the design and implementation plan for a centralized scraper configuration system that provides:
|
||||
|
||||
1. **Per-scraper proxy settings** - Configure different proxies for different scrapers
|
||||
2. **Centralized cookie management** - Store cookies in database instead of files
|
||||
3. **FlareSolverr integration** - Test connections and refresh Cloudflare cookies
|
||||
4. **Cookie upload support** - Upload cookies from browser extensions for authenticated access
|
||||
5. **Unified Settings UI** - Single place to manage all scraper configurations
|
||||
|
||||
## Background
|
||||
|
||||
### Problem Statement
|
||||
|
||||
- Proxy settings are not configurable per-module
|
||||
- Cookies are stored in scattered JSON files
|
||||
- No UI to test FlareSolverr connections or manage cookies
|
||||
- Adding new forums requires code changes
|
||||
- No visibility into cookie freshness or scraper health
|
||||
|
||||
### Solution
|
||||
|
||||
A new `scrapers` database table that:
|
||||
- Stores configuration for all automated scrapers
|
||||
- Provides proxy settings per-scraper
|
||||
- Centralizes cookie storage with merge logic
|
||||
- Syncs automatically with platform configurations
|
||||
- Exposes management via Settings UI
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Table: `scrapers`
|
||||
|
||||
```sql
|
||||
CREATE TABLE scrapers (
|
||||
id TEXT PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
type TEXT NOT NULL, -- 'direct', 'proxy', 'forum', 'cli_tool'
|
||||
module TEXT, -- Python module name, NULL for cli_tool
|
||||
base_url TEXT, -- Primary URL for the scraper
|
||||
target_platform TEXT, -- 'instagram', 'snapchat', 'tiktok', NULL for forums/cli
|
||||
enabled INTEGER DEFAULT 1, -- Enable/disable scraper
|
||||
|
||||
-- Proxy settings
|
||||
proxy_enabled INTEGER DEFAULT 0,
|
||||
proxy_url TEXT, -- e.g., "socks5://user:pass@host:port"
|
||||
|
||||
-- Cloudflare/Cookie settings
|
||||
flaresolverr_required INTEGER DEFAULT 0,
|
||||
cookies_json TEXT, -- JSON blob of cookies
|
||||
cookies_updated_at TEXT, -- ISO timestamp of last cookie update
|
||||
|
||||
-- Test status
|
||||
last_test_at TEXT, -- ISO timestamp of last test
|
||||
last_test_status TEXT, -- 'success', 'failed', 'timeout'
|
||||
last_test_message TEXT, -- Error message if failed
|
||||
|
||||
-- Module-specific settings
|
||||
settings_json TEXT, -- Additional JSON settings per-scraper
|
||||
|
||||
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
### Column Definitions
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `id` | TEXT | Unique identifier (e.g., 'imginn', 'forum_phun') |
|
||||
| `name` | TEXT | Display name shown in UI |
|
||||
| `type` | TEXT | One of: 'direct', 'proxy', 'forum', 'cli_tool' |
|
||||
| `module` | TEXT | Python module name (e.g., 'imginn_module'), NULL for CLI tools |
|
||||
| `base_url` | TEXT | Primary URL for the service |
|
||||
| `target_platform` | TEXT | What platform this scraper downloads from (instagram, snapchat, tiktok, NULL) |
|
||||
| `enabled` | INTEGER | 1=enabled, 0=disabled |
|
||||
| `proxy_enabled` | INTEGER | 1=use proxy, 0=direct connection |
|
||||
| `proxy_url` | TEXT | Proxy URL (http, https, socks5 supported) |
|
||||
| `flaresolverr_required` | INTEGER | 1=needs FlareSolverr for Cloudflare bypass |
|
||||
| `cookies_json` | TEXT | JSON array of cookie objects |
|
||||
| `cookies_updated_at` | TEXT | When cookies were last updated |
|
||||
| `last_test_at` | TEXT | When connection was last tested |
|
||||
| `last_test_status` | TEXT | Result of last test: 'success', 'failed', 'timeout' |
|
||||
| `last_test_message` | TEXT | Error message from last failed test |
|
||||
| `settings_json` | TEXT | Module-specific settings as JSON |
|
||||
|
||||
### Scraper Types
|
||||
|
||||
| Type | Description | Examples |
|
||||
|------|-------------|----------|
|
||||
| `direct` | Downloads directly from the platform | instagram, tiktok, snapchat, coppermine |
|
||||
| `proxy` | Uses a proxy service to download | imginn, fastdl, toolzu |
|
||||
| `forum` | Forum scraper | forum_phun, forum_hqcelebcorner, forum_picturepub |
|
||||
| `cli_tool` | Command-line tool wrapper | ytdlp, gallerydl |
|
||||
|
||||
### Target Platforms
|
||||
|
||||
The `target_platform` field indicates what platform the scraper actually downloads content from:
|
||||
|
||||
| Scraper | Target Platform | Notes |
|
||||
|---------|-----------------|-------|
|
||||
| imginn | instagram | Proxy service for Instagram |
|
||||
| fastdl | instagram | Proxy service for Instagram |
|
||||
| toolzu | instagram | Proxy service for Instagram |
|
||||
| snapchat | snapchat | Direct via Playwright scraper |
|
||||
| instagram | instagram | Direct via Instaloader |
|
||||
| tiktok | tiktok | Direct via yt-dlp internally |
|
||||
| coppermine | NULL | Not a social platform |
|
||||
| forum_* | NULL | Not a social platform |
|
||||
| ytdlp | NULL | Generic tool, multiple platforms |
|
||||
| gallerydl | NULL | Generic tool, multiple platforms |
|
||||
|
||||
---
|
||||
|
||||
## Seed Data
|
||||
|
||||
Initial scrapers to populate on first run:
|
||||
|
||||
| id | name | type | module | base_url | target_platform | flaresolverr_required |
|
||||
|----|------|------|--------|----------|-----------------|----------------------|
|
||||
| imginn | Imginn | proxy | imginn_module | https://imginn.com | instagram | 1 |
|
||||
| fastdl | FastDL | proxy | fastdl_module | https://fastdl.app | instagram | 1 |
|
||||
| toolzu | Toolzu | proxy | toolzu_module | https://toolzu.com | instagram | 1 |
|
||||
| snapchat | Snapchat Direct | direct | snapchat_scraper | https://snapchat.com | snapchat | 0 |
|
||||
| instagram | Instagram (Direct) | direct | instaloader_module | https://instagram.com | instagram | 0 |
|
||||
| tiktok | TikTok | direct | tiktok_module | https://tiktok.com | tiktok | 0 |
|
||||
| coppermine | Coppermine | direct | coppermine_module | https://hqdiesel.net | NULL | 1 |
|
||||
| forum_phun | Phun.org | forum | forum_downloader | https://forum.phun.org | NULL | 1 |
|
||||
| forum_hqcelebcorner | HQCelebCorner | forum | forum_downloader | https://hqcelebcorner.com | NULL | 0 |
|
||||
| forum_picturepub | PicturePub | forum | forum_downloader | https://picturepub.net | NULL | 0 |
|
||||
| ytdlp | yt-dlp | cli_tool | NULL | NULL | NULL | 0 |
|
||||
| gallerydl | gallery-dl | cli_tool | NULL | NULL | NULL | 0 |
|
||||
|
||||
### Notes on Seed Data
|
||||
|
||||
1. **Snapchat**: Uses direct Playwright-based scraper with optional proxy support (configured per-scraper in Scrapers settings page)
|
||||
|
||||
2. **Forums**: Derived from existing `forum_threads` table entries and cookie files
|
||||
|
||||
3. **Excluded scrapers**: YouTube and Bilibili are NOT included - they are on-demand downloaders from the Video Downloader page, not scheduled scrapers
|
||||
|
||||
---
|
||||
|
||||
## Auto-Sync Logic
|
||||
|
||||
The scrapers table stays in sync with platform configurations automatically:
|
||||
|
||||
### When Forums Change
|
||||
- New forum added in Forums settings → Create scraper entry with `type='forum'`
|
||||
- Forum removed from settings → Remove scraper entry
|
||||
|
||||
### When Modules Are Enabled/Disabled
|
||||
- Module enabled → Ensure scraper entry exists
|
||||
- Module disabled → Scraper entry remains but `enabled=0`
|
||||
|
||||
### No Manual Add/Delete
|
||||
- The Scrapers UI does NOT have Add or Delete buttons
|
||||
- Scrapers are managed through their respective platform configuration pages
|
||||
- Scrapers UI only manages: proxy settings, testing, cookies
|
||||
|
||||
---
|
||||
|
||||
## Cookie Management
|
||||
|
||||
### Storage Format
|
||||
|
||||
Cookies are stored as JSON in the `cookies_json` column:
|
||||
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{
|
||||
"name": "cf_clearance",
|
||||
"value": "abc123...",
|
||||
"domain": ".imginn.com",
|
||||
"path": "/",
|
||||
"expiry": 1735689600
|
||||
},
|
||||
{
|
||||
"name": "session_id",
|
||||
"value": "xyz789...",
|
||||
"domain": "imginn.com",
|
||||
"path": "/",
|
||||
"expiry": -1
|
||||
}
|
||||
],
|
||||
"user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36..."
|
||||
}
|
||||
```
|
||||
|
||||
### Cookie Merge Logic
|
||||
|
||||
**CRITICAL**: When updating cookies, MERGE with existing - never wipe:
|
||||
|
||||
```python
|
||||
def merge_cookies(existing_cookies: list, new_cookies: list) -> list:
|
||||
"""
|
||||
Merge new cookies into existing, preserving non-updated cookies.
|
||||
|
||||
This ensures:
|
||||
- Cloudflare cookies (cf_clearance, __cf_bm) get refreshed
|
||||
- Site session/auth cookies are preserved
|
||||
- No data loss on test/refresh
|
||||
"""
|
||||
# Index existing by name
|
||||
cookie_map = {c['name']: c for c in existing_cookies}
|
||||
|
||||
# Update/add from new cookies
|
||||
for cookie in new_cookies:
|
||||
cookie_map[cookie['name']] = cookie
|
||||
|
||||
return list(cookie_map.values())
|
||||
```
|
||||
|
||||
### Cookie Sources
|
||||
|
||||
1. **FlareSolverr** - Automated Cloudflare bypass, returns CF cookies
|
||||
2. **Upload** - User uploads JSON from browser extension (EditThisCookie, Cookie-Editor)
|
||||
3. **Module** - Some modules save cookies during operation
|
||||
|
||||
### Cookie File Migration
|
||||
|
||||
Existing cookie files to migrate on first run:
|
||||
|
||||
| File | Scraper ID |
|
||||
|------|------------|
|
||||
| `cookies/coppermine_cookies.json` | coppermine |
|
||||
| `cookies/imginn_cookies.json` | imginn |
|
||||
| `cookies/fastdl_cookies.json` | fastdl |
|
||||
| `cookies/snapchat_cookies.json` | snapchat |
|
||||
| `cookies/forum_cookies_phun.org.json` | forum_phun |
|
||||
| `cookies/forum_cookies_HQCelebCorner.json` | forum_hqcelebcorner |
|
||||
| `cookies/forum_cookies_PicturePub.json` | forum_picturepub |
|
||||
|
||||
---
|
||||
|
||||
## Proxy Configuration
|
||||
|
||||
### Supported Proxy Formats
|
||||
|
||||
```
|
||||
http://host:port
|
||||
http://user:pass@host:port
|
||||
https://host:port
|
||||
https://user:pass@host:port
|
||||
socks5://host:port
|
||||
socks5://user:pass@host:port
|
||||
```
|
||||
|
||||
### FlareSolverr Proxy Integration
|
||||
|
||||
When a scraper has `proxy_enabled=1`, the proxy is passed to FlareSolverr:
|
||||
|
||||
```python
|
||||
payload = {
|
||||
"cmd": "request.get",
|
||||
"url": url,
|
||||
"maxTimeout": 120000
|
||||
}
|
||||
if proxy_url:
|
||||
payload["proxy"] = {"url": proxy_url}
|
||||
```
|
||||
|
||||
**Important**: Cloudflare cookies are tied to IP address. If FlareSolverr uses a proxy, subsequent requests MUST use the same proxy or cookies will be invalid.
|
||||
|
||||
### Per-Module Proxy Usage
|
||||
|
||||
| Module | How Proxy is Used |
|
||||
|--------|-------------------|
|
||||
| coppermine_module | `requests.Session(proxies={...})` |
|
||||
| imginn_module | Playwright `proxy` option |
|
||||
| fastdl_module | Playwright `proxy` option |
|
||||
| toolzu_module | Playwright `proxy` option |
|
||||
| snapchat_scraper | Playwright `proxy` option (optional, configured in Scrapers page) |
|
||||
| instaloader_module | Instaloader `proxy` parameter |
|
||||
| tiktok_module | yt-dlp `--proxy` flag |
|
||||
| forum_downloader | Playwright `proxy` option + requests |
|
||||
| ytdlp | `--proxy` flag |
|
||||
| gallerydl | `--proxy` flag |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### GET /api/scrapers
|
||||
|
||||
List all scrapers with optional type filter.
|
||||
|
||||
**Query Parameters:**
|
||||
- `type` (optional): Filter by type ('direct', 'proxy', 'forum', 'cli_tool')
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"scrapers": [
|
||||
{
|
||||
"id": "imginn",
|
||||
"name": "Imginn",
|
||||
"type": "proxy",
|
||||
"module": "imginn_module",
|
||||
"base_url": "https://imginn.com",
|
||||
"target_platform": "instagram",
|
||||
"enabled": true,
|
||||
"proxy_enabled": false,
|
||||
"proxy_url": null,
|
||||
"flaresolverr_required": true,
|
||||
"cookies_count": 23,
|
||||
"cookies_updated_at": "2025-12-01T10:30:00",
|
||||
"cookies_fresh": true,
|
||||
"last_test_at": "2025-12-01T10:30:00",
|
||||
"last_test_status": "success",
|
||||
"last_test_message": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### GET /api/scrapers/{id}
|
||||
|
||||
Get single scraper configuration.
|
||||
|
||||
### PUT /api/scrapers/{id}
|
||||
|
||||
Update scraper settings.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"proxy_enabled": true,
|
||||
"proxy_url": "socks5://user:pass@host:port",
|
||||
"base_url": "https://new-domain.com"
|
||||
}
|
||||
```
|
||||
|
||||
### POST /api/scrapers/{id}/test
|
||||
|
||||
Test connection via FlareSolverr (if required) and save cookies on success.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Connection successful, 23 cookies saved",
|
||||
"cookies_count": 23
|
||||
}
|
||||
```
|
||||
|
||||
### POST /api/scrapers/{id}/cookies
|
||||
|
||||
Upload cookies from JSON file. Merges with existing cookies.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{"name": "session", "value": "abc123", "domain": ".example.com"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Merged 5 cookies (total: 28)",
|
||||
"cookies_count": 28
|
||||
}
|
||||
```
|
||||
|
||||
### DELETE /api/scrapers/{id}/cookies
|
||||
|
||||
Clear all cookies for a scraper.
|
||||
|
||||
---
|
||||
|
||||
## Frontend UI
|
||||
|
||||
### Settings > Scrapers Tab
|
||||
|
||||
The Scrapers tab displays all scrapers grouped by type/platform:
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────────────┐
|
||||
│ Settings > Scrapers │
|
||||
├───────────────────────────────────────────────────────────────────────┤
|
||||
│ Filter: [All Types ▼] │
|
||||
│ │
|
||||
│ ─── Instagram Proxies ────────────────────────────────────────────── │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ● Imginn [Enabled ▼] │ │
|
||||
│ │ https://imginn.com │ │
|
||||
│ │ ☐ Use Proxy [ ] │ │
|
||||
│ │ Cloudflare: Required │ Cookies: ✓ Fresh (2h ago, 23 cookies) │ │
|
||||
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ─── Direct ───────────────────────────────────────────────────────── │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ● Instagram (Direct) [Enabled ▼] │ │
|
||||
│ │ https://instagram.com │ │
|
||||
│ │ ☐ Use Proxy [ ] │ │
|
||||
│ │ Cloudflare: Not Required │ Cookies: ✓ 12 cookies │ │
|
||||
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ─── Forums ───────────────────────────────────────────────────────── │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ● Phun.org [Enabled ▼] │ │
|
||||
│ │ https://forum.phun.org │ │
|
||||
│ │ ☐ Use Proxy [ ] │ │
|
||||
│ │ Cloudflare: Required │ Cookies: ⚠ Expired (3 days) │ │
|
||||
│ │ [Test Connection] [Upload Cookies] [Clear Cookies] │ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ─── CLI Tools ────────────────────────────────────────────────────── │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ● yt-dlp [Enabled ▼] │ │
|
||||
│ │ Generic video downloader │ │
|
||||
│ │ ☐ Use Proxy [ ] │ │
|
||||
│ │ [Test Connection] [Upload Cookies] │ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Button Visibility
|
||||
|
||||
| Button | When Shown |
|
||||
|--------|------------|
|
||||
| Test Connection | Always |
|
||||
| Upload Cookies | Always |
|
||||
| Clear Cookies | When cookies exist |
|
||||
|
||||
### No Add/Delete Buttons
|
||||
|
||||
Scrapers are NOT added or deleted from this UI. They are managed through:
|
||||
- Forums settings (for forum scrapers)
|
||||
- Platform settings (for other scrapers)
|
||||
|
||||
This UI only manages:
|
||||
- Enable/disable
|
||||
- Proxy configuration
|
||||
- Cookie testing/upload/clear
|
||||
|
||||
---
|
||||
|
||||
## Module Integration
|
||||
|
||||
### Common Pattern
|
||||
|
||||
All modules follow this pattern to load scraper configuration:
|
||||
|
||||
```python
|
||||
class SomeModule:
|
||||
def __init__(self, unified_db=None, scraper_id='some_scraper', ...):
|
||||
self.db = unified_db
|
||||
self.scraper_id = scraper_id
|
||||
|
||||
# Load config from DB
|
||||
self.config = self.db.get_scraper(scraper_id) if self.db else {}
|
||||
|
||||
# Check if enabled
|
||||
if not self.config.get('enabled', True):
|
||||
raise ScraperDisabledError(f"{scraper_id} is disabled")
|
||||
|
||||
# Get base URL from DB (not hardcoded)
|
||||
self.base_url = self.config.get('base_url', 'https://default.com')
|
||||
|
||||
# Get proxy config
|
||||
self.proxy_url = None
|
||||
if self.config.get('proxy_enabled') and self.config.get('proxy_url'):
|
||||
self.proxy_url = self.config['proxy_url']
|
||||
|
||||
# Initialize CloudflareHandler with DB storage
|
||||
self.cf_handler = CloudflareHandler(
|
||||
module_name=self.scraper_id,
|
||||
scraper_id=self.scraper_id,
|
||||
unified_db=self.db,
|
||||
proxy_url=self.proxy_url,
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
### CloudflareHandler Changes
|
||||
|
||||
```python
|
||||
class CloudflareHandler:
|
||||
def __init__(self,
|
||||
module_name: str,
|
||||
scraper_id: str = None, # For DB cookie storage
|
||||
unified_db = None, # DB reference
|
||||
proxy_url: str = None, # Proxy support
|
||||
cookie_file: str = None, # DEPRECATED: backwards compat
|
||||
...):
|
||||
self.scraper_id = scraper_id
|
||||
self.db = unified_db
|
||||
self.proxy_url = proxy_url
|
||||
|
||||
def get_cookies_via_flaresolverr(self, url: str, max_retries: int = 2) -> bool:
|
||||
payload = {
|
||||
"cmd": "request.get",
|
||||
"url": url,
|
||||
"maxTimeout": 120000
|
||||
}
|
||||
# Add proxy if configured
|
||||
if self.proxy_url:
|
||||
payload["proxy"] = {"url": self.proxy_url}
|
||||
|
||||
# ... rest of implementation
|
||||
|
||||
# On success, merge cookies (don't replace)
|
||||
if success:
|
||||
existing = self.load_cookies_from_db()
|
||||
merged = self.merge_cookies(existing, new_cookies)
|
||||
self.save_cookies_to_db(merged)
|
||||
|
||||
def load_cookies_from_db(self) -> list:
|
||||
if self.db and self.scraper_id:
|
||||
config = self.db.get_scraper(self.scraper_id)
|
||||
if config and config.get('cookies_json'):
|
||||
data = json.loads(config['cookies_json'])
|
||||
return data.get('cookies', [])
|
||||
return []
|
||||
|
||||
def save_cookies_to_db(self, cookies: list, user_agent: str = None):
|
||||
if self.db and self.scraper_id:
|
||||
data = {
|
||||
'cookies': cookies,
|
||||
'user_agent': user_agent
|
||||
}
|
||||
self.db.update_scraper_cookies(self.scraper_id, json.dumps(data))
|
||||
|
||||
def merge_cookies(self, existing: list, new: list) -> list:
|
||||
cookie_map = {c['name']: c for c in existing}
|
||||
for cookie in new:
|
||||
cookie_map[cookie['name']] = cookie
|
||||
return list(cookie_map.values())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scheduler Integration
|
||||
|
||||
The scheduler uses the scrapers table to determine what to run:
|
||||
|
||||
```python
|
||||
def run_scheduled_downloads(self):
|
||||
# Get all enabled scrapers
|
||||
scrapers = self.db.get_all_scrapers()
|
||||
enabled_scrapers = [s for s in scrapers if s['enabled']]
|
||||
|
||||
for scraper in enabled_scrapers:
|
||||
if scraper['type'] == 'forum':
|
||||
self.run_forum_download(scraper['id'])
|
||||
elif scraper['id'] == 'coppermine':
|
||||
self.run_coppermine_download()
|
||||
elif scraper['id'] == 'instagram':
|
||||
self.run_instagram_download()
|
||||
elif scraper['id'] == 'tiktok':
|
||||
self.run_tiktok_download()
|
||||
# etc.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Step 1: Create Table
|
||||
|
||||
Add to `unified_database.py`:
|
||||
|
||||
```python
|
||||
def _create_scrapers_table(self):
|
||||
self.cursor.execute('''
|
||||
CREATE TABLE IF NOT EXISTS scrapers (
|
||||
id TEXT PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
type TEXT NOT NULL,
|
||||
module TEXT,
|
||||
base_url TEXT,
|
||||
target_platform TEXT,
|
||||
enabled INTEGER DEFAULT 1,
|
||||
proxy_enabled INTEGER DEFAULT 0,
|
||||
proxy_url TEXT,
|
||||
flaresolverr_required INTEGER DEFAULT 0,
|
||||
cookies_json TEXT,
|
||||
cookies_updated_at TEXT,
|
||||
last_test_at TEXT,
|
||||
last_test_status TEXT,
|
||||
last_test_message TEXT,
|
||||
settings_json TEXT,
|
||||
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
|
||||
)
|
||||
''')
|
||||
```
|
||||
|
||||
### Step 2: Seed Initial Data
|
||||
|
||||
```python
|
||||
def _seed_scrapers(self):
|
||||
scrapers = [
|
||||
('imginn', 'Imginn', 'proxy', 'imginn_module', 'https://imginn.com', 'instagram', 1),
|
||||
('fastdl', 'FastDL', 'proxy', 'fastdl_module', 'https://fastdl.app', 'instagram', 1),
|
||||
('toolzu', 'Toolzu', 'proxy', 'toolzu_module', 'https://toolzu.com', 'instagram', 1),
|
||||
('snapchat', 'Snapchat Direct', 'direct', 'snapchat_scraper', 'https://snapchat.com', 'snapchat', 0),
|
||||
('instagram', 'Instagram (Direct)', 'direct', 'instaloader_module', 'https://instagram.com', 'instagram', 0),
|
||||
('tiktok', 'TikTok', 'direct', 'tiktok_module', 'https://tiktok.com', 'tiktok', 0),
|
||||
('coppermine', 'Coppermine', 'direct', 'coppermine_module', 'https://hqdiesel.net', None, 1),
|
||||
('forum_phun', 'Phun.org', 'forum', 'forum_downloader', 'https://forum.phun.org', None, 1),
|
||||
('forum_hqcelebcorner', 'HQCelebCorner', 'forum', 'forum_downloader', 'https://hqcelebcorner.com', None, 0),
|
||||
('forum_picturepub', 'PicturePub', 'forum', 'forum_downloader', 'https://picturepub.net', None, 0),
|
||||
('ytdlp', 'yt-dlp', 'cli_tool', None, None, None, 0),
|
||||
('gallerydl', 'gallery-dl', 'cli_tool', None, None, None, 0),
|
||||
]
|
||||
|
||||
for s in scrapers:
|
||||
self.cursor.execute('''
|
||||
INSERT OR IGNORE INTO scrapers
|
||||
(id, name, type, module, base_url, target_platform, flaresolverr_required)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
||||
''', s)
|
||||
```
|
||||
|
||||
### Step 3: Migrate Cookies
|
||||
|
||||
```python
|
||||
def _migrate_cookies_to_db(self):
|
||||
cookie_files = {
|
||||
'coppermine': '/opt/media-downloader/cookies/coppermine_cookies.json',
|
||||
'imginn': '/opt/media-downloader/cookies/imginn_cookies.json',
|
||||
'fastdl': '/opt/media-downloader/cookies/fastdl_cookies.json',
|
||||
'snapchat': '/opt/media-downloader/cookies/snapchat_cookies.json',
|
||||
'forum_phun': '/opt/media-downloader/cookies/forum_cookies_phun.org.json',
|
||||
'forum_hqcelebcorner': '/opt/media-downloader/cookies/forum_cookies_HQCelebCorner.json',
|
||||
'forum_picturepub': '/opt/media-downloader/cookies/forum_cookies_PicturePub.json',
|
||||
}
|
||||
|
||||
for scraper_id, cookie_file in cookie_files.items():
|
||||
if os.path.exists(cookie_file):
|
||||
try:
|
||||
with open(cookie_file, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Store in DB
|
||||
self.cursor.execute('''
|
||||
UPDATE scrapers
|
||||
SET cookies_json = ?, cookies_updated_at = ?
|
||||
WHERE id = ?
|
||||
''', (json.dumps(data), datetime.now().isoformat(), scraper_id))
|
||||
|
||||
self.logger.info(f"Migrated cookies for {scraper_id}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to migrate cookies for {scraper_id}: {e}")
|
||||
```
|
||||
|
||||
### Step 4: Migrate Snapchat proxy_domain
|
||||
|
||||
```python
|
||||
def _migrate_snapchat_proxy_domain(self):
|
||||
# Get current proxy_domain from settings
|
||||
settings = self.get_setting('snapchat')
|
||||
if settings and 'proxy_domain' in settings:
|
||||
proxy_domain = settings['proxy_domain']
|
||||
base_url = f"https://{proxy_domain}"
|
||||
|
||||
self.cursor.execute('''
|
||||
UPDATE scrapers SET base_url = ? WHERE id = 'snapchat'
|
||||
''', (base_url,))
|
||||
|
||||
# Remove from settings (now in scrapers table)
|
||||
del settings['proxy_domain']
|
||||
self.save_setting('snapchat', settings)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
| Step | Task | Files to Modify |
|
||||
|------|------|-----------------|
|
||||
| 1 | Database schema + migration | `unified_database.py` |
|
||||
| 2 | Backend API endpoints | `api.py` |
|
||||
| 3 | CloudflareHandler proxy + DB storage + merge logic | `cloudflare_handler.py` |
|
||||
| 4 | Frontend Scrapers tab | `ScrapersTab.tsx`, `Settings.tsx`, `api.ts` |
|
||||
| 5 | Update coppermine_module (test case) | `coppermine_module.py` |
|
||||
| 6 | Test end-to-end | - |
|
||||
| 7 | Update remaining modules | `imginn_module.py`, `fastdl_module.py`, `toolzu_module.py`, `snapchat_scraper.py`, `instaloader_module.py`, `tiktok_module.py`, `forum_downloader.py` |
|
||||
| 8 | Update scheduler | `scheduler.py` |
|
||||
| 9 | Cookie file cleanup | Remove old cookie files after verification |
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
### Database
|
||||
- [ ] Table created on first run
|
||||
- [ ] Seed data populated correctly
|
||||
- [ ] Cookies migrated from files
|
||||
- [ ] Snapchat proxy_domain migrated
|
||||
|
||||
### API
|
||||
- [ ] GET /api/scrapers returns all scrapers
|
||||
- [ ] GET /api/scrapers?type=forum filters correctly
|
||||
- [ ] PUT /api/scrapers/{id} updates settings
|
||||
- [ ] POST /api/scrapers/{id}/test works with FlareSolverr
|
||||
- [ ] POST /api/scrapers/{id}/test works with proxy
|
||||
- [ ] POST /api/scrapers/{id}/cookies merges correctly
|
||||
- [ ] DELETE /api/scrapers/{id}/cookies clears cookies
|
||||
|
||||
### Frontend
|
||||
- [ ] Scrapers tab displays all scrapers
|
||||
- [ ] Grouping by type works
|
||||
- [ ] Filter dropdown works
|
||||
- [ ] Enable/disable toggle works
|
||||
- [ ] Proxy checkbox and URL input work
|
||||
- [ ] Test Connection button works
|
||||
- [ ] Upload Cookies button works
|
||||
- [ ] Clear Cookies button works
|
||||
- [ ] Cookie status shows correctly (fresh/expired/none)
|
||||
|
||||
### Modules
|
||||
- [ ] coppermine_module loads config from DB
|
||||
- [ ] coppermine_module uses proxy when configured
|
||||
- [ ] coppermine_module uses cookies from DB
|
||||
- [ ] All other modules updated and working
|
||||
|
||||
### Scheduler
|
||||
- [ ] Only runs enabled scrapers
|
||||
- [ ] Passes correct scraper_id to modules
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues occur:
|
||||
|
||||
1. **Database**: The old cookie files are preserved as backups
|
||||
2. **Modules**: Can fall back to reading cookie files if DB fails
|
||||
3. **API**: Add backwards compatibility for old endpoints if needed
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential additions not in initial scope:
|
||||
|
||||
1. **Rotating proxies** - Support proxy pools with rotation
|
||||
2. **Proxy health monitoring** - Track proxy success/failure rates
|
||||
3. **Auto-refresh cookies** - Background job to refresh expiring cookies
|
||||
4. **Cookie export** - Download cookies as JSON for backup
|
||||
5. **Scraper metrics** - Track download success rates per scraper
|
||||
Reference in New Issue
Block a user