377
docs/CLOUDFLARE_HANDLER.md
Normal file
377
docs/CLOUDFLARE_HANDLER.md
Normal file
@@ -0,0 +1,377 @@
|
||||
# Universal Cloudflare Handler
|
||||
|
||||
**Version:** 12.0.1
|
||||
**Module:** `modules/cloudflare_handler.py`
|
||||
**Status:** Production
|
||||
|
||||
## Overview
|
||||
|
||||
The Universal Cloudflare Handler provides centralized Cloudflare bypass, error detection, cookie management, and **dynamic browser fingerprinting** for all download modules in the media-downloader system.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. **Site Status Detection**
|
||||
|
||||
Before attempting downloads, the handler checks if the target site is accessible:
|
||||
|
||||
- **WORKING** - Site is accessible and responding normally
|
||||
- **SERVER_ERROR** - HTTP 500, 502, 503, 504 errors (site is down)
|
||||
- **CLOUDFLARE_CHALLENGE** - Cloudflare challenge page detected
|
||||
- **FORBIDDEN** - HTTP 403 access denied
|
||||
- **TIMEOUT** - Request timed out
|
||||
- **UNKNOWN_ERROR** - Other errors
|
||||
|
||||
### 2. **Smart Skip Logic**
|
||||
|
||||
Downloads are automatically skipped when:
|
||||
- Site returns server errors (500, 502, 503, 504)
|
||||
- Request times out
|
||||
- Unknown errors occur
|
||||
|
||||
This prevents wasting time and resources on unavailable sites.
|
||||
|
||||
### 3. **FlareSolverr Integration**
|
||||
|
||||
- Automatic Cloudflare bypass using FlareSolverr
|
||||
- Configurable retry logic (default: 2 attempts)
|
||||
- 120-second timeout for difficult challenges
|
||||
- Detects cf_clearance cookie presence
|
||||
|
||||
### 4. **Cookie Management**
|
||||
|
||||
#### For Playwright (Browser Automation)
|
||||
```python
|
||||
# Load cookies into browser context
|
||||
cf_handler.load_cookies_to_playwright(context)
|
||||
|
||||
# Save cookies from browser
|
||||
cf_handler.save_cookies_from_playwright(context)
|
||||
|
||||
# Get cookies as list
|
||||
cookies = cf_handler.get_cookies_list()
|
||||
```
|
||||
|
||||
#### For Requests (HTTP Library)
|
||||
```python
|
||||
# Load cookies into session
|
||||
cf_handler.load_cookies_to_requests(session)
|
||||
|
||||
# Get cookies as dictionary
|
||||
cookies = cf_handler.get_cookies_dict()
|
||||
```
|
||||
|
||||
### 5. **Cookie Expiration Strategies**
|
||||
|
||||
#### Aggressive Mode (Default)
|
||||
- Cookies expire if older than 12 hours
|
||||
- Cookies expire if any cookie will expire within 7 days
|
||||
- Used by: imginn, fastdl, toolzu, snapchat
|
||||
|
||||
#### Conservative Mode
|
||||
- Only expires if cf_clearance cookie is actually expired
|
||||
- Minimizes FlareSolverr calls
|
||||
- Used by: coppermine
|
||||
|
||||
### 6. **Dynamic Browser Fingerprinting** (v12.0.1)
|
||||
|
||||
**Critical for cf_clearance cookies to work!**
|
||||
|
||||
The cf_clearance cookie is tied to the browser fingerprint (User-Agent, headers, etc.). If Playwright uses a different fingerprint than FlareSolverr, the cookies will be rejected.
|
||||
|
||||
#### Key Functions
|
||||
|
||||
```python
|
||||
from modules.cloudflare_handler import (
|
||||
get_flaresolverr_fingerprint,
|
||||
get_playwright_context_options,
|
||||
get_playwright_stealth_scripts,
|
||||
set_fingerprint_database
|
||||
)
|
||||
|
||||
# Initialize database persistence (call once at startup)
|
||||
set_fingerprint_database(unified_db)
|
||||
|
||||
# Get complete fingerprint (instant from cache/database)
|
||||
fingerprint = get_flaresolverr_fingerprint()
|
||||
# Returns: user_agent, sec_ch_ua, locale, timezone, viewport, etc.
|
||||
|
||||
# Get ready-to-use Playwright context options
|
||||
context_options = get_playwright_context_options()
|
||||
context = browser.new_context(**context_options)
|
||||
|
||||
# Add anti-detection scripts
|
||||
page.add_init_script(get_playwright_stealth_scripts())
|
||||
```
|
||||
|
||||
#### Fingerprint Persistence
|
||||
|
||||
Fingerprints are cached in three layers:
|
||||
1. **Memory cache** - Instant access during session
|
||||
2. **Database** - Persists across restarts (key_value_store table)
|
||||
3. **FlareSolverr fetch** - Fallback if no cache available
|
||||
|
||||
#### Important: Save Cookies with user_agent
|
||||
|
||||
When saving cookies to the database, **always include the user_agent**:
|
||||
|
||||
```python
|
||||
# CORRECT - includes user_agent
|
||||
self.unified_db.save_scraper_cookies(
|
||||
self.scraper_id,
|
||||
cookies,
|
||||
user_agent=self.user_agent, # REQUIRED for cf_clearance!
|
||||
merge=True
|
||||
)
|
||||
|
||||
# WRONG - missing user_agent (cookies won't work)
|
||||
self.unified_db.save_scraper_cookies(self.scraper_id, cookies)
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Initialization
|
||||
|
||||
```python
|
||||
from modules.cloudflare_handler import CloudflareHandler, SiteStatus
|
||||
|
||||
handler = CloudflareHandler(
|
||||
module_name="MyModule",
|
||||
cookie_file="/path/to/cookies.json",
|
||||
user_agent="Mozilla/5.0...",
|
||||
logger=logger, # Optional
|
||||
aggressive_expiry=True # or False for conservative
|
||||
)
|
||||
```
|
||||
|
||||
### Check Site Status
|
||||
|
||||
```python
|
||||
status, error_msg = handler.check_site_status("https://example.com/", timeout=10)
|
||||
|
||||
if handler.should_skip_download(status):
|
||||
print(f"Skipping download - site unavailable: {error_msg}")
|
||||
return []
|
||||
elif status == SiteStatus.CLOUDFLARE_CHALLENGE:
|
||||
print("Cloudflare challenge detected, will attempt bypass")
|
||||
```
|
||||
|
||||
### Get Fresh Cookies via FlareSolverr
|
||||
|
||||
```python
|
||||
success = handler.get_cookies_via_flaresolverr("https://example.com/", max_retries=2)
|
||||
|
||||
if success:
|
||||
print("Got fresh cookies from FlareSolverr")
|
||||
else:
|
||||
print("FlareSolverr failed")
|
||||
```
|
||||
|
||||
### Ensure Cookies Are Valid
|
||||
|
||||
```python
|
||||
# Checks expiration and gets new cookies if needed
|
||||
if handler.ensure_cookies("https://example.com/"):
|
||||
print("Cookies are valid")
|
||||
else:
|
||||
print("Failed to get valid cookies")
|
||||
```
|
||||
|
||||
### Check and Bypass Automatically
|
||||
|
||||
```python
|
||||
# Checks site status and automatically attempts FlareSolverr if needed
|
||||
status, cookies_obtained = handler.check_and_bypass("https://example.com/")
|
||||
|
||||
if handler.should_skip_download(status):
|
||||
print("Site is down, skipping")
|
||||
else:
|
||||
print("Site is accessible, proceeding")
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### ImgInn Module
|
||||
|
||||
```python
|
||||
class ImgInnDownloader:
|
||||
def __init__(self, ...):
|
||||
# Initialize CloudflareHandler
|
||||
self.cf_handler = CloudflareHandler(
|
||||
module_name="ImgInn",
|
||||
cookie_file=str(self.cookie_file),
|
||||
user_agent=self.user_agent,
|
||||
logger=self.logger,
|
||||
aggressive_expiry=True
|
||||
)
|
||||
|
||||
def download_posts(self, username, ...):
|
||||
# Check site status before downloading
|
||||
status, error_msg = self.cf_handler.check_site_status(
|
||||
"https://imginn.com/",
|
||||
timeout=10
|
||||
)
|
||||
|
||||
if self.cf_handler.should_skip_download(status):
|
||||
self.log(f"Skipping - ImgInn unavailable: {error_msg}", "warning")
|
||||
return []
|
||||
|
||||
# Proceed with download...
|
||||
```
|
||||
|
||||
### Coppermine Module (Conservative Mode)
|
||||
|
||||
```python
|
||||
class CoppermineDownloader:
|
||||
def __init__(self, ...):
|
||||
# Use conservative mode
|
||||
self.cf_handler = CloudflareHandler(
|
||||
module_name="Coppermine",
|
||||
cookie_file=str(self.cookie_file),
|
||||
user_agent=self.user_agent,
|
||||
logger=self.logger,
|
||||
aggressive_expiry=False # Conservative
|
||||
)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### FlareSolverr Setup
|
||||
|
||||
The handler expects FlareSolverr running at `http://localhost:8191/v1`:
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name flaresolverr \
|
||||
-p 8191:8191 \
|
||||
-e LOG_LEVEL=info \
|
||||
--restart unless-stopped \
|
||||
ghcr.io/flaresolverr/flaresolverr:latest
|
||||
```
|
||||
|
||||
### Cookie Storage
|
||||
|
||||
Cookies are stored in JSON format:
|
||||
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{
|
||||
"name": "cf_clearance",
|
||||
"value": "...",
|
||||
"domain": ".example.com",
|
||||
"path": "/",
|
||||
"expiry": 1234567890
|
||||
}
|
||||
],
|
||||
"timestamp": "2025-11-18T12:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
Location: `/opt/media-downloader/cookies/{module}_cookies.json`
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Server Errors (500, 502, 503, 504)
|
||||
|
||||
```python
|
||||
if status == SiteStatus.SERVER_ERROR:
|
||||
# Site is down, skip downloads
|
||||
return []
|
||||
```
|
||||
|
||||
### Cloudflare Challenges
|
||||
|
||||
```python
|
||||
if status == SiteStatus.CLOUDFLARE_CHALLENGE:
|
||||
# Attempt FlareSolverr bypass
|
||||
if handler.get_cookies_via_flaresolverr(url):
|
||||
# Retry with new cookies
|
||||
pass
|
||||
```
|
||||
|
||||
### Timeouts
|
||||
|
||||
```python
|
||||
if status == SiteStatus.TIMEOUT:
|
||||
# Site not responding, skip
|
||||
return []
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Centralized Logic** - All Cloudflare handling in one place
|
||||
2. **Reduced Duplication** - Eliminates 500+ lines of duplicate code across modules
|
||||
3. **Better Error Detection** - Distinguishes server errors from Cloudflare challenges
|
||||
4. **Automatic Skipping** - No wasted time on unavailable sites
|
||||
5. **Unified Cookie Management** - Same cookie handling for all modules
|
||||
6. **Backwards Compatible** - Existing modules work without changes
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### Before CloudflareHandler
|
||||
- ImgInn down with 500 error
|
||||
- Wait 120 seconds for Cloudflare challenge that never resolves
|
||||
- Launch browser, waste resources
|
||||
- Eventually timeout with error
|
||||
|
||||
### After CloudflareHandler
|
||||
- Check site status (10 seconds)
|
||||
- Detect 500 error immediately
|
||||
- Skip download with clear message
|
||||
- No browser launch, no wasted resources
|
||||
|
||||
**Time Saved:** 110 seconds per failed attempt
|
||||
|
||||
## Module Integration
|
||||
|
||||
All 5 download modules now use CloudflareHandler:
|
||||
|
||||
| Module | Expiry Mode | Site URL | Notes |
|
||||
|--------|-------------|----------|-------|
|
||||
| imginn | Aggressive | https://imginn.com/ | Instagram proxy |
|
||||
| fastdl | Aggressive | https://fastdl.app/ | Instagram API |
|
||||
| toolzu | Aggressive | https://toolzu.com/ | Instagram downloader |
|
||||
| snapchat | Aggressive | https://storiesdown.com/ | Snapchat proxy |
|
||||
| coppermine | Conservative | Dynamic (gallery URL) | Photo galleries |
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
- Rate limiting integration
|
||||
- Proxy rotation support
|
||||
- Multi-FlareSolverr failover
|
||||
- Cookie pool management
|
||||
- Site health monitoring
|
||||
- Automatic retry scheduling
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### FlareSolverr Not Available
|
||||
|
||||
```python
|
||||
# Handler will automatically disable FlareSolverr for session
|
||||
# Falls back to Playwright-based bypass
|
||||
```
|
||||
|
||||
### Cookies Not Refreshing
|
||||
|
||||
```python
|
||||
# Check cookie file permissions
|
||||
# Verify FlareSolverr is running
|
||||
# Check logs for error messages
|
||||
```
|
||||
|
||||
### Site Status Always Returns Error
|
||||
|
||||
```python
|
||||
# Verify network connectivity
|
||||
# Check firewall rules
|
||||
# Ensure target site is actually accessible
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [FlareSolverr Integration](FLARESOLVERR.md)
|
||||
- [Download Module Architecture](DOWNLOAD_MODULES.md)
|
||||
- [Cookie Management](COOKIES.md)
|
||||
- [Error Handling Best Practices](ERROR_HANDLING.md)
|
||||
Reference in New Issue
Block a user