Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00
commit 0d7b2b1aab
389 changed files with 280296 additions and 0 deletions
--- a/docs/SERVICE_HEALTH_MONITORING.md
+++ b/docs/SERVICE_HEALTH_MONITORING.md
@@ -0,0 +1,289 @@
+# Service Health Monitoring
+
+## Overview
+
+The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.
+
+## Features
+
+- **Scheduler-only operation** - Only monitors during unattended daemon mode
+- **24-hour notification cooldown** - Prevents notification spam
+- **Failure threshold** - 3 consecutive failures trigger stuck state
+- **Automatic recovery detection** - Stops alerting when service recovers
+- **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc.
+
+## Configuration
+
+Located in `config/settings.json`:
+
+```json
+{
+  "service_monitoring": {
+    "enabled": true,
+    "notification_cooldown_hours": 24,
+    "failure_threshold": 3,
+    "send_recovery_notifications": false,
+    "services": {
+      "fastdl": {"monitor": true, "notify": true},
+      "imginn": {"monitor": true, "notify": true},
+      "snapchat": {"monitor": true, "notify": true},
+      "toolzu": {"monitor": true, "notify": true},
+      "tiktok": {"monitor": true, "notify": true},
+      "forums": {"monitor": true, "notify": true}
+    },
+    "pushover": {
+      "enabled": true,
+      "priority": 0,
+      "sound": "pushover"
+    }
+  }
+}
+```
+
+### Configuration Options
+
+**Main Settings:**
+- `enabled` (boolean) - Master switch for service monitoring
+- `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24)
+- `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3)
+- `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false)
+
+**Per-Service Settings:**
+- `monitor` (boolean) - Track this service's health
+- `notify` (boolean) - Send notifications for this service
+
+**Pushover Settings:**
+- `enabled` (boolean) - Enable Pushover notifications
+- `priority` (integer) - Notification priority (-2 to 2)
+- `sound` (string) - Notification sound
+
+## How It Works
+
+### Detection Flow
+
+1. **Service runs** in scheduler mode
+2. **Success**: `health_monitor.record_success('service_name')`
+   - Resets consecutive failure counter
+   - Marks service as healthy
+   - Sends recovery notification (if enabled)
+
+3. **Failure**: `health_monitor.record_failure('service_name', 'reason')`
+   - Increments failure counter
+   - Records failure type (cloudflare, timeout, etc.)
+   - If failures ≥ threshold → mark as stuck
+   - If stuck AND cooldown expired → send alert
+
+### Failure Types
+
+- `cloudflare` / `cloudflare_challenge` - Cloudflare block detected
+- `rate_limit` - HTTP 429 rate limiting
+- `forbidden` - HTTP 403 access denied
+- `timeout` - Connection timeout
+- `authentication` - Login/auth required
+- `captcha` - CAPTCHA challenge
+- `blocked` - IP blocked
+- `unknown` - Other errors
+
+### State Tracking
+
+State stored in `/opt/media-downloader/database/service_health.json`:
+
+```json
+{
+  "service_health": {
+    "fastdl": {
+      "status": "stuck",
+      "consecutive_failures": 5,
+      "last_success": "2025-10-27T14:30:00",
+      "last_failure": "2025-10-28T23:30:00",
+      "last_notification_sent": "2025-10-28T08:00:00",
+      "failure_type": "cloudflare_challenge",
+      "total_failures": 12,
+      "total_successes": 145
+    }
+  }
+}
+```
+
+## Notification Format
+
+**Alert Notification:**
+```
+⚠️ Service Alert: FastDL
+
+Status: Stuck/Blocked
+Issue: Cloudflare Challenge
+Failed Since: Oct 28, 2:30 PM (5 consecutive failures)
+
+Last successful download: 9 hours ago
+
+Action may be required.
+```
+
+**Recovery Notification** (optional):
+```
+✅ Service Recovered: FastDL
+
+Status: Healthy
+Service is working again.
+
+Recovered at: Oct 28, 11:45 PM
+```
+
+## Implementation Status
+
+### ✅ Completed Components
+
+1. **Core Module** - `modules/service_health_monitor.py`
+   - State management
+   - Failure/success tracking
+   - Notification logic
+   - Cooldown management
+
+2. **Configuration** - `config/settings.json`
+   - service_monitoring section added
+   - All services configured
+
+3. **Integration** - `media-downloader.py`
+   - Health monitor initialization (scheduler mode only)
+   - Imported ServiceHealthMonitor
+
+4. **Example Implementation** - `download_fastdl()`
+   - Success tracking after completion
+   - Failure tracking with error classification
+   - Try/except wrapper pattern
+
+### 🔄 Pending Implementation
+
+The following download methods need success/failure tracking added:
+
+#### Pattern to Follow
+
+```python
+def download_SERVICE(self):
+    """Download content via SERVICE"""
+    try:
+        # ... existing download logic ...
+
+        # Record success at end
+        if self.health_monitor:
+            self.health_monitor.record_success('service_name')
+
+        return total_downloaded
+
+    except Exception as e:
+        self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")
+
+        # Record failure with classification
+        if self.health_monitor:
+            error_str = str(e).lower()
+            if 'cloudflare' in error_str or 'cf_clearance' in error_str:
+                reason = 'cloudflare'
+            elif 'timeout' in error_str:
+                reason = 'timeout'
+            elif '403' in error_str:
+                reason = 'forbidden'
+            elif '429' in error_str:
+                reason = 'rate_limit'
+            else:
+                reason = 'unknown'
+            self.health_monitor.record_failure('service_name', reason)
+
+        raise  # Re-raise to maintain existing error handling
+```
+
+#### Methods to Update
+
+1. **download_imginn()** (line ~1065)
+   - Service name: `'imginn'`
+   - Common errors: Cloudflare, timeouts
+
+2. **download_toolzu()** (line ~1134)
+   - Service name: `'toolzu'`
+   - Common errors: Cloudflare, rate limits
+
+3. **download_snapchat()** (line ~1320)
+   - Service name: `'snapchat'`
+   - Common errors: Cloudflare, timeouts
+
+4. **download_tiktok()** (line ~1364)
+   - Service name: `'tiktok'`
+   - Common errors: Rate limits, geo-blocks
+
+5. **download_forums()** (line ~1442)
+   - Service name: `'forums'`
+   - Common errors: Authentication, Cloudflare
+
+## Testing
+
+### Manual Testing (No Monitoring)
+
+```bash
+# Manual runs don't trigger monitoring
+sudo media-downloader --platform snapchat
+# Health monitor inactive - no tracking
+```
+
+### Scheduler Testing (With Monitoring)
+
+```bash
+# Start scheduler (monitoring active)
+sudo systemctl start media-downloader
+
+# Check health state
+cat /opt/media-downloader/database/service_health.json
+
+# Check logs for monitoring activity
+tail -f /opt/media-downloader/logs/*.log | grep "Service health"
+```
+
+### Simulate Failure
+
+1. Stop FlareSolverr: `docker stop flaresolverr`
+2. Run scheduler - service will fail
+3. Check after 3 failures - notification should be sent
+4. Check cooldown - no notification for 24 hours
+5. Start FlareSolverr: `docker start flaresolverr`
+6. Run scheduler - service recovers, counter resets
+
+## Benefits
+
+✅ **Early Warning** - Know immediately when services are stuck
+✅ **No Spam** - Single daily notification per service
+✅ **Actionable** - Shows specific failure reason
+✅ **Auto-Recovery** - Stops alerting when fixed
+✅ **Historical Data** - Track failure/success patterns
+✅ **Granular Control** - Enable/disable per service
+
+## Troubleshooting
+
+**Not receiving notifications:**
+- Check `service_monitoring.enabled` is `true`
+- Check service-specific `monitor` and `notify` settings
+- Verify Pushover credentials in config
+- Confirm running in scheduler mode (not manual)
+
+**Too many notifications:**
+- Increase `notification_cooldown_hours`
+- Increase `failure_threshold`
+- Disable specific services with `notify: false`
+
+**Service marked stuck incorrectly:**
+- Increase `failure_threshold` (default: 3)
+- Check if service is actually failing
+- Review failure logs
+
+**Reset service state:**
+```python
+from modules.service_health_monitor import ServiceHealthMonitor
+monitor = ServiceHealthMonitor()
+monitor.reset_service('fastdl')
+```
+
+## Future Enhancements
+
+- Web dashboard showing service health
+- Daily digest emails/notifications
+- Success rate metrics
+- Escalation after extended downtime
+- Integration with monitoring tools (Grafana, etc.)