# Service Health Monitoring ## Overview The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues. ## Features - **Scheduler-only operation** - Only monitors during unattended daemon mode - **24-hour notification cooldown** - Prevents notification spam - **Failure threshold** - 3 consecutive failures trigger stuck state - **Automatic recovery detection** - Stops alerting when service recovers - **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc. ## Configuration Located in `config/settings.json`: ```json { "service_monitoring": { "enabled": true, "notification_cooldown_hours": 24, "failure_threshold": 3, "send_recovery_notifications": false, "services": { "fastdl": {"monitor": true, "notify": true}, "imginn": {"monitor": true, "notify": true}, "snapchat": {"monitor": true, "notify": true}, "toolzu": {"monitor": true, "notify": true}, "tiktok": {"monitor": true, "notify": true}, "forums": {"monitor": true, "notify": true} }, "pushover": { "enabled": true, "priority": 0, "sound": "pushover" } } } ``` ### Configuration Options **Main Settings:** - `enabled` (boolean) - Master switch for service monitoring - `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24) - `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3) - `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false) **Per-Service Settings:** - `monitor` (boolean) - Track this service's health - `notify` (boolean) - Send notifications for this service **Pushover Settings:** - `enabled` (boolean) - Enable Pushover notifications - `priority` (integer) - Notification priority (-2 to 2) - `sound` (string) - Notification sound ## How It Works ### Detection Flow 1. **Service runs** in scheduler mode 2. **Success**: `health_monitor.record_success('service_name')` - Resets consecutive failure counter - Marks service as healthy - Sends recovery notification (if enabled) 3. **Failure**: `health_monitor.record_failure('service_name', 'reason')` - Increments failure counter - Records failure type (cloudflare, timeout, etc.) - If failures ≥ threshold → mark as stuck - If stuck AND cooldown expired → send alert ### Failure Types - `cloudflare` / `cloudflare_challenge` - Cloudflare block detected - `rate_limit` - HTTP 429 rate limiting - `forbidden` - HTTP 403 access denied - `timeout` - Connection timeout - `authentication` - Login/auth required - `captcha` - CAPTCHA challenge - `blocked` - IP blocked - `unknown` - Other errors ### State Tracking State stored in `/opt/media-downloader/database/service_health.json`: ```json { "service_health": { "fastdl": { "status": "stuck", "consecutive_failures": 5, "last_success": "2025-10-27T14:30:00", "last_failure": "2025-10-28T23:30:00", "last_notification_sent": "2025-10-28T08:00:00", "failure_type": "cloudflare_challenge", "total_failures": 12, "total_successes": 145 } } } ``` ## Notification Format **Alert Notification:** ``` ⚠️ Service Alert: FastDL Status: Stuck/Blocked Issue: Cloudflare Challenge Failed Since: Oct 28, 2:30 PM (5 consecutive failures) Last successful download: 9 hours ago Action may be required. ``` **Recovery Notification** (optional): ``` ✅ Service Recovered: FastDL Status: Healthy Service is working again. Recovered at: Oct 28, 11:45 PM ``` ## Implementation Status ### ✅ Completed Components 1. **Core Module** - `modules/service_health_monitor.py` - State management - Failure/success tracking - Notification logic - Cooldown management 2. **Configuration** - `config/settings.json` - service_monitoring section added - All services configured 3. **Integration** - `media-downloader.py` - Health monitor initialization (scheduler mode only) - Imported ServiceHealthMonitor 4. **Example Implementation** - `download_fastdl()` - Success tracking after completion - Failure tracking with error classification - Try/except wrapper pattern ### 🔄 Pending Implementation The following download methods need success/failure tracking added: #### Pattern to Follow ```python def download_SERVICE(self): """Download content via SERVICE""" try: # ... existing download logic ... # Record success at end if self.health_monitor: self.health_monitor.record_success('service_name') return total_downloaded except Exception as e: self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}") # Record failure with classification if self.health_monitor: error_str = str(e).lower() if 'cloudflare' in error_str or 'cf_clearance' in error_str: reason = 'cloudflare' elif 'timeout' in error_str: reason = 'timeout' elif '403' in error_str: reason = 'forbidden' elif '429' in error_str: reason = 'rate_limit' else: reason = 'unknown' self.health_monitor.record_failure('service_name', reason) raise # Re-raise to maintain existing error handling ``` #### Methods to Update 1. **download_imginn()** (line ~1065) - Service name: `'imginn'` - Common errors: Cloudflare, timeouts 2. **download_toolzu()** (line ~1134) - Service name: `'toolzu'` - Common errors: Cloudflare, rate limits 3. **download_snapchat()** (line ~1320) - Service name: `'snapchat'` - Common errors: Cloudflare, timeouts 4. **download_tiktok()** (line ~1364) - Service name: `'tiktok'` - Common errors: Rate limits, geo-blocks 5. **download_forums()** (line ~1442) - Service name: `'forums'` - Common errors: Authentication, Cloudflare ## Testing ### Manual Testing (No Monitoring) ```bash # Manual runs don't trigger monitoring sudo media-downloader --platform snapchat # Health monitor inactive - no tracking ``` ### Scheduler Testing (With Monitoring) ```bash # Start scheduler (monitoring active) sudo systemctl start media-downloader # Check health state cat /opt/media-downloader/database/service_health.json # Check logs for monitoring activity tail -f /opt/media-downloader/logs/*.log | grep "Service health" ``` ### Simulate Failure 1. Stop FlareSolverr: `docker stop flaresolverr` 2. Run scheduler - service will fail 3. Check after 3 failures - notification should be sent 4. Check cooldown - no notification for 24 hours 5. Start FlareSolverr: `docker start flaresolverr` 6. Run scheduler - service recovers, counter resets ## Benefits ✅ **Early Warning** - Know immediately when services are stuck ✅ **No Spam** - Single daily notification per service ✅ **Actionable** - Shows specific failure reason ✅ **Auto-Recovery** - Stops alerting when fixed ✅ **Historical Data** - Track failure/success patterns ✅ **Granular Control** - Enable/disable per service ## Troubleshooting **Not receiving notifications:** - Check `service_monitoring.enabled` is `true` - Check service-specific `monitor` and `notify` settings - Verify Pushover credentials in config - Confirm running in scheduler mode (not manual) **Too many notifications:** - Increase `notification_cooldown_hours` - Increase `failure_threshold` - Disable specific services with `notify: false` **Service marked stuck incorrectly:** - Increase `failure_threshold` (default: 3) - Check if service is actually failing - Review failure logs **Reset service state:** ```python from modules.service_health_monitor import ServiceHealthMonitor monitor = ServiceHealthMonitor() monitor.reset_service('fastdl') ``` ## Future Enhancements - Web dashboard showing service health - Daily digest emails/notifications - Success rate metrics - Escalation after extended downtime - Integration with monitoring tools (Grafana, etc.)