8.0 KiB
Service Health Monitoring
Overview
The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.
Features
- Scheduler-only operation - Only monitors during unattended daemon mode
- 24-hour notification cooldown - Prevents notification spam
- Failure threshold - 3 consecutive failures trigger stuck state
- Automatic recovery detection - Stops alerting when service recovers
- Detailed failure tracking - Cloudflare, rate limits, timeouts, etc.
Configuration
Located in config/settings.json:
{
"service_monitoring": {
"enabled": true,
"notification_cooldown_hours": 24,
"failure_threshold": 3,
"send_recovery_notifications": false,
"services": {
"fastdl": {"monitor": true, "notify": true},
"imginn": {"monitor": true, "notify": true},
"snapchat": {"monitor": true, "notify": true},
"toolzu": {"monitor": true, "notify": true},
"tiktok": {"monitor": true, "notify": true},
"forums": {"monitor": true, "notify": true}
},
"pushover": {
"enabled": true,
"priority": 0,
"sound": "pushover"
}
}
}
Configuration Options
Main Settings:
enabled(boolean) - Master switch for service monitoringnotification_cooldown_hours(integer) - Hours between notifications for same service (default: 24)failure_threshold(integer) - Consecutive failures before marking as stuck (default: 3)send_recovery_notifications(boolean) - Send notification when service recovers (default: false)
Per-Service Settings:
monitor(boolean) - Track this service's healthnotify(boolean) - Send notifications for this service
Pushover Settings:
enabled(boolean) - Enable Pushover notificationspriority(integer) - Notification priority (-2 to 2)sound(string) - Notification sound
How It Works
Detection Flow
-
Service runs in scheduler mode
-
Success:
health_monitor.record_success('service_name')- Resets consecutive failure counter
- Marks service as healthy
- Sends recovery notification (if enabled)
-
Failure:
health_monitor.record_failure('service_name', 'reason')- Increments failure counter
- Records failure type (cloudflare, timeout, etc.)
- If failures ≥ threshold → mark as stuck
- If stuck AND cooldown expired → send alert
Failure Types
cloudflare/cloudflare_challenge- Cloudflare block detectedrate_limit- HTTP 429 rate limitingforbidden- HTTP 403 access deniedtimeout- Connection timeoutauthentication- Login/auth requiredcaptcha- CAPTCHA challengeblocked- IP blockedunknown- Other errors
State Tracking
State stored in /opt/media-downloader/database/service_health.json:
{
"service_health": {
"fastdl": {
"status": "stuck",
"consecutive_failures": 5,
"last_success": "2025-10-27T14:30:00",
"last_failure": "2025-10-28T23:30:00",
"last_notification_sent": "2025-10-28T08:00:00",
"failure_type": "cloudflare_challenge",
"total_failures": 12,
"total_successes": 145
}
}
}
Notification Format
Alert Notification:
⚠️ Service Alert: FastDL
Status: Stuck/Blocked
Issue: Cloudflare Challenge
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)
Last successful download: 9 hours ago
Action may be required.
Recovery Notification (optional):
✅ Service Recovered: FastDL
Status: Healthy
Service is working again.
Recovered at: Oct 28, 11:45 PM
Implementation Status
✅ Completed Components
-
Core Module -
modules/service_health_monitor.py- State management
- Failure/success tracking
- Notification logic
- Cooldown management
-
Configuration -
config/settings.json- service_monitoring section added
- All services configured
-
Integration -
media-downloader.py- Health monitor initialization (scheduler mode only)
- Imported ServiceHealthMonitor
-
Example Implementation -
download_fastdl()- Success tracking after completion
- Failure tracking with error classification
- Try/except wrapper pattern
🔄 Pending Implementation
The following download methods need success/failure tracking added:
Pattern to Follow
def download_SERVICE(self):
"""Download content via SERVICE"""
try:
# ... existing download logic ...
# Record success at end
if self.health_monitor:
self.health_monitor.record_success('service_name')
return total_downloaded
except Exception as e:
self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")
# Record failure with classification
if self.health_monitor:
error_str = str(e).lower()
if 'cloudflare' in error_str or 'cf_clearance' in error_str:
reason = 'cloudflare'
elif 'timeout' in error_str:
reason = 'timeout'
elif '403' in error_str:
reason = 'forbidden'
elif '429' in error_str:
reason = 'rate_limit'
else:
reason = 'unknown'
self.health_monitor.record_failure('service_name', reason)
raise # Re-raise to maintain existing error handling
Methods to Update
-
download_imginn() (line ~1065)
- Service name:
'imginn' - Common errors: Cloudflare, timeouts
- Service name:
-
download_toolzu() (line ~1134)
- Service name:
'toolzu' - Common errors: Cloudflare, rate limits
- Service name:
-
download_snapchat() (line ~1320)
- Service name:
'snapchat' - Common errors: Cloudflare, timeouts
- Service name:
-
download_tiktok() (line ~1364)
- Service name:
'tiktok' - Common errors: Rate limits, geo-blocks
- Service name:
-
download_forums() (line ~1442)
- Service name:
'forums' - Common errors: Authentication, Cloudflare
- Service name:
Testing
Manual Testing (No Monitoring)
# Manual runs don't trigger monitoring
sudo media-downloader --platform snapchat
# Health monitor inactive - no tracking
Scheduler Testing (With Monitoring)
# Start scheduler (monitoring active)
sudo systemctl start media-downloader
# Check health state
cat /opt/media-downloader/database/service_health.json
# Check logs for monitoring activity
tail -f /opt/media-downloader/logs/*.log | grep "Service health"
Simulate Failure
- Stop FlareSolverr:
docker stop flaresolverr - Run scheduler - service will fail
- Check after 3 failures - notification should be sent
- Check cooldown - no notification for 24 hours
- Start FlareSolverr:
docker start flaresolverr - Run scheduler - service recovers, counter resets
Benefits
✅ Early Warning - Know immediately when services are stuck ✅ No Spam - Single daily notification per service ✅ Actionable - Shows specific failure reason ✅ Auto-Recovery - Stops alerting when fixed ✅ Historical Data - Track failure/success patterns ✅ Granular Control - Enable/disable per service
Troubleshooting
Not receiving notifications:
- Check
service_monitoring.enabledistrue - Check service-specific
monitorandnotifysettings - Verify Pushover credentials in config
- Confirm running in scheduler mode (not manual)
Too many notifications:
- Increase
notification_cooldown_hours - Increase
failure_threshold - Disable specific services with
notify: false
Service marked stuck incorrectly:
- Increase
failure_threshold(default: 3) - Check if service is actually failing
- Review failure logs
Reset service state:
from modules.service_health_monitor import ServiceHealthMonitor
monitor = ServiceHealthMonitor()
monitor.reset_service('fastdl')
Future Enhancements
- Web dashboard showing service health
- Daily digest emails/notifications
- Success rate metrics
- Escalation after extended downtime
- Integration with monitoring tools (Grafana, etc.)