Files

Todd 0d7b2b1aab Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-29 22:42:55 -04:00

8.0 KiB

Raw Blame History

Service Health Monitoring

Overview

The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.

Features

Scheduler-only operation - Only monitors during unattended daemon mode
24-hour notification cooldown - Prevents notification spam
Failure threshold - 3 consecutive failures trigger stuck state
Automatic recovery detection - Stops alerting when service recovers
Detailed failure tracking - Cloudflare, rate limits, timeouts, etc.

Configuration

Located in config/settings.json:

{
  "service_monitoring": {
    "enabled": true,
    "notification_cooldown_hours": 24,
    "failure_threshold": 3,
    "send_recovery_notifications": false,
    "services": {
      "fastdl": {"monitor": true, "notify": true},
      "imginn": {"monitor": true, "notify": true},
      "snapchat": {"monitor": true, "notify": true},
      "toolzu": {"monitor": true, "notify": true},
      "tiktok": {"monitor": true, "notify": true},
      "forums": {"monitor": true, "notify": true}
    },
    "pushover": {
      "enabled": true,
      "priority": 0,
      "sound": "pushover"
    }
  }
}

Configuration Options

Main Settings:

enabled (boolean) - Master switch for service monitoring
notification_cooldown_hours (integer) - Hours between notifications for same service (default: 24)
failure_threshold (integer) - Consecutive failures before marking as stuck (default: 3)
send_recovery_notifications (boolean) - Send notification when service recovers (default: false)

Per-Service Settings:

monitor (boolean) - Track this service's health
notify (boolean) - Send notifications for this service

Pushover Settings:

enabled (boolean) - Enable Pushover notifications
priority (integer) - Notification priority (-2 to 2)
sound (string) - Notification sound

How It Works

Detection Flow

Service runs in scheduler mode
Success: health_monitor.record_success('service_name')
- Resets consecutive failure counter
- Marks service as healthy
- Sends recovery notification (if enabled)
Failure: health_monitor.record_failure('service_name', 'reason')
- Increments failure counter
- Records failure type (cloudflare, timeout, etc.)
- If failures ≥ threshold → mark as stuck
- If stuck AND cooldown expired → send alert

Failure Types

cloudflare / cloudflare_challenge - Cloudflare block detected
rate_limit - HTTP 429 rate limiting
forbidden - HTTP 403 access denied
timeout - Connection timeout
authentication - Login/auth required
captcha - CAPTCHA challenge
blocked - IP blocked
unknown - Other errors

State Tracking

State stored in /opt/media-downloader/database/service_health.json:

{
  "service_health": {
    "fastdl": {
      "status": "stuck",
      "consecutive_failures": 5,
      "last_success": "2025-10-27T14:30:00",
      "last_failure": "2025-10-28T23:30:00",
      "last_notification_sent": "2025-10-28T08:00:00",
      "failure_type": "cloudflare_challenge",
      "total_failures": 12,
      "total_successes": 145
    }
  }
}

Notification Format

Alert Notification:

⚠️ Service Alert: FastDL

Status: Stuck/Blocked
Issue: Cloudflare Challenge
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)

Last successful download: 9 hours ago

Action may be required.

Recovery Notification (optional):

✅ Service Recovered: FastDL

Status: Healthy
Service is working again.

Recovered at: Oct 28, 11:45 PM

Implementation Status

✅ Completed Components

Core Module - modules/service_health_monitor.py
- State management
- Failure/success tracking
- Notification logic
- Cooldown management
Configuration - config/settings.json
- service_monitoring section added
- All services configured
Integration - media-downloader.py
- Health monitor initialization (scheduler mode only)
- Imported ServiceHealthMonitor
Example Implementation - download_fastdl()
- Success tracking after completion
- Failure tracking with error classification
- Try/except wrapper pattern

🔄 Pending Implementation

The following download methods need success/failure tracking added:

Pattern to Follow

def download_SERVICE(self):
    """Download content via SERVICE"""
    try:
        # ... existing download logic ...

        # Record success at end
        if self.health_monitor:
            self.health_monitor.record_success('service_name')

        return total_downloaded

    except Exception as e:
        self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")

        # Record failure with classification
        if self.health_monitor:
            error_str = str(e).lower()
            if 'cloudflare' in error_str or 'cf_clearance' in error_str:
                reason = 'cloudflare'
            elif 'timeout' in error_str:
                reason = 'timeout'
            elif '403' in error_str:
                reason = 'forbidden'
            elif '429' in error_str:
                reason = 'rate_limit'
            else:
                reason = 'unknown'
            self.health_monitor.record_failure('service_name', reason)

        raise  # Re-raise to maintain existing error handling

Methods to Update

download_imginn() (line ~1065)
- Service name: 'imginn'
- Common errors: Cloudflare, timeouts
download_toolzu() (line ~1134)
- Service name: 'toolzu'
- Common errors: Cloudflare, rate limits
download_snapchat() (line ~1320)
- Service name: 'snapchat'
- Common errors: Cloudflare, timeouts
download_tiktok() (line ~1364)
- Service name: 'tiktok'
- Common errors: Rate limits, geo-blocks
download_forums() (line ~1442)
- Service name: 'forums'
- Common errors: Authentication, Cloudflare

Testing

Manual Testing (No Monitoring)

# Manual runs don't trigger monitoring
sudo media-downloader --platform snapchat
# Health monitor inactive - no tracking

Scheduler Testing (With Monitoring)

# Start scheduler (monitoring active)
sudo systemctl start media-downloader

# Check health state
cat /opt/media-downloader/database/service_health.json

# Check logs for monitoring activity
tail -f /opt/media-downloader/logs/*.log | grep "Service health"

Simulate Failure

Stop FlareSolverr: docker stop flaresolverr
Run scheduler - service will fail
Check after 3 failures - notification should be sent
Check cooldown - no notification for 24 hours
Start FlareSolverr: docker start flaresolverr
Run scheduler - service recovers, counter resets

Benefits

✅ Early Warning - Know immediately when services are stuck ✅ No Spam - Single daily notification per service ✅ Actionable - Shows specific failure reason ✅ Auto-Recovery - Stops alerting when fixed ✅ Historical Data - Track failure/success patterns ✅ Granular Control - Enable/disable per service

Troubleshooting

Not receiving notifications:

Check service_monitoring.enabled is true
Check service-specific monitor and notify settings
Verify Pushover credentials in config
Confirm running in scheduler mode (not manual)

Too many notifications:

Increase notification_cooldown_hours
Increase failure_threshold
Disable specific services with notify: false

Service marked stuck incorrectly:

Increase failure_threshold (default: 3)
Check if service is actually failing
Review failure logs

Reset service state:

from modules.service_health_monitor import ServiceHealthMonitor
monitor = ServiceHealthMonitor()
monitor.reset_service('fastdl')

Future Enhancements

Web dashboard showing service health
Daily digest emails/notifications
Success rate metrics
Escalation after extended downtime
Integration with monitoring tools (Grafana, etc.)

8.0 KiB Raw Blame History