Files
media-downloader/docs/SERVICE_HEALTH_MONITORING.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

8.0 KiB

Service Health Monitoring

Overview

The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.

Features

  • Scheduler-only operation - Only monitors during unattended daemon mode
  • 24-hour notification cooldown - Prevents notification spam
  • Failure threshold - 3 consecutive failures trigger stuck state
  • Automatic recovery detection - Stops alerting when service recovers
  • Detailed failure tracking - Cloudflare, rate limits, timeouts, etc.

Configuration

Located in config/settings.json:

{
  "service_monitoring": {
    "enabled": true,
    "notification_cooldown_hours": 24,
    "failure_threshold": 3,
    "send_recovery_notifications": false,
    "services": {
      "fastdl": {"monitor": true, "notify": true},
      "imginn": {"monitor": true, "notify": true},
      "snapchat": {"monitor": true, "notify": true},
      "toolzu": {"monitor": true, "notify": true},
      "tiktok": {"monitor": true, "notify": true},
      "forums": {"monitor": true, "notify": true}
    },
    "pushover": {
      "enabled": true,
      "priority": 0,
      "sound": "pushover"
    }
  }
}

Configuration Options

Main Settings:

  • enabled (boolean) - Master switch for service monitoring
  • notification_cooldown_hours (integer) - Hours between notifications for same service (default: 24)
  • failure_threshold (integer) - Consecutive failures before marking as stuck (default: 3)
  • send_recovery_notifications (boolean) - Send notification when service recovers (default: false)

Per-Service Settings:

  • monitor (boolean) - Track this service's health
  • notify (boolean) - Send notifications for this service

Pushover Settings:

  • enabled (boolean) - Enable Pushover notifications
  • priority (integer) - Notification priority (-2 to 2)
  • sound (string) - Notification sound

How It Works

Detection Flow

  1. Service runs in scheduler mode

  2. Success: health_monitor.record_success('service_name')

    • Resets consecutive failure counter
    • Marks service as healthy
    • Sends recovery notification (if enabled)
  3. Failure: health_monitor.record_failure('service_name', 'reason')

    • Increments failure counter
    • Records failure type (cloudflare, timeout, etc.)
    • If failures ≥ threshold → mark as stuck
    • If stuck AND cooldown expired → send alert

Failure Types

  • cloudflare / cloudflare_challenge - Cloudflare block detected
  • rate_limit - HTTP 429 rate limiting
  • forbidden - HTTP 403 access denied
  • timeout - Connection timeout
  • authentication - Login/auth required
  • captcha - CAPTCHA challenge
  • blocked - IP blocked
  • unknown - Other errors

State Tracking

State stored in /opt/media-downloader/database/service_health.json:

{
  "service_health": {
    "fastdl": {
      "status": "stuck",
      "consecutive_failures": 5,
      "last_success": "2025-10-27T14:30:00",
      "last_failure": "2025-10-28T23:30:00",
      "last_notification_sent": "2025-10-28T08:00:00",
      "failure_type": "cloudflare_challenge",
      "total_failures": 12,
      "total_successes": 145
    }
  }
}

Notification Format

Alert Notification:

⚠️ Service Alert: FastDL

Status: Stuck/Blocked
Issue: Cloudflare Challenge
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)

Last successful download: 9 hours ago

Action may be required.

Recovery Notification (optional):

✅ Service Recovered: FastDL

Status: Healthy
Service is working again.

Recovered at: Oct 28, 11:45 PM

Implementation Status

Completed Components

  1. Core Module - modules/service_health_monitor.py

    • State management
    • Failure/success tracking
    • Notification logic
    • Cooldown management
  2. Configuration - config/settings.json

    • service_monitoring section added
    • All services configured
  3. Integration - media-downloader.py

    • Health monitor initialization (scheduler mode only)
    • Imported ServiceHealthMonitor
  4. Example Implementation - download_fastdl()

    • Success tracking after completion
    • Failure tracking with error classification
    • Try/except wrapper pattern

🔄 Pending Implementation

The following download methods need success/failure tracking added:

Pattern to Follow

def download_SERVICE(self):
    """Download content via SERVICE"""
    try:
        # ... existing download logic ...

        # Record success at end
        if self.health_monitor:
            self.health_monitor.record_success('service_name')

        return total_downloaded

    except Exception as e:
        self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")

        # Record failure with classification
        if self.health_monitor:
            error_str = str(e).lower()
            if 'cloudflare' in error_str or 'cf_clearance' in error_str:
                reason = 'cloudflare'
            elif 'timeout' in error_str:
                reason = 'timeout'
            elif '403' in error_str:
                reason = 'forbidden'
            elif '429' in error_str:
                reason = 'rate_limit'
            else:
                reason = 'unknown'
            self.health_monitor.record_failure('service_name', reason)

        raise  # Re-raise to maintain existing error handling

Methods to Update

  1. download_imginn() (line ~1065)

    • Service name: 'imginn'
    • Common errors: Cloudflare, timeouts
  2. download_toolzu() (line ~1134)

    • Service name: 'toolzu'
    • Common errors: Cloudflare, rate limits
  3. download_snapchat() (line ~1320)

    • Service name: 'snapchat'
    • Common errors: Cloudflare, timeouts
  4. download_tiktok() (line ~1364)

    • Service name: 'tiktok'
    • Common errors: Rate limits, geo-blocks
  5. download_forums() (line ~1442)

    • Service name: 'forums'
    • Common errors: Authentication, Cloudflare

Testing

Manual Testing (No Monitoring)

# Manual runs don't trigger monitoring
sudo media-downloader --platform snapchat
# Health monitor inactive - no tracking

Scheduler Testing (With Monitoring)

# Start scheduler (monitoring active)
sudo systemctl start media-downloader

# Check health state
cat /opt/media-downloader/database/service_health.json

# Check logs for monitoring activity
tail -f /opt/media-downloader/logs/*.log | grep "Service health"

Simulate Failure

  1. Stop FlareSolverr: docker stop flaresolverr
  2. Run scheduler - service will fail
  3. Check after 3 failures - notification should be sent
  4. Check cooldown - no notification for 24 hours
  5. Start FlareSolverr: docker start flaresolverr
  6. Run scheduler - service recovers, counter resets

Benefits

Early Warning - Know immediately when services are stuck No Spam - Single daily notification per service Actionable - Shows specific failure reason Auto-Recovery - Stops alerting when fixed Historical Data - Track failure/success patterns Granular Control - Enable/disable per service

Troubleshooting

Not receiving notifications:

  • Check service_monitoring.enabled is true
  • Check service-specific monitor and notify settings
  • Verify Pushover credentials in config
  • Confirm running in scheduler mode (not manual)

Too many notifications:

  • Increase notification_cooldown_hours
  • Increase failure_threshold
  • Disable specific services with notify: false

Service marked stuck incorrectly:

  • Increase failure_threshold (default: 3)
  • Check if service is actually failing
  • Review failure logs

Reset service state:

from modules.service_health_monitor import ServiceHealthMonitor
monitor = ServiceHealthMonitor()
monitor.reset_service('fastdl')

Future Enhancements

  • Web dashboard showing service health
  • Daily digest emails/notifications
  • Success rate metrics
  • Escalation after extended downtime
  • Integration with monitoring tools (Grafana, etc.)