media-downloader/docs/SERVICE_HEALTH_MONITORING.md

# Service Health Monitoring

## Overview

The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.

## Features

- **Scheduler-only operation** - Only monitors during unattended daemon mode
- **24-hour notification cooldown** - Prevents notification spam
- **Failure threshold** - 3 consecutive failures trigger stuck state
- **Automatic recovery detection** - Stops alerting when service recovers
- **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc.

## Configuration

Located in `config/settings.json`:

```json
{
  "service_monitoring": {
    "enabled": true,
    "notification_cooldown_hours": 24,
    "failure_threshold": 3,
    "send_recovery_notifications": false,
    "services": {
      "fastdl": {"monitor": true, "notify": true},
      "imginn": {"monitor": true, "notify": true},
      "snapchat": {"monitor": true, "notify": true},
      "toolzu": {"monitor": true, "notify": true},
      "tiktok": {"monitor": true, "notify": true},
      "forums": {"monitor": true, "notify": true}
    },
    "pushover": {
      "enabled": true,
      "priority": 0,
      "sound": "pushover"
    }
  }
}
```

### Configuration Options

**Main Settings:**
- `enabled` (boolean) - Master switch for service monitoring
- `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24)
- `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3)
- `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false)

**Per-Service Settings:**
- `monitor` (boolean) - Track this service's health
- `notify` (boolean) - Send notifications for this service

**Pushover Settings:**
- `enabled` (boolean) - Enable Pushover notifications
- `priority` (integer) - Notification priority (-2 to 2)
- `sound` (string) - Notification sound

## How It Works

### Detection Flow

1. **Service runs** in scheduler mode
2. **Success**: `health_monitor.record_success('service_name')`
   - Resets consecutive failure counter
   - Marks service as healthy
   - Sends recovery notification (if enabled)

3. **Failure**: `health_monitor.record_failure('service_name', 'reason')`
   - Increments failure counter
   - Records failure type (cloudflare, timeout, etc.)
   - If failures ≥ threshold → mark as stuck
   - If stuck AND cooldown expired → send alert

### Failure Types

- `cloudflare` / `cloudflare_challenge` - Cloudflare block detected
- `rate_limit` - HTTP 429 rate limiting
- `forbidden` - HTTP 403 access denied
- `timeout` - Connection timeout
- `authentication` - Login/auth required
- `captcha` - CAPTCHA challenge
- `blocked` - IP blocked
- `unknown` - Other errors

### State Tracking

State stored in `/opt/media-downloader/database/service_health.json`:

```json
{
  "service_health": {
    "fastdl": {
      "status": "stuck",
      "consecutive_failures": 5,
      "last_success": "2025-10-27T14:30:00",
      "last_failure": "2025-10-28T23:30:00",
      "last_notification_sent": "2025-10-28T08:00:00",
      "failure_type": "cloudflare_challenge",
      "total_failures": 12,
      "total_successes": 145
    }
  }
}
```

## Notification Format

**Alert Notification:**
```
⚠️ Service Alert: FastDL

Status: Stuck/Blocked
Issue: Cloudflare Challenge
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)

Last successful download: 9 hours ago

Action may be required.
```

**Recovery Notification** (optional):
```
✅ Service Recovered: FastDL

Status: Healthy
Service is working again.

Recovered at: Oct 28, 11:45 PM
```

## Implementation Status

### ✅ Completed Components

1. **Core Module** - `modules/service_health_monitor.py`
   - State management
   - Failure/success tracking
   - Notification logic
   - Cooldown management

2. **Configuration** - `config/settings.json`
   - service_monitoring section added
   - All services configured

3. **Integration** - `media-downloader.py`
   - Health monitor initialization (scheduler mode only)
   - Imported ServiceHealthMonitor

4. **Example Implementation** - `download_fastdl()`
   - Success tracking after completion
   - Failure tracking with error classification
   - Try/except wrapper pattern

### 🔄 Pending Implementation

The following download methods need success/failure tracking added:

#### Pattern to Follow

```python
def download_SERVICE(self):
    """Download content via SERVICE"""
    try:
        # ... existing download logic ...

        # Record success at end
        if self.health_monitor:
            self.health_monitor.record_success('service_name')

        return total_downloaded

    except Exception as e:
        self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")

        # Record failure with classification
        if self.health_monitor:
            error_str = str(e).lower()
            if 'cloudflare' in error_str or 'cf_clearance' in error_str:
                reason = 'cloudflare'
            elif 'timeout' in error_str:
                reason = 'timeout'
            elif '403' in error_str:
                reason = 'forbidden'
            elif '429' in error_str:
                reason = 'rate_limit'
            else:
                reason = 'unknown'
            self.health_monitor.record_failure('service_name', reason)

        raise  # Re-raise to maintain existing error handling
```

#### Methods to Update

1. **download_imginn()** (line ~1065)
   - Service name: `'imginn'`
   - Common errors: Cloudflare, timeouts

2. **download_toolzu()** (line ~1134)
   - Service name: `'toolzu'`
   - Common errors: Cloudflare, rate limits

3. **download_snapchat()** (line ~1320)
   - Service name: `'snapchat'`
   - Common errors: Cloudflare, timeouts

4. **download_tiktok()** (line ~1364)
   - Service name: `'tiktok'`
   - Common errors: Rate limits, geo-blocks

5. **download_forums()** (line ~1442)
   - Service name: `'forums'`
   - Common errors: Authentication, Cloudflare

## Testing

### Manual Testing (No Monitoring)

```bash
# Manual runs don't trigger monitoring
sudo media-downloader --platform snapchat
# Health monitor inactive - no tracking
```

### Scheduler Testing (With Monitoring)

```bash
# Start scheduler (monitoring active)
sudo systemctl start media-downloader

# Check health state
cat /opt/media-downloader/database/service_health.json

# Check logs for monitoring activity
tail -f /opt/media-downloader/logs/*.log | grep "Service health"
```

### Simulate Failure

1. Stop FlareSolverr: `docker stop flaresolverr`
2. Run scheduler - service will fail
3. Check after 3 failures - notification should be sent
4. Check cooldown - no notification for 24 hours
5. Start FlareSolverr: `docker start flaresolverr`
6. Run scheduler - service recovers, counter resets

## Benefits

✅ **Early Warning** - Know immediately when services are stuck
✅ **No Spam** - Single daily notification per service
✅ **Actionable** - Shows specific failure reason
✅ **Auto-Recovery** - Stops alerting when fixed
✅ **Historical Data** - Track failure/success patterns
✅ **Granular Control** - Enable/disable per service

## Troubleshooting

**Not receiving notifications:**
- Check `service_monitoring.enabled` is `true`
- Check service-specific `monitor` and `notify` settings
- Verify Pushover credentials in config
- Confirm running in scheduler mode (not manual)

**Too many notifications:**
- Increase `notification_cooldown_hours`
- Increase `failure_threshold`
- Disable specific services with `notify: false`

**Service marked stuck incorrectly:**
- Increase `failure_threshold` (default: 3)
- Check if service is actually failing
- Review failure logs

**Reset service state:**
```python
from modules.service_health_monitor import ServiceHealthMonitor
monitor = ServiceHealthMonitor()
monitor.reset_service('fastdl')
```

## Future Enhancements

- Web dashboard showing service health
- Daily digest emails/notifications
- Success rate metrics
- Escalation after extended downtime
- Integration with monitoring tools (Grafana, etc.)