290 lines
8.0 KiB
Markdown
290 lines
8.0 KiB
Markdown
# Service Health Monitoring
|
|
|
|
## Overview
|
|
|
|
The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.
|
|
|
|
## Features
|
|
|
|
- **Scheduler-only operation** - Only monitors during unattended daemon mode
|
|
- **24-hour notification cooldown** - Prevents notification spam
|
|
- **Failure threshold** - 3 consecutive failures trigger stuck state
|
|
- **Automatic recovery detection** - Stops alerting when service recovers
|
|
- **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc.
|
|
|
|
## Configuration
|
|
|
|
Located in `config/settings.json`:
|
|
|
|
```json
|
|
{
|
|
"service_monitoring": {
|
|
"enabled": true,
|
|
"notification_cooldown_hours": 24,
|
|
"failure_threshold": 3,
|
|
"send_recovery_notifications": false,
|
|
"services": {
|
|
"fastdl": {"monitor": true, "notify": true},
|
|
"imginn": {"monitor": true, "notify": true},
|
|
"snapchat": {"monitor": true, "notify": true},
|
|
"toolzu": {"monitor": true, "notify": true},
|
|
"tiktok": {"monitor": true, "notify": true},
|
|
"forums": {"monitor": true, "notify": true}
|
|
},
|
|
"pushover": {
|
|
"enabled": true,
|
|
"priority": 0,
|
|
"sound": "pushover"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Configuration Options
|
|
|
|
**Main Settings:**
|
|
- `enabled` (boolean) - Master switch for service monitoring
|
|
- `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24)
|
|
- `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3)
|
|
- `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false)
|
|
|
|
**Per-Service Settings:**
|
|
- `monitor` (boolean) - Track this service's health
|
|
- `notify` (boolean) - Send notifications for this service
|
|
|
|
**Pushover Settings:**
|
|
- `enabled` (boolean) - Enable Pushover notifications
|
|
- `priority` (integer) - Notification priority (-2 to 2)
|
|
- `sound` (string) - Notification sound
|
|
|
|
## How It Works
|
|
|
|
### Detection Flow
|
|
|
|
1. **Service runs** in scheduler mode
|
|
2. **Success**: `health_monitor.record_success('service_name')`
|
|
- Resets consecutive failure counter
|
|
- Marks service as healthy
|
|
- Sends recovery notification (if enabled)
|
|
|
|
3. **Failure**: `health_monitor.record_failure('service_name', 'reason')`
|
|
- Increments failure counter
|
|
- Records failure type (cloudflare, timeout, etc.)
|
|
- If failures ≥ threshold → mark as stuck
|
|
- If stuck AND cooldown expired → send alert
|
|
|
|
### Failure Types
|
|
|
|
- `cloudflare` / `cloudflare_challenge` - Cloudflare block detected
|
|
- `rate_limit` - HTTP 429 rate limiting
|
|
- `forbidden` - HTTP 403 access denied
|
|
- `timeout` - Connection timeout
|
|
- `authentication` - Login/auth required
|
|
- `captcha` - CAPTCHA challenge
|
|
- `blocked` - IP blocked
|
|
- `unknown` - Other errors
|
|
|
|
### State Tracking
|
|
|
|
State stored in `/opt/media-downloader/database/service_health.json`:
|
|
|
|
```json
|
|
{
|
|
"service_health": {
|
|
"fastdl": {
|
|
"status": "stuck",
|
|
"consecutive_failures": 5,
|
|
"last_success": "2025-10-27T14:30:00",
|
|
"last_failure": "2025-10-28T23:30:00",
|
|
"last_notification_sent": "2025-10-28T08:00:00",
|
|
"failure_type": "cloudflare_challenge",
|
|
"total_failures": 12,
|
|
"total_successes": 145
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Notification Format
|
|
|
|
**Alert Notification:**
|
|
```
|
|
⚠️ Service Alert: FastDL
|
|
|
|
Status: Stuck/Blocked
|
|
Issue: Cloudflare Challenge
|
|
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)
|
|
|
|
Last successful download: 9 hours ago
|
|
|
|
Action may be required.
|
|
```
|
|
|
|
**Recovery Notification** (optional):
|
|
```
|
|
✅ Service Recovered: FastDL
|
|
|
|
Status: Healthy
|
|
Service is working again.
|
|
|
|
Recovered at: Oct 28, 11:45 PM
|
|
```
|
|
|
|
## Implementation Status
|
|
|
|
### ✅ Completed Components
|
|
|
|
1. **Core Module** - `modules/service_health_monitor.py`
|
|
- State management
|
|
- Failure/success tracking
|
|
- Notification logic
|
|
- Cooldown management
|
|
|
|
2. **Configuration** - `config/settings.json`
|
|
- service_monitoring section added
|
|
- All services configured
|
|
|
|
3. **Integration** - `media-downloader.py`
|
|
- Health monitor initialization (scheduler mode only)
|
|
- Imported ServiceHealthMonitor
|
|
|
|
4. **Example Implementation** - `download_fastdl()`
|
|
- Success tracking after completion
|
|
- Failure tracking with error classification
|
|
- Try/except wrapper pattern
|
|
|
|
### 🔄 Pending Implementation
|
|
|
|
The following download methods need success/failure tracking added:
|
|
|
|
#### Pattern to Follow
|
|
|
|
```python
|
|
def download_SERVICE(self):
|
|
"""Download content via SERVICE"""
|
|
try:
|
|
# ... existing download logic ...
|
|
|
|
# Record success at end
|
|
if self.health_monitor:
|
|
self.health_monitor.record_success('service_name')
|
|
|
|
return total_downloaded
|
|
|
|
except Exception as e:
|
|
self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")
|
|
|
|
# Record failure with classification
|
|
if self.health_monitor:
|
|
error_str = str(e).lower()
|
|
if 'cloudflare' in error_str or 'cf_clearance' in error_str:
|
|
reason = 'cloudflare'
|
|
elif 'timeout' in error_str:
|
|
reason = 'timeout'
|
|
elif '403' in error_str:
|
|
reason = 'forbidden'
|
|
elif '429' in error_str:
|
|
reason = 'rate_limit'
|
|
else:
|
|
reason = 'unknown'
|
|
self.health_monitor.record_failure('service_name', reason)
|
|
|
|
raise # Re-raise to maintain existing error handling
|
|
```
|
|
|
|
#### Methods to Update
|
|
|
|
1. **download_imginn()** (line ~1065)
|
|
- Service name: `'imginn'`
|
|
- Common errors: Cloudflare, timeouts
|
|
|
|
2. **download_toolzu()** (line ~1134)
|
|
- Service name: `'toolzu'`
|
|
- Common errors: Cloudflare, rate limits
|
|
|
|
3. **download_snapchat()** (line ~1320)
|
|
- Service name: `'snapchat'`
|
|
- Common errors: Cloudflare, timeouts
|
|
|
|
4. **download_tiktok()** (line ~1364)
|
|
- Service name: `'tiktok'`
|
|
- Common errors: Rate limits, geo-blocks
|
|
|
|
5. **download_forums()** (line ~1442)
|
|
- Service name: `'forums'`
|
|
- Common errors: Authentication, Cloudflare
|
|
|
|
## Testing
|
|
|
|
### Manual Testing (No Monitoring)
|
|
|
|
```bash
|
|
# Manual runs don't trigger monitoring
|
|
sudo media-downloader --platform snapchat
|
|
# Health monitor inactive - no tracking
|
|
```
|
|
|
|
### Scheduler Testing (With Monitoring)
|
|
|
|
```bash
|
|
# Start scheduler (monitoring active)
|
|
sudo systemctl start media-downloader
|
|
|
|
# Check health state
|
|
cat /opt/media-downloader/database/service_health.json
|
|
|
|
# Check logs for monitoring activity
|
|
tail -f /opt/media-downloader/logs/*.log | grep "Service health"
|
|
```
|
|
|
|
### Simulate Failure
|
|
|
|
1. Stop FlareSolverr: `docker stop flaresolverr`
|
|
2. Run scheduler - service will fail
|
|
3. Check after 3 failures - notification should be sent
|
|
4. Check cooldown - no notification for 24 hours
|
|
5. Start FlareSolverr: `docker start flaresolverr`
|
|
6. Run scheduler - service recovers, counter resets
|
|
|
|
## Benefits
|
|
|
|
✅ **Early Warning** - Know immediately when services are stuck
|
|
✅ **No Spam** - Single daily notification per service
|
|
✅ **Actionable** - Shows specific failure reason
|
|
✅ **Auto-Recovery** - Stops alerting when fixed
|
|
✅ **Historical Data** - Track failure/success patterns
|
|
✅ **Granular Control** - Enable/disable per service
|
|
|
|
## Troubleshooting
|
|
|
|
**Not receiving notifications:**
|
|
- Check `service_monitoring.enabled` is `true`
|
|
- Check service-specific `monitor` and `notify` settings
|
|
- Verify Pushover credentials in config
|
|
- Confirm running in scheduler mode (not manual)
|
|
|
|
**Too many notifications:**
|
|
- Increase `notification_cooldown_hours`
|
|
- Increase `failure_threshold`
|
|
- Disable specific services with `notify: false`
|
|
|
|
**Service marked stuck incorrectly:**
|
|
- Increase `failure_threshold` (default: 3)
|
|
- Check if service is actually failing
|
|
- Review failure logs
|
|
|
|
**Reset service state:**
|
|
```python
|
|
from modules.service_health_monitor import ServiceHealthMonitor
|
|
monitor = ServiceHealthMonitor()
|
|
monitor.reset_service('fastdl')
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
- Web dashboard showing service health
|
|
- Daily digest emails/notifications
|
|
- Success rate metrics
|
|
- Escalation after extended downtime
|
|
- Integration with monitoring tools (Grafana, etc.)
|