289
docs/SERVICE_HEALTH_MONITORING.md
Normal file
289
docs/SERVICE_HEALTH_MONITORING.md
Normal file
@@ -0,0 +1,289 @@
|
||||
# Service Health Monitoring
|
||||
|
||||
## Overview
|
||||
|
||||
The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.
|
||||
|
||||
## Features
|
||||
|
||||
- **Scheduler-only operation** - Only monitors during unattended daemon mode
|
||||
- **24-hour notification cooldown** - Prevents notification spam
|
||||
- **Failure threshold** - 3 consecutive failures trigger stuck state
|
||||
- **Automatic recovery detection** - Stops alerting when service recovers
|
||||
- **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc.
|
||||
|
||||
## Configuration
|
||||
|
||||
Located in `config/settings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"service_monitoring": {
|
||||
"enabled": true,
|
||||
"notification_cooldown_hours": 24,
|
||||
"failure_threshold": 3,
|
||||
"send_recovery_notifications": false,
|
||||
"services": {
|
||||
"fastdl": {"monitor": true, "notify": true},
|
||||
"imginn": {"monitor": true, "notify": true},
|
||||
"snapchat": {"monitor": true, "notify": true},
|
||||
"toolzu": {"monitor": true, "notify": true},
|
||||
"tiktok": {"monitor": true, "notify": true},
|
||||
"forums": {"monitor": true, "notify": true}
|
||||
},
|
||||
"pushover": {
|
||||
"enabled": true,
|
||||
"priority": 0,
|
||||
"sound": "pushover"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
**Main Settings:**
|
||||
- `enabled` (boolean) - Master switch for service monitoring
|
||||
- `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24)
|
||||
- `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3)
|
||||
- `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false)
|
||||
|
||||
**Per-Service Settings:**
|
||||
- `monitor` (boolean) - Track this service's health
|
||||
- `notify` (boolean) - Send notifications for this service
|
||||
|
||||
**Pushover Settings:**
|
||||
- `enabled` (boolean) - Enable Pushover notifications
|
||||
- `priority` (integer) - Notification priority (-2 to 2)
|
||||
- `sound` (string) - Notification sound
|
||||
|
||||
## How It Works
|
||||
|
||||
### Detection Flow
|
||||
|
||||
1. **Service runs** in scheduler mode
|
||||
2. **Success**: `health_monitor.record_success('service_name')`
|
||||
- Resets consecutive failure counter
|
||||
- Marks service as healthy
|
||||
- Sends recovery notification (if enabled)
|
||||
|
||||
3. **Failure**: `health_monitor.record_failure('service_name', 'reason')`
|
||||
- Increments failure counter
|
||||
- Records failure type (cloudflare, timeout, etc.)
|
||||
- If failures ≥ threshold → mark as stuck
|
||||
- If stuck AND cooldown expired → send alert
|
||||
|
||||
### Failure Types
|
||||
|
||||
- `cloudflare` / `cloudflare_challenge` - Cloudflare block detected
|
||||
- `rate_limit` - HTTP 429 rate limiting
|
||||
- `forbidden` - HTTP 403 access denied
|
||||
- `timeout` - Connection timeout
|
||||
- `authentication` - Login/auth required
|
||||
- `captcha` - CAPTCHA challenge
|
||||
- `blocked` - IP blocked
|
||||
- `unknown` - Other errors
|
||||
|
||||
### State Tracking
|
||||
|
||||
State stored in `/opt/media-downloader/database/service_health.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"service_health": {
|
||||
"fastdl": {
|
||||
"status": "stuck",
|
||||
"consecutive_failures": 5,
|
||||
"last_success": "2025-10-27T14:30:00",
|
||||
"last_failure": "2025-10-28T23:30:00",
|
||||
"last_notification_sent": "2025-10-28T08:00:00",
|
||||
"failure_type": "cloudflare_challenge",
|
||||
"total_failures": 12,
|
||||
"total_successes": 145
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Notification Format
|
||||
|
||||
**Alert Notification:**
|
||||
```
|
||||
⚠️ Service Alert: FastDL
|
||||
|
||||
Status: Stuck/Blocked
|
||||
Issue: Cloudflare Challenge
|
||||
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)
|
||||
|
||||
Last successful download: 9 hours ago
|
||||
|
||||
Action may be required.
|
||||
```
|
||||
|
||||
**Recovery Notification** (optional):
|
||||
```
|
||||
✅ Service Recovered: FastDL
|
||||
|
||||
Status: Healthy
|
||||
Service is working again.
|
||||
|
||||
Recovered at: Oct 28, 11:45 PM
|
||||
```
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ Completed Components
|
||||
|
||||
1. **Core Module** - `modules/service_health_monitor.py`
|
||||
- State management
|
||||
- Failure/success tracking
|
||||
- Notification logic
|
||||
- Cooldown management
|
||||
|
||||
2. **Configuration** - `config/settings.json`
|
||||
- service_monitoring section added
|
||||
- All services configured
|
||||
|
||||
3. **Integration** - `media-downloader.py`
|
||||
- Health monitor initialization (scheduler mode only)
|
||||
- Imported ServiceHealthMonitor
|
||||
|
||||
4. **Example Implementation** - `download_fastdl()`
|
||||
- Success tracking after completion
|
||||
- Failure tracking with error classification
|
||||
- Try/except wrapper pattern
|
||||
|
||||
### 🔄 Pending Implementation
|
||||
|
||||
The following download methods need success/failure tracking added:
|
||||
|
||||
#### Pattern to Follow
|
||||
|
||||
```python
|
||||
def download_SERVICE(self):
|
||||
"""Download content via SERVICE"""
|
||||
try:
|
||||
# ... existing download logic ...
|
||||
|
||||
# Record success at end
|
||||
if self.health_monitor:
|
||||
self.health_monitor.record_success('service_name')
|
||||
|
||||
return total_downloaded
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")
|
||||
|
||||
# Record failure with classification
|
||||
if self.health_monitor:
|
||||
error_str = str(e).lower()
|
||||
if 'cloudflare' in error_str or 'cf_clearance' in error_str:
|
||||
reason = 'cloudflare'
|
||||
elif 'timeout' in error_str:
|
||||
reason = 'timeout'
|
||||
elif '403' in error_str:
|
||||
reason = 'forbidden'
|
||||
elif '429' in error_str:
|
||||
reason = 'rate_limit'
|
||||
else:
|
||||
reason = 'unknown'
|
||||
self.health_monitor.record_failure('service_name', reason)
|
||||
|
||||
raise # Re-raise to maintain existing error handling
|
||||
```
|
||||
|
||||
#### Methods to Update
|
||||
|
||||
1. **download_imginn()** (line ~1065)
|
||||
- Service name: `'imginn'`
|
||||
- Common errors: Cloudflare, timeouts
|
||||
|
||||
2. **download_toolzu()** (line ~1134)
|
||||
- Service name: `'toolzu'`
|
||||
- Common errors: Cloudflare, rate limits
|
||||
|
||||
3. **download_snapchat()** (line ~1320)
|
||||
- Service name: `'snapchat'`
|
||||
- Common errors: Cloudflare, timeouts
|
||||
|
||||
4. **download_tiktok()** (line ~1364)
|
||||
- Service name: `'tiktok'`
|
||||
- Common errors: Rate limits, geo-blocks
|
||||
|
||||
5. **download_forums()** (line ~1442)
|
||||
- Service name: `'forums'`
|
||||
- Common errors: Authentication, Cloudflare
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Testing (No Monitoring)
|
||||
|
||||
```bash
|
||||
# Manual runs don't trigger monitoring
|
||||
sudo media-downloader --platform snapchat
|
||||
# Health monitor inactive - no tracking
|
||||
```
|
||||
|
||||
### Scheduler Testing (With Monitoring)
|
||||
|
||||
```bash
|
||||
# Start scheduler (monitoring active)
|
||||
sudo systemctl start media-downloader
|
||||
|
||||
# Check health state
|
||||
cat /opt/media-downloader/database/service_health.json
|
||||
|
||||
# Check logs for monitoring activity
|
||||
tail -f /opt/media-downloader/logs/*.log | grep "Service health"
|
||||
```
|
||||
|
||||
### Simulate Failure
|
||||
|
||||
1. Stop FlareSolverr: `docker stop flaresolverr`
|
||||
2. Run scheduler - service will fail
|
||||
3. Check after 3 failures - notification should be sent
|
||||
4. Check cooldown - no notification for 24 hours
|
||||
5. Start FlareSolverr: `docker start flaresolverr`
|
||||
6. Run scheduler - service recovers, counter resets
|
||||
|
||||
## Benefits
|
||||
|
||||
✅ **Early Warning** - Know immediately when services are stuck
|
||||
✅ **No Spam** - Single daily notification per service
|
||||
✅ **Actionable** - Shows specific failure reason
|
||||
✅ **Auto-Recovery** - Stops alerting when fixed
|
||||
✅ **Historical Data** - Track failure/success patterns
|
||||
✅ **Granular Control** - Enable/disable per service
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Not receiving notifications:**
|
||||
- Check `service_monitoring.enabled` is `true`
|
||||
- Check service-specific `monitor` and `notify` settings
|
||||
- Verify Pushover credentials in config
|
||||
- Confirm running in scheduler mode (not manual)
|
||||
|
||||
**Too many notifications:**
|
||||
- Increase `notification_cooldown_hours`
|
||||
- Increase `failure_threshold`
|
||||
- Disable specific services with `notify: false`
|
||||
|
||||
**Service marked stuck incorrectly:**
|
||||
- Increase `failure_threshold` (default: 3)
|
||||
- Check if service is actually failing
|
||||
- Review failure logs
|
||||
|
||||
**Reset service state:**
|
||||
```python
|
||||
from modules.service_health_monitor import ServiceHealthMonitor
|
||||
monitor = ServiceHealthMonitor()
|
||||
monitor.reset_service('fastdl')
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Web dashboard showing service health
|
||||
- Daily digest emails/notifications
|
||||
- Success rate metrics
|
||||
- Escalation after extended downtime
|
||||
- Integration with monitoring tools (Grafana, etc.)
|
||||
Reference in New Issue
Block a user