Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Todd
2026-03-29 22:42:55 -04:00
commit 0d7b2b1aab
389 changed files with 280296 additions and 0 deletions

View File

@@ -0,0 +1,289 @@
# Service Health Monitoring
## Overview
The Service Health Monitor tracks service failures in scheduler mode and sends Pushover notifications when services get stuck due to Cloudflare blocks, rate limiting, or other issues.
## Features
- **Scheduler-only operation** - Only monitors during unattended daemon mode
- **24-hour notification cooldown** - Prevents notification spam
- **Failure threshold** - 3 consecutive failures trigger stuck state
- **Automatic recovery detection** - Stops alerting when service recovers
- **Detailed failure tracking** - Cloudflare, rate limits, timeouts, etc.
## Configuration
Located in `config/settings.json`:
```json
{
"service_monitoring": {
"enabled": true,
"notification_cooldown_hours": 24,
"failure_threshold": 3,
"send_recovery_notifications": false,
"services": {
"fastdl": {"monitor": true, "notify": true},
"imginn": {"monitor": true, "notify": true},
"snapchat": {"monitor": true, "notify": true},
"toolzu": {"monitor": true, "notify": true},
"tiktok": {"monitor": true, "notify": true},
"forums": {"monitor": true, "notify": true}
},
"pushover": {
"enabled": true,
"priority": 0,
"sound": "pushover"
}
}
}
```
### Configuration Options
**Main Settings:**
- `enabled` (boolean) - Master switch for service monitoring
- `notification_cooldown_hours` (integer) - Hours between notifications for same service (default: 24)
- `failure_threshold` (integer) - Consecutive failures before marking as stuck (default: 3)
- `send_recovery_notifications` (boolean) - Send notification when service recovers (default: false)
**Per-Service Settings:**
- `monitor` (boolean) - Track this service's health
- `notify` (boolean) - Send notifications for this service
**Pushover Settings:**
- `enabled` (boolean) - Enable Pushover notifications
- `priority` (integer) - Notification priority (-2 to 2)
- `sound` (string) - Notification sound
## How It Works
### Detection Flow
1. **Service runs** in scheduler mode
2. **Success**: `health_monitor.record_success('service_name')`
- Resets consecutive failure counter
- Marks service as healthy
- Sends recovery notification (if enabled)
3. **Failure**: `health_monitor.record_failure('service_name', 'reason')`
- Increments failure counter
- Records failure type (cloudflare, timeout, etc.)
- If failures ≥ threshold → mark as stuck
- If stuck AND cooldown expired → send alert
### Failure Types
- `cloudflare` / `cloudflare_challenge` - Cloudflare block detected
- `rate_limit` - HTTP 429 rate limiting
- `forbidden` - HTTP 403 access denied
- `timeout` - Connection timeout
- `authentication` - Login/auth required
- `captcha` - CAPTCHA challenge
- `blocked` - IP blocked
- `unknown` - Other errors
### State Tracking
State stored in `/opt/media-downloader/database/service_health.json`:
```json
{
"service_health": {
"fastdl": {
"status": "stuck",
"consecutive_failures": 5,
"last_success": "2025-10-27T14:30:00",
"last_failure": "2025-10-28T23:30:00",
"last_notification_sent": "2025-10-28T08:00:00",
"failure_type": "cloudflare_challenge",
"total_failures": 12,
"total_successes": 145
}
}
}
```
## Notification Format
**Alert Notification:**
```
⚠️ Service Alert: FastDL
Status: Stuck/Blocked
Issue: Cloudflare Challenge
Failed Since: Oct 28, 2:30 PM (5 consecutive failures)
Last successful download: 9 hours ago
Action may be required.
```
**Recovery Notification** (optional):
```
✅ Service Recovered: FastDL
Status: Healthy
Service is working again.
Recovered at: Oct 28, 11:45 PM
```
## Implementation Status
### ✅ Completed Components
1. **Core Module** - `modules/service_health_monitor.py`
- State management
- Failure/success tracking
- Notification logic
- Cooldown management
2. **Configuration** - `config/settings.json`
- service_monitoring section added
- All services configured
3. **Integration** - `media-downloader.py`
- Health monitor initialization (scheduler mode only)
- Imported ServiceHealthMonitor
4. **Example Implementation** - `download_fastdl()`
- Success tracking after completion
- Failure tracking with error classification
- Try/except wrapper pattern
### 🔄 Pending Implementation
The following download methods need success/failure tracking added:
#### Pattern to Follow
```python
def download_SERVICE(self):
"""Download content via SERVICE"""
try:
# ... existing download logic ...
# Record success at end
if self.health_monitor:
self.health_monitor.record_success('service_name')
return total_downloaded
except Exception as e:
self.logger.error(f"[Core] [ERROR] SERVICE download error: {e}")
# Record failure with classification
if self.health_monitor:
error_str = str(e).lower()
if 'cloudflare' in error_str or 'cf_clearance' in error_str:
reason = 'cloudflare'
elif 'timeout' in error_str:
reason = 'timeout'
elif '403' in error_str:
reason = 'forbidden'
elif '429' in error_str:
reason = 'rate_limit'
else:
reason = 'unknown'
self.health_monitor.record_failure('service_name', reason)
raise # Re-raise to maintain existing error handling
```
#### Methods to Update
1. **download_imginn()** (line ~1065)
- Service name: `'imginn'`
- Common errors: Cloudflare, timeouts
2. **download_toolzu()** (line ~1134)
- Service name: `'toolzu'`
- Common errors: Cloudflare, rate limits
3. **download_snapchat()** (line ~1320)
- Service name: `'snapchat'`
- Common errors: Cloudflare, timeouts
4. **download_tiktok()** (line ~1364)
- Service name: `'tiktok'`
- Common errors: Rate limits, geo-blocks
5. **download_forums()** (line ~1442)
- Service name: `'forums'`
- Common errors: Authentication, Cloudflare
## Testing
### Manual Testing (No Monitoring)
```bash
# Manual runs don't trigger monitoring
sudo media-downloader --platform snapchat
# Health monitor inactive - no tracking
```
### Scheduler Testing (With Monitoring)
```bash
# Start scheduler (monitoring active)
sudo systemctl start media-downloader
# Check health state
cat /opt/media-downloader/database/service_health.json
# Check logs for monitoring activity
tail -f /opt/media-downloader/logs/*.log | grep "Service health"
```
### Simulate Failure
1. Stop FlareSolverr: `docker stop flaresolverr`
2. Run scheduler - service will fail
3. Check after 3 failures - notification should be sent
4. Check cooldown - no notification for 24 hours
5. Start FlareSolverr: `docker start flaresolverr`
6. Run scheduler - service recovers, counter resets
## Benefits
**Early Warning** - Know immediately when services are stuck
**No Spam** - Single daily notification per service
**Actionable** - Shows specific failure reason
**Auto-Recovery** - Stops alerting when fixed
**Historical Data** - Track failure/success patterns
**Granular Control** - Enable/disable per service
## Troubleshooting
**Not receiving notifications:**
- Check `service_monitoring.enabled` is `true`
- Check service-specific `monitor` and `notify` settings
- Verify Pushover credentials in config
- Confirm running in scheduler mode (not manual)
**Too many notifications:**
- Increase `notification_cooldown_hours`
- Increase `failure_threshold`
- Disable specific services with `notify: false`
**Service marked stuck incorrectly:**
- Increase `failure_threshold` (default: 3)
- Check if service is actually failing
- Review failure logs
**Reset service state:**
```python
from modules.service_health_monitor import ServiceHealthMonitor
monitor = ServiceHealthMonitor()
monitor.reset_service('fastdl')
```
## Future Enhancements
- Web dashboard showing service health
- Daily digest emails/notifications
- Success rate metrics
- Escalation after extended downtime
- Integration with monitoring tools (Grafana, etc.)