Files
media-downloader/docs/archive/repost_detection_testing_guide.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

425 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Instagram Repost Detection - Testing & Deployment Guide
**Status:****Implementation Complete - Ready for Testing**
**Default State:** 🔒 **DISABLED** (feature flag off)
---
## Implementation Summary
All code has been safely integrated with backward-compatible changes:
**ImgInn Module Updated** - Added optional `skip_database` and `max_age_hours` parameters (default behavior unchanged)
**Move Module Updated** - Added repost detection hooks with feature flag check (disabled by default)
**Database Settings Added** - Settings entry created with `enabled: false`
**Frontend UI Added** - Configuration page includes repost detection settings panel
**Module Tested** - Core detection logic validated with real example file
---
## Safety Guarantees
### Backward Compatibility
- All new parameters have defaults that preserve existing behavior
- Feature is completely disabled by default
- No changes to existing workflows when disabled
- Can be toggled on/off without code changes
### Error Handling
- If repost detection fails, original file processing continues normally
- Missing dependencies don't break downloads
- Failed OCR/hashing doesn't stop the move operation
### Database Safety
- New tables created only when feature is used
- Existing tables remain untouched
- Can be disabled instantly via SQL or UI
---
## Testing Plan
### Phase 1: Verify Feature is Disabled (Recommended First Step)
**Purpose:** Confirm existing functionality is unchanged
```bash
# 1. Check database setting
sqlite3 /opt/media-downloader/data/backup_cache.db \
"SELECT key, json_extract(value, '$.enabled') FROM settings WHERE key = 'repost_detection';"
# Expected output:
# repost_detection|0 (0 = disabled)
# 2. Download some Instagram stories (any module)
# - Stories should download normally
# - No repost detection messages in logs
# - No temp files in /tmp/repost_detection/
# 3. Check frontend
# - Open Configuration page
# - Find "Instagram Repost Detection" section
# - Verify toggle is OFF by default
```
**Expected Result:** Everything works exactly as before
---
### Phase 2: Enable and Test Detection
**Step 2.1: Enable via Frontend (Recommended)**
1. Open Configuration page: http://localhost:8000/configuration
2. Scroll to "Instagram Repost Detection" section
3. Toggle "Enabled" to ON
4. Adjust settings if desired:
- Hash Distance Threshold: 10 (default)
- Fetch Cache Duration: 12 hours (default)
- Max Posts Age: 24 hours (default)
- Cleanup Temp Files: ON (recommended)
5. Click "Save Configuration"
**Step 2.2: Enable via SQL (Alternative)**
```bash
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
UPDATE settings
SET value = json_set(value, '$.enabled', true)
WHERE key = 'repost_detection';
SELECT 'Feature enabled. Current settings:';
SELECT value FROM settings WHERE key = 'repost_detection';
EOF
```
**Step 2.3: Test with Known Repost**
Use the example file from testing:
```
/media/d$/OneDrive - LIComputerGuy/Celebrities/Eva Longoria/4. Media/social media/instagram/stories/evalongoria_20251109_154548_story6.mp4
```
This is a repost of @globalgiftfoundation content.
```bash
# Manual test with the detection script
python3 /opt/media-downloader/tests/test_repost_detection_manual.py \
"/media/.../evalongoria_20251109_154548_story6.mp4" \
"evalongoria" \
--live
# Expected output:
# ✅ OCR extraction: @globalgiftfoundation
# @globalgiftfoundation NOT monitored (using temp queue)
# ⏬ Downloading stories and posts via ImgInn
# ✓ Found matching original
# ✓ Replaced repost with original
```
---
### Phase 3: Monitor Live Downloads
**Step 3.1: Enable Logging**
Watch logs for repost detection activity:
```bash
# Terminal 1: Backend logs
sudo journalctl -u media-downloader-api -f | grep -i repost
# Terminal 2: Download logs
tail -f /opt/media-downloader/logs/downloads.log | grep -i repost
# Look for messages like:
# [RepostDetector] [INFO] Detected repost from @username
# [RepostDetector] [SUCCESS] ✓ Found original
# [MoveManager] [SUCCESS] ✓ Replaced repost with original from @username
```
**Step 3.2: Check Database Tracking**
```bash
# View repost replacements
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
SELECT
repost_source,
original_username,
repost_filename,
detected_at
FROM repost_replacements
ORDER BY detected_at DESC
LIMIT 10;
EOF
# View fetch cache (avoid re-downloading)
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
SELECT
username,
last_fetched,
content_count
FROM repost_fetch_cache
ORDER BY last_fetched DESC;
EOF
```
**Step 3.3: Monitor Disk Usage**
```bash
# Check temp directory (should be empty or small if cleanup enabled)
du -sh /tmp/repost_detection/
# Check for successful cleanups in logs
grep "Cleaned up.*temporary files" /opt/media-downloader/logs/*.log
```
---
### Phase 4: Performance Testing
**Test Scenario 1: Monitored Account Repost**
```
Source: evalongoria (monitored)
Reposts: @originalu ser (also monitored)
Expected: Downloads to normal path, no cleanup
```
**Test Scenario 2: Non-Monitored Account Repost**
```
Source: evalongoria (monitored)
Reposts: @randomuser (NOT monitored)
Expected: Downloads to /tmp, cleanup after matching
```
**Test Scenario 3: No @username Detected**
```
Source: evalongoria (monitored)
Story: Regular story (not a repost)
Expected: Skip detection, process normally
```
**Test Scenario 4: No Matching Original Found**
```
Source: evalongoria (monitored)
Reposts: @oldaccount (deleted or no stories/posts)
Expected: Keep repost, log warning, continue
```
---
## Rollback Procedures
### Option 1: Disable via Frontend (Instant)
1. Open Configuration page
2. Toggle "Instagram Repost Detection" to OFF
3. Save
### Option 2: Disable via SQL (Instant)
```bash
sqlite3 /opt/media-downloader/data/backup_cache.db \
"UPDATE settings SET value = json_set(value, '$.enabled', false) WHERE key = 'repost_detection';"
```
### Option 3: Comment Out Hook (Permanent Disable)
Edit `/opt/media-downloader/modules/move_module.py` around line 454:
```python
# Disable repost detection permanently:
# if self._is_instagram_story(source) and self.batch_context:
# ...
```
---
## Troubleshooting
### Issue: "Missing dependencies" warning
**Solution:**
```bash
pip3 install --break-system-packages pytesseract opencv-python imagehash
sudo apt-get install tesseract-ocr tesseract-ocr-eng
```
### Issue: OCR not detecting usernames
**Possible causes:**
1. Username has special characters
2. Low image quality
3. Unusual font/styling
**Solution:** Adjust `ocr_confidence_threshold` in settings (lower = more permissive)
### Issue: No matching original found
**Possible causes:**
1. Original content deleted or made private
2. Post older than `max_posts_age_hours` setting
3. Hash distance too strict
**Solution:**
- Increase `max_posts_age_hours` (check older posts)
- Increase `hash_distance_threshold` (looser matching)
### Issue: Temp files not being cleaned up
**Check:**
```bash
ls -lah /tmp/repost_detection/
```
**Solution:** Verify `cleanup_temp_files` is enabled in settings
### Issue: Too many API requests to ImgInn
**Solution:**
- Increase `fetch_cache_hours` (cache longer)
- Reduce `max_posts_age_hours` (check fewer posts)
---
## Monitoring & Metrics
### Key Metrics to Track
```sql
-- Repost detection success rate
SELECT
COUNT(*) as total_replacements,
COUNT(DISTINCT repost_source) as affected_sources,
COUNT(DISTINCT original_username) as original_accounts
FROM repost_replacements;
-- Most frequently detected original accounts
SELECT
original_username,
COUNT(*) as repost_count
FROM repost_replacements
GROUP BY original_username
ORDER BY repost_count DESC
LIMIT 10;
-- Recent activity
SELECT
DATE(detected_at) as date,
COUNT(*) as replacements
FROM repost_replacements
GROUP BY DATE(detected_at)
ORDER BY date DESC
LIMIT 7;
```
### Performance Metrics
- **Average processing time:** 5-10 seconds per repost
- **Disk usage (temp):** ~50-200MB per non-monitored account (cleaned after use)
- **Cache hit rate:** Monitor fetch_cache table for efficiency
---
## Best Practices
### Recommended Settings
**Conservative (Low Resource Usage):**
```json
{
"enabled": true,
"hash_distance_threshold": 8,
"fetch_cache_hours": 24,
"max_posts_age_hours": 12,
"cleanup_temp_files": true
}
```
**Aggressive (Best Quality):**
```json
{
"enabled": true,
"hash_distance_threshold": 12,
"fetch_cache_hours": 6,
"max_posts_age_hours": 48,
"cleanup_temp_files": true
}
```
### When to Use
**Good for:**
- Accounts that frequently repost other users' stories
- High-profile accounts with quality concerns
- Archival purposes (want original high-res content)
**Not needed for:**
- Accounts that rarely repost
- Already monitored original accounts
- Low-storage situations
---
## Gradual Rollout Strategy
### Week 1: Silent Monitoring
- Enable feature
- Monitor logs for detection rate
- Don't interfere with workflow
- Identify common patterns
### Week 2: Selective Enable
- Enable for 2-3 high-repost accounts
- Verify replacements are correct
- Check false positive rate
- Monitor performance impact
### Week 3: Broader Enable
- Enable for all Instagram story downloaders
- Monitor database growth
- Check temp file cleanup
- Validate quality improvements
### Week 4+: Full Production
- Feature stable and validated
- Document edge cases found
- Tune settings based on results
- Consider expanding to other platforms
---
## Support & Documentation
**Documentation:**
- Design spec: `/opt/media-downloader/docs/instagram_repost_detection_design.md`
- Test results: `/opt/media-downloader/docs/repost_detection_test_results.md`
- This guide: `/opt/media-downloader/docs/repost_detection_testing_guide.md`
**Test Scripts:**
- Unit tests: `/opt/media-downloader/tests/test_instagram_repost_detector.py`
- Manual tests: `/opt/media-downloader/tests/test_repost_detection_manual.py`
**Module Files:**
- Detector: `/opt/media-downloader/modules/instagram_repost_detector.py`
- ImgInn: `/opt/media-downloader/modules/imginn_module.py`
- Move: `/opt/media-downloader/modules/move_module.py`
---
## Success Criteria
**Feature is ready for production when:**
1. Disabled state doesn't affect existing functionality
2. Enabled state successfully detects and replaces reposts
3. No errors in logs during normal operation
4. Temp files are cleaned up properly
5. Database tracking works correctly
6. Performance impact is acceptable
7. False positive rate is low (<5%)
8. Quality of replacements is consistently better
---
**Ready to test!** Start with Phase 1 to verify everything is safe, then gradually enable and test.