425 lines
10 KiB
Markdown
425 lines
10 KiB
Markdown
# Instagram Repost Detection - Testing & Deployment Guide
|
||
|
||
**Status:** ✅ **Implementation Complete - Ready for Testing**
|
||
**Default State:** 🔒 **DISABLED** (feature flag off)
|
||
|
||
---
|
||
|
||
## Implementation Summary
|
||
|
||
All code has been safely integrated with backward-compatible changes:
|
||
|
||
✅ **ImgInn Module Updated** - Added optional `skip_database` and `max_age_hours` parameters (default behavior unchanged)
|
||
✅ **Move Module Updated** - Added repost detection hooks with feature flag check (disabled by default)
|
||
✅ **Database Settings Added** - Settings entry created with `enabled: false`
|
||
✅ **Frontend UI Added** - Configuration page includes repost detection settings panel
|
||
✅ **Module Tested** - Core detection logic validated with real example file
|
||
|
||
---
|
||
|
||
## Safety Guarantees
|
||
|
||
### Backward Compatibility
|
||
- All new parameters have defaults that preserve existing behavior
|
||
- Feature is completely disabled by default
|
||
- No changes to existing workflows when disabled
|
||
- Can be toggled on/off without code changes
|
||
|
||
### Error Handling
|
||
- If repost detection fails, original file processing continues normally
|
||
- Missing dependencies don't break downloads
|
||
- Failed OCR/hashing doesn't stop the move operation
|
||
|
||
### Database Safety
|
||
- New tables created only when feature is used
|
||
- Existing tables remain untouched
|
||
- Can be disabled instantly via SQL or UI
|
||
|
||
---
|
||
|
||
## Testing Plan
|
||
|
||
### Phase 1: Verify Feature is Disabled (Recommended First Step)
|
||
|
||
**Purpose:** Confirm existing functionality is unchanged
|
||
|
||
```bash
|
||
# 1. Check database setting
|
||
sqlite3 /opt/media-downloader/data/backup_cache.db \
|
||
"SELECT key, json_extract(value, '$.enabled') FROM settings WHERE key = 'repost_detection';"
|
||
|
||
# Expected output:
|
||
# repost_detection|0 (0 = disabled)
|
||
|
||
# 2. Download some Instagram stories (any module)
|
||
# - Stories should download normally
|
||
# - No repost detection messages in logs
|
||
# - No temp files in /tmp/repost_detection/
|
||
|
||
# 3. Check frontend
|
||
# - Open Configuration page
|
||
# - Find "Instagram Repost Detection" section
|
||
# - Verify toggle is OFF by default
|
||
```
|
||
|
||
**Expected Result:** Everything works exactly as before
|
||
|
||
---
|
||
|
||
### Phase 2: Enable and Test Detection
|
||
|
||
**Step 2.1: Enable via Frontend (Recommended)**
|
||
|
||
1. Open Configuration page: http://localhost:8000/configuration
|
||
2. Scroll to "Instagram Repost Detection" section
|
||
3. Toggle "Enabled" to ON
|
||
4. Adjust settings if desired:
|
||
- Hash Distance Threshold: 10 (default)
|
||
- Fetch Cache Duration: 12 hours (default)
|
||
- Max Posts Age: 24 hours (default)
|
||
- Cleanup Temp Files: ON (recommended)
|
||
5. Click "Save Configuration"
|
||
|
||
**Step 2.2: Enable via SQL (Alternative)**
|
||
|
||
```bash
|
||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||
UPDATE settings
|
||
SET value = json_set(value, '$.enabled', true)
|
||
WHERE key = 'repost_detection';
|
||
|
||
SELECT 'Feature enabled. Current settings:';
|
||
SELECT value FROM settings WHERE key = 'repost_detection';
|
||
EOF
|
||
```
|
||
|
||
**Step 2.3: Test with Known Repost**
|
||
|
||
Use the example file from testing:
|
||
```
|
||
/media/d$/OneDrive - LIComputerGuy/Celebrities/Eva Longoria/4. Media/social media/instagram/stories/evalongoria_20251109_154548_story6.mp4
|
||
```
|
||
|
||
This is a repost of @globalgiftfoundation content.
|
||
|
||
```bash
|
||
# Manual test with the detection script
|
||
python3 /opt/media-downloader/tests/test_repost_detection_manual.py \
|
||
"/media/.../evalongoria_20251109_154548_story6.mp4" \
|
||
"evalongoria" \
|
||
--live
|
||
|
||
# Expected output:
|
||
# ✅ OCR extraction: @globalgiftfoundation
|
||
# ℹ️ @globalgiftfoundation NOT monitored (using temp queue)
|
||
# ⏬ Downloading stories and posts via ImgInn
|
||
# ✓ Found matching original
|
||
# ✓ Replaced repost with original
|
||
```
|
||
|
||
---
|
||
|
||
### Phase 3: Monitor Live Downloads
|
||
|
||
**Step 3.1: Enable Logging**
|
||
|
||
Watch logs for repost detection activity:
|
||
```bash
|
||
# Terminal 1: Backend logs
|
||
sudo journalctl -u media-downloader-api -f | grep -i repost
|
||
|
||
# Terminal 2: Download logs
|
||
tail -f /opt/media-downloader/logs/downloads.log | grep -i repost
|
||
|
||
# Look for messages like:
|
||
# [RepostDetector] [INFO] Detected repost from @username
|
||
# [RepostDetector] [SUCCESS] ✓ Found original
|
||
# [MoveManager] [SUCCESS] ✓ Replaced repost with original from @username
|
||
```
|
||
|
||
**Step 3.2: Check Database Tracking**
|
||
|
||
```bash
|
||
# View repost replacements
|
||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||
SELECT
|
||
repost_source,
|
||
original_username,
|
||
repost_filename,
|
||
detected_at
|
||
FROM repost_replacements
|
||
ORDER BY detected_at DESC
|
||
LIMIT 10;
|
||
EOF
|
||
|
||
# View fetch cache (avoid re-downloading)
|
||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||
SELECT
|
||
username,
|
||
last_fetched,
|
||
content_count
|
||
FROM repost_fetch_cache
|
||
ORDER BY last_fetched DESC;
|
||
EOF
|
||
```
|
||
|
||
**Step 3.3: Monitor Disk Usage**
|
||
|
||
```bash
|
||
# Check temp directory (should be empty or small if cleanup enabled)
|
||
du -sh /tmp/repost_detection/
|
||
|
||
# Check for successful cleanups in logs
|
||
grep "Cleaned up.*temporary files" /opt/media-downloader/logs/*.log
|
||
```
|
||
|
||
---
|
||
|
||
### Phase 4: Performance Testing
|
||
|
||
**Test Scenario 1: Monitored Account Repost**
|
||
|
||
```
|
||
Source: evalongoria (monitored)
|
||
Reposts: @originalu ser (also monitored)
|
||
Expected: Downloads to normal path, no cleanup
|
||
```
|
||
|
||
**Test Scenario 2: Non-Monitored Account Repost**
|
||
|
||
```
|
||
Source: evalongoria (monitored)
|
||
Reposts: @randomuser (NOT monitored)
|
||
Expected: Downloads to /tmp, cleanup after matching
|
||
```
|
||
|
||
**Test Scenario 3: No @username Detected**
|
||
|
||
```
|
||
Source: evalongoria (monitored)
|
||
Story: Regular story (not a repost)
|
||
Expected: Skip detection, process normally
|
||
```
|
||
|
||
**Test Scenario 4: No Matching Original Found**
|
||
|
||
```
|
||
Source: evalongoria (monitored)
|
||
Reposts: @oldaccount (deleted or no stories/posts)
|
||
Expected: Keep repost, log warning, continue
|
||
```
|
||
|
||
---
|
||
|
||
## Rollback Procedures
|
||
|
||
### Option 1: Disable via Frontend (Instant)
|
||
1. Open Configuration page
|
||
2. Toggle "Instagram Repost Detection" to OFF
|
||
3. Save
|
||
|
||
### Option 2: Disable via SQL (Instant)
|
||
```bash
|
||
sqlite3 /opt/media-downloader/data/backup_cache.db \
|
||
"UPDATE settings SET value = json_set(value, '$.enabled', false) WHERE key = 'repost_detection';"
|
||
```
|
||
|
||
### Option 3: Comment Out Hook (Permanent Disable)
|
||
Edit `/opt/media-downloader/modules/move_module.py` around line 454:
|
||
```python
|
||
# Disable repost detection permanently:
|
||
# if self._is_instagram_story(source) and self.batch_context:
|
||
# ...
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Issue: "Missing dependencies" warning
|
||
|
||
**Solution:**
|
||
```bash
|
||
pip3 install --break-system-packages pytesseract opencv-python imagehash
|
||
sudo apt-get install tesseract-ocr tesseract-ocr-eng
|
||
```
|
||
|
||
### Issue: OCR not detecting usernames
|
||
|
||
**Possible causes:**
|
||
1. Username has special characters
|
||
2. Low image quality
|
||
3. Unusual font/styling
|
||
|
||
**Solution:** Adjust `ocr_confidence_threshold` in settings (lower = more permissive)
|
||
|
||
### Issue: No matching original found
|
||
|
||
**Possible causes:**
|
||
1. Original content deleted or made private
|
||
2. Post older than `max_posts_age_hours` setting
|
||
3. Hash distance too strict
|
||
|
||
**Solution:**
|
||
- Increase `max_posts_age_hours` (check older posts)
|
||
- Increase `hash_distance_threshold` (looser matching)
|
||
|
||
### Issue: Temp files not being cleaned up
|
||
|
||
**Check:**
|
||
```bash
|
||
ls -lah /tmp/repost_detection/
|
||
```
|
||
|
||
**Solution:** Verify `cleanup_temp_files` is enabled in settings
|
||
|
||
### Issue: Too many API requests to ImgInn
|
||
|
||
**Solution:**
|
||
- Increase `fetch_cache_hours` (cache longer)
|
||
- Reduce `max_posts_age_hours` (check fewer posts)
|
||
|
||
---
|
||
|
||
## Monitoring & Metrics
|
||
|
||
### Key Metrics to Track
|
||
|
||
```sql
|
||
-- Repost detection success rate
|
||
SELECT
|
||
COUNT(*) as total_replacements,
|
||
COUNT(DISTINCT repost_source) as affected_sources,
|
||
COUNT(DISTINCT original_username) as original_accounts
|
||
FROM repost_replacements;
|
||
|
||
-- Most frequently detected original accounts
|
||
SELECT
|
||
original_username,
|
||
COUNT(*) as repost_count
|
||
FROM repost_replacements
|
||
GROUP BY original_username
|
||
ORDER BY repost_count DESC
|
||
LIMIT 10;
|
||
|
||
-- Recent activity
|
||
SELECT
|
||
DATE(detected_at) as date,
|
||
COUNT(*) as replacements
|
||
FROM repost_replacements
|
||
GROUP BY DATE(detected_at)
|
||
ORDER BY date DESC
|
||
LIMIT 7;
|
||
```
|
||
|
||
### Performance Metrics
|
||
|
||
- **Average processing time:** 5-10 seconds per repost
|
||
- **Disk usage (temp):** ~50-200MB per non-monitored account (cleaned after use)
|
||
- **Cache hit rate:** Monitor fetch_cache table for efficiency
|
||
|
||
---
|
||
|
||
## Best Practices
|
||
|
||
### Recommended Settings
|
||
|
||
**Conservative (Low Resource Usage):**
|
||
```json
|
||
{
|
||
"enabled": true,
|
||
"hash_distance_threshold": 8,
|
||
"fetch_cache_hours": 24,
|
||
"max_posts_age_hours": 12,
|
||
"cleanup_temp_files": true
|
||
}
|
||
```
|
||
|
||
**Aggressive (Best Quality):**
|
||
```json
|
||
{
|
||
"enabled": true,
|
||
"hash_distance_threshold": 12,
|
||
"fetch_cache_hours": 6,
|
||
"max_posts_age_hours": 48,
|
||
"cleanup_temp_files": true
|
||
}
|
||
```
|
||
|
||
### When to Use
|
||
|
||
✅ **Good for:**
|
||
- Accounts that frequently repost other users' stories
|
||
- High-profile accounts with quality concerns
|
||
- Archival purposes (want original high-res content)
|
||
|
||
❌ **Not needed for:**
|
||
- Accounts that rarely repost
|
||
- Already monitored original accounts
|
||
- Low-storage situations
|
||
|
||
---
|
||
|
||
## Gradual Rollout Strategy
|
||
|
||
### Week 1: Silent Monitoring
|
||
- Enable feature
|
||
- Monitor logs for detection rate
|
||
- Don't interfere with workflow
|
||
- Identify common patterns
|
||
|
||
### Week 2: Selective Enable
|
||
- Enable for 2-3 high-repost accounts
|
||
- Verify replacements are correct
|
||
- Check false positive rate
|
||
- Monitor performance impact
|
||
|
||
### Week 3: Broader Enable
|
||
- Enable for all Instagram story downloaders
|
||
- Monitor database growth
|
||
- Check temp file cleanup
|
||
- Validate quality improvements
|
||
|
||
### Week 4+: Full Production
|
||
- Feature stable and validated
|
||
- Document edge cases found
|
||
- Tune settings based on results
|
||
- Consider expanding to other platforms
|
||
|
||
---
|
||
|
||
## Support & Documentation
|
||
|
||
**Documentation:**
|
||
- Design spec: `/opt/media-downloader/docs/instagram_repost_detection_design.md`
|
||
- Test results: `/opt/media-downloader/docs/repost_detection_test_results.md`
|
||
- This guide: `/opt/media-downloader/docs/repost_detection_testing_guide.md`
|
||
|
||
**Test Scripts:**
|
||
- Unit tests: `/opt/media-downloader/tests/test_instagram_repost_detector.py`
|
||
- Manual tests: `/opt/media-downloader/tests/test_repost_detection_manual.py`
|
||
|
||
**Module Files:**
|
||
- Detector: `/opt/media-downloader/modules/instagram_repost_detector.py`
|
||
- ImgInn: `/opt/media-downloader/modules/imginn_module.py`
|
||
- Move: `/opt/media-downloader/modules/move_module.py`
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
✅ **Feature is ready for production when:**
|
||
|
||
1. Disabled state doesn't affect existing functionality
|
||
2. Enabled state successfully detects and replaces reposts
|
||
3. No errors in logs during normal operation
|
||
4. Temp files are cleaned up properly
|
||
5. Database tracking works correctly
|
||
6. Performance impact is acceptable
|
||
7. False positive rate is low (<5%)
|
||
8. Quality of replacements is consistently better
|
||
|
||
---
|
||
|
||
**Ready to test!** Start with Phase 1 to verify everything is safe, then gradually enable and test.
|