424
docs/archive/repost_detection_testing_guide.md
Normal file
424
docs/archive/repost_detection_testing_guide.md
Normal file
@@ -0,0 +1,424 @@
|
||||
# Instagram Repost Detection - Testing & Deployment Guide
|
||||
|
||||
**Status:** ✅ **Implementation Complete - Ready for Testing**
|
||||
**Default State:** 🔒 **DISABLED** (feature flag off)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
All code has been safely integrated with backward-compatible changes:
|
||||
|
||||
✅ **ImgInn Module Updated** - Added optional `skip_database` and `max_age_hours` parameters (default behavior unchanged)
|
||||
✅ **Move Module Updated** - Added repost detection hooks with feature flag check (disabled by default)
|
||||
✅ **Database Settings Added** - Settings entry created with `enabled: false`
|
||||
✅ **Frontend UI Added** - Configuration page includes repost detection settings panel
|
||||
✅ **Module Tested** - Core detection logic validated with real example file
|
||||
|
||||
---
|
||||
|
||||
## Safety Guarantees
|
||||
|
||||
### Backward Compatibility
|
||||
- All new parameters have defaults that preserve existing behavior
|
||||
- Feature is completely disabled by default
|
||||
- No changes to existing workflows when disabled
|
||||
- Can be toggled on/off without code changes
|
||||
|
||||
### Error Handling
|
||||
- If repost detection fails, original file processing continues normally
|
||||
- Missing dependencies don't break downloads
|
||||
- Failed OCR/hashing doesn't stop the move operation
|
||||
|
||||
### Database Safety
|
||||
- New tables created only when feature is used
|
||||
- Existing tables remain untouched
|
||||
- Can be disabled instantly via SQL or UI
|
||||
|
||||
---
|
||||
|
||||
## Testing Plan
|
||||
|
||||
### Phase 1: Verify Feature is Disabled (Recommended First Step)
|
||||
|
||||
**Purpose:** Confirm existing functionality is unchanged
|
||||
|
||||
```bash
|
||||
# 1. Check database setting
|
||||
sqlite3 /opt/media-downloader/data/backup_cache.db \
|
||||
"SELECT key, json_extract(value, '$.enabled') FROM settings WHERE key = 'repost_detection';"
|
||||
|
||||
# Expected output:
|
||||
# repost_detection|0 (0 = disabled)
|
||||
|
||||
# 2. Download some Instagram stories (any module)
|
||||
# - Stories should download normally
|
||||
# - No repost detection messages in logs
|
||||
# - No temp files in /tmp/repost_detection/
|
||||
|
||||
# 3. Check frontend
|
||||
# - Open Configuration page
|
||||
# - Find "Instagram Repost Detection" section
|
||||
# - Verify toggle is OFF by default
|
||||
```
|
||||
|
||||
**Expected Result:** Everything works exactly as before
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Enable and Test Detection
|
||||
|
||||
**Step 2.1: Enable via Frontend (Recommended)**
|
||||
|
||||
1. Open Configuration page: http://localhost:8000/configuration
|
||||
2. Scroll to "Instagram Repost Detection" section
|
||||
3. Toggle "Enabled" to ON
|
||||
4. Adjust settings if desired:
|
||||
- Hash Distance Threshold: 10 (default)
|
||||
- Fetch Cache Duration: 12 hours (default)
|
||||
- Max Posts Age: 24 hours (default)
|
||||
- Cleanup Temp Files: ON (recommended)
|
||||
5. Click "Save Configuration"
|
||||
|
||||
**Step 2.2: Enable via SQL (Alternative)**
|
||||
|
||||
```bash
|
||||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||||
UPDATE settings
|
||||
SET value = json_set(value, '$.enabled', true)
|
||||
WHERE key = 'repost_detection';
|
||||
|
||||
SELECT 'Feature enabled. Current settings:';
|
||||
SELECT value FROM settings WHERE key = 'repost_detection';
|
||||
EOF
|
||||
```
|
||||
|
||||
**Step 2.3: Test with Known Repost**
|
||||
|
||||
Use the example file from testing:
|
||||
```
|
||||
/media/d$/OneDrive - LIComputerGuy/Celebrities/Eva Longoria/4. Media/social media/instagram/stories/evalongoria_20251109_154548_story6.mp4
|
||||
```
|
||||
|
||||
This is a repost of @globalgiftfoundation content.
|
||||
|
||||
```bash
|
||||
# Manual test with the detection script
|
||||
python3 /opt/media-downloader/tests/test_repost_detection_manual.py \
|
||||
"/media/.../evalongoria_20251109_154548_story6.mp4" \
|
||||
"evalongoria" \
|
||||
--live
|
||||
|
||||
# Expected output:
|
||||
# ✅ OCR extraction: @globalgiftfoundation
|
||||
# ℹ️ @globalgiftfoundation NOT monitored (using temp queue)
|
||||
# ⏬ Downloading stories and posts via ImgInn
|
||||
# ✓ Found matching original
|
||||
# ✓ Replaced repost with original
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Monitor Live Downloads
|
||||
|
||||
**Step 3.1: Enable Logging**
|
||||
|
||||
Watch logs for repost detection activity:
|
||||
```bash
|
||||
# Terminal 1: Backend logs
|
||||
sudo journalctl -u media-downloader-api -f | grep -i repost
|
||||
|
||||
# Terminal 2: Download logs
|
||||
tail -f /opt/media-downloader/logs/downloads.log | grep -i repost
|
||||
|
||||
# Look for messages like:
|
||||
# [RepostDetector] [INFO] Detected repost from @username
|
||||
# [RepostDetector] [SUCCESS] ✓ Found original
|
||||
# [MoveManager] [SUCCESS] ✓ Replaced repost with original from @username
|
||||
```
|
||||
|
||||
**Step 3.2: Check Database Tracking**
|
||||
|
||||
```bash
|
||||
# View repost replacements
|
||||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||||
SELECT
|
||||
repost_source,
|
||||
original_username,
|
||||
repost_filename,
|
||||
detected_at
|
||||
FROM repost_replacements
|
||||
ORDER BY detected_at DESC
|
||||
LIMIT 10;
|
||||
EOF
|
||||
|
||||
# View fetch cache (avoid re-downloading)
|
||||
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
|
||||
SELECT
|
||||
username,
|
||||
last_fetched,
|
||||
content_count
|
||||
FROM repost_fetch_cache
|
||||
ORDER BY last_fetched DESC;
|
||||
EOF
|
||||
```
|
||||
|
||||
**Step 3.3: Monitor Disk Usage**
|
||||
|
||||
```bash
|
||||
# Check temp directory (should be empty or small if cleanup enabled)
|
||||
du -sh /tmp/repost_detection/
|
||||
|
||||
# Check for successful cleanups in logs
|
||||
grep "Cleaned up.*temporary files" /opt/media-downloader/logs/*.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Performance Testing
|
||||
|
||||
**Test Scenario 1: Monitored Account Repost**
|
||||
|
||||
```
|
||||
Source: evalongoria (monitored)
|
||||
Reposts: @originalu ser (also monitored)
|
||||
Expected: Downloads to normal path, no cleanup
|
||||
```
|
||||
|
||||
**Test Scenario 2: Non-Monitored Account Repost**
|
||||
|
||||
```
|
||||
Source: evalongoria (monitored)
|
||||
Reposts: @randomuser (NOT monitored)
|
||||
Expected: Downloads to /tmp, cleanup after matching
|
||||
```
|
||||
|
||||
**Test Scenario 3: No @username Detected**
|
||||
|
||||
```
|
||||
Source: evalongoria (monitored)
|
||||
Story: Regular story (not a repost)
|
||||
Expected: Skip detection, process normally
|
||||
```
|
||||
|
||||
**Test Scenario 4: No Matching Original Found**
|
||||
|
||||
```
|
||||
Source: evalongoria (monitored)
|
||||
Reposts: @oldaccount (deleted or no stories/posts)
|
||||
Expected: Keep repost, log warning, continue
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Option 1: Disable via Frontend (Instant)
|
||||
1. Open Configuration page
|
||||
2. Toggle "Instagram Repost Detection" to OFF
|
||||
3. Save
|
||||
|
||||
### Option 2: Disable via SQL (Instant)
|
||||
```bash
|
||||
sqlite3 /opt/media-downloader/data/backup_cache.db \
|
||||
"UPDATE settings SET value = json_set(value, '$.enabled', false) WHERE key = 'repost_detection';"
|
||||
```
|
||||
|
||||
### Option 3: Comment Out Hook (Permanent Disable)
|
||||
Edit `/opt/media-downloader/modules/move_module.py` around line 454:
|
||||
```python
|
||||
# Disable repost detection permanently:
|
||||
# if self._is_instagram_story(source) and self.batch_context:
|
||||
# ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: "Missing dependencies" warning
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
pip3 install --break-system-packages pytesseract opencv-python imagehash
|
||||
sudo apt-get install tesseract-ocr tesseract-ocr-eng
|
||||
```
|
||||
|
||||
### Issue: OCR not detecting usernames
|
||||
|
||||
**Possible causes:**
|
||||
1. Username has special characters
|
||||
2. Low image quality
|
||||
3. Unusual font/styling
|
||||
|
||||
**Solution:** Adjust `ocr_confidence_threshold` in settings (lower = more permissive)
|
||||
|
||||
### Issue: No matching original found
|
||||
|
||||
**Possible causes:**
|
||||
1. Original content deleted or made private
|
||||
2. Post older than `max_posts_age_hours` setting
|
||||
3. Hash distance too strict
|
||||
|
||||
**Solution:**
|
||||
- Increase `max_posts_age_hours` (check older posts)
|
||||
- Increase `hash_distance_threshold` (looser matching)
|
||||
|
||||
### Issue: Temp files not being cleaned up
|
||||
|
||||
**Check:**
|
||||
```bash
|
||||
ls -lah /tmp/repost_detection/
|
||||
```
|
||||
|
||||
**Solution:** Verify `cleanup_temp_files` is enabled in settings
|
||||
|
||||
### Issue: Too many API requests to ImgInn
|
||||
|
||||
**Solution:**
|
||||
- Increase `fetch_cache_hours` (cache longer)
|
||||
- Reduce `max_posts_age_hours` (check fewer posts)
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Metrics
|
||||
|
||||
### Key Metrics to Track
|
||||
|
||||
```sql
|
||||
-- Repost detection success rate
|
||||
SELECT
|
||||
COUNT(*) as total_replacements,
|
||||
COUNT(DISTINCT repost_source) as affected_sources,
|
||||
COUNT(DISTINCT original_username) as original_accounts
|
||||
FROM repost_replacements;
|
||||
|
||||
-- Most frequently detected original accounts
|
||||
SELECT
|
||||
original_username,
|
||||
COUNT(*) as repost_count
|
||||
FROM repost_replacements
|
||||
GROUP BY original_username
|
||||
ORDER BY repost_count DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Recent activity
|
||||
SELECT
|
||||
DATE(detected_at) as date,
|
||||
COUNT(*) as replacements
|
||||
FROM repost_replacements
|
||||
GROUP BY DATE(detected_at)
|
||||
ORDER BY date DESC
|
||||
LIMIT 7;
|
||||
```
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
- **Average processing time:** 5-10 seconds per repost
|
||||
- **Disk usage (temp):** ~50-200MB per non-monitored account (cleaned after use)
|
||||
- **Cache hit rate:** Monitor fetch_cache table for efficiency
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Recommended Settings
|
||||
|
||||
**Conservative (Low Resource Usage):**
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"hash_distance_threshold": 8,
|
||||
"fetch_cache_hours": 24,
|
||||
"max_posts_age_hours": 12,
|
||||
"cleanup_temp_files": true
|
||||
}
|
||||
```
|
||||
|
||||
**Aggressive (Best Quality):**
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"hash_distance_threshold": 12,
|
||||
"fetch_cache_hours": 6,
|
||||
"max_posts_age_hours": 48,
|
||||
"cleanup_temp_files": true
|
||||
}
|
||||
```
|
||||
|
||||
### When to Use
|
||||
|
||||
✅ **Good for:**
|
||||
- Accounts that frequently repost other users' stories
|
||||
- High-profile accounts with quality concerns
|
||||
- Archival purposes (want original high-res content)
|
||||
|
||||
❌ **Not needed for:**
|
||||
- Accounts that rarely repost
|
||||
- Already monitored original accounts
|
||||
- Low-storage situations
|
||||
|
||||
---
|
||||
|
||||
## Gradual Rollout Strategy
|
||||
|
||||
### Week 1: Silent Monitoring
|
||||
- Enable feature
|
||||
- Monitor logs for detection rate
|
||||
- Don't interfere with workflow
|
||||
- Identify common patterns
|
||||
|
||||
### Week 2: Selective Enable
|
||||
- Enable for 2-3 high-repost accounts
|
||||
- Verify replacements are correct
|
||||
- Check false positive rate
|
||||
- Monitor performance impact
|
||||
|
||||
### Week 3: Broader Enable
|
||||
- Enable for all Instagram story downloaders
|
||||
- Monitor database growth
|
||||
- Check temp file cleanup
|
||||
- Validate quality improvements
|
||||
|
||||
### Week 4+: Full Production
|
||||
- Feature stable and validated
|
||||
- Document edge cases found
|
||||
- Tune settings based on results
|
||||
- Consider expanding to other platforms
|
||||
|
||||
---
|
||||
|
||||
## Support & Documentation
|
||||
|
||||
**Documentation:**
|
||||
- Design spec: `/opt/media-downloader/docs/instagram_repost_detection_design.md`
|
||||
- Test results: `/opt/media-downloader/docs/repost_detection_test_results.md`
|
||||
- This guide: `/opt/media-downloader/docs/repost_detection_testing_guide.md`
|
||||
|
||||
**Test Scripts:**
|
||||
- Unit tests: `/opt/media-downloader/tests/test_instagram_repost_detector.py`
|
||||
- Manual tests: `/opt/media-downloader/tests/test_repost_detection_manual.py`
|
||||
|
||||
**Module Files:**
|
||||
- Detector: `/opt/media-downloader/modules/instagram_repost_detector.py`
|
||||
- ImgInn: `/opt/media-downloader/modules/imginn_module.py`
|
||||
- Move: `/opt/media-downloader/modules/move_module.py`
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ **Feature is ready for production when:**
|
||||
|
||||
1. Disabled state doesn't affect existing functionality
|
||||
2. Enabled state successfully detects and replaces reposts
|
||||
3. No errors in logs during normal operation
|
||||
4. Temp files are cleaned up properly
|
||||
5. Database tracking works correctly
|
||||
6. Performance impact is acceptable
|
||||
7. False positive rate is low (<5%)
|
||||
8. Quality of replacements is consistently better
|
||||
|
||||
---
|
||||
|
||||
**Ready to test!** Start with Phase 1 to verify everything is safe, then gradually enable and test.
|
||||
Reference in New Issue
Block a user