Files
media-downloader/docs/archive/repost_detection_testing_guide.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

10 KiB
Raw Permalink Blame History

Instagram Repost Detection - Testing & Deployment Guide

Status: Implementation Complete - Ready for Testing Default State: 🔒 DISABLED (feature flag off)


Implementation Summary

All code has been safely integrated with backward-compatible changes:

ImgInn Module Updated - Added optional skip_database and max_age_hours parameters (default behavior unchanged) Move Module Updated - Added repost detection hooks with feature flag check (disabled by default) Database Settings Added - Settings entry created with enabled: false Frontend UI Added - Configuration page includes repost detection settings panel Module Tested - Core detection logic validated with real example file


Safety Guarantees

Backward Compatibility

  • All new parameters have defaults that preserve existing behavior
  • Feature is completely disabled by default
  • No changes to existing workflows when disabled
  • Can be toggled on/off without code changes

Error Handling

  • If repost detection fails, original file processing continues normally
  • Missing dependencies don't break downloads
  • Failed OCR/hashing doesn't stop the move operation

Database Safety

  • New tables created only when feature is used
  • Existing tables remain untouched
  • Can be disabled instantly via SQL or UI

Testing Plan

Purpose: Confirm existing functionality is unchanged

# 1. Check database setting
sqlite3 /opt/media-downloader/data/backup_cache.db \
  "SELECT key, json_extract(value, '$.enabled') FROM settings WHERE key = 'repost_detection';"

# Expected output:
# repost_detection|0  (0 = disabled)

# 2. Download some Instagram stories (any module)
# - Stories should download normally
# - No repost detection messages in logs
# - No temp files in /tmp/repost_detection/

# 3. Check frontend
# - Open Configuration page
# - Find "Instagram Repost Detection" section
# - Verify toggle is OFF by default

Expected Result: Everything works exactly as before


Phase 2: Enable and Test Detection

Step 2.1: Enable via Frontend (Recommended)

  1. Open Configuration page: http://localhost:8000/configuration
  2. Scroll to "Instagram Repost Detection" section
  3. Toggle "Enabled" to ON
  4. Adjust settings if desired:
    • Hash Distance Threshold: 10 (default)
    • Fetch Cache Duration: 12 hours (default)
    • Max Posts Age: 24 hours (default)
    • Cleanup Temp Files: ON (recommended)
  5. Click "Save Configuration"

Step 2.2: Enable via SQL (Alternative)

sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
UPDATE settings
SET value = json_set(value, '$.enabled', true)
WHERE key = 'repost_detection';

SELECT 'Feature enabled. Current settings:';
SELECT value FROM settings WHERE key = 'repost_detection';
EOF

Step 2.3: Test with Known Repost

Use the example file from testing:

/media/d$/OneDrive - LIComputerGuy/Celebrities/Eva Longoria/4. Media/social media/instagram/stories/evalongoria_20251109_154548_story6.mp4

This is a repost of @globalgiftfoundation content.

# Manual test with the detection script
python3 /opt/media-downloader/tests/test_repost_detection_manual.py \
  "/media/.../evalongoria_20251109_154548_story6.mp4" \
  "evalongoria" \
  --live

# Expected output:
# ✅ OCR extraction: @globalgiftfoundation
#   @globalgiftfoundation NOT monitored (using temp queue)
# ⏬ Downloading stories and posts via ImgInn
# ✓ Found matching original
# ✓ Replaced repost with original

Phase 3: Monitor Live Downloads

Step 3.1: Enable Logging

Watch logs for repost detection activity:

# Terminal 1: Backend logs
sudo journalctl -u media-downloader-api -f | grep -i repost

# Terminal 2: Download logs
tail -f /opt/media-downloader/logs/downloads.log | grep -i repost

# Look for messages like:
# [RepostDetector] [INFO] Detected repost from @username
# [RepostDetector] [SUCCESS] ✓ Found original
# [MoveManager] [SUCCESS] ✓ Replaced repost with original from @username

Step 3.2: Check Database Tracking

# View repost replacements
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
SELECT
  repost_source,
  original_username,
  repost_filename,
  detected_at
FROM repost_replacements
ORDER BY detected_at DESC
LIMIT 10;
EOF

# View fetch cache (avoid re-downloading)
sqlite3 /opt/media-downloader/data/backup_cache.db << 'EOF'
SELECT
  username,
  last_fetched,
  content_count
FROM repost_fetch_cache
ORDER BY last_fetched DESC;
EOF

Step 3.3: Monitor Disk Usage

# Check temp directory (should be empty or small if cleanup enabled)
du -sh /tmp/repost_detection/

# Check for successful cleanups in logs
grep "Cleaned up.*temporary files" /opt/media-downloader/logs/*.log

Phase 4: Performance Testing

Test Scenario 1: Monitored Account Repost

Source: evalongoria (monitored)
Reposts: @originalu ser (also monitored)
Expected: Downloads to normal path, no cleanup

Test Scenario 2: Non-Monitored Account Repost

Source: evalongoria (monitored)
Reposts: @randomuser (NOT monitored)
Expected: Downloads to /tmp, cleanup after matching

Test Scenario 3: No @username Detected

Source: evalongoria (monitored)
Story: Regular story (not a repost)
Expected: Skip detection, process normally

Test Scenario 4: No Matching Original Found

Source: evalongoria (monitored)
Reposts: @oldaccount (deleted or no stories/posts)
Expected: Keep repost, log warning, continue

Rollback Procedures

Option 1: Disable via Frontend (Instant)

  1. Open Configuration page
  2. Toggle "Instagram Repost Detection" to OFF
  3. Save

Option 2: Disable via SQL (Instant)

sqlite3 /opt/media-downloader/data/backup_cache.db \
  "UPDATE settings SET value = json_set(value, '$.enabled', false) WHERE key = 'repost_detection';"

Option 3: Comment Out Hook (Permanent Disable)

Edit /opt/media-downloader/modules/move_module.py around line 454:

# Disable repost detection permanently:
# if self._is_instagram_story(source) and self.batch_context:
#     ...

Troubleshooting

Issue: "Missing dependencies" warning

Solution:

pip3 install --break-system-packages pytesseract opencv-python imagehash
sudo apt-get install tesseract-ocr tesseract-ocr-eng

Issue: OCR not detecting usernames

Possible causes:

  1. Username has special characters
  2. Low image quality
  3. Unusual font/styling

Solution: Adjust ocr_confidence_threshold in settings (lower = more permissive)

Issue: No matching original found

Possible causes:

  1. Original content deleted or made private
  2. Post older than max_posts_age_hours setting
  3. Hash distance too strict

Solution:

  • Increase max_posts_age_hours (check older posts)
  • Increase hash_distance_threshold (looser matching)

Issue: Temp files not being cleaned up

Check:

ls -lah /tmp/repost_detection/

Solution: Verify cleanup_temp_files is enabled in settings

Issue: Too many API requests to ImgInn

Solution:

  • Increase fetch_cache_hours (cache longer)
  • Reduce max_posts_age_hours (check fewer posts)

Monitoring & Metrics

Key Metrics to Track

-- Repost detection success rate
SELECT
  COUNT(*) as total_replacements,
  COUNT(DISTINCT repost_source) as affected_sources,
  COUNT(DISTINCT original_username) as original_accounts
FROM repost_replacements;

-- Most frequently detected original accounts
SELECT
  original_username,
  COUNT(*) as repost_count
FROM repost_replacements
GROUP BY original_username
ORDER BY repost_count DESC
LIMIT 10;

-- Recent activity
SELECT
  DATE(detected_at) as date,
  COUNT(*) as replacements
FROM repost_replacements
GROUP BY DATE(detected_at)
ORDER BY date DESC
LIMIT 7;

Performance Metrics

  • Average processing time: 5-10 seconds per repost
  • Disk usage (temp): ~50-200MB per non-monitored account (cleaned after use)
  • Cache hit rate: Monitor fetch_cache table for efficiency

Best Practices

Conservative (Low Resource Usage):

{
  "enabled": true,
  "hash_distance_threshold": 8,
  "fetch_cache_hours": 24,
  "max_posts_age_hours": 12,
  "cleanup_temp_files": true
}

Aggressive (Best Quality):

{
  "enabled": true,
  "hash_distance_threshold": 12,
  "fetch_cache_hours": 6,
  "max_posts_age_hours": 48,
  "cleanup_temp_files": true
}

When to Use

Good for:

  • Accounts that frequently repost other users' stories
  • High-profile accounts with quality concerns
  • Archival purposes (want original high-res content)

Not needed for:

  • Accounts that rarely repost
  • Already monitored original accounts
  • Low-storage situations

Gradual Rollout Strategy

Week 1: Silent Monitoring

  • Enable feature
  • Monitor logs for detection rate
  • Don't interfere with workflow
  • Identify common patterns

Week 2: Selective Enable

  • Enable for 2-3 high-repost accounts
  • Verify replacements are correct
  • Check false positive rate
  • Monitor performance impact

Week 3: Broader Enable

  • Enable for all Instagram story downloaders
  • Monitor database growth
  • Check temp file cleanup
  • Validate quality improvements

Week 4+: Full Production

  • Feature stable and validated
  • Document edge cases found
  • Tune settings based on results
  • Consider expanding to other platforms

Support & Documentation

Documentation:

  • Design spec: /opt/media-downloader/docs/instagram_repost_detection_design.md
  • Test results: /opt/media-downloader/docs/repost_detection_test_results.md
  • This guide: /opt/media-downloader/docs/repost_detection_testing_guide.md

Test Scripts:

  • Unit tests: /opt/media-downloader/tests/test_instagram_repost_detector.py
  • Manual tests: /opt/media-downloader/tests/test_repost_detection_manual.py

Module Files:

  • Detector: /opt/media-downloader/modules/instagram_repost_detector.py
  • ImgInn: /opt/media-downloader/modules/imginn_module.py
  • Move: /opt/media-downloader/modules/move_module.py

Success Criteria

Feature is ready for production when:

  1. Disabled state doesn't affect existing functionality
  2. Enabled state successfully detects and replaces reposts
  3. No errors in logs during normal operation
  4. Temp files are cleaned up properly
  5. Database tracking works correctly
  6. Performance impact is acceptable
  7. False positive rate is low (<5%)
  8. Quality of replacements is consistently better

Ready to test! Start with Phase 1 to verify everything is safe, then gradually enable and test.