# Smart Download Workflow with Face Recognition & Deduplication **Your Perfect Workflow**: Download → Check Face → Check Duplicate → Auto-Sort or Review --- ## 🎯 Your Exact Requirements ### What You Want 1. **Download image** 2. **Check if face matches** (using Immich face recognition) 3. **Check if duplicate** (using existing SHA256 hash system) 4. **Decision**: - ✅ **Match + Not Duplicate** → Move to final destination (`/faces/person_name/`) - ⚠️ **No Match OR Duplicate** → Move to holding/review directory (`/faces/review/`) ### Why This Makes Sense ✅ **Automatic for good images** - Hands-off for images you want ✅ **Manual review for uncertain** - You decide on edge cases ✅ **No duplicates** - Leverages existing deduplication system ✅ **Clean organization** - Final destination is curated, high-quality ✅ **Nothing lost** - Everything goes somewhere (review or final) --- ## 🏗️ Complete Workflow Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ DOWNLOAD IMAGE │ └───────────────────────────┬─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ STEP 1: Calculate SHA256 Hash │ └───────────────────────────┬─────────────────────────────────────┘ │ ▼ ┌───────────────┐ │ Is Duplicate? │ └───────┬───────┘ │ ┌───────────┴────────────┐ │ │ YES NO │ │ ▼ ▼ ┌─────────────┐ ┌─────────────────┐ │ Move to │ │ STEP 2: Trigger │ │ REVIEW/ │ │ Immich Scan │ │ duplicates/ │ └────────┬────────┘ └─────────────┘ │ ▼ ┌───────────────┐ │ Wait for Face │ │ Detection │ └───────┬───────┘ │ ▼ ┌───────────────────┐ │ Query Immich DB: │ │ Who's in photo? │ └───────┬───────────┘ │ ┌────────────────┴────────────────┐ │ │ IDENTIFIED NOT IDENTIFIED (in whitelist) (unknown/unwanted) │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Move to FINAL │ │ Move to REVIEW/ │ │ /faces/john/ │ │ unidentified/ │ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ Update Database │ │ - Record path │ │ - Record person │ │ - Mark complete │ └─────────────────┘ ``` --- ## 📁 Directory Structure ``` /mnt/storage/Downloads/ │ ├── temp_downloads/ # Temporary download location │ └── [images downloaded here first] │ ├── faces/ # Final curated collection │ ├── john_doe/ # Auto-sorted, verified │ │ ├── 20250131_120000.jpg │ │ └── 20250131_130000.jpg │ │ │ ├── sarah_smith/ # Auto-sorted, verified │ │ └── 20250131_140000.jpg │ │ │ └── family_member/ │ └── 20250131_150000.jpg │ └── review/ # Holding directory for manual review ├── duplicates/ # Duplicate images │ ├── duplicate_20250131_120000.jpg │ └── duplicate_20250131_130000.jpg │ ├── unidentified/ # No faces or unknown faces │ ├── unknown_20250131_120000.jpg │ └── noface_20250131_130000.jpg │ ├── low_confidence/ # Face detected but low match confidence │ └── lowconf_20250131_120000.jpg │ ├── multiple_faces/ # Multiple people in image │ └── multi_20250131_120000.jpg │ └── unwanted_person/ # Blacklisted person detected └── unwanted_20250131_120000.jpg ``` --- ## 💻 Complete Implementation ### Core Smart Download Class ```python #!/usr/bin/env python3 """ Smart Download with Face Recognition & Deduplication Downloads, checks faces, checks duplicates, auto-sorts or reviews """ import os import shutil import hashlib import logging import time import sqlite3 from pathlib import Path from datetime import datetime from typing import Dict, Optional logger = logging.getLogger(__name__) class SmartDownloader: """Intelligent download with face recognition and deduplication""" def __init__(self, config, immich_db, unified_db): self.config = config self.immich_db = immich_db self.unified_db = unified_db # Directories self.temp_dir = config.get('smart_download', {}).get('temp_dir', '/mnt/storage/Downloads/temp_downloads') self.final_base = config.get('smart_download', {}).get('final_base', '/mnt/storage/Downloads/faces') self.review_base = config.get('smart_download', {}).get('review_base', '/mnt/storage/Downloads/review') # Whitelist self.whitelist = config.get('smart_download', {}).get('whitelist', []) self.blacklist = config.get('smart_download', {}).get('blacklist', []) # Thresholds self.min_confidence = config.get('smart_download', {}).get('min_confidence', 0.6) self.immich_wait_time = config.get('smart_download', {}).get('immich_wait_time', 5) # Create directories self._create_directories() def _create_directories(self): """Create all required directories""" dirs = [ self.temp_dir, self.final_base, self.review_base, os.path.join(self.review_base, 'duplicates'), os.path.join(self.review_base, 'unidentified'), os.path.join(self.review_base, 'low_confidence'), os.path.join(self.review_base, 'multiple_faces'), os.path.join(self.review_base, 'unwanted_person'), ] for d in dirs: os.makedirs(d, exist_ok=True) def smart_download(self, url: str, source: str = None) -> Dict: """ Smart download workflow: Download → Check → Sort or Review Args: url: URL to download source: Source identifier (e.g., 'instagram', 'forum') Returns: dict: { 'status': 'success'|'error', 'action': 'sorted'|'reviewed'|'skipped', 'destination': str, 'reason': str, 'person': str or None } """ try: # STEP 1: Download to temp temp_path = self._download_to_temp(url) if not temp_path: return {'status': 'error', 'reason': 'download_failed'} # STEP 2: Check for duplicates file_hash = self._calculate_hash(temp_path) if self._is_duplicate(file_hash): return self._handle_duplicate(temp_path, file_hash) # STEP 3: Trigger Immich scan self._trigger_immich_scan(temp_path) # STEP 4: Wait for Immich to process time.sleep(self.immich_wait_time) # STEP 5: Check faces faces = self.immich_db.get_faces_for_file(temp_path) # STEP 6: Make decision based on faces return self._process_faces(temp_path, faces, file_hash, source) except Exception as e: logger.error(f"Smart download failed for {url}: {e}") return {'status': 'error', 'reason': str(e)} def _download_to_temp(self, url: str) -> Optional[str]: """Download file to temporary location""" try: # Use your existing download logic here # For now, placeholder: filename = f"temp_{datetime.now().strftime('%Y%m%d_%H%M%S')}.jpg" temp_path = os.path.join(self.temp_dir, filename) # Download file (use requests, yt-dlp, etc.) # download_file(url, temp_path) logger.info(f"Downloaded to temp: {temp_path}") return temp_path except Exception as e: logger.error(f"Download failed for {url}: {e}") return None def _calculate_hash(self, file_path: str) -> str: """Calculate SHA256 hash of file""" sha256_hash = hashlib.sha256() with open(file_path, "rb") as f: for byte_block in iter(lambda: f.read(4096), b""): sha256_hash.update(byte_block) return sha256_hash.hexdigest() def _is_duplicate(self, file_hash: str) -> bool: """Check if file hash already exists in database""" with sqlite3.connect(self.unified_db.db_path) as conn: cursor = conn.execute( "SELECT COUNT(*) FROM downloads WHERE file_hash = ?", (file_hash,) ) count = cursor.fetchone()[0] return count > 0 def _handle_duplicate(self, temp_path: str, file_hash: str) -> Dict: """Handle duplicate file - move to review/duplicates""" filename = os.path.basename(temp_path) review_path = os.path.join( self.review_base, 'duplicates', f"duplicate_{filename}" ) shutil.move(temp_path, review_path) logger.info(f"Duplicate detected: {filename} → review/duplicates/") return { 'status': 'success', 'action': 'reviewed', 'destination': review_path, 'reason': 'duplicate', 'hash': file_hash } def _trigger_immich_scan(self, file_path: str): """Trigger Immich to scan new file""" try: import requests immich_url = self.config.get('immich', {}).get('url') api_key = self.config.get('immich', {}).get('api_key') if immich_url and api_key: response = requests.post( f"{immich_url}/api/library/scan", headers={'x-api-key': api_key} ) logger.debug(f"Triggered Immich scan: {response.status_code}") except Exception as e: logger.warning(f"Could not trigger Immich scan: {e}") def _process_faces(self, temp_path: str, faces: list, file_hash: str, source: str = None) -> Dict: """ Process faces and decide: final destination or review Returns: dict with status, action, destination, reason """ filename = os.path.basename(temp_path) # NO FACES DETECTED if not faces: return self._move_to_review( temp_path, 'unidentified', f"noface_{filename}", 'no_faces_detected' ) # MULTIPLE FACES if len(faces) > 1: return self._move_to_review( temp_path, 'multiple_faces', f"multi_{filename}", f'multiple_faces ({len(faces)} people)' ) # SINGLE FACE - Process face = faces[0] person_name = face.get('person_name') confidence = face.get('confidence', 1.0) # BLACKLIST CHECK if self.blacklist and person_name in self.blacklist: return self._move_to_review( temp_path, 'unwanted_person', f"unwanted_{filename}", f'blacklisted_person: {person_name}' ) # WHITELIST CHECK if self.whitelist and person_name not in self.whitelist: return self._move_to_review( temp_path, 'unidentified', f"notwhitelisted_{filename}", f'not_in_whitelist: {person_name}' ) # CONFIDENCE CHECK (if we have confidence data) if confidence < self.min_confidence: return self._move_to_review( temp_path, 'low_confidence', f"lowconf_{filename}", f'low_confidence: {confidence:.2f}' ) # ALL CHECKS PASSED - Move to final destination return self._move_to_final( temp_path, person_name, file_hash, source ) def _move_to_final(self, temp_path: str, person_name: str, file_hash: str, source: str = None) -> Dict: """Move to final destination and record in database""" # Create person directory person_dir_name = self._sanitize_name(person_name) person_dir = os.path.join(self.final_base, person_dir_name) os.makedirs(person_dir, exist_ok=True) # Move file filename = os.path.basename(temp_path) final_path = os.path.join(person_dir, filename) # Handle duplicates in destination if os.path.exists(final_path): base, ext = os.path.splitext(filename) timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') filename = f"{base}_{timestamp}{ext}" final_path = os.path.join(person_dir, filename) shutil.move(temp_path, final_path) # Record in database self._record_download(final_path, person_name, file_hash, source) logger.info(f"✓ Auto-sorted: {filename} → {person_name}/") return { 'status': 'success', 'action': 'sorted', 'destination': final_path, 'reason': 'face_match_verified', 'person': person_name, 'hash': file_hash } def _move_to_review(self, temp_path: str, category: str, new_filename: str, reason: str) -> Dict: """Move to review directory for manual processing""" review_dir = os.path.join(self.review_base, category) review_path = os.path.join(review_dir, new_filename) # Handle duplicates if os.path.exists(review_path): base, ext = os.path.splitext(new_filename) timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') new_filename = f"{base}_{timestamp}{ext}" review_path = os.path.join(review_dir, new_filename) shutil.move(temp_path, review_path) logger.info(f"⚠ Needs review: {new_filename} → review/{category}/ ({reason})") return { 'status': 'success', 'action': 'reviewed', 'destination': review_path, 'reason': reason, 'category': category } def _record_download(self, file_path: str, person_name: str, file_hash: str, source: str = None): """Record successful download in database""" with sqlite3.connect(self.unified_db.db_path) as conn: conn.execute(""" INSERT INTO downloads (file_path, filename, file_hash, source, person_name, download_date, auto_sorted) VALUES (?, ?, ?, ?, ?, ?, 1) """, ( file_path, os.path.basename(file_path), file_hash, source, person_name, datetime.now().isoformat() )) conn.commit() def _sanitize_name(self, name: str) -> str: """Convert person name to safe directory name""" import re safe = re.sub(r'[^\w\s-]', '', name) safe = re.sub(r'[-\s]+', '_', safe) return safe.lower() # REVIEW QUEUE MANAGEMENT def get_review_queue(self, category: str = None) -> list: """Get files in review queue""" if category: review_dir = os.path.join(self.review_base, category) categories = [category] else: categories = ['duplicates', 'unidentified', 'low_confidence', 'multiple_faces', 'unwanted_person'] queue = [] for cat in categories: cat_dir = os.path.join(self.review_base, cat) if os.path.exists(cat_dir): files = os.listdir(cat_dir) for f in files: queue.append({ 'category': cat, 'filename': f, 'path': os.path.join(cat_dir, f), 'size': os.path.getsize(os.path.join(cat_dir, f)), 'modified': os.path.getmtime(os.path.join(cat_dir, f)) }) return sorted(queue, key=lambda x: x['modified'], reverse=True) def approve_review_item(self, file_path: str, person_name: str) -> Dict: """Manually approve a review item and move to final destination""" if not os.path.exists(file_path): return {'status': 'error', 'reason': 'file_not_found'} # Calculate hash file_hash = self._calculate_hash(file_path) # Move to final destination return self._move_to_final(file_path, person_name, file_hash, source='manual_review') def reject_review_item(self, file_path: str) -> Dict: """Delete a review item""" if not os.path.exists(file_path): return {'status': 'error', 'reason': 'file_not_found'} os.remove(file_path) logger.info(f"Rejected and deleted: {file_path}") return { 'status': 'success', 'action': 'deleted', 'path': file_path } ``` --- ## ⚙️ Configuration ### Add to `config.json`: ```json { "smart_download": { "enabled": true, "directories": { "temp_dir": "/mnt/storage/Downloads/temp_downloads", "final_base": "/mnt/storage/Downloads/faces", "review_base": "/mnt/storage/Downloads/review" }, "whitelist": [ "john_doe", "sarah_smith", "family_member_1" ], "blacklist": [ "ex_partner", "stranger" ], "thresholds": { "min_confidence": 0.6, "max_faces_per_image": 1 }, "immich": { "wait_time_seconds": 5, "trigger_scan": true, "retry_if_no_faces": true, "max_retries": 2 }, "deduplication": { "check_hash": true, "action_on_duplicate": "move_to_review" }, "review_categories": { "duplicates": true, "unidentified": true, "low_confidence": true, "multiple_faces": true, "unwanted_person": true } } } ``` --- ## 🔄 Integration with Existing Download System ### Modify Download Completion Hook ```python def on_download_complete(url: str, temp_path: str, source: str): """ Called when download completes Now uses smart download workflow """ if config.get('smart_download', {}).get('enabled', False): # Use smart download workflow smart = SmartDownloader(config, immich_db, unified_db) result = smart.smart_download(url, source) logger.info(f"Smart download result: {result}") # Send notification if result['action'] == 'sorted': send_notification( f"✓ Auto-sorted to {result['person']}", result['destination'] ) elif result['action'] == 'reviewed': send_notification( f"⚠ Needs review: {result['reason']}", result['destination'] ) return result else: # Fall back to old workflow return legacy_download_handler(url, temp_path, source) ``` --- ## 📊 Database Schema Addition ```sql -- Add person_name and auto_sorted columns to downloads table ALTER TABLE downloads ADD COLUMN person_name TEXT; ALTER TABLE downloads ADD COLUMN auto_sorted INTEGER DEFAULT 0; -- Create index for quick person lookups CREATE INDEX idx_downloads_person ON downloads(person_name); CREATE INDEX idx_downloads_auto_sorted ON downloads(auto_sorted); -- Create review queue table CREATE TABLE review_queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, file_path TEXT NOT NULL, category TEXT NOT NULL, -- duplicates, unidentified, etc. file_hash TEXT, reason TEXT, faces_detected INTEGER DEFAULT 0, suggested_person TEXT, created_at TEXT, reviewed_at TEXT, reviewed_by TEXT, action TEXT -- approved, rejected, pending ); CREATE INDEX idx_review_category ON review_queue(category); CREATE INDEX idx_review_action ON review_queue(action); ``` --- ## 🎨 Web UI - Review Queue Page ### Review Queue Interface ``` ┌─────────────────────────────────────────────────────────────────┐ │ Review Queue (42 items) │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Filter: [All ▼] [Duplicates: 5] [Unidentified: 28] │ │ [Low Confidence: 6] [Multiple Faces: 3] │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ [Image Thumbnail] │ │ │ │ │ │ │ │ Category: Unidentified │ │ │ │ Reason: No faces detected by Immich │ │ │ │ File: instagram_profile_20250131_120000.jpg │ │ │ │ Size: 2.4 MB │ │ │ │ Downloaded: 2025-01-31 12:00:00 │ │ │ │ │ │ │ │ This is: [Select Person ▼] or [New Person...] │ │ │ │ │ │ │ │ [✓ Approve & Sort] [✗ Delete] [→ Skip] │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ [◄ Previous] 1 of 42 [Next ►] │ │ │ │ Bulk Actions: [Select All] [Delete Selected] [Export List] │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 📡 API Endpoints (New) ```python # Review Queue GET /api/smart-download/review/queue # Get all review items GET /api/smart-download/review/queue/{category} # By category POST /api/smart-download/review/{id}/approve # Approve and move to person POST /api/smart-download/review/{id}/reject # Delete item GET /api/smart-download/review/stats # Queue statistics # Smart Download Control GET /api/smart-download/status POST /api/smart-download/enable POST /api/smart-download/disable # Configuration GET /api/smart-download/config PUT /api/smart-download/config/whitelist PUT /api/smart-download/config/blacklist # Statistics GET /api/smart-download/stats/today GET /api/smart-download/stats/summary ``` --- ## 📈 Statistics & Reporting ```python def get_smart_download_stats(days: int = 30) -> dict: """Get smart download statistics""" with sqlite3.connect(db_path) as conn: # Auto-sorted count auto_sorted = conn.execute(""" SELECT COUNT(*) FROM downloads WHERE auto_sorted = 1 AND download_date >= datetime('now', ? || ' days') """, (f'-{days}',)).fetchone()[0] # Review queue count in_review = conn.execute(""" SELECT COUNT(*) FROM review_queue WHERE action = 'pending' """).fetchone()[0] # By person by_person = conn.execute(""" SELECT person_name, COUNT(*) FROM downloads WHERE auto_sorted = 1 AND download_date >= datetime('now', ? || ' days') GROUP BY person_name """, (f'-{days}',)).fetchall() # By review category by_category = conn.execute(""" SELECT category, COUNT(*) FROM review_queue WHERE action = 'pending' GROUP BY category """).fetchall() return { 'auto_sorted': auto_sorted, 'in_review': in_review, 'by_person': dict(by_person), 'by_category': dict(by_category), 'success_rate': (auto_sorted / (auto_sorted + in_review) * 100) if (auto_sorted + in_review) > 0 else 0 } # Example output: # { # 'auto_sorted': 145, # 'in_review': 23, # 'by_person': {'john_doe': 85, 'sarah_smith': 60}, # 'by_category': {'unidentified': 15, 'duplicates': 5, 'multiple_faces': 3}, # 'success_rate': 86.3 # } ``` --- ## 🎯 Example Usage ### Example 1: Download Instagram Profile ```python # Download profile with smart workflow downloader = SmartDownloader(config, immich_db, unified_db) images = get_instagram_profile_images('username') results = { 'sorted': 0, 'reviewed': 0, 'errors': 0 } for image_url in images: result = downloader.smart_download(image_url, source='instagram') if result['action'] == 'sorted': results['sorted'] += 1 print(f"✓ {result['person']}: {result['destination']}") elif result['action'] == 'reviewed': results['reviewed'] += 1 print(f"⚠ Review needed ({result['reason']}): {result['destination']}") else: results['errors'] += 1 print(f"\nResults: {results['sorted']} sorted, {results['reviewed']} need review") # Output: # ✓ john_doe: /faces/john_doe/image1.jpg # ✓ john_doe: /faces/john_doe/image2.jpg # ⚠ Review needed (not_in_whitelist): /review/unidentified/image3.jpg # ⚠ Review needed (duplicate): /review/duplicates/image4.jpg # ✓ john_doe: /faces/john_doe/image5.jpg # # Results: 3 sorted, 2 need review ``` ### Example 2: Process Review Queue ```python # Get pending reviews queue = downloader.get_review_queue() print(f"Review queue: {len(queue)} items") for item in queue: print(f"\nFile: {item['filename']}") print(f"Category: {item['category']}") print(f"Path: {item['path']}") # Manual decision action = input("Action (approve/reject/skip): ") if action == 'approve': person = input("Person name: ") result = downloader.approve_review_item(item['path'], person) print(f"✓ Approved and sorted to {person}") elif action == 'reject': downloader.reject_review_item(item['path']) print(f"✗ Deleted") else: print(f"→ Skipped") ``` --- ## ✅ Advantages of This System ### 1. **Fully Automated for Good Cases** - Matching face + not duplicate = auto-sorted - No manual intervention needed for 80-90% of images ### 2. **Safe Review for Edge Cases** - Duplicates flagged for review - Unknown faces queued for identification - Multiple faces queued for decision ### 3. **Leverages Existing Systems** - Uses your SHA256 deduplication - Uses Immich's face recognition - Clean integration ### 4. **Nothing Lost** - Every image goes somewhere - Easy to find and review - Can always approve later ### 5. **Flexible Configuration** - Whitelist/blacklist - Confidence thresholds - Review categories ### 6. **Clear Audit Trail** - Database tracks everything - Statistics available - Can generate reports --- ## 🚀 Implementation Timeline ### Week 1: Core Workflow - [ ] Create SmartDownloader class - [ ] Implement download to temp - [ ] Add hash checking - [ ] Basic face checking - [ ] Move to final/review logic ### Week 2: Immich Integration - [ ] Connect to Immich DB - [ ] Query face data - [ ] Trigger Immich scans - [ ] Handle face results ### Week 3: Review System - [ ] Create review directories - [ ] Review queue database - [ ] Get/approve/reject methods - [ ] Statistics ### Week 4: Web UI - [ ] Review queue page - [ ] Approve/reject interface - [ ] Statistics dashboard - [ ] Configuration page ### Week 5: Polish - [ ] Error handling - [ ] Notifications - [ ] Documentation - [ ] Testing --- ## 🎯 Success Metrics After implementation, track: - **Auto-sort rate**: % of images auto-sorted vs reviewed - **Target**: >80% auto-sorted - **Duplicate catch rate**: % of duplicates caught - **Target**: 100% - **False positive rate**: % of incorrectly sorted images - **Target**: <5% - **Review queue size**: Average pending items - **Target**: <50 items --- ## ✅ Your Perfect Workflow - Summary ``` Download → Hash Check → Face Check → Decision ↓ ↓ Duplicate? Matches? ↓ ↓ ┌───┴───┐ ┌───┴────┐ YES NO YES NO ↓ ↓ ↓ ↓ REVIEW Continue FINAL REVIEW ``` **Final Destinations**: - ✅ `/faces/john_doe/` - Verified, auto-sorted - ⚠️ `/review/duplicates/` - Needs duplicate review - ⚠️ `/review/unidentified/` - Needs face identification - ⚠️ `/review/low_confidence/` - Low match confidence - ⚠️ `/review/multiple_faces/` - Multiple people **This is exactly what you wanted!** --- **Last Updated**: 2025-10-31