32 KiB
32 KiB
Smart Download Workflow with Face Recognition & Deduplication
Your Perfect Workflow: Download → Check Face → Check Duplicate → Auto-Sort or Review
🎯 Your Exact Requirements
What You Want
- Download image
- Check if face matches (using Immich face recognition)
- Check if duplicate (using existing SHA256 hash system)
- Decision:
- ✅ Match + Not Duplicate → Move to final destination (
/faces/person_name/) - ⚠️ No Match OR Duplicate → Move to holding/review directory (
/faces/review/)
- ✅ Match + Not Duplicate → Move to final destination (
Why This Makes Sense
✅ Automatic for good images - Hands-off for images you want ✅ Manual review for uncertain - You decide on edge cases ✅ No duplicates - Leverages existing deduplication system ✅ Clean organization - Final destination is curated, high-quality ✅ Nothing lost - Everything goes somewhere (review or final)
🏗️ Complete Workflow Architecture
┌─────────────────────────────────────────────────────────────────┐
│ DOWNLOAD IMAGE │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Calculate SHA256 Hash │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────┐
│ Is Duplicate? │
└───────┬───────┘
│
┌───────────┴────────────┐
│ │
YES NO
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Move to │ │ STEP 2: Trigger │
│ REVIEW/ │ │ Immich Scan │
│ duplicates/ │ └────────┬────────┘
└─────────────┘ │
▼
┌───────────────┐
│ Wait for Face │
│ Detection │
└───────┬───────┘
│
▼
┌───────────────────┐
│ Query Immich DB: │
│ Who's in photo? │
└───────┬───────────┘
│
┌────────────────┴────────────────┐
│ │
IDENTIFIED NOT IDENTIFIED
(in whitelist) (unknown/unwanted)
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Move to FINAL │ │ Move to REVIEW/ │
│ /faces/john/ │ │ unidentified/ │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Update Database │
│ - Record path │
│ - Record person │
│ - Mark complete │
└─────────────────┘
📁 Directory Structure
/mnt/storage/Downloads/
│
├── temp_downloads/ # Temporary download location
│ └── [images downloaded here first]
│
├── faces/ # Final curated collection
│ ├── john_doe/ # Auto-sorted, verified
│ │ ├── 20250131_120000.jpg
│ │ └── 20250131_130000.jpg
│ │
│ ├── sarah_smith/ # Auto-sorted, verified
│ │ └── 20250131_140000.jpg
│ │
│ └── family_member/
│ └── 20250131_150000.jpg
│
└── review/ # Holding directory for manual review
├── duplicates/ # Duplicate images
│ ├── duplicate_20250131_120000.jpg
│ └── duplicate_20250131_130000.jpg
│
├── unidentified/ # No faces or unknown faces
│ ├── unknown_20250131_120000.jpg
│ └── noface_20250131_130000.jpg
│
├── low_confidence/ # Face detected but low match confidence
│ └── lowconf_20250131_120000.jpg
│
├── multiple_faces/ # Multiple people in image
│ └── multi_20250131_120000.jpg
│
└── unwanted_person/ # Blacklisted person detected
└── unwanted_20250131_120000.jpg
💻 Complete Implementation
Core Smart Download Class
#!/usr/bin/env python3
"""
Smart Download with Face Recognition & Deduplication
Downloads, checks faces, checks duplicates, auto-sorts or reviews
"""
import os
import shutil
import hashlib
import logging
import time
import sqlite3
from pathlib import Path
from datetime import datetime
from typing import Dict, Optional
logger = logging.getLogger(__name__)
class SmartDownloader:
"""Intelligent download with face recognition and deduplication"""
def __init__(self, config, immich_db, unified_db):
self.config = config
self.immich_db = immich_db
self.unified_db = unified_db
# Directories
self.temp_dir = config.get('smart_download', {}).get('temp_dir',
'/mnt/storage/Downloads/temp_downloads')
self.final_base = config.get('smart_download', {}).get('final_base',
'/mnt/storage/Downloads/faces')
self.review_base = config.get('smart_download', {}).get('review_base',
'/mnt/storage/Downloads/review')
# Whitelist
self.whitelist = config.get('smart_download', {}).get('whitelist', [])
self.blacklist = config.get('smart_download', {}).get('blacklist', [])
# Thresholds
self.min_confidence = config.get('smart_download', {}).get('min_confidence', 0.6)
self.immich_wait_time = config.get('smart_download', {}).get('immich_wait_time', 5)
# Create directories
self._create_directories()
def _create_directories(self):
"""Create all required directories"""
dirs = [
self.temp_dir,
self.final_base,
self.review_base,
os.path.join(self.review_base, 'duplicates'),
os.path.join(self.review_base, 'unidentified'),
os.path.join(self.review_base, 'low_confidence'),
os.path.join(self.review_base, 'multiple_faces'),
os.path.join(self.review_base, 'unwanted_person'),
]
for d in dirs:
os.makedirs(d, exist_ok=True)
def smart_download(self, url: str, source: str = None) -> Dict:
"""
Smart download workflow: Download → Check → Sort or Review
Args:
url: URL to download
source: Source identifier (e.g., 'instagram', 'forum')
Returns:
dict: {
'status': 'success'|'error',
'action': 'sorted'|'reviewed'|'skipped',
'destination': str,
'reason': str,
'person': str or None
}
"""
try:
# STEP 1: Download to temp
temp_path = self._download_to_temp(url)
if not temp_path:
return {'status': 'error', 'reason': 'download_failed'}
# STEP 2: Check for duplicates
file_hash = self._calculate_hash(temp_path)
if self._is_duplicate(file_hash):
return self._handle_duplicate(temp_path, file_hash)
# STEP 3: Trigger Immich scan
self._trigger_immich_scan(temp_path)
# STEP 4: Wait for Immich to process
time.sleep(self.immich_wait_time)
# STEP 5: Check faces
faces = self.immich_db.get_faces_for_file(temp_path)
# STEP 6: Make decision based on faces
return self._process_faces(temp_path, faces, file_hash, source)
except Exception as e:
logger.error(f"Smart download failed for {url}: {e}")
return {'status': 'error', 'reason': str(e)}
def _download_to_temp(self, url: str) -> Optional[str]:
"""Download file to temporary location"""
try:
# Use your existing download logic here
# For now, placeholder:
filename = f"temp_{datetime.now().strftime('%Y%m%d_%H%M%S')}.jpg"
temp_path = os.path.join(self.temp_dir, filename)
# Download file (use requests, yt-dlp, etc.)
# download_file(url, temp_path)
logger.info(f"Downloaded to temp: {temp_path}")
return temp_path
except Exception as e:
logger.error(f"Download failed for {url}: {e}")
return None
def _calculate_hash(self, file_path: str) -> str:
"""Calculate SHA256 hash of file"""
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
def _is_duplicate(self, file_hash: str) -> bool:
"""Check if file hash already exists in database"""
with sqlite3.connect(self.unified_db.db_path) as conn:
cursor = conn.execute(
"SELECT COUNT(*) FROM downloads WHERE file_hash = ?",
(file_hash,)
)
count = cursor.fetchone()[0]
return count > 0
def _handle_duplicate(self, temp_path: str, file_hash: str) -> Dict:
"""Handle duplicate file - move to review/duplicates"""
filename = os.path.basename(temp_path)
review_path = os.path.join(
self.review_base,
'duplicates',
f"duplicate_{filename}"
)
shutil.move(temp_path, review_path)
logger.info(f"Duplicate detected: {filename} → review/duplicates/")
return {
'status': 'success',
'action': 'reviewed',
'destination': review_path,
'reason': 'duplicate',
'hash': file_hash
}
def _trigger_immich_scan(self, file_path: str):
"""Trigger Immich to scan new file"""
try:
import requests
immich_url = self.config.get('immich', {}).get('url')
api_key = self.config.get('immich', {}).get('api_key')
if immich_url and api_key:
response = requests.post(
f"{immich_url}/api/library/scan",
headers={'x-api-key': api_key}
)
logger.debug(f"Triggered Immich scan: {response.status_code}")
except Exception as e:
logger.warning(f"Could not trigger Immich scan: {e}")
def _process_faces(self, temp_path: str, faces: list, file_hash: str,
source: str = None) -> Dict:
"""
Process faces and decide: final destination or review
Returns:
dict with status, action, destination, reason
"""
filename = os.path.basename(temp_path)
# NO FACES DETECTED
if not faces:
return self._move_to_review(
temp_path,
'unidentified',
f"noface_{filename}",
'no_faces_detected'
)
# MULTIPLE FACES
if len(faces) > 1:
return self._move_to_review(
temp_path,
'multiple_faces',
f"multi_{filename}",
f'multiple_faces ({len(faces)} people)'
)
# SINGLE FACE - Process
face = faces[0]
person_name = face.get('person_name')
confidence = face.get('confidence', 1.0)
# BLACKLIST CHECK
if self.blacklist and person_name in self.blacklist:
return self._move_to_review(
temp_path,
'unwanted_person',
f"unwanted_{filename}",
f'blacklisted_person: {person_name}'
)
# WHITELIST CHECK
if self.whitelist and person_name not in self.whitelist:
return self._move_to_review(
temp_path,
'unidentified',
f"notwhitelisted_{filename}",
f'not_in_whitelist: {person_name}'
)
# CONFIDENCE CHECK (if we have confidence data)
if confidence < self.min_confidence:
return self._move_to_review(
temp_path,
'low_confidence',
f"lowconf_{filename}",
f'low_confidence: {confidence:.2f}'
)
# ALL CHECKS PASSED - Move to final destination
return self._move_to_final(
temp_path,
person_name,
file_hash,
source
)
def _move_to_final(self, temp_path: str, person_name: str,
file_hash: str, source: str = None) -> Dict:
"""Move to final destination and record in database"""
# Create person directory
person_dir_name = self._sanitize_name(person_name)
person_dir = os.path.join(self.final_base, person_dir_name)
os.makedirs(person_dir, exist_ok=True)
# Move file
filename = os.path.basename(temp_path)
final_path = os.path.join(person_dir, filename)
# Handle duplicates in destination
if os.path.exists(final_path):
base, ext = os.path.splitext(filename)
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f"{base}_{timestamp}{ext}"
final_path = os.path.join(person_dir, filename)
shutil.move(temp_path, final_path)
# Record in database
self._record_download(final_path, person_name, file_hash, source)
logger.info(f"✓ Auto-sorted: {filename} → {person_name}/")
return {
'status': 'success',
'action': 'sorted',
'destination': final_path,
'reason': 'face_match_verified',
'person': person_name,
'hash': file_hash
}
def _move_to_review(self, temp_path: str, category: str,
new_filename: str, reason: str) -> Dict:
"""Move to review directory for manual processing"""
review_dir = os.path.join(self.review_base, category)
review_path = os.path.join(review_dir, new_filename)
# Handle duplicates
if os.path.exists(review_path):
base, ext = os.path.splitext(new_filename)
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
new_filename = f"{base}_{timestamp}{ext}"
review_path = os.path.join(review_dir, new_filename)
shutil.move(temp_path, review_path)
logger.info(f"⚠ Needs review: {new_filename} → review/{category}/ ({reason})")
return {
'status': 'success',
'action': 'reviewed',
'destination': review_path,
'reason': reason,
'category': category
}
def _record_download(self, file_path: str, person_name: str,
file_hash: str, source: str = None):
"""Record successful download in database"""
with sqlite3.connect(self.unified_db.db_path) as conn:
conn.execute("""
INSERT INTO downloads
(file_path, filename, file_hash, source, person_name,
download_date, auto_sorted)
VALUES (?, ?, ?, ?, ?, ?, 1)
""", (
file_path,
os.path.basename(file_path),
file_hash,
source,
person_name,
datetime.now().isoformat()
))
conn.commit()
def _sanitize_name(self, name: str) -> str:
"""Convert person name to safe directory name"""
import re
safe = re.sub(r'[^\w\s-]', '', name)
safe = re.sub(r'[-\s]+', '_', safe)
return safe.lower()
# REVIEW QUEUE MANAGEMENT
def get_review_queue(self, category: str = None) -> list:
"""Get files in review queue"""
if category:
review_dir = os.path.join(self.review_base, category)
categories = [category]
else:
categories = ['duplicates', 'unidentified', 'low_confidence',
'multiple_faces', 'unwanted_person']
queue = []
for cat in categories:
cat_dir = os.path.join(self.review_base, cat)
if os.path.exists(cat_dir):
files = os.listdir(cat_dir)
for f in files:
queue.append({
'category': cat,
'filename': f,
'path': os.path.join(cat_dir, f),
'size': os.path.getsize(os.path.join(cat_dir, f)),
'modified': os.path.getmtime(os.path.join(cat_dir, f))
})
return sorted(queue, key=lambda x: x['modified'], reverse=True)
def approve_review_item(self, file_path: str, person_name: str) -> Dict:
"""Manually approve a review item and move to final destination"""
if not os.path.exists(file_path):
return {'status': 'error', 'reason': 'file_not_found'}
# Calculate hash
file_hash = self._calculate_hash(file_path)
# Move to final destination
return self._move_to_final(file_path, person_name, file_hash, source='manual_review')
def reject_review_item(self, file_path: str) -> Dict:
"""Delete a review item"""
if not os.path.exists(file_path):
return {'status': 'error', 'reason': 'file_not_found'}
os.remove(file_path)
logger.info(f"Rejected and deleted: {file_path}")
return {
'status': 'success',
'action': 'deleted',
'path': file_path
}
⚙️ Configuration
Add to config.json:
{
"smart_download": {
"enabled": true,
"directories": {
"temp_dir": "/mnt/storage/Downloads/temp_downloads",
"final_base": "/mnt/storage/Downloads/faces",
"review_base": "/mnt/storage/Downloads/review"
},
"whitelist": [
"john_doe",
"sarah_smith",
"family_member_1"
],
"blacklist": [
"ex_partner",
"stranger"
],
"thresholds": {
"min_confidence": 0.6,
"max_faces_per_image": 1
},
"immich": {
"wait_time_seconds": 5,
"trigger_scan": true,
"retry_if_no_faces": true,
"max_retries": 2
},
"deduplication": {
"check_hash": true,
"action_on_duplicate": "move_to_review"
},
"review_categories": {
"duplicates": true,
"unidentified": true,
"low_confidence": true,
"multiple_faces": true,
"unwanted_person": true
}
}
}
🔄 Integration with Existing Download System
Modify Download Completion Hook
def on_download_complete(url: str, temp_path: str, source: str):
"""
Called when download completes
Now uses smart download workflow
"""
if config.get('smart_download', {}).get('enabled', False):
# Use smart download workflow
smart = SmartDownloader(config, immich_db, unified_db)
result = smart.smart_download(url, source)
logger.info(f"Smart download result: {result}")
# Send notification
if result['action'] == 'sorted':
send_notification(
f"✓ Auto-sorted to {result['person']}",
result['destination']
)
elif result['action'] == 'reviewed':
send_notification(
f"⚠ Needs review: {result['reason']}",
result['destination']
)
return result
else:
# Fall back to old workflow
return legacy_download_handler(url, temp_path, source)
📊 Database Schema Addition
-- Add person_name and auto_sorted columns to downloads table
ALTER TABLE downloads ADD COLUMN person_name TEXT;
ALTER TABLE downloads ADD COLUMN auto_sorted INTEGER DEFAULT 0;
-- Create index for quick person lookups
CREATE INDEX idx_downloads_person ON downloads(person_name);
CREATE INDEX idx_downloads_auto_sorted ON downloads(auto_sorted);
-- Create review queue table
CREATE TABLE review_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT NOT NULL,
category TEXT NOT NULL, -- duplicates, unidentified, etc.
file_hash TEXT,
reason TEXT,
faces_detected INTEGER DEFAULT 0,
suggested_person TEXT,
created_at TEXT,
reviewed_at TEXT,
reviewed_by TEXT,
action TEXT -- approved, rejected, pending
);
CREATE INDEX idx_review_category ON review_queue(category);
CREATE INDEX idx_review_action ON review_queue(action);
🎨 Web UI - Review Queue Page
Review Queue Interface
┌─────────────────────────────────────────────────────────────────┐
│ Review Queue (42 items) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Filter: [All ▼] [Duplicates: 5] [Unidentified: 28] │
│ [Low Confidence: 6] [Multiple Faces: 3] │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ [Image Thumbnail] │ │
│ │ │ │
│ │ Category: Unidentified │ │
│ │ Reason: No faces detected by Immich │ │
│ │ File: instagram_profile_20250131_120000.jpg │ │
│ │ Size: 2.4 MB │ │
│ │ Downloaded: 2025-01-31 12:00:00 │ │
│ │ │ │
│ │ This is: [Select Person ▼] or [New Person...] │ │
│ │ │ │
│ │ [✓ Approve & Sort] [✗ Delete] [→ Skip] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ [◄ Previous] 1 of 42 [Next ►] │
│ │
│ Bulk Actions: [Select All] [Delete Selected] [Export List] │
└─────────────────────────────────────────────────────────────────┘
📡 API Endpoints (New)
# Review Queue
GET /api/smart-download/review/queue # Get all review items
GET /api/smart-download/review/queue/{category} # By category
POST /api/smart-download/review/{id}/approve # Approve and move to person
POST /api/smart-download/review/{id}/reject # Delete item
GET /api/smart-download/review/stats # Queue statistics
# Smart Download Control
GET /api/smart-download/status
POST /api/smart-download/enable
POST /api/smart-download/disable
# Configuration
GET /api/smart-download/config
PUT /api/smart-download/config/whitelist
PUT /api/smart-download/config/blacklist
# Statistics
GET /api/smart-download/stats/today
GET /api/smart-download/stats/summary
📈 Statistics & Reporting
def get_smart_download_stats(days: int = 30) -> dict:
"""Get smart download statistics"""
with sqlite3.connect(db_path) as conn:
# Auto-sorted count
auto_sorted = conn.execute("""
SELECT COUNT(*)
FROM downloads
WHERE auto_sorted = 1
AND download_date >= datetime('now', ? || ' days')
""", (f'-{days}',)).fetchone()[0]
# Review queue count
in_review = conn.execute("""
SELECT COUNT(*)
FROM review_queue
WHERE action = 'pending'
""").fetchone()[0]
# By person
by_person = conn.execute("""
SELECT person_name, COUNT(*)
FROM downloads
WHERE auto_sorted = 1
AND download_date >= datetime('now', ? || ' days')
GROUP BY person_name
""", (f'-{days}',)).fetchall()
# By review category
by_category = conn.execute("""
SELECT category, COUNT(*)
FROM review_queue
WHERE action = 'pending'
GROUP BY category
""").fetchall()
return {
'auto_sorted': auto_sorted,
'in_review': in_review,
'by_person': dict(by_person),
'by_category': dict(by_category),
'success_rate': (auto_sorted / (auto_sorted + in_review) * 100) if (auto_sorted + in_review) > 0 else 0
}
# Example output:
# {
# 'auto_sorted': 145,
# 'in_review': 23,
# 'by_person': {'john_doe': 85, 'sarah_smith': 60},
# 'by_category': {'unidentified': 15, 'duplicates': 5, 'multiple_faces': 3},
# 'success_rate': 86.3
# }
🎯 Example Usage
Example 1: Download Instagram Profile
# Download profile with smart workflow
downloader = SmartDownloader(config, immich_db, unified_db)
images = get_instagram_profile_images('username')
results = {
'sorted': 0,
'reviewed': 0,
'errors': 0
}
for image_url in images:
result = downloader.smart_download(image_url, source='instagram')
if result['action'] == 'sorted':
results['sorted'] += 1
print(f"✓ {result['person']}: {result['destination']}")
elif result['action'] == 'reviewed':
results['reviewed'] += 1
print(f"⚠ Review needed ({result['reason']}): {result['destination']}")
else:
results['errors'] += 1
print(f"\nResults: {results['sorted']} sorted, {results['reviewed']} need review")
# Output:
# ✓ john_doe: /faces/john_doe/image1.jpg
# ✓ john_doe: /faces/john_doe/image2.jpg
# ⚠ Review needed (not_in_whitelist): /review/unidentified/image3.jpg
# ⚠ Review needed (duplicate): /review/duplicates/image4.jpg
# ✓ john_doe: /faces/john_doe/image5.jpg
#
# Results: 3 sorted, 2 need review
Example 2: Process Review Queue
# Get pending reviews
queue = downloader.get_review_queue()
print(f"Review queue: {len(queue)} items")
for item in queue:
print(f"\nFile: {item['filename']}")
print(f"Category: {item['category']}")
print(f"Path: {item['path']}")
# Manual decision
action = input("Action (approve/reject/skip): ")
if action == 'approve':
person = input("Person name: ")
result = downloader.approve_review_item(item['path'], person)
print(f"✓ Approved and sorted to {person}")
elif action == 'reject':
downloader.reject_review_item(item['path'])
print(f"✗ Deleted")
else:
print(f"→ Skipped")
✅ Advantages of This System
1. Fully Automated for Good Cases
- Matching face + not duplicate = auto-sorted
- No manual intervention needed for 80-90% of images
2. Safe Review for Edge Cases
- Duplicates flagged for review
- Unknown faces queued for identification
- Multiple faces queued for decision
3. Leverages Existing Systems
- Uses your SHA256 deduplication
- Uses Immich's face recognition
- Clean integration
4. Nothing Lost
- Every image goes somewhere
- Easy to find and review
- Can always approve later
5. Flexible Configuration
- Whitelist/blacklist
- Confidence thresholds
- Review categories
6. Clear Audit Trail
- Database tracks everything
- Statistics available
- Can generate reports
🚀 Implementation Timeline
Week 1: Core Workflow
- Create SmartDownloader class
- Implement download to temp
- Add hash checking
- Basic face checking
- Move to final/review logic
Week 2: Immich Integration
- Connect to Immich DB
- Query face data
- Trigger Immich scans
- Handle face results
Week 3: Review System
- Create review directories
- Review queue database
- Get/approve/reject methods
- Statistics
Week 4: Web UI
- Review queue page
- Approve/reject interface
- Statistics dashboard
- Configuration page
Week 5: Polish
- Error handling
- Notifications
- Documentation
- Testing
🎯 Success Metrics
After implementation, track:
- Auto-sort rate: % of images auto-sorted vs reviewed
- Target: >80% auto-sorted
- Duplicate catch rate: % of duplicates caught
- Target: 100%
- False positive rate: % of incorrectly sorted images
- Target: <5%
- Review queue size: Average pending items
- Target: <50 items
✅ Your Perfect Workflow - Summary
Download → Hash Check → Face Check → Decision
↓ ↓
Duplicate? Matches?
↓ ↓
┌───┴───┐ ┌───┴────┐
YES NO YES NO
↓ ↓ ↓ ↓
REVIEW Continue FINAL REVIEW
Final Destinations:
- ✅
/faces/john_doe/- Verified, auto-sorted - ⚠️
/review/duplicates/- Needs duplicate review - ⚠️
/review/unidentified/- Needs face identification - ⚠️
/review/low_confidence/- Low match confidence - ⚠️
/review/multiple_faces/- Multiple people
This is exactly what you wanted!
Last Updated: 2025-10-31