Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00
commit 0d7b2b1aab
389 changed files with 280296 additions and 0 deletions
--- a/docs/FILE_INVENTORY.md
+++ b/docs/FILE_INVENTORY.md
@@ -0,0 +1,475 @@
+# File Inventory Architecture
+
+**Version:** 6.33.5
+**Date:** 2025-11-16
+**Status:** Implementation Phase
+
+---
+
+## 📋 Overview
+
+The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.
+
+---
+
+## 🎯 Problem Statement
+
+### Current Issues
+
+1. **Performance**: Pages scan entire directory trees on every load (2,493+ files)
+2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records
+3. **Stale Records**: Downloaded files moved/deleted but database not updated
+4. **Missing Records**: 1,733+ files on disk with no database entries
+
+### Root Cause
+
+Multiple systems track files independently:
+- **Download modules** record to `downloads` table during download
+- **move_module** updates paths when moving files
+- **Filesystem** is the actual source of truth
+- **API endpoints** scan filesystem (ignoring database)
+
+Result: Database and filesystem drift apart over time.
+
+---
+
+## 💡 Solution: file_inventory Table
+
+### Architecture
+
+**Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations.
+
+```
+┌─────────────────┐
+│ Download Module │──> downloads table (historical audit trail)
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  move_module    │──> file_inventory table (current locations)
+└─────────────────┘         │
+                            │
+                            ▼
+                    ┌───────────────┐
+                    │  API Endpoints │
+                    │  (Fast Queries)│
+                    └───────────────┘
+```
+
+### Separation of Concerns
+
+| Table | Purpose | Updates | Deletions |
+|-------|---------|---------|-----------|
+| `downloads` | Historical audit trail | Never | Never |
+| `file_inventory` | Current file locations | On every move | When file deleted |
+| `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete |
+
+---
+
+## 🗄️ Database Schema
+
+### file_inventory Table
+
+```sql
+CREATE TABLE file_inventory (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    file_path TEXT NOT NULL UNIQUE,        -- Absolute path (current location)
+    filename TEXT NOT NULL,                -- Basename for display
+    platform TEXT NOT NULL,                -- instagram, tiktok, snapchat, forum, coppermine
+    source TEXT,                           -- Username, forum name, etc.
+    content_type TEXT,                     -- 'image' or 'video'
+    file_size INTEGER,                     -- Size in bytes
+    file_hash TEXT,                        -- SHA256 for deduplication
+    width INTEGER,                         -- Image/video width (from metadata cache)
+    height INTEGER,                        -- Image/video height
+    location TEXT NOT NULL,                -- 'final', 'review', 'recycle'
+    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- When moved to this location
+    last_verified TIMESTAMP,               -- Last time file existence verified
+    metadata JSON,                         -- Additional metadata (face recognition, etc.)
+
+    -- Indexes for fast queries
+    INDEX idx_fi_platform_location (platform, location, created_date DESC),
+    INDEX idx_fi_source (source, created_date DESC),
+    INDEX idx_fi_location (location),
+    INDEX idx_fi_hash (file_hash)
+);
+```
+
+### Field Descriptions
+
+- **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`)
+- **location**: Current location type
+  - `'final'` - In final destination directory (ready for Immich)
+  - `'review'` - In review queue (no face match, pending manual review)
+  - `'recycle'` - In recycle bin (soft deleted, can be restored)
+- **created_date**: When file was moved to current location (not original download date)
+- **last_verified**: Background task updates this when verifying file still exists
+
+---
+
+## 📂 File Locations
+
+### Final Destinations (location='final')
+
+Configured in settings table, per platform:
+
+```
+/opt/immich/md/
+├── social media/
+│   ├── instagram/
+│   │   ├── posts/
+│   │   ├── stories/
+│   │   ├── reels/
+│   │   └── tagged/
+│   ├── snapchat/stories/
+│   └── tiktok/reels/
+├── forums/
+│   ├── HQCelebCorner/
+│   └── PicturePub/
+└── gallery/
+    └── Coppermine/
+```
+
+**Settings locations:**
+- Forums: `settings.forums.configs[].destination_path`
+- Instagram/Snapchat/TikTok: Hardcoded or configurable
+- Coppermine: `settings.coppermine.destination_path`
+
+### Review Queue (location='review')
+
+Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`)
+
+Maintains same directory structure as final destination:
+```
+/opt/immich/review/
+├── social media/
+│   └── instagram/posts/
+│       └── no_face_match.jpg
+└── forums/
+    └── PicturePub/
+        └── unmatched.jpg
+```
+
+### Recycle Bin (location='recycle')
+
+Path: `/opt/immich/recycle/` (fixed)
+
+**Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate).
+
+Files stored with UUID filenames:
+```
+/opt/immich/recycle/
+├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
+└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
+```
+
+---
+
+## 🔄 Data Flow
+
+### 1. Download Phase
+
+```
+Download Module
+    ↓
+downloads table (audit trail)
+    ↓
+Temporary file in /opt/media-downloader/temp/
+```
+
+### 2. Move Phase (move_module.py)
+
+```
+move_module.move_file()
+    ↓
+Face Recognition Check
+    ├─ Match → Final Destination
+    └─ No Match → Review Queue
+    ↓
+File moved to location
+    ↓
+file_inventory.upsert(file_path, location)
+    ↓
+downloads.update(file_path) [optional - for audit trail]
+```
+
+### 3. Delete Phase
+
+```
+User deletes from UI
+    ↓
+File moved to /opt/immich/recycle/
+    ↓
+recycle_bin.insert(original_path, recycle_path)
+    ↓
+file_inventory.delete(file_path) OR update(location='recycle')
+```
+
+### 4. Restore Phase
+
+```
+User restores from recycle bin
+    ↓
+File moved back to original_path
+    ↓
+recycle_bin.delete(id)
+    ↓
+file_inventory.insert(original_path, location='final')
+```
+
+---
+
+## 🚀 Implementation Plan
+
+### Phase 1: Infrastructure (Week 1)
+
+#### Day 1: Table Creation & Backfill Script
+- [ ] Add `file_inventory` table to `unified_database.py`
+- [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()`
+- [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py`
+- [ ] Test backfill on test database
+
+#### Day 2: Initial Backfill
+- [ ] Run backfill script on production database
+- [ ] Verify all 2,493 files captured
+- [ ] Check accuracy vs filesystem
+- [ ] Document any discrepancies
+
+#### Day 3: Testing & Validation
+- [ ] Verify indexes created
+- [ ] Test query performance (should be <10ms)
+- [ ] Write unit tests for file_inventory methods
+
+### Phase 2: Update move_module (Week 1)
+
+#### Day 4-5: Integration
+- [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves
+- [ ] Handle location tracking ('final' vs 'review')
+- [ ] Add error handling and logging
+- [ ] Test with sample downloads (Instagram, Forum, etc.)
+- [ ] Verify file_inventory stays in sync
+
+### Phase 3: Update API Endpoints (Week 2)
+
+#### Day 1-2: Media Page
+- [ ] Update `/api/media/gallery` to query `file_inventory` (location='final')
+- [ ] Add filtering by platform, source, content_type
+- [ ] Add pagination (already indexed)
+- [ ] Test performance improvement
+- [ ] Deploy and monitor
+
+#### Day 3: Downloads Page
+- [ ] Update `/api/downloads/recent` to query `file_inventory`
+- [ ] Test with different platforms
+- [ ] Verify sorting by created_date
+
+#### Day 4: Review Queue
+- [ ] Update `/api/review/queue` to query `file_inventory` (location='review')
+- [ ] Verify face recognition integration
+- [ ] Test restore from review queue
+
+#### Day 5: Testing & Documentation
+- [ ] Integration testing across all pages
+- [ ] Performance testing with large datasets
+- [ ] Update API documentation
+- [ ] User acceptance testing
+
+### Phase 4: Background Maintenance (Week 3)
+
+#### Optional: File Verification Task
+- [ ] Create periodic task to verify file existence
+- [ ] Mark missing files in `file_inventory`
+- [ ] Alert on discrepancies
+- [ ] Auto-cleanup stale records (configurable)
+
+---
+
+## 📊 Expected Performance
+
+### Before (Filesystem Scanning)
+
+| Page | Method | Files Scanned | Load Time |
+|------|--------|---------------|-----------|
+| Media | `directory.rglob('*')` | 2,493 | 5-10 seconds |
+| Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds |
+| Review | `directory.rglob('*')` | Variable | 2-5 seconds |
+| Recycle Bin | Database query | N/A | <100ms ✅ |
+
+### After (Database Queries)
+
+| Page | Method | Query Cost | Load Time |
+|------|--------|-----------|-----------|
+| Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
+| Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
+| Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ |
+| Recycle Bin | Already database | O(1) with index | <100ms ✅ |
+
+**Performance Improvement: 50-100x faster** 🚀
+
+---
+
+## 🔍 Backfill Strategy
+
+### Discovery Phase
+
+Backfill script reads settings to find all file locations:
+
+```python
+# 1. Get all destination paths from settings
+forums = db.get_setting('forums')
+for forum in forums['configs']:
+    scan_directory(forum['destination_path'], platform='forum', source=forum['name'])
+
+# 2. Get review queue path
+face_settings = db.get_setting('face_recognition')
+scan_directory(face_settings['review_path'], location='review')
+
+# 3. Hardcoded platform paths (or from settings)
+scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
+scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
+scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
+scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
+```
+
+### Metadata Extraction
+
+For each file found:
+- **Platform**: From directory structure or settings
+- **Source**: Extract from filename or directory name
+- **Content Type**: From file extension
+- **File Size**: `os.stat().st_size`
+- **File Hash**: Calculate SHA256 (for deduplication)
+- **Dimensions**: Query from `media_metadata.db` if exists
+- **Location**: 'final' or 'review' based on directory
+
+### Idempotency
+
+Script can be run multiple times safely:
+- Uses `INSERT OR REPLACE` / `UPSERT` semantics
+- Skips files already in database (with option to force refresh)
+- Logs statistics: new files, updated files, skipped files
+
+---
+
+## 🛡️ Data Integrity
+
+### Constraints
+
+- `file_path UNIQUE` - Prevents duplicate entries
+- `location NOT NULL` - Every file must have a location
+- Indexes ensure fast lookups even with 100,000+ files
+
+### Verification
+
+Background task (optional, runs daily):
+1. Select random 1000 files from `file_inventory`
+2. Check if files still exist on filesystem
+3. Mark missing files or auto-delete records
+4. Log discrepancies for review
+
+### Migration Safety
+
+- **downloads table**: Never modified (preserves audit trail)
+- **recycle_bin table**: Never modified (already works perfectly)
+- **New table**: No risk to existing functionality
+- **Gradual rollout**: Update one endpoint at a time
+
+---
+
+## 📝 Database Methods
+
+### unified_database.py
+
+```python
+def create_file_inventory_table(self):
+    """Create file_inventory table and indexes"""
+
+def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
+                         source: str = None, content_type: str = None,
+                         location: str = 'final', **kwargs) -> bool:
+    """Insert or update file in inventory"""
+
+def delete_file_inventory(self, file_path: str) -> bool:
+    """Remove file from inventory (when permanently deleted)"""
+
+def query_file_inventory(self, location: str = None, platform: str = None,
+                         source: str = None, limit: int = 50,
+                         offset: int = 0) -> List[Dict]:
+    """Query file inventory with filters and pagination"""
+
+def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
+    """Update file location (e.g., final → review → recycle)"""
+
+def verify_file_inventory(self) -> Dict:
+    """Verify all files in inventory still exist on filesystem"""
+```
+
+---
+
+## 🔄 Backwards Compatibility
+
+### During Migration
+
+- Old endpoints continue working (filesystem scan)
+- New endpoints use database queries
+- Can roll back instantly by switching endpoint logic
+- No data loss risk
+
+### After Migration
+
+- Keep `downloads` table for historical queries
+- Keep filesystem structure unchanged (Immich needs it)
+- `file_inventory` is index, not replacement for files
+
+---
+
+## 📈 Monitoring
+
+### Metrics to Track
+
+- Query performance (should be <10ms)
+- File inventory count vs filesystem count
+- Missing files detected
+- Backfill success rate
+- API endpoint latency before/after
+
+### Alerts
+
+- File inventory diverges >5% from filesystem
+- Query performance degrades >100ms
+- Backfill failures
+
+---
+
+## 🚧 Future Enhancements
+
+1. **Real-time sync**: inotify/watchdog to detect file changes
+2. **Advanced queries**: Full-text search on filename/source
+3. **Statistics**: Track file age, access patterns
+4. **Cleanup**: Auto-remove files older than X days
+5. **Export**: Generate inventory reports (CSV, JSON)
+
+---
+
+## 📚 Related Documentation
+
+- [VERSIONING.md](VERSIONING.md) - Version control and backups
+- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
+- [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration
+- [CHANGELOG.md](CHANGELOG.md) - Version history
+
+---
+
+## ✅ Success Criteria
+
+- [ ] All 2,493+ files tracked in `file_inventory`
+- [ ] Database accuracy >98%
+- [ ] Page load times <100ms
+- [ ] Zero data loss
+- [ ] Backward compatible
+- [ ] No user-facing changes (transparent migration)
+
+---
+
+**Status**: Ready for implementation
+**Next Step**: Create `file_inventory` table in `unified_database.py`