# File Inventory Architecture **Version:** 6.33.5 **Date:** 2025-11-16 **Status:** Implementation Phase --- ## 📋 Overview The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms. --- ## 🎯 Problem Statement ### Current Issues 1. **Performance**: Pages scan entire directory trees on every load (2,493+ files) 2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records 3. **Stale Records**: Downloaded files moved/deleted but database not updated 4. **Missing Records**: 1,733+ files on disk with no database entries ### Root Cause Multiple systems track files independently: - **Download modules** record to `downloads` table during download - **move_module** updates paths when moving files - **Filesystem** is the actual source of truth - **API endpoints** scan filesystem (ignoring database) Result: Database and filesystem drift apart over time. --- ## 💡 Solution: file_inventory Table ### Architecture **Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations. ``` ┌─────────────────┐ │ Download Module │──> downloads table (historical audit trail) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ move_module │──> file_inventory table (current locations) └─────────────────┘ │ │ ▼ ┌───────────────┐ │ API Endpoints │ │ (Fast Queries)│ └───────────────┘ ``` ### Separation of Concerns | Table | Purpose | Updates | Deletions | |-------|---------|---------|-----------| | `downloads` | Historical audit trail | Never | Never | | `file_inventory` | Current file locations | On every move | When file deleted | | `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete | --- ## 🗄️ Database Schema ### file_inventory Table ```sql CREATE TABLE file_inventory ( id INTEGER PRIMARY KEY AUTOINCREMENT, file_path TEXT NOT NULL UNIQUE, -- Absolute path (current location) filename TEXT NOT NULL, -- Basename for display platform TEXT NOT NULL, -- instagram, tiktok, snapchat, forum, coppermine source TEXT, -- Username, forum name, etc. content_type TEXT, -- 'image' or 'video' file_size INTEGER, -- Size in bytes file_hash TEXT, -- SHA256 for deduplication width INTEGER, -- Image/video width (from metadata cache) height INTEGER, -- Image/video height location TEXT NOT NULL, -- 'final', 'review', 'recycle' created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When moved to this location last_verified TIMESTAMP, -- Last time file existence verified metadata JSON, -- Additional metadata (face recognition, etc.) -- Indexes for fast queries INDEX idx_fi_platform_location (platform, location, created_date DESC), INDEX idx_fi_source (source, created_date DESC), INDEX idx_fi_location (location), INDEX idx_fi_hash (file_hash) ); ``` ### Field Descriptions - **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`) - **location**: Current location type - `'final'` - In final destination directory (ready for Immich) - `'review'` - In review queue (no face match, pending manual review) - `'recycle'` - In recycle bin (soft deleted, can be restored) - **created_date**: When file was moved to current location (not original download date) - **last_verified**: Background task updates this when verifying file still exists --- ## 📂 File Locations ### Final Destinations (location='final') Configured in settings table, per platform: ``` /opt/immich/md/ ├── social media/ │ ├── instagram/ │ │ ├── posts/ │ │ ├── stories/ │ │ ├── reels/ │ │ └── tagged/ │ ├── snapchat/stories/ │ └── tiktok/reels/ ├── forums/ │ ├── HQCelebCorner/ │ └── PicturePub/ └── gallery/ └── Coppermine/ ``` **Settings locations:** - Forums: `settings.forums.configs[].destination_path` - Instagram/Snapchat/TikTok: Hardcoded or configurable - Coppermine: `settings.coppermine.destination_path` ### Review Queue (location='review') Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`) Maintains same directory structure as final destination: ``` /opt/immich/review/ ├── social media/ │ └── instagram/posts/ │ └── no_face_match.jpg └── forums/ └── PicturePub/ └── unmatched.jpg ``` ### Recycle Bin (location='recycle') Path: `/opt/immich/recycle/` (fixed) **Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate). Files stored with UUID filenames: ``` /opt/immich/recycle/ ├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg └── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4 ``` --- ## 🔄 Data Flow ### 1. Download Phase ``` Download Module ↓ downloads table (audit trail) ↓ Temporary file in /opt/media-downloader/temp/ ``` ### 2. Move Phase (move_module.py) ``` move_module.move_file() ↓ Face Recognition Check ├─ Match → Final Destination └─ No Match → Review Queue ↓ File moved to location ↓ file_inventory.upsert(file_path, location) ↓ downloads.update(file_path) [optional - for audit trail] ``` ### 3. Delete Phase ``` User deletes from UI ↓ File moved to /opt/immich/recycle/ ↓ recycle_bin.insert(original_path, recycle_path) ↓ file_inventory.delete(file_path) OR update(location='recycle') ``` ### 4. Restore Phase ``` User restores from recycle bin ↓ File moved back to original_path ↓ recycle_bin.delete(id) ↓ file_inventory.insert(original_path, location='final') ``` --- ## 🚀 Implementation Plan ### Phase 1: Infrastructure (Week 1) #### Day 1: Table Creation & Backfill Script - [ ] Add `file_inventory` table to `unified_database.py` - [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()` - [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py` - [ ] Test backfill on test database #### Day 2: Initial Backfill - [ ] Run backfill script on production database - [ ] Verify all 2,493 files captured - [ ] Check accuracy vs filesystem - [ ] Document any discrepancies #### Day 3: Testing & Validation - [ ] Verify indexes created - [ ] Test query performance (should be <10ms) - [ ] Write unit tests for file_inventory methods ### Phase 2: Update move_module (Week 1) #### Day 4-5: Integration - [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves - [ ] Handle location tracking ('final' vs 'review') - [ ] Add error handling and logging - [ ] Test with sample downloads (Instagram, Forum, etc.) - [ ] Verify file_inventory stays in sync ### Phase 3: Update API Endpoints (Week 2) #### Day 1-2: Media Page - [ ] Update `/api/media/gallery` to query `file_inventory` (location='final') - [ ] Add filtering by platform, source, content_type - [ ] Add pagination (already indexed) - [ ] Test performance improvement - [ ] Deploy and monitor #### Day 3: Downloads Page - [ ] Update `/api/downloads/recent` to query `file_inventory` - [ ] Test with different platforms - [ ] Verify sorting by created_date #### Day 4: Review Queue - [ ] Update `/api/review/queue` to query `file_inventory` (location='review') - [ ] Verify face recognition integration - [ ] Test restore from review queue #### Day 5: Testing & Documentation - [ ] Integration testing across all pages - [ ] Performance testing with large datasets - [ ] Update API documentation - [ ] User acceptance testing ### Phase 4: Background Maintenance (Week 3) #### Optional: File Verification Task - [ ] Create periodic task to verify file existence - [ ] Mark missing files in `file_inventory` - [ ] Alert on discrepancies - [ ] Auto-cleanup stale records (configurable) --- ## 📊 Expected Performance ### Before (Filesystem Scanning) | Page | Method | Files Scanned | Load Time | |------|--------|---------------|-----------| | Media | `directory.rglob('*')` | 2,493 | 5-10 seconds | | Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds | | Review | `directory.rglob('*')` | Variable | 2-5 seconds | | Recycle Bin | Database query | N/A | <100ms ✅ | ### After (Database Queries) | Page | Method | Query Cost | Load Time | |------|--------|-----------|-----------| | Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ | | Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ | | Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ | | Recycle Bin | Already database | O(1) with index | <100ms ✅ | **Performance Improvement: 50-100x faster** 🚀 --- ## 🔍 Backfill Strategy ### Discovery Phase Backfill script reads settings to find all file locations: ```python # 1. Get all destination paths from settings forums = db.get_setting('forums') for forum in forums['configs']: scan_directory(forum['destination_path'], platform='forum', source=forum['name']) # 2. Get review queue path face_settings = db.get_setting('face_recognition') scan_directory(face_settings['review_path'], location='review') # 3. Hardcoded platform paths (or from settings) scan_directory('/opt/immich/md/social media/instagram', platform='instagram') scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat') scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok') scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine') ``` ### Metadata Extraction For each file found: - **Platform**: From directory structure or settings - **Source**: Extract from filename or directory name - **Content Type**: From file extension - **File Size**: `os.stat().st_size` - **File Hash**: Calculate SHA256 (for deduplication) - **Dimensions**: Query from `media_metadata.db` if exists - **Location**: 'final' or 'review' based on directory ### Idempotency Script can be run multiple times safely: - Uses `INSERT OR REPLACE` / `UPSERT` semantics - Skips files already in database (with option to force refresh) - Logs statistics: new files, updated files, skipped files --- ## 🛡️ Data Integrity ### Constraints - `file_path UNIQUE` - Prevents duplicate entries - `location NOT NULL` - Every file must have a location - Indexes ensure fast lookups even with 100,000+ files ### Verification Background task (optional, runs daily): 1. Select random 1000 files from `file_inventory` 2. Check if files still exist on filesystem 3. Mark missing files or auto-delete records 4. Log discrepancies for review ### Migration Safety - **downloads table**: Never modified (preserves audit trail) - **recycle_bin table**: Never modified (already works perfectly) - **New table**: No risk to existing functionality - **Gradual rollout**: Update one endpoint at a time --- ## 📝 Database Methods ### unified_database.py ```python def create_file_inventory_table(self): """Create file_inventory table and indexes""" def upsert_file_inventory(self, file_path: str, filename: str, platform: str, source: str = None, content_type: str = None, location: str = 'final', **kwargs) -> bool: """Insert or update file in inventory""" def delete_file_inventory(self, file_path: str) -> bool: """Remove file from inventory (when permanently deleted)""" def query_file_inventory(self, location: str = None, platform: str = None, source: str = None, limit: int = 50, offset: int = 0) -> List[Dict]: """Query file inventory with filters and pagination""" def update_file_inventory_location(self, file_path: str, new_location: str) -> bool: """Update file location (e.g., final → review → recycle)""" def verify_file_inventory(self) -> Dict: """Verify all files in inventory still exist on filesystem""" ``` --- ## 🔄 Backwards Compatibility ### During Migration - Old endpoints continue working (filesystem scan) - New endpoints use database queries - Can roll back instantly by switching endpoint logic - No data loss risk ### After Migration - Keep `downloads` table for historical queries - Keep filesystem structure unchanged (Immich needs it) - `file_inventory` is index, not replacement for files --- ## 📈 Monitoring ### Metrics to Track - Query performance (should be <10ms) - File inventory count vs filesystem count - Missing files detected - Backfill success rate - API endpoint latency before/after ### Alerts - File inventory diverges >5% from filesystem - Query performance degrades >100ms - Backfill failures --- ## 🚧 Future Enhancements 1. **Real-time sync**: inotify/watchdog to detect file changes 2. **Advanced queries**: Full-text search on filename/source 3. **Statistics**: Track file age, access patterns 4. **Cleanup**: Auto-remove files older than X days 5. **Export**: Generate inventory reports (CSV, JSON) --- ## 📚 Related Documentation - [VERSIONING.md](VERSIONING.md) - Version control and backups - [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema - [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration - [CHANGELOG.md](CHANGELOG.md) - Version history --- ## ✅ Success Criteria - [ ] All 2,493+ files tracked in `file_inventory` - [ ] Database accuracy >98% - [ ] Page load times <100ms - [ ] Zero data loss - [ ] Backward compatible - [ ] No user-facing changes (transparent migration) --- **Status**: Ready for implementation **Next Step**: Create `file_inventory` table in `unified_database.py`