14 KiB
File Inventory Architecture
Version: 6.33.5 Date: 2025-11-16 Status: Implementation Phase
📋 Overview
The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.
🎯 Problem Statement
Current Issues
- Performance: Pages scan entire directory trees on every load (2,493+ files)
- Accuracy: Database 79.93% accurate - files on disk don't match database records
- Stale Records: Downloaded files moved/deleted but database not updated
- Missing Records: 1,733+ files on disk with no database entries
Root Cause
Multiple systems track files independently:
- Download modules record to
downloadstable during download - move_module updates paths when moving files
- Filesystem is the actual source of truth
- API endpoints scan filesystem (ignoring database)
Result: Database and filesystem drift apart over time.
💡 Solution: file_inventory Table
Architecture
Single Source of Truth: file_inventory table tracks ALL files in their current locations.
┌─────────────────┐
│ Download Module │──> downloads table (historical audit trail)
└────────┬────────┘
│
▼
┌─────────────────┐
│ move_module │──> file_inventory table (current locations)
└─────────────────┘ │
│
▼
┌───────────────┐
│ API Endpoints │
│ (Fast Queries)│
└───────────────┘
Separation of Concerns
| Table | Purpose | Updates | Deletions |
|---|---|---|---|
downloads |
Historical audit trail | Never | Never |
file_inventory |
Current file locations | On every move | When file deleted |
recycle_bin |
Deleted files (restore capability) | On delete/restore | On permanent delete |
🗄️ Database Schema
file_inventory Table
CREATE TABLE file_inventory (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT NOT NULL UNIQUE, -- Absolute path (current location)
filename TEXT NOT NULL, -- Basename for display
platform TEXT NOT NULL, -- instagram, tiktok, snapchat, forum, coppermine
source TEXT, -- Username, forum name, etc.
content_type TEXT, -- 'image' or 'video'
file_size INTEGER, -- Size in bytes
file_hash TEXT, -- SHA256 for deduplication
width INTEGER, -- Image/video width (from metadata cache)
height INTEGER, -- Image/video height
location TEXT NOT NULL, -- 'final', 'review', 'recycle'
created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When moved to this location
last_verified TIMESTAMP, -- Last time file existence verified
metadata JSON, -- Additional metadata (face recognition, etc.)
-- Indexes for fast queries
INDEX idx_fi_platform_location (platform, location, created_date DESC),
INDEX idx_fi_source (source, created_date DESC),
INDEX idx_fi_location (location),
INDEX idx_fi_hash (file_hash)
);
Field Descriptions
- file_path: Full absolute path (e.g.,
/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg) - location: Current location type
'final'- In final destination directory (ready for Immich)'review'- In review queue (no face match, pending manual review)'recycle'- In recycle bin (soft deleted, can be restored)
- created_date: When file was moved to current location (not original download date)
- last_verified: Background task updates this when verifying file still exists
📂 File Locations
Final Destinations (location='final')
Configured in settings table, per platform:
/opt/immich/md/
├── social media/
│ ├── instagram/
│ │ ├── posts/
│ │ ├── stories/
│ │ ├── reels/
│ │ └── tagged/
│ ├── snapchat/stories/
│ └── tiktok/reels/
├── forums/
│ ├── HQCelebCorner/
│ └── PicturePub/
└── gallery/
└── Coppermine/
Settings locations:
- Forums:
settings.forums.configs[].destination_path - Instagram/Snapchat/TikTok: Hardcoded or configurable
- Coppermine:
settings.coppermine.destination_path
Review Queue (location='review')
Path: /opt/immich/review/ (configurable in settings.face_recognition.review_path)
Maintains same directory structure as final destination:
/opt/immich/review/
├── social media/
│ └── instagram/posts/
│ └── no_face_match.jpg
└── forums/
└── PicturePub/
└── unmatched.jpg
Recycle Bin (location='recycle')
Path: /opt/immich/recycle/ (fixed)
Note: Recycle bin uses separate recycle_bin table (already implemented, don't duplicate).
Files stored with UUID filenames:
/opt/immich/recycle/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
🔄 Data Flow
1. Download Phase
Download Module
↓
downloads table (audit trail)
↓
Temporary file in /opt/media-downloader/temp/
2. Move Phase (move_module.py)
move_module.move_file()
↓
Face Recognition Check
├─ Match → Final Destination
└─ No Match → Review Queue
↓
File moved to location
↓
file_inventory.upsert(file_path, location)
↓
downloads.update(file_path) [optional - for audit trail]
3. Delete Phase
User deletes from UI
↓
File moved to /opt/immich/recycle/
↓
recycle_bin.insert(original_path, recycle_path)
↓
file_inventory.delete(file_path) OR update(location='recycle')
4. Restore Phase
User restores from recycle bin
↓
File moved back to original_path
↓
recycle_bin.delete(id)
↓
file_inventory.insert(original_path, location='final')
🚀 Implementation Plan
Phase 1: Infrastructure (Week 1)
Day 1: Table Creation & Backfill Script
- Add
file_inventorytable tounified_database.py - Add methods:
upsert_file_inventory(),delete_file_inventory(),query_file_inventory() - Write backfill script:
/opt/media-downloader/utilities/backfill_file_inventory.py - Test backfill on test database
Day 2: Initial Backfill
- Run backfill script on production database
- Verify all 2,493 files captured
- Check accuracy vs filesystem
- Document any discrepancies
Day 3: Testing & Validation
- Verify indexes created
- Test query performance (should be <10ms)
- Write unit tests for file_inventory methods
Phase 2: Update move_module (Week 1)
Day 4-5: Integration
- Update
move_module.pyto callupsert_file_inventory()after successful moves - Handle location tracking ('final' vs 'review')
- Add error handling and logging
- Test with sample downloads (Instagram, Forum, etc.)
- Verify file_inventory stays in sync
Phase 3: Update API Endpoints (Week 2)
Day 1-2: Media Page
- Update
/api/media/galleryto queryfile_inventory(location='final') - Add filtering by platform, source, content_type
- Add pagination (already indexed)
- Test performance improvement
- Deploy and monitor
Day 3: Downloads Page
- Update
/api/downloads/recentto queryfile_inventory - Test with different platforms
- Verify sorting by created_date
Day 4: Review Queue
- Update
/api/review/queueto queryfile_inventory(location='review') - Verify face recognition integration
- Test restore from review queue
Day 5: Testing & Documentation
- Integration testing across all pages
- Performance testing with large datasets
- Update API documentation
- User acceptance testing
Phase 4: Background Maintenance (Week 3)
Optional: File Verification Task
- Create periodic task to verify file existence
- Mark missing files in
file_inventory - Alert on discrepancies
- Auto-cleanup stale records (configurable)
📊 Expected Performance
Before (Filesystem Scanning)
| Page | Method | Files Scanned | Load Time |
|---|---|---|---|
| Media | directory.rglob('*') |
2,493 | 5-10 seconds |
| Downloads | directory.rglob('*') |
2,493 | 5-10 seconds |
| Review | directory.rglob('*') |
Variable | 2-5 seconds |
| Recycle Bin | Database query | N/A | <100ms ✅ |
After (Database Queries)
| Page | Method | Query Cost | Load Time |
|---|---|---|---|
| Media | SELECT ... LIMIT 50 |
O(1) with index | <100ms ✅ |
| Downloads | SELECT ... LIMIT 50 |
O(1) with index | <100ms ✅ |
| Review | SELECT ... WHERE location='review' |
O(1) with index | <100ms ✅ |
| Recycle Bin | Already database | O(1) with index | <100ms ✅ |
Performance Improvement: 50-100x faster 🚀
🔍 Backfill Strategy
Discovery Phase
Backfill script reads settings to find all file locations:
# 1. Get all destination paths from settings
forums = db.get_setting('forums')
for forum in forums['configs']:
scan_directory(forum['destination_path'], platform='forum', source=forum['name'])
# 2. Get review queue path
face_settings = db.get_setting('face_recognition')
scan_directory(face_settings['review_path'], location='review')
# 3. Hardcoded platform paths (or from settings)
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
Metadata Extraction
For each file found:
- Platform: From directory structure or settings
- Source: Extract from filename or directory name
- Content Type: From file extension
- File Size:
os.stat().st_size - File Hash: Calculate SHA256 (for deduplication)
- Dimensions: Query from
media_metadata.dbif exists - Location: 'final' or 'review' based on directory
Idempotency
Script can be run multiple times safely:
- Uses
INSERT OR REPLACE/UPSERTsemantics - Skips files already in database (with option to force refresh)
- Logs statistics: new files, updated files, skipped files
🛡️ Data Integrity
Constraints
file_path UNIQUE- Prevents duplicate entrieslocation NOT NULL- Every file must have a location- Indexes ensure fast lookups even with 100,000+ files
Verification
Background task (optional, runs daily):
- Select random 1000 files from
file_inventory - Check if files still exist on filesystem
- Mark missing files or auto-delete records
- Log discrepancies for review
Migration Safety
- downloads table: Never modified (preserves audit trail)
- recycle_bin table: Never modified (already works perfectly)
- New table: No risk to existing functionality
- Gradual rollout: Update one endpoint at a time
📝 Database Methods
unified_database.py
def create_file_inventory_table(self):
"""Create file_inventory table and indexes"""
def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
source: str = None, content_type: str = None,
location: str = 'final', **kwargs) -> bool:
"""Insert or update file in inventory"""
def delete_file_inventory(self, file_path: str) -> bool:
"""Remove file from inventory (when permanently deleted)"""
def query_file_inventory(self, location: str = None, platform: str = None,
source: str = None, limit: int = 50,
offset: int = 0) -> List[Dict]:
"""Query file inventory with filters and pagination"""
def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
"""Update file location (e.g., final → review → recycle)"""
def verify_file_inventory(self) -> Dict:
"""Verify all files in inventory still exist on filesystem"""
🔄 Backwards Compatibility
During Migration
- Old endpoints continue working (filesystem scan)
- New endpoints use database queries
- Can roll back instantly by switching endpoint logic
- No data loss risk
After Migration
- Keep
downloadstable for historical queries - Keep filesystem structure unchanged (Immich needs it)
file_inventoryis index, not replacement for files
📈 Monitoring
Metrics to Track
- Query performance (should be <10ms)
- File inventory count vs filesystem count
- Missing files detected
- Backfill success rate
- API endpoint latency before/after
Alerts
- File inventory diverges >5% from filesystem
- Query performance degrades >100ms
- Backfill failures
🚧 Future Enhancements
- Real-time sync: inotify/watchdog to detect file changes
- Advanced queries: Full-text search on filename/source
- Statistics: Track file age, access patterns
- Cleanup: Auto-remove files older than X days
- Export: Generate inventory reports (CSV, JSON)
📚 Related Documentation
- VERSIONING.md - Version control and backups
- DATABASE_SCHEMA.md - Complete database schema
- FACE_RECOGNITION.md - Face recognition integration
- CHANGELOG.md - Version history
✅ Success Criteria
- All 2,493+ files tracked in
file_inventory - Database accuracy >98%
- Page load times <100ms
- Zero data loss
- Backward compatible
- No user-facing changes (transparent migration)
Status: Ready for implementation
Next Step: Create file_inventory table in unified_database.py