Files
media-downloader/docs/FILE_INVENTORY.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

14 KiB

File Inventory Architecture

Version: 6.33.5 Date: 2025-11-16 Status: Implementation Phase


📋 Overview

The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.


🎯 Problem Statement

Current Issues

  1. Performance: Pages scan entire directory trees on every load (2,493+ files)
  2. Accuracy: Database 79.93% accurate - files on disk don't match database records
  3. Stale Records: Downloaded files moved/deleted but database not updated
  4. Missing Records: 1,733+ files on disk with no database entries

Root Cause

Multiple systems track files independently:

  • Download modules record to downloads table during download
  • move_module updates paths when moving files
  • Filesystem is the actual source of truth
  • API endpoints scan filesystem (ignoring database)

Result: Database and filesystem drift apart over time.


💡 Solution: file_inventory Table

Architecture

Single Source of Truth: file_inventory table tracks ALL files in their current locations.

┌─────────────────┐
│ Download Module │──> downloads table (historical audit trail)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  move_module    │──> file_inventory table (current locations)
└─────────────────┘         │
                            │
                            ▼
                    ┌───────────────┐
                    │  API Endpoints │
                    │  (Fast Queries)│
                    └───────────────┘

Separation of Concerns

Table Purpose Updates Deletions
downloads Historical audit trail Never Never
file_inventory Current file locations On every move When file deleted
recycle_bin Deleted files (restore capability) On delete/restore On permanent delete

🗄️ Database Schema

file_inventory Table

CREATE TABLE file_inventory (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path TEXT NOT NULL UNIQUE,        -- Absolute path (current location)
    filename TEXT NOT NULL,                -- Basename for display
    platform TEXT NOT NULL,                -- instagram, tiktok, snapchat, forum, coppermine
    source TEXT,                           -- Username, forum name, etc.
    content_type TEXT,                     -- 'image' or 'video'
    file_size INTEGER,                     -- Size in bytes
    file_hash TEXT,                        -- SHA256 for deduplication
    width INTEGER,                         -- Image/video width (from metadata cache)
    height INTEGER,                        -- Image/video height
    location TEXT NOT NULL,                -- 'final', 'review', 'recycle'
    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- When moved to this location
    last_verified TIMESTAMP,               -- Last time file existence verified
    metadata JSON,                         -- Additional metadata (face recognition, etc.)

    -- Indexes for fast queries
    INDEX idx_fi_platform_location (platform, location, created_date DESC),
    INDEX idx_fi_source (source, created_date DESC),
    INDEX idx_fi_location (location),
    INDEX idx_fi_hash (file_hash)
);

Field Descriptions

  • file_path: Full absolute path (e.g., /opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg)
  • location: Current location type
    • 'final' - In final destination directory (ready for Immich)
    • 'review' - In review queue (no face match, pending manual review)
    • 'recycle' - In recycle bin (soft deleted, can be restored)
  • created_date: When file was moved to current location (not original download date)
  • last_verified: Background task updates this when verifying file still exists

📂 File Locations

Final Destinations (location='final')

Configured in settings table, per platform:

/opt/immich/md/
├── social media/
│   ├── instagram/
│   │   ├── posts/
│   │   ├── stories/
│   │   ├── reels/
│   │   └── tagged/
│   ├── snapchat/stories/
│   └── tiktok/reels/
├── forums/
│   ├── HQCelebCorner/
│   └── PicturePub/
└── gallery/
    └── Coppermine/

Settings locations:

  • Forums: settings.forums.configs[].destination_path
  • Instagram/Snapchat/TikTok: Hardcoded or configurable
  • Coppermine: settings.coppermine.destination_path

Review Queue (location='review')

Path: /opt/immich/review/ (configurable in settings.face_recognition.review_path)

Maintains same directory structure as final destination:

/opt/immich/review/
├── social media/
│   └── instagram/posts/
│       └── no_face_match.jpg
└── forums/
    └── PicturePub/
        └── unmatched.jpg

Recycle Bin (location='recycle')

Path: /opt/immich/recycle/ (fixed)

Note: Recycle bin uses separate recycle_bin table (already implemented, don't duplicate).

Files stored with UUID filenames:

/opt/immich/recycle/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4

🔄 Data Flow

1. Download Phase

Download Module
    ↓
downloads table (audit trail)
    ↓
Temporary file in /opt/media-downloader/temp/

2. Move Phase (move_module.py)

move_module.move_file()
    ↓
Face Recognition Check
    ├─ Match → Final Destination
    └─ No Match → Review Queue
    ↓
File moved to location
    ↓
file_inventory.upsert(file_path, location)
    ↓
downloads.update(file_path) [optional - for audit trail]

3. Delete Phase

User deletes from UI
    ↓
File moved to /opt/immich/recycle/
    ↓
recycle_bin.insert(original_path, recycle_path)
    ↓
file_inventory.delete(file_path) OR update(location='recycle')

4. Restore Phase

User restores from recycle bin
    ↓
File moved back to original_path
    ↓
recycle_bin.delete(id)
    ↓
file_inventory.insert(original_path, location='final')

🚀 Implementation Plan

Phase 1: Infrastructure (Week 1)

Day 1: Table Creation & Backfill Script

  • Add file_inventory table to unified_database.py
  • Add methods: upsert_file_inventory(), delete_file_inventory(), query_file_inventory()
  • Write backfill script: /opt/media-downloader/utilities/backfill_file_inventory.py
  • Test backfill on test database

Day 2: Initial Backfill

  • Run backfill script on production database
  • Verify all 2,493 files captured
  • Check accuracy vs filesystem
  • Document any discrepancies

Day 3: Testing & Validation

  • Verify indexes created
  • Test query performance (should be <10ms)
  • Write unit tests for file_inventory methods

Phase 2: Update move_module (Week 1)

Day 4-5: Integration

  • Update move_module.py to call upsert_file_inventory() after successful moves
  • Handle location tracking ('final' vs 'review')
  • Add error handling and logging
  • Test with sample downloads (Instagram, Forum, etc.)
  • Verify file_inventory stays in sync

Phase 3: Update API Endpoints (Week 2)

Day 1-2: Media Page

  • Update /api/media/gallery to query file_inventory (location='final')
  • Add filtering by platform, source, content_type
  • Add pagination (already indexed)
  • Test performance improvement
  • Deploy and monitor

Day 3: Downloads Page

  • Update /api/downloads/recent to query file_inventory
  • Test with different platforms
  • Verify sorting by created_date

Day 4: Review Queue

  • Update /api/review/queue to query file_inventory (location='review')
  • Verify face recognition integration
  • Test restore from review queue

Day 5: Testing & Documentation

  • Integration testing across all pages
  • Performance testing with large datasets
  • Update API documentation
  • User acceptance testing

Phase 4: Background Maintenance (Week 3)

Optional: File Verification Task

  • Create periodic task to verify file existence
  • Mark missing files in file_inventory
  • Alert on discrepancies
  • Auto-cleanup stale records (configurable)

📊 Expected Performance

Before (Filesystem Scanning)

Page Method Files Scanned Load Time
Media directory.rglob('*') 2,493 5-10 seconds
Downloads directory.rglob('*') 2,493 5-10 seconds
Review directory.rglob('*') Variable 2-5 seconds
Recycle Bin Database query N/A <100ms

After (Database Queries)

Page Method Query Cost Load Time
Media SELECT ... LIMIT 50 O(1) with index <100ms
Downloads SELECT ... LIMIT 50 O(1) with index <100ms
Review SELECT ... WHERE location='review' O(1) with index <100ms
Recycle Bin Already database O(1) with index <100ms

Performance Improvement: 50-100x faster 🚀


🔍 Backfill Strategy

Discovery Phase

Backfill script reads settings to find all file locations:

# 1. Get all destination paths from settings
forums = db.get_setting('forums')
for forum in forums['configs']:
    scan_directory(forum['destination_path'], platform='forum', source=forum['name'])

# 2. Get review queue path
face_settings = db.get_setting('face_recognition')
scan_directory(face_settings['review_path'], location='review')

# 3. Hardcoded platform paths (or from settings)
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')

Metadata Extraction

For each file found:

  • Platform: From directory structure or settings
  • Source: Extract from filename or directory name
  • Content Type: From file extension
  • File Size: os.stat().st_size
  • File Hash: Calculate SHA256 (for deduplication)
  • Dimensions: Query from media_metadata.db if exists
  • Location: 'final' or 'review' based on directory

Idempotency

Script can be run multiple times safely:

  • Uses INSERT OR REPLACE / UPSERT semantics
  • Skips files already in database (with option to force refresh)
  • Logs statistics: new files, updated files, skipped files

🛡️ Data Integrity

Constraints

  • file_path UNIQUE - Prevents duplicate entries
  • location NOT NULL - Every file must have a location
  • Indexes ensure fast lookups even with 100,000+ files

Verification

Background task (optional, runs daily):

  1. Select random 1000 files from file_inventory
  2. Check if files still exist on filesystem
  3. Mark missing files or auto-delete records
  4. Log discrepancies for review

Migration Safety

  • downloads table: Never modified (preserves audit trail)
  • recycle_bin table: Never modified (already works perfectly)
  • New table: No risk to existing functionality
  • Gradual rollout: Update one endpoint at a time

📝 Database Methods

unified_database.py

def create_file_inventory_table(self):
    """Create file_inventory table and indexes"""

def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
                         source: str = None, content_type: str = None,
                         location: str = 'final', **kwargs) -> bool:
    """Insert or update file in inventory"""

def delete_file_inventory(self, file_path: str) -> bool:
    """Remove file from inventory (when permanently deleted)"""

def query_file_inventory(self, location: str = None, platform: str = None,
                         source: str = None, limit: int = 50,
                         offset: int = 0) -> List[Dict]:
    """Query file inventory with filters and pagination"""

def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
    """Update file location (e.g., final → review → recycle)"""

def verify_file_inventory(self) -> Dict:
    """Verify all files in inventory still exist on filesystem"""

🔄 Backwards Compatibility

During Migration

  • Old endpoints continue working (filesystem scan)
  • New endpoints use database queries
  • Can roll back instantly by switching endpoint logic
  • No data loss risk

After Migration

  • Keep downloads table for historical queries
  • Keep filesystem structure unchanged (Immich needs it)
  • file_inventory is index, not replacement for files

📈 Monitoring

Metrics to Track

  • Query performance (should be <10ms)
  • File inventory count vs filesystem count
  • Missing files detected
  • Backfill success rate
  • API endpoint latency before/after

Alerts

  • File inventory diverges >5% from filesystem
  • Query performance degrades >100ms
  • Backfill failures

🚧 Future Enhancements

  1. Real-time sync: inotify/watchdog to detect file changes
  2. Advanced queries: Full-text search on filename/source
  3. Statistics: Track file age, access patterns
  4. Cleanup: Auto-remove files older than X days
  5. Export: Generate inventory reports (CSV, JSON)


Success Criteria

  • All 2,493+ files tracked in file_inventory
  • Database accuracy >98%
  • Page load times <100ms
  • Zero data loss
  • Backward compatible
  • No user-facing changes (transparent migration)

Status: Ready for implementation Next Step: Create file_inventory table in unified_database.py