Files

Todd 0d7b2b1aab Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-29 22:42:55 -04:00

14 KiB

Raw Blame History

File Inventory Architecture

Version: 6.33.5 Date: 2025-11-16 Status: Implementation Phase

📋 Overview

The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.

🎯 Problem Statement

Current Issues

Performance: Pages scan entire directory trees on every load (2,493+ files)
Accuracy: Database 79.93% accurate - files on disk don't match database records
Stale Records: Downloaded files moved/deleted but database not updated
Missing Records: 1,733+ files on disk with no database entries

Root Cause

Multiple systems track files independently:

Download modules record to downloads table during download
move_module updates paths when moving files
Filesystem is the actual source of truth
API endpoints scan filesystem (ignoring database)

Result: Database and filesystem drift apart over time.

💡 Solution: file_inventory Table

Architecture

Single Source of Truth: file_inventory table tracks ALL files in their current locations.

┌─────────────────┐
│ Download Module │──> downloads table (historical audit trail)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  move_module    │──> file_inventory table (current locations)
└─────────────────┘         │
                            │
                            ▼
                    ┌───────────────┐
                    │  API Endpoints │
                    │  (Fast Queries)│
                    └───────────────┘

Separation of Concerns

Table	Purpose	Updates	Deletions
`downloads`	Historical audit trail	Never	Never
`file_inventory`	Current file locations	On every move	When file deleted
`recycle_bin`	Deleted files (restore capability)	On delete/restore	On permanent delete

🗄️ Database Schema

file_inventory Table

CREATE TABLE file_inventory (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path TEXT NOT NULL UNIQUE,        -- Absolute path (current location)
    filename TEXT NOT NULL,                -- Basename for display
    platform TEXT NOT NULL,                -- instagram, tiktok, snapchat, forum, coppermine
    source TEXT,                           -- Username, forum name, etc.
    content_type TEXT,                     -- 'image' or 'video'
    file_size INTEGER,                     -- Size in bytes
    file_hash TEXT,                        -- SHA256 for deduplication
    width INTEGER,                         -- Image/video width (from metadata cache)
    height INTEGER,                        -- Image/video height
    location TEXT NOT NULL,                -- 'final', 'review', 'recycle'
    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- When moved to this location
    last_verified TIMESTAMP,               -- Last time file existence verified
    metadata JSON,                         -- Additional metadata (face recognition, etc.)

    -- Indexes for fast queries
    INDEX idx_fi_platform_location (platform, location, created_date DESC),
    INDEX idx_fi_source (source, created_date DESC),
    INDEX idx_fi_location (location),
    INDEX idx_fi_hash (file_hash)
);

Field Descriptions

file_path: Full absolute path (e.g., /opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg)
location: Current location type
- 'final' - In final destination directory (ready for Immich)
- 'review' - In review queue (no face match, pending manual review)
- 'recycle' - In recycle bin (soft deleted, can be restored)
created_date: When file was moved to current location (not original download date)
last_verified: Background task updates this when verifying file still exists

📂 File Locations

Final Destinations (location='final')

Configured in settings table, per platform:

/opt/immich/md/
├── social media/
│   ├── instagram/
│   │   ├── posts/
│   │   ├── stories/
│   │   ├── reels/
│   │   └── tagged/
│   ├── snapchat/stories/
│   └── tiktok/reels/
├── forums/
│   ├── HQCelebCorner/
│   └── PicturePub/
└── gallery/
    └── Coppermine/

Settings locations:

Forums: settings.forums.configs[].destination_path
Instagram/Snapchat/TikTok: Hardcoded or configurable
Coppermine: settings.coppermine.destination_path

Review Queue (location='review')

Path: /opt/immich/review/ (configurable in settings.face_recognition.review_path)

Maintains same directory structure as final destination:

/opt/immich/review/
├── social media/
│   └── instagram/posts/
│       └── no_face_match.jpg
└── forums/
    └── PicturePub/
        └── unmatched.jpg

Recycle Bin (location='recycle')

Path: /opt/immich/recycle/ (fixed)

Note: Recycle bin uses separate recycle_bin table (already implemented, don't duplicate).

Files stored with UUID filenames:

/opt/immich/recycle/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4

🔄 Data Flow

1. Download Phase

Download Module
    ↓
downloads table (audit trail)
    ↓
Temporary file in /opt/media-downloader/temp/

2. Move Phase (move_module.py)

move_module.move_file()
    ↓
Face Recognition Check
    ├─ Match → Final Destination
    └─ No Match → Review Queue
    ↓
File moved to location
    ↓
file_inventory.upsert(file_path, location)
    ↓
downloads.update(file_path) [optional - for audit trail]

3. Delete Phase

User deletes from UI
    ↓
File moved to /opt/immich/recycle/
    ↓
recycle_bin.insert(original_path, recycle_path)
    ↓
file_inventory.delete(file_path) OR update(location='recycle')

4. Restore Phase

User restores from recycle bin
    ↓
File moved back to original_path
    ↓
recycle_bin.delete(id)
    ↓
file_inventory.insert(original_path, location='final')

🚀 Implementation Plan

Phase 1: Infrastructure (Week 1)

Day 1: Table Creation & Backfill Script

Add file_inventory table to unified_database.py
Add methods: upsert_file_inventory(), delete_file_inventory(), query_file_inventory()
Write backfill script: /opt/media-downloader/utilities/backfill_file_inventory.py
Test backfill on test database

Day 2: Initial Backfill

Run backfill script on production database
Verify all 2,493 files captured
Check accuracy vs filesystem
Document any discrepancies

Day 3: Testing & Validation

Verify indexes created
Test query performance (should be <10ms)
Write unit tests for file_inventory methods

Phase 2: Update move_module (Week 1)

Day 4-5: Integration

Update move_module.py to call upsert_file_inventory() after successful moves
Handle location tracking ('final' vs 'review')
Add error handling and logging
Test with sample downloads (Instagram, Forum, etc.)
Verify file_inventory stays in sync

Phase 3: Update API Endpoints (Week 2)

Day 1-2: Media Page

Update /api/media/gallery to query file_inventory (location='final')
Add filtering by platform, source, content_type
Add pagination (already indexed)
Test performance improvement
Deploy and monitor

Day 3: Downloads Page

Update /api/downloads/recent to query file_inventory
Test with different platforms
Verify sorting by created_date

Day 4: Review Queue

Update /api/review/queue to query file_inventory (location='review')
Verify face recognition integration
Test restore from review queue

Day 5: Testing & Documentation

Integration testing across all pages
Performance testing with large datasets
Update API documentation
User acceptance testing

Phase 4: Background Maintenance (Week 3)

Optional: File Verification Task

Create periodic task to verify file existence
Mark missing files in file_inventory
Alert on discrepancies
Auto-cleanup stale records (configurable)

📊 Expected Performance

Before (Filesystem Scanning)

Page	Method	Files Scanned	Load Time
Media	`directory.rglob('*')`	2,493	5-10 seconds
Downloads	`directory.rglob('*')`	2,493	5-10 seconds
Review	`directory.rglob('*')`	Variable	2-5 seconds
Recycle Bin	Database query	N/A	<100ms ✅

After (Database Queries)

Page	Method	Query Cost	Load Time
Media	`SELECT ... LIMIT 50`	O(1) with index	<100ms ✅
Downloads	`SELECT ... LIMIT 50`	O(1) with index	<100ms ✅
Review	`SELECT ... WHERE location='review'`	O(1) with index	<100ms ✅
Recycle Bin	Already database	O(1) with index	<100ms ✅

Performance Improvement: 50-100x faster 🚀

🔍 Backfill Strategy

Discovery Phase

Backfill script reads settings to find all file locations:

# 1. Get all destination paths from settings
forums = db.get_setting('forums')
for forum in forums['configs']:
    scan_directory(forum['destination_path'], platform='forum', source=forum['name'])

# 2. Get review queue path
face_settings = db.get_setting('face_recognition')
scan_directory(face_settings['review_path'], location='review')

# 3. Hardcoded platform paths (or from settings)
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')

Metadata Extraction

For each file found:

Platform: From directory structure or settings
Source: Extract from filename or directory name
Content Type: From file extension
File Size: os.stat().st_size
File Hash: Calculate SHA256 (for deduplication)
Dimensions: Query from media_metadata.db if exists
Location: 'final' or 'review' based on directory

Idempotency

Script can be run multiple times safely:

Uses INSERT OR REPLACE / UPSERT semantics
Skips files already in database (with option to force refresh)
Logs statistics: new files, updated files, skipped files

🛡️ Data Integrity

Constraints

file_path UNIQUE - Prevents duplicate entries
location NOT NULL - Every file must have a location
Indexes ensure fast lookups even with 100,000+ files

Verification

Background task (optional, runs daily):

Select random 1000 files from file_inventory
Check if files still exist on filesystem
Mark missing files or auto-delete records
Log discrepancies for review

Migration Safety

downloads table: Never modified (preserves audit trail)
recycle_bin table: Never modified (already works perfectly)
New table: No risk to existing functionality
Gradual rollout: Update one endpoint at a time

📝 Database Methods

unified_database.py

def create_file_inventory_table(self):
    """Create file_inventory table and indexes"""

def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
                         source: str = None, content_type: str = None,
                         location: str = 'final', **kwargs) -> bool:
    """Insert or update file in inventory"""

def delete_file_inventory(self, file_path: str) -> bool:
    """Remove file from inventory (when permanently deleted)"""

def query_file_inventory(self, location: str = None, platform: str = None,
                         source: str = None, limit: int = 50,
                         offset: int = 0) -> List[Dict]:
    """Query file inventory with filters and pagination"""

def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
    """Update file location (e.g., final → review → recycle)"""

def verify_file_inventory(self) -> Dict:
    """Verify all files in inventory still exist on filesystem"""

🔄 Backwards Compatibility

During Migration

Old endpoints continue working (filesystem scan)
New endpoints use database queries
Can roll back instantly by switching endpoint logic
No data loss risk

After Migration

Keep downloads table for historical queries
Keep filesystem structure unchanged (Immich needs it)
file_inventory is index, not replacement for files

📈 Monitoring

Metrics to Track

Query performance (should be <10ms)
File inventory count vs filesystem count
Missing files detected
Backfill success rate
API endpoint latency before/after

Alerts

File inventory diverges >5% from filesystem
Query performance degrades >100ms
Backfill failures

🚧 Future Enhancements

Real-time sync: inotify/watchdog to detect file changes
Advanced queries: Full-text search on filename/source
Statistics: Track file age, access patterns
Cleanup: Auto-remove files older than X days
Export: Generate inventory reports (CSV, JSON)

VERSIONING.md - Version control and backups
DATABASE_SCHEMA.md - Complete database schema
FACE_RECOGNITION.md - Face recognition integration
CHANGELOG.md - Version history

✅ Success Criteria

All 2,493+ files tracked in file_inventory
Database accuracy >98%
Page load times <100ms
Zero data loss
Backward compatible
No user-facing changes (transparent migration)

Status: Ready for implementation Next Step: Create file_inventory table in unified_database.py

14 KiB Raw Blame History