# File Inventory Architecture

**Version:** 6.33.5
**Date:** 2025-11-16
**Status:** Implementation Phase

---

## 📋 Overview

The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.

---

## 🎯 Problem Statement

### Current Issues

1. **Performance**: Pages scan entire directory trees on every load (2,493+ files)
2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records
3. **Stale Records**: Downloaded files moved/deleted but database not updated
4. **Missing Records**: 1,733+ files on disk with no database entries

### Root Cause

Multiple systems track files independently:
- **Download modules** record to `downloads` table during download
- **move_module** updates paths when moving files
- **Filesystem** is the actual source of truth
- **API endpoints** scan filesystem (ignoring database)

Result: Database and filesystem drift apart over time.

---

## 💡 Solution: file_inventory Table

### Architecture

**Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations.

```
┌─────────────────┐
│ Download Module │──> downloads table (historical audit trail)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  move_module    │──> file_inventory table (current locations)
└─────────────────┘         │
                            │
                            ▼
                    ┌───────────────┐
                    │  API Endpoints │
                    │  (Fast Queries)│
                    └───────────────┘
```

### Separation of Concerns

| Table | Purpose | Updates | Deletions |
|-------|---------|---------|-----------|
| `downloads` | Historical audit trail | Never | Never |
| `file_inventory` | Current file locations | On every move | When file deleted |
| `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete |

---

## 🗄️ Database Schema

### file_inventory Table

```sql
CREATE TABLE file_inventory (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path TEXT NOT NULL UNIQUE,        -- Absolute path (current location)
    filename TEXT NOT NULL,                -- Basename for display
    platform TEXT NOT NULL,                -- instagram, tiktok, snapchat, forum, coppermine
    source TEXT,                           -- Username, forum name, etc.
    content_type TEXT,                     -- 'image' or 'video'
    file_size INTEGER,                     -- Size in bytes
    file_hash TEXT,                        -- SHA256 for deduplication
    width INTEGER,                         -- Image/video width (from metadata cache)
    height INTEGER,                        -- Image/video height
    location TEXT NOT NULL,                -- 'final', 'review', 'recycle'
    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- When moved to this location
    last_verified TIMESTAMP,               -- Last time file existence verified
    metadata JSON,                         -- Additional metadata (face recognition, etc.)

    -- Indexes for fast queries
    INDEX idx_fi_platform_location (platform, location, created_date DESC),
    INDEX idx_fi_source (source, created_date DESC),
    INDEX idx_fi_location (location),
    INDEX idx_fi_hash (file_hash)
);
```

### Field Descriptions

- **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`)
- **location**: Current location type
  - `'final'` - In final destination directory (ready for Immich)
  - `'review'` - In review queue (no face match, pending manual review)
  - `'recycle'` - In recycle bin (soft deleted, can be restored)
- **created_date**: When file was moved to current location (not original download date)
- **last_verified**: Background task updates this when verifying file still exists

---

## 📂 File Locations

### Final Destinations (location='final')

Configured in settings table, per platform:

```
/opt/immich/md/
├── social media/
│   ├── instagram/
│   │   ├── posts/
│   │   ├── stories/
│   │   ├── reels/
│   │   └── tagged/
│   ├── snapchat/stories/
│   └── tiktok/reels/
├── forums/
│   ├── HQCelebCorner/
│   └── PicturePub/
└── gallery/
    └── Coppermine/
```

**Settings locations:**
- Forums: `settings.forums.configs[].destination_path`
- Instagram/Snapchat/TikTok: Hardcoded or configurable
- Coppermine: `settings.coppermine.destination_path`

### Review Queue (location='review')

Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`)

Maintains same directory structure as final destination:
```
/opt/immich/review/
├── social media/
│   └── instagram/posts/
│       └── no_face_match.jpg
└── forums/
    └── PicturePub/
        └── unmatched.jpg
```

### Recycle Bin (location='recycle')

Path: `/opt/immich/recycle/` (fixed)

**Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate).

Files stored with UUID filenames:
```
/opt/immich/recycle/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
```

---

## 🔄 Data Flow

### 1. Download Phase

```
Download Module
    ↓
downloads table (audit trail)
    ↓
Temporary file in /opt/media-downloader/temp/
```

### 2. Move Phase (move_module.py)

```
move_module.move_file()
    ↓
Face Recognition Check
    ├─ Match → Final Destination
    └─ No Match → Review Queue
    ↓
File moved to location
    ↓
file_inventory.upsert(file_path, location)
    ↓
downloads.update(file_path) [optional - for audit trail]
```

### 3. Delete Phase

```
User deletes from UI
    ↓
File moved to /opt/immich/recycle/
    ↓
recycle_bin.insert(original_path, recycle_path)
    ↓
file_inventory.delete(file_path) OR update(location='recycle')
```

### 4. Restore Phase

```
User restores from recycle bin
    ↓
File moved back to original_path
    ↓
recycle_bin.delete(id)
    ↓
file_inventory.insert(original_path, location='final')
```

---

## 🚀 Implementation Plan

### Phase 1: Infrastructure (Week 1)

#### Day 1: Table Creation & Backfill Script
- [ ] Add `file_inventory` table to `unified_database.py`
- [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()`
- [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py`
- [ ] Test backfill on test database

#### Day 2: Initial Backfill
- [ ] Run backfill script on production database
- [ ] Verify all 2,493 files captured
- [ ] Check accuracy vs filesystem
- [ ] Document any discrepancies

#### Day 3: Testing & Validation
- [ ] Verify indexes created
- [ ] Test query performance (should be <10ms)
- [ ] Write unit tests for file_inventory methods

### Phase 2: Update move_module (Week 1)

#### Day 4-5: Integration
- [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves
- [ ] Handle location tracking ('final' vs 'review')
- [ ] Add error handling and logging
- [ ] Test with sample downloads (Instagram, Forum, etc.)
- [ ] Verify file_inventory stays in sync

### Phase 3: Update API Endpoints (Week 2)

#### Day 1-2: Media Page
- [ ] Update `/api/media/gallery` to query `file_inventory` (location='final')
- [ ] Add filtering by platform, source, content_type
- [ ] Add pagination (already indexed)
- [ ] Test performance improvement
- [ ] Deploy and monitor

#### Day 3: Downloads Page
- [ ] Update `/api/downloads/recent` to query `file_inventory`
- [ ] Test with different platforms
- [ ] Verify sorting by created_date

#### Day 4: Review Queue
- [ ] Update `/api/review/queue` to query `file_inventory` (location='review')
- [ ] Verify face recognition integration
- [ ] Test restore from review queue

#### Day 5: Testing & Documentation
- [ ] Integration testing across all pages
- [ ] Performance testing with large datasets
- [ ] Update API documentation
- [ ] User acceptance testing

### Phase 4: Background Maintenance (Week 3)

#### Optional: File Verification Task
- [ ] Create periodic task to verify file existence
- [ ] Mark missing files in `file_inventory`
- [ ] Alert on discrepancies
- [ ] Auto-cleanup stale records (configurable)

---

## 📊 Expected Performance

### Before (Filesystem Scanning)

| Page | Method | Files Scanned | Load Time |
|------|--------|---------------|-----------|
| Media | `directory.rglob('*')` | 2,493 | 5-10 seconds |
| Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds |
| Review | `directory.rglob('*')` | Variable | 2-5 seconds |
| Recycle Bin | Database query | N/A | <100ms ✅ |

### After (Database Queries)

| Page | Method | Query Cost | Load Time |
|------|--------|-----------|-----------|
| Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
| Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
| Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ |
| Recycle Bin | Already database | O(1) with index | <100ms ✅ |

**Performance Improvement: 50-100x faster** 🚀

---

## 🔍 Backfill Strategy

### Discovery Phase

Backfill script reads settings to find all file locations:

```python
# 1. Get all destination paths from settings
forums = db.get_setting('forums')
for forum in forums['configs']:
    scan_directory(forum['destination_path'], platform='forum', source=forum['name'])

# 2. Get review queue path
face_settings = db.get_setting('face_recognition')
scan_directory(face_settings['review_path'], location='review')

# 3. Hardcoded platform paths (or from settings)
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
```

### Metadata Extraction

For each file found:
- **Platform**: From directory structure or settings
- **Source**: Extract from filename or directory name
- **Content Type**: From file extension
- **File Size**: `os.stat().st_size`
- **File Hash**: Calculate SHA256 (for deduplication)
- **Dimensions**: Query from `media_metadata.db` if exists
- **Location**: 'final' or 'review' based on directory

### Idempotency

Script can be run multiple times safely:
- Uses `INSERT OR REPLACE` / `UPSERT` semantics
- Skips files already in database (with option to force refresh)
- Logs statistics: new files, updated files, skipped files

---

## 🛡️ Data Integrity

### Constraints

- `file_path UNIQUE` - Prevents duplicate entries
- `location NOT NULL` - Every file must have a location
- Indexes ensure fast lookups even with 100,000+ files

### Verification

Background task (optional, runs daily):
1. Select random 1000 files from `file_inventory`
2. Check if files still exist on filesystem
3. Mark missing files or auto-delete records
4. Log discrepancies for review

### Migration Safety

- **downloads table**: Never modified (preserves audit trail)
- **recycle_bin table**: Never modified (already works perfectly)
- **New table**: No risk to existing functionality
- **Gradual rollout**: Update one endpoint at a time

---

## 📝 Database Methods

### unified_database.py

```python
def create_file_inventory_table(self):
    """Create file_inventory table and indexes"""

def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
                         source: str = None, content_type: str = None,
                         location: str = 'final', **kwargs) -> bool:
    """Insert or update file in inventory"""

def delete_file_inventory(self, file_path: str) -> bool:
    """Remove file from inventory (when permanently deleted)"""

def query_file_inventory(self, location: str = None, platform: str = None,
                         source: str = None, limit: int = 50,
                         offset: int = 0) -> List[Dict]:
    """Query file inventory with filters and pagination"""

def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
    """Update file location (e.g., final → review → recycle)"""

def verify_file_inventory(self) -> Dict:
    """Verify all files in inventory still exist on filesystem"""
```

---

## 🔄 Backwards Compatibility

### During Migration

- Old endpoints continue working (filesystem scan)
- New endpoints use database queries
- Can roll back instantly by switching endpoint logic
- No data loss risk

### After Migration

- Keep `downloads` table for historical queries
- Keep filesystem structure unchanged (Immich needs it)
- `file_inventory` is index, not replacement for files

---

## 📈 Monitoring

### Metrics to Track

- Query performance (should be <10ms)
- File inventory count vs filesystem count
- Missing files detected
- Backfill success rate
- API endpoint latency before/after

### Alerts

- File inventory diverges >5% from filesystem
- Query performance degrades >100ms
- Backfill failures

---

## 🚧 Future Enhancements

1. **Real-time sync**: inotify/watchdog to detect file changes
2. **Advanced queries**: Full-text search on filename/source
3. **Statistics**: Track file age, access patterns
4. **Cleanup**: Auto-remove files older than X days
5. **Export**: Generate inventory reports (CSV, JSON)

---

## 📚 Related Documentation

- [VERSIONING.md](VERSIONING.md) - Version control and backups
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
- [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration
- [CHANGELOG.md](CHANGELOG.md) - Version history

---

## ✅ Success Criteria

- [ ] All 2,493+ files tracked in `file_inventory`
- [ ] Database accuracy >98%
- [ ] Page load times <100ms
- [ ] Zero data loss
- [ ] Backward compatible
- [ ] No user-facing changes (transparent migration)

---

**Status**: Ready for implementation
**Next Step**: Create `file_inventory` table in `unified_database.py`