476 lines
14 KiB
Markdown
476 lines
14 KiB
Markdown
# File Inventory Architecture
|
|
|
|
**Version:** 6.33.5
|
|
**Date:** 2025-11-16
|
|
**Status:** Implementation Phase
|
|
|
|
---
|
|
|
|
## 📋 Overview
|
|
|
|
The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.
|
|
|
|
---
|
|
|
|
## 🎯 Problem Statement
|
|
|
|
### Current Issues
|
|
|
|
1. **Performance**: Pages scan entire directory trees on every load (2,493+ files)
|
|
2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records
|
|
3. **Stale Records**: Downloaded files moved/deleted but database not updated
|
|
4. **Missing Records**: 1,733+ files on disk with no database entries
|
|
|
|
### Root Cause
|
|
|
|
Multiple systems track files independently:
|
|
- **Download modules** record to `downloads` table during download
|
|
- **move_module** updates paths when moving files
|
|
- **Filesystem** is the actual source of truth
|
|
- **API endpoints** scan filesystem (ignoring database)
|
|
|
|
Result: Database and filesystem drift apart over time.
|
|
|
|
---
|
|
|
|
## 💡 Solution: file_inventory Table
|
|
|
|
### Architecture
|
|
|
|
**Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations.
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Download Module │──> downloads table (historical audit trail)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ move_module │──> file_inventory table (current locations)
|
|
└─────────────────┘ │
|
|
│
|
|
▼
|
|
┌───────────────┐
|
|
│ API Endpoints │
|
|
│ (Fast Queries)│
|
|
└───────────────┘
|
|
```
|
|
|
|
### Separation of Concerns
|
|
|
|
| Table | Purpose | Updates | Deletions |
|
|
|-------|---------|---------|-----------|
|
|
| `downloads` | Historical audit trail | Never | Never |
|
|
| `file_inventory` | Current file locations | On every move | When file deleted |
|
|
| `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete |
|
|
|
|
---
|
|
|
|
## 🗄️ Database Schema
|
|
|
|
### file_inventory Table
|
|
|
|
```sql
|
|
CREATE TABLE file_inventory (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
file_path TEXT NOT NULL UNIQUE, -- Absolute path (current location)
|
|
filename TEXT NOT NULL, -- Basename for display
|
|
platform TEXT NOT NULL, -- instagram, tiktok, snapchat, forum, coppermine
|
|
source TEXT, -- Username, forum name, etc.
|
|
content_type TEXT, -- 'image' or 'video'
|
|
file_size INTEGER, -- Size in bytes
|
|
file_hash TEXT, -- SHA256 for deduplication
|
|
width INTEGER, -- Image/video width (from metadata cache)
|
|
height INTEGER, -- Image/video height
|
|
location TEXT NOT NULL, -- 'final', 'review', 'recycle'
|
|
created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When moved to this location
|
|
last_verified TIMESTAMP, -- Last time file existence verified
|
|
metadata JSON, -- Additional metadata (face recognition, etc.)
|
|
|
|
-- Indexes for fast queries
|
|
INDEX idx_fi_platform_location (platform, location, created_date DESC),
|
|
INDEX idx_fi_source (source, created_date DESC),
|
|
INDEX idx_fi_location (location),
|
|
INDEX idx_fi_hash (file_hash)
|
|
);
|
|
```
|
|
|
|
### Field Descriptions
|
|
|
|
- **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`)
|
|
- **location**: Current location type
|
|
- `'final'` - In final destination directory (ready for Immich)
|
|
- `'review'` - In review queue (no face match, pending manual review)
|
|
- `'recycle'` - In recycle bin (soft deleted, can be restored)
|
|
- **created_date**: When file was moved to current location (not original download date)
|
|
- **last_verified**: Background task updates this when verifying file still exists
|
|
|
|
---
|
|
|
|
## 📂 File Locations
|
|
|
|
### Final Destinations (location='final')
|
|
|
|
Configured in settings table, per platform:
|
|
|
|
```
|
|
/opt/immich/md/
|
|
├── social media/
|
|
│ ├── instagram/
|
|
│ │ ├── posts/
|
|
│ │ ├── stories/
|
|
│ │ ├── reels/
|
|
│ │ └── tagged/
|
|
│ ├── snapchat/stories/
|
|
│ └── tiktok/reels/
|
|
├── forums/
|
|
│ ├── HQCelebCorner/
|
|
│ └── PicturePub/
|
|
└── gallery/
|
|
└── Coppermine/
|
|
```
|
|
|
|
**Settings locations:**
|
|
- Forums: `settings.forums.configs[].destination_path`
|
|
- Instagram/Snapchat/TikTok: Hardcoded or configurable
|
|
- Coppermine: `settings.coppermine.destination_path`
|
|
|
|
### Review Queue (location='review')
|
|
|
|
Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`)
|
|
|
|
Maintains same directory structure as final destination:
|
|
```
|
|
/opt/immich/review/
|
|
├── social media/
|
|
│ └── instagram/posts/
|
|
│ └── no_face_match.jpg
|
|
└── forums/
|
|
└── PicturePub/
|
|
└── unmatched.jpg
|
|
```
|
|
|
|
### Recycle Bin (location='recycle')
|
|
|
|
Path: `/opt/immich/recycle/` (fixed)
|
|
|
|
**Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate).
|
|
|
|
Files stored with UUID filenames:
|
|
```
|
|
/opt/immich/recycle/
|
|
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
|
|
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Data Flow
|
|
|
|
### 1. Download Phase
|
|
|
|
```
|
|
Download Module
|
|
↓
|
|
downloads table (audit trail)
|
|
↓
|
|
Temporary file in /opt/media-downloader/temp/
|
|
```
|
|
|
|
### 2. Move Phase (move_module.py)
|
|
|
|
```
|
|
move_module.move_file()
|
|
↓
|
|
Face Recognition Check
|
|
├─ Match → Final Destination
|
|
└─ No Match → Review Queue
|
|
↓
|
|
File moved to location
|
|
↓
|
|
file_inventory.upsert(file_path, location)
|
|
↓
|
|
downloads.update(file_path) [optional - for audit trail]
|
|
```
|
|
|
|
### 3. Delete Phase
|
|
|
|
```
|
|
User deletes from UI
|
|
↓
|
|
File moved to /opt/immich/recycle/
|
|
↓
|
|
recycle_bin.insert(original_path, recycle_path)
|
|
↓
|
|
file_inventory.delete(file_path) OR update(location='recycle')
|
|
```
|
|
|
|
### 4. Restore Phase
|
|
|
|
```
|
|
User restores from recycle bin
|
|
↓
|
|
File moved back to original_path
|
|
↓
|
|
recycle_bin.delete(id)
|
|
↓
|
|
file_inventory.insert(original_path, location='final')
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Implementation Plan
|
|
|
|
### Phase 1: Infrastructure (Week 1)
|
|
|
|
#### Day 1: Table Creation & Backfill Script
|
|
- [ ] Add `file_inventory` table to `unified_database.py`
|
|
- [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()`
|
|
- [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py`
|
|
- [ ] Test backfill on test database
|
|
|
|
#### Day 2: Initial Backfill
|
|
- [ ] Run backfill script on production database
|
|
- [ ] Verify all 2,493 files captured
|
|
- [ ] Check accuracy vs filesystem
|
|
- [ ] Document any discrepancies
|
|
|
|
#### Day 3: Testing & Validation
|
|
- [ ] Verify indexes created
|
|
- [ ] Test query performance (should be <10ms)
|
|
- [ ] Write unit tests for file_inventory methods
|
|
|
|
### Phase 2: Update move_module (Week 1)
|
|
|
|
#### Day 4-5: Integration
|
|
- [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves
|
|
- [ ] Handle location tracking ('final' vs 'review')
|
|
- [ ] Add error handling and logging
|
|
- [ ] Test with sample downloads (Instagram, Forum, etc.)
|
|
- [ ] Verify file_inventory stays in sync
|
|
|
|
### Phase 3: Update API Endpoints (Week 2)
|
|
|
|
#### Day 1-2: Media Page
|
|
- [ ] Update `/api/media/gallery` to query `file_inventory` (location='final')
|
|
- [ ] Add filtering by platform, source, content_type
|
|
- [ ] Add pagination (already indexed)
|
|
- [ ] Test performance improvement
|
|
- [ ] Deploy and monitor
|
|
|
|
#### Day 3: Downloads Page
|
|
- [ ] Update `/api/downloads/recent` to query `file_inventory`
|
|
- [ ] Test with different platforms
|
|
- [ ] Verify sorting by created_date
|
|
|
|
#### Day 4: Review Queue
|
|
- [ ] Update `/api/review/queue` to query `file_inventory` (location='review')
|
|
- [ ] Verify face recognition integration
|
|
- [ ] Test restore from review queue
|
|
|
|
#### Day 5: Testing & Documentation
|
|
- [ ] Integration testing across all pages
|
|
- [ ] Performance testing with large datasets
|
|
- [ ] Update API documentation
|
|
- [ ] User acceptance testing
|
|
|
|
### Phase 4: Background Maintenance (Week 3)
|
|
|
|
#### Optional: File Verification Task
|
|
- [ ] Create periodic task to verify file existence
|
|
- [ ] Mark missing files in `file_inventory`
|
|
- [ ] Alert on discrepancies
|
|
- [ ] Auto-cleanup stale records (configurable)
|
|
|
|
---
|
|
|
|
## 📊 Expected Performance
|
|
|
|
### Before (Filesystem Scanning)
|
|
|
|
| Page | Method | Files Scanned | Load Time |
|
|
|------|--------|---------------|-----------|
|
|
| Media | `directory.rglob('*')` | 2,493 | 5-10 seconds |
|
|
| Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds |
|
|
| Review | `directory.rglob('*')` | Variable | 2-5 seconds |
|
|
| Recycle Bin | Database query | N/A | <100ms ✅ |
|
|
|
|
### After (Database Queries)
|
|
|
|
| Page | Method | Query Cost | Load Time |
|
|
|------|--------|-----------|-----------|
|
|
| Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
|
|
| Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
|
|
| Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ |
|
|
| Recycle Bin | Already database | O(1) with index | <100ms ✅ |
|
|
|
|
**Performance Improvement: 50-100x faster** 🚀
|
|
|
|
---
|
|
|
|
## 🔍 Backfill Strategy
|
|
|
|
### Discovery Phase
|
|
|
|
Backfill script reads settings to find all file locations:
|
|
|
|
```python
|
|
# 1. Get all destination paths from settings
|
|
forums = db.get_setting('forums')
|
|
for forum in forums['configs']:
|
|
scan_directory(forum['destination_path'], platform='forum', source=forum['name'])
|
|
|
|
# 2. Get review queue path
|
|
face_settings = db.get_setting('face_recognition')
|
|
scan_directory(face_settings['review_path'], location='review')
|
|
|
|
# 3. Hardcoded platform paths (or from settings)
|
|
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
|
|
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
|
|
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
|
|
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
|
|
```
|
|
|
|
### Metadata Extraction
|
|
|
|
For each file found:
|
|
- **Platform**: From directory structure or settings
|
|
- **Source**: Extract from filename or directory name
|
|
- **Content Type**: From file extension
|
|
- **File Size**: `os.stat().st_size`
|
|
- **File Hash**: Calculate SHA256 (for deduplication)
|
|
- **Dimensions**: Query from `media_metadata.db` if exists
|
|
- **Location**: 'final' or 'review' based on directory
|
|
|
|
### Idempotency
|
|
|
|
Script can be run multiple times safely:
|
|
- Uses `INSERT OR REPLACE` / `UPSERT` semantics
|
|
- Skips files already in database (with option to force refresh)
|
|
- Logs statistics: new files, updated files, skipped files
|
|
|
|
---
|
|
|
|
## 🛡️ Data Integrity
|
|
|
|
### Constraints
|
|
|
|
- `file_path UNIQUE` - Prevents duplicate entries
|
|
- `location NOT NULL` - Every file must have a location
|
|
- Indexes ensure fast lookups even with 100,000+ files
|
|
|
|
### Verification
|
|
|
|
Background task (optional, runs daily):
|
|
1. Select random 1000 files from `file_inventory`
|
|
2. Check if files still exist on filesystem
|
|
3. Mark missing files or auto-delete records
|
|
4. Log discrepancies for review
|
|
|
|
### Migration Safety
|
|
|
|
- **downloads table**: Never modified (preserves audit trail)
|
|
- **recycle_bin table**: Never modified (already works perfectly)
|
|
- **New table**: No risk to existing functionality
|
|
- **Gradual rollout**: Update one endpoint at a time
|
|
|
|
---
|
|
|
|
## 📝 Database Methods
|
|
|
|
### unified_database.py
|
|
|
|
```python
|
|
def create_file_inventory_table(self):
|
|
"""Create file_inventory table and indexes"""
|
|
|
|
def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
|
|
source: str = None, content_type: str = None,
|
|
location: str = 'final', **kwargs) -> bool:
|
|
"""Insert or update file in inventory"""
|
|
|
|
def delete_file_inventory(self, file_path: str) -> bool:
|
|
"""Remove file from inventory (when permanently deleted)"""
|
|
|
|
def query_file_inventory(self, location: str = None, platform: str = None,
|
|
source: str = None, limit: int = 50,
|
|
offset: int = 0) -> List[Dict]:
|
|
"""Query file inventory with filters and pagination"""
|
|
|
|
def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
|
|
"""Update file location (e.g., final → review → recycle)"""
|
|
|
|
def verify_file_inventory(self) -> Dict:
|
|
"""Verify all files in inventory still exist on filesystem"""
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Backwards Compatibility
|
|
|
|
### During Migration
|
|
|
|
- Old endpoints continue working (filesystem scan)
|
|
- New endpoints use database queries
|
|
- Can roll back instantly by switching endpoint logic
|
|
- No data loss risk
|
|
|
|
### After Migration
|
|
|
|
- Keep `downloads` table for historical queries
|
|
- Keep filesystem structure unchanged (Immich needs it)
|
|
- `file_inventory` is index, not replacement for files
|
|
|
|
---
|
|
|
|
## 📈 Monitoring
|
|
|
|
### Metrics to Track
|
|
|
|
- Query performance (should be <10ms)
|
|
- File inventory count vs filesystem count
|
|
- Missing files detected
|
|
- Backfill success rate
|
|
- API endpoint latency before/after
|
|
|
|
### Alerts
|
|
|
|
- File inventory diverges >5% from filesystem
|
|
- Query performance degrades >100ms
|
|
- Backfill failures
|
|
|
|
---
|
|
|
|
## 🚧 Future Enhancements
|
|
|
|
1. **Real-time sync**: inotify/watchdog to detect file changes
|
|
2. **Advanced queries**: Full-text search on filename/source
|
|
3. **Statistics**: Track file age, access patterns
|
|
4. **Cleanup**: Auto-remove files older than X days
|
|
5. **Export**: Generate inventory reports (CSV, JSON)
|
|
|
|
---
|
|
|
|
## 📚 Related Documentation
|
|
|
|
- [VERSIONING.md](VERSIONING.md) - Version control and backups
|
|
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
|
|
- [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration
|
|
- [CHANGELOG.md](CHANGELOG.md) - Version history
|
|
|
|
---
|
|
|
|
## ✅ Success Criteria
|
|
|
|
- [ ] All 2,493+ files tracked in `file_inventory`
|
|
- [ ] Database accuracy >98%
|
|
- [ ] Page load times <100ms
|
|
- [ ] Zero data loss
|
|
- [ ] Backward compatible
|
|
- [ ] No user-facing changes (transparent migration)
|
|
|
|
---
|
|
|
|
**Status**: Ready for implementation
|
|
**Next Step**: Create `file_inventory` table in `unified_database.py`
|