475
docs/FILE_INVENTORY.md
Normal file
475
docs/FILE_INVENTORY.md
Normal file
@@ -0,0 +1,475 @@
|
||||
# File Inventory Architecture
|
||||
|
||||
**Version:** 6.33.5
|
||||
**Date:** 2025-11-16
|
||||
**Status:** Implementation Phase
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Problem Statement
|
||||
|
||||
### Current Issues
|
||||
|
||||
1. **Performance**: Pages scan entire directory trees on every load (2,493+ files)
|
||||
2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records
|
||||
3. **Stale Records**: Downloaded files moved/deleted but database not updated
|
||||
4. **Missing Records**: 1,733+ files on disk with no database entries
|
||||
|
||||
### Root Cause
|
||||
|
||||
Multiple systems track files independently:
|
||||
- **Download modules** record to `downloads` table during download
|
||||
- **move_module** updates paths when moving files
|
||||
- **Filesystem** is the actual source of truth
|
||||
- **API endpoints** scan filesystem (ignoring database)
|
||||
|
||||
Result: Database and filesystem drift apart over time.
|
||||
|
||||
---
|
||||
|
||||
## 💡 Solution: file_inventory Table
|
||||
|
||||
### Architecture
|
||||
|
||||
**Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations.
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Download Module │──> downloads table (historical audit trail)
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ move_module │──> file_inventory table (current locations)
|
||||
└─────────────────┘ │
|
||||
│
|
||||
▼
|
||||
┌───────────────┐
|
||||
│ API Endpoints │
|
||||
│ (Fast Queries)│
|
||||
└───────────────┘
|
||||
```
|
||||
|
||||
### Separation of Concerns
|
||||
|
||||
| Table | Purpose | Updates | Deletions |
|
||||
|-------|---------|---------|-----------|
|
||||
| `downloads` | Historical audit trail | Never | Never |
|
||||
| `file_inventory` | Current file locations | On every move | When file deleted |
|
||||
| `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete |
|
||||
|
||||
---
|
||||
|
||||
## 🗄️ Database Schema
|
||||
|
||||
### file_inventory Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE file_inventory (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
file_path TEXT NOT NULL UNIQUE, -- Absolute path (current location)
|
||||
filename TEXT NOT NULL, -- Basename for display
|
||||
platform TEXT NOT NULL, -- instagram, tiktok, snapchat, forum, coppermine
|
||||
source TEXT, -- Username, forum name, etc.
|
||||
content_type TEXT, -- 'image' or 'video'
|
||||
file_size INTEGER, -- Size in bytes
|
||||
file_hash TEXT, -- SHA256 for deduplication
|
||||
width INTEGER, -- Image/video width (from metadata cache)
|
||||
height INTEGER, -- Image/video height
|
||||
location TEXT NOT NULL, -- 'final', 'review', 'recycle'
|
||||
created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When moved to this location
|
||||
last_verified TIMESTAMP, -- Last time file existence verified
|
||||
metadata JSON, -- Additional metadata (face recognition, etc.)
|
||||
|
||||
-- Indexes for fast queries
|
||||
INDEX idx_fi_platform_location (platform, location, created_date DESC),
|
||||
INDEX idx_fi_source (source, created_date DESC),
|
||||
INDEX idx_fi_location (location),
|
||||
INDEX idx_fi_hash (file_hash)
|
||||
);
|
||||
```
|
||||
|
||||
### Field Descriptions
|
||||
|
||||
- **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`)
|
||||
- **location**: Current location type
|
||||
- `'final'` - In final destination directory (ready for Immich)
|
||||
- `'review'` - In review queue (no face match, pending manual review)
|
||||
- `'recycle'` - In recycle bin (soft deleted, can be restored)
|
||||
- **created_date**: When file was moved to current location (not original download date)
|
||||
- **last_verified**: Background task updates this when verifying file still exists
|
||||
|
||||
---
|
||||
|
||||
## 📂 File Locations
|
||||
|
||||
### Final Destinations (location='final')
|
||||
|
||||
Configured in settings table, per platform:
|
||||
|
||||
```
|
||||
/opt/immich/md/
|
||||
├── social media/
|
||||
│ ├── instagram/
|
||||
│ │ ├── posts/
|
||||
│ │ ├── stories/
|
||||
│ │ ├── reels/
|
||||
│ │ └── tagged/
|
||||
│ ├── snapchat/stories/
|
||||
│ └── tiktok/reels/
|
||||
├── forums/
|
||||
│ ├── HQCelebCorner/
|
||||
│ └── PicturePub/
|
||||
└── gallery/
|
||||
└── Coppermine/
|
||||
```
|
||||
|
||||
**Settings locations:**
|
||||
- Forums: `settings.forums.configs[].destination_path`
|
||||
- Instagram/Snapchat/TikTok: Hardcoded or configurable
|
||||
- Coppermine: `settings.coppermine.destination_path`
|
||||
|
||||
### Review Queue (location='review')
|
||||
|
||||
Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`)
|
||||
|
||||
Maintains same directory structure as final destination:
|
||||
```
|
||||
/opt/immich/review/
|
||||
├── social media/
|
||||
│ └── instagram/posts/
|
||||
│ └── no_face_match.jpg
|
||||
└── forums/
|
||||
└── PicturePub/
|
||||
└── unmatched.jpg
|
||||
```
|
||||
|
||||
### Recycle Bin (location='recycle')
|
||||
|
||||
Path: `/opt/immich/recycle/` (fixed)
|
||||
|
||||
**Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate).
|
||||
|
||||
Files stored with UUID filenames:
|
||||
```
|
||||
/opt/immich/recycle/
|
||||
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
|
||||
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Data Flow
|
||||
|
||||
### 1. Download Phase
|
||||
|
||||
```
|
||||
Download Module
|
||||
↓
|
||||
downloads table (audit trail)
|
||||
↓
|
||||
Temporary file in /opt/media-downloader/temp/
|
||||
```
|
||||
|
||||
### 2. Move Phase (move_module.py)
|
||||
|
||||
```
|
||||
move_module.move_file()
|
||||
↓
|
||||
Face Recognition Check
|
||||
├─ Match → Final Destination
|
||||
└─ No Match → Review Queue
|
||||
↓
|
||||
File moved to location
|
||||
↓
|
||||
file_inventory.upsert(file_path, location)
|
||||
↓
|
||||
downloads.update(file_path) [optional - for audit trail]
|
||||
```
|
||||
|
||||
### 3. Delete Phase
|
||||
|
||||
```
|
||||
User deletes from UI
|
||||
↓
|
||||
File moved to /opt/immich/recycle/
|
||||
↓
|
||||
recycle_bin.insert(original_path, recycle_path)
|
||||
↓
|
||||
file_inventory.delete(file_path) OR update(location='recycle')
|
||||
```
|
||||
|
||||
### 4. Restore Phase
|
||||
|
||||
```
|
||||
User restores from recycle bin
|
||||
↓
|
||||
File moved back to original_path
|
||||
↓
|
||||
recycle_bin.delete(id)
|
||||
↓
|
||||
file_inventory.insert(original_path, location='final')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Implementation Plan
|
||||
|
||||
### Phase 1: Infrastructure (Week 1)
|
||||
|
||||
#### Day 1: Table Creation & Backfill Script
|
||||
- [ ] Add `file_inventory` table to `unified_database.py`
|
||||
- [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()`
|
||||
- [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py`
|
||||
- [ ] Test backfill on test database
|
||||
|
||||
#### Day 2: Initial Backfill
|
||||
- [ ] Run backfill script on production database
|
||||
- [ ] Verify all 2,493 files captured
|
||||
- [ ] Check accuracy vs filesystem
|
||||
- [ ] Document any discrepancies
|
||||
|
||||
#### Day 3: Testing & Validation
|
||||
- [ ] Verify indexes created
|
||||
- [ ] Test query performance (should be <10ms)
|
||||
- [ ] Write unit tests for file_inventory methods
|
||||
|
||||
### Phase 2: Update move_module (Week 1)
|
||||
|
||||
#### Day 4-5: Integration
|
||||
- [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves
|
||||
- [ ] Handle location tracking ('final' vs 'review')
|
||||
- [ ] Add error handling and logging
|
||||
- [ ] Test with sample downloads (Instagram, Forum, etc.)
|
||||
- [ ] Verify file_inventory stays in sync
|
||||
|
||||
### Phase 3: Update API Endpoints (Week 2)
|
||||
|
||||
#### Day 1-2: Media Page
|
||||
- [ ] Update `/api/media/gallery` to query `file_inventory` (location='final')
|
||||
- [ ] Add filtering by platform, source, content_type
|
||||
- [ ] Add pagination (already indexed)
|
||||
- [ ] Test performance improvement
|
||||
- [ ] Deploy and monitor
|
||||
|
||||
#### Day 3: Downloads Page
|
||||
- [ ] Update `/api/downloads/recent` to query `file_inventory`
|
||||
- [ ] Test with different platforms
|
||||
- [ ] Verify sorting by created_date
|
||||
|
||||
#### Day 4: Review Queue
|
||||
- [ ] Update `/api/review/queue` to query `file_inventory` (location='review')
|
||||
- [ ] Verify face recognition integration
|
||||
- [ ] Test restore from review queue
|
||||
|
||||
#### Day 5: Testing & Documentation
|
||||
- [ ] Integration testing across all pages
|
||||
- [ ] Performance testing with large datasets
|
||||
- [ ] Update API documentation
|
||||
- [ ] User acceptance testing
|
||||
|
||||
### Phase 4: Background Maintenance (Week 3)
|
||||
|
||||
#### Optional: File Verification Task
|
||||
- [ ] Create periodic task to verify file existence
|
||||
- [ ] Mark missing files in `file_inventory`
|
||||
- [ ] Alert on discrepancies
|
||||
- [ ] Auto-cleanup stale records (configurable)
|
||||
|
||||
---
|
||||
|
||||
## 📊 Expected Performance
|
||||
|
||||
### Before (Filesystem Scanning)
|
||||
|
||||
| Page | Method | Files Scanned | Load Time |
|
||||
|------|--------|---------------|-----------|
|
||||
| Media | `directory.rglob('*')` | 2,493 | 5-10 seconds |
|
||||
| Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds |
|
||||
| Review | `directory.rglob('*')` | Variable | 2-5 seconds |
|
||||
| Recycle Bin | Database query | N/A | <100ms ✅ |
|
||||
|
||||
### After (Database Queries)
|
||||
|
||||
| Page | Method | Query Cost | Load Time |
|
||||
|------|--------|-----------|-----------|
|
||||
| Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
|
||||
| Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
|
||||
| Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ |
|
||||
| Recycle Bin | Already database | O(1) with index | <100ms ✅ |
|
||||
|
||||
**Performance Improvement: 50-100x faster** 🚀
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Backfill Strategy
|
||||
|
||||
### Discovery Phase
|
||||
|
||||
Backfill script reads settings to find all file locations:
|
||||
|
||||
```python
|
||||
# 1. Get all destination paths from settings
|
||||
forums = db.get_setting('forums')
|
||||
for forum in forums['configs']:
|
||||
scan_directory(forum['destination_path'], platform='forum', source=forum['name'])
|
||||
|
||||
# 2. Get review queue path
|
||||
face_settings = db.get_setting('face_recognition')
|
||||
scan_directory(face_settings['review_path'], location='review')
|
||||
|
||||
# 3. Hardcoded platform paths (or from settings)
|
||||
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
|
||||
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
|
||||
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
|
||||
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
|
||||
```
|
||||
|
||||
### Metadata Extraction
|
||||
|
||||
For each file found:
|
||||
- **Platform**: From directory structure or settings
|
||||
- **Source**: Extract from filename or directory name
|
||||
- **Content Type**: From file extension
|
||||
- **File Size**: `os.stat().st_size`
|
||||
- **File Hash**: Calculate SHA256 (for deduplication)
|
||||
- **Dimensions**: Query from `media_metadata.db` if exists
|
||||
- **Location**: 'final' or 'review' based on directory
|
||||
|
||||
### Idempotency
|
||||
|
||||
Script can be run multiple times safely:
|
||||
- Uses `INSERT OR REPLACE` / `UPSERT` semantics
|
||||
- Skips files already in database (with option to force refresh)
|
||||
- Logs statistics: new files, updated files, skipped files
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ Data Integrity
|
||||
|
||||
### Constraints
|
||||
|
||||
- `file_path UNIQUE` - Prevents duplicate entries
|
||||
- `location NOT NULL` - Every file must have a location
|
||||
- Indexes ensure fast lookups even with 100,000+ files
|
||||
|
||||
### Verification
|
||||
|
||||
Background task (optional, runs daily):
|
||||
1. Select random 1000 files from `file_inventory`
|
||||
2. Check if files still exist on filesystem
|
||||
3. Mark missing files or auto-delete records
|
||||
4. Log discrepancies for review
|
||||
|
||||
### Migration Safety
|
||||
|
||||
- **downloads table**: Never modified (preserves audit trail)
|
||||
- **recycle_bin table**: Never modified (already works perfectly)
|
||||
- **New table**: No risk to existing functionality
|
||||
- **Gradual rollout**: Update one endpoint at a time
|
||||
|
||||
---
|
||||
|
||||
## 📝 Database Methods
|
||||
|
||||
### unified_database.py
|
||||
|
||||
```python
|
||||
def create_file_inventory_table(self):
|
||||
"""Create file_inventory table and indexes"""
|
||||
|
||||
def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
|
||||
source: str = None, content_type: str = None,
|
||||
location: str = 'final', **kwargs) -> bool:
|
||||
"""Insert or update file in inventory"""
|
||||
|
||||
def delete_file_inventory(self, file_path: str) -> bool:
|
||||
"""Remove file from inventory (when permanently deleted)"""
|
||||
|
||||
def query_file_inventory(self, location: str = None, platform: str = None,
|
||||
source: str = None, limit: int = 50,
|
||||
offset: int = 0) -> List[Dict]:
|
||||
"""Query file inventory with filters and pagination"""
|
||||
|
||||
def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
|
||||
"""Update file location (e.g., final → review → recycle)"""
|
||||
|
||||
def verify_file_inventory(self) -> Dict:
|
||||
"""Verify all files in inventory still exist on filesystem"""
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Backwards Compatibility
|
||||
|
||||
### During Migration
|
||||
|
||||
- Old endpoints continue working (filesystem scan)
|
||||
- New endpoints use database queries
|
||||
- Can roll back instantly by switching endpoint logic
|
||||
- No data loss risk
|
||||
|
||||
### After Migration
|
||||
|
||||
- Keep `downloads` table for historical queries
|
||||
- Keep filesystem structure unchanged (Immich needs it)
|
||||
- `file_inventory` is index, not replacement for files
|
||||
|
||||
---
|
||||
|
||||
## 📈 Monitoring
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
- Query performance (should be <10ms)
|
||||
- File inventory count vs filesystem count
|
||||
- Missing files detected
|
||||
- Backfill success rate
|
||||
- API endpoint latency before/after
|
||||
|
||||
### Alerts
|
||||
|
||||
- File inventory diverges >5% from filesystem
|
||||
- Query performance degrades >100ms
|
||||
- Backfill failures
|
||||
|
||||
---
|
||||
|
||||
## 🚧 Future Enhancements
|
||||
|
||||
1. **Real-time sync**: inotify/watchdog to detect file changes
|
||||
2. **Advanced queries**: Full-text search on filename/source
|
||||
3. **Statistics**: Track file age, access patterns
|
||||
4. **Cleanup**: Auto-remove files older than X days
|
||||
5. **Export**: Generate inventory reports (CSV, JSON)
|
||||
|
||||
---
|
||||
|
||||
## 📚 Related Documentation
|
||||
|
||||
- [VERSIONING.md](VERSIONING.md) - Version control and backups
|
||||
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
|
||||
- [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration
|
||||
- [CHANGELOG.md](CHANGELOG.md) - Version history
|
||||
|
||||
---
|
||||
|
||||
## ✅ Success Criteria
|
||||
|
||||
- [ ] All 2,493+ files tracked in `file_inventory`
|
||||
- [ ] Database accuracy >98%
|
||||
- [ ] Page load times <100ms
|
||||
- [ ] Zero data loss
|
||||
- [ ] Backward compatible
|
||||
- [ ] No user-facing changes (transparent migration)
|
||||
|
||||
---
|
||||
|
||||
**Status**: Ready for implementation
|
||||
**Next Step**: Create `file_inventory` table in `unified_database.py`
|
||||
Reference in New Issue
Block a user