Initial commit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Todd
2026-03-29 22:42:55 -04:00
commit 0d7b2b1aab
389 changed files with 280296 additions and 0 deletions

475
docs/FILE_INVENTORY.md Normal file
View File

@@ -0,0 +1,475 @@
# File Inventory Architecture
**Version:** 6.33.5
**Date:** 2025-11-16
**Status:** Implementation Phase
---
## 📋 Overview
The File Inventory system is a database-first approach to tracking media files across the application. It replaces slow filesystem scanning with fast indexed database queries, improving page load times from 5-10 seconds to <100ms.
---
## 🎯 Problem Statement
### Current Issues
1. **Performance**: Pages scan entire directory trees on every load (2,493+ files)
2. **Accuracy**: Database 79.93% accurate - files on disk don't match database records
3. **Stale Records**: Downloaded files moved/deleted but database not updated
4. **Missing Records**: 1,733+ files on disk with no database entries
### Root Cause
Multiple systems track files independently:
- **Download modules** record to `downloads` table during download
- **move_module** updates paths when moving files
- **Filesystem** is the actual source of truth
- **API endpoints** scan filesystem (ignoring database)
Result: Database and filesystem drift apart over time.
---
## 💡 Solution: file_inventory Table
### Architecture
**Single Source of Truth**: `file_inventory` table tracks ALL files in their current locations.
```
┌─────────────────┐
│ Download Module │──> downloads table (historical audit trail)
└────────┬────────┘
┌─────────────────┐
│ move_module │──> file_inventory table (current locations)
└─────────────────┘ │
┌───────────────┐
│ API Endpoints │
│ (Fast Queries)│
└───────────────┘
```
### Separation of Concerns
| Table | Purpose | Updates | Deletions |
|-------|---------|---------|-----------|
| `downloads` | Historical audit trail | Never | Never |
| `file_inventory` | Current file locations | On every move | When file deleted |
| `recycle_bin` | Deleted files (restore capability) | On delete/restore | On permanent delete |
---
## 🗄️ Database Schema
### file_inventory Table
```sql
CREATE TABLE file_inventory (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT NOT NULL UNIQUE, -- Absolute path (current location)
filename TEXT NOT NULL, -- Basename for display
platform TEXT NOT NULL, -- instagram, tiktok, snapchat, forum, coppermine
source TEXT, -- Username, forum name, etc.
content_type TEXT, -- 'image' or 'video'
file_size INTEGER, -- Size in bytes
file_hash TEXT, -- SHA256 for deduplication
width INTEGER, -- Image/video width (from metadata cache)
height INTEGER, -- Image/video height
location TEXT NOT NULL, -- 'final', 'review', 'recycle'
created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When moved to this location
last_verified TIMESTAMP, -- Last time file existence verified
metadata JSON, -- Additional metadata (face recognition, etc.)
-- Indexes for fast queries
INDEX idx_fi_platform_location (platform, location, created_date DESC),
INDEX idx_fi_source (source, created_date DESC),
INDEX idx_fi_location (location),
INDEX idx_fi_hash (file_hash)
);
```
### Field Descriptions
- **file_path**: Full absolute path (e.g., `/opt/immich/md/social media/instagram/posts/evalongoria_2025-11-16.jpg`)
- **location**: Current location type
- `'final'` - In final destination directory (ready for Immich)
- `'review'` - In review queue (no face match, pending manual review)
- `'recycle'` - In recycle bin (soft deleted, can be restored)
- **created_date**: When file was moved to current location (not original download date)
- **last_verified**: Background task updates this when verifying file still exists
---
## 📂 File Locations
### Final Destinations (location='final')
Configured in settings table, per platform:
```
/opt/immich/md/
├── social media/
│ ├── instagram/
│ │ ├── posts/
│ │ ├── stories/
│ │ ├── reels/
│ │ └── tagged/
│ ├── snapchat/stories/
│ └── tiktok/reels/
├── forums/
│ ├── HQCelebCorner/
│ └── PicturePub/
└── gallery/
└── Coppermine/
```
**Settings locations:**
- Forums: `settings.forums.configs[].destination_path`
- Instagram/Snapchat/TikTok: Hardcoded or configurable
- Coppermine: `settings.coppermine.destination_path`
### Review Queue (location='review')
Path: `/opt/immich/review/` (configurable in `settings.face_recognition.review_path`)
Maintains same directory structure as final destination:
```
/opt/immich/review/
├── social media/
│ └── instagram/posts/
│ └── no_face_match.jpg
└── forums/
└── PicturePub/
└── unmatched.jpg
```
### Recycle Bin (location='recycle')
Path: `/opt/immich/recycle/` (fixed)
**Note**: Recycle bin uses separate `recycle_bin` table (already implemented, don't duplicate).
Files stored with UUID filenames:
```
/opt/immich/recycle/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890.jpg
└── f9e8d7c6-b5a4-3210-9876-543210fedcba.mp4
```
---
## 🔄 Data Flow
### 1. Download Phase
```
Download Module
downloads table (audit trail)
Temporary file in /opt/media-downloader/temp/
```
### 2. Move Phase (move_module.py)
```
move_module.move_file()
Face Recognition Check
├─ Match → Final Destination
└─ No Match → Review Queue
File moved to location
file_inventory.upsert(file_path, location)
downloads.update(file_path) [optional - for audit trail]
```
### 3. Delete Phase
```
User deletes from UI
File moved to /opt/immich/recycle/
recycle_bin.insert(original_path, recycle_path)
file_inventory.delete(file_path) OR update(location='recycle')
```
### 4. Restore Phase
```
User restores from recycle bin
File moved back to original_path
recycle_bin.delete(id)
file_inventory.insert(original_path, location='final')
```
---
## 🚀 Implementation Plan
### Phase 1: Infrastructure (Week 1)
#### Day 1: Table Creation & Backfill Script
- [ ] Add `file_inventory` table to `unified_database.py`
- [ ] Add methods: `upsert_file_inventory()`, `delete_file_inventory()`, `query_file_inventory()`
- [ ] Write backfill script: `/opt/media-downloader/utilities/backfill_file_inventory.py`
- [ ] Test backfill on test database
#### Day 2: Initial Backfill
- [ ] Run backfill script on production database
- [ ] Verify all 2,493 files captured
- [ ] Check accuracy vs filesystem
- [ ] Document any discrepancies
#### Day 3: Testing & Validation
- [ ] Verify indexes created
- [ ] Test query performance (should be <10ms)
- [ ] Write unit tests for file_inventory methods
### Phase 2: Update move_module (Week 1)
#### Day 4-5: Integration
- [ ] Update `move_module.py` to call `upsert_file_inventory()` after successful moves
- [ ] Handle location tracking ('final' vs 'review')
- [ ] Add error handling and logging
- [ ] Test with sample downloads (Instagram, Forum, etc.)
- [ ] Verify file_inventory stays in sync
### Phase 3: Update API Endpoints (Week 2)
#### Day 1-2: Media Page
- [ ] Update `/api/media/gallery` to query `file_inventory` (location='final')
- [ ] Add filtering by platform, source, content_type
- [ ] Add pagination (already indexed)
- [ ] Test performance improvement
- [ ] Deploy and monitor
#### Day 3: Downloads Page
- [ ] Update `/api/downloads/recent` to query `file_inventory`
- [ ] Test with different platforms
- [ ] Verify sorting by created_date
#### Day 4: Review Queue
- [ ] Update `/api/review/queue` to query `file_inventory` (location='review')
- [ ] Verify face recognition integration
- [ ] Test restore from review queue
#### Day 5: Testing & Documentation
- [ ] Integration testing across all pages
- [ ] Performance testing with large datasets
- [ ] Update API documentation
- [ ] User acceptance testing
### Phase 4: Background Maintenance (Week 3)
#### Optional: File Verification Task
- [ ] Create periodic task to verify file existence
- [ ] Mark missing files in `file_inventory`
- [ ] Alert on discrepancies
- [ ] Auto-cleanup stale records (configurable)
---
## 📊 Expected Performance
### Before (Filesystem Scanning)
| Page | Method | Files Scanned | Load Time |
|------|--------|---------------|-----------|
| Media | `directory.rglob('*')` | 2,493 | 5-10 seconds |
| Downloads | `directory.rglob('*')` | 2,493 | 5-10 seconds |
| Review | `directory.rglob('*')` | Variable | 2-5 seconds |
| Recycle Bin | Database query | N/A | <100ms ✅ |
### After (Database Queries)
| Page | Method | Query Cost | Load Time |
|------|--------|-----------|-----------|
| Media | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
| Downloads | `SELECT ... LIMIT 50` | O(1) with index | <100ms ✅ |
| Review | `SELECT ... WHERE location='review'` | O(1) with index | <100ms ✅ |
| Recycle Bin | Already database | O(1) with index | <100ms ✅ |
**Performance Improvement: 50-100x faster** 🚀
---
## 🔍 Backfill Strategy
### Discovery Phase
Backfill script reads settings to find all file locations:
```python
# 1. Get all destination paths from settings
forums = db.get_setting('forums')
for forum in forums['configs']:
scan_directory(forum['destination_path'], platform='forum', source=forum['name'])
# 2. Get review queue path
face_settings = db.get_setting('face_recognition')
scan_directory(face_settings['review_path'], location='review')
# 3. Hardcoded platform paths (or from settings)
scan_directory('/opt/immich/md/social media/instagram', platform='instagram')
scan_directory('/opt/immich/md/social media/snapchat', platform='snapchat')
scan_directory('/opt/immich/md/social media/tiktok', platform='tiktok')
scan_directory('/opt/immich/md/gallery/Coppermine', platform='coppermine')
```
### Metadata Extraction
For each file found:
- **Platform**: From directory structure or settings
- **Source**: Extract from filename or directory name
- **Content Type**: From file extension
- **File Size**: `os.stat().st_size`
- **File Hash**: Calculate SHA256 (for deduplication)
- **Dimensions**: Query from `media_metadata.db` if exists
- **Location**: 'final' or 'review' based on directory
### Idempotency
Script can be run multiple times safely:
- Uses `INSERT OR REPLACE` / `UPSERT` semantics
- Skips files already in database (with option to force refresh)
- Logs statistics: new files, updated files, skipped files
---
## 🛡️ Data Integrity
### Constraints
- `file_path UNIQUE` - Prevents duplicate entries
- `location NOT NULL` - Every file must have a location
- Indexes ensure fast lookups even with 100,000+ files
### Verification
Background task (optional, runs daily):
1. Select random 1000 files from `file_inventory`
2. Check if files still exist on filesystem
3. Mark missing files or auto-delete records
4. Log discrepancies for review
### Migration Safety
- **downloads table**: Never modified (preserves audit trail)
- **recycle_bin table**: Never modified (already works perfectly)
- **New table**: No risk to existing functionality
- **Gradual rollout**: Update one endpoint at a time
---
## 📝 Database Methods
### unified_database.py
```python
def create_file_inventory_table(self):
"""Create file_inventory table and indexes"""
def upsert_file_inventory(self, file_path: str, filename: str, platform: str,
source: str = None, content_type: str = None,
location: str = 'final', **kwargs) -> bool:
"""Insert or update file in inventory"""
def delete_file_inventory(self, file_path: str) -> bool:
"""Remove file from inventory (when permanently deleted)"""
def query_file_inventory(self, location: str = None, platform: str = None,
source: str = None, limit: int = 50,
offset: int = 0) -> List[Dict]:
"""Query file inventory with filters and pagination"""
def update_file_inventory_location(self, file_path: str, new_location: str) -> bool:
"""Update file location (e.g., final → review → recycle)"""
def verify_file_inventory(self) -> Dict:
"""Verify all files in inventory still exist on filesystem"""
```
---
## 🔄 Backwards Compatibility
### During Migration
- Old endpoints continue working (filesystem scan)
- New endpoints use database queries
- Can roll back instantly by switching endpoint logic
- No data loss risk
### After Migration
- Keep `downloads` table for historical queries
- Keep filesystem structure unchanged (Immich needs it)
- `file_inventory` is index, not replacement for files
---
## 📈 Monitoring
### Metrics to Track
- Query performance (should be <10ms)
- File inventory count vs filesystem count
- Missing files detected
- Backfill success rate
- API endpoint latency before/after
### Alerts
- File inventory diverges >5% from filesystem
- Query performance degrades >100ms
- Backfill failures
---
## 🚧 Future Enhancements
1. **Real-time sync**: inotify/watchdog to detect file changes
2. **Advanced queries**: Full-text search on filename/source
3. **Statistics**: Track file age, access patterns
4. **Cleanup**: Auto-remove files older than X days
5. **Export**: Generate inventory reports (CSV, JSON)
---
## 📚 Related Documentation
- [VERSIONING.md](VERSIONING.md) - Version control and backups
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
- [FACE_RECOGNITION.md](FACE_RECOGNITION.md) - Face recognition integration
- [CHANGELOG.md](CHANGELOG.md) - Version history
---
## ✅ Success Criteria
- [ ] All 2,493+ files tracked in `file_inventory`
- [ ] Database accuracy >98%
- [ ] Page load times <100ms
- [ ] Zero data loss
- [ ] Backward compatible
- [ ] No user-facing changes (transparent migration)
---
**Status**: Ready for implementation
**Next Step**: Create `file_inventory` table in `unified_database.py`