# Media Downloader - Comprehensive Code Review **Date:** 2025-10-31 **Version:** 6.3.4 **Reviewer:** Claude Code (Automated Analysis) **Scope:** Full codebase - Backend, Frontend, Database, Architecture --- ## Executive Summary The Media Downloader is a **feature-rich, architecturally sound application** with excellent modular design and modern technology choices. The codebase demonstrates solid engineering principles with a unified database, clear separation of concerns, and comprehensive feature coverage. **Overall Assessment:** - **Code Quality:** 6.5/10 - Good structure but needs refactoring - **Security:** 4/10 - **CRITICAL issues** requiring immediate attention - **Performance:** 7/10 - Generally good with optimization opportunities - **Maintainability:** 6/10 - Large files, some duplication, limited tests - **Architecture:** 8/10 - Excellent modular design ### Key Statistics - **Total Lines of Code:** 37,966 - **Python Files:** 49 (including 20 modules, 2 backend files) - **TypeScript Files:** 20 - **Documentation Files:** 11 (in docs/) - **Test Files:** 0 ⚠️ ### Critical Findings 🔴 **4 Critical Security Issues** - Require immediate action 🟠 **4 High Priority Issues** - Fix within 1-2 weeks 🟡 **7 Medium Priority Issues** - Address within 1-3 months 🟢 **5 Low Priority Issues** - Nice to have improvements --- ## Critical Issues (🔴 Fix Immediately) ### 1. Hardcoded Secrets in Configuration **Severity:** CRITICAL | **Effort:** 2-4 hours | **Risk:** Data breach **Location:** `/opt/media-downloader/config/settings.json` **Problem:** ```json { "password": "cpc6rvm!wvf_wft2EHN", "totp_secret": "OVLX4K6NHTUJTUJVL4TLHXJ55SIEDOOY", "api_key": "SC1dje6Zo5VhGPmy9vyfkeuBY0MZ7VfgrhI8wIvjOM", "api_token": "a3jmhwnhecq9k9dz3tzv2bdk7uc29p" } ``` Credentials are stored in plaintext and tracked in version control. Anyone with repository access has full account credentials. Git history cannot be cleaned without force-pushing. **Impact:** - All forum passwords, API keys, and TOTP secrets exposed - Cannot rotate credentials without code changes - Violates OWASP A02:2021 – Cryptographic Failures **Solution:** ```bash # 1. Immediate: Add to .gitignore echo "config/settings.json" >> .gitignore echo ".env" >> .gitignore # 2. Create environment variable template cat > config/settings.example.json < .env.example < str: """Get secret from environment, raise if not found and no default""" value = os.getenv(key, default) if value is None: raise ValueError(f"Secret '{key}' not found in environment") return value def get_optional_secret(self, key: str) -> Optional[str]: """Get secret from environment, return None if not found""" return os.getenv(key) # Usage in modules secrets = SecretsManager() forum_password = secrets.get_secret('FORUM_PASSWORD') ``` **Rollout Plan:** 1. Create `.env.example` with placeholder values 2. Add `.gitignore` entries for `.env` and `config/settings.json` 3. Document secret setup in `INSTALL.md` 4. Update all modules to use `SecretsManager` 5. Notify team to create local `.env` files 6. Remove secrets from `settings.json` (keep structure) --- ### 2. SQL Injection Vulnerabilities **Severity:** CRITICAL | **Effort:** 4-6 hours | **Risk:** Database compromise **Location:** `/opt/media-downloader/web/backend/api.py` (multiple locations) **Problem:** F-string SQL queries with user-controlled input: ```python # Line ~478-482 (VULNERABLE) cursor.execute(f""" SELECT COUNT(*) FROM downloads WHERE download_date >= datetime('now', '-1 day') AND {filters} """) # Line ~830-850 (VULNERABLE) query = f"SELECT * FROM downloads WHERE platform = '{platform}'" cursor.execute(query) ``` The `filters` variable is constructed from user input (`platform`, `source`, `search`) without proper sanitization. **Impact:** - Attackers can inject arbitrary SQL commands - Can drop tables: `'; DROP TABLE downloads; --` - Can exfiltrate data: `' OR 1=1 UNION SELECT * FROM users --` - Can bypass authentication - OWASP A03:2021 – Injection **Solution:** ```python # BEFORE (VULNERABLE) platform = request.query_params.get('platform') query = f"SELECT * FROM downloads WHERE platform = '{platform}'" cursor.execute(query) # AFTER (SECURE) platform = request.query_params.get('platform') query = "SELECT * FROM downloads WHERE platform = ?" cursor.execute(query, (platform,)) # For dynamic filters def build_safe_query(filters: dict) -> tuple[str, tuple]: """Build parameterized query from filters""" conditions = [] params = [] if filters.get('platform'): conditions.append("platform = ?") params.append(filters['platform']) if filters.get('source'): conditions.append("source = ?") params.append(filters['source']) if filters.get('search'): conditions.append("(filename LIKE ? OR source LIKE ?)") search_pattern = f"%{filters['search']}%" params.extend([search_pattern, search_pattern]) where_clause = " AND ".join(conditions) if conditions else "1=1" return where_clause, tuple(params) # Usage filters = build_safe_query(request.query_params) query = f"SELECT * FROM downloads WHERE {filters[0]}" cursor.execute(query, filters[1]) ``` **Files Requiring Fixes:** - `/opt/media-downloader/web/backend/api.py` (17+ instances) - Lines 478-482, 520-540, 830-850, 910-930 - `/opt/media-downloader/utilities/db_manager.py` (2 instances) **Testing:** ```python # Test case for SQL injection prevention def test_sql_injection_prevention(): # Try to inject SQL malicious_input = "'; DROP TABLE downloads; --" response = client.get(f"/api/downloads?platform={malicious_input}") # Should not execute injection assert response.status_code in [400, 404] # Bad request or not found # Verify table still exists assert db.table_exists('downloads') ``` --- ### 3. Path Traversal Vulnerabilities **Severity:** HIGH | **Effort:** 3-4 hours | **Risk:** File system access **Location:** `/opt/media-downloader/web/backend/api.py` (media endpoints) **Problem:** File paths from user input are not validated: ```python # Lines ~1920+ (VULNERABLE) @app.get("/api/media/preview") async def get_media_preview(file_path: str, ...): # No validation - attacker could use ../../etc/passwd return FileResponse(file_path) @app.get("/api/media/thumbnail") async def get_media_thumbnail(file_path: str, ...): # No validation requested_path = Path(file_path) return FileResponse(requested_path) ``` **Impact:** - Read arbitrary files: `/etc/passwd`, `/etc/shadow`, database files - Access configuration with secrets - Data exfiltration via media endpoints - OWASP A01:2021 – Broken Access Control **Solution:** ```python from pathlib import Path from fastapi import HTTPException ALLOWED_MEDIA_BASE = Path("/opt/immich/md") def validate_file_path(file_path: str, allowed_base: Path) -> Path: """ Ensure file_path is within allowed directory. Prevents directory traversal attacks. """ try: # Resolve to absolute path requested = Path(file_path).resolve() # Check if within allowed directory if not requested.is_relative_to(allowed_base): raise ValueError(f"Path outside allowed directory") # Check file exists if not requested.exists(): raise FileNotFoundError() # Check it's a file, not directory if not requested.is_file(): raise ValueError("Path is not a file") return requested except (ValueError, FileNotFoundError) as e: raise HTTPException( status_code=403, detail="Access denied: Invalid file path" ) @app.get("/api/media/preview") async def get_media_preview( file_path: str, current_user: Dict = Depends(get_current_user_media) ): """Serve media file with path validation""" safe_path = validate_file_path(file_path, ALLOWED_MEDIA_BASE) return FileResponse(safe_path) ``` **Test Cases:** ```python # Path traversal attack attempts test_cases = [ "../../etc/passwd", "/etc/passwd", "../../../root/.ssh/id_rsa", "....//....//etc/passwd", "%2e%2e%2f%2e%2e%2fetc%2fpasswd", # URL encoded ] for attack in test_cases: response = client.get(f"/api/media/preview?file_path={attack}") assert response.status_code == 403, f"Failed to block: {attack}" ``` --- ### 4. Command Injection Risk **Severity:** HIGH | **Effort:** 2-3 hours | **Risk:** Code execution **Location:** `/opt/media-downloader/web/backend/api.py` **Problem:** Subprocess calls with user input: ```python # Line ~1314 @app.post("/api/platforms/{platform}/trigger") async def trigger_platform_download(platform: str, ...): cmd = ["python3", "/opt/media-downloader/media-downloader.py", "--platform", platform] process = await asyncio.create_subprocess_exec(*cmd, ...) ``` While using a list (safer than shell=True), the `platform` parameter is not validated against a whitelist. **Impact:** - Could inject commands if platform validation is bypassed - Potential code execution via crafted platform names - OWASP A03:2021 – Injection **Solution:** ```python from enum import Enum from typing import Literal # Define allowed platforms as enum class Platform(str, Enum): INSTAGRAM = "instagram" FASTDL = "fastdl" IMGINN = "imginn" TOOLZU = "toolzu" SNAPCHAT = "snapchat" TIKTOK = "tiktok" FORUMS = "forums" ALL = "all" @app.post("/api/platforms/{platform}/trigger") async def trigger_platform_download( platform: Platform, # Type hint enforces validation trigger_data: TriggerRequest, background_tasks: BackgroundTasks, current_user: Dict = Depends(get_current_user) ): """Trigger download with validated platform""" # FastAPI automatically validates against enum cmd = [ "python3", "/opt/media-downloader/media-downloader.py", "--platform", platform.value # Safe - enum member ] process = await asyncio.create_subprocess_exec( *cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE ) ``` **Additional Hardening:** ```python # Subprocess wrapper with additional safety import shlex def safe_subprocess_exec(cmd: List[str], allowed_commands: Set[str]): """Execute subprocess with command whitelist""" if cmd[0] not in allowed_commands: raise ValueError(f"Command not allowed: {cmd[0]}") # Validate all arguments are safe for arg in cmd: if any(char in arg for char in [';', '&', '|', '$', '`']): raise ValueError(f"Dangerous character in argument: {arg}") return subprocess.run(cmd, capture_output=True, text=True, timeout=300) ``` --- ## High Priority Issues (🟠 Fix Soon) ### 5. Massive Files - Maintainability Crisis **Severity:** HIGH | **Effort:** 24-36 hours | **Risk:** Technical debt **Problem:** Several files exceed 2,000 lines, violating single responsibility principle: | File | Lines | Size | |------|-------|------| | `modules/forum_downloader.py` | 3,971 | 167 KB | | `media-downloader.py` | 2,653 | - | | `web/backend/api.py` | 2,649 | 94 KB | | `modules/imginn_module.py` | 2,542 | 129 KB | **Impact:** - Difficult to navigate and understand - Hard to test individual components - Increases cognitive load - Higher bug density - Makes code reviews painful - Merge conflicts more frequent **Recommended Structure:** ``` # For api.py refactoring: web/backend/ ├── main.py (FastAPI app initialization, 100-150 lines) ├── dependencies.py (auth dependencies, 50-100 lines) ├── middleware.py (CORS, rate limiting, 50-100 lines) ├── routers/ │ ├── __init__.py │ ├── auth.py (authentication endpoints, 150-200 lines) │ ├── downloads.py (download endpoints, 200-300 lines) │ ├── scheduler.py (scheduler endpoints, 150-200 lines) │ ├── media.py (media endpoints, 150-200 lines) │ ├── health.py (health/monitoring, 100-150 lines) │ └── config.py (configuration endpoints, 100-150 lines) ├── services/ │ ├── download_service.py (download business logic) │ ├── scheduler_service.py (scheduler business logic) │ └── media_service.py (media processing logic) ├── models/ │ ├── requests.py (Pydantic request models) │ ├── responses.py (Pydantic response models) │ └── schemas.py (database schemas) └── utils/ ├── validators.py (input validation) └── helpers.py (utility functions) ``` **Migration Plan:** 1. Create new directory structure 2. Extract routers one at a time (start with health, least dependencies) 3. Move business logic to services 4. Extract Pydantic models 5. Update imports gradually 6. Test after each extraction 7. Remove old code once verified --- ### 6. Database Connection Pool Exhaustion **Severity:** HIGH | **Effort:** 4-6 hours | **Risk:** Application hang **Location:** `/opt/media-downloader/modules/unified_database.py` **Problem:** Connection pool implementation has potential leaks: ```python # Line 119-130 (PROBLEMATIC) def get_connection(self, for_write=False): try: if self.pool: with self.pool.get_connection(for_write=for_write) as conn: yield conn else: conn = sqlite3.connect(...) # ⚠️ No try/finally - connection might not close on error yield conn ``` **Impact:** - Connection leaks under error conditions - Pool exhaustion causes application hang - No monitoring of pool health - Memory leaks **Solution:** ```python from contextlib import contextmanager from typing import Generator import sqlite3 @contextmanager def get_connection( self, for_write: bool = False ) -> Generator[sqlite3.Connection, None, None]: """ Get database connection with guaranteed cleanup. Args: for_write: If True, ensures exclusive write access Yields: sqlite3.Connection: Database connection Raises: sqlite3.Error: On connection/query errors """ conn = None try: if self.pool: conn = self.pool.get_connection(for_write=for_write) else: conn = sqlite3.connect( str(self.db_path), timeout=30, check_same_thread=False ) conn.row_factory = sqlite3.Row yield conn # Commit if no exceptions if for_write: conn.commit() except sqlite3.Error as e: # Rollback on error if conn and for_write: conn.rollback() logger.error(f"Database error: {e}") raise finally: # Always close connection if conn: conn.close() # Add pool monitoring def get_pool_stats(self) -> dict: """Get connection pool statistics""" if not self.pool: return {'pool_enabled': False} return { 'pool_enabled': True, 'active_connections': self.pool.active_connections, 'max_connections': self.pool.max_connections, 'available': self.pool.max_connections - self.pool.active_connections, 'wait_count': self.pool.wait_count, 'timeout_count': self.pool.timeout_count } # Add to health endpoint @app.get("/api/health/database") async def get_database_health(): stats = app_state.db.get_pool_stats() # Alert if low on connections if stats.get('available', 0) < 2: logger.warning("Database connection pool nearly exhausted") return stats ``` --- ### 7. No Authentication Rate Limiting (Already Fixed) **Severity:** HIGH | **Status:** ✅ FIXED in 6.3.4 Rate limiting has been implemented in version 6.3.4 using slowapi: - Login: 5 requests/minute - Auth endpoints: 10 requests/minute - Read endpoints: 100 requests/minute No additional action required. --- ### 8. Missing CSRF Protection **Severity:** HIGH | **Effort:** 2-3 hours | **Risk:** Unauthorized actions **Problem:** No CSRF tokens on state-changing operations. Attackers can craft malicious pages that trigger actions on behalf of authenticated users. **Impact:** - Delete downloads via CSRF - Trigger new downloads - Modify configuration - Stop running tasks - OWASP A01:2021 – Broken Access Control **Solution:** ```bash # Install CSRF protection pip install fastapi-csrf-protect ``` ```python # web/backend/main.py from fastapi_csrf_protect import CsrfProtect from fastapi_csrf_protect.exceptions import CsrfProtectError from pydantic import BaseModel class CsrfSettings(BaseModel): secret_key: str = os.getenv('CSRF_SECRET_KEY', secrets.token_urlsafe(32)) cookie_samesite: str = 'strict' @CsrfProtect.load_config def get_csrf_config(): return CsrfSettings() # Apply to state-changing endpoints @app.post("/api/platforms/{platform}/trigger") async def trigger_download( request: Request, csrf_protect: CsrfProtect = Depends() ): # Validate CSRF token await csrf_protect.validate_csrf(request) # Rest of code... # Frontend: Include CSRF token // api.ts async post(endpoint: string, data: any): Promise { const csrfToken = this.getCsrfToken() return fetch(`${API_BASE}${endpoint}`, { method: 'POST', headers: { 'Content-Type': 'application/json', 'X-CSRF-Token': csrfToken }, body: JSON.stringify(data) }) } ``` --- ## Medium Priority Issues (🟡 Address This Quarter) ### 9. TypeScript 'any' Type Overuse **Severity:** MEDIUM | **Effort:** 4-6 hours 70+ instances of `any` type defeat TypeScript's purpose. **Solution:** ```typescript // Define proper interfaces interface User { id: number username: string role: 'admin' | 'user' | 'viewer' email?: string preferences: UserPreferences } interface UserPreferences { theme: 'light' | 'dark' notifications: boolean } interface PlatformConfig { enabled: boolean check_interval_hours: number accounts?: Account[] usernames?: string[] run_at_start?: boolean } // Replace any with proper types async getMe(): Promise { return this.get('/auth/me') } ``` --- ### 10. No Comprehensive Error Handling **Severity:** MEDIUM | **Effort:** 6-8 hours 115 try/except blocks with generic `except Exception` catching. **Solution:** ```python # modules/exceptions.py class MediaDownloaderError(Exception): """Base exception""" pass class DownloadError(MediaDownloaderError): """Download failed""" pass class AuthenticationError(MediaDownloaderError): """Authentication failed""" pass class RateLimitError(MediaDownloaderError): """Rate limit exceeded""" pass class ValidationError(MediaDownloaderError): """Input validation failed""" pass # Structured error responses @app.exception_handler(MediaDownloaderError) async def handle_app_error(request: Request, exc: MediaDownloaderError): return JSONResponse( status_code=400, content={ 'error': exc.__class__.__name__, 'message': str(exc), 'timestamp': datetime.now().isoformat() } ) ``` --- ### 11. Code Duplication Across Modules **Severity:** MEDIUM | **Effort:** 6-8 hours Instagram modules share 60-70% similar code. **Solution:** ```python # modules/base_downloader.py from abc import ABC, abstractmethod class BaseDownloader(ABC): """Base class for all downloaders""" def __init__(self, unified_db, log_callback, show_progress): self.unified_db = unified_db self.log_callback = log_callback self.show_progress = show_progress def log(self, message: str, level: str = "info"): """Centralized logging""" if self.log_callback: self.log_callback(f"[{self.platform_name}] {message}", level) def is_downloaded(self, media_id: str) -> bool: return self.unified_db.is_downloaded(media_id, self.platform_name) @abstractmethod def download(self, username: str) -> int: """Implement in subclass""" pass ``` --- ### 12. Inconsistent Logging **Severity:** MEDIUM | **Effort:** 4-6 hours Mix of print(), custom callbacks, and logging module. **Solution:** ```python import logging import json class StructuredLogger: def __init__(self, name: str): self.logger = logging.getLogger(name) handler = logging.FileHandler('logs/media-downloader.log') handler.setFormatter(logging.Formatter('%(message)s')) self.logger.addHandler(handler) self.logger.setLevel(logging.INFO) def log(self, message: str, level: str = "info", **extra): log_entry = { 'timestamp': datetime.now().isoformat(), 'level': level.upper(), 'message': message, **extra } getattr(self.logger, level)(json.dumps(log_entry)) ``` --- ### 13. No Database Migration Strategy **Severity:** MEDIUM | **Effort:** 4-6 hours Schema changes via ad-hoc ALTER TABLE statements. **Solution:** Implement Alembic or custom migration system. --- ### 14. Missing API Validation **Severity:** MEDIUM | **Effort:** 3-4 hours Some endpoints lack Pydantic models. **Solution:** Add comprehensive request/response models. --- ### 15. No Tests **Severity:** MEDIUM | **Effort:** 40-60 hours Zero test coverage. **Solution:** Implement pytest with unit, integration, and E2E tests. --- ## Low Priority Issues (🟢 Nice to Have) ### 16. Frontend Re-render Optimization Multiple independent polling timers. Consider WebSocket-only updates. ### 17. TypeScript Strict Mode Leverage Enable additional strict checks. ### 18. API Response Caching Add caching for expensive queries. ### 19. Database Indexes Add indexes on frequently queried columns. ### 20. API Versioning Implement `/api/v1` prefix for future compatibility. --- ## Strengths ✅ **Excellent Modular Architecture** - Clear separation of concerns ✅ **Comprehensive Database Design** - WAL mode, connection pooling ✅ **Modern Frontend Stack** - TypeScript, React, TanStack Query ✅ **Good Type Hints** - Python type hints improve clarity ✅ **Rate Limiting** - Sophisticated anti-detection measures ✅ **WebSocket Real-time** - Live updates for better UX ✅ **Feature Complete** - Multi-platform support, deduplication, notifications --- ## Implementation Priorities ### Week 1 (Critical - 11-17 hours) - [ ] Remove secrets from version control - [ ] Fix SQL injection vulnerabilities - [ ] Add file path validation - [ ] Validate subprocess inputs ### Month 1 (High Priority - 32-48 hours) - [ ] Refactor large files - [ ] Fix connection pool handling - [ ] Add CSRF protection ### Quarter 1 (Medium Priority - 67-98 hours) - [ ] Replace TypeScript any types - [ ] Implement error handling strategy - [ ] Eliminate code duplication - [ ] Standardize logging - [ ] Add database migrations - [ ] Implement test suite ### Ongoing (Low Priority - 15-23 hours) - [ ] Optimize frontend performance - [ ] Leverage TypeScript strict mode - [ ] Add API caching - [ ] Add database indexes - [ ] Implement API versioning --- ## Metrics **Current State:** - Code Quality Score: 6.5/10 - Security Score: 4/10 - Test Coverage: 0% - Technical Debt: HIGH **Target State (After Improvements):** - Code Quality Score: 8.5/10 - Security Score: 9/10 - Test Coverage: 70%+ - Technical Debt: LOW --- ## Conclusion The Media Downloader is a well-architected application that demonstrates solid engineering principles. However, **critical security issues must be addressed immediately** to prevent data breaches and system compromise. With systematic implementation of these recommendations, this will evolve into a production-ready, enterprise-grade system with excellent security, maintainability, and performance. **Total Estimated Effort:** 125-186 hours (3-4 months at 10-15 hrs/week) **Next Steps:** 1. Review and prioritize recommendations 2. Create GitHub issues for each item 3. Begin with Week 1 critical fixes 4. Establish regular review cadence