Files
media-downloader/docs/archive/CODE_REVIEW_2025-10-31.md
Todd 0d7b2b1aab Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-29 22:42:55 -04:00

25 KiB
Raw Blame History

Media Downloader - Comprehensive Code Review

Date: 2025-10-31 Version: 6.3.4 Reviewer: Claude Code (Automated Analysis) Scope: Full codebase - Backend, Frontend, Database, Architecture


Executive Summary

The Media Downloader is a feature-rich, architecturally sound application with excellent modular design and modern technology choices. The codebase demonstrates solid engineering principles with a unified database, clear separation of concerns, and comprehensive feature coverage.

Overall Assessment:

  • Code Quality: 6.5/10 - Good structure but needs refactoring
  • Security: 4/10 - CRITICAL issues requiring immediate attention
  • Performance: 7/10 - Generally good with optimization opportunities
  • Maintainability: 6/10 - Large files, some duplication, limited tests
  • Architecture: 8/10 - Excellent modular design

Key Statistics

  • Total Lines of Code: 37,966
  • Python Files: 49 (including 20 modules, 2 backend files)
  • TypeScript Files: 20
  • Documentation Files: 11 (in docs/)
  • Test Files: 0 ⚠️

Critical Findings

🔴 4 Critical Security Issues - Require immediate action 🟠 4 High Priority Issues - Fix within 1-2 weeks 🟡 7 Medium Priority Issues - Address within 1-3 months 🟢 5 Low Priority Issues - Nice to have improvements


Critical Issues (🔴 Fix Immediately)

1. Hardcoded Secrets in Configuration

Severity: CRITICAL | Effort: 2-4 hours | Risk: Data breach

Location: /opt/media-downloader/config/settings.json

Problem:

{
  "password": "cpc6rvm!wvf_wft2EHN",
  "totp_secret": "OVLX4K6NHTUJTUJVL4TLHXJ55SIEDOOY",
  "api_key": "SC1dje6Zo5VhGPmy9vyfkeuBY0MZ7VfgrhI8wIvjOM",
  "api_token": "a3jmhwnhecq9k9dz3tzv2bdk7uc29p"
}

Credentials are stored in plaintext and tracked in version control. Anyone with repository access has full account credentials. Git history cannot be cleaned without force-pushing.

Impact:

  • All forum passwords, API keys, and TOTP secrets exposed
  • Cannot rotate credentials without code changes
  • Violates OWASP A02:2021 Cryptographic Failures

Solution:

# 1. Immediate: Add to .gitignore
echo "config/settings.json" >> .gitignore
echo ".env" >> .gitignore

# 2. Create environment variable template
cat > config/settings.example.json <<EOF
{
  "forums": {
    "password": "FORUM_PASSWORD",
    "totp_secret": "FORUM_TOTP_SECRET"
  },
  "snapchat": {
    "password": "SNAPCHAT_PASSWORD"
  },
  "tiktok": {
    "api_key": "TIKTOK_API_KEY",
    "api_token": "TIKTOK_API_TOKEN"
  }
}
EOF

# 3. Create .env file (add to .gitignore)
cat > .env.example <<EOF
FORUM_PASSWORD=your_password_here
FORUM_TOTP_SECRET=your_totp_secret_here
SNAPCHAT_PASSWORD=your_password_here
TIKTOK_API_KEY=your_api_key_here
TIKTOK_API_TOKEN=your_api_token_here
EOF

Implementation:

# modules/secrets_manager.py
import os
from pathlib import Path
from dotenv import load_dotenv
from typing import Optional

class SecretsManager:
    """Secure secrets management using environment variables"""

    def __init__(self, env_file: Optional[Path] = None):
        if env_file is None:
            env_file = Path(__file__).parent.parent / '.env'

        if env_file.exists():
            load_dotenv(env_file)

    def get_secret(self, key: str, default: Optional[str] = None) -> str:
        """Get secret from environment, raise if not found and no default"""
        value = os.getenv(key, default)
        if value is None:
            raise ValueError(f"Secret '{key}' not found in environment")
        return value

    def get_optional_secret(self, key: str) -> Optional[str]:
        """Get secret from environment, return None if not found"""
        return os.getenv(key)

# Usage in modules
secrets = SecretsManager()
forum_password = secrets.get_secret('FORUM_PASSWORD')

Rollout Plan:

  1. Create .env.example with placeholder values
  2. Add .gitignore entries for .env and config/settings.json
  3. Document secret setup in INSTALL.md
  4. Update all modules to use SecretsManager
  5. Notify team to create local .env files
  6. Remove secrets from settings.json (keep structure)

2. SQL Injection Vulnerabilities

Severity: CRITICAL | Effort: 4-6 hours | Risk: Database compromise

Location: /opt/media-downloader/web/backend/api.py (multiple locations)

Problem: F-string SQL queries with user-controlled input:

# Line ~478-482 (VULNERABLE)
cursor.execute(f"""
    SELECT COUNT(*) FROM downloads
    WHERE download_date >= datetime('now', '-1 day')
    AND {filters}
""")

# Line ~830-850 (VULNERABLE)
query = f"SELECT * FROM downloads WHERE platform = '{platform}'"
cursor.execute(query)

The filters variable is constructed from user input (platform, source, search) without proper sanitization.

Impact:

  • Attackers can inject arbitrary SQL commands
  • Can drop tables: '; DROP TABLE downloads; --
  • Can exfiltrate data: ' OR 1=1 UNION SELECT * FROM users --
  • Can bypass authentication
  • OWASP A03:2021 Injection

Solution:

# BEFORE (VULNERABLE)
platform = request.query_params.get('platform')
query = f"SELECT * FROM downloads WHERE platform = '{platform}'"
cursor.execute(query)

# AFTER (SECURE)
platform = request.query_params.get('platform')
query = "SELECT * FROM downloads WHERE platform = ?"
cursor.execute(query, (platform,))

# For dynamic filters
def build_safe_query(filters: dict) -> tuple[str, tuple]:
    """Build parameterized query from filters"""
    conditions = []
    params = []

    if filters.get('platform'):
        conditions.append("platform = ?")
        params.append(filters['platform'])

    if filters.get('source'):
        conditions.append("source = ?")
        params.append(filters['source'])

    if filters.get('search'):
        conditions.append("(filename LIKE ? OR source LIKE ?)")
        search_pattern = f"%{filters['search']}%"
        params.extend([search_pattern, search_pattern])

    where_clause = " AND ".join(conditions) if conditions else "1=1"
    return where_clause, tuple(params)

# Usage
filters = build_safe_query(request.query_params)
query = f"SELECT * FROM downloads WHERE {filters[0]}"
cursor.execute(query, filters[1])

Files Requiring Fixes:

  • /opt/media-downloader/web/backend/api.py (17+ instances)
    • Lines 478-482, 520-540, 830-850, 910-930
  • /opt/media-downloader/utilities/db_manager.py (2 instances)

Testing:

# Test case for SQL injection prevention
def test_sql_injection_prevention():
    # Try to inject SQL
    malicious_input = "'; DROP TABLE downloads; --"
    response = client.get(f"/api/downloads?platform={malicious_input}")

    # Should not execute injection
    assert response.status_code in [400, 404]  # Bad request or not found

    # Verify table still exists
    assert db.table_exists('downloads')

3. Path Traversal Vulnerabilities

Severity: HIGH | Effort: 3-4 hours | Risk: File system access

Location: /opt/media-downloader/web/backend/api.py (media endpoints)

Problem: File paths from user input are not validated:

# Lines ~1920+ (VULNERABLE)
@app.get("/api/media/preview")
async def get_media_preview(file_path: str, ...):
    # No validation - attacker could use ../../etc/passwd
    return FileResponse(file_path)

@app.get("/api/media/thumbnail")
async def get_media_thumbnail(file_path: str, ...):
    # No validation
    requested_path = Path(file_path)
    return FileResponse(requested_path)

Impact:

  • Read arbitrary files: /etc/passwd, /etc/shadow, database files
  • Access configuration with secrets
  • Data exfiltration via media endpoints
  • OWASP A01:2021 Broken Access Control

Solution:

from pathlib import Path
from fastapi import HTTPException

ALLOWED_MEDIA_BASE = Path("/opt/immich/md")

def validate_file_path(file_path: str, allowed_base: Path) -> Path:
    """
    Ensure file_path is within allowed directory.
    Prevents directory traversal attacks.
    """
    try:
        # Resolve to absolute path
        requested = Path(file_path).resolve()

        # Check if within allowed directory
        if not requested.is_relative_to(allowed_base):
            raise ValueError(f"Path outside allowed directory")

        # Check file exists
        if not requested.exists():
            raise FileNotFoundError()

        # Check it's a file, not directory
        if not requested.is_file():
            raise ValueError("Path is not a file")

        return requested

    except (ValueError, FileNotFoundError) as e:
        raise HTTPException(
            status_code=403,
            detail="Access denied: Invalid file path"
        )

@app.get("/api/media/preview")
async def get_media_preview(
    file_path: str,
    current_user: Dict = Depends(get_current_user_media)
):
    """Serve media file with path validation"""
    safe_path = validate_file_path(file_path, ALLOWED_MEDIA_BASE)
    return FileResponse(safe_path)

Test Cases:

# Path traversal attack attempts
test_cases = [
    "../../etc/passwd",
    "/etc/passwd",
    "../../../root/.ssh/id_rsa",
    "....//....//etc/passwd",
    "%2e%2e%2f%2e%2e%2fetc%2fpasswd",  # URL encoded
]

for attack in test_cases:
    response = client.get(f"/api/media/preview?file_path={attack}")
    assert response.status_code == 403, f"Failed to block: {attack}"

4. Command Injection Risk

Severity: HIGH | Effort: 2-3 hours | Risk: Code execution

Location: /opt/media-downloader/web/backend/api.py

Problem: Subprocess calls with user input:

# Line ~1314
@app.post("/api/platforms/{platform}/trigger")
async def trigger_platform_download(platform: str, ...):
    cmd = ["python3", "/opt/media-downloader/media-downloader.py", "--platform", platform]
    process = await asyncio.create_subprocess_exec(*cmd, ...)

While using a list (safer than shell=True), the platform parameter is not validated against a whitelist.

Impact:

  • Could inject commands if platform validation is bypassed
  • Potential code execution via crafted platform names
  • OWASP A03:2021 Injection

Solution:

from enum import Enum
from typing import Literal

# Define allowed platforms as enum
class Platform(str, Enum):
    INSTAGRAM = "instagram"
    FASTDL = "fastdl"
    IMGINN = "imginn"
    TOOLZU = "toolzu"
    SNAPCHAT = "snapchat"
    TIKTOK = "tiktok"
    FORUMS = "forums"
    ALL = "all"

@app.post("/api/platforms/{platform}/trigger")
async def trigger_platform_download(
    platform: Platform,  # Type hint enforces validation
    trigger_data: TriggerRequest,
    background_tasks: BackgroundTasks,
    current_user: Dict = Depends(get_current_user)
):
    """Trigger download with validated platform"""
    # FastAPI automatically validates against enum
    cmd = [
        "python3",
        "/opt/media-downloader/media-downloader.py",
        "--platform",
        platform.value  # Safe - enum member
    ]

    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

Additional Hardening:

# Subprocess wrapper with additional safety
import shlex

def safe_subprocess_exec(cmd: List[str], allowed_commands: Set[str]):
    """Execute subprocess with command whitelist"""
    if cmd[0] not in allowed_commands:
        raise ValueError(f"Command not allowed: {cmd[0]}")

    # Validate all arguments are safe
    for arg in cmd:
        if any(char in arg for char in [';', '&', '|', '$', '`']):
            raise ValueError(f"Dangerous character in argument: {arg}")

    return subprocess.run(cmd, capture_output=True, text=True, timeout=300)

High Priority Issues (🟠 Fix Soon)

5. Massive Files - Maintainability Crisis

Severity: HIGH | Effort: 24-36 hours | Risk: Technical debt

Problem: Several files exceed 2,000 lines, violating single responsibility principle:

File Lines Size
modules/forum_downloader.py 3,971 167 KB
media-downloader.py 2,653 -
web/backend/api.py 2,649 94 KB
modules/imginn_module.py 2,542 129 KB

Impact:

  • Difficult to navigate and understand
  • Hard to test individual components
  • Increases cognitive load
  • Higher bug density
  • Makes code reviews painful
  • Merge conflicts more frequent

Recommended Structure:

# For api.py refactoring:
web/backend/
├── main.py (FastAPI app initialization, 100-150 lines)
├── dependencies.py (auth dependencies, 50-100 lines)
├── middleware.py (CORS, rate limiting, 50-100 lines)
├── routers/
│   ├── __init__.py
│   ├── auth.py (authentication endpoints, 150-200 lines)
│   ├── downloads.py (download endpoints, 200-300 lines)
│   ├── scheduler.py (scheduler endpoints, 150-200 lines)
│   ├── media.py (media endpoints, 150-200 lines)
│   ├── health.py (health/monitoring, 100-150 lines)
│   └── config.py (configuration endpoints, 100-150 lines)
├── services/
│   ├── download_service.py (download business logic)
│   ├── scheduler_service.py (scheduler business logic)
│   └── media_service.py (media processing logic)
├── models/
│   ├── requests.py (Pydantic request models)
│   ├── responses.py (Pydantic response models)
│   └── schemas.py (database schemas)
└── utils/
    ├── validators.py (input validation)
    └── helpers.py (utility functions)

Migration Plan:

  1. Create new directory structure
  2. Extract routers one at a time (start with health, least dependencies)
  3. Move business logic to services
  4. Extract Pydantic models
  5. Update imports gradually
  6. Test after each extraction
  7. Remove old code once verified

6. Database Connection Pool Exhaustion

Severity: HIGH | Effort: 4-6 hours | Risk: Application hang

Location: /opt/media-downloader/modules/unified_database.py

Problem: Connection pool implementation has potential leaks:

# Line 119-130 (PROBLEMATIC)
def get_connection(self, for_write=False):
    try:
        if self.pool:
            with self.pool.get_connection(for_write=for_write) as conn:
                yield conn
        else:
            conn = sqlite3.connect(...)
            # ⚠️ No try/finally - connection might not close on error
            yield conn

Impact:

  • Connection leaks under error conditions
  • Pool exhaustion causes application hang
  • No monitoring of pool health
  • Memory leaks

Solution:

from contextlib import contextmanager
from typing import Generator
import sqlite3

@contextmanager
def get_connection(
    self,
    for_write: bool = False
) -> Generator[sqlite3.Connection, None, None]:
    """
    Get database connection with guaranteed cleanup.

    Args:
        for_write: If True, ensures exclusive write access

    Yields:
        sqlite3.Connection: Database connection

    Raises:
        sqlite3.Error: On connection/query errors
    """
    conn = None
    try:
        if self.pool:
            conn = self.pool.get_connection(for_write=for_write)
        else:
            conn = sqlite3.connect(
                str(self.db_path),
                timeout=30,
                check_same_thread=False
            )
            conn.row_factory = sqlite3.Row

        yield conn

        # Commit if no exceptions
        if for_write:
            conn.commit()

    except sqlite3.Error as e:
        # Rollback on error
        if conn and for_write:
            conn.rollback()
        logger.error(f"Database error: {e}")
        raise

    finally:
        # Always close connection
        if conn:
            conn.close()

# Add pool monitoring
def get_pool_stats(self) -> dict:
    """Get connection pool statistics"""
    if not self.pool:
        return {'pool_enabled': False}

    return {
        'pool_enabled': True,
        'active_connections': self.pool.active_connections,
        'max_connections': self.pool.max_connections,
        'available': self.pool.max_connections - self.pool.active_connections,
        'wait_count': self.pool.wait_count,
        'timeout_count': self.pool.timeout_count
    }

# Add to health endpoint
@app.get("/api/health/database")
async def get_database_health():
    stats = app_state.db.get_pool_stats()

    # Alert if low on connections
    if stats.get('available', 0) < 2:
        logger.warning("Database connection pool nearly exhausted")

    return stats

7. No Authentication Rate Limiting (Already Fixed)

Severity: HIGH | Status: FIXED in 6.3.4

Rate limiting has been implemented in version 6.3.4 using slowapi:

  • Login: 5 requests/minute
  • Auth endpoints: 10 requests/minute
  • Read endpoints: 100 requests/minute

No additional action required.


8. Missing CSRF Protection

Severity: HIGH | Effort: 2-3 hours | Risk: Unauthorized actions

Problem: No CSRF tokens on state-changing operations. Attackers can craft malicious pages that trigger actions on behalf of authenticated users.

Impact:

  • Delete downloads via CSRF
  • Trigger new downloads
  • Modify configuration
  • Stop running tasks
  • OWASP A01:2021 Broken Access Control

Solution:

# Install CSRF protection
pip install fastapi-csrf-protect
# web/backend/main.py
from fastapi_csrf_protect import CsrfProtect
from fastapi_csrf_protect.exceptions import CsrfProtectError
from pydantic import BaseModel

class CsrfSettings(BaseModel):
    secret_key: str = os.getenv('CSRF_SECRET_KEY', secrets.token_urlsafe(32))
    cookie_samesite: str = 'strict'

@CsrfProtect.load_config
def get_csrf_config():
    return CsrfSettings()

# Apply to state-changing endpoints
@app.post("/api/platforms/{platform}/trigger")
async def trigger_download(
    request: Request,
    csrf_protect: CsrfProtect = Depends()
):
    # Validate CSRF token
    await csrf_protect.validate_csrf(request)
    # Rest of code...

# Frontend: Include CSRF token
// api.ts
async post<T>(endpoint: string, data: any): Promise<T> {
    const csrfToken = this.getCsrfToken()
    return fetch(`${API_BASE}${endpoint}`, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'X-CSRF-Token': csrfToken
        },
        body: JSON.stringify(data)
    })
}

Medium Priority Issues (🟡 Address This Quarter)

9. TypeScript 'any' Type Overuse

Severity: MEDIUM | Effort: 4-6 hours

70+ instances of any type defeat TypeScript's purpose.

Solution:

// Define proper interfaces
interface User {
    id: number
    username: string
    role: 'admin' | 'user' | 'viewer'
    email?: string
    preferences: UserPreferences
}

interface UserPreferences {
    theme: 'light' | 'dark'
    notifications: boolean
}

interface PlatformConfig {
    enabled: boolean
    check_interval_hours: number
    accounts?: Account[]
    usernames?: string[]
    run_at_start?: boolean
}

// Replace any with proper types
async getMe(): Promise<User> {
    return this.get<User>('/auth/me')
}

10. No Comprehensive Error Handling

Severity: MEDIUM | Effort: 6-8 hours

115 try/except blocks with generic except Exception catching.

Solution:

# modules/exceptions.py
class MediaDownloaderError(Exception):
    """Base exception"""
    pass

class DownloadError(MediaDownloaderError):
    """Download failed"""
    pass

class AuthenticationError(MediaDownloaderError):
    """Authentication failed"""
    pass

class RateLimitError(MediaDownloaderError):
    """Rate limit exceeded"""
    pass

class ValidationError(MediaDownloaderError):
    """Input validation failed"""
    pass

# Structured error responses
@app.exception_handler(MediaDownloaderError)
async def handle_app_error(request: Request, exc: MediaDownloaderError):
    return JSONResponse(
        status_code=400,
        content={
            'error': exc.__class__.__name__,
            'message': str(exc),
            'timestamp': datetime.now().isoformat()
        }
    )

11. Code Duplication Across Modules

Severity: MEDIUM | Effort: 6-8 hours

Instagram modules share 60-70% similar code.

Solution:

# modules/base_downloader.py
from abc import ABC, abstractmethod

class BaseDownloader(ABC):
    """Base class for all downloaders"""

    def __init__(self, unified_db, log_callback, show_progress):
        self.unified_db = unified_db
        self.log_callback = log_callback
        self.show_progress = show_progress

    def log(self, message: str, level: str = "info"):
        """Centralized logging"""
        if self.log_callback:
            self.log_callback(f"[{self.platform_name}] {message}", level)

    def is_downloaded(self, media_id: str) -> bool:
        return self.unified_db.is_downloaded(media_id, self.platform_name)

    @abstractmethod
    def download(self, username: str) -> int:
        """Implement in subclass"""
        pass

12. Inconsistent Logging

Severity: MEDIUM | Effort: 4-6 hours

Mix of print(), custom callbacks, and logging module.

Solution:

import logging
import json

class StructuredLogger:
    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
        handler = logging.FileHandler('logs/media-downloader.log')
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log(self, message: str, level: str = "info", **extra):
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'level': level.upper(),
            'message': message,
            **extra
        }
        getattr(self.logger, level)(json.dumps(log_entry))

13. No Database Migration Strategy

Severity: MEDIUM | Effort: 4-6 hours

Schema changes via ad-hoc ALTER TABLE statements.

Solution: Implement Alembic or custom migration system.


14. Missing API Validation

Severity: MEDIUM | Effort: 3-4 hours

Some endpoints lack Pydantic models.

Solution: Add comprehensive request/response models.


15. No Tests

Severity: MEDIUM | Effort: 40-60 hours

Zero test coverage.

Solution: Implement pytest with unit, integration, and E2E tests.


Low Priority Issues (🟢 Nice to Have)

16. Frontend Re-render Optimization

Multiple independent polling timers. Consider WebSocket-only updates.

17. TypeScript Strict Mode Leverage

Enable additional strict checks.

18. API Response Caching

Add caching for expensive queries.

19. Database Indexes

Add indexes on frequently queried columns.

20. API Versioning

Implement /api/v1 prefix for future compatibility.


Strengths

Excellent Modular Architecture - Clear separation of concerns Comprehensive Database Design - WAL mode, connection pooling Modern Frontend Stack - TypeScript, React, TanStack Query Good Type Hints - Python type hints improve clarity Rate Limiting - Sophisticated anti-detection measures WebSocket Real-time - Live updates for better UX Feature Complete - Multi-platform support, deduplication, notifications


Implementation Priorities

Week 1 (Critical - 11-17 hours)

  • Remove secrets from version control
  • Fix SQL injection vulnerabilities
  • Add file path validation
  • Validate subprocess inputs

Month 1 (High Priority - 32-48 hours)

  • Refactor large files
  • Fix connection pool handling
  • Add CSRF protection

Quarter 1 (Medium Priority - 67-98 hours)

  • Replace TypeScript any types
  • Implement error handling strategy
  • Eliminate code duplication
  • Standardize logging
  • Add database migrations
  • Implement test suite

Ongoing (Low Priority - 15-23 hours)

  • Optimize frontend performance
  • Leverage TypeScript strict mode
  • Add API caching
  • Add database indexes
  • Implement API versioning

Metrics

Current State:

  • Code Quality Score: 6.5/10
  • Security Score: 4/10
  • Test Coverage: 0%
  • Technical Debt: HIGH

Target State (After Improvements):

  • Code Quality Score: 8.5/10
  • Security Score: 9/10
  • Test Coverage: 70%+
  • Technical Debt: LOW

Conclusion

The Media Downloader is a well-architected application that demonstrates solid engineering principles. However, critical security issues must be addressed immediately to prevent data breaches and system compromise.

With systematic implementation of these recommendations, this will evolve into a production-ready, enterprise-grade system with excellent security, maintainability, and performance.

Total Estimated Effort: 125-186 hours (3-4 months at 10-15 hrs/week)

Next Steps:

  1. Review and prioritize recommendations
  2. Create GitHub issues for each item
  3. Begin with Week 1 critical fixes
  4. Establish regular review cadence