Skip to content

rune-encoder/AI-Karaoke-Studio

Repository files navigation

AI Karaoke Studio: Create Karaoke Videos 10x Faster!

Python Badge Anaconda Badge .ENV Badge OpenAI Badge Gradio Badge

Overview

AI Karaoke Video Creator is a Gradio-based application that dramatically reduces the time needed to transform a standard song into a fully produced karaoke video. Typical manual workflows can take 4-8 hours per song. With AI-driven automation—Demucs (Facebook AI) for stem separation, OpenAI Whisper for transcription, AcoustID + Genius for metadata/lyrics, and Gemini AI for correction—this app brings the process down to 5-15 minutes.

Key Highlights:

  • Stem Separation & Audio Processing: Automatic creation of an instrumental track from your uploaded audio.
  • AI Transcription: Converts vocals to timed lyrics using OpenAI Whisper.
  • Lyric Alignment & Correction: Fetches official lyrics from Genius or user input, then refines alignment via Gemini AI.
  • Subtitle & Video Generation: Highly customizable karaoke video output with dynamic subtitle styles and optional background effects.
  • Caching for Efficiency: Generates a unique hash-based cache folder for each audio track, speeding up reprocessing tasks.


Demo & Example Outputs

Check out the videos to get a sense of the final product quality!




Table of Contents

  1. Overview
  2. Demo & Example Outputs
  3. Features
  4. System Architecture
  5. Installation & Setup
  6. Installation & Setup with Docker
  7. How to Use
  8. Customization
  9. Caching Mechanism
  10. Benefits
  11. License


Features

  1. Audio Processing & Transcription

    • Demucs (Facebook AI) for automatic stem separation (vocals, bass, drums, etc.).
    • Merges stems (except vocals) to produce a karaoke-style instrumental.
    • OpenAI Whisper for vocal transcription with word-level timestamps.
  2. Metadata & Lyrics Retrieval

    • AcoustID to identify audio fingerprint and retrieve song metadata (artist, title).
    • Genius API to auto-fetch official song lyrics.
    • Manual Input option for lyrics when metadata is incomplete or for custom songs.
  3. Lyric Correction & Alignment

    • Gemini AI to align and correct transcription using official or user-provided lyrics.
    • Handles spelling errors, missing words, verse alignment, etc.
  4. Karaoke Video Generation

    • Generate .ass subtitle files with user-defined font, color, highlights, shadows, and outlines.
    • Seamlessly loop background video effects for a visually appealing background.
    • Final output as a single high-quality karaoke video (customizable resolution, bitrate, FPS).
  5. Caching for Fast Iterations

    • Creates a unique hash-based directory for each audio file.
    • Allows partial reprocessing only for sections you choose to override (metadata fetch, AI transcription, etc.).


System Architecture

Below is a high-level overview of the application’s workflow:

  1. User Uploads an Audio File
  2. AcoustID: Generate audio fingerprint → Retrieve song metadata
  3. Demucs: Separate audio stems (vocals, instruments) → Merge instrument stems to create instrumental
  4. Whisper: Transcribe vocals (with timestamps)
  5. Genius / Manual Input: Fetch or provide reference lyrics
  6. Gemini AI: Align & correct transcribed lyrics with reference lyrics
  7. Subtitle & Video Generation: Create .ass subtitles → Loop selected video effect → Render final karaoke video

All of this is orchestrated within a Gradio interface. Once you launch app.py, it provides a local URL that you can open in your browser to interact with these steps visually.



Installation & Setup

Step 1: Install Conda

Windows
  1. Download and install Anaconda or Miniconda.
  2. During installation, ensure conda is added to your system PATH.
    • Example: C:\Users\<your_username>\Anaconda3\Scripts
  3. Check successful installation:
    conda --version
Linux/macOS
  1. Follow the official Conda installation guide.
  2. Check successful installation:
    conda --version


Step 2: Set Up API Keys

The app requires API keys for fetching metadata, lyrics, and AI-based modifications.

  1. AcoustID API Key - Fetches metadata (artist, song name, etc.).
  2. Genius API Key - Fetches song lyrics.
  3. Gemini API Key - AI-powered lyric modification and alignment.

Create a .env file at the root of the project with the following keys (replace placeholders with your actual tokens):

ACOUST_ID="your_acoustid_api_key"
GENIUS_API_ACCESS_TOKEN="your_genius_api_key"
GEMINI_API_KEY="your_gemini_api_key"


Step 3: Install FFmpeg and Chromaprint

  1. FFmpeg - Required for audio/video processing.
  2. Chromaprint (fpcalc) - Required to generate audio fingerprints.
Windows
  1. Download and extract both FFmpeg and Chromaprint.
  2. Add their bin directories to the system PATH, example:
    C:\Users\<your_username>\ffmpeg\bin
    C:\Users\<your_username>\chromaprint-fpcalc
  3. Verify successful installation and setup:
    ffmpeg -version
    fpcalc -version
Linux/macOS
  1. Install via your package manager (e.g., apt-get install ffmpeg chromaprint) or follow official documentation.
  2. Verify successful installation and setup:
    ffmpeg -version
    fpcalc -version


Optional GPU Acceleration (NVIDIA Only)

If you have an NVIDIA GPU, installing CUDA + cuDNN can significantly speed up AI processes (Demucs, Whisper, etc.).

Windows
  1. Download and install:
  2. Add their directories to the PATH, e.g.:
    C:\Program Files\NVIDIA\CUDNN\<version_number>\bin
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\bin
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\libnvvp
  3. Verify installation by running:
    nvcc --version
Linux/macOS
  1. Refer to NVIDIA’s official documentation for your platform.
  2. Verify installation by running:
    nvcc --version


Step 4: Install Dependencies

After cloning or downloading this repo, from your terminal run:

Windows
setup.bat
Linux/macOS
chmod +x setup.sh
./setup.sh

This will:

  1. Create a Conda environment named karaoke_env.
  2. Install all necessary Python libraries (Gradio, OpenAI Whisper, Demucs, etc.).


Step 5: Running the App

conda activate karaoke_env
python app.py

A local Gradio link will appear in your terminal. Open it in your browser to use the app.



Installation & Setup with Docker

Step 1: Install Docker Desktop

Windows
  1. Follow the official WSL 2 installation guide.
  2. Install Ubuntu 22.04 into WSL.
    wsl --install -d Ubuntu-24.04
  3. Download and install Docker Desktop.
  4. Check successful installation:
    docker run --rm hello-world
    It will be typed "Hello from Docker!".
Linux
  1. Follow the official Docker Desktop installation guide.
  2. Check successful installation:
    docker run --rm hello-world
    It will be typed "Hello from Docker!".
MacOS
  1. Follow the official Docker Desktop installation guide.
  2. Check successful installation:
    docker run --rm hello-world
    It will be typed "Hello from Docker!".


Step 2: Build docker image

Make sure you have 120 GB of free disk space.
Even on a modern computer, the operation will take more than 30 minutes.

docker compose build


Step 3: Set Up API Keys

The app requires API keys for fetching metadata, lyrics, and AI-based modifications.

  1. AcoustID API Key - (optional) Fetches metadata (artist, song name, etc.).
  2. Genius API Key - (optional) Fetches song lyrics.
  3. Gemini API Key - (required) AI-powered lyric modification and alignment.

Create a .env file at the root of the project with the following keys (replace placeholders with your actual tokens):

ACOUST_ID=your_acoustid_api_key
GENIUS_API_ACCESS_TOKEN=your_genius_api_key
GEMINI_API_KEY=your_gemini_api_key


Step 4: Running the App

docker compose up --no-build

A local Gradio link will appear in your terminal. Open it in your browser to use the app.



How to Use

Step 1: Audio Processing & Transcription

  1. Upload Audio: Provide the .mp3, .wav, or any valid audio file
  2. Process Audio: The app will:
    • Identify metadata (song name, artist, etc.) via AcoustID.
    • Separate stems with Demucs.
    • Merge stems (except vocals) to form your instrumental track.
    • Transcribe vocals using Whisper (timestamps included).
  3. Advanced Settings (Optional): Adjust transcription accuracy, re-run processes, set specific languages if auto-detect fails, etc.
  4. Click Process Audio to proceed.

Pro Tip: If you see any mismatched data or want to refine any step, open the “Developer Settings” accordion and force specific tasks to re-run.


Step 1



Step 2: Lyric Correction & Alignment

  1. Review Artist/Song Name: Edit if the auto-detected metadata is incorrect.
  2. Fetch Lyrics: Click Fetch Reference Lyrics to grab official lyrics from Genius. Alternatively, paste your own text and click Update Reference Lyrics.
  3. Modify with AI: Once you have both the raw transcription and reference lyrics, press Modify with AI to refine and align timestamps.

This ensures spelling, repeated words, and verse alignment are corrected using Gemini AI.


Step 2



Step 3: Karaoke Video Generation

  1. Subtitle Style: Choose font, color, highlight, outline, and shadow settings.
    App searches Font files in app folder fonts/. Copy needed fonts in folder and restart app.
  2. Background Effects: Optional looping .mp4 files can be selected for a dynamic background.
  3. Advanced Video Settings: Set resolution (720p, 1080p), FPS, bitrate, etc., based on your quality needs.
  4. Generate Karaoke: Click the button to produce your final video.
  5. Output: Video is saved in the output folder. If re-generated, it overwrites the existing file.

Experiment with fonts and color combos to achieve a professional karaoke style or something playful and unique!


Step 3



Customization

  1. Background Effects

    • Place any .mp4 file in the effects folder; it appears automatically in the Gradio dropdown.
    • The video is looped to match your song’s duration.
  2. Subtitle .ass Files

    • The app automatically creates an advanced subtitle file with your chosen styling (font, size, colors, etc.).
    • You can tweak the .ass file further if you want extremely fine-grained control (e.g., line spacing).
  3. Developer Settings

    • Access advanced toggles in each section to re-run certain stages (metadata fetching, stem separation, AI alignment).
    • Great for iterative improvements or debugging.


Caching Mechanism

When you upload a new song, the app:

  1. Generates a Hash of the audio file.
  2. Creates a Cache Directory inside cache/<unique_hash> for storing processed data—like separated stems, transcribed lyrics, and more.
  3. Speeds Up Reprocessing if you choose to revisit or re-generate any part of the same audio file.

This design ensures you don’t waste time repeatedly re-running expensive AI tasks.



Benefits

  • Time Savings: Cut down from 4-8 hours of manual editing to just 5-15 minutes.
  • High-Quality Output: Syncs lyrics with precise timing and offers advanced customization.
  • AI-Powered: Capitalizes on cutting-edge models for stem separation and transcription, ensuring accuracy.
  • Flexible & Extensible: Gradio-based UI, easy to integrate, and modifiable for various use cases.


License

This project is licensed under the Apache License. See LICENSE for details.




Thank you for checking out the AI Karaoke Video Creator. Enjoy making awesome karaoke videos with a fraction of the usual effort!