AI Karaoke Studio: Create Karaoke Videos 10x Faster!

Overview

AI Karaoke Video Creator is a Gradio-based application that dramatically reduces the time needed to transform a standard song into a fully produced karaoke video. Typical manual workflows can take 4-8 hours per song. With AI-driven automation—Demucs (Facebook AI) for stem separation, OpenAI Whisper for transcription, AcoustID + Genius for metadata/lyrics, and Gemini AI for correction—this app brings the process down to 5-15 minutes.

Key Highlights:

Stem Separation & Audio Processing: Automatic creation of an instrumental track from your uploaded audio.
AI Transcription: Converts vocals to timed lyrics using OpenAI Whisper.
Lyric Alignment & Correction: Fetches official lyrics from Genius or user input, then refines alignment via Gemini AI.
Subtitle & Video Generation: Highly customizable karaoke video output with dynamic subtitle styles and optional background effects.
Caching for Efficiency: Generates a unique hash-based cache folder for each audio track, speeding up reprocessing tasks.

Demo & Example Outputs

Demo Walkthrough: Youtube Walkthrough Link
Sample Video 1: [ Karaoke Version ] I Still Remember - Blackmore's Night
Sample Video 2: [ Karaoke Version ] The Boy Who Wouldn't Hoe Corn - Alison Krauss & Union Station
Sample Video 3: [ Karaoke Version ] Dear God - Avenged Sevenfold

Check out the videos to get a sense of the final product quality!

[ Karaoke Version ] I Still Remember - Blackmore's Night

[ Karaoke Version ] The Boy Who Wouldn't Hoe Corn - Alison Krauss & Union Station

[ Karaoke Version ] Dear God - Avenged Sevenfold

Features

Audio Processing & Transcription
- Demucs (Facebook AI) for automatic stem separation (vocals, bass, drums, etc.).
- Merges stems (except vocals) to produce a karaoke-style instrumental.
- OpenAI Whisper for vocal transcription with word-level timestamps.
Metadata & Lyrics Retrieval
- AcoustID to identify audio fingerprint and retrieve song metadata (artist, title).
- Genius API to auto-fetch official song lyrics.
- Manual Input option for lyrics when metadata is incomplete or for custom songs.
Lyric Correction & Alignment
- Gemini AI to align and correct transcription using official or user-provided lyrics.
- Handles spelling errors, missing words, verse alignment, etc.
Karaoke Video Generation
- Generate .ass subtitle files with user-defined font, color, highlights, shadows, and outlines.
- Seamlessly loop background video effects for a visually appealing background.
- Final output as a single high-quality karaoke video (customizable resolution, bitrate, FPS).
Caching for Fast Iterations
- Creates a unique hash-based directory for each audio file.
- Allows partial reprocessing only for sections you choose to override (metadata fetch, AI transcription, etc.).

System Architecture

Below is a high-level overview of the application’s workflow:

User Uploads an Audio File
AcoustID: Generate audio fingerprint → Retrieve song metadata
Demucs: Separate audio stems (vocals, instruments) → Merge instrument stems to create instrumental
Whisper: Transcribe vocals (with timestamps)
Genius / Manual Input: Fetch or provide reference lyrics
Gemini AI: Align & correct transcribed lyrics with reference lyrics
Subtitle & Video Generation: Create .ass subtitles → Loop selected video effect → Render final karaoke video

All of this is orchestrated within a Gradio interface. Once you launch app.py, it provides a local URL that you can open in your browser to interact with these steps visually.

Installation & Setup

Step 1: Install Conda

Windows

Download and install Anaconda or Miniconda.
During installation, ensure conda is added to your system PATH.
- Example: C:\Users\<your_username>\Anaconda3\Scripts
Check successful installation:
```
conda --version
```

Linux/macOS

Follow the official Conda installation guide.
Check successful installation:
```
conda --version
```

Step 2: Set Up API Keys

The app requires API keys for fetching metadata, lyrics, and AI-based modifications.

AcoustID API Key - Fetches metadata (artist, song name, etc.).
Genius API Key - Fetches song lyrics.
Gemini API Key - AI-powered lyric modification and alignment.

Create a .env file at the root of the project with the following keys (replace placeholders with your actual tokens):

ACOUST_ID="your_acoustid_api_key"
GENIUS_API_ACCESS_TOKEN="your_genius_api_key"
GEMINI_API_KEY="your_gemini_api_key"

Step 3: Install FFmpeg and Chromaprint

FFmpeg - Required for audio/video processing.
Chromaprint (fpcalc) - Required to generate audio fingerprints.

Windows

Download and extract both FFmpeg and Chromaprint.

Add their bin directories to the system PATH, example:

C:\Users\<your_username>\ffmpeg\bin
C:\Users\<your_username>\chromaprint-fpcalc

Verify successful installation and setup:
```
ffmpeg -version
fpcalc -version
```

Linux/macOS

Install via your package manager (e.g., apt-get install ffmpeg chromaprint) or follow official documentation.
Verify successful installation and setup:
```
ffmpeg -version
fpcalc -version
```

Optional GPU Acceleration (NVIDIA Only)

If you have an NVIDIA GPU, installing CUDA + cuDNN can significantly speed up AI processes (Demucs, Whisper, etc.).

Windows

Download and install:
- cuDNN
- CUDA Toolkit

Add their directories to the PATH, e.g.:

C:\Program Files\NVIDIA\CUDNN\<version_number>\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\libnvvp

Verify installation by running:
```
nvcc --version
```

Linux/macOS

Refer to NVIDIA’s official documentation for your platform.
Verify installation by running:
```
nvcc --version
```

Step 4: Install Dependencies

After cloning or downloading this repo, from your terminal run:

Windows

setup.bat

Linux/macOS

chmod +x setup.sh
./setup.sh

This will:

Create a Conda environment named karaoke_env.
Install all necessary Python libraries (Gradio, OpenAI Whisper, Demucs, etc.).

Step 5: Running the App

conda activate karaoke_env
python app.py

A local Gradio link will appear in your terminal. Open it in your browser to use the app.

Installation & Setup with Docker

Step 1: Install Docker Desktop

Windows

Follow the official WSL 2 installation guide.
Install Ubuntu 22.04 into WSL.
```
wsl --install -d Ubuntu-24.04
```
Download and install Docker Desktop.
Check successful installation:
```
docker run --rm hello-world
```
It will be typed "Hello from Docker!".

Linux

Follow the official Docker Desktop installation guide.
Check successful installation:
```
docker run --rm hello-world
```
It will be typed "Hello from Docker!".

MacOS

Follow the official Docker Desktop installation guide.
Check successful installation:
```
docker run --rm hello-world
```
It will be typed "Hello from Docker!".

Step 2: Build docker image

Make sure you have 120 GB of free disk space.
Even on a modern computer, the operation will take more than 30 minutes.

docker compose build

Step 3: Set Up API Keys

The app requires API keys for fetching metadata, lyrics, and AI-based modifications.

AcoustID API Key - (optional) Fetches metadata (artist, song name, etc.).
Genius API Key - (optional) Fetches song lyrics.
Gemini API Key - (required) AI-powered lyric modification and alignment.

Create a .env file at the root of the project with the following keys (replace placeholders with your actual tokens):

ACOUST_ID=your_acoustid_api_key
GENIUS_API_ACCESS_TOKEN=your_genius_api_key
GEMINI_API_KEY=your_gemini_api_key

Step 4: Running the App

docker compose up --no-build

A local Gradio link will appear in your terminal. Open it in your browser to use the app.

How to Use

Step 1: Audio Processing & Transcription

Upload Audio: Provide the .mp3, .wav, or any valid audio file
Process Audio: The app will:
- Identify metadata (song name, artist, etc.) via AcoustID.
- Separate stems with Demucs.
- Merge stems (except vocals) to form your instrumental track.
- Transcribe vocals using Whisper (timestamps included).
Advanced Settings (Optional): Adjust transcription accuracy, re-run processes, set specific languages if auto-detect fails, etc.
Click Process Audio to proceed.

Pro Tip: If you see any mismatched data or want to refine any step, open the “Developer Settings” accordion and force specific tasks to re-run.

Step 2: Lyric Correction & Alignment

Review Artist/Song Name: Edit if the auto-detected metadata is incorrect.
Fetch Lyrics: Click Fetch Reference Lyrics to grab official lyrics from Genius. Alternatively, paste your own text and click Update Reference Lyrics.
Modify with AI: Once you have both the raw transcription and reference lyrics, press Modify with AI to refine and align timestamps.

This ensures spelling, repeated words, and verse alignment are corrected using Gemini AI.

Step 3: Karaoke Video Generation

Subtitle Style: Choose font, color, highlight, outline, and shadow settings.
App searches Font files in app folder fonts/. Copy needed fonts in folder and restart app.
Background Effects: Optional looping .mp4 files can be selected for a dynamic background.
Advanced Video Settings: Set resolution (720p, 1080p), FPS, bitrate, etc., based on your quality needs.
Generate Karaoke: Click the button to produce your final video.
Output: Video is saved in the output folder. If re-generated, it overwrites the existing file.

Experiment with fonts and color combos to achieve a professional karaoke style or something playful and unique!

Customization

Background Effects
- Place any .mp4 file in the effects folder; it appears automatically in the Gradio dropdown.
- The video is looped to match your song’s duration.
Subtitle .ass Files
- The app automatically creates an advanced subtitle file with your chosen styling (font, size, colors, etc.).
- You can tweak the .ass file further if you want extremely fine-grained control (e.g., line spacing).
Developer Settings
- Access advanced toggles in each section to re-run certain stages (metadata fetching, stem separation, AI alignment).
- Great for iterative improvements or debugging.

Caching Mechanism

When you upload a new song, the app:

Generates a Hash of the audio file.
Creates a Cache Directory inside cache/<unique_hash> for storing processed data—like separated stems, transcribed lyrics, and more.
Speeds Up Reprocessing if you choose to revisit or re-generate any part of the same audio file.

This design ensures you don’t waste time repeatedly re-running expensive AI tasks.

Benefits

Time Savings: Cut down from 4-8 hours of manual editing to just 5-15 minutes.
High-Quality Output: Syncs lyrics with precise timing and offers advanced customization.
AI-Powered: Capitalizes on cutting-edge models for stem separation and transcription, ensuring accuracy.
Flexible & Extensible: Gradio-based UI, easy to integrate, and modifiable for various use cases.

License

This project is licensed under the Apache License. See LICENSE for details.

Thank you for checking out the AI Karaoke Video Creator. Enjoy making awesome karaoke videos with a fraction of the usual effort!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Karaoke Studio: Create Karaoke Videos 10x Faster!

Overview

Demo & Example Outputs

Table of Contents

Features

System Architecture

Installation & Setup

Step 1: Install Conda

Step 2: Set Up API Keys

Step 3: Install FFmpeg and Chromaprint

Optional GPU Acceleration (NVIDIA Only)

Step 4: Install Dependencies

Step 5: Running the App

Installation & Setup with Docker

Step 1: Install Docker Desktop

Step 2: Build docker image

Step 3: Set Up API Keys

Step 4: Running the App

How to Use

Step 1: Audio Processing & Transcription

Step 2: Lyric Correction & Alignment

Step 3: Karaoke Video Generation

Customization

Caching Mechanism

Benefits

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
cache		cache
effects		effects
fonts		fonts
interface		interface
logs		logs
modules		modules
output		output
public		public
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
setup.bat		setup.bat
setup.sh		setup.sh

License

rune-encoder/AI-Karaoke-Studio

Folders and files

Latest commit

History

Repository files navigation

AI Karaoke Studio: Create Karaoke Videos 10x Faster!

Overview

Demo & Example Outputs

Table of Contents

Features

System Architecture

Installation & Setup

Step 1: Install Conda

Step 2: Set Up API Keys

Step 3: Install FFmpeg and Chromaprint

Optional GPU Acceleration (NVIDIA Only)

Step 4: Install Dependencies

Step 5: Running the App

Installation & Setup with Docker

Step 1: Install Docker Desktop

Step 2: Build docker image

Step 3: Set Up API Keys

Step 4: Running the App

How to Use

Step 1: Audio Processing & Transcription

Step 2: Lyric Correction & Alignment

Step 3: Karaoke Video Generation

Customization

Caching Mechanism

Benefits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages