AI Karaoke Video Creator is a Gradio-based application that dramatically reduces the time needed to transform a standard song into a fully produced karaoke video. Typical manual workflows can take 4-8 hours per song. With AI-driven automation—Demucs (Facebook AI) for stem separation, OpenAI Whisper for transcription, AcoustID + Genius for metadata/lyrics, and Gemini AI for correction—this app brings the process down to 5-15 minutes.
Key Highlights:
- Stem Separation & Audio Processing: Automatic creation of an instrumental track from your uploaded audio.
- AI Transcription: Converts vocals to timed lyrics using OpenAI Whisper.
- Lyric Alignment & Correction: Fetches official lyrics from Genius or user input, then refines alignment via Gemini AI.
- Subtitle & Video Generation: Highly customizable karaoke video output with dynamic subtitle styles and optional background effects.
- Caching for Efficiency: Generates a unique hash-based cache folder for each audio track, speeding up reprocessing tasks.
- Demo Walkthrough: Youtube Walkthrough Link
- Sample Video 1: [ Karaoke Version ] I Still Remember - Blackmore's Night
- Sample Video 2: [ Karaoke Version ] The Boy Who Wouldn't Hoe Corn - Alison Krauss & Union Station
- Sample Video 3: [ Karaoke Version ] Dear God - Avenged Sevenfold
Check out the videos to get a sense of the final product quality!
- Overview
- Demo & Example Outputs
- Features
- System Architecture
- Installation & Setup
- Installation & Setup with Docker
- How to Use
- Customization
- Caching Mechanism
- Benefits
- License
-
Audio Processing & Transcription
- Demucs (Facebook AI) for automatic stem separation (vocals, bass, drums, etc.).
- Merges stems (except vocals) to produce a karaoke-style instrumental.
- OpenAI Whisper for vocal transcription with word-level timestamps.
-
Metadata & Lyrics Retrieval
- AcoustID to identify audio fingerprint and retrieve song metadata (artist, title).
- Genius API to auto-fetch official song lyrics.
- Manual Input option for lyrics when metadata is incomplete or for custom songs.
-
Lyric Correction & Alignment
- Gemini AI to align and correct transcription using official or user-provided lyrics.
- Handles spelling errors, missing words, verse alignment, etc.
-
Karaoke Video Generation
- Generate
.ass
subtitle files with user-defined font, color, highlights, shadows, and outlines. - Seamlessly loop background video effects for a visually appealing background.
- Final output as a single high-quality karaoke video (customizable resolution, bitrate, FPS).
- Generate
-
Caching for Fast Iterations
- Creates a unique hash-based directory for each audio file.
- Allows partial reprocessing only for sections you choose to override (metadata fetch, AI transcription, etc.).
Below is a high-level overview of the application’s workflow:
- User Uploads an Audio File
- AcoustID: Generate audio fingerprint → Retrieve song metadata
- Demucs: Separate audio stems (vocals, instruments) → Merge instrument stems to create instrumental
- Whisper: Transcribe vocals (with timestamps)
- Genius / Manual Input: Fetch or provide reference lyrics
- Gemini AI: Align & correct transcribed lyrics with reference lyrics
- Subtitle & Video Generation: Create .ass subtitles → Loop selected video effect → Render final karaoke video
All of this is orchestrated within a Gradio interface. Once you launch app.py
, it provides a local URL that you can open in your browser to interact with these steps visually.
Windows
Linux/macOS
- Follow the official Conda installation guide.
- Check successful installation:
conda --version
The app requires API keys for fetching metadata, lyrics, and AI-based modifications.
- AcoustID API Key - Fetches metadata (artist, song name, etc.).
- Genius API Key - Fetches song lyrics.
- Gemini API Key - AI-powered lyric modification and alignment.
Create a .env
file at the root of the project with the following keys (replace placeholders with your actual tokens):
ACOUST_ID="your_acoustid_api_key"
GENIUS_API_ACCESS_TOKEN="your_genius_api_key"
GEMINI_API_KEY="your_gemini_api_key"
- FFmpeg - Required for audio/video processing.
- Chromaprint (fpcalc) - Required to generate audio fingerprints.
Windows
- Download and extract both FFmpeg and Chromaprint.
- Add their
bin
directories to the systemPATH
, example:C:\Users\<your_username>\ffmpeg\bin C:\Users\<your_username>\chromaprint-fpcalc
- Verify successful installation and setup:
ffmpeg -version fpcalc -version
Linux/macOS
- Install via your package manager (e.g.,
apt-get install ffmpeg chromaprint
) or follow official documentation. - Verify successful installation and setup:
ffmpeg -version fpcalc -version
If you have an NVIDIA GPU, installing CUDA + cuDNN can significantly speed up AI processes (Demucs, Whisper, etc.).
Windows
- Download and install:
- Add their directories to the
PATH
, e.g.:C:\Program Files\NVIDIA\CUDNN\<version_number>\bin C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\bin C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version_number>\libnvvp
- Verify installation by running:
nvcc --version
Linux/macOS
- Refer to NVIDIA’s official documentation for your platform.
- Verify installation by running:
nvcc --version
After cloning or downloading this repo, from your terminal run:
Windows
setup.bat
Linux/macOS
chmod +x setup.sh
./setup.sh
This will:
- Create a Conda environment named
karaoke_env
. - Install all necessary Python libraries (Gradio, OpenAI Whisper, Demucs, etc.).
conda activate karaoke_env
python app.py
A local Gradio link will appear in your terminal. Open it in your browser to use the app.
Windows
- Follow the official WSL 2 installation guide.
- Install Ubuntu 22.04 into WSL.
wsl --install -d Ubuntu-24.04
- Download and install Docker Desktop.
- Check successful installation:
It will be typed "
docker run --rm hello-world
Hello from Docker!
".
Linux
- Follow the official Docker Desktop installation guide.
- Check successful installation:
It will be typed "
docker run --rm hello-world
Hello from Docker!
".
MacOS
- Follow the official Docker Desktop installation guide.
- Check successful installation:
It will be typed "
docker run --rm hello-world
Hello from Docker!
".
Make sure you have 120 GB of free disk space.
Even on a modern computer, the operation will take more than 30 minutes.
docker compose build
The app requires API keys for fetching metadata, lyrics, and AI-based modifications.
- AcoustID API Key - (optional) Fetches metadata (artist, song name, etc.).
- Genius API Key - (optional) Fetches song lyrics.
- Gemini API Key - (required) AI-powered lyric modification and alignment.
Create a .env
file at the root of the project with the following keys (replace placeholders with your actual tokens):
ACOUST_ID=your_acoustid_api_key
GENIUS_API_ACCESS_TOKEN=your_genius_api_key
GEMINI_API_KEY=your_gemini_api_key
docker compose up --no-build
A local Gradio link will appear in your terminal. Open it in your browser to use the app.
- Upload Audio: Provide the
.mp3
,.wav
, or any valid audio file - Process Audio: The app will:
- Identify metadata (song name, artist, etc.) via AcoustID.
- Separate stems with Demucs.
- Merge stems (except vocals) to form your instrumental track.
- Transcribe vocals using Whisper (timestamps included).
- Advanced Settings (Optional): Adjust transcription accuracy, re-run processes, set specific languages if auto-detect fails, etc.
- Click
Process Audio
to proceed.
Pro Tip: If you see any mismatched data or want to refine any step, open the “Developer Settings” accordion and force specific tasks to re-run.
- Review Artist/Song Name: Edit if the auto-detected metadata is incorrect.
- Fetch Lyrics: Click Fetch Reference Lyrics to grab official lyrics from Genius. Alternatively, paste your own text and click Update Reference Lyrics.
- Modify with AI: Once you have both the raw transcription and reference lyrics, press Modify with AI to refine and align timestamps.
This ensures spelling, repeated words, and verse alignment are corrected using Gemini AI.
- Subtitle Style: Choose font, color, highlight, outline, and shadow settings.
App searches Font files in app folderfonts/
. Copy needed fonts in folder and restart app. - Background Effects: Optional looping
.mp4
files can be selected for a dynamic background. - Advanced Video Settings: Set resolution (
720p
,1080p
), FPS, bitrate, etc., based on your quality needs. - Generate Karaoke: Click the button to produce your final video.
- Output: Video is saved in the
output
folder. If re-generated, it overwrites the existing file.
Experiment with fonts and color combos to achieve a professional karaoke style or something playful and unique!
-
Background Effects
- Place any
.mp4
file in the effects folder; it appears automatically in the Gradio dropdown. - The video is looped to match your song’s duration.
- Place any
-
Subtitle
.ass
Files- The app automatically creates an advanced subtitle file with your chosen styling (font, size, colors, etc.).
- You can tweak the .ass file further if you want extremely fine-grained control (e.g., line spacing).
-
Developer Settings
- Access advanced toggles in each section to re-run certain stages (metadata fetching, stem separation, AI alignment).
- Great for iterative improvements or debugging.
When you upload a new song, the app:
- Generates a Hash of the audio file.
- Creates a Cache Directory inside
cache/<unique_hash>
for storing processed data—like separated stems, transcribed lyrics, and more. - Speeds Up Reprocessing if you choose to revisit or re-generate any part of the same audio file.
This design ensures you don’t waste time repeatedly re-running expensive AI tasks.
- Time Savings: Cut down from 4-8 hours of manual editing to just 5-15 minutes.
- High-Quality Output: Syncs lyrics with precise timing and offers advanced customization.
- AI-Powered: Capitalizes on cutting-edge models for stem separation and transcription, ensuring accuracy.
- Flexible & Extensible: Gradio-based UI, easy to integrate, and modifiable for various use cases.
This project is licensed under the Apache License. See LICENSE
for details.
Thank you for checking out the AI Karaoke Video Creator. Enjoy making awesome karaoke videos with a fraction of the usual effort!