Skip to content

rune-encoder/Harmonize-AI

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Harmonize: AI-Powered Audio Processing Suite


Python Badge Anaconda Badge pandas Badge NumPy Badge TensorFlow Badge Keras Badge .ENV Badge OpenAI Badge Gradio Badge

πŸ“– Table of Contents

  1. 🎡 Overview
  2. 🌟 Key Features
  3. 🎸 Research on Guitar Notes
  4. πŸ–₯️ How to Use the App
  5. πŸ—‚οΈ Project Structure
  6. πŸ“¦ Installation Guide
  7. πŸ“Š Technical Highlights
  8. πŸ”§ Troubleshooting
  9. πŸš€ Future Enhancements
  10. πŸ‘¨β€πŸ’» Team
  11. πŸ“œ License

🎡 Overview

This application provides musicians, producers, and enthusiasts with a powerful yet intuitive interface for audio processing. Using advanced models like Demucs, Basic Pitch, and Whisper, the app offers:

  • Audio Separation: Extract instrumental and vocal stems.
  • MIDI Conversion: Convert instrumental audio to editable MIDI files.
  • MIDI Modification: Customize MIDI files using AI-powered prompts.
  • Lyrics Extraction and Translation: Extract lyrics from songs and translate them into multiple languages.
  • Karaoke Video Generation: Create karaoke videos with synchronized lyrics and a custom background.

🌟 Key Features

1. Audio Separation

  • Description: Extract individual components (e.g., vocals, bass, drums).
  • Model Used: Demucs.
  • Use Case: Isolate instrumental tracks for practice or remixing.

2. Audio to MIDI Conversion

  • Description: Convert audio into MIDI format for further editing.
  • Model Used: Basic Pitch.
  • Use Case: Generate sheet music or integrate into DAWs.

3. Modify MIDI Files

  • Description: Apply transformations like changing the scale or style using AI-generated prompts.
  • Model Used: Google Gemini API.
  • Use Case: Create unique renditions of existing tracks.

4. Lyrics Extraction and Translation

  • Description: Extract and translate lyrics from vocal tracks.
  • Model Used: Whisper.
  • Use Case: Understand or re-purpose song lyrics.

5. Karaoke Video Generation

  • Description: Create karaoke videos with synchronized lyrics and custom backgrounds.
  • Tools Used: FFmpeg for video processing.
  • Use Case: Host karaoke sessions or share lyric videos online.

🎸 Research on Guitar Notes

Before developing the comprehensive audio processing app, we conducted focused research on recognizing guitar notes using machine learning and signal processing. This foundational work guided our understanding of audio features and model capabilities.

Key Steps in the Research:

  1. Feature Extraction for Audio

    • MFCC (Mel-Frequency Cepstral Coefficients): Captures the spectral envelope of audio signals.
    • Mel Spectrogram: Provides a frequency-based visual representation.
    • Chroma Features: Highlights harmonic pitch content.
    • Spectral Contrast: Differentiates between peaks and valleys in the spectrum.
  2. Model Training

    • Used Convolutional Neural Networks (CNNs) with TensorFlow/Keras to classify guitar chords.
    • Trained on diverse datasets of .wav files containing guitar notes at varying pitches, tones, and durations.
    • Achieved high validation accuracy: 98.5%.
  3. Pitch Estimation and Signal Processing

    • Applied FFT (Fast Fourier Transform) and CQT (Constant-Q Transform) for frequency analysis.
    • Estimated fundamental frequencies and converted them into MIDI notes.
    • Segmented audio into smaller chunks for chord prediction.
  4. Data Augmentation

    • Applied techniques such as white noise addition, time stretching, and pitch shifting to improve model robustness.
  5. Outputs

    • Visualized predictions with CQT and FFT to validate chord recognition accuracy.
    • Generated MIDI files for the predicted notes.
    • Created music21 streams and MIDI files for playback and analysis.
    • Example Output: sweet_child_music21_with_chords.mid.
  6. Interactive UI for Fine-Tuning

    • Implemented sliders to adjust CQT parameters for better flexibility and analysis.

Visualizations and Results:

  • Feature Extraction Outputs:

    • MFCC visualization.
      MFCC visualization

    • Mel Spectrogram comparison (before/after training).
      Mel Spectrogram

    • Chroma features with labels of pitch classes. Chroma

  • Model Training Evaluation:

    • Validation loss and accuracy graphs over training epochs.
      Validation Loss and Accuracy
  • Audio Processing Visuals:

    • Raw audio waveform.
      Audio Example

    • FFT and CQT plots.
      CQT vs FFT

    • Predicted guitar notes. Note Prediction


Why This Research Matters:

This research proved instrumental in identifying the strengths and limitations of CNNs for specific instruments. It informed our decision to later leverage pre-trained models like Demucs and Basic Pitch for broader functionality.


πŸ–₯️ How to Use the App

Step-by-Step Instructions:

Audio Separation

  1. Run python app.py on the root directory. Open gradio on the browswer
  2. Go to the "Audio Separation" tab.
  3. Upload an audio file.
  4. Customize parameters (e.g., model version, bitrate).
  5. Click Separate Audio and download the stems.

Separate Audio

Audio to MIDI Conversion

  1. Switch to the "Audio to MIDI" tab.
  2. Upload an instrumental audio file.
  3. Adjust MIDI generation settings (e.g., note threshold).
  4. Click Convert to MIDI to generate and download the file.

MIDI Conversion

Modify MIDI Files

  1. Select the "Modify MIDI" tab.
  2. Upload a MIDI file.
  3. Enter a text prompt (e.g., "Change to jazz style").
  4. Click Modify MIDI to apply changes.

Modify MIDI

Lyrics Extraction and Translation

  1. Go to the "Lyrics Extraction" tab.
  2. Upload a vocal stem.
  3. Click Extract Lyrics to display text.
  4. Input a language code for translation (e.g., en, es, fr) and click Translate.

Extract Lyrics

Karaoke Video Generation

  1. Upload instrumental and vocal stems.
  2. Use Whisper to synchronize lyrics.
  3. Customize lyrics and background image.
  4. Generate a karaoke video using FFmpeg.

Karaoke Output


πŸ“‚ Project Structure

audio_processing_app/
β”œβ”€β”€ output_stems/        # Processed audio stems
β”œβ”€β”€ output_midi/         # Generated MIDI files
β”œβ”€β”€ karaoke_videos/      # Karaoke video outputs
β”œβ”€β”€ notebooks/           # Development notebooks
β”œβ”€β”€ utilities/           # Helper scripts
β”‚   β”œβ”€β”€ separate_audio.py
β”‚   β”œβ”€β”€ audio_to_midi.py
β”‚   β”œβ”€β”€ modify_midi.py
β”‚   β”œβ”€β”€ lyrics_processing.py
β”œβ”€β”€ app.py               # Main Gradio application
β”œβ”€β”€ requirements.txt     # Python dependencies

πŸ“¦ Installation Guide

Pre-Installation Steps

Before installing the required dependencies, make sure to install the correct versions of PyTorch, torchvision, and torchaudio based on your system and CUDA version. Follow the instructions below:


  1. Clone the Repository

    git clone https://github.com/Corey-Holton/Group_3_Project.git
    cd Group_3_Project
  2. Set Up Conda Environment

    conda create -n audio_processing python=3.10 -y
    conda activate audio_processing
  3. Install Dependencies

    • First install torch, torchvision, torchaudio
      • For GPU Users:

        • Install the appropriate CUDA toolkit.
        • Use the PyTorch installation guide to install the correct versions:
          pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cuXX
          Replace cuXX with your specific CUDA version (e.g., cu118 for CUDA 11.8).
      • For CPU Users:

        • Install the CPU versions of PyTorch, torchvision, and torchaudio:
          pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
      • For macOS Users:

        • Use the CPU-only version of PyTorch as macOS does not support CUDA:
          pip install torch torchvision torchaudio
    • Second install requirements.txt
      pip install -r requirements.txt
  4. Run the Application

    python app.py

πŸ“Š Technical Highlights

Models and Techniques

  • Demucs: Audio stem separation.
  • Basic Pitch: Audio-to-MIDI conversion.
  • Whisper: Lyrics extraction and translation.
  • Google Gemini API: AI-based MIDI modification.

Key Audio Features:

  • MFCC: Mel-Frequency Cepstral Coefficients.
  • Chroma Features: Harmonic pitch representation.
  • Spectral Contrast: Timbre differentiation.
  • Mel Spectrogram: Frequency-based signal representation.

Data Augmentation:

  • White noise addition.
  • Time stretching/shifting.
  • Pitch shifting.

Spectrogram


πŸ”§ Troubleshooting

  • Dependency Issues: Ensure all libraries in requirements.txt are installed.
  • Missing Outputs: Verify write permissions for output_stems/ and output_midi/.
  • Model Compatibility: Use Python 3.10+ for TensorFlow compatibility.

πŸš€ Future Enhancements

  1. Expand model compatibility for non-guitar instruments.
  2. Real-time audio processing.
  3. Cloud storage integration for outputs.
  4. Enhanced lyrics editing features.

πŸ‘©β€πŸ’» Team

  • Corey Holton
  • Christian Palacios
  • Edwin Lovera
  • Montre Davis

πŸ“œ License

This project is licensed under the MIT License.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Other 0.3%