Shuffle.ai

General

The project aims to use a vector database to find songs that fit well with a given playlist. By leveraging a vector database, the project tries interpret the overall theme or "direction" of the playlist and suggest songs that align closely with it. The playlists themselves are built using a relational database.

The method for embedding the songs (currently a work in progress and subject to further research) involves embedding song lyrics using the e5-small model. Each song in the playlist is represented as a vector in the database. By calculating the average of these vectors, the system can then search for the nearest matching songs in the database.

Quality analysis

Assessing the objective quality of the suggested tracks poses a significant challenge. The following quality evaluation is inherently based on subjective assessments, supplemented by extensive testing across various genres and playlist configurations.

Results on Small Dataset:

The overall intention of the playlist generation is clear. For example, when creating a playlist labeled "girly party music," the algorithm suggests tracks that generally align with this theme (with 5 out of 10 songs being a good fit). Similarly, when generating a rap playlist, the recommendations lean heavily toward the genre. Considering the limited data available, the results are quite promising.

However, it's difficult to fully assess the quality of the recommended songs. Since the dataset only includes the top 3,000 worldwide hits, it's likely that any song suggested will already be a popular, high-quality track. Therefore, further analysis would be needed to judge whether the recommendations offer variety or true relevance beyond these popular songs.

Results from Larger Dataset and Ingestion Tweaks:

During the data ingestion process, lyrics were cleaned by removing common stop words, such as "the" "and" and similar terms, which are generally believed not to impact the overall meaning of the lyrics. The aim of this preprocessing step was to enhance the output quality by focusing on more meaningful content within the lyrics, potentially improving the relevance of track recommendations.

The overall quality of the suggested titles decreased with the use of a larger dataset. Only 4 out of 10 suggestions were deemed acceptable or good for inclusion in the playlist. While the intention behind the recommendations remains identifiable, and the clustering of songs based on their lyrics still holds, the quality of recommendations has suffered. Notably, the genre range of the suggestions remains consistent.

An interesting observation is the quality of recommendations varies significantly depending on the genre of the playlist. For instance, pop music recommendations are considerably worse compared to those for Hip-Hop or Electronic genres.

With the expansion of the dataset, which includes many lesser-known songs, there are more selections that align with the lyrical themes of the playlist, but the overall quality of these tracks does not match that of the hit songs in the playlist. However, when a song does fit well, it often aligns better than before, likely due to the finer distinctions made possible by the larger dataset.

At this point, it is unclear how the ingestion changes have affected the output quality. Further research will be conducted to investigate this.

Architecture

Setup

Install the requirements.txt.
Setup the vectorstore with this shell script.
Ingest the data into the vector store with the ingestion notebook.
Run the run.py.
Access the project on localhost.
Try it, fork it, open an issue, ...

Datasets

Small Dataset (3.5k tracks)

The small dataset originates from a Kaggle dataset and serves as the initial source of songs. It has undergone preprocessing to remove unnecessary columns and duplicates. Additionally, the dataset has been augmented by incorporating a lyrics column through web scraping.

Large Dataset (70k tracks):

Multiple datasets were combined by matching songs on their track name and artist.
Each song was assigned a unique identifier to facilitate tracking.
Lyrics for each song were obtained via the Genius API

Note: The large dataset is not stored in the repository due to file size limitations.

Source Datasets:

Relevant Papers

B. Logan, A. Kositsky and P. Moreno, "Semantic analysis of song lyrics" 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), Taipei, Taiwan, 2004, pp. 827-830 Vol.2, doi: 10.1109/ICME.2004.1394328.

K. I. Batcho, M. L. DaRin, A. M. Nave, and R. R. Yaworsky, "Nostalgia and identity in song lyrics," Psychology of Aesthetics, Creativity, and the Arts, vol. 2, no. 4, pp. 236-244, 2008, doi: 10.1037/1931-3896.2.4.236.

Wang, Liang, et al. "Multilingual e5 text embeddings: A technical report" arXiv preprint arXiv:2402.05672 (2024).

Wang, Jianguo, et al. "Milvus: A purpose-built vector data management system" Proceedings of the 2021 International Conference on Management of Data. 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
app		app
img		img
milvus		milvus
song_data		song_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shuffle.ai

General

Quality analysis

Results on Small Dataset:

Results from Larger Dataset and Ingestion Tweaks:

Architecture

Setup

Datasets

Small Dataset (3.5k tracks)

Large Dataset (70k tracks):

Relevant Papers

About

Releases

Packages

Languages

License

kirbs-btw/shuffle.ai

Folders and files

Latest commit

History

Repository files navigation

Shuffle.ai

General

Quality analysis

Results on Small Dataset:

Results from Larger Dataset and Ingestion Tweaks:

Architecture

Setup

Datasets

Small Dataset (3.5k tracks)

Large Dataset (70k tracks):

Relevant Papers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages