The project aims to use a vector database to find songs that fit well with a given playlist. By leveraging a vector database, the project tries interpret the overall theme or "direction" of the playlist and suggest songs that align closely with it. The playlists themselves are built using a relational database.
The method for embedding the songs (currently a work in progress and subject to further research) involves embedding song lyrics using the e5-small model. Each song in the playlist is represented as a vector in the database. By calculating the average of these vectors, the system can then search for the nearest matching songs in the database.
Assessing the objective quality of the suggested tracks poses a significant challenge. The following quality evaluation is inherently based on subjective assessments, supplemented by extensive testing across various genres and playlist configurations.
The overall intention of the playlist generation is clear. For example, when creating a playlist labeled "girly party music," the algorithm suggests tracks that generally align with this theme (with 5 out of 10 songs being a good fit). Similarly, when generating a rap playlist, the recommendations lean heavily toward the genre. Considering the limited data available, the results are quite promising.
However, it's difficult to fully assess the quality of the recommended songs. Since the dataset only includes the top 3,000 worldwide hits, it's likely that any song suggested will already be a popular, high-quality track. Therefore, further analysis would be needed to judge whether the recommendations offer variety or true relevance beyond these popular songs.
During the data ingestion process, lyrics were cleaned by removing common stop words, such as "the" "and" and similar terms, which are generally believed not to impact the overall meaning of the lyrics. The aim of this preprocessing step was to enhance the output quality by focusing on more meaningful content within the lyrics, potentially improving the relevance of track recommendations.
The overall quality of the suggested titles decreased with the use of a larger dataset. Only 4 out of 10 suggestions were deemed acceptable or good for inclusion in the playlist. While the intention behind the recommendations remains identifiable, and the clustering of songs based on their lyrics still holds, the quality of recommendations has suffered. Notably, the genre range of the suggestions remains consistent.
An interesting observation is the quality of recommendations varies significantly depending on the genre of the playlist. For instance, pop music recommendations are considerably worse compared to those for Hip-Hop or Electronic genres.
With the expansion of the dataset, which includes many lesser-known songs, there are more selections that align with the lyrical themes of the playlist, but the overall quality of these tracks does not match that of the hit songs in the playlist. However, when a song does fit well, it often aligns better than before, likely due to the finer distinctions made possible by the larger dataset.
At this point, it is unclear how the ingestion changes have affected the output quality. Further research will be conducted to investigate this.
- Install the requirements.txt.
- Setup the vectorstore with this shell script.
- Ingest the data into the vector store with the ingestion notebook.
- Run the run.py.
- Access the project on localhost.
- Try it, fork it, open an issue, ...
Small Dataset (3.5k tracks)
The small dataset originates from a Kaggle dataset and serves as the initial source of songs. It has undergone preprocessing to remove unnecessary columns and duplicates. Additionally, the dataset has been augmented by incorporating a lyrics column through web scraping.
Large Dataset (70k tracks):
- Multiple datasets were combined by matching songs on their track name and artist.
- Each song was assigned a unique identifier to facilitate tracking.
- Lyrics for each song were obtained via the Genius API
Note: The large dataset is not stored in the repository due to file size limitations.
Source Datasets:
- Top 30k Spotify Songs
- Most Streamed Spotify Songs 2024
- Spotify Million Song Dataset
- Spotify 1.2M+ Songs
B. Logan, A. Kositsky and P. Moreno, "Semantic analysis of song lyrics" 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), Taipei, Taiwan, 2004, pp. 827-830 Vol.2, doi: 10.1109/ICME.2004.1394328.
K. I. Batcho, M. L. DaRin, A. M. Nave, and R. R. Yaworsky, "Nostalgia and identity in song lyrics," Psychology of Aesthetics, Creativity, and the Arts, vol. 2, no. 4, pp. 236-244, 2008, doi: 10.1037/1931-3896.2.4.236.
Wang, Liang, et al. "Multilingual e5 text embeddings: A technical report" arXiv preprint arXiv:2402.05672 (2024).
Wang, Jianguo, et al. "Milvus: A purpose-built vector data management system" Proceedings of the 2021 International Conference on Management of Data. 2021.