
We now look at the user history and retrieve his/her 5 most played songs. The optimal number of clusters was found by calculating the sum of square errors within each cluster for different values of k. Each of the features had to be cleaned to remove false outliers and then min-max normalized (scaled) to a value between 1 and 5. The model was trained on the following features : mode, duration, loudness, genre, speed and hotness. To solve this problem, we used a clustering technique (k-means) in order to group songs together, which would incorporate song features as well as reduce the search space while recommending new songs. The traditional approach of collaborative filtering was discarded because we do not have a metric of user ratings (only play counts) and song features would not be used in this case. We wanted to build a system to recommend new songs to a user based on his tastes and listening history. Natural Language Processing techniques like standardizing (removing non-english and other irrelevant characters) and stop word removal (remove frequent meaningless words like ‘I’, ‘the’, ‘a’, etc.) were also done on the lyrics to preserve only the words with actual meaning. This is done to normalize how “fast” a song sounds to human ears. Another feature “Speed” had to be calculated from each tracks’ tempo and time signature by calculating the number of beats per measure rather than the number of beats per time (tempo). Additionally, rows with a time signature confidence threshold less than 0.5 had to be dropped because they were filled with garbage values which was not accurate. As some of these features were available across different dataframes, they had to be joined together to a single dataframe. We only considered the following features for our use : TrackID, Artist name, Artist location, User play count, Genre, Lyrics, Mode (major/minor), Song Duration, Song Loudness, Song hotness, Artist hotness, Tempo (in bpm), Time Signature and Time Signature Confidence. No processing had to be done for the user and genre datasets. To be able to use this, the indexed structure had to be flattened into a dictionary format for each track. The lyrics dataset consisted of lyrics as a bag of words (the original lyric files are copyright protected) Further cleaning had to be done by removing incomplete records and records with too many empty features. Only the dataset files in the HDF5 files were read and the group and metadata files were ignored.

To load the Million Song Dataset, the HDF5 files in the directory structure had to be read into a Spark dataframe. The size of all the datasets is 300GB, too large for conventional processing.
#Million song dataset download#
The dataset is stored in a hierarchal HDF5 format and is available to download from the Open Science Data Cloud and AWS.Īdditionally, the following 3 auxiliary datasets were used:. The dataset consists of over 50 features and metadata extracted from the song audio. The main dataset for the project is the Million Song Dataset (MSD), a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

concat.py : parse output files to single file.

#Million song dataset code#
