iorewpixel.blogg.se - Million song dataset

#Million song dataset code#
#Million song dataset download#

We now look at the user history and retrieve his/her 5 most played songs. The optimal number of clusters was found by calculating the sum of square errors within each cluster for different values of k. Each of the features had to be cleaned to remove false outliers and then min-max normalized (scaled) to a value between 1 and 5. The model was trained on the following features : mode, duration, loudness, genre, speed and hotness. To solve this problem, we used a clustering technique (k-means) in order to group songs together, which would incorporate song features as well as reduce the search space while recommending new songs. The traditional approach of collaborative filtering was discarded because we do not have a metric of user ratings (only play counts) and song features would not be used in this case. We wanted to build a system to recommend new songs to a user based on his tastes and listening history. Natural Language Processing techniques like standardizing (removing non-english and other irrelevant characters) and stop word removal (remove frequent meaningless words like ‘I’, ‘the’, ‘a’, etc.) were also done on the lyrics to preserve only the words with actual meaning. This is done to normalize how “fast” a song sounds to human ears. Another feature “Speed” had to be calculated from each tracks’ tempo and time signature by calculating the number of beats per measure rather than the number of beats per time (tempo). Additionally, rows with a time signature confidence threshold less than 0.5 had to be dropped because they were filled with garbage values which was not accurate. As some of these features were available across different dataframes, they had to be joined together to a single dataframe. We only considered the following features for our use : TrackID, Artist name, Artist location, User play count, Genre, Lyrics, Mode (major/minor), Song Duration, Song Loudness, Song hotness, Artist hotness, Tempo (in bpm), Time Signature and Time Signature Confidence. No processing had to be done for the user and genre datasets. To be able to use this, the indexed structure had to be flattened into a dictionary format for each track. The lyrics dataset consisted of lyrics as a bag of words (the original lyric files are copyright protected) Further cleaning had to be done by removing incomplete records and records with too many empty features. Only the dataset files in the HDF5 files were read and the group and metadata files were ignored.

To load the Million Song Dataset, the HDF5 files in the directory structure had to be read into a Spark dataframe. The size of all the datasets is 300GB, too large for conventional processing.

Echonest User Datset : song play history for over 1 million users.

TU Wien Genre Dataset : categorization of the above dataset into 21 different genres.

MusiXmatch Lyrics Dataset : lyrics (where applicable) for the above available as an indexed data.

#Million song dataset download#

The dataset is stored in a hierarchal HDF5 format and is available to download from the Open Science Data Cloud and AWS.Īdditionally, the following 3 auxiliary datasets were used:. The dataset consists of over 50 features and metadata extracted from the song audio. The main dataset for the project is the Million Song Dataset (MSD), a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

analyasis.py : parse output files and select fields to plot word clouds into a single file.

combine.py : parse output files and select fields to plot locaitons.

parselyrics.py : convert the lyrics to dictionary format.

concat.py : parse output files to single file.

#Million song dataset code#

clustering.scala : main clustering algorithm and recommender system code.

optimum.scala : find the optimal number of clusters and plot the elbow plot.

process.scala : code to process and join all datasets and save the necessary fields.

makecsv.py : script to cache the dataset into a csv file.

The main aim of the project is not to provide precise and groundbreaking results, but to use the concepts of Big Data Analytics on large data that can not be analyzed and processed using conventional methods.įolder consists of the following code files:. This project on the publicly available Million Song Dataset aims to address three separate questions - recommending songs to a user based on his play history, visualizing trends in music across the years and finally predicting the genre of an unknown song based on its lyrics.