Improved KNN Based Music Recommendation Engine

By Shivani Vijay Patil, Information Technology, VESIT

Introduction

In this blog we are going to see how to build a music recommendation 

Engine using an improved KNN algorithm and  the main steps involved to build it. So, let’s understand what a music recommendation engine is ?

A Music recommendation systems are part of a broader class of recommender systems, which filter information to predict a user’s preferences when it comes to a certain item. There are broadly two  approaches, to build recommended systems which are Collaborative Filtering and Content Based Filtering.

To build our recommendation engine we will use a user based collaborative filtering approach, in which we find similar users to make recommendations based on closely related listening histories

DataSet

Our first step is that we require a dataset.Although there are many datasets available we have worked with ListenBrainz dataset.One can download dataset from their official website .To process huge dataset ,we have used some functionalities of BigQuery and Pandas.

BigQuery offers users the ability to manage data using fast SQL-like queries for real-time analysis. Our Listenbraiz dataset looks like,

import pandas as pd
import numpy as np
df=pd.read_csv(‘/content/drive/My Drive//bq-results-20200605-173909-k0klgfi06h80.csv’)
df.head()

 And  columns in dataset are

However, we’re still deprived of  parameters like genres , song_duration and release_year, this requires the need for making REST API calls by using  Musicbrainz Development/JSON Web Service.

By using a recording request url https://musicbrainz.org/ws/2/recording/{recording_id}?inc=aliases+artist-credits+releases+genres&fmt=json  

we can collect  genre , song_duration and release_year data in dictionary d2(see below) which we can then convert into our  dataframe. After which  we write our dataframe to our csv file by using the to_csv Method.

import json
import requests
import time
count=0
d2={‘recording_mbid’:[],’length’:[],’year’:[],’genres’:[]} #dictionary for making the dataframe
for row in reid :
resp = requests.get(f’https://musicbrainz.org/ws/2/recording/{row}?inc=aliases+artist-credits+releases+genres&fmt=json’)
time.sleep(0.5)
s2 = json.dumps(resp.json())
data1 = json.loads(s2)

try:
if(data1[‘id’]):
d2[‘recording_mbid’].append(data1[‘id’])
else:
d2[‘recording_mbid’].append(None)
except KeyError:
d2[‘recording_mbid’].append(None)
try:
if(data1[‘length’]):
d2[‘length’].append(data1[‘length’])
else:
d2[‘length’].append(None)
except KeyError:
d2[‘length’].append(None)
try:
if(data1[‘releases’][0][‘release-events’][0][‘date’][0:4]):
d2[‘year’].append(data1[‘releases’][0][‘release-events’][0][‘date’][0:4])
else:
d2[‘year’].append(”)
except KeyError:
d2[‘year’].append(”)
try:
if(data1[‘artist-credit’][0][‘artist’][‘genres’][0][‘name’]):
d2[‘genres’].append(data1[‘artist-credit’][0][‘artist’][‘genres’][0][‘name’])
else:
d2[‘genres’].append(None)
except (KeyError,IndexError):
d2[‘genres’].append(None)

#print(d2)
#print(len(d2[‘genres’]))
#print(len(d2[‘year’]))
df1=pd.DataFrame(d2)
print(df1)
df1.to_csv(“/content/drive/My Drive/new4.csv”)

And then after merging our two dataframe on recording_mbid  , we get our final data consisting of original parameters along with year,length and genres.

import pandas as pd
import numpy as np
df=pd.read_csv(‘/content/drive/My Drive/datacleaned.csv’)
df.drop(df.columns[0],axis=1,inplace=True)
df.head(5)

Data Preprocessing And Cleaning

After getting the required  dataset we need to clean the dataset . Data cleaning involved dropping all duplicate rows ,removing rows which had most of its values as NaN.

Next,Preprocessing is done using LabelEncoder() from sklearn.preprocessing  to provide a mapping for recording_mbids,artist_mbids,release_mbids and genres to a number for ease of use.

from sklearn.preprocessing

import LabelEncoder
genres=LabelEncoder()
userName=LabelEncoder()
artistmbids=LabelEncoder()
releasembid=LabelEncoder()
recordingmbid=LabelEncoder()

df[‘genres_encoding’]=genres.fit_transform(df[‘genres’])
df[‘user_name_encoding’]=userName.fit_transform(df[‘user_name’])
df[‘artist_encoding’]=artistmbids.fit_transform(df[‘artist_mbids’])
df[‘recording_encoding’]=recordingmbid.fit_transform(df[‘recording_mbid’])
df[‘release_encoding’]=releasembid.fit_transform(df[‘release_mbid’])
df.head(20)

And now we can drop  unnecessary data columns from the dataset using drop() method.

Next step  is to find the listen_Count of each song ,which is required to build knn model.We have used query() method on user_name_encoding and recording_encoding and have built a Listen count matrix.

To find artist_count  and album_count, we do as follows. 

We’ll then find the mean of the album_count and artist_count using statistics.mean() which we will use to find out user_rating_count in the later sections .

Build improved KNN algorithm:

After completing data preprocessing, we will build a recommendation engine.Here we are going to use an improved  multidimensional KNN(K Nearest Neighbors) algorithm.  We are answering questions, like :What is the KNN algorithm and what is an improved KNN algorithm?  What is the difference between them and why we are using an improved KNN adapting baseline algorithm?

    KNN algorithm is used for both classification and regression problems.As the name itself says we are going to get K nearest neighbors based on some metrics.

We can observe that  clustering of data takes place here.When a new datapoint is introduced we try to figure out its K-Nearest neighbours and predict which cluster it belongs it.

In the improved KNN approach we are using a baseline algorithm.This baseline algorithm helps us build our ratings when ratings are not explicitly provided by the user. Improved KNN also includes the mean value thought of the baseline algorithm,helping us the reduce the impact of too highs or too lows in ratings.

To find base_count we will use the listen_count matrix which we have found previously. 

Base_Count is the average of the entire matrix excluding all non null values.

So, to begin with Improved knn first find out user_rating _count in which we  use base_rating as 2.5 and base_count as 2.0. And we will use this formula to find user-rating count_value  which include log normalisation of  user_song_listen_count and base_count , log normalisation of artist_count and album_count with mean value of artist_count (12.26784), album_count(10.9134).

base_rating + log_normalisation(user_song_listen_count,base_count) + log_normalisation(artist_count[ar_count],12.267686424474187) + log_normalisation(album_count[al_count],10.91342064977037)

Constructing the ratings matrix

def log_normalisation(val,base):
to_normalise=val-base
if to_normalise>0:
return(log10(to_normalise))
elif to_normalise>0:
return(log10(to_normalise))
else:
return 0

user_song_rating=[]
base_count=2.0
base_rating=2.5
for i in range(64160):
user_song_listen_count=A[df.iloc[i].user_name_encoding,df.iloc[i].recording_encoding]
ar_count=df.iloc[i].artist_encoding
al_count=df.iloc[i].release_encoding
g_count=df.iloc[i].genres_encoding
user_song_rating.append((base_rating + log_normalisation(user_song_listen_count,base_count) + log_normalisation(artist_count[ar_count],12.267686424474187) + log_normalisation(album_count[al_count],10.91342064977037)))

And then combine the user_rating data in our dataframe

df[‘ratings’]=user_song_rating
df.head()

So , after that we will normalize our rating data using minmaxscaler() method

To implement knn we have used sklearn library and cosine Similarity.

That’s going to  give us a reduction in prediction error values. And how do we determine the value of K? We run the model for various K values to find the most appropriate value of K.

Test our model and make some recommendations:

In this step, the kNN algorithm measures distance to determine the “closeness” of instances. It then classifies an instance by finding its nearest neighbors, and picks the most popular class among the neighbors.

query_index = np.random.choice(df.shape[1])
distances, indices = model_knn.kneighbors(d4.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 12)
for i in range(0, len(distances.flatten())):
if i == 0:
print(‘Recommendations for {0}:\n’.format(df.index[query_index]))
else:
print(‘{0}: {1} , song_name{2} with distance of {3}:’.format(i,
df.index[indices.flatten()[i]],df[‘track_name’][i],distances.flatten()[i]))

Result

Leave a Reply

Your email address will not be published. Required fields are marked *