Movie Recommendation
Keywords: k-nearest neigbor, pandas, numpy, sklearn, scipy
I will explain my approach for a movie recommendation system based on user ratings, in which I recommend N movies that are similar to a given movie. The recommendations are not personalized for users, in the sense that the identity of the user is not used as an input during query time. The only input is the movie that the recommendations are wanted for.
The source code can be found in Movie Recommendation.
I use the user rating data from the MovieLens Database to compute similarities between different movies. MovieLens Database contains 25.000.095 ratings for 62.423 movies by 162.541 users and it is created between January 09, 1995 and November 21, 2019. The dataset was generated on November 21, 2019.
The ratings are made on a 5-star scale with 0.5 star increments (0.5 stars - 5.0 stars).
Files
The necessary files for my approach from MovieLens database are ratings.csv
, in which user ratings are stored, and movies.csv
, in which movie information is stored.
ratings.csv
contains the following columns: userId,movieId,rating,timestamp
. Among these columns, I utilize only userId,movieId,rating
columns, and my approach does not depend on the timestamp
that the rating was given.
movies.csv
contains the following columns: movieId,title,genres
. Among these columns, I only utilize movieId,title
columns and the title
columns is only used for outputting movie names. My approach does not utilize genres
column.
Approach
I embed each movie to a space of user ratings. Each movie is described by its ratings from each user. If a user did not give a rating for a particular movie (which is generally the case as there are >60k movies), the rating is set to 0 to express that it does not exist.
In this embedding space, I apply k-nearest neighbor algorithm to find closest movies to the given movie.
Code
Model Fitting
I use pandas.DataFrame
to store data, sklearn.neighbors.NearestNeighbors
for KNN, scipy.sparse.csr_matrix
to store sparse matrices. First, import the necessary libraries.
Next, I read the movies.csv
and ratings.csv
files. These files can be found in ml-25m.zip file of MovieLens dataset.
Then, I create copies of pd.DataFrame
s to preserve the original ones throughout the code.
Since I do not utilize timestamp
column of rating.csv
, I drop that column.
Then, I drop movies that have a very small number of ratings since those movies do not have enough data to be predicted. I chose the threshold of 100. Each movie that has higher than 100 ratings is a part of my dataset I generate recommendations for and from.
Next, I take the inner join of filtered ratings.csv
and movies.csv
files to generate the movies_and_ratings
dataframe.
Next, I create a pivot table of the movies_and_ratings
dataframe, to embed movies to the user rating space. Each row of the pivot table is a vector that represents a movie where each column is the rating of a particular user.
Then, I replace NAs in the pivot table with 0s to represent the user-movie pairs for which no rating exists.
Since for each movie a very small number of users give ratings (because there are >60k movies), the resulting movies_and_features
dataframe is sparse. To make the operations that work on movies_and_features
dataframe fast, I convert it to a sparse representation using scipy.sparse.csr_matrix
.
I fit a KNN model with K=6 to the resulting embedding of movies, using Minkowski metric with p=2, which reduces to Euclidean distance.
Recommendation
Using the learned KNN model, I recommend movies for each given movie using its nearest neighbors. Closest movie to any given movie is itself. Therefore, with K=6, I recommend 5 different movies for each given movie.
In an infinite loop, I take movie ID inputs, and print its closest neighbors.
Example Recommendations
Here, I list several example recommendations.
Input: Toy Story (1995)
Recommendations: Toy Story 2 (1999), Mission: Impossible (1996), Independence Day (a.k.a. ID4) (1996), Willy Wonka & Chocolate Factory (1971), Bug’s Life, A (1998)
Input: Batman vs. Robin (2015)
Recommendations: Batman: Bad Blood (2016), Justice League Dark (2017), Justice League vs. Teen Titans (2016), Justics League: Gods vs Monsters (2015), Justice League: Throne of Atlantis (2015)
Input: Lord of the Rings: The Fellowship of the Ring, The (2001)
Recommendations: Lord of the Rings: The Return of the King, The (2003), Lord of the Rings: The Two Towers, The (2002), Pirates of the Caribbean: The Curse of the Black Pearl (2003), Spider-Man (2002), Shrek (2001)
Conclusion
The system recommends movies that are related to the input movie. However, it seems that there is a period bias, in the sense that movies that are similar to each other and are shot in similar periods of time gets recommended more, while movies that are similar to each other but shot in different time periods do not get recommended. For example, in the last example, while recommending other Lord of the Rings movies is good, recommending “Hobbit: An Unexpected Journey, The (2012)” instead of “Shrek (2001)” would have been better.
However, this result is expected, since I do recommendations using only user ratings and most users rate movies in a specific period of their lifetimes and only a handful of users rate movies consistently over decades.