Keywords: k-nearest neigbor, pandas, numpy, sklearn, scipy

I will explain my approach for a movie recommendation system based on user ratings, in which I recommend N movies that are similar to a given movie. The recommendations are not personalized for users, in the sense that the identity of the user is not used as an input during query time. The only input is the movie that the recommendations are wanted for.

The source code can be found in Movie Recommendation.

I use the user rating data from the MovieLens Database to compute similarities between different movies. MovieLens Database contains 25.000.095 ratings for 62.423 movies by 162.541 users and it is created between January 09, 1995 and November 21, 2019. The dataset was generated on November 21, 2019.

The ratings are made on a 5-star scale with 0.5 star increments (0.5 stars - 5.0 stars).

Files

The necessary files for my approach from MovieLens database are ratings.csv, in which user ratings are stored, and movies.csv, in which movie information is stored.

ratings.csv contains the following columns: userId,movieId,rating,timestamp. Among these columns, I utilize only userId,movieId,rating columns, and my approach does not depend on the timestamp that the rating was given.

movies.csv contains the following columns: movieId,title,genres. Among these columns, I only utilize movieId,title columns and the title columns is only used for outputting movie names. My approach does not utilize genres column.

Approach

I embed each movie to a space of user ratings. Each movie is described by its ratings from each user. If a user did not give a rating for a particular movie (which is generally the case as there are >60k movies), the rating is set to 0 to express that it does not exist.

In this embedding space, I apply k-nearest neighbor algorithm to find closest movies to the given movie.

Code

Model Fitting

I use pandas.DataFrame to store data, sklearn.neighbors.NearestNeighbors for KNN, scipy.sparse.csr_matrix to store sparse matrices. First, import the necessary libraries.

import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

Next, I read the movies.csv and ratings.csv files. These files can be found in ml-25m.zip file of MovieLens dataset.

# read the files
print("reading files...")
ratings = pd.read_csv('/Users/sevgicosan/Desktop/Movie Recommendation System/ratings.csv_back')
movies = pd.read_csv('/Users/sevgicosan/Desktop/Movie Recommendation System/movies.csv_back')

Then, I create copies of pd.DataFrames to preserve the original ones throughout the code.

# make copies
print("copying files...")
ratings_copy = ratings.copy()
movies_copy = movies.copy()

Since I do not utilize timestamp column of rating.csv, I drop that column.

# 'timestamp' column is not necessary
print("dropping timestamp...")
ratings_copy.drop("timestamp", axis = 1, inplace = True)

Then, I drop movies that have a very small number of ratings since those movies do not have enough data to be predicted. I chose the threshold of 100. Each movie that has higher than 100 ratings is a part of my dataset I generate recommendations for and from.

# top_movies are the movies that has more than 100 ratings
# top_movies_ids are the IDs of top_movies
print("removing movies with less than 100 ratings...")
top_movie_ids = ratings_copy[ratings_copy['movieId'].isin(ratings_copy['movieId'].value_counts()[ratings_copy['movieId'].value_counts() >= 100].index)].movieId.unique()
top_movie_ratings = ratings.loc[ratings["movieId"].isin(top_movie_ids)]
top_movies = movies.loc[movies["movieId"].isin(top_movie_ids)]

Next, I take the inner join of filtered ratings.csv and movies.csv files to generate the movies_and_ratings dataframe.

# Merge movies and ratings dataframes
print("merging movies and ratings...")
movies_and_ratings = pd.merge(top_movies,top_movie_ratings, on = 'movieId')

Next, I create a pivot table of the movies_and_ratings dataframe, to embed movies to the user rating space. Each row of the pivot table is a vector that represents a movie where each column is the rating of a particular user.

# Create pivot table
print("creating pivot table...")
movies_and_features = movies_and_ratings.pivot(index = 'movieId', columns = 'userId', values = 'rating')

Then, I replace NAs in the pivot table with 0s to represent the user-movie pairs for which no rating exists.

# Replace NAs with 0s
print("replacing NAs with 0s...")
movies_and_features.fillna(0, inplace = True)

Since for each movie a very small number of users give ratings (because there are >60k movies), the resulting movies_and_features dataframe is sparse. To make the operations that work on movies_and_features dataframe fast, I convert it to a sparse representation using scipy.sparse.csr_matrix.

# movies_and_features is a sparse matrix.
print("converting to csr_matrix...")
mat_movies_and_features = csr_matrix(movies_and_features.values)

I fit a KNN model with K=6 to the resulting embedding of movies, using Minkowski metric with p=2, which reduces to Euclidean distance.

# Fit KNN model with k = 6
print("fitting model...")
neigh = NearestNeighbors(n_neighbors = 6, metric = 'minkowski', p = 2)
neigh.fit(mat_movies_and_features)

Recommendation

Using the learned KNN model, I recommend movies for each given movie using its nearest neighbors. Closest movie to any given movie is itself. Therefore, with K=6, I recommend 5 different movies for each given movie.

In an infinite loop, I take movie ID inputs, and print its closest neighbors.

while True:
    # Get movie id as input
    movie_id = input('movie_id: ')
    movie_id = int(movie_id.strip())

    if movie_id in top_movie_ids:
        # get the feature vector (ratings) of the movie
        movie = movies_and_features.loc[movie_id, :]

        # Get k neighbors of the movie
        # Scores will be stored the nearest movies distances in ascending order 
        # And the indices will be the nearest movies indices founded by knn
        scores, indices = neigh.kneighbors([movie])

        print("recommended movie scores: %s " % (scores))

        indices = indices[0]

        # get closest movie ids from indices
        closest_movie_ids = movies_and_features.index.values[indices]
        
        # print closest movies
        print(movies[movies["movieId"].isin(closest_movie_ids)][["movieId", "title"]])
    else:
        print("movie_id", movie_id, "does not exist")

Example Recommendations

Here, I list several example recommendations.

Input: Toy Story (1995)
Recommendations: Toy Story 2 (1999), Mission: Impossible (1996), Independence Day (a.k.a. ID4) (1996), Willy Wonka & Chocolate Factory (1971), Bug’s Life, A (1998)

Input: Batman vs. Robin (2015)
Recommendations: Batman: Bad Blood (2016), Justice League Dark (2017), Justice League vs. Teen Titans (2016), Justics League: Gods vs Monsters (2015), Justice League: Throne of Atlantis (2015)

Input: Lord of the Rings: The Fellowship of the Ring, The (2001)
Recommendations: Lord of the Rings: The Return of the King, The (2003), Lord of the Rings: The Two Towers, The (2002), Pirates of the Caribbean: The Curse of the Black Pearl (2003), Spider-Man (2002), Shrek (2001)

Conclusion

The system recommends movies that are related to the input movie. However, it seems that there is a period bias, in the sense that movies that are similar to each other and are shot in similar periods of time gets recommended more, while movies that are similar to each other but shot in different time periods do not get recommended. For example, in the last example, while recommending other Lord of the Rings movies is good, recommending “Hobbit: An Unexpected Journey, The (2012)” instead of “Shrek (2001)” would have been better.

However, this result is expected, since I do recommendations using only user ratings and most users rate movies in a specific period of their lifetimes and only a handful of users rate movies consistently over decades.