WatchThis (A Movie Recommender)

1. Problem Statement

Have you ever wanted to watch a movie in the comfort of your home, but stopped short of doing so just because nothing came to mind? Or during a stayover with friends, too many differences in preferences for a movie choice led to a wild goose catch for a decision that never materialized?

Yes, there are a few movie recommender resources around. They are either complicated to use (ie. need to set up account, enter a bunch of user preferences etc), or too simple and does not allow for a flexibility in preferences for input.

2. Natural Language Processing (NLP)

This project is heavily reliant on Natural Language Processing or NLP, so let us understand what NLP is all about.

NLP is broadly defined as the processing or manipulation of natural language, which can be in the form of speech and text etc. It is a very challenging task to make “useful” sense of such information, as they are messy and can be illogical at times.

For the purpose of this project, we will be comparing how “similar” the movie plots, cast, director are, and subsequently recommend the movies accordingly. More details will be discussed as we go along.

3. Dataset

The dataset is from GroupLens Research 20M. It is a stable benchmark dataset.

# Load the data.
movies = pd.read_csv("./movies-dataset/movies_metadata.csv")
movies.head(20)

After some data cleaning, we are good to go!

4. Baseline Model: Simple Recommender

The Simple Recommender is a baseline model which offers simple recommendations based on a movie’s weighted rating. Then, the top few selected movies will be displayed.

In addition, I will pass in a genre argument for extra option for the user.

# extract the genres
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies.genres.head(5)
0     [Animation, Comedy, Family]
1    [Adventure, Fantasy, Family]
2               [Romance, Comedy]
3        [Comedy, Drama, Romance]
4                        [Comedy]
Name: genres, dtype: object

I will use a weighted rating that takes into account the average rating and the number of votes for the movie. Hence, a movie that has a 8 rating from say, 50k voters will have a higher score than a movie with the same rating but with fewer voters.

where,

  • v is the number of votes for the movie
  • m is the minimum votes required to be listed in the chart
  • R is the average rating of the movie

For m, it is an arbitrary number where we will filter out movies that have few votes.

We will use 90th percentile as a cutoff. Hence, for a movie to feature in the list, it has to have more votes than 90% of the movies in the list.

# Minimum number of votes required to be in the chart
m = movies['vote_count'].quantile(0.90)
print(m)
160.0
# Filter out all qualified movies into a new df
q_movies = movies.copy().loc[movies['vote_count'] >= m]
q_movies.shape
(4555, 18)

There are 4555 movies which qualify to be in this list.

q_movies.head(3)
budget genres id imdb_id original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages status title vote_average vote_count year
0 30000000 [Animation, Comedy, Family] 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.9469 [{'name': 'Pixar Animation Studios', 'id': 3}] 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Toy Story 7.7 5415.0 1995
1 65000000 [Adventure, Fantasy, Family] 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... 17.0155 [{'name': 'TriStar Pictures', 'id': 559}, {'na... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Jumanji 6.9 2413.0 1995
4 0 [Comedy] 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... 8.38752 [{'name': 'Sandollar Productions', 'id': 5842}... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Father of the Bride Part II 5.7 173.0 1995
# Compute the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

4.1 Top movies by weighted rating

#Sort movies based on score calculated
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score','genres']].head(15)
title vote_count vote_average score genres
314 The Shawshank Redemption 8358.0 8.5 8.445869 [Drama, Crime]
834 The Godfather 6024.0 8.5 8.425439 [Drama, Crime]
10309 Dilwale Dulhania Le Jayenge 661.0 9.1 8.421453 [Comedy, Drama, Romance]
12481 The Dark Knight 12269.0 8.3 8.265477 [Drama, Action, Crime, Thriller]
2843 Fight Club 9678.0 8.3 8.256385 [Drama]
292 Pulp Fiction 8670.0 8.3 8.251406 [Thriller, Crime]
522 Schindler's List 4436.0 8.3 8.206639 [Drama, History, War]
23673 Whiplash 4376.0 8.3 8.205404 [Drama]
5481 Spirited Away 3968.0 8.3 8.196055 [Fantasy, Adventure, Animation, Family]
2211 Life Is Beautiful 3643.0 8.3 8.187171 [Comedy, Drama]
1178 The Godfather: Part II 3418.0 8.3 8.180076 [Drama, Crime]
1152 One Flew Over the Cuckoo's Nest 3001.0 8.3 8.164256 [Drama]
351 Forrest Gump 8147.0 8.2 8.150272 [Comedy, Drama, Romance]
1154 The Empire Strikes Back 5998.0 8.2 8.132919 [Adventure, Action, Science Fiction]
1176 Psycho 2405.0 8.3 8.132715 [Drama, Horror, Thriller]

4.2 Top movies by genre

# To split the movies into one genre per row
s = movies.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
genre_movies = movies.drop('genres', axis=1).join(s)
genre_movies.head(2)
budget id imdb_id original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages status title vote_average vote_count year genre
0 30000000 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.9469 [{'name': 'Pixar Animation Studios', 'id': 3}] 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Toy Story 7.7 5415.0 1995 Animation
0 30000000 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.9469 [{'name': 'Pixar Animation Studios', 'id': 3}] 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Toy Story 7.7 5415.0 1995 Comedy
def genre_chart(genre, percentile=0.90):
    df = genre_movies[genre_movies['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
#     C = vote_averages.mean()
#     m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    # qualified['score'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified['score'] = q_movies['score']
    qualified = qualified.sort_values('score', ascending=False).head(250)
    
    return qualified
genre_chart('War').head(10)
title year vote_count vote_average popularity score
522 Schindler's List 1993 4436 8 41.7251 8.206639
24860 The Imitation Game 2014 5895 8 31.5959 7.937062
5857 The Pianist 2002 1927 8 14.8116 7.909733
13605 Inglourious Basterds 2009 6598 7 16.8956 7.845977
5553 Grave of the Fireflies 1988 974 8 0.010902 7.835726
1165 Apocalypse Now 1979 2112 8 13.5963 7.832268
1919 Saving Private Ryan 1998 5148 7 21.7581 7.831220
1179 Full Metal Jacket 1987 2595 7 13.9415 7.767482
732 Dr. Strangelove or: How I Learned to Stop Worr... 1964 1472 8 9.80398 7.766491
43190 Band of Brothers 2001 725 8 7.903731 7.733235
genre_chart('Romance').head(10)
title year vote_count vote_average popularity score
10309 Dilwale Dulhania Le Jayenge 1995 661 9 34.457 8.421453
351 Forrest Gump 1994 8147 8 48.3072 8.150272
40251 Your Name. 2016 1030 8 34.461252 8.112532
40882 La La Land 2016 4745 7 19.681686 7.825568
22168 Her 2013 4215 7 13.8295 7.816552
7208 Eternal Sunshine of the Spotless Mind 2004 3758 7 12.9063 7.806818
1132 Cinema Paradiso 1988 834 8 14.177 7.784420
876 Vertigo 1958 1162 8 18.2082 7.711735
4843 Amélie 2001 3403 7 12.8794 7.702024
24982 The Theory of Everything 2014 3403 7 11.853 7.702024

5. Content Based Recommender (Movie Description)

This part recommends movies that are similar to a particular movie in terms of movie description. It considers the pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

movies['overview'].head(3)
0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
Name: overview, dtype: object
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all english stop words.
vect_1 = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = vect_1.fit_transform(movies['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape
(45466, 75827)

Use the cosine similarity to denote the similarity between two movies.

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Define a function that takes in a movie title as an input and outputs a list of the 8 most similar movies.

#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
# Function that takes in movie title as input and outputs most similar movies

def get_recommendations(title, cosine_sim=cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 8 most similar movies
    sim_scores = sim_scores[1:9]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 8 most similar movies
    return movies['title'].iloc[movie_indices]
get_recommendations('X-Men')
32251                              Superman
20806                    Hulk vs. Wolverine
23472    Mission: Impossible - Rogue Nation
13635              X-Men Origins: Wolverine
6195                                     X2
30067                               Holiday
21296                         The Wolverine
38010                              Sharkman
Name: title, dtype: object
get_recommendations('Mission: Impossible - Ghost Protocol')
23472         Mission: Impossible - Rogue Nation
10952                    Mission: Impossible III
3501                      Mission: Impossible II
19275    The President's Man: A Line in the Sand
26633                          A Dangerous Place
18674                               Act of Valor
15886                  My Girlfriend's Boyfriend
33441                             Swat: Unit 887
Name: title, dtype: object

5.1 Content Based Recommender (Other Parameters)

For the recommendations, it seems that the movies are correctly recommended based on similar movie descriptions. However, some users might like a movie based on the movie’s cast, director and/or the genre of the movie. Hence, the model will be improved based on these two added features.

# Load keywords and credits
credits = pd.read_csv("./movies-dataset/credits.csv")

# Remove rows with bad IDs.
movies = movies.drop([19730, 29503, 35587])

# Convert IDs to integers for merging
credits['id'] = credits['id'].astype('int')
movies['id'] = movies['id'].astype('int')
# Merge credits into movies dataframe
movies = movies.merge(credits, on='id')

From the merged dataframe, the scope of features will be defined as such:

Crew: Only the Director will be selected as I feel his directing sense contributes most to the movie.

Cast: Most movies have a mixture of better known and lesser known actors and actresses. Hence, I will choose only the top 3 actors/actresses names in the list.

movies['cast'] = movies['cast'].apply(literal_eval)
movies['crew'] = movies['crew'].apply(literal_eval)
# movies['cast_size'] = movies['cast'].apply(lambda x: len(x))
# movies['crew_size'] = movies['crew'].apply(lambda x: len(x))
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []
# Define new director, cast and genres features that are in a suitable form.
movies['director'] = movies['crew'].apply(get_director)
# features = ['cast', 'genres']
features = ['cast']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)
# Print the new features of the first 3 films
movies[['title', 'cast', 'director', 'genres']].head(3)
title cast director genres
0 Toy Story [Tom Hanks, Tim Allen, Don Rickles] John Lasseter [Animation, Comedy, Family]
1 Jumanji [Robin Williams, Jonathan Hyde, Kirsten Dunst] Joe Johnston [Adventure, Fantasy, Family]
2 Grumpier Old Men [Walter Matthau, Jack Lemmon, Ann-Margret] Howard Deutch [Romance, Comedy]
# Function to convert all strings to lower case and strip names of spaces

def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
print (movies['cast'].head(1).all())
['Tom Hanks', 'Tim Allen', 'Don Rickles']
print (movies['director'].head())
0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: director, dtype: object
print (movies['genres'].head())
0     [Animation, Comedy, Family]
1    [Adventure, Fantasy, Family]
2               [Romance, Comedy]
3        [Comedy, Drama, Romance]
4                        [Comedy]
Name: genres, dtype: object
# Apply clean_data function to your features.
# features = ['cast', 'director', 'genres']
features = ['director', 'genres']

for feature in features:
    movies[feature] = movies[feature].apply(clean_data)
def create_soup(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
# Create a new soup feature
movies['soup'] = movies.apply(create_soup, axis=1)
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

vect_2 = CountVectorizer(stop_words='english')
count_matrix = vect_2.fit_transform(movies['soup'])
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
# Reset index of your main DataFrame and construct reverse mapping as before
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])
get_recommendations('The Dark Knight Rises', cosine_sim2)
12525      The Dark Knight
10158        Batman Begins
11399         The Prestige
23964            Quicksand
516      Romeo Is Bleeding
8990        State of Grace
11460          Harsh Times
14977          Harry Brown
Name: title, dtype: object
get_recommendations('The Godfather', cosine_sim2)
1187      The Godfather: Part II
1922     The Godfather: Part III
3996            Gardens of Stone
3145            Scent of a Woman
15503            The Rain People
1174              Apocalypse Now
1844           On the Waterfront
5281                 The Gambler
Name: title, dtype: object

6. Prediction Of Ratings (Collaborative Filtering)

For this part, I will attempt to predict how a user will rate a recommended movie (presuming he or she has not seen it before or at least has not rated it before)

reader = Reader()
ratings3 = pd.read_csv("./movies-dataset/ratings_small.csv")
ratings3.head()
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
data = Dataset.load_from_df(ratings3[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])
Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8947
MAE:  0.6894
------------
Fold 2
RMSE: 0.8995
MAE:  0.6926
------------
Fold 3
RMSE: 0.8910
MAE:  0.6868
------------
Fold 4
RMSE: 0.8996
MAE:  0.6917
------------
Fold 5
RMSE: 0.8991
MAE:  0.6929
------------
------------
Mean RMSE: 0.8968
Mean MAE : 0.6907
------------
------------





CaseInsensitiveDefaultDict(list,
                           {'mae': [0.68937359473806903,
                             0.69259939130111503,
                             0.68678665677980999,
                             0.69169120460418154,
                             0.69285620150031413],
                            'rmse': [0.89472414482943841,
                             0.89948598218998499,
                             0.89096153777913623,
                             0.8996171912501465,
                             0.89907130432515781]})
trainset = data.build_full_trainset()
svd.train(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11704a908>
ratings3[ratings3['userId'] == 1]
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
5 1 1263 2.0 1260759151
6 1 1287 2.0 1260759187
7 1 1293 2.0 1260759148
8 1 1339 3.5 1260759125
9 1 1343 2.0 1260759131
10 1 1371 2.5 1260759135
11 1 1405 1.0 1260759203
12 1 1953 4.0 1260759191
13 1 2105 4.0 1260759139
14 1 2150 3.0 1260759194
15 1 2193 2.0 1260759198
16 1 2294 2.0 1260759108
17 1 2455 2.5 1260759113
18 1 2968 1.0 1260759200
19 1 3671 3.0 1260759117
svd.predict(1, 31)
Prediction(uid=1, iid=31, r_ui=None, est=2.5823946941598028, details={'was_impossible': False})

7. Key Insights

  1. Baseline Model
    • Does well in recommending movies which have a high weighted rating according to user’s favourite genre.
    • Not flexible enough to take in more parameters and recommend more personalized choices for user.
  2. Content Based Model
    • Does well in recommending movies which are similar to the user’s inputs, such as movie plot, favourite director etc.
    • Does not have cold start problem as user does not need to have rated many movies before, since the model just needs user to select favourite movie and other parameters if he/she so wishes.
  3. Rating Prediction Model
    • The Surprise package, which is a Python scikit package for recommender systems, has a decent performance.
    • Does not address the cold start problem, which occurs when the user has not rated enough movies before.
  4. Other models For the movie recommender engine, there exists a Collaborative Filtering model which takes into account similar users’ choices of movies and recommends such movies to the inquiring user. But due to the time constraints of the capstone project, we are only able to explore content based model. I will be following up with this model so give this a space a watch!

8. Future work

I hope you like what you have seen thus far.

If you have any comments or questions regarding the above work, feel free to contact me via the “Contact Me” tab at the top of the page.

Have a nice day!