WatchThis (A Movie Recommender)
1. Problem Statement
Have you ever wanted to watch a movie in the comfort of your home, but stopped short of doing so just because nothing came to mind? Or during a stayover with friends, too many differences in preferences for a movie choice led to a wild goose catch for a decision that never materialized?
Yes, there are a few movie recommender resources around. They are either complicated to use (ie. need to set up account, enter a bunch of user preferences etc), or too simple and does not allow for a flexibility in preferences for input.
2. Natural Language Processing (NLP)
This project is heavily reliant on Natural Language Processing or NLP, so let us understand what NLP is all about.
NLP is broadly defined as the processing or manipulation of natural language, which can be in the form of speech and text etc. It is a very challenging task to make “useful” sense of such information, as they are messy and can be illogical at times.
For the purpose of this project, we will be comparing how “similar” the movie plots, cast, director are, and subsequently recommend the movies accordingly. More details will be discussed as we go along.
3. Dataset
The dataset is from GroupLens Research 20M. It is a stable benchmark dataset.
# Load the data.
movies = pd.read_csv("./movies-dataset/movies_metadata.csv")
movies.head(20)
After some data cleaning, we are good to go!
4. Baseline Model: Simple Recommender
The Simple Recommender is a baseline model which offers simple recommendations based on a movie’s weighted rating. Then, the top few selected movies will be displayed.
In addition, I will pass in a genre argument for extra option for the user.
# extract the genres
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies.genres.head(5)
0 [Animation, Comedy, Family]
1 [Adventure, Fantasy, Family]
2 [Romance, Comedy]
3 [Comedy, Drama, Romance]
4 [Comedy]
Name: genres, dtype: object
I will use a weighted rating that takes into account the average rating and the number of votes for the movie. Hence, a movie that has a 8 rating from say, 50k voters will have a higher score than a movie with the same rating but with fewer voters.
where,
- v is the number of votes for the movie
- m is the minimum votes required to be listed in the chart
- R is the average rating of the movie
For m, it is an arbitrary number where we will filter out movies that have few votes.
We will use 90th percentile as a cutoff. Hence, for a movie to feature in the list, it has to have more votes than 90% of the movies in the list.
# Minimum number of votes required to be in the chart
m = movies['vote_count'].quantile(0.90)
print(m)
160.0
# Filter out all qualified movies into a new df
q_movies = movies.copy().loc[movies['vote_count'] >= m]
q_movies.shape
(4555, 18)
There are 4555 movies which qualify to be in this list.
q_movies.head(3)
budget | genres | id | imdb_id | original_language | original_title | overview | popularity | production_companies | release_date | revenue | runtime | spoken_languages | status | title | vote_average | vote_count | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30000000 | [Animation, Comedy, Family] | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.9469 | [{'name': 'Pixar Animation Studios', 'id': 3}] | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Toy Story | 7.7 | 5415.0 | 1995 |
1 | 65000000 | [Adventure, Fantasy, Family] | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | 17.0155 | [{'name': 'TriStar Pictures', 'id': 559}, {'na... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Jumanji | 6.9 | 2413.0 | 1995 |
4 | 0 | [Comedy] | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | 8.38752 | [{'name': 'Sandollar Productions', 'id': 5842}... | 1995-02-10 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Father of the Bride Part II | 5.7 | 173.0 | 1995 |
# Compute the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
v = x['vote_count']
R = x['vote_average']
return (v/(v+m) * R) + (m/(m+v) * C)
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)
4.1 Top movies by weighted rating
#Sort movies based on score calculated
q_movies = q_movies.sort_values('score', ascending=False)
#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score','genres']].head(15)
title | vote_count | vote_average | score | genres | |
---|---|---|---|---|---|
314 | The Shawshank Redemption | 8358.0 | 8.5 | 8.445869 | [Drama, Crime] |
834 | The Godfather | 6024.0 | 8.5 | 8.425439 | [Drama, Crime] |
10309 | Dilwale Dulhania Le Jayenge | 661.0 | 9.1 | 8.421453 | [Comedy, Drama, Romance] |
12481 | The Dark Knight | 12269.0 | 8.3 | 8.265477 | [Drama, Action, Crime, Thriller] |
2843 | Fight Club | 9678.0 | 8.3 | 8.256385 | [Drama] |
292 | Pulp Fiction | 8670.0 | 8.3 | 8.251406 | [Thriller, Crime] |
522 | Schindler's List | 4436.0 | 8.3 | 8.206639 | [Drama, History, War] |
23673 | Whiplash | 4376.0 | 8.3 | 8.205404 | [Drama] |
5481 | Spirited Away | 3968.0 | 8.3 | 8.196055 | [Fantasy, Adventure, Animation, Family] |
2211 | Life Is Beautiful | 3643.0 | 8.3 | 8.187171 | [Comedy, Drama] |
1178 | The Godfather: Part II | 3418.0 | 8.3 | 8.180076 | [Drama, Crime] |
1152 | One Flew Over the Cuckoo's Nest | 3001.0 | 8.3 | 8.164256 | [Drama] |
351 | Forrest Gump | 8147.0 | 8.2 | 8.150272 | [Comedy, Drama, Romance] |
1154 | The Empire Strikes Back | 5998.0 | 8.2 | 8.132919 | [Adventure, Action, Science Fiction] |
1176 | Psycho | 2405.0 | 8.3 | 8.132715 | [Drama, Horror, Thriller] |
4.2 Top movies by genre
# To split the movies into one genre per row
s = movies.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
genre_movies = movies.drop('genres', axis=1).join(s)
genre_movies.head(2)
budget | id | imdb_id | original_language | original_title | overview | popularity | production_companies | release_date | revenue | runtime | spoken_languages | status | title | vote_average | vote_count | year | genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30000000 | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.9469 | [{'name': 'Pixar Animation Studios', 'id': 3}] | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Toy Story | 7.7 | 5415.0 | 1995 | Animation |
0 | 30000000 | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.9469 | [{'name': 'Pixar Animation Studios', 'id': 3}] | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Toy Story | 7.7 | 5415.0 | 1995 | Comedy |
def genre_chart(genre, percentile=0.90):
df = genre_movies[genre_movies['genre'] == genre]
vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
# C = vote_averages.mean()
# m = vote_counts.quantile(percentile)
qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
# qualified['score'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
qualified['score'] = q_movies['score']
qualified = qualified.sort_values('score', ascending=False).head(250)
return qualified
genre_chart('War').head(10)
title | year | vote_count | vote_average | popularity | score | |
---|---|---|---|---|---|---|
522 | Schindler's List | 1993 | 4436 | 8 | 41.7251 | 8.206639 |
24860 | The Imitation Game | 2014 | 5895 | 8 | 31.5959 | 7.937062 |
5857 | The Pianist | 2002 | 1927 | 8 | 14.8116 | 7.909733 |
13605 | Inglourious Basterds | 2009 | 6598 | 7 | 16.8956 | 7.845977 |
5553 | Grave of the Fireflies | 1988 | 974 | 8 | 0.010902 | 7.835726 |
1165 | Apocalypse Now | 1979 | 2112 | 8 | 13.5963 | 7.832268 |
1919 | Saving Private Ryan | 1998 | 5148 | 7 | 21.7581 | 7.831220 |
1179 | Full Metal Jacket | 1987 | 2595 | 7 | 13.9415 | 7.767482 |
732 | Dr. Strangelove or: How I Learned to Stop Worr... | 1964 | 1472 | 8 | 9.80398 | 7.766491 |
43190 | Band of Brothers | 2001 | 725 | 8 | 7.903731 | 7.733235 |
genre_chart('Romance').head(10)
title | year | vote_count | vote_average | popularity | score | |
---|---|---|---|---|---|---|
10309 | Dilwale Dulhania Le Jayenge | 1995 | 661 | 9 | 34.457 | 8.421453 |
351 | Forrest Gump | 1994 | 8147 | 8 | 48.3072 | 8.150272 |
40251 | Your Name. | 2016 | 1030 | 8 | 34.461252 | 8.112532 |
40882 | La La Land | 2016 | 4745 | 7 | 19.681686 | 7.825568 |
22168 | Her | 2013 | 4215 | 7 | 13.8295 | 7.816552 |
7208 | Eternal Sunshine of the Spotless Mind | 2004 | 3758 | 7 | 12.9063 | 7.806818 |
1132 | Cinema Paradiso | 1988 | 834 | 8 | 14.177 | 7.784420 |
876 | Vertigo | 1958 | 1162 | 8 | 18.2082 | 7.711735 |
4843 | Amélie | 2001 | 3403 | 7 | 12.8794 | 7.702024 |
24982 | The Theory of Everything | 2014 | 3403 | 7 | 11.853 | 7.702024 |
5. Content Based Recommender (Movie Description)
This part recommends movies that are similar to a particular movie in terms of movie description. It considers the pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.
movies['overview'].head(3)
0 Led by Woody, Andy's toys live happily in his ...
1 When siblings Judy and Peter discover an encha...
2 A family wedding reignites the ancient feud be...
Name: overview, dtype: object
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
# Define a TF-IDF Vectorizer Object. Remove all english stop words.
vect_1 = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')
# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = vect_1.fit_transform(movies['overview'])
# Output the shape of tfidf_matrix
tfidf_matrix.shape
(45466, 75827)
Use the cosine similarity to denote the similarity between two movies.
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
Define a function that takes in a movie title as an input and outputs a list of the 8 most similar movies.
#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwise similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 8 most similar movies
sim_scores = sim_scores[1:9]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 8 most similar movies
return movies['title'].iloc[movie_indices]
get_recommendations('X-Men')
32251 Superman
20806 Hulk vs. Wolverine
23472 Mission: Impossible - Rogue Nation
13635 X-Men Origins: Wolverine
6195 X2
30067 Holiday
21296 The Wolverine
38010 Sharkman
Name: title, dtype: object
get_recommendations('Mission: Impossible - Ghost Protocol')
23472 Mission: Impossible - Rogue Nation
10952 Mission: Impossible III
3501 Mission: Impossible II
19275 The President's Man: A Line in the Sand
26633 A Dangerous Place
18674 Act of Valor
15886 My Girlfriend's Boyfriend
33441 Swat: Unit 887
Name: title, dtype: object
5.1 Content Based Recommender (Other Parameters)
For the recommendations, it seems that the movies are correctly recommended based on similar movie descriptions. However, some users might like a movie based on the movie’s cast, director and/or the genre of the movie. Hence, the model will be improved based on these two added features.
# Load keywords and credits
credits = pd.read_csv("./movies-dataset/credits.csv")
# Remove rows with bad IDs.
movies = movies.drop([19730, 29503, 35587])
# Convert IDs to integers for merging
credits['id'] = credits['id'].astype('int')
movies['id'] = movies['id'].astype('int')
# Merge credits into movies dataframe
movies = movies.merge(credits, on='id')
From the merged dataframe, the scope of features will be defined as such:
Crew: Only the Director will be selected as I feel his directing sense contributes most to the movie.
Cast: Most movies have a mixture of better known and lesser known actors and actresses. Hence, I will choose only the top 3 actors/actresses names in the list.
movies['cast'] = movies['cast'].apply(literal_eval)
movies['crew'] = movies['crew'].apply(literal_eval)
# movies['cast_size'] = movies['cast'].apply(lambda x: len(x))
# movies['crew_size'] = movies['crew'].apply(lambda x: len(x))
def get_director(x):
for i in x:
if i['job'] == 'Director':
return i['name']
return np.nan
def get_list(x):
if isinstance(x, list):
names = [i['name'] for i in x]
#Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
if len(names) > 3:
names = names[:3]
return names
#Return empty list in case of missing/malformed data
return []
# Define new director, cast and genres features that are in a suitable form.
movies['director'] = movies['crew'].apply(get_director)
# features = ['cast', 'genres']
features = ['cast']
for feature in features:
movies[feature] = movies[feature].apply(get_list)
# Print the new features of the first 3 films
movies[['title', 'cast', 'director', 'genres']].head(3)
title | cast | director | genres | |
---|---|---|---|---|
0 | Toy Story | [Tom Hanks, Tim Allen, Don Rickles] | John Lasseter | [Animation, Comedy, Family] |
1 | Jumanji | [Robin Williams, Jonathan Hyde, Kirsten Dunst] | Joe Johnston | [Adventure, Fantasy, Family] |
2 | Grumpier Old Men | [Walter Matthau, Jack Lemmon, Ann-Margret] | Howard Deutch | [Romance, Comedy] |
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
if isinstance(x, list):
return [str.lower(i.replace(" ", "")) for i in x]
else:
#Check if director exists. If not, return empty string
if isinstance(x, str):
return str.lower(x.replace(" ", ""))
else:
return ''
print (movies['cast'].head(1).all())
['Tom Hanks', 'Tim Allen', 'Don Rickles']
print (movies['director'].head())
0 John Lasseter
1 Joe Johnston
2 Howard Deutch
3 Forest Whitaker
4 Charles Shyer
Name: director, dtype: object
print (movies['genres'].head())
0 [Animation, Comedy, Family]
1 [Adventure, Fantasy, Family]
2 [Romance, Comedy]
3 [Comedy, Drama, Romance]
4 [Comedy]
Name: genres, dtype: object
# Apply clean_data function to your features.
# features = ['cast', 'director', 'genres']
features = ['director', 'genres']
for feature in features:
movies[feature] = movies[feature].apply(clean_data)
def create_soup(x):
return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
# Create a new soup feature
movies['soup'] = movies.apply(create_soup, axis=1)
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer
vect_2 = CountVectorizer(stop_words='english')
count_matrix = vect_2.fit_transform(movies['soup'])
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
# Reset index of your main DataFrame and construct reverse mapping as before
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])
get_recommendations('The Dark Knight Rises', cosine_sim2)
12525 The Dark Knight
10158 Batman Begins
11399 The Prestige
23964 Quicksand
516 Romeo Is Bleeding
8990 State of Grace
11460 Harsh Times
14977 Harry Brown
Name: title, dtype: object
get_recommendations('The Godfather', cosine_sim2)
1187 The Godfather: Part II
1922 The Godfather: Part III
3996 Gardens of Stone
3145 Scent of a Woman
15503 The Rain People
1174 Apocalypse Now
1844 On the Waterfront
5281 The Gambler
Name: title, dtype: object
6. Prediction Of Ratings (Collaborative Filtering)
For this part, I will attempt to predict how a user will rate a recommended movie (presuming he or she has not seen it before or at least has not rated it before)
reader = Reader()
ratings3 = pd.read_csv("./movies-dataset/ratings_small.csv")
ratings3.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
data = Dataset.load_from_df(ratings3[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])
Evaluating RMSE, MAE of algorithm SVD.
------------
Fold 1
RMSE: 0.8947
MAE: 0.6894
------------
Fold 2
RMSE: 0.8995
MAE: 0.6926
------------
Fold 3
RMSE: 0.8910
MAE: 0.6868
------------
Fold 4
RMSE: 0.8996
MAE: 0.6917
------------
Fold 5
RMSE: 0.8991
MAE: 0.6929
------------
------------
Mean RMSE: 0.8968
Mean MAE : 0.6907
------------
------------
CaseInsensitiveDefaultDict(list,
{'mae': [0.68937359473806903,
0.69259939130111503,
0.68678665677980999,
0.69169120460418154,
0.69285620150031413],
'rmse': [0.89472414482943841,
0.89948598218998499,
0.89096153777913623,
0.8996171912501465,
0.89907130432515781]})
trainset = data.build_full_trainset()
svd.train(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11704a908>
ratings3[ratings3['userId'] == 1]
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
5 | 1 | 1263 | 2.0 | 1260759151 |
6 | 1 | 1287 | 2.0 | 1260759187 |
7 | 1 | 1293 | 2.0 | 1260759148 |
8 | 1 | 1339 | 3.5 | 1260759125 |
9 | 1 | 1343 | 2.0 | 1260759131 |
10 | 1 | 1371 | 2.5 | 1260759135 |
11 | 1 | 1405 | 1.0 | 1260759203 |
12 | 1 | 1953 | 4.0 | 1260759191 |
13 | 1 | 2105 | 4.0 | 1260759139 |
14 | 1 | 2150 | 3.0 | 1260759194 |
15 | 1 | 2193 | 2.0 | 1260759198 |
16 | 1 | 2294 | 2.0 | 1260759108 |
17 | 1 | 2455 | 2.5 | 1260759113 |
18 | 1 | 2968 | 1.0 | 1260759200 |
19 | 1 | 3671 | 3.0 | 1260759117 |
svd.predict(1, 31)
Prediction(uid=1, iid=31, r_ui=None, est=2.5823946941598028, details={'was_impossible': False})
7. Key Insights
- Baseline Model
- Does well in recommending movies which have a high weighted rating according to user’s favourite genre.
- Not flexible enough to take in more parameters and recommend more personalized choices for user.
- Content Based Model
- Does well in recommending movies which are similar to the user’s inputs, such as movie plot, favourite director etc.
- Does not have cold start problem as user does not need to have rated many movies before, since the model just needs user to select favourite movie and other parameters if he/she so wishes.
- Rating Prediction Model
- The Surprise package, which is a Python scikit package for recommender systems, has a decent performance.
- Does not address the cold start problem, which occurs when the user has not rated enough movies before.
- Other models For the movie recommender engine, there exists a Collaborative Filtering model which takes into account similar users’ choices of movies and recommends such movies to the inquiring user. But due to the time constraints of the capstone project, we are only able to explore content based model. I will be following up with this model so give this a space a watch!
8. Future work
I hope you like what you have seen thus far.
If you have any comments or questions regarding the above work, feel free to contact me via the “Contact Me” tab at the top of the page.
Have a nice day!