WatchThis (A Movie Recommender)

1. Problem Statement

Have you ever wanted to watch a movie in the comfort of your home, but stopped short of doing so just because nothing came to mind? Or during a stayover with friends, too many differences in preferences for a movie choice led to a wild goose catch for a decision that never materialized?

Yes, there are a few movie recommender resources around. They are either complicated to use (ie. need to set up account, enter a bunch of user preferences etc), or too simple and does not allow for a flexibility in preferences for input.

2. Natural Language Processing (NLP)

This project is heavily reliant on Natural Language Processing or NLP, so let us understand what NLP is all about.

NLP is broadly defined as the processing or manipulation of natural language, which can be in the form of speech and text etc. It is a very challenging task to make “useful” sense of such information, as they are messy and can be illogical at times.

For the purpose of this project, we will be comparing how “similar” the movie plots, cast, director are, and subsequently recommend the movies accordingly. More details will be discussed as we go along.

3. Dataset

The dataset is from GroupLens Research 20M. It is a stable benchmark dataset.

# Load the data.
movies = pd.read_csv("./movies-dataset/movies_metadata.csv")
movies.head(20)

After some data cleaning, we are good to go!

4. Baseline Model: Simple Recommender

The Simple Recommender is a baseline model which offers simple recommendations based on a movie’s weighted rating. Then, the top few selected movies will be displayed.

In addition, I will pass in a genre argument for extra option for the user.

# extract the genres
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

movies.genres.head(5)

   [Animation, Comedy, Family]
  [Adventure, Fantasy, Family]
             [Romance, Comedy]
      [Comedy, Drama, Romance]
                      [Comedy]
Name: genres, dtype: object

I will use a weighted rating that takes into account the average rating and the number of votes for the movie. Hence, a movie that has a 8 rating from say, 50k voters will have a higher score than a movie with the same rating but with fewer voters.

where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie

For m, it is an arbitrary number where we will filter out movies that have few votes.

We will use 90th percentile as a cutoff. Hence, for a movie to feature in the list, it has to have more votes than 90% of the movies in the list.

# Minimum number of votes required to be in the chart
m = movies['vote_count'].quantile(0.90)
print(m)

160.0

# Filter out all qualified movies into a new df
q_movies = movies.copy().loc[movies['vote_count'] >= m]
q_movies.shape

(4555, 18)

There are 4555 movies which qualify to be in this list.

q_movies.head(3)

	budget	genres	id	imdb_id	original_language	original_title	overview	popularity	production_companies	release_date	revenue	runtime	spoken_languages	status	title	vote_average	vote_count	year
0	30000000	[Animation, Comedy, Family]	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.9469	[{'name': 'Pixar Animation Studios', 'id': 3}]	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Toy Story	7.7	5415.0	1995
1	65000000	[Adventure, Fantasy, Family]	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.0155	[{'name': 'TriStar Pictures', 'id': 559}, {'na...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Jumanji	6.9	2413.0	1995
4	0	[Comedy]	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	8.38752	[{'name': 'Sandollar Productions', 'id': 5842}...	1995-02-10	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Father of the Bride Part II	5.7	173.0	1995

# Compute the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

4.1 Top movies by weighted rating

#Sort movies based on score calculated
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score','genres']].head(15)

	title	vote_count	vote_average	score	genres
314	The Shawshank Redemption	8358.0	8.5	8.445869	[Drama, Crime]
834	The Godfather	6024.0	8.5	8.425439	[Drama, Crime]
10309	Dilwale Dulhania Le Jayenge	661.0	9.1	8.421453	[Comedy, Drama, Romance]
12481	The Dark Knight	12269.0	8.3	8.265477	[Drama, Action, Crime, Thriller]
2843	Fight Club	9678.0	8.3	8.256385	[Drama]
292	Pulp Fiction	8670.0	8.3	8.251406	[Thriller, Crime]
522	Schindler's List	4436.0	8.3	8.206639	[Drama, History, War]
23673	Whiplash	4376.0	8.3	8.205404	[Drama]
5481	Spirited Away	3968.0	8.3	8.196055	[Fantasy, Adventure, Animation, Family]
2211	Life Is Beautiful	3643.0	8.3	8.187171	[Comedy, Drama]
1178	The Godfather: Part II	3418.0	8.3	8.180076	[Drama, Crime]
1152	One Flew Over the Cuckoo's Nest	3001.0	8.3	8.164256	[Drama]
351	Forrest Gump	8147.0	8.2	8.150272	[Comedy, Drama, Romance]
1154	The Empire Strikes Back	5998.0	8.2	8.132919	[Adventure, Action, Science Fiction]
1176	Psycho	2405.0	8.3	8.132715	[Drama, Horror, Thriller]

4.2 Top movies by genre

# To split the movies into one genre per row
s = movies.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
genre_movies = movies.drop('genres', axis=1).join(s)

genre_movies.head(2)

	budget	id	imdb_id	original_language	original_title	overview	popularity	production_companies	release_date	revenue	runtime	spoken_languages	status	title	vote_average	vote_count	year	genre
0	30000000	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.9469	[{'name': 'Pixar Animation Studios', 'id': 3}]	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Toy Story	7.7	5415.0	1995	Animation
0	30000000	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.9469	[{'name': 'Pixar Animation Studios', 'id': 3}]	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Toy Story	7.7	5415.0	1995	Comedy

def genre_chart(genre, percentile=0.90):
    df = genre_movies[genre_movies['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
#     C = vote_averages.mean()
#     m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    # qualified['score'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified['score'] = q_movies['score']
    qualified = qualified.sort_values('score', ascending=False).head(250)
    
    return qualified

genre_chart('War').head(10)

	title	year	vote_count	vote_average	popularity	score
522	Schindler's List	1993	4436	8	41.7251	8.206639
24860	The Imitation Game	2014	5895	8	31.5959	7.937062
5857	The Pianist	2002	1927	8	14.8116	7.909733
13605	Inglourious Basterds	2009	6598	7	16.8956	7.845977
5553	Grave of the Fireflies	1988	974	8	0.010902	7.835726
1165	Apocalypse Now	1979	2112	8	13.5963	7.832268
1919	Saving Private Ryan	1998	5148	7	21.7581	7.831220
1179	Full Metal Jacket	1987	2595	7	13.9415	7.767482
732	Dr. Strangelove or: How I Learned to Stop Worr...	1964	1472	8	9.80398	7.766491
43190	Band of Brothers	2001	725	8	7.903731	7.733235

genre_chart('Romance').head(10)

	title	year	vote_count	vote_average	popularity	score
10309	Dilwale Dulhania Le Jayenge	1995	661	9	34.457	8.421453
351	Forrest Gump	1994	8147	8	48.3072	8.150272
40251	Your Name.	2016	1030	8	34.461252	8.112532
40882	La La Land	2016	4745	7	19.681686	7.825568
22168	Her	2013	4215	7	13.8295	7.816552
7208	Eternal Sunshine of the Spotless Mind	2004	3758	7	12.9063	7.806818
1132	Cinema Paradiso	1988	834	8	14.177	7.784420
876	Vertigo	1958	1162	8	18.2082	7.711735
4843	Amélie	2001	3403	7	12.8794	7.702024
24982	The Theory of Everything	2014	3403	7	11.853	7.702024

5. Content Based Recommender (Movie Description)

This part recommends movies that are similar to a particular movie in terms of movie description. It considers the pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

movies['overview'].head(3)

  Led by Woody, Andy's toys live happily in his ...
  When siblings Judy and Peter discover an encha...
  A family wedding reignites the ancient feud be...
Name: overview, dtype: object

# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all english stop words.
vect_1 = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = vect_1.fit_transform(movies['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

Use the cosine similarity to denote the similarity between two movies.

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Define a function that takes in a movie title as an input and outputs a list of the 8 most similar movies.

#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

# Function that takes in movie title as input and outputs most similar movies

def get_recommendations(title, cosine_sim=cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 8 most similar movies
    sim_scores = sim_scores[1:9]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 8 most similar movies
    return movies['title'].iloc[movie_indices]

get_recommendations('X-Men')

                            Superman
                  Hulk vs. Wolverine
  Mission: Impossible - Rogue Nation
            X-Men Origins: Wolverine
                                   X2
                             Holiday
                       The Wolverine
                            Sharkman
Name: title, dtype: object

get_recommendations('Mission: Impossible - Ghost Protocol')

       Mission: Impossible - Rogue Nation
                  Mission: Impossible III
                    Mission: Impossible II
  The President's Man: A Line in the Sand
                        A Dangerous Place
                             Act of Valor
                My Girlfriend's Boyfriend
                           Swat: Unit 887
Name: title, dtype: object

5.1 Content Based Recommender (Other Parameters)

For the recommendations, it seems that the movies are correctly recommended based on similar movie descriptions. However, some users might like a movie based on the movie’s cast, director and/or the genre of the movie. Hence, the model will be improved based on these two added features.

# Load keywords and credits
credits = pd.read_csv("./movies-dataset/credits.csv")

# Remove rows with bad IDs.
movies = movies.drop([19730, 29503, 35587])

# Convert IDs to integers for merging
credits['id'] = credits['id'].astype('int')
movies['id'] = movies['id'].astype('int')

# Merge credits into movies dataframe
movies = movies.merge(credits, on='id')

From the merged dataframe, the scope of features will be defined as such:

Crew: Only the Director will be selected as I feel his directing sense contributes most to the movie.

Cast: Most movies have a mixture of better known and lesser known actors and actresses. Hence, I will choose only the top 3 actors/actresses names in the list.

movies['cast'] = movies['cast'].apply(literal_eval)
movies['crew'] = movies['crew'].apply(literal_eval)
# movies['cast_size'] = movies['cast'].apply(lambda x: len(x))
# movies['crew_size'] = movies['crew'].apply(lambda x: len(x))

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Define new director, cast and genres features that are in a suitable form.
movies['director'] = movies['crew'].apply(get_director)

# features = ['cast', 'genres']
features = ['cast']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)

# Print the new features of the first 3 films
movies[['title', 'cast', 'director', 'genres']].head(3)

	title	cast	director	genres
0	Toy Story	[Tom Hanks, Tim Allen, Don Rickles]	John Lasseter	[Animation, Comedy, Family]
1	Jumanji	[Robin Williams, Jonathan Hyde, Kirsten Dunst]	Joe Johnston	[Adventure, Fantasy, Family]
2	Grumpier Old Men	[Walter Matthau, Jack Lemmon, Ann-Margret]	Howard Deutch	[Romance, Comedy]

# Function to convert all strings to lower case and strip names of spaces

def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

print (movies['cast'].head(1).all())

['Tom Hanks', 'Tim Allen', 'Don Rickles']

print (movies['director'].head())

    John Lasseter
     Joe Johnston
    Howard Deutch
  Forest Whitaker
    Charles Shyer
Name: director, dtype: object

print (movies['genres'].head())

   [Animation, Comedy, Family]
  [Adventure, Fantasy, Family]
             [Romance, Comedy]
      [Comedy, Drama, Romance]
                      [Comedy]
Name: genres, dtype: object

# Apply clean_data function to your features.
# features = ['cast', 'director', 'genres']
features = ['director', 'genres']

for feature in features:
    movies[feature] = movies[feature].apply(clean_data)

def create_soup(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# Create a new soup feature
movies['soup'] = movies.apply(create_soup, axis=1)

# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

vect_2 = CountVectorizer(stop_words='english')
count_matrix = vect_2.fit_transform(movies['soup'])

# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# Reset index of your main DataFrame and construct reverse mapping as before
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])

get_recommendations('The Dark Knight Rises', cosine_sim2)

    The Dark Knight
      Batman Begins
       The Prestige
          Quicksand
    Romeo Is Bleeding
      State of Grace
        Harsh Times
        Harry Brown
Name: title, dtype: object

get_recommendations('The Godfather', cosine_sim2)

    The Godfather: Part II
   The Godfather: Part III
          Gardens of Stone
          Scent of a Woman
          The Rain People
            Apocalypse Now
         On the Waterfront
               The Gambler
Name: title, dtype: object

6. Prediction Of Ratings (Collaborative Filtering)

For this part, I will attempt to predict how a user will rate a recommended movie (presuming he or she has not seen it before or at least has not rated it before)

reader = Reader()

ratings3 = pd.read_csv("./movies-dataset/ratings_small.csv")
ratings3.head()

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

data = Dataset.load_from_df(ratings3[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8947
MAE:  0.6894
------------
Fold 2
RMSE: 0.8995
MAE:  0.6926
------------
Fold 3
RMSE: 0.8910
MAE:  0.6868
------------
Fold 4
RMSE: 0.8996
MAE:  0.6917
------------
Fold 5
RMSE: 0.8991
MAE:  0.6929
------------
------------
Mean RMSE: 0.8968
Mean MAE : 0.6907
------------
------------





CaseInsensitiveDefaultDict(list,
                           {'mae': [0.68937359473806903,
                             0.69259939130111503,
                             0.68678665677980999,
                             0.69169120460418154,
                             0.69285620150031413],
                            'rmse': [0.89472414482943841,
                             0.89948598218998499,
                             0.89096153777913623,
                             0.8996171912501465,
                             0.89907130432515781]})

trainset = data.build_full_trainset()
svd.train(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11704a908>

ratings3[ratings3['userId'] == 1]

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205
5	1	1263	2.0	1260759151
6	1	1287	2.0	1260759187
7	1	1293	2.0	1260759148
8	1	1339	3.5	1260759125
9	1	1343	2.0	1260759131
10	1	1371	2.5	1260759135
11	1	1405	1.0	1260759203
12	1	1953	4.0	1260759191
13	1	2105	4.0	1260759139
14	1	2150	3.0	1260759194
15	1	2193	2.0	1260759198
16	1	2294	2.0	1260759108
17	1	2455	2.5	1260759113
18	1	2968	1.0	1260759200
19	1	3671	3.0	1260759117

svd.predict(1, 31)

Prediction(uid=1, iid=31, r_ui=None, est=2.5823946941598028, details={'was_impossible': False})

7. Key Insights

Baseline Model
- Does well in recommending movies which have a high weighted rating according to user’s favourite genre.
- Not flexible enough to take in more parameters and recommend more personalized choices for user.
Content Based Model
- Does well in recommending movies which are similar to the user’s inputs, such as movie plot, favourite director etc.
- Does not have cold start problem as user does not need to have rated many movies before, since the model just needs user to select favourite movie and other parameters if he/she so wishes.
Rating Prediction Model
- The Surprise package, which is a Python scikit package for recommender systems, has a decent performance.
- Does not address the cold start problem, which occurs when the user has not rated enough movies before.
Other models For the movie recommender engine, there exists a Collaborative Filtering model which takes into account similar users’ choices of movies and recommends such movies to the inquiring user. But due to the time constraints of the capstone project, we are only able to explore content based model. I will be following up with this model so give this a space a watch!

8. Future work

I hope you like what you have seen thus far.

If you have any comments or questions regarding the above work, feel free to contact me via the “Contact Me” tab at the top of the page.

Have a nice day!

Jia Jian Woo