P4DA Example2

MovieLens 1M Data Set

GroupLens Research provides a number of collections of movie ratings data collected from user of MovieLens in the late 1990s and early 2000s. These data sets contain 1 million ratings collected from 6000 users on 4000 movies. It's spread across 3 tables: ratings, user information, and movie information.

Data set Download:

Movies Data File 167.29KB
Rating Data File 23.45MB
Users Data File 131.22KB
ReadMe File 5.07KB

Each .dat has a format like the picture beneath, so we have to parse them:

.dat Format

Each table can be loaded into a pandas DataFrame object using pandas.read_table:

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']  
users = pd.read_table('user.dat', sep='::', header = None, names = unames)

mnames = ['movie_id', 'title', 'genres']  
movies = pd.read_table('movies.dat', sep='::', header = None, names = mnames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']  
ratings = pd.read_table('ratings.dat', sep='::', header = None, names = rnames)

#ParserWarning: Falling back to the 'python' engine because the 'c' engine does
#not support regex separators (separators > 1 char and different from '\s+' are
#interpreted as regex);

#If there is a warning, you should change your engine to 'python'

You can check the data if you want:


Users, movies, and ratings

Suppose you want to compute mean ratings for a particular movie by sex and age, this is much easier to do with all of the data merged together into a single table. Using pandas's merge function, we first merge ratings with users then merging that result with the movies data. pandas infers which columns to use as the merge keys based on overlapping names:

data = pd.merge(pd.merge(ratings, users), movies)  
#user_id                                            1
#movie_id                                        1193
#rating                                             5
#timestamp                                  978300760
#gender                                             F
#age                                                1
#occupation                                        10
#zip                                            48067
#title         One Flew Over the Cuckoo's Nest (1975)
#genres                                         Drama
#Name: 0, dtype: object

To get mean movie rating for each file grouped by gender, we can use the pivot_table method:

mean_rating = data.pivot_table('rating', index = 'title', columns = 'gender', aggfunc='mean')  
#Here is some parameter names changes in `pivot_table` method
#Do not use the old parameter or you will make a mistake

This produced another Data Frame containing mean ratings with movie totals as index(row) labels and gender as column labels. First, group the data by title and use size() to get a Series of group sizes for each title:

ratings_by_title = data.groupby('title').size()  
#$1,000,000 Duck (1971)       37
#'Night Mother (1986)         70
#'Til There Was You (1997)    52
#dtype: int64

Then we want to filter down to movies that received at least 250 ratings:

active_titles = ratings_by_title.index[ratings_by_title >=250]  

And the index of titles receiving at least 250 ratings can then be used to select rows from mean_rating above:

mean_ratings = mean_ratings.ix[active_titles]  
#gender                            F         M
#$1,000,000 Duck (1971)     3.375000  2.761905
#'Night Mother (1986)       3.388889  3.352941
#'Til There Was You (1997)  2.675676  2.733333

To see the top films among female viewers, we can sort by the F column in descending order:

top_female_ratings = mean_ratings.sort_values(by = 'F', ascending = False)  
#FutureWarning: by argument to sort_index is deprecated
#pls use .sort_values(by=...)
#Do not use .sort_index(by=...)

#gender                                               F         M
#Clean Slate (Coup de Torchon) (1981)               5.0  3.857143
#Ballad of Narayama, The (Narayama Bushiko) (1958)  5.0  3.428571
#Raw Deal (1948)                                    5.0  3.307692
#Bittersweet Motel (2000)                           5.0       NaN
#Skipped Parts (2000)                               5.0  4.000000

Measuering rating disagreement

Suppose you wanted to find the movies that are most divisive between male and female viewers. One way is to add a column to mean_ratings containing the difference in means, then sort by that:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']  
sorted_by_diff = mean_ratings.sort_values(by = 'diff')  

The output is like this:


Disagreement can be measured by the variance or standard deviation of the ratings:

#Standard deviation of rating grouped by title
rating_std_by_title = data.groupby('title')['rating'].std()  
#Filter down to active_titles
rating_std_by_title = rating_std_by_title.ix[active_titles]  
#Order Series by value in descending order
rating_std_by_title.order(ascending = False)[:5]  
#Dumb & Dumber (1994)                     1.321333
#Blair Witch Project, The (1999)          1.316368
#Natural Born Killers (1994)              1.307198
#Tank Girl (1995)                         1.277695
#Rocky Horror Picture Show, The (1975)    1.260177
#Name: rating, dtype: float64