# Load libraries and data
library(ggplot2)
library(dplyr)
library(statsr)
load("movies.Rdata")
Our Coursera instructors provided us with a random sample of movies from the web sites IMDb and Rotten Tomatoes. Although they do not tell us how this random selection process was chosen we place our faith in our instructors that it was done well.
We are given scores and ratings of the randomly selected films from three sources:
Both IMDb and Rotten Tomatoes build their scores based on user reviews, which suffer from biases similar to online polls. This quote from the New York Times pertains to election polling done on news websites and blogs, but the concerns are the same for our audience scores:
“[Online polls] do a good job of engaging audiences online, and they do a good job of letting you know how other people who have come to the webpage feel about whatever issue,” said Mollyann Brodie, the executive director for public opinion and survey research at the Kaiser Family Foundation. “But they’re not necessarily good at telling you, in general, what people think, because we don’t know who’s come to that website and who’s taken it.”
Professional pollsters use scientific statistical methods to make sure that their small random samples are demographically appropriate to indicate how larger groups of people think. Online polls do nothing of the sort, and are not random, allowing anyone who finds the poll to vote.
(“Why You Shouldn’t Trust ‘Polls’ Conducted Online,” New York Times, September 28, 2016)
The critics scores are derived by an expert judgement of the critics reviews by people at Rotten Tomatoes. We do not know the process of human judgement that produces a score for a given critic’s review – how it might be 45% versus 50% – but we won’t be using critics reviews here, since our boss at Paramount is concerned about what makes a movie popular, as opposed to critically acclaimed. In fact, a critic’s score might actually be a predictor of the audience score.
I believe it was the computer security guru Bruce Schneier who once said, “For marketing purposes, 30% accuracy can be good enough.” Perhaps it was a wry comment, but he meant that when you are trying to market a product to people having a 30% accuracy in reaching the correct audience is better than marketing to purely randomly selected people.
In that spirit, despite the bias problems with the audience scoring addressed above, at the very least our inferences will tell us what kinds of films IMDb or Rotten Tomatoes reviewers will like. We will treat the audience scores, for this report, as having veracity for generalizing to the greater population.
Note that we can only correlate movie parameters to audience score. We cannot infer causality even if we were guaranteed the audience scores came from randomly selected reviewers.
For me personally, the interest here is general. I’ve been a movie lover pretty much all of my life. I lived in New York City for twelve years, where going to the movies borders on being a sport. As a kid I saw “Star Wars” twelve times in the theater in its original run in 1977. I have been an avid fan of “Mystery Science Theater 3000”, have seen countless films in art house cinemas, and even sat through the film adaptation of “Battlefield Earth,” which gets a 3% rating on Rotten Tomatoes (which sounds a bit high).
Our boss here at Paramount Pictures poses two questions for us:
We will use multiple linear regression to answer her first question.
As to learning something new: consider that a movie has a theatrical release date and a DVD release date. If we measure the amount of time from the theatrical release to the DVD release, does that length of time correlate to the popularity of the movie?
To speculate why this might be, we can think about a given movie having a “hype cycle.” A film may have an enormous amount of hype leading up to its release. It may do well at the box office. If the movie studio waits too long to release the DVD then perhaps both sales of the DVD and “online hype” (which probably generates more iMDB and Rotten Tomatoes responses) may suffer. We might see this in a negative correlation between the number of days and audience score.
One concern here may be: “Does this question call for time-series data?” This is one of the conditions for the least squares line we will plot. Because we deal, ultimately, with an integer (number of days from theatrical release to DVD release) this should not pose a problem.
Our instructions list several things to remove: actor1 through actor5 and the URIs for the films. We’ll also leave out the director and title_type. At first I left in “studio” thinking my boss would want to know which studios netted films with higher scores, all other predictors held equal; but this made the output messy and the data looked somewhat questionable:
## studioWalt Disney Home Entertainment 38.31529 20.15666 ## studioWalt Disney Pictures 13.97832 10.88395 ## studioWalt Disney Productions 3.59856 13.25061
Walt Disney Home Entertainment does not create releases for theaters, I believe. Also Paramount Pictures (my boss) cannot control for ‘movie studio’ so I decided to omit it.
# Keep 'title' in case we want to identify a single film
movies_for_lm <- movies %>% select(title, genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
str(movies_for_lm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 16 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
We need to compute the number of days from the date of the theatrical release to the release date of the DVD. First, let’s remove all films that have ‘NA’ for any release year.
movies_rel <- movies %>% filter(!is.na(dvd_rel_year)) %>% filter(!is.na(thtr_rel_year))
dim(movies_rel)
## [1] 643 32
That removed eight entries. Let’s compute the number of days between theatrical release date and DVD release date.
# Build a new column of type Date for the theatrical release date
movies_rel <- movies_rel %>% mutate(theatrical_release_date = as.Date(sprintf("%d-%d-%d", thtr_rel_year, thtr_rel_month, thtr_rel_day)))
# Build another new column of type Date for the DVD release date
movies_rel <- movies_rel %>% mutate(dvd_release_date = as.Date(sprintf("%d-%d-%d", dvd_rel_year, dvd_rel_month, dvd_rel_day)))
# Finally, calculate the number of days from theatrical release to DVD release
movies_rel <- movies_rel %>% mutate(days_to_dvd_release = as.numeric(dvd_release_date - theatrical_release_date, units="days"))
And finally let’s do a linear regression with our new predictor, days_to_dvd_release, and our response variable audience_score from Rotten Tomatoes:
does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_rel)
summary(does_dvd_matter)
##
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_rel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.883 -15.859 3.146 17.148 33.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.181e+01 9.849e-01 62.761 <2e-16 ***
## days_to_dvd_release 2.779e-04 2.537e-04 1.095 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.13 on 641 degrees of freedom
## Multiple R-squared: 0.001869, Adjusted R-squared: 0.0003114
## F-statistic: 1.2 on 1 and 641 DF, p-value: 0.2737
Wow, talk about “no correlation!” I wonder, then, if something is wrong with our assumption. Let’s look at the years for the releases:
ggplot(data = movies_rel, aes(x = thtr_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year")
Ah. We have data for movies released in the 1970s, so the DVD release date will be pretty long after the theatrical release date! Let’s look at the distribution of years for DVD release date:
ggplot(data = movies_rel, aes(x = dvd_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "DVD Release Year")
With these realizations we can narrow our data to account for these facts. Let’s assume the years of 2000 to 2010 as the “heyday” of DVD sales and limit our model to films with theatrical releases in those years:
movies_dvd <- movies_rel %>% filter(thtr_rel_year >= 2000, thtr_rel_year <= 2010)
ggplot(data = movies_dvd, aes(x = factor(thtr_rel_year))) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year 2000-2010")
Now we should have better data to make our inference.
does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_dvd)
summary(does_dvd_matter)
##
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_dvd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.344 -14.543 3.084 17.008 31.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.792877 1.655776 37.320 <2e-16 ***
## days_to_dvd_release 0.006231 0.004530 1.376 0.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.22 on 252 degrees of freedom
## Multiple R-squared: 0.007453, Adjusted R-squared: 0.003514
## F-statistic: 1.892 on 1 and 252 DF, p-value: 0.1702
While we still see a slope for days_to_dvd_release that is pretty much zero at least it’s not astronomically tiny as it was before. More on this in our “Inference” section below. Note we have three films whose DVDs came out on or before the theatrical release:
ggplot(data = movies_dvd, aes(x = days_to_dvd_release, y = audience_score)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
and those films are:
direct_to_dvd_maybe <- movies_dvd[movies_dvd$days_to_dvd_release < 0,]
direct_to_dvd_maybe[c("title", "genre", "thtr_rel_year", "dvd_rel_year")]
## # A tibble: 3 × 4
## title genre thtr_rel_year dvd_rel_year
## <chr> <fctr> <dbl> <dbl>
## 1 Dirty Sanchez: The Movie Comedy 2007 2007
## 2 London to Brighton Mystery & Suspense 2008 2005
## 3 Bookies Drama 2004 2004
but this doesn’t qualify them as outliers and they probably have little influence on the model.
For our multiple linear regression we will use Adjusted R-squared, which is better suited for making predictions. Let’s start by looking at the full model using the audience score from Rotten Tomatoes:
full_model <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies_for_lm)
summary(full_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box, data = movies_for_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.877 -8.763 0.515 9.113 40.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.20448 5.00242 6.038 2.68e-09 ***
## genreAnimation 5.35017 5.46689 0.979 0.328132
## genreArt House & International 6.32227 4.21886 1.499 0.134488
## genreComedy 0.01512 2.33917 0.006 0.994844
## genreDocumentary 10.92760 3.17288 3.444 0.000611 ***
## genreDrama 2.24704 2.04338 1.100 0.271901
## genreHorror -8.40856 3.48101 -2.416 0.015997 *
## genreMusical & Performing Arts 10.69044 4.49506 2.378 0.017693 *
## genreMystery & Suspense -3.79941 2.62451 -1.448 0.148211
## genreOther 1.45588 3.96310 0.367 0.713476
## genreScience Fiction & Fantasy -7.02810 4.96991 -1.414 0.157821
## runtime 0.07012 0.03332 2.105 0.035711 *
## mpaa_ratingNC-17 -11.29602 10.58749 -1.067 0.286419
## mpaa_ratingPG -1.98305 3.85925 -0.514 0.607542
## mpaa_ratingPG-13 -2.99617 3.97551 -0.754 0.451339
## mpaa_ratingR -1.35467 3.84036 -0.353 0.724397
## mpaa_ratingUnrated -3.06938 4.37962 -0.701 0.483667
## critics_score 0.43523 0.02290 19.004 < 2e-16 ***
## best_pic_nomyes 10.00938 3.62296 2.763 0.005900 **
## best_pic_winyes -1.95947 6.36506 -0.308 0.758301
## best_actor_winyes -1.58429 1.66193 -0.953 0.340817
## best_actress_winyes -2.23071 1.83976 -1.213 0.225778
## best_dir_winyes -0.07536 2.40966 -0.031 0.975060
## top200_boxyes 4.25460 3.79858 1.120 0.263121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.93 on 626 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5258
## F-statistic: 32.29 on 23 and 626 DF, p-value: < 2.2e-16
To save space (and scrolling) I’ll just print out the Adjusted R-squared value.
After performing backwards elimination for Adjusted R-squared for the Rotten Tomatoes audience scores, we wind up with an Adjusted R-squared value of about 0.526.
rt_model <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_dir_win + top200_box, data = movies_for_lm)
summary(rt_model)$adj.r.squared
## [1] 0.5260853
Plotting the residuals, the scatter and distribution look good:
plot(rt_model$residuals)
hist(rt_model$residuals)
If we use IMDb’s ratings as our predicted value, again for Adjusted R-squared using backwards elimination we arrive at:
imdb_model <- lm(imdb_rating ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_dir_win + top200_box, data = movies_for_lm)
summary(imdb_model)$adj.r.squared
## [1] 0.6273571
which gets a higher Adjusted R-squared for the same parameters. Plotting the residuals here looks good too:
plot(imdb_model$residuals)
hist(imdb_model$residuals)
so both models appear to meet the requirements for multiple linear regression.
We’ll test our model with the Disney animated film Zootopia. We’ll run our predictions for both Rotten Tomatoes and IMDb.
Zootopia <- data.frame(genre="Comedy", runtime=108, mpaa_rating="PG", critics_score=98, best_pic_nom="no", best_dir_win="no", top200_box="yes")
First we predict with our Rotten Tomatoes model:
predict(rt_model, newdata=Zootopia, interval="confidence", level=.95)
## fit lwr upr
## 1 81.76869 73.60684 89.93053
And then our IMDb model:
predict(imdb_model, newdata=Zootopia, interval="confidence", level=.95)
## fit lwr upr
## 1 7.330328 6.9423 7.718356
Our Rotten Tomatoes model predicts a score of 82% with a 95% confidence interval of (74%, 90%). Zootopia might be an outlier: the actual Rotten Tomatoes audience score is 92%.
Our IMDb model predicts a score of 7.3 with a 95% confidence interval of (6.9, 7.7). The actual score on IMDb is 8.1.
Since our data consists primarily of live action films (as opposed to animated films like Zootopia) let’s try something else: Deadpool:
Deadpool <- data.frame(genre="Action & Adventure", runtime=108, mpaa_rating="R", critics_score=84, best_pic_nom="no", best_dir_win="no", top200_box="yes")
Again we first use our Rotten Tomatoes model:
predict(rt_model, newdata=Deadpool, interval="confidence", level=.95)
## fit lwr upr
## 1 76.80159 68.7703 84.83288
And again with our IMDb model:
predict(imdb_model, newdata=Deadpool, interval="confidence", level=.95)
## fit lwr upr
## 1 7.235894 6.854073 7.617716
We see a Rotten Tomatoes prediction of 77% with a 95% CI of (69%, 85%). The actual score is 84%.
For our IMDb prediction we get 7.2 with a 95% CI of (6.9, 7.6). The actual IMDb score for Deadpool is 8.0.
We were tasked with two items from our boss at Paramount Pictures:
What makes a movie popular? There are far more variables than we were provided. I can imagine more complex models that account for the combination of actors and actresses, the producers, the screenwriters, holiday releases, and so on. In fact, there are probably models we can write that are specific to genres like animated films. One model (or two) cannot work for films as diverse as Zootopia and Deadpool, which are quite different. Our two models came pretty close to predicting our two movies; in fact the Rotten Tomatoes model correctly predicted Deadpool’s audience rating.
For the second item we have shown that there is no correlation between the number of days between the theatrical release and the DVD release of a film and the audience score. In data science it can be just as important to discover “There is no correlation” as it is to find a correlation.
Note, though, that the number of days between release dates is probably very important in terms of DVD sales, something we are not measuring. We can guess that the movie studios know very well that waiting too long to bring a DVD to market after the theatrical release can negatively impact sales.