About the Data:
For these graphs I ran a linear regression on the data
which was scraped and cleaned in python.
As a predictive model, it was pretty weak, with R Squared values between .3 and .6 depending on the character included.
Despite this, we're still able to pluck out
specific peices of information with statistical significance.
"Line" in this context is defined as all the words a character would say in a row
without being interrupted, and without scenes changing. This is in contrast with the theatrical definition of the word,
where it is possible for a character to have multiple consecutive lines in the same scene.
The first graph
is powered by the following regression. Using the rating of the show as the dependent variable, this graph
at uses a character's appearance as a binary independent variable. For instance, if Jim appeared in an episode, the data point
would be 1, otherwise, it would be 0. This regression allows us to look at the expected effect of each character's appearance
in the show. For instance, Jim's coefficient is 1.671. This means that, all else equal, we could expect an episode without Jim
to be rated 1.671 less on IMDB than a comparable episode in which Jim appears.
Character |
Coefficient |
Standard Error |
t Statistic |
P-Value |
Intercept |
6.1233 |
0.5565 |
11.0034 |
1.242E-21 |
Jim |
1.6710 |
0.4829 |
3.4603 |
6.814E-4 |
Angela |
-0.0463 |
0.1282 |
-0.3613 |
0.7183 |
Jan |
0.0281 |
0.0949 |
0.2967 |
0.7671 |
Andy |
0.1510 |
0.0888 |
1.6993 |
0.0911 |
Pam |
-0.0613 |
0.2326 |
-0.2636 |
0.7924 |
Oscar |
0.0598 |
0.1163 |
0.5144 |
0.6075 |
Phyllis |
-0.0187 |
0.1122 |
-0.1671 |
0.8675 |
Meredith |
0.1152 |
0.0747 |
1.5410 |
0.1252 |
Michael |
0.5188 |
0.1083 |
4.7896 |
3.604E-06 |
Ryan |
-0.0374 |
0.0936 |
-0.4000 |
0.6897 |
Darryl |
0.0275 |
0.0762 |
0.3608 |
0.7187 |
Kelly |
0.0403 |
0.0943 |
0.4314 |
0.6667 |
Toby |
0.1214 |
0.0668 |
1.8184 |
0.0708 |
Erin |
-0.2293 |
0.0972 |
-2.3584 |
0.0195 |
Kevin |
-0.06957 |
0.1908 |
-0.3601 |
0.7163 |
Stanley |
0.0680 |
0.1093 |
0.6219 |
0.5348 |
The P-value shows the percent chance of a given coefficient being wrong. Many of the characters had P-values greater than
.5. These characters have a lot of "noise": variance to their expected effect on th ratings. Their coefficients are little more than
averages of the highly varied numbers, with little correlation to IMDB score.
All of the characters whose appearance had at least a somewhat statistically significant impact are highlighted in blue. Michael, we
can see, had what amounts to an undeniable effect on the ratings.
Additionally, Jim had a very high impact on the ratings. This surprised me, because he's in a lot of episodes. I expected there to
be almost no statistical significance. Upon further investigation, this is the case because Jim is only absent in a handful of episodes, and they where
well below the show's average ratings.
This second regression
is really interesting. I looked at the ratings against the number of lines the main characters
had in each episode. The coefficients show the marginal change in ratings to a change in the number of lines a character has.
In English, if you give a character another line, what happens to the rating?
Character |
Coefficient |
Standard Error |
t Statistic |
P-Value |
Intercept |
7.6816 |
0.1159 |
66.3017 |
6.55E-124 |
Jim |
0.0035 |
0.0020 |
1.7054 |
0.0899 |
Angela |
-0.0041 |
0.0042 |
0.9813 |
0.3278 |
Jan |
0.0040 |
0.0030 |
1.3401 |
0.1820 |
Andy |
-0.0008 |
0.0020 |
-0.3860 |
0.7000 |
Pam |
0.0004 |
0.0024 |
0.1511 |
0.8801 |
Oscar |
-0.0007 |
0.0053 |
-0.1262 |
0.8998 |
Phyllis |
0.0052 |
0.0083 |
0.7202 |
0.4724 |
Meredith |
0.0022 |
0.0083 |
0.2638 |
0.7923 |
Michael |
0.0050 |
0.0010 |
4.8563 |
2.685E-06 |
Ryan |
-0.0022 |
0.0034 |
-0.6527 |
0.5148 |
Darryl |
-0.0004 |
0.0034 |
-0.1056 |
0.9160 |
Kelly |
0.0039 |
0.0070 |
0.5571 |
0.5782 |
Toby |
0.0020 |
0.0050 |
0.4052 |
0.6858 |
Erin |
-0.0066 |
0.0038 |
-1.7125 |
0.0886 |
Kevin |
-0.0112 |
0.0056 |
-01.8732 |
0.0627 |
Stanley |
0.0147 |
0.0087 |
1.6942 |
0.0920 |
Statistically significantly, giving Michael more lines comes with a better IMDB rating. Jim is correlated with a positive marginal
rate of return to IMDB in both visualizations as well.
Additionally, and I think this is a really fun takeaway, Stanley and Kevin have far and away the largest coefficients in this regression,
which implies that they really shine in most of the episodes where they have a lot of lines.