About the Data:

For these graphs I ran a linear regression on the data which was scraped and cleaned in python. As a predictive model, it was pretty weak, with R Squared values between .3 and .6 depending on the character included. Despite this, we're still able to pluck out specific peices of information with statistical significance.

"Line" in this context is defined as all the words a character would say in a row without being interrupted, and without scenes changing. This is in contrast with the theatrical definition of the word, where it is possible for a character to have multiple consecutive lines in the same scene.

The first graph is powered by the following regression. Using the rating of the show as the dependent variable, this graph at uses a character's appearance as a binary independent variable. For instance, if Jim appeared in an episode, the data point would be 1, otherwise, it would be 0. This regression allows us to look at the expected effect of each character's appearance in the show. For instance, Jim's coefficient is 1.671. This means that, all else equal, we could expect an episode without Jim to be rated 1.671 less on IMDB than a comparable episode in which Jim appears.

Character Coefficient Standard Error t Statistic P-Value
Intercept 6.1233 0.5565 11.0034 1.242E-21
Jim 1.6710 0.4829 3.4603 6.814E-4
Angela -0.0463 0.1282 -0.3613 0.7183
Jan 0.0281 0.0949 0.2967 0.7671
Andy 0.1510 0.0888 1.6993 0.0911
Pam -0.0613 0.2326 -0.2636 0.7924
Oscar 0.0598 0.1163 0.5144 0.6075
Phyllis -0.0187 0.1122 -0.1671 0.8675
Meredith 0.1152 0.0747 1.5410 0.1252
Michael 0.5188 0.1083 4.7896 3.604E-06
Ryan -0.0374 0.0936 -0.4000 0.6897
Darryl 0.0275 0.0762 0.3608 0.7187
Kelly 0.0403 0.0943 0.4314 0.6667
Toby 0.1214 0.0668 1.8184 0.0708
Erin -0.2293 0.0972 -2.3584 0.0195
Kevin -0.06957 0.1908 -0.3601 0.7163
Stanley 0.0680 0.1093 0.6219 0.5348



The P-value shows the percent chance of a given coefficient being wrong. Many of the characters had P-values greater than .5. These characters have a lot of "noise": variance to their expected effect on th ratings. Their coefficients are little more than averages of the highly varied numbers, with little correlation to IMDB score. All of the characters whose appearance had at least a somewhat statistically significant impact are highlighted in blue. Michael, we can see, had what amounts to an undeniable effect on the ratings. Additionally, Jim had a very high impact on the ratings. This surprised me, because he's in a lot of episodes. I expected there to be almost no statistical significance. Upon further investigation, this is the case because Jim is only absent in a handful of episodes, and they where well below the show's average ratings.


This second regression is really interesting. I looked at the ratings against the number of lines the main characters had in each episode. The coefficients show the marginal change in ratings to a change in the number of lines a character has. In English, if you give a character another line, what happens to the rating?

Character Coefficient Standard Error t Statistic P-Value
Intercept 7.6816 0.1159 66.3017 6.55E-124
Jim 0.0035 0.0020 1.7054 0.0899
Angela -0.0041 0.0042 0.9813 0.3278
Jan 0.0040 0.0030 1.3401 0.1820
Andy -0.0008 0.0020 -0.3860 0.7000
Pam 0.0004 0.0024 0.1511 0.8801
Oscar -0.0007 0.0053 -0.1262 0.8998
Phyllis 0.0052 0.0083 0.7202 0.4724
Meredith 0.0022 0.0083 0.2638 0.7923
Michael 0.0050 0.0010 4.8563 2.685E-06
Ryan -0.0022 0.0034 -0.6527 0.5148
Darryl -0.0004 0.0034 -0.1056 0.9160
Kelly 0.0039 0.0070 0.5571 0.5782
Toby 0.0020 0.0050 0.4052 0.6858
Erin -0.0066 0.0038 -1.7125 0.0886
Kevin -0.0112 0.0056 -01.8732 0.0627
Stanley 0.0147 0.0087 1.6942 0.0920

Statistically significantly, giving Michael more lines comes with a better IMDB rating. Jim is correlated with a positive marginal rate of return to IMDB in both visualizations as well. Additionally, and I think this is a really fun takeaway, Stanley and Kevin have far and away the largest coefficients in this regression, which implies that they really shine in most of the episodes where they have a lot of lines.



Back to the Visualization