saturn·

rotten tomatoes rotten tomatoes movies

source /home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv 143,258 rows 16 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

citing: row_count · column_count · columns.boxOffice.null_rate · columns.rating.null_rate · columns.tomatoMeter.null_rate · columns.releaseDateTheaters.null_rate · columns.audienceScore.null_rate · columns.genre.top_values · columns.runtimeMinutes.stats · columns.tomatoMeter.stats · columns.audienceScore.stats · columns.originalLanguage.top_rate · columns.originalLanguage.top_values

Schema

16 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id text 0.0% 142,052
near_unique one_word
title text 0.3% 126,403
multilingual
audienceScore numeric 48.9% 101
null_rate
tomatoMeter numeric 76.4% 101
null_rate
rating categorical 90.2% 10
null_rate
ratingContents text 90.2% 8,353
multilingual null_rate duplicates
releaseDateTheaters text 78.5% 12,062
one_word allcaps null_rate short_text duplicates
releaseDateStreaming text 44.6% 4,726
one_word allcaps null_rate short_text duplicates
runtimeMinutes numeric 9.7% 324
high_skew outliers
genre text 7.7% 2,912
one_word duplicates
originalLanguage categorical 9.7% 112
director text 2.9% 62,207
duplicates
writer text 37.1% 67,274
null_rate duplicates
boxOffice text 89.7% 4,863
one_word allcaps null_rate short_text duplicates
distributor text 83.9% 3,694
null_rate duplicates
soundMix categorical 88.9% 551
null_rate

id

text identifier near_unique one_word
Slug-style identifier column: every value is a single token (one_word_rate 1.0, word_mean 1.0) with mean length ~18 chars and 142052 uniques out of 143258 rows. The 1206 duplicates (0.84%) are surprising for an id field — top repeats like 'catch_me_if_you_can' and 'hear_no_evil' suggest these are title-derived slugs rather than guaranteed-unique keys. Readability score is meaningless here (−75.5) because the tokens are underscore-joined phrases, not prose. Treatment: Use as a join key but deduplicate first — it is not strictly unique. high · anthropic:claude-opus-4-7
n
143,258
nulls
0 (0.0%)
unique
142,052
len_min
1
len_max
178
len_mean
18.15
len_median
16
len_p95
37
word_mean
1
word_median
1
n_empty
0
n_duplicates
1,206
duplicate_rate
0.008418
vocab_size
19,976
readability_flesch_mean
-75.47
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.005787
boilerplate_rate
0

title

text label multilingual
Short titles (mean 17 chars, median 3 words) of what look like films or works — top values include 'The Return', 'A Christmas Carol', 'Hero', 'Blue'. Predominantly English (3946) but 29 other languages are detected, with Spanish (123), German (80), and French (72) most common. Notable duplication: 16,488 repeats (11.5% duplicate rate) across 126,403 unique values out of 143,258 rows, and 17% are single-word titles. Treatment: Normalise case and tokenize for embedding; do not treat as a unique key given the 11.5% duplicate rate. high · anthropic:claude-opus-4-7
n
143,258
nulls
367 (0.3%)
unique
126,403
len_min
1
len_max
176
len_mean
17.23
len_median
15
len_p95
37
word_mean
3.074
word_median
3
n_empty
0
n_duplicates
16,488
duplicate_rate
0.1154
vocab_size
17,597
readability_flesch_mean
60.62
emoji_rate
0
url_rate
0
one_word_rate
0.1715
allcaps_rate
0.003989
boilerplate_rate
9.098e-05

audienceScore

numeric feature null_rate
This is an audience rating score on a 0-100 scale with 101 unique integer values, mean 55.67 and median 57. The distribution is wide (std 24.55, IQR 39) and slightly left-skewed (skew -0.23, kurtosis -0.83) with no outliers flagged. The dominant concern is missingness: 48.87% of rows are null, so nearly half the dataset lacks this score. Treatment: Impute or add a missing-indicator before modelling given the ~49% null rate. high · anthropic:claude-opus-4-7
n
143,258
nulls
70,010 (48.9%)
unique
101
min
0
max
100
mean
55.67
median
57
std
24.55
q1
37
q3
76
iqr
39
skew
-0.2257
kurtosis
-0.8322
n_outliers
0
outlier_rate
0
zero_rate
0.0156

tomatoMeter

numeric feature null_rate
This is the Rotten Tomatoes critic score (tomatoMeter), a 0-100 percentage with 101 unique integer values, mean 65.77 and median 73. The distribution is left-skewed (skew -0.65) with Q1 at 45 and Q3 at 89, indicating most rated titles lean favorable. The dominant concern is coverage: 76.35% of rows are null, so the field is only populated for a minority of records. Treatment: Impute or add a missingness indicator before modelling given the 76% null rate. high · anthropic:claude-opus-4-7
n
143,258
nulls
109,381 (76.4%)
unique
101
min
0
max
100
mean
65.77
median
73
std
28.02
q1
45
q3
89
iqr
44
skew
-0.6467
kurtosis
-0.6657
n_outliers
0
outlier_rate
0
zero_rate
0.02084

rating

categorical feature null_rate
This is a content rating field mixing theatrical (R, PG-13, PG, NC-17, G) and television (TVPG, TV14, TVMA, TVY7, TVG) classifications across 10 distinct values. The column is 90.23% null, so only ~9.77% of the 143,258 rows carry a rating, and within those R alone accounts for 55.28% of values. The mixed rating systems and the long tail (TVG, TVY7, G each appearing once) suggest inconsistent sourcing rather than a clean controlled vocabulary. Treatment: Normalize TV vs MPAA codes into a unified scheme and add an explicit 'missing' category before encoding. high · anthropic:claude-opus-4-7
n
143,258
nulls
129,267 (90.2%)
unique
10
top_value
R
top_rate
0.5528
cardinality
10
entropy
1.71
entropy_ratio
0.5147

ratingContents

text feature multilingual null_rate duplicates
This column stores content-rating descriptors (e.g. 'Language', 'Violence', 'Some Sexual Content') serialised as Python-style list literals rather than clean arrays. It is 90.23% null and, among the 14k populated rows, 40.3% are duplicates with only 8,353 unique values across 143,258 records. A handful of non-English entries (12 it, 5 ro, 1 km) appear despite the vocabulary being tiny (1,188 words), and the bracket/quote artefacts in top_words confirm the values were never parsed out of their string representation. Treatment: Parse the list-literal strings into a multi-hot encoding of rating tags before modelling. high · anthropic:claude-opus-4-7
n
143,258
nulls
129,267 (90.2%)
unique
8,353
len_min
5
len_max
148
len_mean
46.16
len_median
44
len_p95
88
word_mean
5.169
word_median
5
n_empty
0
n_duplicates
5,638
duplicate_rate
0.403
vocab_size
1,188
readability_flesch_mean
15.02
emoji_rate
0
url_rate
0
one_word_rate
0.05925
allcaps_rate
0.06168
boilerplate_rate
0

releaseDateTheaters

text timestamp one_word allcaps null_rate short_text duplicates
This is a theatrical release date stored as an ISO-format string (every value is exactly 10 characters and a single token, e.g. '2018-09-14'). It is sparsely populated — 78.52% null — and the non-null values are heavily repeated, with a 60.8% duplicate rate across 12,062 distinct dates. The 'allcaps' alert is a false positive driven by digits-only strings; there's no actual text content to mine. Treatment: Parse to date and impute or flag the 78.52% missing before any time-based feature engineering. high · anthropic:claude-opus-4-7
n
143,258
nulls
112,485 (78.5%)
unique
12,062
len_min
10
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
18,711
duplicate_rate
0.608
vocab_size
9,088
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

releaseDateStreaming

text timestamp one_word allcaps null_rate short_text duplicates
This is a streaming-release date stored as ISO-8601 text (len_median 10, one_word_rate 1.0, all top values match YYYY-MM-DD). Roughly 44.56% of rows are null and the duplicate_rate is 0.94, with a single date 2017-05-22 appearing 1232 times — heavy clustering on a few release days. The text-style alerts (allcaps, one_word, short_text) are artifacts of the date format, not a quality issue. Treatment: Parse to date dtype and treat missingness explicitly before any temporal feature engineering. high · anthropic:claude-opus-4-7
n
143,258
nulls
63,838 (44.6%)
unique
4,726
len_min
7
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
74,694
duplicate_rate
0.9405
vocab_size
3,514
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

runtimeMinutes

numeric feature high_skew outliers
Movie or episode runtime in minutes, with a typical feature-length distribution (median 92, IQR 84-103). The tail is extreme: max 2700, skew 7.62, kurtosis 598.65, and 11.37% of rows flagged as outliers, suggesting a mix of shorts, multi-part specials, or full series totals alongside standard films. Roughly 9.65% of rows are null. Treatment: Cap or log-transform before modelling and impute the ~10% nulls. high · anthropic:claude-opus-4-7
n
143,258
nulls
13,827 (9.7%)
unique
324
min
1
max
2,700
mean
93.71
median
92
std
28.13
q1
84
q3
103
iqr
19
skew
7.623
kurtosis
598.7
n_outliers
14,720
outlier_rate
0.1137
zero_rate
0

genre

text label one_word duplicates
This is a categorical genre label for films, often a single word like 'Drama' (27,860 rows) or 'Documentary' (15,162) but sometimes a comma-separated combo such as 'Comedy, Drama'. With only 66 distinct vocabulary tokens but 2,912 unique strings and a 97.8% duplicate rate, the cardinality comes entirely from how genres are concatenated. Note the 7.74% null rate and that 55% of values are single-word — multi-genre rows are the minority. Treatment: Split on comma and one-hot encode into a small set of genre flags. high · anthropic:claude-opus-4-7
n
143,258
nulls
11,083 (7.7%)
unique
2,912
len_min
3
len_max
101
len_mean
12.91
len_median
11
len_p95
32
word_mean
1.874
word_median
1
n_empty
0
n_duplicates
129,263
duplicate_rate
0.978
vocab_size
66
readability_flesch_mean
-19.66
emoji_rate
0
url_rate
0
one_word_rate
0.5533
allcaps_rate
0
boilerplate_rate
0

originalLanguage

categorical feature
Categorical language label with 112 distinct values, dominated by English at 65.7% of non-null rows (85,034 of 143,258). The long tail spans regional variants (e.g., 'French (Canada)' vs 'French (France)', 'English (United Kingdom)') alongside bare language names like 'English' and 'French', suggesting inconsistent locale tagging that will fragment counts. Null rate is 9.67%, and entropy ratio of 0.38 confirms heavy concentration in a few categories. Treatment: Normalize locale variants to base language codes, then one-hot encode the top categories and bucket the rest as 'Other'. high · anthropic:claude-opus-4-7
n
143,258
nulls
13,858 (9.7%)
unique
112
top_value
English
top_rate
0.6571
cardinality
112
entropy
2.605
entropy_ratio
0.3827

director

text feature duplicates
Holds a film director's name, averaging 2.2 words and 14.8 characters with 62,207 unique values across 143,258 rows. The duplicate rate is 55.3% (76,857 rows), inflated by a 'Unknown Director' sentinel that occurs 3,544 times and should not be treated as a real name. Null rate is 2.93%, and the long tail (David DeCoteau at 129, Sam Newfield at 124) reflects prolific B-movie directors rather than data quality issues. Treatment: Replace 'Unknown Director' with null and use as a high-cardinality categorical (target/frequency encode). high · anthropic:claude-opus-4-7
n
143,258
nulls
4,194 (2.9%)
unique
62,207
len_min
1
len_max
326
len_mean
14.81
len_median
14
len_p95
26
word_mean
2.213
word_median
2
n_empty
0
n_duplicates
76,857
duplicate_rate
0.5527
vocab_size
16,693
readability_flesch_mean
47.17
emoji_rate
0
url_rate
0
one_word_rate
0.007356
allcaps_rate
7.191e-05
boilerplate_rate
0

writer

text feature null_rate duplicates
Holds writer credits, typically one or two personal names averaging 2.7 words and 21 characters, with familiar figures like Jing Wong, Woody Allen, and Ingmar Bergman topping the list. Coverage is weak: 37.1% of rows are null and 25.3% are duplicates across 67,274 unique values, so a single column likely concatenates multiple co-writers per title. Top tokens (michael, david, john) confirm Western personal names dominate, though 'de' hints at multi-name strings or non-English credits mixed in. Treatment: Split on delimiters into individual writers and explode for any per-person analysis; impute or flag the 37% missing before modelling. high · anthropic:claude-opus-4-7
n
143,258
nulls
53,142 (37.1%)
unique
67,274
len_min
1
len_max
262
len_mean
21.34
len_median
16
len_p95
47
word_mean
2.711
word_median
2
n_empty
0
n_duplicates
22,842
duplicate_rate
0.2535
vocab_size
26,458
readability_flesch_mean
37.79
emoji_rate
0
url_rate
0
one_word_rate
0.005349
allcaps_rate
6.658e-05
boilerplate_rate
0

boxOffice

text feature one_word allcaps null_rate short_text duplicates
Box office gross stored as a short currency string like "$1.1M" — every value is one token, 99.99% allcaps, and lengths cluster between 2 and 7 characters. The column is 89.71% null and only 4,863 distinct values cover the 14,762 populated rows, with a 67.01% duplicate rate concentrated on round million-dollar figures. Note this is a coarse, pre-formatted string (millions only), not a precise revenue number. Treatment: Parse the "$X.XM" string into a numeric dollar amount and decide whether to impute or drop given the 89.71% null rate. high · anthropic:claude-opus-4-7
n
143,258
nulls
128,515 (89.7%)
unique
4,863
len_min
2
len_max
7
len_mean
5.977
len_median
6
len_p95
7
word_mean
1
word_median
1
n_empty
0
n_duplicates
9,880
duplicate_rate
0.6701
vocab_size
4,863
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9999
boilerplate_rate
0

distributor

text feature null_rate duplicates
This column lists film distributor names, dominated by major studios like Paramount Pictures (994), 20th Century Fox (745), and Universal Pictures (737). It is overwhelmingly sparse with an 83.94% null rate and a 83.94% duplicate rate across 3,694 unique values, suggesting most rows lack distributor data while a small set of studios accounts for the populated entries. Names are short (mean length 19.9 chars, median 2 words) and vocabulary is concentrated around terms like 'pictures', 'films', and 'entertainment'. Treatment: Normalize studio name variants and treat as a high-cardinality categorical with an explicit 'missing' bucket given the 83.94% null rate. high · anthropic:claude-opus-4-7
n
143,258
nulls
120,253 (83.9%)
unique
3,694
len_min
3
len_max
385
len_mean
19.89
len_median
17
len_p95
43
word_mean
2.649
word_median
2
n_empty
0
n_duplicates
19,311
duplicate_rate
0.8394
vocab_size
2,650
readability_flesch_mean
16.85
emoji_rate
0
url_rate
4.347e-05
one_word_rate
0.09376
allcaps_rate
0.01474
boilerplate_rate
0

soundMix

categorical feature null_rate
Catalogues the audio mix format of each title (Surround, Dolby Digital, Stereo, Mono, etc.), with 551 distinct labels across 143,258 rows. The dominant issue is sparsity: 88.89% of values are null, and even among populated rows 'Surround' covers only 25.6%. Free-form combinations like 'Stereo, Surround' vs 'Surround, Stereo' and overlapping Dolby variants suggest the field is unnormalised multi-label text rather than a clean taxonomy. Treatment: Split on commas, normalise Dolby/Surround variants, and treat as multi-hot; consider dropping if downstream task can't tolerate 89% missingness. high · anthropic:claude-opus-4-7
n
143,258
nulls
127,341 (88.9%)
unique
551
top_value
Surround
top_rate
0.256
cardinality
551
entropy
4.663
entropy_ratio
0.5121