saturn·

rotten tomatoes rotten tomatoes movies

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv

Saturn profiled 143,258 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv",
    "--findings", "rotten_tomatoes-rotten_tomatoes_movies.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

citing: row_count · column_count · columns.boxOffice.null_rate · columns.rating.null_rate · columns.tomatoMeter.null_rate · columns.releaseDateTheaters.null_rate · columns.audienceScore.null_rate · columns.genre.top_values · columns.runtimeMinutes.stats · columns.tomatoMeter.stats · columns.audienceScore.stats · columns.originalLanguage.top_rate · columns.originalLanguage.top_values

Out[4]:

saturn.schema() · 16 columns

column kind n null% unique alerts
id text 143,258 0.0% 142,052 near_unique one_word
title text 143,258 0.3% 126,403 multilingual
audienceScore numeric 143,258 48.9% 101 null_rate
tomatoMeter numeric 143,258 76.4% 101 null_rate
rating categorical 143,258 90.2% 10 null_rate
ratingContents text 143,258 90.2% 8,353 multilingual null_rate duplicates
releaseDateTheaters text 143,258 78.5% 12,062 one_word allcaps null_rate short_text duplicates
releaseDateStreaming text 143,258 44.6% 4,726 one_word allcaps null_rate short_text duplicates
runtimeMinutes numeric 143,258 9.7% 324 high_skew outliers
genre text 143,258 7.7% 2,912 one_word duplicates
originalLanguage categorical 143,258 9.7% 112
director text 143,258 2.9% 62,207 duplicates
writer text 143,258 37.1% 67,274 null_rate duplicates
boxOffice text 143,258 89.7% 4,863 one_word allcaps null_rate short_text duplicates
distributor text 143,258 83.9% 3,694 null_rate duplicates
soundMix categorical 143,258 88.9% 551 null_rate
Fig 1.
genre · Drama, Documentary, and Comedy dominate; note how long-tail combo genres fragment the rest.
Show data table
Character-length distribution for genre (mean: 12.906979383393228).
charscount
3 – 528416
5 – 826877
8 – 103141
10 – 1318414
13 – 1518553
15 – 183196
18 – 2010562
20 – 233727
23 – 254991
25 – 285352
28 – 30854
30 – 322468
32 – 352166
35 – 371055
37 – 40240
40 – 42906
42 – 45639
45 – 47124
47 – 5089
50 – 52123
52 – 54164
54 – 5711
57 – 5930
59 – 6242
62 – 6418
64 – 674
67 – 698
69 – 721
72 – 742
74 – 760
76 – 790
79 – 811
81 – 840
84 – 860
86 – 890
89 – 910
91 – 940
94 – 960
96 – 990
99 – 1011
Fig 2.
runtimeMinutes · Most films cluster around 84–103 minutes, but extreme outliers stretch the tail to 2,700 minutes.
Show data table
Histogram bins for runtimeMinutes (median: 92.0).
bincount
1 – 68.4712329
68.47 – 135.9110286
135.9 – 203.46482
203.4 – 270.9254
270.9 – 338.437
338.4 – 405.819
405.8 – 473.39
473.3 – 540.87
540.8 – 608.33
608.3 – 675.81
675.8 – 743.20
743.2 – 810.71
810.7 – 878.20
878.2 – 945.61
945.6 – 10131
1013 – 10810
1081 – 11480
1148 – 12160
1216 – 12830
1283 – 13500
1350 – 14180
1418 – 14850
1485 – 15530
1553 – 16200
1620 – 16880
1688 – 17550
1755 – 18230
1823 – 18900
1890 – 19580
1958 – 20250
2025 – 20930
2093 – 21600
2160 – 22280
2228 – 22950
2295 – 23630
2363 – 24300
2430 – 24980
2498 – 25650
2565 – 26330
2633 – 27001
Fig 3.
tomatoMeter · Critic scores skew positive with a median of 73 — look for the left tail of poorly reviewed films.
Show data table
Histogram bins for tomatoMeter (median: 73.0).
bincount
0 – 2.5723
2.5 – 577
5 – 7.5201
7.5 – 10224
10 – 12.5375
12.5 – 15458
15 – 17.5529
17.5 – 20262
20 – 22.5859
22.5 – 25203
25 – 27.5557
27.5 – 30433
30 – 32.5431
32.5 – 35610
35 – 37.5436
37.5 – 40436
40 – 42.5964
42.5 – 45651
45 – 47.5521
47.5 – 50228
50 – 52.51119
52.5 – 55386
55 – 57.5964
57.5 – 60335
60 – 62.51163
62.5 – 65769
65 – 67.51293
67.5 – 70493
70 – 72.51215
72.5 – 75614
75 – 77.51166
77.5 – 80786
80 – 82.51951
82.5 – 851247
85 – 87.51623
87.5 – 901488
90 – 92.51782
92.5 – 951104
95 – 97.51133
97.5 – 1004068
Fig 4.
audienceScore · Audience scores are flatter and more centered (median 57) than critic scores — compare the two distributions.
Show data table
Histogram bins for audienceScore (median: 57.0).
bincount
0 – 2.51158
2.5 – 584
5 – 7.5353
7.5 – 10437
10 – 12.51114
12.5 – 15814
15 – 17.51413
17.5 – 20812
20 – 22.52208
22.5 – 25884
25 – 27.51918
27.5 – 301406
30 – 32.51799
32.5 – 352064
35 – 37.51875
37.5 – 401756
40 – 42.53056
42.5 – 452022
45 – 47.52212
47.5 – 501176
50 – 52.53593
52.5 – 551526
55 – 57.53065
57.5 – 601583
60 – 62.53772
62.5 – 651583
65 – 67.53187
67.5 – 701692
70 – 72.53102
72.5 – 751777
75 – 77.52972
77.5 – 801877
80 – 82.53343
82.5 – 852034
85 – 87.52618
87.5 – 901791
90 – 92.51830
92.5 – 95879
95 – 97.5668
97.5 – 1001795
Fig 5.
originalLanguage · English accounts for ~66% of titles; the remaining 111 languages form a long tail worth segmenting.
Show data table
Top values for originalLanguage (20 unique shown, of 112 total).
valuecountshare
English8503459.4%
Spanish47863.3%
Japanese34822.4%
Hindi33092.3%
French (Canada)32822.3%
Chinese31662.2%
French (France)27601.9%
English (United Kingdom)25531.8%
Italian23031.6%
German21551.5%
Korean12260.9%
Arabic9380.7%
Spanish (Spain)9360.7%
Tamil9090.6%
Russian8980.6%
Portuguese (Brazil)8670.6%
Telugu7740.5%
Malayalam6420.4%
Unknown language5280.4%
Dutch4820.3%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idtext0.0%
titletext0.3%
audienceScorenumeric48.9%
tomatoMeternumeric76.4%
ratingcategorical90.2%
ratingContentstext90.2%
releaseDateTheaterstext78.5%
releaseDateStreamingtext44.6%
runtimeMinutesnumeric9.7%
genretext7.7%
originalLanguagecategorical9.7%
directortext2.9%
writertext37.1%
boxOfficetext89.7%
distributortext83.9%
soundMixcategorical88.9%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 4,885 detected strings).
langcountshare
en437189.5%
es1232.5%
de801.6%
fr721.5%
it671.4%
nl230.5%
pt200.4%
sv160.3%
ceb120.2%
tr120.2%
fi100.2%
pl80.2%
id80.2%
eo60.1%
ca60.1%
sl50.1%
ru50.1%
no50.1%
ro50.1%
ms40.1%
hu30.1%
hr30.1%
af30.1%
et30.1%
ja30.1%
cs20.0%
la20.0%
tl20.0%
da20.0%
sr20.0%
lv10.0%
km10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
audienceScoretomatoMeterruntimeMinutes
audienceScore+1.00-0.04+0.01
tomatoMeter-0.04+1.00+0.02
runtimeMinutes+0.01+0.02+1.00

id text identifier

Slug-style identifier column: every value is a single token (one_word_rate 1.0, word_mean 1.0) with mean length ~18 chars and 142052 uniques out of 143258 rows. The 1206 duplicates (0.84%) are surprising for an id field — top repeats like 'catch_me_if_you_can' and 'hear_no_evil' suggest these are title-derived slugs rather than guaranteed-unique keys. Readability score is meaningless here (−75.5) because the tokens are underscore-joined phrases, not prose.

Treatment: Use as a join key but deduplicate first — it is not strictly unique.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["id"].stats

statvalue
n143,258
nulls0 (0.0%)
unique142,052
len_min 1
len_max 178
len_mean 18.15
len_median 16
len_p95 37
word_mean 1
word_median 1
n_empty 0
n_duplicates 1,206
duplicate_rate 0.008418
vocab_size 19,976
readability_flesch_mean -75.47
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.005787
boilerplate_rate 0
alert: near_unique99.2% of rows are unique strings
alert: one_word100.0% rows are a single word
Fig 9.
Character-length distribution for id.
Show data table
Character-length distribution for id (mean: 18.15210319842522).
charscount
1 – 53503
5 – 1015645
10 – 1438487
14 – 1931119
19 – 2325119
23 – 2810857
28 – 326439
32 – 364867
36 – 412424
41 – 451975
45 – 50982
50 – 54721
54 – 59386
59 – 63268
63 – 67198
67 – 72101
72 – 7669
76 – 8125
81 – 8535
85 – 9013
90 – 9410
94 – 9810
98 – 1033
103 – 1070
107 – 1121
112 – 1160
116 – 1200
120 – 1250
125 – 1290
129 – 1340
134 – 1380
138 – 1430
143 – 1470
147 – 1510
151 – 1560
156 – 1600
160 – 1650
165 – 1690
169 – 1740
174 – 1781

title text label

Short titles (mean 17 chars, median 3 words) of what look like films or works — top values include 'The Return', 'A Christmas Carol', 'Hero', 'Blue'. Predominantly English (3946) but 29 other languages are detected, with Spanish (123), German (80), and French (72) most common. Notable duplication: 16,488 repeats (11.5% duplicate rate) across 126,403 unique values out of 143,258 rows, and 17% are single-word titles.

Treatment: Normalise case and tokenize for embedding; do not treat as a unique key given the 11.5% duplicate rate.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["title"].stats

statvalue
n143,258
nulls367 (0.3%)
unique126,403
len_min 1
len_max 176
len_mean 17.23
len_median 15
len_p95 37
word_mean 3.074
word_median 3
n_empty 0
n_duplicates 16,488
duplicate_rate 0.1154
vocab_size 17,597
readability_flesch_mean 60.62
emoji_rate 0
url_rate 0
one_word_rate 0.1715
allcaps_rate 0.003989
boilerplate_rate 9.098e-05
alert: multilingual31 languages detected in sample
Fig 10.
Character-length distribution for title.
Show data table
Character-length distribution for title (mean: 17.23315674185218).
charscount
1 – 56174
5 – 1021384
10 – 1438720
14 – 1828652
18 – 2318097
23 – 2712387
27 – 325661
32 – 363782
36 – 403096
40 – 451631
45 – 491338
49 – 54635
54 – 58436
58 – 62346
62 – 67159
67 – 71137
71 – 7595
75 – 8037
80 – 8449
84 – 8817
88 – 9317
93 – 9717
97 – 10215
102 – 1064
106 – 1102
110 – 1150
115 – 1190
119 – 1241
124 – 1280
128 – 1320
132 – 1370
137 – 1411
141 – 1450
145 – 1500
150 – 1540
154 – 1580
158 – 1630
163 – 1670
167 – 1720
172 – 1761

audienceScore numeric feature

This is an audience rating score on a 0-100 scale with 101 unique integer values, mean 55.67 and median 57. The distribution is wide (std 24.55, IQR 39) and slightly left-skewed (skew -0.23, kurtosis -0.83) with no outliers flagged. The dominant concern is missingness: 48.87% of rows are null, so nearly half the dataset lacks this score.

Treatment: Impute or add a missing-indicator before modelling given the ~49% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["audienceScore"].stats

statvalue
n143,258
nulls70,010 (48.9%)
unique101
min 0
max 100
mean 55.67
median 57
std 24.55
q1 37
q3 76
iqr 39
skew -0.2257
kurtosis -0.8322
n_outliers 0
outlier_rate 0
zero_rate 0.0156
alert: null_rate48.9% null
Fig 11.
Distribution of audienceScore. Vertical dash marks the median.
Show data table
Histogram bins for audienceScore (median: 57.0).
bincount
0 – 2.51158
2.5 – 584
5 – 7.5353
7.5 – 10437
10 – 12.51114
12.5 – 15814
15 – 17.51413
17.5 – 20812
20 – 22.52208
22.5 – 25884
25 – 27.51918
27.5 – 301406
30 – 32.51799
32.5 – 352064
35 – 37.51875
37.5 – 401756
40 – 42.53056
42.5 – 452022
45 – 47.52212
47.5 – 501176
50 – 52.53593
52.5 – 551526
55 – 57.53065
57.5 – 601583
60 – 62.53772
62.5 – 651583
65 – 67.53187
67.5 – 701692
70 – 72.53102
72.5 – 751777
75 – 77.52972
77.5 – 801877
80 – 82.53343
82.5 – 852034
85 – 87.52618
87.5 – 901791
90 – 92.51830
92.5 – 95879
95 – 97.5668
97.5 – 1001795

tomatoMeter numeric feature

This is the Rotten Tomatoes critic score (tomatoMeter), a 0-100 percentage with 101 unique integer values, mean 65.77 and median 73. The distribution is left-skewed (skew -0.65) with Q1 at 45 and Q3 at 89, indicating most rated titles lean favorable. The dominant concern is coverage: 76.35% of rows are null, so the field is only populated for a minority of records.

Treatment: Impute or add a missingness indicator before modelling given the 76% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["tomatoMeter"].stats

statvalue
n143,258
nulls109,381 (76.4%)
unique101
min 0
max 100
mean 65.77
median 73
std 28.02
q1 45
q3 89
iqr 44
skew -0.6467
kurtosis -0.6657
n_outliers 0
outlier_rate 0
zero_rate 0.02084
alert: null_rate76.4% null
Fig 12.
Distribution of tomatoMeter. Vertical dash marks the median.
Show data table
Histogram bins for tomatoMeter (median: 73.0).
bincount
0 – 2.5723
2.5 – 577
5 – 7.5201
7.5 – 10224
10 – 12.5375
12.5 – 15458
15 – 17.5529
17.5 – 20262
20 – 22.5859
22.5 – 25203
25 – 27.5557
27.5 – 30433
30 – 32.5431
32.5 – 35610
35 – 37.5436
37.5 – 40436
40 – 42.5964
42.5 – 45651
45 – 47.5521
47.5 – 50228
50 – 52.51119
52.5 – 55386
55 – 57.5964
57.5 – 60335
60 – 62.51163
62.5 – 65769
65 – 67.51293
67.5 – 70493
70 – 72.51215
72.5 – 75614
75 – 77.51166
77.5 – 80786
80 – 82.51951
82.5 – 851247
85 – 87.51623
87.5 – 901488
90 – 92.51782
92.5 – 951104
95 – 97.51133
97.5 – 1004068

rating categorical feature

This is a content rating field mixing theatrical (R, PG-13, PG, NC-17, G) and television (TVPG, TV14, TVMA, TVY7, TVG) classifications across 10 distinct values. The column is 90.23% null, so only ~9.77% of the 143,258 rows carry a rating, and within those R alone accounts for 55.28% of values. The mixed rating systems and the long tail (TVG, TVY7, G each appearing once) suggest inconsistent sourcing rather than a clean controlled vocabulary.

Treatment: Normalize TV vs MPAA codes into a unified scheme and add an explicit 'missing' category before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["rating"].stats

statvalue
n143,258
nulls129,267 (90.2%)
unique10
top_value R
top_rate 0.5528
cardinality 10
entropy 1.71
entropy_ratio 0.5147
alert: null_rate90.2% null
Fig 13.
Top values for rating.
Show data table
Top values for rating (10 unique shown, of 10 total).
valuecountshare
R77345.4%
PG-1334462.4%
PG19111.3%
TVPG4240.3%
TV143970.3%
TVMA570.0%
NC-17190.0%
TVG10.0%
TVY710.0%
G10.0%

ratingContents text feature

This column stores content-rating descriptors (e.g. 'Language', 'Violence', 'Some Sexual Content') serialised as Python-style list literals rather than clean arrays. It is 90.23% null and, among the 14k populated rows, 40.3% are duplicates with only 8,353 unique values across 143,258 records. A handful of non-English entries (12 it, 5 ro, 1 km) appear despite the vocabulary being tiny (1,188 words), and the bracket/quote artefacts in top_words confirm the values were never parsed out of their string representation.

Treatment: Parse the list-literal strings into a multi-hot encoding of rating tags before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["ratingContents"].stats

statvalue
n143,258
nulls129,267 (90.2%)
unique8,353
len_min 5
len_max 148
len_mean 46.16
len_median 44
len_p95 88
word_mean 5.169
word_median 5
n_empty 0
n_duplicates 5,638
duplicate_rate 0.403
vocab_size 1,188
readability_flesch_mean 15.02
emoji_rate 0
url_rate 0
one_word_rate 0.05925
allcaps_rate 0.06168
boilerplate_rate 0
alert: multilingual5 languages detected in sample
alert: null_rate90.2% null
alert: duplicates40.3% duplicate strings
Fig 14.
Character-length distribution for ratingContents.
Show data table
Character-length distribution for ratingContents (mean: 46.15974555071117).
charscount
5 – 9326
9 – 12791
12 – 16207
16 – 19529
19 – 23281
23 – 26915
26 – 30839
30 – 34592
34 – 37956
37 – 41702
41 – 44910
44 – 48757
48 – 51889
51 – 55819
55 – 59539
59 – 62633
62 – 66493
66 – 69532
69 – 73369
73 – 76398
76 – 80310
80 – 84246
84 – 87215
87 – 91158
91 – 94146
94 – 9896
98 – 10295
102 – 10570
105 – 10945
109 – 11243
112 – 11619
116 – 11926
119 – 12311
123 – 12711
127 – 1305
130 – 1343
134 – 1375
137 – 1416
141 – 1442
144 – 1482

releaseDateTheaters text timestamp

This is a theatrical release date stored as an ISO-format string (every value is exactly 10 characters and a single token, e.g. '2018-09-14'). It is sparsely populated — 78.52% null — and the non-null values are heavily repeated, with a 60.8% duplicate rate across 12,062 distinct dates. The 'allcaps' alert is a false positive driven by digits-only strings; there's no actual text content to mine.

Treatment: Parse to date and impute or flag the 78.52% missing before any time-based feature engineering.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["releaseDateTheaters"].stats

statvalue
n143,258
nulls112,485 (78.5%)
unique12,062
len_min 10
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 18,711
duplicate_rate 0.608
vocab_size 9,088
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: null_rate78.5% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates60.8% duplicate strings
Fig 15.
Character-length distribution for releaseDateTheaters.
Show data table
Character-length distribution for releaseDateTheaters (mean: 10.0).
charscount
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 1030773
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100

releaseDateStreaming text timestamp

This is a streaming-release date stored as ISO-8601 text (len_median 10, one_word_rate 1.0, all top values match YYYY-MM-DD). Roughly 44.56% of rows are null and the duplicate_rate is 0.94, with a single date 2017-05-22 appearing 1232 times — heavy clustering on a few release days. The text-style alerts (allcaps, one_word, short_text) are artifacts of the date format, not a quality issue.

Treatment: Parse to date dtype and treat missingness explicitly before any temporal feature engineering.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["releaseDateStreaming"].stats

statvalue
n143,258
nulls63,838 (44.6%)
unique4,726
len_min 7
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 74,694
duplicate_rate 0.9405
vocab_size 3,514
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: null_rate44.6% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.0% duplicate strings
Fig 16.
Character-length distribution for releaseDateStreaming.
Show data table
Character-length distribution for releaseDateStreaming (mean: 9.99996222613951).
charscount
7 – 71
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 1079419

runtimeMinutes numeric feature

Movie or episode runtime in minutes, with a typical feature-length distribution (median 92, IQR 84-103). The tail is extreme: max 2700, skew 7.62, kurtosis 598.65, and 11.37% of rows flagged as outliers, suggesting a mix of shorts, multi-part specials, or full series totals alongside standard films. Roughly 9.65% of rows are null.

Treatment: Cap or log-transform before modelling and impute the ~10% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["runtimeMinutes"].stats

statvalue
n143,258
nulls13,827 (9.7%)
unique324
min 1
max 2,700
mean 93.71
median 92
std 28.13
q1 84
q3 103
iqr 19
skew 7.623
kurtosis 598.7
n_outliers 14,720
outlier_rate 0.1137
zero_rate 0
alert: high_skewskew=+7.62
alert: outliers11.4% rows beyond 1.5 IQR
Fig 17.
Distribution of runtimeMinutes. Vertical dash marks the median.
Show data table
Histogram bins for runtimeMinutes (median: 92.0).
bincount
1 – 68.4712329
68.47 – 135.9110286
135.9 – 203.46482
203.4 – 270.9254
270.9 – 338.437
338.4 – 405.819
405.8 – 473.39
473.3 – 540.87
540.8 – 608.33
608.3 – 675.81
675.8 – 743.20
743.2 – 810.71
810.7 – 878.20
878.2 – 945.61
945.6 – 10131
1013 – 10810
1081 – 11480
1148 – 12160
1216 – 12830
1283 – 13500
1350 – 14180
1418 – 14850
1485 – 15530
1553 – 16200
1620 – 16880
1688 – 17550
1755 – 18230
1823 – 18900
1890 – 19580
1958 – 20250
2025 – 20930
2093 – 21600
2160 – 22280
2228 – 22950
2295 – 23630
2363 – 24300
2430 – 24980
2498 – 25650
2565 – 26330
2633 – 27001

genre text label

This is a categorical genre label for films, often a single word like 'Drama' (27,860 rows) or 'Documentary' (15,162) but sometimes a comma-separated combo such as 'Comedy, Drama'. With only 66 distinct vocabulary tokens but 2,912 unique strings and a 97.8% duplicate rate, the cardinality comes entirely from how genres are concatenated. Note the 7.74% null rate and that 55% of values are single-word — multi-genre rows are the minority.

Treatment: Split on comma and one-hot encode into a small set of genre flags.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["genre"].stats

statvalue
n143,258
nulls11,083 (7.7%)
unique2,912
len_min 3
len_max 101
len_mean 12.91
len_median 11
len_p95 32
word_mean 1.874
word_median 1
n_empty 0
n_duplicates 129,263
duplicate_rate 0.978
vocab_size 66
readability_flesch_mean -19.66
emoji_rate 0
url_rate 0
one_word_rate 0.5533
allcaps_rate 0
boilerplate_rate 0
alert: one_word55.3% rows are a single word
alert: duplicates97.8% duplicate strings
Fig 18.
Character-length distribution for genre.
Show data table
Character-length distribution for genre (mean: 12.906979383393228).
charscount
3 – 528416
5 – 826877
8 – 103141
10 – 1318414
13 – 1518553
15 – 183196
18 – 2010562
20 – 233727
23 – 254991
25 – 285352
28 – 30854
30 – 322468
32 – 352166
35 – 371055
37 – 40240
40 – 42906
42 – 45639
45 – 47124
47 – 5089
50 – 52123
52 – 54164
54 – 5711
57 – 5930
59 – 6242
62 – 6418
64 – 674
67 – 698
69 – 721
72 – 742
74 – 760
76 – 790
79 – 811
81 – 840
84 – 860
86 – 890
89 – 910
91 – 940
94 – 960
96 – 990
99 – 1011

originalLanguage categorical feature

Categorical language label with 112 distinct values, dominated by English at 65.7% of non-null rows (85,034 of 143,258). The long tail spans regional variants (e.g., 'French (Canada)' vs 'French (France)', 'English (United Kingdom)') alongside bare language names like 'English' and 'French', suggesting inconsistent locale tagging that will fragment counts. Null rate is 9.67%, and entropy ratio of 0.38 confirms heavy concentration in a few categories.

Treatment: Normalize locale variants to base language codes, then one-hot encode the top categories and bucket the rest as 'Other'.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["originalLanguage"].stats

statvalue
n143,258
nulls13,858 (9.7%)
unique112
top_value English
top_rate 0.6571
cardinality 112
entropy 2.605
entropy_ratio 0.3827
Fig 19.
Top values for originalLanguage.
Show data table
Top values for originalLanguage (20 unique shown, of 112 total).
valuecountshare
English8503459.4%
Spanish47863.3%
Japanese34822.4%
Hindi33092.3%
French (Canada)32822.3%
Chinese31662.2%
French (France)27601.9%
English (United Kingdom)25531.8%
Italian23031.6%
German21551.5%
Korean12260.9%
Arabic9380.7%
Spanish (Spain)9360.7%
Tamil9090.6%
Russian8980.6%
Portuguese (Brazil)8670.6%
Telugu7740.5%
Malayalam6420.4%
Unknown language5280.4%
Dutch4820.3%

director text feature

Holds a film director's name, averaging 2.2 words and 14.8 characters with 62,207 unique values across 143,258 rows. The duplicate rate is 55.3% (76,857 rows), inflated by a 'Unknown Director' sentinel that occurs 3,544 times and should not be treated as a real name. Null rate is 2.93%, and the long tail (David DeCoteau at 129, Sam Newfield at 124) reflects prolific B-movie directors rather than data quality issues.

Treatment: Replace 'Unknown Director' with null and use as a high-cardinality categorical (target/frequency encode).

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["director"].stats

statvalue
n143,258
nulls4,194 (2.9%)
unique62,207
len_min 1
len_max 326
len_mean 14.81
len_median 14
len_p95 26
word_mean 2.213
word_median 2
n_empty 0
n_duplicates 76,857
duplicate_rate 0.5527
vocab_size 16,693
readability_flesch_mean 47.17
emoji_rate 0
url_rate 0
one_word_rate 0.007356
allcaps_rate 7.191e-05
boilerplate_rate 0
alert: duplicates55.3% duplicate strings
Fig 20.
Character-length distribution for director.
Show data table
Character-length distribution for director (mean: 14.80681556693321).
charscount
1 – 97477
9 – 17109547
17 – 2514942
25 – 345028
34 – 421160
42 – 50412
50 – 58164
58 – 6687
66 – 7469
74 – 8241
82 – 9028
90 – 9824
98 – 10719
107 – 11517
115 – 1233
123 – 1313
131 – 1399
139 – 14713
147 – 1551
155 – 1646
164 – 1724
172 – 1802
180 – 1880
188 – 1961
196 – 2040
204 – 2122
212 – 2200
220 – 2280
228 – 2370
237 – 2451
245 – 2530
253 – 2610
261 – 2691
269 – 2770
277 – 2850
285 – 2940
294 – 3022
302 – 3100
310 – 3180
318 – 3261

writer text feature

Holds writer credits, typically one or two personal names averaging 2.7 words and 21 characters, with familiar figures like Jing Wong, Woody Allen, and Ingmar Bergman topping the list. Coverage is weak: 37.1% of rows are null and 25.3% are duplicates across 67,274 unique values, so a single column likely concatenates multiple co-writers per title. Top tokens (michael, david, john) confirm Western personal names dominate, though 'de' hints at multi-name strings or non-English credits mixed in.

Treatment: Split on delimiters into individual writers and explode for any per-person analysis; impute or flag the 37% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["writer"].stats

statvalue
n143,258
nulls53,142 (37.1%)
unique67,274
len_min 1
len_max 262
len_mean 21.34
len_median 16
len_p95 47
word_mean 2.711
word_median 2
n_empty 0
n_duplicates 22,842
duplicate_rate 0.2535
vocab_size 26,458
readability_flesch_mean 37.79
emoji_rate 0
url_rate 0
one_word_rate 0.005349
allcaps_rate 6.658e-05
boilerplate_rate 0
alert: null_rate37.1% null
alert: duplicates25.3% duplicate strings
Fig 21.
Character-length distribution for writer.
Show data table
Character-length distribution for writer (mean: 21.344955390829597).
charscount
1 – 8565
8 – 1437626
14 – 2117271
21 – 2712261
27 – 349944
34 – 404841
40 – 473059
47 – 531894
53 – 60937
60 – 66718
66 – 73355
73 – 79234
79 – 86120
86 – 9279
92 – 9960
99 – 10546
105 – 11228
112 – 11825
118 – 1259
125 – 13213
132 – 1389
138 – 14512
145 – 1515
151 – 1581
158 – 1641
164 – 1710
171 – 1771
177 – 1840
184 – 1901
190 – 1970
197 – 2030
203 – 2100
210 – 2160
216 – 2230
223 – 2290
229 – 2360
236 – 2420
242 – 2490
249 – 2550
255 – 2621

boxOffice text feature

Box office gross stored as a short currency string like "$1.1M" — every value is one token, 99.99% allcaps, and lengths cluster between 2 and 7 characters. The column is 89.71% null and only 4,863 distinct values cover the 14,762 populated rows, with a 67.01% duplicate rate concentrated on round million-dollar figures. Note this is a coarse, pre-formatted string (millions only), not a precise revenue number.

Treatment: Parse the "$X.XM" string into a numeric dollar amount and decide whether to impute or drop given the 89.71% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["boxOffice"].stats

statvalue
n143,258
nulls128,515 (89.7%)
unique4,863
len_min 2
len_max 7
len_mean 5.977
len_median 6
len_p95 7
word_mean 1
word_median 1
n_empty 0
n_duplicates 9,880
duplicate_rate 0.6701
vocab_size 4,863
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9999
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: null_rate89.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates67.0% duplicate strings
Fig 22.
Character-length distribution for boxOffice.
Show data table
Character-length distribution for boxOffice (mean: 5.9768703791629925).
charscount
2 – 22
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 36
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 4107
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 54011
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 66707
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 73910

distributor text feature

This column lists film distributor names, dominated by major studios like Paramount Pictures (994), 20th Century Fox (745), and Universal Pictures (737). It is overwhelmingly sparse with an 83.94% null rate and a 83.94% duplicate rate across 3,694 unique values, suggesting most rows lack distributor data while a small set of studios accounts for the populated entries. Names are short (mean length 19.9 chars, median 2 words) and vocabulary is concentrated around terms like 'pictures', 'films', and 'entertainment'.

Treatment: Normalize studio name variants and treat as a high-cardinality categorical with an explicit 'missing' bucket given the 83.94% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["distributor"].stats

statvalue
n143,258
nulls120,253 (83.9%)
unique3,694
len_min 3
len_max 385
len_mean 19.89
len_median 17
len_p95 43
word_mean 2.649
word_median 2
n_empty 0
n_duplicates 19,311
duplicate_rate 0.8394
vocab_size 2,650
readability_flesch_mean 16.85
emoji_rate 0
url_rate 4.347e-05
one_word_rate 0.09376
allcaps_rate 0.01474
boilerplate_rate 0
alert: null_rate83.9% null
alert: duplicates83.9% duplicate strings
Fig 23.
Character-length distribution for distributor.
Show data table
Character-length distribution for distributor (mean: 19.88802434253423).
charscount
3 – 134694
13 – 2214064
22 – 322077
32 – 41968
41 – 51437
51 – 60254
60 – 70146
70 – 79125
79 – 8958
89 – 9850
98 – 10839
108 – 11822
118 – 12717
127 – 13710
137 – 1467
146 – 1568
156 – 1656
165 – 1753
175 – 1843
184 – 1944
194 – 2042
204 – 2135
213 – 2230
223 – 2322
232 – 2420
242 – 2510
251 – 2610
261 – 2701
270 – 2801
280 – 2900
290 – 2990
299 – 3091
309 – 3180
318 – 3280
328 – 3370
337 – 3470
347 – 3560
356 – 3660
366 – 3750
375 – 3851

soundMix categorical feature

Catalogues the audio mix format of each title (Surround, Dolby Digital, Stereo, Mono, etc.), with 551 distinct labels across 143,258 rows. The dominant issue is sparsity: 88.89% of values are null, and even among populated rows 'Surround' covers only 25.6%. Free-form combinations like 'Stereo, Surround' vs 'Surround, Stereo' and overlapping Dolby variants suggest the field is unnormalised multi-label text rather than a clean taxonomy.

Treatment: Split on commas, normalise Dolby/Surround variants, and treat as multi-hot; consider dropping if downstream task can't tolerate 89% missingness.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["soundMix"].stats

statvalue
n143,258
nulls127,341 (88.9%)
unique551
top_value Surround
top_rate 0.256
cardinality 551
entropy 4.663
entropy_ratio 0.5121
alert: null_rate88.9% null
Fig 24.
Top values for soundMix.
Show data table
Top values for soundMix (20 unique shown, of 551 total).
valuecountshare
Surround40752.8%
Dolby Digital23751.7%
Stereo20821.5%
Mono12460.9%
Stereo, Surround4730.3%
Surround, Stereo4510.3%
Dolby4110.3%
Dolby SRD, DTS, SDDS2530.2%
Dolby Atmos2410.2%
Dolby SR1980.1%
Dolby SR, DTS, Dolby Stereo, Surround, SDDS, Dolby A, Dolby Digital1920.1%
Dolby Stereo, Dolby Digital, Dolby A, Surround, Dolby SR1670.1%
Surround, Dolby Digital1330.1%
Dolby, Surround1190.1%
SDDS, Dolby Digital, DTS1180.1%
Surround, Dolby SRD, DTS, SDDS1180.1%
Dolby SRD1070.1%
Surround, Dolby SR, Dolby Digital, Dolby A, Dolby Stereo1010.1%
Dolby Atmos, Dolby Digital930.1%
Datasat, Dolby Digital840.1%

How to cite

click to copy

BibTeX
@misc{saturn-rotten-tomatoes-rotten-tomatoes-movies-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: rotten tomatoes rotten tomatoes movies},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/rotten_tomatoes-rotten_tomatoes_movies}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: rotten tomatoes rotten tomatoes movies. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/rotten_tomatoes-rotten_tomatoes_movies