saturn

/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv 143,258 rows sample n=143,258 seed 42 2026-05-01T23:13:52+00:00

Overview

Source	/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv
Total rows	143,258
Profiled sample	143,258
Columns	16
Generated	2026-05-01T23:13:52+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

id high anthropic:claude-opus-4-7

Slug-style identifier column: every value is a single token (one_word_rate 1.0, word_mean 1.0) with mean length ~18 chars and 142052 uniques out of 143258 rows. The 1206 duplicates (0.84%) are surprising for an id field — top repeats like 'catch_me_if_you_can' and 'hear_no_evil' suggest these are title-derived slugs rather than guaranteed-unique keys. Readability score is meaningless here (−75.5) because the tokens are underscore-joined phrases, not prose.

title high anthropic:claude-opus-4-7

Short titles (mean 17 chars, median 3 words) of what look like films or works — top values include 'The Return', 'A Christmas Carol', 'Hero', 'Blue'. Predominantly English (3946) but 29 other languages are detected, with Spanish (123), German (80), and French (72) most common. Notable duplication: 16,488 repeats (11.5% duplicate rate) across 126,403 unique values out of 143,258 rows, and 17% are single-word titles.

audienceScore high anthropic:claude-opus-4-7

This is an audience rating score on a 0-100 scale with 101 unique integer values, mean 55.67 and median 57. The distribution is wide (std 24.55, IQR 39) and slightly left-skewed (skew -0.23, kurtosis -0.83) with no outliers flagged. The dominant concern is missingness: 48.87% of rows are null, so nearly half the dataset lacks this score.

tomatoMeter high anthropic:claude-opus-4-7

This is the Rotten Tomatoes critic score (tomatoMeter), a 0-100 percentage with 101 unique integer values, mean 65.77 and median 73. The distribution is left-skewed (skew -0.65) with Q1 at 45 and Q3 at 89, indicating most rated titles lean favorable. The dominant concern is coverage: 76.35% of rows are null, so the field is only populated for a minority of records.

rating high anthropic:claude-opus-4-7

This is a content rating field mixing theatrical (R, PG-13, PG, NC-17, G) and television (TVPG, TV14, TVMA, TVY7, TVG) classifications across 10 distinct values. The column is 90.23% null, so only ~9.77% of the 143,258 rows carry a rating, and within those R alone accounts for 55.28% of values. The mixed rating systems and the long tail (TVG, TVY7, G each appearing once) suggest inconsistent sourcing rather than a clean controlled vocabulary.

ratingContents high anthropic:claude-opus-4-7

This column stores content-rating descriptors (e.g. 'Language', 'Violence', 'Some Sexual Content') serialised as Python-style list literals rather than clean arrays. It is 90.23% null and, among the 14k populated rows, 40.3% are duplicates with only 8,353 unique values across 143,258 records. A handful of non-English entries (12 it, 5 ro, 1 km) appear despite the vocabulary being tiny (1,188 words), and the bracket/quote artefacts in top_words confirm the values were never parsed out of their string representation.

releaseDateTheaters high anthropic:claude-opus-4-7

This is a theatrical release date stored as an ISO-format string (every value is exactly 10 characters and a single token, e.g. '2018-09-14'). It is sparsely populated — 78.52% null — and the non-null values are heavily repeated, with a 60.8% duplicate rate across 12,062 distinct dates. The 'allcaps' alert is a false positive driven by digits-only strings; there's no actual text content to mine.

releaseDateStreaming high anthropic:claude-opus-4-7

This is a streaming-release date stored as ISO-8601 text (len_median 10, one_word_rate 1.0, all top values match YYYY-MM-DD). Roughly 44.56% of rows are null and the duplicate_rate is 0.94, with a single date 2017-05-22 appearing 1232 times — heavy clustering on a few release days. The text-style alerts (allcaps, one_word, short_text) are artifacts of the date format, not a quality issue.

runtimeMinutes high anthropic:claude-opus-4-7

Movie or episode runtime in minutes, with a typical feature-length distribution (median 92, IQR 84-103). The tail is extreme: max 2700, skew 7.62, kurtosis 598.65, and 11.37% of rows flagged as outliers, suggesting a mix of shorts, multi-part specials, or full series totals alongside standard films. Roughly 9.65% of rows are null.

genre high anthropic:claude-opus-4-7

This is a categorical genre label for films, often a single word like 'Drama' (27,860 rows) or 'Documentary' (15,162) but sometimes a comma-separated combo such as 'Comedy, Drama'. With only 66 distinct vocabulary tokens but 2,912 unique strings and a 97.8% duplicate rate, the cardinality comes entirely from how genres are concatenated. Note the 7.74% null rate and that 55% of values are single-word — multi-genre rows are the minority.

originalLanguage high anthropic:claude-opus-4-7

Categorical language label with 112 distinct values, dominated by English at 65.7% of non-null rows (85,034 of 143,258). The long tail spans regional variants (e.g., 'French (Canada)' vs 'French (France)', 'English (United Kingdom)') alongside bare language names like 'English' and 'French', suggesting inconsistent locale tagging that will fragment counts. Null rate is 9.67%, and entropy ratio of 0.38 confirms heavy concentration in a few categories.

director high anthropic:claude-opus-4-7

Holds a film director's name, averaging 2.2 words and 14.8 characters with 62,207 unique values across 143,258 rows. The duplicate rate is 55.3% (76,857 rows), inflated by a 'Unknown Director' sentinel that occurs 3,544 times and should not be treated as a real name. Null rate is 2.93%, and the long tail (David DeCoteau at 129, Sam Newfield at 124) reflects prolific B-movie directors rather than data quality issues.

writer high anthropic:claude-opus-4-7

Holds writer credits, typically one or two personal names averaging 2.7 words and 21 characters, with familiar figures like Jing Wong, Woody Allen, and Ingmar Bergman topping the list. Coverage is weak: 37.1% of rows are null and 25.3% are duplicates across 67,274 unique values, so a single column likely concatenates multiple co-writers per title. Top tokens (michael, david, john) confirm Western personal names dominate, though 'de' hints at multi-name strings or non-English credits mixed in.

boxOffice high anthropic:claude-opus-4-7

Box office gross stored as a short currency string like "$1.1M" — every value is one token, 99.99% allcaps, and lengths cluster between 2 and 7 characters. The column is 89.71% null and only 4,863 distinct values cover the 14,762 populated rows, with a 67.01% duplicate rate concentrated on round million-dollar figures. Note this is a coarse, pre-formatted string (millions only), not a precise revenue number.

distributor high anthropic:claude-opus-4-7

This column lists film distributor names, dominated by major studios like Paramount Pictures (994), 20th Century Fox (745), and Universal Pictures (737). It is overwhelmingly sparse with an 83.94% null rate and a 83.94% duplicate rate across 3,694 unique values, suggesting most rows lack distributor data while a small set of studios accounts for the populated entries. Names are short (mean length 19.9 chars, median 2 words) and vocabulary is concentrated around terms like 'pictures', 'films', and 'entertainment'.

soundMix high anthropic:claude-opus-4-7

Catalogues the audio mix format of each title (Surround, Dolby Digital, Stereo, Mono, etc.), with 551 distinct labels across 143,258 rows. The dominant issue is sparsity: 88.89% of values are null, and even among populated rows 'Surround' covers only 25.6%. Free-form combinations like 'Stereo, Surround' vs 'Surround, Stereo' and overlapping Dolby variants suggest the field is unnormalised multi-label text rather than a clean taxonomy.

Numeric correlation

Languages detected

Per-string language detection across text columns (sampled).

id text

99.2% of rows are unique strings 100.0% rows are a single word

rows143,258

null0 (0.0%)

unique142,052

len_min1

len_max178

len_mean18.152

len_median16.000

len_p9537.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates1,206

duplicate_rate8.42e-03

vocab_size19,976

readability_flesch_mean-75.475

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate5.79e-03

boilerplate_rate0.000

Sample values (first 10)

my_sassy_hubby
kattappanayile_rithwik_roshan
pinky
bethune-the-making-of-a-hero
mr_jones_2013
los_chicos_crenen_1942
mad_wednesday_1946
in_the_dark_room
clawed
starlight_hotel

title text

31 languages detected in sample

rows143,258

null367 (0.3%)

unique126,403

len_min1

len_max176

len_mean17.233

len_median15.000

len_p9537.000

word_mean3.074

word_median3.000

n_empty0

n_duplicates16,488

duplicate_rate0.115

vocab_size17,597

readability_flesch_mean60.618

emoji_rate0.000

url_rate0.000

one_word_rate0.171

allcaps_rate3.99e-03

boilerplate_rate9.10e-05

Sample values (first 10)

My Sassy Hubby
Marie for Memory
The Locals
All the Winters That Have Been
The Luck of Ginger Coffey
Bleeding Through
School Killer
Territories
Miral
Murt Ramirez Wants to Kick My Ass

audienceScore numeric

48.9% null

rows143,258

null70,010 (48.9%)

unique101

min0.000

max100.000

mean55.675

median57.000

std24.554

q137.000

q376.000

iqr39.000

skew-0.226

kurtosis-0.832

n_outliers0

outlier_rate0.000

zero_rate0.016

tomatoMeter numeric

76.4% null

rows143,258

null109,381 (76.4%)

unique101

min0.000

max100.000

mean65.770

median73.000

std28.023

q145.000

q389.000

iqr44.000

skew-0.647

kurtosis-0.666

n_outliers0

outlier_rate0.000

zero_rate0.021

rating categorical

90.2% null

rows143,258

null129,267 (90.2%)

unique10

top_valueR

top_rate0.553

cardinality10

entropy1.710

entropy_ratio0.515

Top values (rank 1–20)

R — 7,734
PG-13 — 3,446
PG — 1,911
TVPG — 424
TV14 — 397
TVMA — 57
NC-17 — 19
TVG — 1
TVY7 — 1
G — 1

ratingContents text

5 languages detected in sample 90.2% null 40.3% duplicate strings

rows143,258

null129,267 (90.2%)

unique8,353

len_min5

len_max148

len_mean46.160

len_median44.000

len_p9588.000

word_mean5.169

word_median5.000

n_empty0

n_duplicates5,638

duplicate_rate0.403

vocab_size1,188

readability_flesch_mean15.017

emoji_rate0.000

url_rate0.000

one_word_rate0.059

allcaps_rate0.062

boilerplate_rate0.000

Sample values (first 10)

['Rude and Suggestive Material']
['Thematic Elements and Language']
['Sexual References', 'Language Throughout', 'Some Violence']
['Strong Bloody Violence', 'Language']
['Some Violent Content', 'Language']
['Language and Brief Violence']
['Some Mild Action', 'Rude Material/Language']
['Some Drug Use', 'Language Throughout', 'Strong Sexual Content']
['Thematic Material', 'Some Violent Content', 'Brief Language']
['Depiction of Killing', 'Some Aberrant Sexual Content', 'Depiction of Torture']

releaseDateTheaters text

100.0% rows are a single word 100.0% rows are all-caps 78.5% null 95th-percentile length under 20 chars 60.8% duplicate strings

rows143,258

null112,485 (78.5%)

unique12,062

len_min10

len_max10

len_mean10.000

len_median10.000

len_p9510.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates18,711

duplicate_rate0.608

vocab_size9,088

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

1997-02-20
1980-09-10
2017-09-08
2010-05-21
2022-11-11
2021-03-12
1970-05-06
2022-07-22
2018-09-07
2011-05-06

releaseDateStreaming text

100.0% rows are a single word 100.0% rows are all-caps 44.6% null 95th-percentile length under 20 chars 94.0% duplicate strings

rows143,258

null63,838 (44.6%)

unique4,726

len_min7

len_max10

len_mean10.000

len_median10.000

len_p9510.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates74,694

duplicate_rate0.940

vocab_size3,514

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

2010-09-28
2012-11-27
2020-07-04
2020-12-02
2018-08-25
2012-12-04
2015-10-06
2014-03-18
2016-09-07
2019-03-26

runtimeMinutes numeric

skew=+7.62 11.4% rows beyond 1.5 IQR

rows143,258

null13,827 (9.7%)

unique324

min1.000

max2,700

mean93.709

median92.000

std28.129

q184.000

q3103.000

iqr19.000

skew7.623

kurtosis598.651

n_outliers14,720

outlier_rate0.114

zero_rate0.000

genre text

55.3% rows are a single word 97.8% duplicate strings

rows143,258

null11,083 (7.7%)

unique2,912

len_min3

len_max101

len_mean12.907

len_median11.000

len_p9532.000

word_mean1.874

word_median1.000

n_empty0

n_duplicates129,263

duplicate_rate0.978

vocab_size66

readability_flesch_mean-19.665

emoji_rate0.000

url_rate0.000

one_word_rate0.553

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

Documentary
Musical, Comedy
Documentary, Music
Action, Adventure
Comedy, Drama
Mystery & thriller, Drama
Documentary
Documentary
Crime, Drama, Mystery & thriller
Drama

originalLanguage categorical

rows143,258

null13,858 (9.7%)

unique112

top_valueEnglish

top_rate0.657

cardinality112

entropy2.605

entropy_ratio0.383

Top values (rank 1–20)

English — 85,034
Spanish — 4,786
Japanese — 3,482
Hindi — 3,309
French (Canada) — 3,282
Chinese — 3,166
French (France) — 2,760
English (United Kingdom) — 2,553
Italian — 2,303
German — 2,155
Korean — 1,226
Arabic — 938
Spanish (Spain) — 936
Tamil — 909
Russian — 898
Portuguese (Brazil) — 867
Telugu — 774
Malayalam — 642
Unknown language — 528
Dutch — 482

director text

55.3% duplicate strings

rows143,258

null4,194 (2.9%)

unique62,207

len_min1

len_max326

len_mean14.807

len_median14.000

len_p9526.000

word_mean2.213

word_median2.000

n_empty0

n_duplicates76,857

duplicate_rate0.553

vocab_size16,693

readability_flesch_mean47.173

emoji_rate0.000

url_rate0.000

one_word_rate7.36e-03

allcaps_rate7.19e-05

boilerplate_rate0.000

Sample values (first 10)

Robert Cavanah
Kinji Fukasaku
Carlo Lizzani
Sacha Polak
Michael Lei
Jim Makichuk
Larry Brand
Kevin de la Isla O'Neill
Jerry Rothwell
Masaki Tsujino

writer text

37.1% null 25.3% duplicate strings

rows143,258

null53,142 (37.1%)

unique67,274

len_min1

len_max262

len_mean21.345

len_median16.000

len_p9547.000

word_mean2.711

word_median2.000

n_empty0

n_duplicates22,842

duplicate_rate0.253

vocab_size26,458

readability_flesch_mean37.788

emoji_rate0.000

url_rate0.000

one_word_rate5.35e-03

allcaps_rate6.66e-05

boilerplate_rate0.000

Sample values (first 10)

Robert Cavanah,Jon Kirby
Albert Espinosa
Miles Bellar
Suzette Couture,Pierre Sarrazin
Kelly Fullerton
Jason Richman
Raffaella Verga
Barry Massoni,Rene Perez
Michael McGovern
Guy Hibbert

boxOffice text

100.0% rows are a single word 100.0% rows are all-caps 89.7% null 95th-percentile length under 20 chars 67.0% duplicate strings

rows143,258

null128,515 (89.7%)

unique4,863

len_min2

len_max7

len_mean5.977

len_median6.000

len_p957.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates9,880

duplicate_rate0.670

vocab_size4,863

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

$41.1M
$334.2K
$5.6K
$11.1K
$12.4M
$2.0M
$83.8M
$2.0M
$321.9K
$15.5M

distributor text

83.9% null 83.9% duplicate strings

rows143,258

null120,253 (83.9%)

unique3,694

len_min3

len_max385

len_mean19.888

len_median17.000

len_p9543.000

word_mean2.649

word_median2.000

n_empty0

n_duplicates19,311

duplicate_rate0.839

vocab_size2,650

readability_flesch_mean16.846

emoji_rate0.000

url_rate4.35e-05

one_word_rate0.094

allcaps_rate0.015

boilerplate_rate0.000

Sample values (first 10)

Grapevine Video
20th Century Fox
Samuel Goldwyn Company
Warner Bros. Pictures
Metro-Goldwyn-Mayer
Gravitas Ventures
LS Video, United Artists, Hollywood Classics, Columbia TriStar Home Video, Reel Media International [us], Madacy Entertainment Group Inc. [us]
Netflix
20th Century Fox
Fox Searchlight

soundMix categorical

88.9% null

rows143,258

null127,341 (88.9%)

unique551

top_valueSurround

top_rate0.256

cardinality551

entropy4.663

entropy_ratio0.512

Top values (rank 1–20)

Surround — 4,075
Dolby Digital — 2,375
Stereo — 2,082
Mono — 1,246
Stereo, Surround — 473
Surround, Stereo — 451
Dolby — 411
Dolby SRD, DTS, SDDS — 253
Dolby Atmos — 241
Dolby SR — 198
Dolby SR, DTS, Dolby Stereo, Surround, SDDS, Dolby A, Dolby Digital — 192
Dolby Stereo, Dolby Digital, Dolby A, Surround, Dolby SR — 167
Surround, Dolby Digital — 133
Dolby, Surround — 119
SDDS, Dolby Digital, DTS — 118
Surround, Dolby SRD, DTS, SDDS — 118
Dolby SRD — 107
Surround, Dolby SR, Dolby Digital, Dolby A, Dolby Stereo — 101
Dolby Atmos, Dolby Digital — 93
Datasat, Dolby Digital — 84