saturn

/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv 143,258 rows sample n=143,258 seed 42 2026-05-01T23:13:52+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv
Total rows143,258
Profiled sample143,258
Columns16
Generated2026-05-01T23:13:52+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

id high anthropic:claude-opus-4-7

Slug-style identifier column: every value is a single token (one_word_rate 1.0, word_mean 1.0) with mean length ~18 chars and 142052 uniques out of 143258 rows. The 1206 duplicates (0.84%) are surprising for an id field — top repeats like 'catch_me_if_you_can' and 'hear_no_evil' suggest these are title-derived slugs rather than guaranteed-unique keys. Readability score is meaningless here (−75.5) because the tokens are underscore-joined phrases, not prose.

title high anthropic:claude-opus-4-7

Short titles (mean 17 chars, median 3 words) of what look like films or works — top values include 'The Return', 'A Christmas Carol', 'Hero', 'Blue'. Predominantly English (3946) but 29 other languages are detected, with Spanish (123), German (80), and French (72) most common. Notable duplication: 16,488 repeats (11.5% duplicate rate) across 126,403 unique values out of 143,258 rows, and 17% are single-word titles.

audienceScore high anthropic:claude-opus-4-7

This is an audience rating score on a 0-100 scale with 101 unique integer values, mean 55.67 and median 57. The distribution is wide (std 24.55, IQR 39) and slightly left-skewed (skew -0.23, kurtosis -0.83) with no outliers flagged. The dominant concern is missingness: 48.87% of rows are null, so nearly half the dataset lacks this score.

tomatoMeter high anthropic:claude-opus-4-7

This is the Rotten Tomatoes critic score (tomatoMeter), a 0-100 percentage with 101 unique integer values, mean 65.77 and median 73. The distribution is left-skewed (skew -0.65) with Q1 at 45 and Q3 at 89, indicating most rated titles lean favorable. The dominant concern is coverage: 76.35% of rows are null, so the field is only populated for a minority of records.

rating high anthropic:claude-opus-4-7

This is a content rating field mixing theatrical (R, PG-13, PG, NC-17, G) and television (TVPG, TV14, TVMA, TVY7, TVG) classifications across 10 distinct values. The column is 90.23% null, so only ~9.77% of the 143,258 rows carry a rating, and within those R alone accounts for 55.28% of values. The mixed rating systems and the long tail (TVG, TVY7, G each appearing once) suggest inconsistent sourcing rather than a clean controlled vocabulary.

ratingContents high anthropic:claude-opus-4-7

This column stores content-rating descriptors (e.g. 'Language', 'Violence', 'Some Sexual Content') serialised as Python-style list literals rather than clean arrays. It is 90.23% null and, among the 14k populated rows, 40.3% are duplicates with only 8,353 unique values across 143,258 records. A handful of non-English entries (12 it, 5 ro, 1 km) appear despite the vocabulary being tiny (1,188 words), and the bracket/quote artefacts in top_words confirm the values were never parsed out of their string representation.

releaseDateTheaters high anthropic:claude-opus-4-7

This is a theatrical release date stored as an ISO-format string (every value is exactly 10 characters and a single token, e.g. '2018-09-14'). It is sparsely populated — 78.52% null — and the non-null values are heavily repeated, with a 60.8% duplicate rate across 12,062 distinct dates. The 'allcaps' alert is a false positive driven by digits-only strings; there's no actual text content to mine.

releaseDateStreaming high anthropic:claude-opus-4-7

This is a streaming-release date stored as ISO-8601 text (len_median 10, one_word_rate 1.0, all top values match YYYY-MM-DD). Roughly 44.56% of rows are null and the duplicate_rate is 0.94, with a single date 2017-05-22 appearing 1232 times — heavy clustering on a few release days. The text-style alerts (allcaps, one_word, short_text) are artifacts of the date format, not a quality issue.

runtimeMinutes high anthropic:claude-opus-4-7

Movie or episode runtime in minutes, with a typical feature-length distribution (median 92, IQR 84-103). The tail is extreme: max 2700, skew 7.62, kurtosis 598.65, and 11.37% of rows flagged as outliers, suggesting a mix of shorts, multi-part specials, or full series totals alongside standard films. Roughly 9.65% of rows are null.

genre high anthropic:claude-opus-4-7

This is a categorical genre label for films, often a single word like 'Drama' (27,860 rows) or 'Documentary' (15,162) but sometimes a comma-separated combo such as 'Comedy, Drama'. With only 66 distinct vocabulary tokens but 2,912 unique strings and a 97.8% duplicate rate, the cardinality comes entirely from how genres are concatenated. Note the 7.74% null rate and that 55% of values are single-word — multi-genre rows are the minority.

originalLanguage high anthropic:claude-opus-4-7

Categorical language label with 112 distinct values, dominated by English at 65.7% of non-null rows (85,034 of 143,258). The long tail spans regional variants (e.g., 'French (Canada)' vs 'French (France)', 'English (United Kingdom)') alongside bare language names like 'English' and 'French', suggesting inconsistent locale tagging that will fragment counts. Null rate is 9.67%, and entropy ratio of 0.38 confirms heavy concentration in a few categories.

director high anthropic:claude-opus-4-7

Holds a film director's name, averaging 2.2 words and 14.8 characters with 62,207 unique values across 143,258 rows. The duplicate rate is 55.3% (76,857 rows), inflated by a 'Unknown Director' sentinel that occurs 3,544 times and should not be treated as a real name. Null rate is 2.93%, and the long tail (David DeCoteau at 129, Sam Newfield at 124) reflects prolific B-movie directors rather than data quality issues.

writer high anthropic:claude-opus-4-7

Holds writer credits, typically one or two personal names averaging 2.7 words and 21 characters, with familiar figures like Jing Wong, Woody Allen, and Ingmar Bergman topping the list. Coverage is weak: 37.1% of rows are null and 25.3% are duplicates across 67,274 unique values, so a single column likely concatenates multiple co-writers per title. Top tokens (michael, david, john) confirm Western personal names dominate, though 'de' hints at multi-name strings or non-English credits mixed in.

boxOffice high anthropic:claude-opus-4-7

Box office gross stored as a short currency string like "$1.1M" — every value is one token, 99.99% allcaps, and lengths cluster between 2 and 7 characters. The column is 89.71% null and only 4,863 distinct values cover the 14,762 populated rows, with a 67.01% duplicate rate concentrated on round million-dollar figures. Note this is a coarse, pre-formatted string (millions only), not a precise revenue number.

distributor high anthropic:claude-opus-4-7

This column lists film distributor names, dominated by major studios like Paramount Pictures (994), 20th Century Fox (745), and Universal Pictures (737). It is overwhelmingly sparse with an 83.94% null rate and a 83.94% duplicate rate across 3,694 unique values, suggesting most rows lack distributor data while a small set of studios accounts for the populated entries. Names are short (mean length 19.9 chars, median 2 words) and vocabulary is concentrated around terms like 'pictures', 'films', and 'entertainment'.

soundMix high anthropic:claude-opus-4-7

Catalogues the audio mix format of each title (Surround, Dolby Digital, Stereo, Mono, etc.), with 551 distinct labels across 143,258 rows. The dominant issue is sparsity: 88.89% of values are null, and even among populated rows 'Surround' covers only 25.6%. Free-form combinations like 'Stereo, Surround' vs 'Surround, Stereo' and overlapping Dolby variants suggest the field is unnormalised multi-label text rather than a clean taxonomy.

Numeric correlation

Languages detected

Per-string language detection across text columns (sampled).

id text

99.2% of rows are unique strings 100.0% rows are a single word
rows143,258
null0 (0.0%)
unique142,052
len_min1
len_max178
len_mean18.152
len_median16.000
len_p9537.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates1,206
duplicate_rate8.42e-03
vocab_size19,976
readability_flesch_mean-75.475
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate5.79e-03
boilerplate_rate0.000
Sample values (first 10)
  1. my_sassy_hubby
  2. kattappanayile_rithwik_roshan
  3. pinky
  4. bethune-the-making-of-a-hero
  5. mr_jones_2013
  6. los_chicos_crenen_1942
  7. mad_wednesday_1946
  8. in_the_dark_room
  9. clawed
  10. starlight_hotel

title text

31 languages detected in sample
rows143,258
null367 (0.3%)
unique126,403
len_min1
len_max176
len_mean17.233
len_median15.000
len_p9537.000
word_mean3.074
word_median3.000
n_empty0
n_duplicates16,488
duplicate_rate0.115
vocab_size17,597
readability_flesch_mean60.618
emoji_rate0.000
url_rate0.000
one_word_rate0.171
allcaps_rate3.99e-03
boilerplate_rate9.10e-05
Sample values (first 10)
  1. My Sassy Hubby
  2. Marie for Memory
  3. The Locals
  4. All the Winters That Have Been
  5. The Luck of Ginger Coffey
  6. Bleeding Through
  7. School Killer
  8. Territories
  9. Miral
  10. Murt Ramirez Wants to Kick My Ass

audienceScore numeric

48.9% null
rows143,258
null70,010 (48.9%)
unique101
min0.000
max100.000
mean55.675
median57.000
std24.554
q137.000
q376.000
iqr39.000
skew-0.226
kurtosis-0.832
n_outliers0
outlier_rate0.000
zero_rate0.016

tomatoMeter numeric

76.4% null
rows143,258
null109,381 (76.4%)
unique101
min0.000
max100.000
mean65.770
median73.000
std28.023
q145.000
q389.000
iqr44.000
skew-0.647
kurtosis-0.666
n_outliers0
outlier_rate0.000
zero_rate0.021

rating categorical

90.2% null
rows143,258
null129,267 (90.2%)
unique10
top_valueR
top_rate0.553
cardinality10
entropy1.710
entropy_ratio0.515
Top values (rank 1–20)
  1. R — 7,734
  2. PG-13 — 3,446
  3. PG — 1,911
  4. TVPG — 424
  5. TV14 — 397
  6. TVMA — 57
  7. NC-17 — 19
  8. TVG — 1
  9. TVY7 — 1
  10. G — 1

ratingContents text

5 languages detected in sample 90.2% null 40.3% duplicate strings
rows143,258
null129,267 (90.2%)
unique8,353
len_min5
len_max148
len_mean46.160
len_median44.000
len_p9588.000
word_mean5.169
word_median5.000
n_empty0
n_duplicates5,638
duplicate_rate0.403
vocab_size1,188
readability_flesch_mean15.017
emoji_rate0.000
url_rate0.000
one_word_rate0.059
allcaps_rate0.062
boilerplate_rate0.000
Sample values (first 10)
  1. ['Rude and Suggestive Material']
  2. ['Thematic Elements and Language']
  3. ['Sexual References', 'Language Throughout', 'Some Violence']
  4. ['Strong Bloody Violence', 'Language']
  5. ['Some Violent Content', 'Language']
  6. ['Language and Brief Violence']
  7. ['Some Mild Action', 'Rude Material/Language']
  8. ['Some Drug Use', 'Language Throughout', 'Strong Sexual Content']
  9. ['Thematic Material', 'Some Violent Content', 'Brief Language']
  10. ['Depiction of Killing', 'Some Aberrant Sexual Content', 'Depiction of Torture']

releaseDateTheaters text

100.0% rows are a single word 100.0% rows are all-caps 78.5% null 95th-percentile length under 20 chars 60.8% duplicate strings
rows143,258
null112,485 (78.5%)
unique12,062
len_min10
len_max10
len_mean10.000
len_median10.000
len_p9510.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates18,711
duplicate_rate0.608
vocab_size9,088
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 1997-02-20
  2. 1980-09-10
  3. 2017-09-08
  4. 2010-05-21
  5. 2022-11-11
  6. 2021-03-12
  7. 1970-05-06
  8. 2022-07-22
  9. 2018-09-07
  10. 2011-05-06

releaseDateStreaming text

100.0% rows are a single word 100.0% rows are all-caps 44.6% null 95th-percentile length under 20 chars 94.0% duplicate strings
rows143,258
null63,838 (44.6%)
unique4,726
len_min7
len_max10
len_mean10.000
len_median10.000
len_p9510.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates74,694
duplicate_rate0.940
vocab_size3,514
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 2010-09-28
  2. 2012-11-27
  3. 2020-07-04
  4. 2020-12-02
  5. 2018-08-25
  6. 2012-12-04
  7. 2015-10-06
  8. 2014-03-18
  9. 2016-09-07
  10. 2019-03-26

runtimeMinutes numeric

skew=+7.62 11.4% rows beyond 1.5 IQR
rows143,258
null13,827 (9.7%)
unique324
min1.000
max2,700
mean93.709
median92.000
std28.129
q184.000
q3103.000
iqr19.000
skew7.623
kurtosis598.651
n_outliers14,720
outlier_rate0.114
zero_rate0.000

genre text

55.3% rows are a single word 97.8% duplicate strings
rows143,258
null11,083 (7.7%)
unique2,912
len_min3
len_max101
len_mean12.907
len_median11.000
len_p9532.000
word_mean1.874
word_median1.000
n_empty0
n_duplicates129,263
duplicate_rate0.978
vocab_size66
readability_flesch_mean-19.665
emoji_rate0.000
url_rate0.000
one_word_rate0.553
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. Documentary
  2. Musical, Comedy
  3. Documentary, Music
  4. Action, Adventure
  5. Comedy, Drama
  6. Mystery & thriller, Drama
  7. Documentary
  8. Documentary
  9. Crime, Drama, Mystery & thriller
  10. Drama

originalLanguage categorical

rows143,258
null13,858 (9.7%)
unique112
top_valueEnglish
top_rate0.657
cardinality112
entropy2.605
entropy_ratio0.383
Top values (rank 1–20)
  1. English — 85,034
  2. Spanish — 4,786
  3. Japanese — 3,482
  4. Hindi — 3,309
  5. French (Canada) — 3,282
  6. Chinese — 3,166
  7. French (France) — 2,760
  8. English (United Kingdom) — 2,553
  9. Italian — 2,303
  10. German — 2,155
  11. Korean — 1,226
  12. Arabic — 938
  13. Spanish (Spain) — 936
  14. Tamil — 909
  15. Russian — 898
  16. Portuguese (Brazil) — 867
  17. Telugu — 774
  18. Malayalam — 642
  19. Unknown language — 528
  20. Dutch — 482

director text

55.3% duplicate strings
rows143,258
null4,194 (2.9%)
unique62,207
len_min1
len_max326
len_mean14.807
len_median14.000
len_p9526.000
word_mean2.213
word_median2.000
n_empty0
n_duplicates76,857
duplicate_rate0.553
vocab_size16,693
readability_flesch_mean47.173
emoji_rate0.000
url_rate0.000
one_word_rate7.36e-03
allcaps_rate7.19e-05
boilerplate_rate0.000
Sample values (first 10)
  1. Robert Cavanah
  2. Kinji Fukasaku
  3. Carlo Lizzani
  4. Sacha Polak
  5. Michael Lei
  6. Jim Makichuk
  7. Larry Brand
  8. Kevin de la Isla O'Neill
  9. Jerry Rothwell
  10. Masaki Tsujino

writer text

37.1% null 25.3% duplicate strings
rows143,258
null53,142 (37.1%)
unique67,274
len_min1
len_max262
len_mean21.345
len_median16.000
len_p9547.000
word_mean2.711
word_median2.000
n_empty0
n_duplicates22,842
duplicate_rate0.253
vocab_size26,458
readability_flesch_mean37.788
emoji_rate0.000
url_rate0.000
one_word_rate5.35e-03
allcaps_rate6.66e-05
boilerplate_rate0.000
Sample values (first 10)
  1. Robert Cavanah,Jon Kirby
  2. Albert Espinosa
  3. Miles Bellar
  4. Suzette Couture,Pierre Sarrazin
  5. Kelly Fullerton
  6. Jason Richman
  7. Raffaella Verga
  8. Barry Massoni,Rene Perez
  9. Michael McGovern
  10. Guy Hibbert

boxOffice text

100.0% rows are a single word 100.0% rows are all-caps 89.7% null 95th-percentile length under 20 chars 67.0% duplicate strings
rows143,258
null128,515 (89.7%)
unique4,863
len_min2
len_max7
len_mean5.977
len_median6.000
len_p957.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates9,880
duplicate_rate0.670
vocab_size4,863
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. $41.1M
  2. $334.2K
  3. $5.6K
  4. $11.1K
  5. $12.4M
  6. $2.0M
  7. $83.8M
  8. $2.0M
  9. $321.9K
  10. $15.5M

distributor text

83.9% null 83.9% duplicate strings
rows143,258
null120,253 (83.9%)
unique3,694
len_min3
len_max385
len_mean19.888
len_median17.000
len_p9543.000
word_mean2.649
word_median2.000
n_empty0
n_duplicates19,311
duplicate_rate0.839
vocab_size2,650
readability_flesch_mean16.846
emoji_rate0.000
url_rate4.35e-05
one_word_rate0.094
allcaps_rate0.015
boilerplate_rate0.000
Sample values (first 10)
  1. Grapevine Video
  2. 20th Century Fox
  3. Samuel Goldwyn Company
  4. Warner Bros. Pictures
  5. Metro-Goldwyn-Mayer
  6. Gravitas Ventures
  7. LS Video, United Artists, Hollywood Classics, Columbia TriStar Home Video, Reel Media International [us], Madacy Entertainment Group Inc. [us]
  8. Netflix
  9. 20th Century Fox
  10. Fox Searchlight

soundMix categorical

88.9% null
rows143,258
null127,341 (88.9%)
unique551
top_valueSurround
top_rate0.256
cardinality551
entropy4.663
entropy_ratio0.512
Top values (rank 1–20)
  1. Surround — 4,075
  2. Dolby Digital — 2,375
  3. Stereo — 2,082
  4. Mono — 1,246
  5. Stereo, Surround — 473
  6. Surround, Stereo — 451
  7. Dolby — 411
  8. Dolby SRD, DTS, SDDS — 253
  9. Dolby Atmos — 241
  10. Dolby SR — 198
  11. Dolby SR, DTS, Dolby Stereo, Surround, SDDS, Dolby A, Dolby Digital — 192
  12. Dolby Stereo, Dolby Digital, Dolby A, Surround, Dolby SR — 167
  13. Surround, Dolby Digital — 133
  14. Dolby, Surround — 119
  15. SDDS, Dolby Digital, DTS — 118
  16. Surround, Dolby SRD, DTS, SDDS — 118
  17. Dolby SRD — 107
  18. Surround, Dolby SR, Dolby Digital, Dolby A, Dolby Stereo — 101
  19. Dolby Atmos, Dolby Digital — 93
  20. Datasat, Dolby Digital — 84