rotten_tomatoes-rotten_tomatoes_movies

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv

Saturn profiled 143,258 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv",
    "--findings", "rotten_tomatoes-rotten_tomatoes_movies.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

citing: row_count · column_count · columns.boxOffice.null_rate · columns.rating.null_rate · columns.tomatoMeter.null_rate · columns.releaseDateTheaters.null_rate · columns.audienceScore.null_rate · columns.genre.top_values · columns.runtimeMinutes.stats · columns.tomatoMeter.stats · columns.audienceScore.stats · columns.originalLanguage.top_rate · columns.originalLanguage.top_values

Out[4]:

saturn.schema() · 16 columns

column	kind	n	null%	unique	alerts
id	text	143,258	0.0%	142,052	near_unique one_word
title	text	143,258	0.3%	126,403	multilingual
audienceScore	numeric	143,258	48.9%	101	null_rate
tomatoMeter	numeric	143,258	76.4%	101	null_rate
rating	categorical	143,258	90.2%	10	null_rate
ratingContents	text	143,258	90.2%	8,353	multilingual null_rate duplicates
releaseDateTheaters	text	143,258	78.5%	12,062	one_word allcaps null_rate short_text duplicates
releaseDateStreaming	text	143,258	44.6%	4,726	one_word allcaps null_rate short_text duplicates
runtimeMinutes	numeric	143,258	9.7%	324	high_skew outliers
genre	text	143,258	7.7%	2,912	one_word duplicates
originalLanguage	categorical	143,258	9.7%	112
director	text	143,258	2.9%	62,207	duplicates
writer	text	143,258	37.1%	67,274	null_rate duplicates
boxOffice	text	143,258	89.7%	4,863	one_word allcaps null_rate short_text duplicates
distributor	text	143,258	83.9%	3,694	null_rate duplicates
soundMix	categorical	143,258	88.9%	551	null_rate

Fig 1.

genre · Drama, Documentary, and Comedy dominate; note how long-tail combo genres fragment the rest.

Show data table

Character-length distribution for genre (mean: 12.906979383393228).
chars	count
3 – 5	28416
5 – 8	26877
8 – 10	3141
10 – 13	18414
13 – 15	18553
15 – 18	3196
18 – 20	10562
20 – 23	3727
23 – 25	4991
25 – 28	5352
28 – 30	854
30 – 32	2468
32 – 35	2166
35 – 37	1055
37 – 40	240
40 – 42	906
42 – 45	639
45 – 47	124
47 – 50	89
50 – 52	123
52 – 54	164
54 – 57	11
57 – 59	30
59 – 62	42
62 – 64	18
64 – 67	4
67 – 69	8
69 – 72	1
72 – 74	2
74 – 76	0
76 – 79	0
79 – 81	1
81 – 84	0
84 – 86	0
86 – 89	0
89 – 91	0
91 – 94	0
94 – 96	0
96 – 99	0
99 – 101	1

Fig 2.

runtimeMinutes · Most films cluster around 84–103 minutes, but extreme outliers stretch the tail to 2,700 minutes.

Show data table

Histogram bins for runtimeMinutes (median: 92.0).
bin	count
1 – 68.47	12329
68.47 – 135.9	110286
135.9 – 203.4	6482
203.4 – 270.9	254
270.9 – 338.4	37
338.4 – 405.8	19
405.8 – 473.3	9
473.3 – 540.8	7
540.8 – 608.3	3
608.3 – 675.8	1
675.8 – 743.2	0
743.2 – 810.7	1
810.7 – 878.2	0
878.2 – 945.6	1
945.6 – 1013	1
1013 – 1081	0
1081 – 1148	0
1148 – 1216	0
1216 – 1283	0
1283 – 1350	0
1350 – 1418	0
1418 – 1485	0
1485 – 1553	0
1553 – 1620	0
1620 – 1688	0
1688 – 1755	0
1755 – 1823	0
1823 – 1890	0
1890 – 1958	0
1958 – 2025	0
2025 – 2093	0
2093 – 2160	0
2160 – 2228	0
2228 – 2295	0
2295 – 2363	0
2363 – 2430	0
2430 – 2498	0
2498 – 2565	0
2565 – 2633	0
2633 – 2700	1

Fig 3.

tomatoMeter · Critic scores skew positive with a median of 73 — look for the left tail of poorly reviewed films.

Show data table

Histogram bins for tomatoMeter (median: 73.0).
bin	count
0 – 2.5	723
2.5 – 5	77
5 – 7.5	201
7.5 – 10	224
10 – 12.5	375
12.5 – 15	458
15 – 17.5	529
17.5 – 20	262
20 – 22.5	859
22.5 – 25	203
25 – 27.5	557
27.5 – 30	433
30 – 32.5	431
32.5 – 35	610
35 – 37.5	436
37.5 – 40	436
40 – 42.5	964
42.5 – 45	651
45 – 47.5	521
47.5 – 50	228
50 – 52.5	1119
52.5 – 55	386
55 – 57.5	964
57.5 – 60	335
60 – 62.5	1163
62.5 – 65	769
65 – 67.5	1293
67.5 – 70	493
70 – 72.5	1215
72.5 – 75	614
75 – 77.5	1166
77.5 – 80	786
80 – 82.5	1951
82.5 – 85	1247
85 – 87.5	1623
87.5 – 90	1488
90 – 92.5	1782
92.5 – 95	1104
95 – 97.5	1133
97.5 – 100	4068

Fig 4.

audienceScore · Audience scores are flatter and more centered (median 57) than critic scores — compare the two distributions.

Show data table

Histogram bins for audienceScore (median: 57.0).
bin	count
0 – 2.5	1158
2.5 – 5	84
5 – 7.5	353
7.5 – 10	437
10 – 12.5	1114
12.5 – 15	814
15 – 17.5	1413
17.5 – 20	812
20 – 22.5	2208
22.5 – 25	884
25 – 27.5	1918
27.5 – 30	1406
30 – 32.5	1799
32.5 – 35	2064
35 – 37.5	1875
37.5 – 40	1756
40 – 42.5	3056
42.5 – 45	2022
45 – 47.5	2212
47.5 – 50	1176
50 – 52.5	3593
52.5 – 55	1526
55 – 57.5	3065
57.5 – 60	1583
60 – 62.5	3772
62.5 – 65	1583
65 – 67.5	3187
67.5 – 70	1692
70 – 72.5	3102
72.5 – 75	1777
75 – 77.5	2972
77.5 – 80	1877
80 – 82.5	3343
82.5 – 85	2034
85 – 87.5	2618
87.5 – 90	1791
90 – 92.5	1830
92.5 – 95	879
95 – 97.5	668
97.5 – 100	1795

Fig 5.

originalLanguage · English accounts for ~66% of titles; the remaining 111 languages form a long tail worth segmenting.

Show data table

Top values for originalLanguage (20 unique shown, of 112 total).
value	count	share
English	85034	59.4%
Spanish	4786	3.3%
Japanese	3482	2.4%
Hindi	3309	2.3%
French (Canada)	3282	2.3%
Chinese	3166	2.2%
French (France)	2760	1.9%
English (United Kingdom)	2553	1.8%
Italian	2303	1.6%
German	2155	1.5%
Korean	1226	0.9%
Arabic	938	0.7%
Spanish (Spain)	936	0.7%
Tamil	909	0.6%
Russian	898	0.6%
Portuguese (Brazil)	867	0.6%
Telugu	774	0.5%
Malayalam	642	0.4%
Unknown language	528	0.4%
Dutch	482	0.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	text	0.0%
title	text	0.3%
audienceScore	numeric	48.9%
tomatoMeter	numeric	76.4%
rating	categorical	90.2%
ratingContents	text	90.2%
releaseDateTheaters	text	78.5%
releaseDateStreaming	text	44.6%
runtimeMinutes	numeric	9.7%
genre	text	7.7%
originalLanguage	categorical	9.7%
director	text	2.9%
writer	text	37.1%
boxOffice	text	89.7%
distributor	text	83.9%
soundMix	categorical	88.9%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,885 detected strings).
lang	count	share
en	4371	89.5%
es	123	2.5%
de	80	1.6%
fr	72	1.5%
it	67	1.4%
nl	23	0.5%
pt	20	0.4%
sv	16	0.3%
ceb	12	0.2%
tr	12	0.2%
fi	10	0.2%
pl	8	0.2%
id	8	0.2%
eo	6	0.1%
ca	6	0.1%
sl	5	0.1%
ru	5	0.1%
no	5	0.1%
ro	5	0.1%
ms	4	0.1%
hu	3	0.1%
hr	3	0.1%
af	3	0.1%
et	3	0.1%
ja	3	0.1%
cs	2	0.0%
la	2	0.0%
tl	2	0.0%
da	2	0.0%
sr	2	0.0%
lv	1	0.0%
km	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	audienceScore	tomatoMeter	runtimeMinutes
audienceScore	+1.00	-0.04	+0.01
tomatoMeter	-0.04	+1.00	+0.02
runtimeMinutes	+0.01	+0.02	+1.00

id text identifier

Slug-style identifier column: every value is a single token (one_word_rate 1.0, word_mean 1.0) with mean length ~18 chars and 142052 uniques out of 143258 rows. The 1206 duplicates (0.84%) are surprising for an id field — top repeats like 'catch_me_if_you_can' and 'hear_no_evil' suggest these are title-derived slugs rather than guaranteed-unique keys. Readability score is meaningless here (−75.5) because the tokens are underscore-joined phrases, not prose.

Treatment: Use as a join key but deduplicate first — it is not strictly unique.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["id"].stats

stat	value
n	143,258
nulls	0 (0.0%)
unique	142,052
len_min	1
len_max	178
len_mean	18.15
len_median	16
len_p95	37
word_mean	1
word_median	1
n_empty	0
n_duplicates	1,206
duplicate_rate	0.008418
vocab_size	19,976
readability_flesch_mean	-75.47
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.005787
boilerplate_rate	0
alert: near_unique	99.2% of rows are unique strings
alert: one_word	100.0% rows are a single word

Fig 9.

Character-length distribution for id.

Show data table

Character-length distribution for id (mean: 18.15210319842522).
chars	count
1 – 5	3503
5 – 10	15645
10 – 14	38487
14 – 19	31119
19 – 23	25119
23 – 28	10857
28 – 32	6439
32 – 36	4867
36 – 41	2424
41 – 45	1975
45 – 50	982
50 – 54	721
54 – 59	386
59 – 63	268
63 – 67	198
67 – 72	101
72 – 76	69
76 – 81	25
81 – 85	35
85 – 90	13
90 – 94	10
94 – 98	10
98 – 103	3
103 – 107	0
107 – 112	1
112 – 116	0
116 – 120	0
120 – 125	0
125 – 129	0
129 – 134	0
134 – 138	0
138 – 143	0
143 – 147	0
147 – 151	0
151 – 156	0
156 – 160	0
160 – 165	0
165 – 169	0
169 – 174	0
174 – 178	1

title text label

Short titles (mean 17 chars, median 3 words) of what look like films or works — top values include 'The Return', 'A Christmas Carol', 'Hero', 'Blue'. Predominantly English (3946) but 29 other languages are detected, with Spanish (123), German (80), and French (72) most common. Notable duplication: 16,488 repeats (11.5% duplicate rate) across 126,403 unique values out of 143,258 rows, and 17% are single-word titles.

Treatment: Normalise case and tokenize for embedding; do not treat as a unique key given the 11.5% duplicate rate.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["title"].stats

stat	value
n	143,258
nulls	367 (0.3%)
unique	126,403
len_min	1
len_max	176
len_mean	17.23
len_median	15
len_p95	37
word_mean	3.074
word_median	3
n_empty	0
n_duplicates	16,488
duplicate_rate	0.1154
vocab_size	17,597
readability_flesch_mean	60.62
emoji_rate	0
url_rate	0
one_word_rate	0.1715
allcaps_rate	0.003989
boilerplate_rate	9.098e-05
alert: multilingual	31 languages detected in sample

Fig 10.

Character-length distribution for title.

Show data table

Character-length distribution for title (mean: 17.23315674185218).
chars	count
1 – 5	6174
5 – 10	21384
10 – 14	38720
14 – 18	28652
18 – 23	18097
23 – 27	12387
27 – 32	5661
32 – 36	3782
36 – 40	3096
40 – 45	1631
45 – 49	1338
49 – 54	635
54 – 58	436
58 – 62	346
62 – 67	159
67 – 71	137
71 – 75	95
75 – 80	37
80 – 84	49
84 – 88	17
88 – 93	17
93 – 97	17
97 – 102	15
102 – 106	4
106 – 110	2
110 – 115	0
115 – 119	0
119 – 124	1
124 – 128	0
128 – 132	0
132 – 137	0
137 – 141	1
141 – 145	0
145 – 150	0
150 – 154	0
154 – 158	0
158 – 163	0
163 – 167	0
167 – 172	0
172 – 176	1

audienceScore numeric feature

This is an audience rating score on a 0-100 scale with 101 unique integer values, mean 55.67 and median 57. The distribution is wide (std 24.55, IQR 39) and slightly left-skewed (skew -0.23, kurtosis -0.83) with no outliers flagged. The dominant concern is missingness: 48.87% of rows are null, so nearly half the dataset lacks this score.

Treatment: Impute or add a missing-indicator before modelling given the ~49% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["audienceScore"].stats

stat	value
n	143,258
nulls	70,010 (48.9%)
unique	101
min	0
max	100
mean	55.67
median	57
std	24.55
q1	37
q3	76
iqr	39
skew	-0.2257
kurtosis	-0.8322
n_outliers	0
outlier_rate	0
zero_rate	0.0156
alert: null_rate	48.9% null

Fig 11.

Distribution of audienceScore. Vertical dash marks the median.

Show data table

Histogram bins for audienceScore (median: 57.0).
bin	count
0 – 2.5	1158
2.5 – 5	84
5 – 7.5	353
7.5 – 10	437
10 – 12.5	1114
12.5 – 15	814
15 – 17.5	1413
17.5 – 20	812
20 – 22.5	2208
22.5 – 25	884
25 – 27.5	1918
27.5 – 30	1406
30 – 32.5	1799
32.5 – 35	2064
35 – 37.5	1875
37.5 – 40	1756
40 – 42.5	3056
42.5 – 45	2022
45 – 47.5	2212
47.5 – 50	1176
50 – 52.5	3593
52.5 – 55	1526
55 – 57.5	3065
57.5 – 60	1583
60 – 62.5	3772
62.5 – 65	1583
65 – 67.5	3187
67.5 – 70	1692
70 – 72.5	3102
72.5 – 75	1777
75 – 77.5	2972
77.5 – 80	1877
80 – 82.5	3343
82.5 – 85	2034
85 – 87.5	2618
87.5 – 90	1791
90 – 92.5	1830
92.5 – 95	879
95 – 97.5	668
97.5 – 100	1795

tomatoMeter numeric feature

This is the Rotten Tomatoes critic score (tomatoMeter), a 0-100 percentage with 101 unique integer values, mean 65.77 and median 73. The distribution is left-skewed (skew -0.65) with Q1 at 45 and Q3 at 89, indicating most rated titles lean favorable. The dominant concern is coverage: 76.35% of rows are null, so the field is only populated for a minority of records.

Treatment: Impute or add a missingness indicator before modelling given the 76% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["tomatoMeter"].stats

stat	value
n	143,258
nulls	109,381 (76.4%)
unique	101
min	0
max	100
mean	65.77
median	73
std	28.02
q1	45
q3	89
iqr	44
skew	-0.6467
kurtosis	-0.6657
n_outliers	0
outlier_rate	0
zero_rate	0.02084
alert: null_rate	76.4% null

Fig 12.

Distribution of tomatoMeter. Vertical dash marks the median.

Show data table

Histogram bins for tomatoMeter (median: 73.0).
bin	count
0 – 2.5	723
2.5 – 5	77
5 – 7.5	201
7.5 – 10	224
10 – 12.5	375
12.5 – 15	458
15 – 17.5	529
17.5 – 20	262
20 – 22.5	859
22.5 – 25	203
25 – 27.5	557
27.5 – 30	433
30 – 32.5	431
32.5 – 35	610
35 – 37.5	436
37.5 – 40	436
40 – 42.5	964
42.5 – 45	651
45 – 47.5	521
47.5 – 50	228
50 – 52.5	1119
52.5 – 55	386
55 – 57.5	964
57.5 – 60	335
60 – 62.5	1163
62.5 – 65	769
65 – 67.5	1293
67.5 – 70	493
70 – 72.5	1215
72.5 – 75	614
75 – 77.5	1166
77.5 – 80	786
80 – 82.5	1951
82.5 – 85	1247
85 – 87.5	1623
87.5 – 90	1488
90 – 92.5	1782
92.5 – 95	1104
95 – 97.5	1133
97.5 – 100	4068

rating categorical feature

This is a content rating field mixing theatrical (R, PG-13, PG, NC-17, G) and television (TVPG, TV14, TVMA, TVY7, TVG) classifications across 10 distinct values. The column is 90.23% null, so only ~9.77% of the 143,258 rows carry a rating, and within those R alone accounts for 55.28% of values. The mixed rating systems and the long tail (TVG, TVY7, G each appearing once) suggest inconsistent sourcing rather than a clean controlled vocabulary.

Treatment: Normalize TV vs MPAA codes into a unified scheme and add an explicit 'missing' category before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["rating"].stats

stat	value
n	143,258
nulls	129,267 (90.2%)
unique	10
top_value	R
top_rate	0.5528
cardinality	10
entropy	1.71
entropy_ratio	0.5147
alert: null_rate	90.2% null

Fig 13.

Top values for rating.

Show data table

Top values for rating (10 unique shown, of 10 total).
value	count	share
R	7734	5.4%
PG-13	3446	2.4%
PG	1911	1.3%
TVPG	424	0.3%
TV14	397	0.3%
TVMA	57	0.0%
NC-17	19	0.0%
TVG	1	0.0%
TVY7	1	0.0%
G	1	0.0%

ratingContents text feature

This column stores content-rating descriptors (e.g. 'Language', 'Violence', 'Some Sexual Content') serialised as Python-style list literals rather than clean arrays. It is 90.23% null and, among the 14k populated rows, 40.3% are duplicates with only 8,353 unique values across 143,258 records. A handful of non-English entries (12 it, 5 ro, 1 km) appear despite the vocabulary being tiny (1,188 words), and the bracket/quote artefacts in top_words confirm the values were never parsed out of their string representation.

Treatment: Parse the list-literal strings into a multi-hot encoding of rating tags before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["ratingContents"].stats

stat	value
n	143,258
nulls	129,267 (90.2%)
unique	8,353
len_min	5
len_max	148
len_mean	46.16
len_median	44
len_p95	88
word_mean	5.169
word_median	5
n_empty	0
n_duplicates	5,638
duplicate_rate	0.403
vocab_size	1,188
readability_flesch_mean	15.02
emoji_rate	0
url_rate	0
one_word_rate	0.05925
allcaps_rate	0.06168
boilerplate_rate	0
alert: multilingual	5 languages detected in sample
alert: null_rate	90.2% null
alert: duplicates	40.3% duplicate strings

Fig 14.

Character-length distribution for ratingContents.

Show data table

Character-length distribution for ratingContents (mean: 46.15974555071117).
chars	count
5 – 9	326
9 – 12	791
12 – 16	207
16 – 19	529
19 – 23	281
23 – 26	915
26 – 30	839
30 – 34	592
34 – 37	956
37 – 41	702
41 – 44	910
44 – 48	757
48 – 51	889
51 – 55	819
55 – 59	539
59 – 62	633
62 – 66	493
66 – 69	532
69 – 73	369
73 – 76	398
76 – 80	310
80 – 84	246
84 – 87	215
87 – 91	158
91 – 94	146
94 – 98	96
98 – 102	95
102 – 105	70
105 – 109	45
109 – 112	43
112 – 116	19
116 – 119	26
119 – 123	11
123 – 127	11
127 – 130	5
130 – 134	3
134 – 137	5
137 – 141	6
141 – 144	2
144 – 148	2

releaseDateTheaters text timestamp

This is a theatrical release date stored as an ISO-format string (every value is exactly 10 characters and a single token, e.g. '2018-09-14'). It is sparsely populated — 78.52% null — and the non-null values are heavily repeated, with a 60.8% duplicate rate across 12,062 distinct dates. The 'allcaps' alert is a false positive driven by digits-only strings; there's no actual text content to mine.

Treatment: Parse to date and impute or flag the 78.52% missing before any time-based feature engineering.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["releaseDateTheaters"].stats

stat	value
n	143,258
nulls	112,485 (78.5%)
unique	12,062
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	18,711
duplicate_rate	0.608
vocab_size	9,088
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: null_rate	78.5% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	60.8% duplicate strings

Fig 15.

Character-length distribution for releaseDateTheaters.

Show data table

Character-length distribution for releaseDateTheaters (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	30773
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

releaseDateStreaming text timestamp

This is a streaming-release date stored as ISO-8601 text (len_median 10, one_word_rate 1.0, all top values match YYYY-MM-DD). Roughly 44.56% of rows are null and the duplicate_rate is 0.94, with a single date 2017-05-22 appearing 1232 times — heavy clustering on a few release days. The text-style alerts (allcaps, one_word, short_text) are artifacts of the date format, not a quality issue.

Treatment: Parse to date dtype and treat missingness explicitly before any temporal feature engineering.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["releaseDateStreaming"].stats

stat	value
n	143,258
nulls	63,838 (44.6%)
unique	4,726
len_min	7
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	74,694
duplicate_rate	0.9405
vocab_size	3,514
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: null_rate	44.6% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.0% duplicate strings

Fig 16.

Character-length distribution for releaseDateStreaming.

Show data table

Character-length distribution for releaseDateStreaming (mean: 9.99996222613951).
chars	count
7 – 7	1
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	79419

runtimeMinutes numeric feature

Movie or episode runtime in minutes, with a typical feature-length distribution (median 92, IQR 84-103). The tail is extreme: max 2700, skew 7.62, kurtosis 598.65, and 11.37% of rows flagged as outliers, suggesting a mix of shorts, multi-part specials, or full series totals alongside standard films. Roughly 9.65% of rows are null.

Treatment: Cap or log-transform before modelling and impute the ~10% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["runtimeMinutes"].stats

stat	value
n	143,258
nulls	13,827 (9.7%)
unique	324
min	1
max	2,700
mean	93.71
median	92
std	28.13
q1	84
q3	103
iqr	19
skew	7.623
kurtosis	598.7
n_outliers	14,720
outlier_rate	0.1137
zero_rate	0
alert: high_skew	skew=+7.62
alert: outliers	11.4% rows beyond 1.5 IQR

Fig 17.

Distribution of runtimeMinutes. Vertical dash marks the median.

Show data table

Histogram bins for runtimeMinutes (median: 92.0).
bin	count
1 – 68.47	12329
68.47 – 135.9	110286
135.9 – 203.4	6482
203.4 – 270.9	254
270.9 – 338.4	37
338.4 – 405.8	19
405.8 – 473.3	9
473.3 – 540.8	7
540.8 – 608.3	3
608.3 – 675.8	1
675.8 – 743.2	0
743.2 – 810.7	1
810.7 – 878.2	0
878.2 – 945.6	1
945.6 – 1013	1
1013 – 1081	0
1081 – 1148	0
1148 – 1216	0
1216 – 1283	0
1283 – 1350	0
1350 – 1418	0
1418 – 1485	0
1485 – 1553	0
1553 – 1620	0
1620 – 1688	0
1688 – 1755	0
1755 – 1823	0
1823 – 1890	0
1890 – 1958	0
1958 – 2025	0
2025 – 2093	0
2093 – 2160	0
2160 – 2228	0
2228 – 2295	0
2295 – 2363	0
2363 – 2430	0
2430 – 2498	0
2498 – 2565	0
2565 – 2633	0
2633 – 2700	1

genre text label

This is a categorical genre label for films, often a single word like 'Drama' (27,860 rows) or 'Documentary' (15,162) but sometimes a comma-separated combo such as 'Comedy, Drama'. With only 66 distinct vocabulary tokens but 2,912 unique strings and a 97.8% duplicate rate, the cardinality comes entirely from how genres are concatenated. Note the 7.74% null rate and that 55% of values are single-word — multi-genre rows are the minority.

Treatment: Split on comma and one-hot encode into a small set of genre flags.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["genre"].stats

stat	value
n	143,258
nulls	11,083 (7.7%)
unique	2,912
len_min	3
len_max	101
len_mean	12.91
len_median	11
len_p95	32
word_mean	1.874
word_median	1
n_empty	0
n_duplicates	129,263
duplicate_rate	0.978
vocab_size	66
readability_flesch_mean	-19.66
emoji_rate	0
url_rate	0
one_word_rate	0.5533
allcaps_rate	0
boilerplate_rate	0
alert: one_word	55.3% rows are a single word
alert: duplicates	97.8% duplicate strings

Fig 18.

Character-length distribution for genre.

Show data table

Character-length distribution for genre (mean: 12.906979383393228).
chars	count
3 – 5	28416
5 – 8	26877
8 – 10	3141
10 – 13	18414
13 – 15	18553
15 – 18	3196
18 – 20	10562
20 – 23	3727
23 – 25	4991
25 – 28	5352
28 – 30	854
30 – 32	2468
32 – 35	2166
35 – 37	1055
37 – 40	240
40 – 42	906
42 – 45	639
45 – 47	124
47 – 50	89
50 – 52	123
52 – 54	164
54 – 57	11
57 – 59	30
59 – 62	42
62 – 64	18
64 – 67	4
67 – 69	8
69 – 72	1
72 – 74	2
74 – 76	0
76 – 79	0
79 – 81	1
81 – 84	0
84 – 86	0
86 – 89	0
89 – 91	0
91 – 94	0
94 – 96	0
96 – 99	0
99 – 101	1

originalLanguage categorical feature

Categorical language label with 112 distinct values, dominated by English at 65.7% of non-null rows (85,034 of 143,258). The long tail spans regional variants (e.g., 'French (Canada)' vs 'French (France)', 'English (United Kingdom)') alongside bare language names like 'English' and 'French', suggesting inconsistent locale tagging that will fragment counts. Null rate is 9.67%, and entropy ratio of 0.38 confirms heavy concentration in a few categories.

Treatment: Normalize locale variants to base language codes, then one-hot encode the top categories and bucket the rest as 'Other'.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["originalLanguage"].stats

stat	value
n	143,258
nulls	13,858 (9.7%)
unique	112
top_value	English
top_rate	0.6571
cardinality	112
entropy	2.605
entropy_ratio	0.3827

Fig 19.

Top values for originalLanguage.

Show data table

Top values for originalLanguage (20 unique shown, of 112 total).
value	count	share
English	85034	59.4%
Spanish	4786	3.3%
Japanese	3482	2.4%
Hindi	3309	2.3%
French (Canada)	3282	2.3%
Chinese	3166	2.2%
French (France)	2760	1.9%
English (United Kingdom)	2553	1.8%
Italian	2303	1.6%
German	2155	1.5%
Korean	1226	0.9%
Arabic	938	0.7%
Spanish (Spain)	936	0.7%
Tamil	909	0.6%
Russian	898	0.6%
Portuguese (Brazil)	867	0.6%
Telugu	774	0.5%
Malayalam	642	0.4%
Unknown language	528	0.4%
Dutch	482	0.3%

director text feature

Holds a film director's name, averaging 2.2 words and 14.8 characters with 62,207 unique values across 143,258 rows. The duplicate rate is 55.3% (76,857 rows), inflated by a 'Unknown Director' sentinel that occurs 3,544 times and should not be treated as a real name. Null rate is 2.93%, and the long tail (David DeCoteau at 129, Sam Newfield at 124) reflects prolific B-movie directors rather than data quality issues.

Treatment: Replace 'Unknown Director' with null and use as a high-cardinality categorical (target/frequency encode).

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["director"].stats

stat	value
n	143,258
nulls	4,194 (2.9%)
unique	62,207
len_min	1
len_max	326
len_mean	14.81
len_median	14
len_p95	26
word_mean	2.213
word_median	2
n_empty	0
n_duplicates	76,857
duplicate_rate	0.5527
vocab_size	16,693
readability_flesch_mean	47.17
emoji_rate	0
url_rate	0
one_word_rate	0.007356
allcaps_rate	7.191e-05
boilerplate_rate	0
alert: duplicates	55.3% duplicate strings

Fig 20.

Character-length distribution for director.

Show data table

Character-length distribution for director (mean: 14.80681556693321).
chars	count
1 – 9	7477
9 – 17	109547
17 – 25	14942
25 – 34	5028
34 – 42	1160
42 – 50	412
50 – 58	164
58 – 66	87
66 – 74	69
74 – 82	41
82 – 90	28
90 – 98	24
98 – 107	19
107 – 115	17
115 – 123	3
123 – 131	3
131 – 139	9
139 – 147	13
147 – 155	1
155 – 164	6
164 – 172	4
172 – 180	2
180 – 188	0
188 – 196	1
196 – 204	0
204 – 212	2
212 – 220	0
220 – 228	0
228 – 237	0
237 – 245	1
245 – 253	0
253 – 261	0
261 – 269	1
269 – 277	0
277 – 285	0
285 – 294	0
294 – 302	2
302 – 310	0
310 – 318	0
318 – 326	1

writer text feature

Holds writer credits, typically one or two personal names averaging 2.7 words and 21 characters, with familiar figures like Jing Wong, Woody Allen, and Ingmar Bergman topping the list. Coverage is weak: 37.1% of rows are null and 25.3% are duplicates across 67,274 unique values, so a single column likely concatenates multiple co-writers per title. Top tokens (michael, david, john) confirm Western personal names dominate, though 'de' hints at multi-name strings or non-English credits mixed in.

Treatment: Split on delimiters into individual writers and explode for any per-person analysis; impute or flag the 37% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["writer"].stats

stat	value
n	143,258
nulls	53,142 (37.1%)
unique	67,274
len_min	1
len_max	262
len_mean	21.34
len_median	16
len_p95	47
word_mean	2.711
word_median	2
n_empty	0
n_duplicates	22,842
duplicate_rate	0.2535
vocab_size	26,458
readability_flesch_mean	37.79
emoji_rate	0
url_rate	0
one_word_rate	0.005349
allcaps_rate	6.658e-05
boilerplate_rate	0
alert: null_rate	37.1% null
alert: duplicates	25.3% duplicate strings

Fig 21.

Character-length distribution for writer.

Show data table

Character-length distribution for writer (mean: 21.344955390829597).
chars	count
1 – 8	565
8 – 14	37626
14 – 21	17271
21 – 27	12261
27 – 34	9944
34 – 40	4841
40 – 47	3059
47 – 53	1894
53 – 60	937
60 – 66	718
66 – 73	355
73 – 79	234
79 – 86	120
86 – 92	79
92 – 99	60
99 – 105	46
105 – 112	28
112 – 118	25
118 – 125	9
125 – 132	13
132 – 138	9
138 – 145	12
145 – 151	5
151 – 158	1
158 – 164	1
164 – 171	0
171 – 177	1
177 – 184	0
184 – 190	1
190 – 197	0
197 – 203	0
203 – 210	0
210 – 216	0
216 – 223	0
223 – 229	0
229 – 236	0
236 – 242	0
242 – 249	0
249 – 255	0
255 – 262	1

boxOffice text feature

Box office gross stored as a short currency string like "$1.1M" — every value is one token, 99.99% allcaps, and lengths cluster between 2 and 7 characters. The column is 89.71% null and only 4,863 distinct values cover the 14,762 populated rows, with a 67.01% duplicate rate concentrated on round million-dollar figures. Note this is a coarse, pre-formatted string (millions only), not a precise revenue number.

Treatment: Parse the "$X.XM" string into a numeric dollar amount and decide whether to impute or drop given the 89.71% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["boxOffice"].stats

stat	value
n	143,258
nulls	128,515 (89.7%)
unique	4,863
len_min	2
len_max	7
len_mean	5.977
len_median	6
len_p95	7
word_mean	1
word_median	1
n_empty	0
n_duplicates	9,880
duplicate_rate	0.6701
vocab_size	4,863
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9999
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: null_rate	89.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	67.0% duplicate strings

Fig 22.

Character-length distribution for boxOffice.

Show data table

Character-length distribution for boxOffice (mean: 5.9768703791629925).
chars	count
2 – 2	2
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	6
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	107
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	4011
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	6707
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	3910

distributor text feature

This column lists film distributor names, dominated by major studios like Paramount Pictures (994), 20th Century Fox (745), and Universal Pictures (737). It is overwhelmingly sparse with an 83.94% null rate and a 83.94% duplicate rate across 3,694 unique values, suggesting most rows lack distributor data while a small set of studios accounts for the populated entries. Names are short (mean length 19.9 chars, median 2 words) and vocabulary is concentrated around terms like 'pictures', 'films', and 'entertainment'.

Treatment: Normalize studio name variants and treat as a high-cardinality categorical with an explicit 'missing' bucket given the 83.94% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["distributor"].stats

stat	value
n	143,258
nulls	120,253 (83.9%)
unique	3,694
len_min	3
len_max	385
len_mean	19.89
len_median	17
len_p95	43
word_mean	2.649
word_median	2
n_empty	0
n_duplicates	19,311
duplicate_rate	0.8394
vocab_size	2,650
readability_flesch_mean	16.85
emoji_rate	0
url_rate	4.347e-05
one_word_rate	0.09376
allcaps_rate	0.01474
boilerplate_rate	0
alert: null_rate	83.9% null
alert: duplicates	83.9% duplicate strings

Fig 23.

Character-length distribution for distributor.

Show data table

Character-length distribution for distributor (mean: 19.88802434253423).
chars	count
3 – 13	4694
13 – 22	14064
22 – 32	2077
32 – 41	968
41 – 51	437
51 – 60	254
60 – 70	146
70 – 79	125
79 – 89	58
89 – 98	50
98 – 108	39
108 – 118	22
118 – 127	17
127 – 137	10
137 – 146	7
146 – 156	8
156 – 165	6
165 – 175	3
175 – 184	3
184 – 194	4
194 – 204	2
204 – 213	5
213 – 223	0
223 – 232	2
232 – 242	0
242 – 251	0
251 – 261	0
261 – 270	1
270 – 280	1
280 – 290	0
290 – 299	0
299 – 309	1
309 – 318	0
318 – 328	0
328 – 337	0
337 – 347	0
347 – 356	0
356 – 366	0
366 – 375	0
375 – 385	1

soundMix categorical feature

Catalogues the audio mix format of each title (Surround, Dolby Digital, Stereo, Mono, etc.), with 551 distinct labels across 143,258 rows. The dominant issue is sparsity: 88.89% of values are null, and even among populated rows 'Surround' covers only 25.6%. Free-form combinations like 'Stereo, Surround' vs 'Surround, Stereo' and overlapping Dolby variants suggest the field is unnormalised multi-label text rather than a clean taxonomy.

Treatment: Split on commas, normalise Dolby/Surround variants, and treat as multi-hot; consider dropping if downstream task can't tolerate 89% missingness.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["soundMix"].stats

stat	value
n	143,258
nulls	127,341 (88.9%)
unique	551
top_value	Surround
top_rate	0.256
cardinality	551
entropy	4.663
entropy_ratio	0.5121
alert: null_rate	88.9% null

Fig 24.

Top values for soundMix.

Show data table

Top values for soundMix (20 unique shown, of 551 total).
value	count	share
Surround	4075	2.8%
Dolby Digital	2375	1.7%
Stereo	2082	1.5%
Mono	1246	0.9%
Stereo, Surround	473	0.3%
Surround, Stereo	451	0.3%
Dolby	411	0.3%
Dolby SRD, DTS, SDDS	253	0.2%
Dolby Atmos	241	0.2%
Dolby SR	198	0.1%
Dolby SR, DTS, Dolby Stereo, Surround, SDDS, Dolby A, Dolby Digital	192	0.1%
Dolby Stereo, Dolby Digital, Dolby A, Surround, Dolby SR	167	0.1%
Surround, Dolby Digital	133	0.1%
Dolby, Surround	119	0.1%
SDDS, Dolby Digital, DTS	118	0.1%
Surround, Dolby SRD, DTS, SDDS	118	0.1%
Dolby SRD	107	0.1%
Surround, Dolby SR, Dolby Digital, Dolby A, Dolby Stereo	101	0.1%
Dolby Atmos, Dolby Digital	93	0.1%
Datasat, Dolby Digital	84	0.1%

rotten tomatoes rotten tomatoes movies

Overview

Summary confidence: high

id text identifier

title text label

audienceScore numeric feature

tomatoMeter numeric feature

rating categorical feature

ratingContents text feature

releaseDateTheaters text timestamp

releaseDateStreaming text timestamp

runtimeMinutes numeric feature

genre text label

originalLanguage categorical feature

director text feature

writer text feature

boxOffice text feature

distributor text feature

soundMix categorical feature

How to cite