ml 32m movies

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv

Saturn profiled 87,585 rows across 3 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv",
    "--findings", "ml-32m-movies.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a movie catalogue of 87,585 rows with three columns: a unique movieId, a title, and a pipe-delimited genres string. The genres column is the most analytically interesting: only 1,798 unique combinations exist, and Drama, Documentary, and Comedy dominate, while 7,080 rows are tagged '(no genres listed)' — a sizeable gap worth flagging. Titles are nearly unique (87,382 distinct of 87,585), and the frequent '(2014)'–'(2019)' tokens in titles suggest the catalogue skews toward recent years. movieId spans 1 to 292,757 with no outliers, indicating a sparse identifier range rather than a clean sequence. Start with the genre distribution and the missing-genre share before any deeper modelling.

citing: row_count · columns.genres.n_unique · columns.genres.top_values · columns.genres.stats.one_word_rate · columns.title.n_unique · columns.title.top_words · columns.title.stats.len_mean · columns.movieId.stats.min · columns.movieId.stats.max · columns.movieId.stats.median

Out[4]:

saturn.schema() · 3 columns

column	kind	n	null%	unique	alerts
movieId	numeric	87,585	0.0%	87,585
title	text	87,585	0.0%	87,382	near_unique
genres	text	87,585	0.0%	1,798	one_word duplicates

Fig 1.

genres · See how Drama, Documentary, and Comedy dominate, and note the large '(no genres listed)' bucket.

Show data table

Character-length distribution for genres (mean: 13.240326539932637).
chars	count
3 – 5	125
5 – 7	24325
7 – 9	3474
9 – 10	2417
10 – 12	14431
12 – 14	11262
14 – 16	3313
16 – 18	2692
18 – 20	9626
20 – 22	5612
22 – 23	3690
23 – 25	1617
25 – 27	1066
27 – 29	758
29 – 31	1008
31 – 33	640
33 – 34	366
34 – 36	430
36 – 38	193
38 – 40	76
40 – 42	135
42 – 44	159
44 – 46	42
46 – 47	45
47 – 49	25
49 – 51	30
51 – 53	5
53 – 55	9
55 – 57	6
57 – 58	3
58 – 60	3
60 – 62	0
62 – 64	0
64 – 66	1
66 – 68	0
68 – 70	0
70 – 71	0
71 – 73	0
73 – 75	0
75 – 77	1

Fig 2.

genres · Share of the top genre combinations versus the long tail of 1,798 unique strings.

Show data table

Character-length distribution for genres (mean: 13.240326539932637).
chars	count
3 – 5	125
5 – 7	24325
7 – 9	3474
9 – 10	2417
10 – 12	14431
12 – 14	11262
14 – 16	3313
16 – 18	2692
18 – 20	9626
20 – 22	5612
22 – 23	3690
23 – 25	1617
25 – 27	1066
27 – 29	758
29 – 31	1008
31 – 33	640
33 – 34	366
34 – 36	430
36 – 38	193
38 – 40	76
40 – 42	135
42 – 44	159
44 – 46	42
46 – 47	45
47 – 49	25
49 – 51	30
51 – 53	5
53 – 55	9
55 – 57	6
57 – 58	3
58 – 60	3
60 – 62	0
62 – 64	0
64 – 66	1
66 – 68	0
68 – 70	0
70 – 71	0
71 – 73	0
73 – 75	0
75 – 77	1

Fig 3.

title · Title character-length distribution centres near 23 with a long tail out to 191.

Show data table

Character-length distribution for title (mean: 25.281349546155162).
chars	count
2 – 7	44
7 – 11	1905
11 – 16	14749
16 – 21	17609
21 – 26	20947
26 – 30	12610
30 – 35	7127
35 – 40	3670
40 – 45	3134
45 – 49	2086
49 – 54	1102
54 – 59	932
59 – 63	548
63 – 68	381
68 – 73	213
73 – 78	180
78 – 82	123
82 – 87	81
87 – 92	41
92 – 96	33
96 – 101	23
101 – 106	13
106 – 111	11
111 – 115	8
115 – 120	3
120 – 125	1
125 – 130	2
130 – 134	4
134 – 139	0
139 – 144	2
144 – 148	1
148 – 153	0
153 – 158	0
158 – 163	1
163 – 167	0
167 – 172	0
172 – 177	0
177 – 182	0
182 – 186	0
186 – 191	1

Fig 4.

movieId · Check how movieIds spread from 1 to 292,757 to gauge sparsity in the identifier range.

Show data table

Histogram bins for movieId (median: 165741.0).
bin	count
1 – 7320	7196
7320 – 1.464e+04	1111
1.464e+04 – 2.196e+04	0
2.196e+04 – 2.928e+04	1114
2.928e+04 – 3.66e+04	796
3.66e+04 – 4.391e+04	435
4.391e+04 – 5.123e+04	754
5.123e+04 – 5.855e+04	816
5.855e+04 – 6.587e+04	789
6.587e+04 – 7.319e+04	1132
7.319e+04 – 8.051e+04	1107
8.051e+04 – 8.783e+04	1398
8.783e+04 – 9.515e+04	1556
9.515e+04 – 1.025e+05	1541
1.025e+05 – 1.098e+05	1535
1.098e+05 – 1.171e+05	1847
1.171e+05 – 1.244e+05	2673
1.244e+05 – 1.317e+05	2763
1.317e+05 – 1.391e+05	3280
1.391e+05 – 1.464e+05	3282
1.464e+05 – 1.537e+05	3146
1.537e+05 – 1.61e+05	3258
1.61e+05 – 1.683e+05	3464
1.683e+05 – 1.757e+05	3492
1.757e+05 – 1.83e+05	3484
1.83e+05 – 1.903e+05	3486
1.903e+05 – 1.976e+05	3401
1.976e+05 – 2.049e+05	3369
2.049e+05 – 2.122e+05	3050
2.122e+05 – 2.196e+05	2907
2.196e+05 – 2.269e+05	2325
2.269e+05 – 2.342e+05	1839
2.342e+05 – 2.415e+05	1672
2.415e+05 – 2.488e+05	1531
2.488e+05 – 2.562e+05	1566
2.562e+05 – 2.635e+05	1679
2.635e+05 – 2.708e+05	1911
2.708e+05 – 2.781e+05	2168
2.781e+05 – 2.854e+05	2580
2.854e+05 – 2.928e+05	2132

Fig 5.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
movieId	numeric	0.0%
title	text	0.0%
genres	text	0.0%

movieId numeric identifier

movieId is fully unique across all 87585 rows with zero nulls, spanning 1 to 292757 — classic surrogate key behaviour rather than a measurable quantity. The wide range with a mean of 157651 and median of 165741 suggests non-contiguous IDs (gaps in the sequence), but there is no statistical signal to extract from it.

Treatment: use as a join key; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[11]:

saturn.columns["movieId"].stats

stat	value
n	87,585
nulls	0 (0.0%)
unique	87,585
min	1
max	292,757
mean	1.577e+05
median	165,741
std	7.901e+04
q1	112,657
q3	213,203
iqr	100,546
skew	-0.3914
kurtosis	-0.5775
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 6.

Distribution of movieId. Vertical dash marks the median.

Show data table

Histogram bins for movieId (median: 165741.0).
bin	count
1 – 7320	7196
7320 – 1.464e+04	1111
1.464e+04 – 2.196e+04	0
2.196e+04 – 2.928e+04	1114
2.928e+04 – 3.66e+04	796
3.66e+04 – 4.391e+04	435
4.391e+04 – 5.123e+04	754
5.123e+04 – 5.855e+04	816
5.855e+04 – 6.587e+04	789
6.587e+04 – 7.319e+04	1132
7.319e+04 – 8.051e+04	1107
8.051e+04 – 8.783e+04	1398
8.783e+04 – 9.515e+04	1556
9.515e+04 – 1.025e+05	1541
1.025e+05 – 1.098e+05	1535
1.098e+05 – 1.171e+05	1847
1.171e+05 – 1.244e+05	2673
1.244e+05 – 1.317e+05	2763
1.317e+05 – 1.391e+05	3280
1.391e+05 – 1.464e+05	3282
1.464e+05 – 1.537e+05	3146
1.537e+05 – 1.61e+05	3258
1.61e+05 – 1.683e+05	3464
1.683e+05 – 1.757e+05	3492
1.757e+05 – 1.83e+05	3484
1.83e+05 – 1.903e+05	3486
1.903e+05 – 1.976e+05	3401
1.976e+05 – 2.049e+05	3369
2.049e+05 – 2.122e+05	3050
2.122e+05 – 2.196e+05	2907
2.196e+05 – 2.269e+05	2325
2.269e+05 – 2.342e+05	1839
2.342e+05 – 2.415e+05	1672
2.415e+05 – 2.488e+05	1531
2.488e+05 – 2.562e+05	1566
2.562e+05 – 2.635e+05	1679
2.635e+05 – 2.708e+05	1911
2.708e+05 – 2.781e+05	2168
2.781e+05 – 2.854e+05	2580
2.854e+05 – 2.928e+05	2132

title text free_text

Short free-text titles, averaging 4.2 words and 25 characters, with 87,382 unique values across 87,585 rows. The frequent year tokens like (2018), (2016), (2017), (2019) suggest these are titled works (likely films or publications) with release years appended. Near-unique with only 203 duplicates and a high Flesch readability of 81.2; emoji and URL rates are effectively zero.

Treatment: Tokenize and embed (or strip the trailing year into a separate feature) before modelling; do not use as a key.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["title"].stats

stat	value
n	87,585
nulls	0 (0.0%)
unique	87,382
len_min	2
len_max	191
len_mean	25.28
len_median	23
len_p95	48
word_mean	4.226
word_median	4
n_empty	0
n_duplicates	203
duplicate_rate	0.002318
vocab_size	19,981
readability_flesch_mean	81.2
emoji_rate	3.425e-05
url_rate	0
one_word_rate	0.001279
allcaps_rate	0.005526
boilerplate_rate	9.134e-05
alert: near_unique	99.8% of rows are unique strings

Fig 7.

Character-length distribution for title.

Show data table

Character-length distribution for title (mean: 25.281349546155162).
chars	count
2 – 7	44
7 – 11	1905
11 – 16	14749
16 – 21	17609
21 – 26	20947
26 – 30	12610
30 – 35	7127
35 – 40	3670
40 – 45	3134
45 – 49	2086
49 – 54	1102
54 – 59	932
59 – 63	548
63 – 68	381
68 – 73	213
73 – 78	180
78 – 82	123
82 – 87	81
87 – 92	41
92 – 96	33
96 – 101	23
101 – 106	13
106 – 111	11
111 – 115	8
115 – 120	3
120 – 125	1
125 – 130	2
130 – 134	4
134 – 139	0
139 – 144	2
144 – 148	1
148 – 153	0
153 – 158	0
158 – 163	1
163 – 167	0
167 – 172	0
172 – 177	0
177 – 182	0
182 – 186	0
186 – 191	1

genres text feature

Pipe-delimited movie genre tags, with 1,798 distinct combinations across 87,585 rows and a 97.9% duplicate rate. Drama (12,443), Documentary (8,132), and Comedy (7,761) dominate, while 7,080 rows carry the literal placeholder '(no genres listed)' that should be treated as missing. 91.9% of values are single tokens (word_mean 1.16), so multi-genre entries like 'Comedy|Drama' are the minority.

Treatment: split on '|' and one-hot or multi-hot encode; recode '(no genres listed)' as null.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["genres"].stats

stat	value
n	87,585
nulls	0 (0.0%)
unique	1,798
len_min	3
len_max	77
len_mean	13.24
len_median	12
len_p95	27
word_mean	1.162
word_median	1
n_empty	0
n_duplicates	85,787
duplicate_rate	0.9795
vocab_size	909
readability_flesch_mean	-117.7
emoji_rate	0
url_rate	0
one_word_rate	0.9192
allcaps_rate	1.142e-05
boilerplate_rate	0
alert: one_word	91.9% rows are a single word
alert: duplicates	97.9% duplicate strings

Fig 8.

Character-length distribution for genres.

Show data table

Character-length distribution for genres (mean: 13.240326539932637).
chars	count
3 – 5	125
5 – 7	24325
7 – 9	3474
9 – 10	2417
10 – 12	14431
12 – 14	11262
14 – 16	3313
16 – 18	2692
18 – 20	9626
20 – 22	5612
22 – 23	3690
23 – 25	1617
25 – 27	1066
27 – 29	758
29 – 31	1008
31 – 33	640
33 – 34	366
34 – 36	430
36 – 38	193
38 – 40	76
40 – 42	135
42 – 44	159
44 – 46	42
46 – 47	45
47 – 49	25
49 – 51	30
51 – 53	5
53 – 55	9
55 – 57	6
57 – 58	3
58 – 60	3
60 – 62	0
62 – 64	0
64 – 66	1
66 – 68	0
68 – 70	0
70 – 71	0
71 – 73	0
73 – 75	0
75 – 77	1

How to cite

click to copy

BibTeX

@misc{saturn-ml-32m-movies-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: ml 32m movies},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/ml-32m-movies}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: ml 32m movies. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/ml-32m-movies