saturn·

ml 32m movies

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv

Saturn profiled 87,585 rows across 3 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv",
    "--findings", "ml-32m-movies.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a movie catalogue of 87,585 rows with three columns: a unique movieId, a title, and a pipe-delimited genres string. The genres column is the most analytically interesting: only 1,798 unique combinations exist, and Drama, Documentary, and Comedy dominate, while 7,080 rows are tagged '(no genres listed)' — a sizeable gap worth flagging. Titles are nearly unique (87,382 distinct of 87,585), and the frequent '(2014)'–'(2019)' tokens in titles suggest the catalogue skews toward recent years. movieId spans 1 to 292,757 with no outliers, indicating a sparse identifier range rather than a clean sequence. Start with the genre distribution and the missing-genre share before any deeper modelling.

citing: row_count · columns.genres.n_unique · columns.genres.top_values · columns.genres.stats.one_word_rate · columns.title.n_unique · columns.title.top_words · columns.title.stats.len_mean · columns.movieId.stats.min · columns.movieId.stats.max · columns.movieId.stats.median

Out[4]:

saturn.schema() · 3 columns

column kind n null% unique alerts
movieId numeric 87,585 0.0% 87,585
title text 87,585 0.0% 87,382 near_unique
genres text 87,585 0.0% 1,798 one_word duplicates
Fig 1.
genres · See how Drama, Documentary, and Comedy dominate, and note the large '(no genres listed)' bucket.
Show data table
Character-length distribution for genres (mean: 13.240326539932637).
charscount
3 – 5125
5 – 724325
7 – 93474
9 – 102417
10 – 1214431
12 – 1411262
14 – 163313
16 – 182692
18 – 209626
20 – 225612
22 – 233690
23 – 251617
25 – 271066
27 – 29758
29 – 311008
31 – 33640
33 – 34366
34 – 36430
36 – 38193
38 – 4076
40 – 42135
42 – 44159
44 – 4642
46 – 4745
47 – 4925
49 – 5130
51 – 535
53 – 559
55 – 576
57 – 583
58 – 603
60 – 620
62 – 640
64 – 661
66 – 680
68 – 700
70 – 710
71 – 730
73 – 750
75 – 771
Fig 2.
genres · Share of the top genre combinations versus the long tail of 1,798 unique strings.
Show data table
Character-length distribution for genres (mean: 13.240326539932637).
charscount
3 – 5125
5 – 724325
7 – 93474
9 – 102417
10 – 1214431
12 – 1411262
14 – 163313
16 – 182692
18 – 209626
20 – 225612
22 – 233690
23 – 251617
25 – 271066
27 – 29758
29 – 311008
31 – 33640
33 – 34366
34 – 36430
36 – 38193
38 – 4076
40 – 42135
42 – 44159
44 – 4642
46 – 4745
47 – 4925
49 – 5130
51 – 535
53 – 559
55 – 576
57 – 583
58 – 603
60 – 620
62 – 640
64 – 661
66 – 680
68 – 700
70 – 710
71 – 730
73 – 750
75 – 771
Fig 3.
title · Title character-length distribution centres near 23 with a long tail out to 191.
Show data table
Character-length distribution for title (mean: 25.281349546155162).
charscount
2 – 744
7 – 111905
11 – 1614749
16 – 2117609
21 – 2620947
26 – 3012610
30 – 357127
35 – 403670
40 – 453134
45 – 492086
49 – 541102
54 – 59932
59 – 63548
63 – 68381
68 – 73213
73 – 78180
78 – 82123
82 – 8781
87 – 9241
92 – 9633
96 – 10123
101 – 10613
106 – 11111
111 – 1158
115 – 1203
120 – 1251
125 – 1302
130 – 1344
134 – 1390
139 – 1442
144 – 1481
148 – 1530
153 – 1580
158 – 1631
163 – 1670
167 – 1720
172 – 1770
177 – 1820
182 – 1860
186 – 1911
Fig 4.
movieId · Check how movieIds spread from 1 to 292,757 to gauge sparsity in the identifier range.
Show data table
Histogram bins for movieId (median: 165741.0).
bincount
1 – 73207196
7320 – 1.464e+041111
1.464e+04 – 2.196e+040
2.196e+04 – 2.928e+041114
2.928e+04 – 3.66e+04796
3.66e+04 – 4.391e+04435
4.391e+04 – 5.123e+04754
5.123e+04 – 5.855e+04816
5.855e+04 – 6.587e+04789
6.587e+04 – 7.319e+041132
7.319e+04 – 8.051e+041107
8.051e+04 – 8.783e+041398
8.783e+04 – 9.515e+041556
9.515e+04 – 1.025e+051541
1.025e+05 – 1.098e+051535
1.098e+05 – 1.171e+051847
1.171e+05 – 1.244e+052673
1.244e+05 – 1.317e+052763
1.317e+05 – 1.391e+053280
1.391e+05 – 1.464e+053282
1.464e+05 – 1.537e+053146
1.537e+05 – 1.61e+053258
1.61e+05 – 1.683e+053464
1.683e+05 – 1.757e+053492
1.757e+05 – 1.83e+053484
1.83e+05 – 1.903e+053486
1.903e+05 – 1.976e+053401
1.976e+05 – 2.049e+053369
2.049e+05 – 2.122e+053050
2.122e+05 – 2.196e+052907
2.196e+05 – 2.269e+052325
2.269e+05 – 2.342e+051839
2.342e+05 – 2.415e+051672
2.415e+05 – 2.488e+051531
2.488e+05 – 2.562e+051566
2.562e+05 – 2.635e+051679
2.635e+05 – 2.708e+051911
2.708e+05 – 2.781e+052168
2.781e+05 – 2.854e+052580
2.854e+05 – 2.928e+052132
Fig 5.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
movieIdnumeric0.0%
titletext0.0%
genrestext0.0%

movieId numeric identifier

movieId is fully unique across all 87585 rows with zero nulls, spanning 1 to 292757 — classic surrogate key behaviour rather than a measurable quantity. The wide range with a mean of 157651 and median of 165741 suggests non-contiguous IDs (gaps in the sequence), but there is no statistical signal to extract from it.

Treatment: use as a join key; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[11]:

saturn.columns["movieId"].stats

statvalue
n87,585
nulls0 (0.0%)
unique87,585
min 1
max 292,757
mean 1.577e+05
median 165,741
std 7.901e+04
q1 112,657
q3 213,203
iqr 100,546
skew -0.3914
kurtosis -0.5775
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 6.
Distribution of movieId. Vertical dash marks the median.
Show data table
Histogram bins for movieId (median: 165741.0).
bincount
1 – 73207196
7320 – 1.464e+041111
1.464e+04 – 2.196e+040
2.196e+04 – 2.928e+041114
2.928e+04 – 3.66e+04796
3.66e+04 – 4.391e+04435
4.391e+04 – 5.123e+04754
5.123e+04 – 5.855e+04816
5.855e+04 – 6.587e+04789
6.587e+04 – 7.319e+041132
7.319e+04 – 8.051e+041107
8.051e+04 – 8.783e+041398
8.783e+04 – 9.515e+041556
9.515e+04 – 1.025e+051541
1.025e+05 – 1.098e+051535
1.098e+05 – 1.171e+051847
1.171e+05 – 1.244e+052673
1.244e+05 – 1.317e+052763
1.317e+05 – 1.391e+053280
1.391e+05 – 1.464e+053282
1.464e+05 – 1.537e+053146
1.537e+05 – 1.61e+053258
1.61e+05 – 1.683e+053464
1.683e+05 – 1.757e+053492
1.757e+05 – 1.83e+053484
1.83e+05 – 1.903e+053486
1.903e+05 – 1.976e+053401
1.976e+05 – 2.049e+053369
2.049e+05 – 2.122e+053050
2.122e+05 – 2.196e+052907
2.196e+05 – 2.269e+052325
2.269e+05 – 2.342e+051839
2.342e+05 – 2.415e+051672
2.415e+05 – 2.488e+051531
2.488e+05 – 2.562e+051566
2.562e+05 – 2.635e+051679
2.635e+05 – 2.708e+051911
2.708e+05 – 2.781e+052168
2.781e+05 – 2.854e+052580
2.854e+05 – 2.928e+052132

title text free_text

Short free-text titles, averaging 4.2 words and 25 characters, with 87,382 unique values across 87,585 rows. The frequent year tokens like (2018), (2016), (2017), (2019) suggest these are titled works (likely films or publications) with release years appended. Near-unique with only 203 duplicates and a high Flesch readability of 81.2; emoji and URL rates are effectively zero.

Treatment: Tokenize and embed (or strip the trailing year into a separate feature) before modelling; do not use as a key.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["title"].stats

statvalue
n87,585
nulls0 (0.0%)
unique87,382
len_min 2
len_max 191
len_mean 25.28
len_median 23
len_p95 48
word_mean 4.226
word_median 4
n_empty 0
n_duplicates 203
duplicate_rate 0.002318
vocab_size 19,981
readability_flesch_mean 81.2
emoji_rate 3.425e-05
url_rate 0
one_word_rate 0.001279
allcaps_rate 0.005526
boilerplate_rate 9.134e-05
alert: near_unique99.8% of rows are unique strings
Fig 7.
Character-length distribution for title.
Show data table
Character-length distribution for title (mean: 25.281349546155162).
charscount
2 – 744
7 – 111905
11 – 1614749
16 – 2117609
21 – 2620947
26 – 3012610
30 – 357127
35 – 403670
40 – 453134
45 – 492086
49 – 541102
54 – 59932
59 – 63548
63 – 68381
68 – 73213
73 – 78180
78 – 82123
82 – 8781
87 – 9241
92 – 9633
96 – 10123
101 – 10613
106 – 11111
111 – 1158
115 – 1203
120 – 1251
125 – 1302
130 – 1344
134 – 1390
139 – 1442
144 – 1481
148 – 1530
153 – 1580
158 – 1631
163 – 1670
167 – 1720
172 – 1770
177 – 1820
182 – 1860
186 – 1911

genres text feature

Pipe-delimited movie genre tags, with 1,798 distinct combinations across 87,585 rows and a 97.9% duplicate rate. Drama (12,443), Documentary (8,132), and Comedy (7,761) dominate, while 7,080 rows carry the literal placeholder '(no genres listed)' that should be treated as missing. 91.9% of values are single tokens (word_mean 1.16), so multi-genre entries like 'Comedy|Drama' are the minority.

Treatment: split on '|' and one-hot or multi-hot encode; recode '(no genres listed)' as null.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["genres"].stats

statvalue
n87,585
nulls0 (0.0%)
unique1,798
len_min 3
len_max 77
len_mean 13.24
len_median 12
len_p95 27
word_mean 1.162
word_median 1
n_empty 0
n_duplicates 85,787
duplicate_rate 0.9795
vocab_size 909
readability_flesch_mean -117.7
emoji_rate 0
url_rate 0
one_word_rate 0.9192
allcaps_rate 1.142e-05
boilerplate_rate 0
alert: one_word91.9% rows are a single word
alert: duplicates97.9% duplicate strings
Fig 8.
Character-length distribution for genres.
Show data table
Character-length distribution for genres (mean: 13.240326539932637).
charscount
3 – 5125
5 – 724325
7 – 93474
9 – 102417
10 – 1214431
12 – 1411262
14 – 163313
16 – 182692
18 – 209626
20 – 225612
22 – 233690
23 – 251617
25 – 271066
27 – 29758
29 – 311008
31 – 33640
33 – 34366
34 – 36430
36 – 38193
38 – 4076
40 – 42135
42 – 44159
44 – 4642
46 – 4745
47 – 4925
49 – 5130
51 – 535
53 – 559
55 – 576
57 – 583
58 – 603
60 – 620
62 – 640
64 – 661
66 – 680
68 – 700
70 – 710
71 – 730
73 – 750
75 – 771

How to cite

click to copy

BibTeX
@misc{saturn-ml-32m-movies-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: ml 32m movies},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/ml-32m-movies}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: ml 32m movies. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/ml-32m-movies