saturn·

ml 32m movies

source /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv 87,585 rows 3 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a movie catalogue of 87,585 rows with three columns: a unique movieId, a title, and a pipe-delimited genres string. The genres column is the most analytically interesting: only 1,798 unique combinations exist, and Drama, Documentary, and Comedy dominate, while 7,080 rows are tagged '(no genres listed)' — a sizeable gap worth flagging. Titles are nearly unique (87,382 distinct of 87,585), and the frequent '(2014)'–'(2019)' tokens in titles suggest the catalogue skews toward recent years. movieId spans 1 to 292,757 with no outliers, indicating a sparse identifier range rather than a clean sequence. Start with the genre distribution and the missing-genre share before any deeper modelling.

citing: row_count · columns.genres.n_unique · columns.genres.top_values · columns.genres.stats.one_word_rate · columns.title.n_unique · columns.title.top_words · columns.title.stats.len_mean · columns.movieId.stats.min · columns.movieId.stats.max · columns.movieId.stats.median

Schema

3 columns
Per-column summary. Click column name to jump to its detail.
Alerts
movieId numeric 0.0% 87,585
title text 0.0% 87,382
near_unique
genres text 0.0% 1,798
one_word duplicates

movieId

numeric identifier
movieId is fully unique across all 87585 rows with zero nulls, spanning 1 to 292757 — classic surrogate key behaviour rather than a measurable quantity. The wide range with a mean of 157651 and median of 165741 suggests non-contiguous IDs (gaps in the sequence), but there is no statistical signal to extract from it. Treatment: use as a join key; exclude from modelling features. high · anthropic:claude-opus-4-7
n
87,585
nulls
0 (0.0%)
unique
87,585
min
1
max
292,757
mean
1.577e+05
median
165,741
std
7.901e+04
q1
112,657
q3
213,203
iqr
100,546
skew
-0.3914
kurtosis
-0.5775
n_outliers
0
outlier_rate
0
zero_rate
0

title

text free_text near_unique
Short free-text titles, averaging 4.2 words and 25 characters, with 87,382 unique values across 87,585 rows. The frequent year tokens like (2018), (2016), (2017), (2019) suggest these are titled works (likely films or publications) with release years appended. Near-unique with only 203 duplicates and a high Flesch readability of 81.2; emoji and URL rates are effectively zero. Treatment: Tokenize and embed (or strip the trailing year into a separate feature) before modelling; do not use as a key. high · anthropic:claude-opus-4-7
n
87,585
nulls
0 (0.0%)
unique
87,382
len_min
2
len_max
191
len_mean
25.28
len_median
23
len_p95
48
word_mean
4.226
word_median
4
n_empty
0
n_duplicates
203
duplicate_rate
0.002318
vocab_size
19,981
readability_flesch_mean
81.2
emoji_rate
3.425e-05
url_rate
0
one_word_rate
0.001279
allcaps_rate
0.005526
boilerplate_rate
9.134e-05

genres

text feature one_word duplicates
Pipe-delimited movie genre tags, with 1,798 distinct combinations across 87,585 rows and a 97.9% duplicate rate. Drama (12,443), Documentary (8,132), and Comedy (7,761) dominate, while 7,080 rows carry the literal placeholder '(no genres listed)' that should be treated as missing. 91.9% of values are single tokens (word_mean 1.16), so multi-genre entries like 'Comedy|Drama' are the minority. Treatment: split on '|' and one-hot or multi-hot encode; recode '(no genres listed)' as null. high · anthropic:claude-opus-4-7
n
87,585
nulls
0 (0.0%)
unique
1,798
len_min
3
len_max
77
len_mean
13.24
len_median
12
len_p95
27
word_mean
1.162
word_median
1
n_empty
0
n_duplicates
85,787
duplicate_rate
0.9795
vocab_size
909
readability_flesch_mean
-117.7
emoji_rate
0
url_rate
0
one_word_rate
0.9192
allcaps_rate
1.142e-05
boilerplate_rate
0