saturn·

ml 32m tags

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv

Saturn profiled 2,000,072 rows across 4 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv",
    "--findings", "ml-32m-tags.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

citing: row_count · column_count · columns.tag.n_unique · columns.tag.stats.duplicate_rate · columns.tag.stats.one_word_rate · columns.tag.top_values · columns.timestamp.stats.skew · columns.timestamp.stats.median · columns.userId.n_unique · columns.movieId.n_unique

Out[4]:

saturn.schema() · 4 columns

column kind n null% unique alerts
userId numeric 2,000,072 0.0% 15,848 outliers
movieId numeric 2,000,072 0.0% 51,323
tag text 2,000,072 0.0% 140,981 one_word duplicates
timestamp numeric 2,000,072 0.0% 1,291,250 outliers
Fig 1.
tag · Top tags are dominated by genre-like labels such as 'sci-fi', 'atmospheric', and 'action'.
Show data table
Character-length distribution for tag (mean: 11.11022453191685).
charscount
1 – 7434000
7 – 13901154
13 – 19488283
19 – 25122950
25 – 3137076
31 – 3711065
37 – 433109
43 – 491395
49 – 55435
55 – 61174
61 – 67128
67 – 7399
73 – 7936
79 – 8525
85 – 9131
91 – 9715
97 – 10310
103 – 10911
109 – 1155
115 – 1215
121 – 1275
127 – 1338
133 – 1398
139 – 1452
145 – 1515
151 – 1573
157 – 1633
163 – 1691
169 – 1751
175 – 1813
181 – 1872
187 – 1933
193 – 1996
199 – 2052
205 – 2115
211 – 2173
217 – 2230
223 – 2292
229 – 2353
235 – 2411
Fig 2.
tag · Most tags are very short (median 10 characters, ~1.7 words), confirming users mostly enter single-word labels.
Show data table
Character-length distribution for tag (mean: 11.11022453191685).
charscount
1 – 7434000
7 – 13901154
13 – 19488283
19 – 25122950
25 – 3137076
31 – 3711065
37 – 433109
43 – 491395
49 – 55435
55 – 61174
61 – 67128
67 – 7399
73 – 7936
79 – 8525
85 – 9131
91 – 9715
97 – 10310
103 – 10911
109 – 1155
115 – 1215
121 – 1275
127 – 1338
133 – 1398
139 – 1452
145 – 1515
151 – 1573
157 – 1633
163 – 1691
169 – 1751
175 – 1813
181 – 1872
187 – 1933
193 – 1996
199 – 2052
205 – 2115
211 – 2173
217 – 2230
223 – 2292
229 – 2353
235 – 2411
Fig 3.
timestamp · Timestamps are left-skewed toward recent years — check when tagging activity ramped up.
Show data table
Histogram bins for timestamp (median: 1574071062.0).
bincount
1.135e+09 – 1.149e+0918141
1.149e+09 – 1.164e+098578
1.164e+09 – 1.178e+0914546
1.178e+09 – 1.192e+098773
1.192e+09 – 1.206e+097339
1.206e+09 – 1.22e+097288
1.22e+09 – 1.234e+096940
1.234e+09 – 1.248e+0934739
1.248e+09 – 1.262e+0921415
1.262e+09 – 1.276e+0920116
1.276e+09 – 1.29e+0919130
1.29e+09 – 1.304e+0925517
1.304e+09 – 1.318e+0923164
1.318e+09 – 1.332e+0917686
1.332e+09 – 1.346e+0918163
1.346e+09 – 1.36e+0917679
1.36e+09 – 1.374e+0926680
1.374e+09 – 1.388e+0916406
1.388e+09 – 1.402e+0917074
1.402e+09 – 1.416e+0913818
1.416e+09 – 1.43e+0924668
1.43e+09 – 1.444e+0946130
1.444e+09 – 1.458e+0944122
1.458e+09 – 1.472e+0938869
1.472e+09 – 1.487e+0935079
1.487e+09 – 1.501e+0938009
1.501e+09 – 1.515e+0934375
1.515e+09 – 1.529e+09237228
1.529e+09 – 1.543e+0964476
1.543e+09 – 1.557e+0946622
1.557e+09 – 1.571e+0938050
1.571e+09 – 1.585e+0941917
1.585e+09 – 1.599e+0982401
1.599e+09 – 1.613e+09297429
1.613e+09 – 1.627e+09222550
1.627e+09 – 1.641e+09103520
1.641e+09 – 1.655e+0997307
1.655e+09 – 1.669e+0968594
1.669e+09 – 1.683e+0961631
1.683e+09 – 1.697e+0933903
Fig 4.
userId · Only ~15.8K distinct users generated 2M tags; look for heavy-tagger concentration and outliers.
Show data table
Histogram bins for userId (median: 78213.0).
bincount
22 – 407822923
4078 – 813537252
8135 – 1.219e+0425374
1.219e+04 – 1.625e+0424398
1.625e+04 – 2.03e+0429784
2.03e+04 – 2.436e+0429113
2.436e+04 – 2.842e+0430580
2.842e+04 – 3.247e+0441492
3.247e+04 – 3.653e+0447022
3.653e+04 – 4.059e+0433643
4.059e+04 – 4.464e+0425173
4.464e+04 – 4.87e+0427372
4.87e+04 – 5.276e+0422925
5.276e+04 – 5.681e+0417507
5.681e+04 – 6.087e+0438134
6.087e+04 – 6.492e+0424762
6.492e+04 – 6.898e+0450610
6.898e+04 – 7.304e+0428819
7.304e+04 – 7.709e+0426238
7.709e+04 – 8.115e+04764653
8.115e+04 – 8.521e+0424570
8.521e+04 – 8.926e+0420395
8.926e+04 – 9.332e+0426016
9.332e+04 – 9.738e+0426137
9.738e+04 – 1.014e+0531546
1.014e+05 – 1.055e+0540766
1.055e+05 – 1.095e+0538876
1.095e+05 – 1.136e+0530073
1.136e+05 – 1.177e+0520068
1.177e+05 – 1.217e+0552404
1.217e+05 – 1.258e+0531800
1.258e+05 – 1.298e+0525923
1.298e+05 – 1.339e+0522044
1.339e+05 – 1.379e+0529770
1.379e+05 – 1.42e+0521515
1.42e+05 – 1.461e+0542803
1.461e+05 – 1.501e+0553003
1.501e+05 – 1.542e+0531967
1.542e+05 – 1.582e+0535481
1.582e+05 – 1.623e+0547141
Fig 5.
movieId · Tag coverage spans ~51K movies with a right-skewed id distribution — many tags cluster on lower-id (older) films.
Show data table
Histogram bins for movieId (median: 52328.0).
bincount
1 – 7317740560
7317 – 1.463e+0478542
1.463e+04 – 2.195e+040
2.195e+04 – 2.926e+0442089
2.926e+04 – 3.658e+0444405
3.658e+04 – 4.39e+0425577
4.39e+04 – 5.121e+0460448
5.121e+04 – 5.853e+0460408
5.853e+04 – 6.584e+0448280
6.584e+04 – 7.316e+0460162
7.316e+04 – 8.047e+0449499
8.047e+04 – 8.779e+0448142
8.779e+04 – 9.511e+0454096
9.511e+04 – 1.024e+0550634
1.024e+05 – 1.097e+0561455
1.097e+05 – 1.171e+0556213
1.171e+05 – 1.244e+0547682
1.244e+05 – 1.317e+0520624
1.317e+05 – 1.39e+0532253
1.39e+05 – 1.463e+0529242
1.463e+05 – 1.536e+0520987
1.536e+05 – 1.609e+0524359
1.609e+05 – 1.683e+0547171
1.683e+05 – 1.756e+0532779
1.756e+05 – 1.829e+0539934
1.829e+05 – 1.902e+0542009
1.902e+05 – 1.975e+0530827
1.975e+05 – 2.048e+0533387
2.048e+05 – 2.122e+0525164
2.122e+05 – 2.195e+0515290
2.195e+05 – 2.268e+0514099
2.268e+05 – 2.341e+055384
2.341e+05 – 2.414e+0510225
2.414e+05 – 2.487e+053451
2.487e+05 – 2.561e+059558
2.561e+05 – 2.634e+057446
2.634e+05 – 2.707e+056938
2.707e+05 – 2.78e+059782
2.78e+05 – 2.853e+057806
2.853e+05 – 2.926e+053165
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
userIdnumeric0.0%
movieIdnumeric0.0%
tagtext0.0%
timestampnumeric0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
userIdmovieIdtimestamp
userId+1.00-0.04-0.01
movieId-0.04+1.00+0.34
timestamp-0.01+0.34+1.00

userId numeric identifier

This is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense.

Treatment: Treat as a categorical/foreign key for joins and grouping; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["userId"].stats

statvalue
n2,000,072
nulls0 (0.0%)
unique15,848
min 22
max 162,279
mean 8.193e+04
median 78,213
std 3.811e+04
q1 68,413
q3 103,698
iqr 35,285
skew 0.1179
kurtosis -0.2601
n_outliers 157,830
outlier_rate 0.07891
zero_rate 0
alert: outliers7.9% rows beyond 1.5 IQR
Fig 8.
Distribution of userId. Vertical dash marks the median.
Show data table
Histogram bins for userId (median: 78213.0).
bincount
22 – 407822923
4078 – 813537252
8135 – 1.219e+0425374
1.219e+04 – 1.625e+0424398
1.625e+04 – 2.03e+0429784
2.03e+04 – 2.436e+0429113
2.436e+04 – 2.842e+0430580
2.842e+04 – 3.247e+0441492
3.247e+04 – 3.653e+0447022
3.653e+04 – 4.059e+0433643
4.059e+04 – 4.464e+0425173
4.464e+04 – 4.87e+0427372
4.87e+04 – 5.276e+0422925
5.276e+04 – 5.681e+0417507
5.681e+04 – 6.087e+0438134
6.087e+04 – 6.492e+0424762
6.492e+04 – 6.898e+0450610
6.898e+04 – 7.304e+0428819
7.304e+04 – 7.709e+0426238
7.709e+04 – 8.115e+04764653
8.115e+04 – 8.521e+0424570
8.521e+04 – 8.926e+0420395
8.926e+04 – 9.332e+0426016
9.332e+04 – 9.738e+0426137
9.738e+04 – 1.014e+0531546
1.014e+05 – 1.055e+0540766
1.055e+05 – 1.095e+0538876
1.095e+05 – 1.136e+0530073
1.136e+05 – 1.177e+0520068
1.177e+05 – 1.217e+0552404
1.217e+05 – 1.258e+0531800
1.258e+05 – 1.298e+0525923
1.298e+05 – 1.339e+0522044
1.339e+05 – 1.379e+0529770
1.379e+05 – 1.42e+0521515
1.42e+05 – 1.461e+0542803
1.461e+05 – 1.501e+0553003
1.501e+05 – 1.542e+0531967
1.542e+05 – 1.582e+0535481
1.582e+05 – 1.623e+0547141

movieId numeric foreign_key

Integer movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution.

Treatment: left-join on this id to a movies dimension; do not treat as numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["movieId"].stats

statvalue
n2,000,072
nulls0 (0.0%)
unique51,323
min 1
max 292,629
mean 7.189e+04
median 52,328
std 7.48e+04
q1 4,011
q3 122,294
iqr 118,283
skew 0.821
kurtosis -0.4024
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of movieId. Vertical dash marks the median.
Show data table
Histogram bins for movieId (median: 52328.0).
bincount
1 – 7317740560
7317 – 1.463e+0478542
1.463e+04 – 2.195e+040
2.195e+04 – 2.926e+0442089
2.926e+04 – 3.658e+0444405
3.658e+04 – 4.39e+0425577
4.39e+04 – 5.121e+0460448
5.121e+04 – 5.853e+0460408
5.853e+04 – 6.584e+0448280
6.584e+04 – 7.316e+0460162
7.316e+04 – 8.047e+0449499
8.047e+04 – 8.779e+0448142
8.779e+04 – 9.511e+0454096
9.511e+04 – 1.024e+0550634
1.024e+05 – 1.097e+0561455
1.097e+05 – 1.171e+0556213
1.171e+05 – 1.244e+0547682
1.244e+05 – 1.317e+0520624
1.317e+05 – 1.39e+0532253
1.39e+05 – 1.463e+0529242
1.463e+05 – 1.536e+0520987
1.536e+05 – 1.609e+0524359
1.609e+05 – 1.683e+0547171
1.683e+05 – 1.756e+0532779
1.756e+05 – 1.829e+0539934
1.829e+05 – 1.902e+0542009
1.902e+05 – 1.975e+0530827
1.975e+05 – 2.048e+0533387
2.048e+05 – 2.122e+0525164
2.122e+05 – 2.195e+0515290
2.195e+05 – 2.268e+0514099
2.268e+05 – 2.341e+055384
2.341e+05 – 2.414e+0510225
2.414e+05 – 2.487e+053451
2.487e+05 – 2.561e+059558
2.561e+05 – 2.634e+057446
2.634e+05 – 2.707e+056938
2.707e+05 – 2.78e+059782
2.78e+05 – 2.853e+057806
2.853e+05 – 2.926e+053165

tag text label

This column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items.

Treatment: Treat as a categorical/multi-label tag field; one-hot or embed top tags and group the long tail.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["tag"].stats

statvalue
n2,000,072
nulls0 (0.0%)
unique140,981
len_min 1
len_max 241
len_mean 11.11
len_median 10
len_p95 21
word_mean 1.674
word_median 1
n_empty 0
n_duplicates 1.859e+06
duplicate_rate 0.9295
vocab_size 7,273
readability_flesch_mean 17.7
emoji_rate 0
url_rate 1.3e-05
one_word_rate 0.5247
allcaps_rate 0.01055
boilerplate_rate 1e-05
alert: one_word52.5% rows are a single word
alert: duplicates93.0% duplicate strings
Fig 10.
Character-length distribution for tag.
Show data table
Character-length distribution for tag (mean: 11.11022453191685).
charscount
1 – 7434000
7 – 13901154
13 – 19488283
19 – 25122950
25 – 3137076
31 – 3711065
37 – 433109
43 – 491395
49 – 55435
55 – 61174
61 – 67128
67 – 7399
73 – 7936
79 – 8525
85 – 9131
91 – 9715
97 – 10310
103 – 10911
109 – 1155
115 – 1215
121 – 1275
127 – 1338
133 – 1398
139 – 1452
145 – 1515
151 – 1573
157 – 1633
163 – 1691
169 – 1751
175 – 1813
181 – 1872
187 – 1933
193 – 1996
199 – 2052
205 – 2115
211 – 2173
217 – 2230
223 – 2292
229 – 2353
235 – 2411

timestamp numeric timestamp

Unix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros.

Treatment: Convert from epoch seconds to datetime and derive features (year, month, recency); consider filtering or bucketing the pre-2018 long tail.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["timestamp"].stats

statvalue
n2,000,072
nulls0 (0.0%)
unique1,291,250
min 1.135e+09
max 1.697e+09
mean 1.529e+09
median 1.574e+09
std 1.291e+08
q1 1.474e+09
q3 1.615e+09
iqr 1.411e+08
skew -1.219
kurtosis 0.6854
n_outliers 127,898
outlier_rate 0.06395
zero_rate 0
alert: outliers6.4% rows beyond 1.5 IQR
Fig 11.
Distribution of timestamp. Vertical dash marks the median.
Show data table
Histogram bins for timestamp (median: 1574071062.0).
bincount
1.135e+09 – 1.149e+0918141
1.149e+09 – 1.164e+098578
1.164e+09 – 1.178e+0914546
1.178e+09 – 1.192e+098773
1.192e+09 – 1.206e+097339
1.206e+09 – 1.22e+097288
1.22e+09 – 1.234e+096940
1.234e+09 – 1.248e+0934739
1.248e+09 – 1.262e+0921415
1.262e+09 – 1.276e+0920116
1.276e+09 – 1.29e+0919130
1.29e+09 – 1.304e+0925517
1.304e+09 – 1.318e+0923164
1.318e+09 – 1.332e+0917686
1.332e+09 – 1.346e+0918163
1.346e+09 – 1.36e+0917679
1.36e+09 – 1.374e+0926680
1.374e+09 – 1.388e+0916406
1.388e+09 – 1.402e+0917074
1.402e+09 – 1.416e+0913818
1.416e+09 – 1.43e+0924668
1.43e+09 – 1.444e+0946130
1.444e+09 – 1.458e+0944122
1.458e+09 – 1.472e+0938869
1.472e+09 – 1.487e+0935079
1.487e+09 – 1.501e+0938009
1.501e+09 – 1.515e+0934375
1.515e+09 – 1.529e+09237228
1.529e+09 – 1.543e+0964476
1.543e+09 – 1.557e+0946622
1.557e+09 – 1.571e+0938050
1.571e+09 – 1.585e+0941917
1.585e+09 – 1.599e+0982401
1.599e+09 – 1.613e+09297429
1.613e+09 – 1.627e+09222550
1.627e+09 – 1.641e+09103520
1.641e+09 – 1.655e+0997307
1.655e+09 – 1.669e+0968594
1.669e+09 – 1.683e+0961631
1.683e+09 – 1.697e+0933903

How to cite

click to copy

BibTeX
@misc{saturn-ml-32m-tags-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: ml 32m tags},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/ml-32m-tags}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: ml 32m tags. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/ml-32m-tags