ml-32m-tags · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv

Saturn profiled 2,000,072 rows across 4 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv",
    "--findings", "ml-32m-tags.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

citing: row_count · column_count · columns.tag.n_unique · columns.tag.stats.duplicate_rate · columns.tag.stats.one_word_rate · columns.tag.top_values · columns.timestamp.stats.skew · columns.timestamp.stats.median · columns.userId.n_unique · columns.movieId.n_unique

Out[4]:

saturn.schema() · 4 columns

column	kind	n	null%	unique	alerts
userId	numeric	2,000,072	0.0%	15,848	outliers
movieId	numeric	2,000,072	0.0%	51,323
tag	text	2,000,072	0.0%	140,981	one_word duplicates
timestamp	numeric	2,000,072	0.0%	1,291,250	outliers

Fig 1.

tag · Top tags are dominated by genre-like labels such as 'sci-fi', 'atmospheric', and 'action'.

Show data table

Character-length distribution for tag (mean: 11.11022453191685).
chars	count
1 – 7	434000
7 – 13	901154
13 – 19	488283
19 – 25	122950
25 – 31	37076
31 – 37	11065
37 – 43	3109
43 – 49	1395
49 – 55	435
55 – 61	174
61 – 67	128
67 – 73	99
73 – 79	36
79 – 85	25
85 – 91	31
91 – 97	15
97 – 103	10
103 – 109	11
109 – 115	5
115 – 121	5
121 – 127	5
127 – 133	8
133 – 139	8
139 – 145	2
145 – 151	5
151 – 157	3
157 – 163	3
163 – 169	1
169 – 175	1
175 – 181	3
181 – 187	2
187 – 193	3
193 – 199	6
199 – 205	2
205 – 211	5
211 – 217	3
217 – 223	0
223 – 229	2
229 – 235	3
235 – 241	1

Fig 2.

tag · Most tags are very short (median 10 characters, ~1.7 words), confirming users mostly enter single-word labels.

Show data table

Character-length distribution for tag (mean: 11.11022453191685).
chars	count
1 – 7	434000
7 – 13	901154
13 – 19	488283
19 – 25	122950
25 – 31	37076
31 – 37	11065
37 – 43	3109
43 – 49	1395
49 – 55	435
55 – 61	174
61 – 67	128
67 – 73	99
73 – 79	36
79 – 85	25
85 – 91	31
91 – 97	15
97 – 103	10
103 – 109	11
109 – 115	5
115 – 121	5
121 – 127	5
127 – 133	8
133 – 139	8
139 – 145	2
145 – 151	5
151 – 157	3
157 – 163	3
163 – 169	1
169 – 175	1
175 – 181	3
181 – 187	2
187 – 193	3
193 – 199	6
199 – 205	2
205 – 211	5
211 – 217	3
217 – 223	0
223 – 229	2
229 – 235	3
235 – 241	1

Fig 3.

timestamp · Timestamps are left-skewed toward recent years — check when tagging activity ramped up.

Show data table

Histogram bins for timestamp (median: 1574071062.0).
bin	count
1.135e+09 – 1.149e+09	18141
1.149e+09 – 1.164e+09	8578
1.164e+09 – 1.178e+09	14546
1.178e+09 – 1.192e+09	8773
1.192e+09 – 1.206e+09	7339
1.206e+09 – 1.22e+09	7288
1.22e+09 – 1.234e+09	6940
1.234e+09 – 1.248e+09	34739
1.248e+09 – 1.262e+09	21415
1.262e+09 – 1.276e+09	20116
1.276e+09 – 1.29e+09	19130
1.29e+09 – 1.304e+09	25517
1.304e+09 – 1.318e+09	23164
1.318e+09 – 1.332e+09	17686
1.332e+09 – 1.346e+09	18163
1.346e+09 – 1.36e+09	17679
1.36e+09 – 1.374e+09	26680
1.374e+09 – 1.388e+09	16406
1.388e+09 – 1.402e+09	17074
1.402e+09 – 1.416e+09	13818
1.416e+09 – 1.43e+09	24668
1.43e+09 – 1.444e+09	46130
1.444e+09 – 1.458e+09	44122
1.458e+09 – 1.472e+09	38869
1.472e+09 – 1.487e+09	35079
1.487e+09 – 1.501e+09	38009
1.501e+09 – 1.515e+09	34375
1.515e+09 – 1.529e+09	237228
1.529e+09 – 1.543e+09	64476
1.543e+09 – 1.557e+09	46622
1.557e+09 – 1.571e+09	38050
1.571e+09 – 1.585e+09	41917
1.585e+09 – 1.599e+09	82401
1.599e+09 – 1.613e+09	297429
1.613e+09 – 1.627e+09	222550
1.627e+09 – 1.641e+09	103520
1.641e+09 – 1.655e+09	97307
1.655e+09 – 1.669e+09	68594
1.669e+09 – 1.683e+09	61631
1.683e+09 – 1.697e+09	33903

Fig 4.

userId · Only ~15.8K distinct users generated 2M tags; look for heavy-tagger concentration and outliers.

Show data table

Histogram bins for userId (median: 78213.0).
bin	count
22 – 4078	22923
4078 – 8135	37252
8135 – 1.219e+04	25374
1.219e+04 – 1.625e+04	24398
1.625e+04 – 2.03e+04	29784
2.03e+04 – 2.436e+04	29113
2.436e+04 – 2.842e+04	30580
2.842e+04 – 3.247e+04	41492
3.247e+04 – 3.653e+04	47022
3.653e+04 – 4.059e+04	33643
4.059e+04 – 4.464e+04	25173
4.464e+04 – 4.87e+04	27372
4.87e+04 – 5.276e+04	22925
5.276e+04 – 5.681e+04	17507
5.681e+04 – 6.087e+04	38134
6.087e+04 – 6.492e+04	24762
6.492e+04 – 6.898e+04	50610
6.898e+04 – 7.304e+04	28819
7.304e+04 – 7.709e+04	26238
7.709e+04 – 8.115e+04	764653
8.115e+04 – 8.521e+04	24570
8.521e+04 – 8.926e+04	20395
8.926e+04 – 9.332e+04	26016
9.332e+04 – 9.738e+04	26137
9.738e+04 – 1.014e+05	31546
1.014e+05 – 1.055e+05	40766
1.055e+05 – 1.095e+05	38876
1.095e+05 – 1.136e+05	30073
1.136e+05 – 1.177e+05	20068
1.177e+05 – 1.217e+05	52404
1.217e+05 – 1.258e+05	31800
1.258e+05 – 1.298e+05	25923
1.298e+05 – 1.339e+05	22044
1.339e+05 – 1.379e+05	29770
1.379e+05 – 1.42e+05	21515
1.42e+05 – 1.461e+05	42803
1.461e+05 – 1.501e+05	53003
1.501e+05 – 1.542e+05	31967
1.542e+05 – 1.582e+05	35481
1.582e+05 – 1.623e+05	47141

Fig 5.

movieId · Tag coverage spans ~51K movies with a right-skewed id distribution — many tags cluster on lower-id (older) films.

Show data table

Histogram bins for movieId (median: 52328.0).
bin	count
1 – 7317	740560
7317 – 1.463e+04	78542
1.463e+04 – 2.195e+04	0
2.195e+04 – 2.926e+04	42089
2.926e+04 – 3.658e+04	44405
3.658e+04 – 4.39e+04	25577
4.39e+04 – 5.121e+04	60448
5.121e+04 – 5.853e+04	60408
5.853e+04 – 6.584e+04	48280
6.584e+04 – 7.316e+04	60162
7.316e+04 – 8.047e+04	49499
8.047e+04 – 8.779e+04	48142
8.779e+04 – 9.511e+04	54096
9.511e+04 – 1.024e+05	50634
1.024e+05 – 1.097e+05	61455
1.097e+05 – 1.171e+05	56213
1.171e+05 – 1.244e+05	47682
1.244e+05 – 1.317e+05	20624
1.317e+05 – 1.39e+05	32253
1.39e+05 – 1.463e+05	29242
1.463e+05 – 1.536e+05	20987
1.536e+05 – 1.609e+05	24359
1.609e+05 – 1.683e+05	47171
1.683e+05 – 1.756e+05	32779
1.756e+05 – 1.829e+05	39934
1.829e+05 – 1.902e+05	42009
1.902e+05 – 1.975e+05	30827
1.975e+05 – 2.048e+05	33387
2.048e+05 – 2.122e+05	25164
2.122e+05 – 2.195e+05	15290
2.195e+05 – 2.268e+05	14099
2.268e+05 – 2.341e+05	5384
2.341e+05 – 2.414e+05	10225
2.414e+05 – 2.487e+05	3451
2.487e+05 – 2.561e+05	9558
2.561e+05 – 2.634e+05	7446
2.634e+05 – 2.707e+05	6938
2.707e+05 – 2.78e+05	9782
2.78e+05 – 2.853e+05	7806
2.853e+05 – 2.926e+05	3165

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
userId	numeric	0.0%
movieId	numeric	0.0%
tag	text	0.0%
timestamp	numeric	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	userId	movieId	timestamp
userId	+1.00	-0.04	-0.01
movieId	-0.04	+1.00	+0.34
timestamp	-0.01	+0.34	+1.00

userId numeric identifier

This is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense.

Treatment: Treat as a categorical/foreign key for joins and grouping; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["userId"].stats

stat	value
n	2,000,072
nulls	0 (0.0%)
unique	15,848
min	22
max	162,279
mean	8.193e+04
median	78,213
std	3.811e+04
q1	68,413
q3	103,698
iqr	35,285
skew	0.1179
kurtosis	-0.2601
n_outliers	157,830
outlier_rate	0.07891
zero_rate	0
alert: outliers	7.9% rows beyond 1.5 IQR

Fig 8.

Distribution of userId. Vertical dash marks the median.

Show data table

Histogram bins for userId (median: 78213.0).
bin	count
22 – 4078	22923
4078 – 8135	37252
8135 – 1.219e+04	25374
1.219e+04 – 1.625e+04	24398
1.625e+04 – 2.03e+04	29784
2.03e+04 – 2.436e+04	29113
2.436e+04 – 2.842e+04	30580
2.842e+04 – 3.247e+04	41492
3.247e+04 – 3.653e+04	47022
3.653e+04 – 4.059e+04	33643
4.059e+04 – 4.464e+04	25173
4.464e+04 – 4.87e+04	27372
4.87e+04 – 5.276e+04	22925
5.276e+04 – 5.681e+04	17507
5.681e+04 – 6.087e+04	38134
6.087e+04 – 6.492e+04	24762
6.492e+04 – 6.898e+04	50610
6.898e+04 – 7.304e+04	28819
7.304e+04 – 7.709e+04	26238
7.709e+04 – 8.115e+04	764653
8.115e+04 – 8.521e+04	24570
8.521e+04 – 8.926e+04	20395
8.926e+04 – 9.332e+04	26016
9.332e+04 – 9.738e+04	26137
9.738e+04 – 1.014e+05	31546
1.014e+05 – 1.055e+05	40766
1.055e+05 – 1.095e+05	38876
1.095e+05 – 1.136e+05	30073
1.136e+05 – 1.177e+05	20068
1.177e+05 – 1.217e+05	52404
1.217e+05 – 1.258e+05	31800
1.258e+05 – 1.298e+05	25923
1.298e+05 – 1.339e+05	22044
1.339e+05 – 1.379e+05	29770
1.379e+05 – 1.42e+05	21515
1.42e+05 – 1.461e+05	42803
1.461e+05 – 1.501e+05	53003
1.501e+05 – 1.542e+05	31967
1.542e+05 – 1.582e+05	35481
1.582e+05 – 1.623e+05	47141

movieId numeric foreign_key

Integer movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution.

Treatment: left-join on this id to a movies dimension; do not treat as numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["movieId"].stats

stat	value
n	2,000,072
nulls	0 (0.0%)
unique	51,323
min	1
max	292,629
mean	7.189e+04
median	52,328
std	7.48e+04
q1	4,011
q3	122,294
iqr	118,283
skew	0.821
kurtosis	-0.4024
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of movieId. Vertical dash marks the median.

Show data table

Histogram bins for movieId (median: 52328.0).
bin	count
1 – 7317	740560
7317 – 1.463e+04	78542
1.463e+04 – 2.195e+04	0
2.195e+04 – 2.926e+04	42089
2.926e+04 – 3.658e+04	44405
3.658e+04 – 4.39e+04	25577
4.39e+04 – 5.121e+04	60448
5.121e+04 – 5.853e+04	60408
5.853e+04 – 6.584e+04	48280
6.584e+04 – 7.316e+04	60162
7.316e+04 – 8.047e+04	49499
8.047e+04 – 8.779e+04	48142
8.779e+04 – 9.511e+04	54096
9.511e+04 – 1.024e+05	50634
1.024e+05 – 1.097e+05	61455
1.097e+05 – 1.171e+05	56213
1.171e+05 – 1.244e+05	47682
1.244e+05 – 1.317e+05	20624
1.317e+05 – 1.39e+05	32253
1.39e+05 – 1.463e+05	29242
1.463e+05 – 1.536e+05	20987
1.536e+05 – 1.609e+05	24359
1.609e+05 – 1.683e+05	47171
1.683e+05 – 1.756e+05	32779
1.756e+05 – 1.829e+05	39934
1.829e+05 – 1.902e+05	42009
1.902e+05 – 1.975e+05	30827
1.975e+05 – 2.048e+05	33387
2.048e+05 – 2.122e+05	25164
2.122e+05 – 2.195e+05	15290
2.195e+05 – 2.268e+05	14099
2.268e+05 – 2.341e+05	5384
2.341e+05 – 2.414e+05	10225
2.414e+05 – 2.487e+05	3451
2.487e+05 – 2.561e+05	9558
2.561e+05 – 2.634e+05	7446
2.634e+05 – 2.707e+05	6938
2.707e+05 – 2.78e+05	9782
2.78e+05 – 2.853e+05	7806
2.853e+05 – 2.926e+05	3165

tag text label

This column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items.

Treatment: Treat as a categorical/multi-label tag field; one-hot or embed top tags and group the long tail.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["tag"].stats

stat	value
n	2,000,072
nulls	0 (0.0%)
unique	140,981
len_min	1
len_max	241
len_mean	11.11
len_median	10
len_p95	21
word_mean	1.674
word_median	1
n_empty	0
n_duplicates	1.859e+06
duplicate_rate	0.9295
vocab_size	7,273
readability_flesch_mean	17.7
emoji_rate	0
url_rate	1.3e-05
one_word_rate	0.5247
allcaps_rate	0.01055
boilerplate_rate	1e-05
alert: one_word	52.5% rows are a single word
alert: duplicates	93.0% duplicate strings

Fig 10.

Character-length distribution for tag.

Show data table

Character-length distribution for tag (mean: 11.11022453191685).
chars	count
1 – 7	434000
7 – 13	901154
13 – 19	488283
19 – 25	122950
25 – 31	37076
31 – 37	11065
37 – 43	3109
43 – 49	1395
49 – 55	435
55 – 61	174
61 – 67	128
67 – 73	99
73 – 79	36
79 – 85	25
85 – 91	31
91 – 97	15
97 – 103	10
103 – 109	11
109 – 115	5
115 – 121	5
121 – 127	5
127 – 133	8
133 – 139	8
139 – 145	2
145 – 151	5
151 – 157	3
157 – 163	3
163 – 169	1
169 – 175	1
175 – 181	3
181 – 187	2
187 – 193	3
193 – 199	6
199 – 205	2
205 – 211	5
211 – 217	3
217 – 223	0
223 – 229	2
229 – 235	3
235 – 241	1

timestamp numeric timestamp

Unix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros.

Treatment: Convert from epoch seconds to datetime and derive features (year, month, recency); consider filtering or bucketing the pre-2018 long tail.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["timestamp"].stats

stat	value
n	2,000,072
nulls	0 (0.0%)
unique	1,291,250
min	1.135e+09
max	1.697e+09
mean	1.529e+09
median	1.574e+09
std	1.291e+08
q1	1.474e+09
q3	1.615e+09
iqr	1.411e+08
skew	-1.219
kurtosis	0.6854
n_outliers	127,898
outlier_rate	0.06395
zero_rate	0
alert: outliers	6.4% rows beyond 1.5 IQR

Fig 11.

Distribution of timestamp. Vertical dash marks the median.

Show data table

Histogram bins for timestamp (median: 1574071062.0).
bin	count
1.135e+09 – 1.149e+09	18141
1.149e+09 – 1.164e+09	8578
1.164e+09 – 1.178e+09	14546
1.178e+09 – 1.192e+09	8773
1.192e+09 – 1.206e+09	7339
1.206e+09 – 1.22e+09	7288
1.22e+09 – 1.234e+09	6940
1.234e+09 – 1.248e+09	34739
1.248e+09 – 1.262e+09	21415
1.262e+09 – 1.276e+09	20116
1.276e+09 – 1.29e+09	19130
1.29e+09 – 1.304e+09	25517
1.304e+09 – 1.318e+09	23164
1.318e+09 – 1.332e+09	17686
1.332e+09 – 1.346e+09	18163
1.346e+09 – 1.36e+09	17679
1.36e+09 – 1.374e+09	26680
1.374e+09 – 1.388e+09	16406
1.388e+09 – 1.402e+09	17074
1.402e+09 – 1.416e+09	13818
1.416e+09 – 1.43e+09	24668
1.43e+09 – 1.444e+09	46130
1.444e+09 – 1.458e+09	44122
1.458e+09 – 1.472e+09	38869
1.472e+09 – 1.487e+09	35079
1.487e+09 – 1.501e+09	38009
1.501e+09 – 1.515e+09	34375
1.515e+09 – 1.529e+09	237228
1.529e+09 – 1.543e+09	64476
1.543e+09 – 1.557e+09	46622
1.557e+09 – 1.571e+09	38050
1.571e+09 – 1.585e+09	41917
1.585e+09 – 1.599e+09	82401
1.599e+09 – 1.613e+09	297429
1.613e+09 – 1.627e+09	222550
1.627e+09 – 1.641e+09	103520
1.641e+09 – 1.655e+09	97307
1.655e+09 – 1.669e+09	68594
1.669e+09 – 1.683e+09	61631
1.683e+09 – 1.697e+09	33903

ml 32m tags

Overview

Summary confidence: high

userId numeric identifier

movieId numeric foreign_key

tag text label

timestamp numeric timestamp

How to cite