ml 32m tags

source /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv 2,000,072 rows 4 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

citing: row_count · column_count · columns.tag.n_unique · columns.tag.stats.duplicate_rate · columns.tag.stats.one_word_rate · columns.tag.top_values · columns.timestamp.stats.skew · columns.timestamp.stats.median · columns.userId.n_unique · columns.movieId.n_unique

Charts the summary said to look at first

tag · Top tags are dominated by genre-like labels such as 'sci-fi', 'atmospheric', and 'action'.

Show data table

Character-length distribution for tag (mean: 11.11022453191685).
chars	count
1 – 7	434000
7 – 13	901154
13 – 19	488283
19 – 25	122950
25 – 31	37076
31 – 37	11065
37 – 43	3109
43 – 49	1395
49 – 55	435
55 – 61	174
61 – 67	128
67 – 73	99
73 – 79	36
79 – 85	25
85 – 91	31
91 – 97	15
97 – 103	10
103 – 109	11
109 – 115	5
115 – 121	5
121 – 127	5
127 – 133	8
133 – 139	8
139 – 145	2
145 – 151	5
151 – 157	3
157 – 163	3
163 – 169	1
169 – 175	1
175 – 181	3
181 – 187	2
187 – 193	3
193 – 199	6
199 – 205	2
205 – 211	5
211 – 217	3
217 – 223	0
223 – 229	2
229 – 235	3
235 – 241	1

tag · Most tags are very short (median 10 characters, ~1.7 words), confirming users mostly enter single-word labels.

Show data table

Character-length distribution for tag (mean: 11.11022453191685).
chars	count
1 – 7	434000
7 – 13	901154
13 – 19	488283
19 – 25	122950
25 – 31	37076
31 – 37	11065
37 – 43	3109
43 – 49	1395
49 – 55	435
55 – 61	174
61 – 67	128
67 – 73	99
73 – 79	36
79 – 85	25
85 – 91	31
91 – 97	15
97 – 103	10
103 – 109	11
109 – 115	5
115 – 121	5
121 – 127	5
127 – 133	8
133 – 139	8
139 – 145	2
145 – 151	5
151 – 157	3
157 – 163	3
163 – 169	1
169 – 175	1
175 – 181	3
181 – 187	2
187 – 193	3
193 – 199	6
199 – 205	2
205 – 211	5
211 – 217	3
217 – 223	0
223 – 229	2
229 – 235	3
235 – 241	1

timestamp · Timestamps are left-skewed toward recent years — check when tagging activity ramped up.

Show data table

Histogram bins for timestamp (median: 1574071062.0).
bin	count
1.135e+09 – 1.149e+09	18141
1.149e+09 – 1.164e+09	8578
1.164e+09 – 1.178e+09	14546
1.178e+09 – 1.192e+09	8773
1.192e+09 – 1.206e+09	7339
1.206e+09 – 1.22e+09	7288
1.22e+09 – 1.234e+09	6940
1.234e+09 – 1.248e+09	34739
1.248e+09 – 1.262e+09	21415
1.262e+09 – 1.276e+09	20116
1.276e+09 – 1.29e+09	19130
1.29e+09 – 1.304e+09	25517
1.304e+09 – 1.318e+09	23164
1.318e+09 – 1.332e+09	17686
1.332e+09 – 1.346e+09	18163
1.346e+09 – 1.36e+09	17679
1.36e+09 – 1.374e+09	26680
1.374e+09 – 1.388e+09	16406
1.388e+09 – 1.402e+09	17074
1.402e+09 – 1.416e+09	13818
1.416e+09 – 1.43e+09	24668
1.43e+09 – 1.444e+09	46130
1.444e+09 – 1.458e+09	44122
1.458e+09 – 1.472e+09	38869
1.472e+09 – 1.487e+09	35079
1.487e+09 – 1.501e+09	38009
1.501e+09 – 1.515e+09	34375
1.515e+09 – 1.529e+09	237228
1.529e+09 – 1.543e+09	64476
1.543e+09 – 1.557e+09	46622
1.557e+09 – 1.571e+09	38050
1.571e+09 – 1.585e+09	41917
1.585e+09 – 1.599e+09	82401
1.599e+09 – 1.613e+09	297429
1.613e+09 – 1.627e+09	222550
1.627e+09 – 1.641e+09	103520
1.641e+09 – 1.655e+09	97307
1.655e+09 – 1.669e+09	68594
1.669e+09 – 1.683e+09	61631
1.683e+09 – 1.697e+09	33903

userId · Only ~15.8K distinct users generated 2M tags; look for heavy-tagger concentration and outliers.

Show data table

Histogram bins for userId (median: 78213.0).
bin	count
22 – 4078	22923
4078 – 8135	37252
8135 – 1.219e+04	25374
1.219e+04 – 1.625e+04	24398
1.625e+04 – 2.03e+04	29784
2.03e+04 – 2.436e+04	29113
2.436e+04 – 2.842e+04	30580
2.842e+04 – 3.247e+04	41492
3.247e+04 – 3.653e+04	47022
3.653e+04 – 4.059e+04	33643
4.059e+04 – 4.464e+04	25173
4.464e+04 – 4.87e+04	27372
4.87e+04 – 5.276e+04	22925
5.276e+04 – 5.681e+04	17507
5.681e+04 – 6.087e+04	38134
6.087e+04 – 6.492e+04	24762
6.492e+04 – 6.898e+04	50610
6.898e+04 – 7.304e+04	28819
7.304e+04 – 7.709e+04	26238
7.709e+04 – 8.115e+04	764653
8.115e+04 – 8.521e+04	24570
8.521e+04 – 8.926e+04	20395
8.926e+04 – 9.332e+04	26016
9.332e+04 – 9.738e+04	26137
9.738e+04 – 1.014e+05	31546
1.014e+05 – 1.055e+05	40766
1.055e+05 – 1.095e+05	38876
1.095e+05 – 1.136e+05	30073
1.136e+05 – 1.177e+05	20068
1.177e+05 – 1.217e+05	52404
1.217e+05 – 1.258e+05	31800
1.258e+05 – 1.298e+05	25923
1.298e+05 – 1.339e+05	22044
1.339e+05 – 1.379e+05	29770
1.379e+05 – 1.42e+05	21515
1.42e+05 – 1.461e+05	42803
1.461e+05 – 1.501e+05	53003
1.501e+05 – 1.542e+05	31967
1.542e+05 – 1.582e+05	35481
1.582e+05 – 1.623e+05	47141

movieId · Tag coverage spans ~51K movies with a right-skewed id distribution — many tags cluster on lower-id (older) films.

Show data table

Histogram bins for movieId (median: 52328.0).
bin	count
1 – 7317	740560
7317 – 1.463e+04	78542
1.463e+04 – 2.195e+04	0
2.195e+04 – 2.926e+04	42089
2.926e+04 – 3.658e+04	44405
3.658e+04 – 4.39e+04	25577
4.39e+04 – 5.121e+04	60448
5.121e+04 – 5.853e+04	60408
5.853e+04 – 6.584e+04	48280
6.584e+04 – 7.316e+04	60162
7.316e+04 – 8.047e+04	49499
8.047e+04 – 8.779e+04	48142
8.779e+04 – 9.511e+04	54096
9.511e+04 – 1.024e+05	50634
1.024e+05 – 1.097e+05	61455
1.097e+05 – 1.171e+05	56213
1.171e+05 – 1.244e+05	47682
1.244e+05 – 1.317e+05	20624
1.317e+05 – 1.39e+05	32253
1.39e+05 – 1.463e+05	29242
1.463e+05 – 1.536e+05	20987
1.536e+05 – 1.609e+05	24359
1.609e+05 – 1.683e+05	47171
1.683e+05 – 1.756e+05	32779
1.756e+05 – 1.829e+05	39934
1.829e+05 – 1.902e+05	42009
1.902e+05 – 1.975e+05	30827
1.975e+05 – 2.048e+05	33387
2.048e+05 – 2.122e+05	25164
2.122e+05 – 2.195e+05	15290
2.195e+05 – 2.268e+05	14099
2.268e+05 – 2.341e+05	5384
2.341e+05 – 2.414e+05	10225
2.414e+05 – 2.487e+05	3451
2.487e+05 – 2.561e+05	9558
2.561e+05 – 2.634e+05	7446
2.634e+05 – 2.707e+05	6938
2.707e+05 – 2.78e+05	9782
2.78e+05 – 2.853e+05	7806
2.853e+05 – 2.926e+05	3165

Schema

4 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
userId	numeric	0.0%	15,848	outliers
movieId	numeric	0.0%	51,323
tag	text	0.0%	140,981	one_word duplicates
timestamp	numeric	0.0%	1,291,250	outliers

userId

numeric identifier outliers

This is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense. Treatment: Treat as a categorical/foreign key for joins and grouping; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 2,000,072
nulls: 0 (0.0%)
unique: 15,848
min: 22
max: 162,279
mean: 8.193e+04
median: 78,213
std: 3.811e+04
q1: 68,413
q3: 103,698
iqr: 35,285
skew: 0.1179
kurtosis: -0.2601
n_outliers: 157,830
outlier_rate: 0.07891
zero_rate: 0

movieId

numeric foreign_key

Integer movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution. Treatment: left-join on this id to a movies dimension; do not treat as numeric feature. high · anthropic:claude-opus-4-7

n: 2,000,072
nulls: 0 (0.0%)
unique: 51,323
min: 1
max: 292,629
mean: 7.189e+04
median: 52,328
std: 7.48e+04
q1: 4,011
q3: 122,294
iqr: 118,283
skew: 0.821
kurtosis: -0.4024
n_outliers: 0
outlier_rate: 0
zero_rate: 0

tag

text label one_word duplicates

This column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items. Treatment: Treat as a categorical/multi-label tag field; one-hot or embed top tags and group the long tail. high · anthropic:claude-opus-4-7

n: 2,000,072
nulls: 0 (0.0%)
unique: 140,981
len_min: 1
len_max: 241
len_mean: 11.11
len_median: 10
len_p95: 21
word_mean: 1.674
word_median: 1
n_empty: 0
n_duplicates: 1.859e+06
duplicate_rate: 0.9295
vocab_size: 7,273
readability_flesch_mean: 17.7
emoji_rate: 0
url_rate: 1.3e-05
one_word_rate: 0.5247
allcaps_rate: 0.01055
boilerplate_rate: 1e-05

timestamp

numeric timestamp outliers

Unix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros. Treatment: Convert from epoch seconds to datetime and derive features (year, month, recency); consider filtering or bucketing the pre-2018 long tail. high · anthropic:claude-opus-4-7

n: 2,000,072
nulls: 0 (0.0%)
unique: 1,291,250
min: 1.135e+09
max: 1.697e+09
mean: 1.529e+09
median: 1.574e+09
std: 1.291e+08
q1: 1.474e+09
q3: 1.615e+09
iqr: 1.411e+08
skew: -1.219
kurtosis: 0.6854
n_outliers: 127,898
outlier_rate: 0.06395
zero_rate: 0