saturn·

ml 32m tags

source /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv 2,000,072 rows 4 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

citing: row_count · column_count · columns.tag.n_unique · columns.tag.stats.duplicate_rate · columns.tag.stats.one_word_rate · columns.tag.top_values · columns.timestamp.stats.skew · columns.timestamp.stats.median · columns.userId.n_unique · columns.movieId.n_unique

Schema

4 columns
Per-column summary. Click column name to jump to its detail.
Alerts
userId numeric 0.0% 15,848
outliers
movieId numeric 0.0% 51,323
tag text 0.0% 140,981
one_word duplicates
timestamp numeric 0.0% 1,291,250
outliers

userId

numeric identifier outliers
This is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense. Treatment: Treat as a categorical/foreign key for joins and grouping; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
2,000,072
nulls
0 (0.0%)
unique
15,848
min
22
max
162,279
mean
8.193e+04
median
78,213
std
3.811e+04
q1
68,413
q3
103,698
iqr
35,285
skew
0.1179
kurtosis
-0.2601
n_outliers
157,830
outlier_rate
0.07891
zero_rate
0

movieId

numeric foreign_key
Integer movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution. Treatment: left-join on this id to a movies dimension; do not treat as numeric feature. high · anthropic:claude-opus-4-7
n
2,000,072
nulls
0 (0.0%)
unique
51,323
min
1
max
292,629
mean
7.189e+04
median
52,328
std
7.48e+04
q1
4,011
q3
122,294
iqr
118,283
skew
0.821
kurtosis
-0.4024
n_outliers
0
outlier_rate
0
zero_rate
0

tag

text label one_word duplicates
This column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items. Treatment: Treat as a categorical/multi-label tag field; one-hot or embed top tags and group the long tail. high · anthropic:claude-opus-4-7
n
2,000,072
nulls
0 (0.0%)
unique
140,981
len_min
1
len_max
241
len_mean
11.11
len_median
10
len_p95
21
word_mean
1.674
word_median
1
n_empty
0
n_duplicates
1.859e+06
duplicate_rate
0.9295
vocab_size
7,273
readability_flesch_mean
17.7
emoji_rate
0
url_rate
1.3e-05
one_word_rate
0.5247
allcaps_rate
0.01055
boilerplate_rate
1e-05

timestamp

numeric timestamp outliers
Unix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros. Treatment: Convert from epoch seconds to datetime and derive features (year, month, recency); consider filtering or bucketing the pre-2018 long tail. high · anthropic:claude-opus-4-7
n
2,000,072
nulls
0 (0.0%)
unique
1,291,250
min
1.135e+09
max
1.697e+09
mean
1.529e+09
median
1.574e+09
std
1.291e+08
q1
1.474e+09
q3
1.615e+09
iqr
1.411e+08
skew
-1.219
kurtosis
0.6854
n_outliers
127,898
outlier_rate
0.06395
zero_rate
0