saturn

/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv 2,000,072 rows sample n=2,000,072 seed 42 2026-05-01T23:36:36+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv
Total rows2,000,072
Profiled sample2,000,072
Columns4
Generated2026-05-01T23:36:36+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

userId high anthropic:claude-opus-4-7

This is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense.

movieId high anthropic:claude-opus-4-7

Integer movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution.

tag high anthropic:claude-opus-4-7

This column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items.

timestamp high anthropic:claude-opus-4-7

Unix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros.

Numeric correlation

userId numeric

7.9% rows beyond 1.5 IQR
rows2,000,072
null0 (0.0%)
unique15,848
min22.000
max162,279
mean81,929
median78,213
std38,106
q168,413
q3103,698
iqr35,285
skew0.118
kurtosis-0.260
n_outliers157,830
outlier_rate0.079
zero_rate0.000

movieId numeric

rows2,000,072
null0 (0.0%)
unique51,323
min1.000
max292,629
mean71,893
median52,328
std74,804
q14,011
q3122,294
iqr118,283
skew0.821
kurtosis-0.402
n_outliers0
outlier_rate0.000
zero_rate0.000

tag text

52.5% rows are a single word 93.0% duplicate strings
rows2,000,072
null0 (0.0%)
unique140,981
len_min1
len_max241
len_mean11.110
len_median10.000
len_p9521.000
word_mean1.674
word_median1.000
n_empty0
n_duplicates1,859,091
duplicate_rate0.930
vocab_size7,273
readability_flesch_mean17.696
emoji_rate0.000
url_rate1.30e-05
one_word_rate0.525
allcaps_rate0.011
boilerplate_rate1.00e-05
Sample values (first 10)
  1. superhero
  2. Mission Impossible
  3. magic
  4. original horror
  5. deception
  6. creative
  7. degradation
  8. Star Trek
  9. bittersweet
  10. music

timestamp numeric

6.4% rows beyond 1.5 IQR
rows2,000,072
null0 (0.0%)
unique1,291,250
min1,135,429,210
max1,697,154,983
mean1,528,913,886
median1,574,071,062
std129,083,513
q11,473,614,704
q31,614,739,619
iqr141,124,915
skew-1.219
kurtosis0.685
n_outliers127,898
outlier_rate0.064
zero_rate0.000