ml 32m tags
Reading
This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.
citing: row_count · column_count · columns.tag.n_unique · columns.tag.stats.duplicate_rate · columns.tag.stats.one_word_rate · columns.tag.top_values · columns.timestamp.stats.skew · columns.timestamp.stats.median · columns.userId.n_unique · columns.movieId.n_unique
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 1 – 7 | 434000 |
| 7 – 13 | 901154 |
| 13 – 19 | 488283 |
| 19 – 25 | 122950 |
| 25 – 31 | 37076 |
| 31 – 37 | 11065 |
| 37 – 43 | 3109 |
| 43 – 49 | 1395 |
| 49 – 55 | 435 |
| 55 – 61 | 174 |
| 61 – 67 | 128 |
| 67 – 73 | 99 |
| 73 – 79 | 36 |
| 79 – 85 | 25 |
| 85 – 91 | 31 |
| 91 – 97 | 15 |
| 97 – 103 | 10 |
| 103 – 109 | 11 |
| 109 – 115 | 5 |
| 115 – 121 | 5 |
| 121 – 127 | 5 |
| 127 – 133 | 8 |
| 133 – 139 | 8 |
| 139 – 145 | 2 |
| 145 – 151 | 5 |
| 151 – 157 | 3 |
| 157 – 163 | 3 |
| 163 – 169 | 1 |
| 169 – 175 | 1 |
| 175 – 181 | 3 |
| 181 – 187 | 2 |
| 187 – 193 | 3 |
| 193 – 199 | 6 |
| 199 – 205 | 2 |
| 205 – 211 | 5 |
| 211 – 217 | 3 |
| 217 – 223 | 0 |
| 223 – 229 | 2 |
| 229 – 235 | 3 |
| 235 – 241 | 1 |
Show data table
| chars | count |
|---|---|
| 1 – 7 | 434000 |
| 7 – 13 | 901154 |
| 13 – 19 | 488283 |
| 19 – 25 | 122950 |
| 25 – 31 | 37076 |
| 31 – 37 | 11065 |
| 37 – 43 | 3109 |
| 43 – 49 | 1395 |
| 49 – 55 | 435 |
| 55 – 61 | 174 |
| 61 – 67 | 128 |
| 67 – 73 | 99 |
| 73 – 79 | 36 |
| 79 – 85 | 25 |
| 85 – 91 | 31 |
| 91 – 97 | 15 |
| 97 – 103 | 10 |
| 103 – 109 | 11 |
| 109 – 115 | 5 |
| 115 – 121 | 5 |
| 121 – 127 | 5 |
| 127 – 133 | 8 |
| 133 – 139 | 8 |
| 139 – 145 | 2 |
| 145 – 151 | 5 |
| 151 – 157 | 3 |
| 157 – 163 | 3 |
| 163 – 169 | 1 |
| 169 – 175 | 1 |
| 175 – 181 | 3 |
| 181 – 187 | 2 |
| 187 – 193 | 3 |
| 193 – 199 | 6 |
| 199 – 205 | 2 |
| 205 – 211 | 5 |
| 211 – 217 | 3 |
| 217 – 223 | 0 |
| 223 – 229 | 2 |
| 229 – 235 | 3 |
| 235 – 241 | 1 |
Show data table
| bin | count |
|---|---|
| 1.135e+09 – 1.149e+09 | 18141 |
| 1.149e+09 – 1.164e+09 | 8578 |
| 1.164e+09 – 1.178e+09 | 14546 |
| 1.178e+09 – 1.192e+09 | 8773 |
| 1.192e+09 – 1.206e+09 | 7339 |
| 1.206e+09 – 1.22e+09 | 7288 |
| 1.22e+09 – 1.234e+09 | 6940 |
| 1.234e+09 – 1.248e+09 | 34739 |
| 1.248e+09 – 1.262e+09 | 21415 |
| 1.262e+09 – 1.276e+09 | 20116 |
| 1.276e+09 – 1.29e+09 | 19130 |
| 1.29e+09 – 1.304e+09 | 25517 |
| 1.304e+09 – 1.318e+09 | 23164 |
| 1.318e+09 – 1.332e+09 | 17686 |
| 1.332e+09 – 1.346e+09 | 18163 |
| 1.346e+09 – 1.36e+09 | 17679 |
| 1.36e+09 – 1.374e+09 | 26680 |
| 1.374e+09 – 1.388e+09 | 16406 |
| 1.388e+09 – 1.402e+09 | 17074 |
| 1.402e+09 – 1.416e+09 | 13818 |
| 1.416e+09 – 1.43e+09 | 24668 |
| 1.43e+09 – 1.444e+09 | 46130 |
| 1.444e+09 – 1.458e+09 | 44122 |
| 1.458e+09 – 1.472e+09 | 38869 |
| 1.472e+09 – 1.487e+09 | 35079 |
| 1.487e+09 – 1.501e+09 | 38009 |
| 1.501e+09 – 1.515e+09 | 34375 |
| 1.515e+09 – 1.529e+09 | 237228 |
| 1.529e+09 – 1.543e+09 | 64476 |
| 1.543e+09 – 1.557e+09 | 46622 |
| 1.557e+09 – 1.571e+09 | 38050 |
| 1.571e+09 – 1.585e+09 | 41917 |
| 1.585e+09 – 1.599e+09 | 82401 |
| 1.599e+09 – 1.613e+09 | 297429 |
| 1.613e+09 – 1.627e+09 | 222550 |
| 1.627e+09 – 1.641e+09 | 103520 |
| 1.641e+09 – 1.655e+09 | 97307 |
| 1.655e+09 – 1.669e+09 | 68594 |
| 1.669e+09 – 1.683e+09 | 61631 |
| 1.683e+09 – 1.697e+09 | 33903 |
Show data table
| bin | count |
|---|---|
| 22 – 4078 | 22923 |
| 4078 – 8135 | 37252 |
| 8135 – 1.219e+04 | 25374 |
| 1.219e+04 – 1.625e+04 | 24398 |
| 1.625e+04 – 2.03e+04 | 29784 |
| 2.03e+04 – 2.436e+04 | 29113 |
| 2.436e+04 – 2.842e+04 | 30580 |
| 2.842e+04 – 3.247e+04 | 41492 |
| 3.247e+04 – 3.653e+04 | 47022 |
| 3.653e+04 – 4.059e+04 | 33643 |
| 4.059e+04 – 4.464e+04 | 25173 |
| 4.464e+04 – 4.87e+04 | 27372 |
| 4.87e+04 – 5.276e+04 | 22925 |
| 5.276e+04 – 5.681e+04 | 17507 |
| 5.681e+04 – 6.087e+04 | 38134 |
| 6.087e+04 – 6.492e+04 | 24762 |
| 6.492e+04 – 6.898e+04 | 50610 |
| 6.898e+04 – 7.304e+04 | 28819 |
| 7.304e+04 – 7.709e+04 | 26238 |
| 7.709e+04 – 8.115e+04 | 764653 |
| 8.115e+04 – 8.521e+04 | 24570 |
| 8.521e+04 – 8.926e+04 | 20395 |
| 8.926e+04 – 9.332e+04 | 26016 |
| 9.332e+04 – 9.738e+04 | 26137 |
| 9.738e+04 – 1.014e+05 | 31546 |
| 1.014e+05 – 1.055e+05 | 40766 |
| 1.055e+05 – 1.095e+05 | 38876 |
| 1.095e+05 – 1.136e+05 | 30073 |
| 1.136e+05 – 1.177e+05 | 20068 |
| 1.177e+05 – 1.217e+05 | 52404 |
| 1.217e+05 – 1.258e+05 | 31800 |
| 1.258e+05 – 1.298e+05 | 25923 |
| 1.298e+05 – 1.339e+05 | 22044 |
| 1.339e+05 – 1.379e+05 | 29770 |
| 1.379e+05 – 1.42e+05 | 21515 |
| 1.42e+05 – 1.461e+05 | 42803 |
| 1.461e+05 – 1.501e+05 | 53003 |
| 1.501e+05 – 1.542e+05 | 31967 |
| 1.542e+05 – 1.582e+05 | 35481 |
| 1.582e+05 – 1.623e+05 | 47141 |
Show data table
| bin | count |
|---|---|
| 1 – 7317 | 740560 |
| 7317 – 1.463e+04 | 78542 |
| 1.463e+04 – 2.195e+04 | 0 |
| 2.195e+04 – 2.926e+04 | 42089 |
| 2.926e+04 – 3.658e+04 | 44405 |
| 3.658e+04 – 4.39e+04 | 25577 |
| 4.39e+04 – 5.121e+04 | 60448 |
| 5.121e+04 – 5.853e+04 | 60408 |
| 5.853e+04 – 6.584e+04 | 48280 |
| 6.584e+04 – 7.316e+04 | 60162 |
| 7.316e+04 – 8.047e+04 | 49499 |
| 8.047e+04 – 8.779e+04 | 48142 |
| 8.779e+04 – 9.511e+04 | 54096 |
| 9.511e+04 – 1.024e+05 | 50634 |
| 1.024e+05 – 1.097e+05 | 61455 |
| 1.097e+05 – 1.171e+05 | 56213 |
| 1.171e+05 – 1.244e+05 | 47682 |
| 1.244e+05 – 1.317e+05 | 20624 |
| 1.317e+05 – 1.39e+05 | 32253 |
| 1.39e+05 – 1.463e+05 | 29242 |
| 1.463e+05 – 1.536e+05 | 20987 |
| 1.536e+05 – 1.609e+05 | 24359 |
| 1.609e+05 – 1.683e+05 | 47171 |
| 1.683e+05 – 1.756e+05 | 32779 |
| 1.756e+05 – 1.829e+05 | 39934 |
| 1.829e+05 – 1.902e+05 | 42009 |
| 1.902e+05 – 1.975e+05 | 30827 |
| 1.975e+05 – 2.048e+05 | 33387 |
| 2.048e+05 – 2.122e+05 | 25164 |
| 2.122e+05 – 2.195e+05 | 15290 |
| 2.195e+05 – 2.268e+05 | 14099 |
| 2.268e+05 – 2.341e+05 | 5384 |
| 2.341e+05 – 2.414e+05 | 10225 |
| 2.414e+05 – 2.487e+05 | 3451 |
| 2.487e+05 – 2.561e+05 | 9558 |
| 2.561e+05 – 2.634e+05 | 7446 |
| 2.634e+05 – 2.707e+05 | 6938 |
| 2.707e+05 – 2.78e+05 | 9782 |
| 2.78e+05 – 2.853e+05 | 7806 |
| 2.853e+05 – 2.926e+05 | 3165 |
Schema
4 columns| Alerts | ||||
|---|---|---|---|---|
| userId | numeric | 0.0% | 15,848 |
outliers
|
| movieId | numeric | 0.0% | 51,323 |
|
| tag | text | 0.0% | 140,981 |
one_word
duplicates
|
| timestamp | numeric | 0.0% | 1,291,250 |
outliers
|
userId
numeric identifier outliersThis is almost certainly a user identifier stored as an integer, with 15,848 unique values spread across roughly 2M rows (so ~126 rows per user on average). Values range from 22 to 162,279 with no nulls or zeros, and the distribution is near-symmetric (skew 0.12, kurtosis -0.26), consistent with sparsely allocated account IDs rather than a meaningful numeric quantity. The 157,830 flagged outliers (7.9%) are an artefact of treating IDs as numeric — they are not statistical anomalies in any business sense. Treatment: Treat as a categorical/foreign key for joins and grouping; do not use as a numeric feature.
- n
- 2,000,072
- nulls
- 0 (0.0%)
- unique
- 15,848
- min
- 22
- max
- 162,279
- mean
- 8.193e+04
- median
- 78,213
- std
- 3.811e+04
- q1
- 68,413
- q3
- 103,698
- iqr
- 35,285
- skew
- 0.1179
- kurtosis
- -0.2601
- n_outliers
- 157,830
- outlier_rate
- 0.07891
- zero_rate
- 0
movieId
numeric foreign_keyInteger movieId spanning 1 to 292629 across 2,000,072 rows with 51,323 unique values and no nulls. The cardinality and bounded integer range indicate a foreign key into a movies table rather than a measured quantity; summary statistics like mean (71893.26) and skew (0.82) reflect ID allocation, not a meaningful distribution. Treatment: left-join on this id to a movies dimension; do not treat as numeric feature.
- n
- 2,000,072
- nulls
- 0 (0.0%)
- unique
- 51,323
- min
- 1
- max
- 292,629
- mean
- 7.189e+04
- median
- 52,328
- std
- 7.48e+04
- q1
- 4,011
- q3
- 122,294
- iqr
- 118,283
- skew
- 0.821
- kurtosis
- -0.4024
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
tag
text label one_word duplicatesThis column holds short descriptive tags applied to items (likely films), with frequent values like 'sci-fi', 'atmospheric', 'action', and 'comedy'. Entries are tiny — mean length 11.1 characters and word_mean 1.67 — and 52.5% are a single word. Despite 2,000,072 rows there are only 140,981 distinct tags and a 92.95% duplicate rate (1,859,091 duplicates), so the same labels recur heavily across items. Treatment: Treat as a categorical/multi-label tag field; one-hot or embed top tags and group the long tail.
- n
- 2,000,072
- nulls
- 0 (0.0%)
- unique
- 140,981
- len_min
- 1
- len_max
- 241
- len_mean
- 11.11
- len_median
- 10
- len_p95
- 21
- word_mean
- 1.674
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 1.859e+06
- duplicate_rate
- 0.9295
- vocab_size
- 7,273
- readability_flesch_mean
- 17.7
- emoji_rate
- 0
- url_rate
- 1.3e-05
- one_word_rate
- 0.5247
- allcaps_rate
- 0.01055
- boilerplate_rate
- 1e-05
timestamp
numeric timestamp outliersUnix epoch timestamps spanning roughly 1135429210 (late 2005) to 1697154983 (late 2023), with a median of 1574071062 (late 2019). The distribution is left-skewed (skew -1.22) and 6.39% of rows fall outside the IQR fence, indicating a long tail of older records before activity concentrated post-2018. With 1291250 unique values across 2000072 rows there is meaningful repetition but no nulls or zeros. Treatment: Convert from epoch seconds to datetime and derive features (year, month, recency); consider filtering or bucketing the pre-2018 long tail.
- n
- 2,000,072
- nulls
- 0 (0.0%)
- unique
- 1,291,250
- min
- 1.135e+09
- max
- 1.697e+09
- mean
- 1.529e+09
- median
- 1.574e+09
- std
- 1.291e+08
- q1
- 1.474e+09
- q3
- 1.615e+09
- iqr
- 1.411e+08
- skew
- -1.219
- kurtosis
- 0.6854
- n_outliers
- 127,898
- outlier_rate
- 0.06395
- zero_rate
- 0