letterboxd users export
Reading
This dataset contains 8,139 Letterboxd user profiles with 5 columns covering identifiers (username, _id), a display name, and two activity metrics (num_reviews, num_ratings_pages). The activity metrics are the most interesting signal: num_reviews is heavily right-skewed with a mean of 868 but a median of 588 and a max of 17,184, and num_ratings_pages shows similar skew along with a 41.7% null rate that warrants investigation. Display names are also worth a look — about 60% are one-word, 12.3% are duplicates, and 'null' literally appears 307 times as a value, suggesting some data quality issues. The username and _id columns are fully unique identifiers and can largely be ignored for analytical purposes.
citing: row_count · column_count · num_reviews.stats · num_ratings_pages.stats · num_ratings_pages.null_rate · display_name.stats · display_name.top_values · username.n_unique · _id.n_unique
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 429.6 | 3145 |
| 429.6 – 859.2 | 2163 |
| 859.2 – 1289 | 1165 |
| 1289 – 1718 | 615 |
| 1718 – 2148 | 396 |
| 2148 – 2578 | 217 |
| 2578 – 3007 | 158 |
| 3007 – 3437 | 98 |
| 3437 – 3866 | 56 |
| 3866 – 4296 | 37 |
| 4296 – 4726 | 22 |
| 4726 – 5155 | 15 |
| 5155 – 5585 | 10 |
| 5585 – 6014 | 11 |
| 6014 – 6444 | 5 |
| 6444 – 6874 | 9 |
| 6874 – 7303 | 4 |
| 7303 – 7733 | 1 |
| 7733 – 8162 | 2 |
| 8162 – 8592 | 1 |
| 8592 – 9022 | 0 |
| 9022 – 9451 | 1 |
| 9451 – 9881 | 2 |
| 9881 – 1.031e+04 | 1 |
| 1.031e+04 – 1.074e+04 | 0 |
| 1.074e+04 – 1.117e+04 | 0 |
| 1.117e+04 – 1.16e+04 | 0 |
| 1.16e+04 – 1.203e+04 | 0 |
| 1.203e+04 – 1.246e+04 | 1 |
| 1.246e+04 – 1.289e+04 | 0 |
| 1.289e+04 – 1.332e+04 | 0 |
| 1.332e+04 – 1.375e+04 | 1 |
| 1.375e+04 – 1.418e+04 | 0 |
| 1.418e+04 – 1.461e+04 | 1 |
| 1.461e+04 – 1.504e+04 | 1 |
| 1.504e+04 – 1.547e+04 | 0 |
| 1.547e+04 – 1.59e+04 | 0 |
| 1.59e+04 – 1.632e+04 | 0 |
| 1.632e+04 – 1.675e+04 | 0 |
| 1.675e+04 – 1.718e+04 | 1 |
Show data table
| bin | count |
|---|---|
| 1 – 31.18 | 2911 |
| 31.18 – 61.35 | 1307 |
| 61.35 – 91.53 | 355 |
| 91.53 – 121.7 | 100 |
| 121.7 – 151.9 | 36 |
| 151.9 – 182.1 | 15 |
| 182.1 – 212.2 | 9 |
| 212.2 – 242.4 | 1 |
| 242.4 – 272.6 | 3 |
| 272.6 – 302.8 | 3 |
| 302.8 – 332.9 | 2 |
| 332.9 – 363.1 | 2 |
| 363.1 – 393.3 | 0 |
| 393.3 – 423.4 | 1 |
| 423.4 – 453.6 | 0 |
| 453.6 – 483.8 | 0 |
| 483.8 – 514 | 0 |
| 514 – 544.1 | 0 |
| 544.1 – 574.3 | 0 |
| 574.3 – 604.5 | 0 |
| 604.5 – 634.7 | 0 |
| 634.7 – 664.9 | 0 |
| 664.9 – 695 | 0 |
| 695 – 725.2 | 1 |
| 725.2 – 755.4 | 0 |
| 755.4 – 785.6 | 0 |
| 785.6 – 815.7 | 0 |
| 815.7 – 845.9 | 0 |
| 845.9 – 876.1 | 0 |
| 876.1 – 906.2 | 0 |
| 906.2 – 936.4 | 0 |
| 936.4 – 966.6 | 0 |
| 966.6 – 996.8 | 0 |
| 996.8 – 1027 | 0 |
| 1027 – 1057 | 0 |
| 1057 – 1087 | 0 |
| 1087 – 1117 | 0 |
| 1117 – 1148 | 0 |
| 1148 – 1178 | 0 |
| 1178 – 1208 | 1 |
Show data table
| chars | count |
|---|---|
| 1 – 3 | 329 |
| 3 – 6 | 1711 |
| 6 – 8 | 1902 |
| 8 – 11 | 1037 |
| 11 – 13 | 1236 |
| 13 – 15 | 1354 |
| 15 – 18 | 275 |
| 18 – 20 | 167 |
| 20 – 23 | 39 |
| 23 – 25 | 27 |
| 25 – 27 | 31 |
| 27 – 30 | 4 |
| 30 – 32 | 14 |
| 32 – 35 | 2 |
| 35 – 37 | 2 |
| 37 – 39 | 2 |
| 39 – 42 | 0 |
| 42 – 44 | 1 |
| 44 – 47 | 1 |
| 47 – 49 | 1 |
| 49 – 51 | 1 |
| 51 – 54 | 0 |
| 54 – 56 | 0 |
| 56 – 59 | 0 |
| 59 – 61 | 0 |
| 61 – 63 | 0 |
| 63 – 66 | 2 |
| 66 – 68 | 0 |
| 68 – 71 | 0 |
| 71 – 73 | 0 |
| 73 – 75 | 0 |
| 75 – 78 | 0 |
| 78 – 80 | 0 |
| 80 – 83 | 0 |
| 83 – 85 | 0 |
| 85 – 87 | 0 |
| 87 – 90 | 0 |
| 90 – 92 | 0 |
| 92 – 95 | 0 |
| 95 – 97 | 1 |
Show data table
| chars | count |
|---|---|
| 1 – 3 | 329 |
| 3 – 6 | 1711 |
| 6 – 8 | 1902 |
| 8 – 11 | 1037 |
| 11 – 13 | 1236 |
| 13 – 15 | 1354 |
| 15 – 18 | 275 |
| 18 – 20 | 167 |
| 20 – 23 | 39 |
| 23 – 25 | 27 |
| 25 – 27 | 31 |
| 27 – 30 | 4 |
| 30 – 32 | 14 |
| 32 – 35 | 2 |
| 35 – 37 | 2 |
| 37 – 39 | 2 |
| 39 – 42 | 0 |
| 42 – 44 | 1 |
| 44 – 47 | 1 |
| 47 – 49 | 1 |
| 49 – 51 | 1 |
| 51 – 54 | 0 |
| 54 – 56 | 0 |
| 56 – 59 | 0 |
| 59 – 61 | 0 |
| 61 – 63 | 0 |
| 63 – 66 | 2 |
| 66 – 68 | 0 |
| 68 – 71 | 0 |
| 71 – 73 | 0 |
| 73 – 75 | 0 |
| 75 – 78 | 0 |
| 78 – 80 | 0 |
| 80 – 83 | 0 |
| 83 – 85 | 0 |
| 85 – 87 | 0 |
| 87 – 90 | 0 |
| 90 – 92 | 0 |
| 92 – 95 | 0 |
| 95 – 97 | 1 |
Schema
5 columns| Alerts | ||||
|---|---|---|---|---|
| _id | text | 0.0% | 8,139 |
near_unique
one_word
|
| display_name | text | 0.0% | 7,136 |
one_word
short_text
|
| num_ratings_pages | numeric | 41.7% | 177 |
null_rate
high_skew
|
| num_reviews | numeric | 0.0% | 2,416 |
high_skew
outliers
|
| username | text | 0.0% | 8,139 |
near_unique
one_word
short_text
|
_id
text identifier near_unique one_wordThis column is a per-row identifier, almost certainly MongoDB ObjectIds: every one of the 8139 values is unique, exactly 24 characters long, single-token, and the samples are 24-char hex strings. There are no nulls, duplicates, or empties, and vocab_size equals n, confirming a pure primary key with no analytic content. Treatment: Drop for modelling; retain only as a join key.
- n
- 8,139
- nulls
- 0 (0.0%)
- unique
- 8,139
- len_min
- 24
- len_max
- 24
- len_mean
- 24
- len_median
- 24
- len_p95
- 24
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 8,139
- readability_flesch_mean
- 31.97
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
display_name
text free_text one_word short_textShort user display names: nearly 60% are a single word (one_word_rate 0.5978), median length 9 chars and median word count 1, with the top values dominated by first names like Sam, Jack, Emma. Notable quirks: 307 rows literally contain the string "null" (not actual nulls, since null_rate is 0.0), duplicate_rate is 12.3% with 1003 repeats, and 5.8% include emoji. Vocabulary is wide (7487 tokens across 8139 rows), consistent with free-form handles rather than a controlled label set. Treatment: Treat as free-text handles: replace literal "null" tokens with true missing, lowercase-normalize, and avoid using as a join key given 12% duplicates.
- n
- 8,139
- nulls
- 0 (0.0%)
- unique
- 7,136
- len_min
- 1
- len_max
- 97
- len_mean
- 9.283
- len_median
- 9
- len_p95
- 16
- word_mean
- 1.479
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 1,003
- duplicate_rate
- 0.1232
- vocab_size
- 7,487
- readability_flesch_mean
- 45.67
- emoji_rate
- 0.05836
- url_rate
- 0
- one_word_rate
- 0.5979
- allcaps_rate
- 0.01954
- boilerplate_rate
- 0
num_ratings_pages
numeric feature null_rate high_skewNumeric count of rating pages per item, present for roughly 58% of rows (null_rate 0.4168) with 177 distinct values from 1 to 1208 and a median of 25. The distribution is severely right-skewed (skew 11.24, kurtosis 298.34) with 236 outliers above the IQR fence, so the mean of 32.81 sits well above the typical row. Treatment: log-transform and impute missing before modelling.
- n
- 8,139
- nulls
- 3,392 (41.7%)
- unique
- 177
- min
- 1
- max
- 1,208
- mean
- 32.81
- median
- 25
- std
- 35.23
- q1
- 15
- q3
- 42
- iqr
- 27
- skew
- 11.24
- kurtosis
- 298.3
- n_outliers
- 236
- outlier_rate
- 0.04972
- zero_rate
- 0
num_reviews
numeric feature high_skew outliersnum_reviews is a count of reviews per item, ranging from 0 to 17184 with a median of 588 and mean of 868. The distribution is heavily right-skewed (skew 3.92, kurtosis 33.3), with 505 outliers (6.2%) and only 0.96% zeros. The gap between q3 (1130) and max (17184) signals a long tail of highly-reviewed items. Treatment: log1p-transform before modelling to tame the right tail.
- n
- 8,139
- nulls
- 0 (0.0%)
- unique
- 2,416
- min
- 0
- max
- 17,184
- mean
- 868.4
- median
- 588
- std
- 979.1
- q1
- 267
- q3
- 1,130
- iqr
- 863
- skew
- 3.923
- kurtosis
- 33.31
- n_outliers
- 505
- outlier_rate
- 0.06205
- zero_rate
- 0.009583
username
text identifier near_unique one_word short_textThis column holds unique single-token usernames: every one of the 8139 rows has a distinct value (n_unique=8139, duplicate_rate=0.0) and one_word_rate is 1.0. Lengths are short and tightly bounded (len_min=2, len_mean≈9.79, len_max=15), consistent with a handle field rather than free text. No nulls, no URLs, no emoji, and allcaps usage is negligible (0.00037). Treatment: Treat as a user identifier; drop from modelling features and use only for joins or deduplication.
- n
- 8,139
- nulls
- 0 (0.0%)
- unique
- 8,139
- len_min
- 2
- len_max
- 15
- len_mean
- 9.793
- len_median
- 10
- len_p95
- 15
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 8,139
- readability_flesch_mean
- 2.78
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0003686
- boilerplate_rate
- 0