saturn·

letterboxd users export

source /home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv 8,139 rows 5 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 8,139 Letterboxd user profiles with 5 columns covering identifiers (username, _id), a display name, and two activity metrics (num_reviews, num_ratings_pages). The activity metrics are the most interesting signal: num_reviews is heavily right-skewed with a mean of 868 but a median of 588 and a max of 17,184, and num_ratings_pages shows similar skew along with a 41.7% null rate that warrants investigation. Display names are also worth a look — about 60% are one-word, 12.3% are duplicates, and 'null' literally appears 307 times as a value, suggesting some data quality issues. The username and _id columns are fully unique identifiers and can largely be ignored for analytical purposes.

citing: row_count · column_count · num_reviews.stats · num_ratings_pages.stats · num_ratings_pages.null_rate · display_name.stats · display_name.top_values · username.n_unique · _id.n_unique

Schema

5 columns
Per-column summary. Click column name to jump to its detail.
Alerts
_id text 0.0% 8,139
near_unique one_word
display_name text 0.0% 7,136
one_word short_text
num_ratings_pages numeric 41.7% 177
null_rate high_skew
num_reviews numeric 0.0% 2,416
high_skew outliers
username text 0.0% 8,139
near_unique one_word short_text

_id

text identifier near_unique one_word
This column is a per-row identifier, almost certainly MongoDB ObjectIds: every one of the 8139 values is unique, exactly 24 characters long, single-token, and the samples are 24-char hex strings. There are no nulls, duplicates, or empties, and vocab_size equals n, confirming a pure primary key with no analytic content. Treatment: Drop for modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
8,139
nulls
0 (0.0%)
unique
8,139
len_min
24
len_max
24
len_mean
24
len_median
24
len_p95
24
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
8,139
readability_flesch_mean
31.97
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

display_name

text free_text one_word short_text
Short user display names: nearly 60% are a single word (one_word_rate 0.5978), median length 9 chars and median word count 1, with the top values dominated by first names like Sam, Jack, Emma. Notable quirks: 307 rows literally contain the string "null" (not actual nulls, since null_rate is 0.0), duplicate_rate is 12.3% with 1003 repeats, and 5.8% include emoji. Vocabulary is wide (7487 tokens across 8139 rows), consistent with free-form handles rather than a controlled label set. Treatment: Treat as free-text handles: replace literal "null" tokens with true missing, lowercase-normalize, and avoid using as a join key given 12% duplicates. high · anthropic:claude-opus-4-7
n
8,139
nulls
0 (0.0%)
unique
7,136
len_min
1
len_max
97
len_mean
9.283
len_median
9
len_p95
16
word_mean
1.479
word_median
1
n_empty
0
n_duplicates
1,003
duplicate_rate
0.1232
vocab_size
7,487
readability_flesch_mean
45.67
emoji_rate
0.05836
url_rate
0
one_word_rate
0.5979
allcaps_rate
0.01954
boilerplate_rate
0

num_ratings_pages

numeric feature null_rate high_skew
Numeric count of rating pages per item, present for roughly 58% of rows (null_rate 0.4168) with 177 distinct values from 1 to 1208 and a median of 25. The distribution is severely right-skewed (skew 11.24, kurtosis 298.34) with 236 outliers above the IQR fence, so the mean of 32.81 sits well above the typical row. Treatment: log-transform and impute missing before modelling. high · anthropic:claude-opus-4-7
n
8,139
nulls
3,392 (41.7%)
unique
177
min
1
max
1,208
mean
32.81
median
25
std
35.23
q1
15
q3
42
iqr
27
skew
11.24
kurtosis
298.3
n_outliers
236
outlier_rate
0.04972
zero_rate
0

num_reviews

numeric feature high_skew outliers
num_reviews is a count of reviews per item, ranging from 0 to 17184 with a median of 588 and mean of 868. The distribution is heavily right-skewed (skew 3.92, kurtosis 33.3), with 505 outliers (6.2%) and only 0.96% zeros. The gap between q3 (1130) and max (17184) signals a long tail of highly-reviewed items. Treatment: log1p-transform before modelling to tame the right tail. high · anthropic:claude-opus-4-7
n
8,139
nulls
0 (0.0%)
unique
2,416
min
0
max
17,184
mean
868.4
median
588
std
979.1
q1
267
q3
1,130
iqr
863
skew
3.923
kurtosis
33.31
n_outliers
505
outlier_rate
0.06205
zero_rate
0.009583

username

text identifier near_unique one_word short_text
This column holds unique single-token usernames: every one of the 8139 rows has a distinct value (n_unique=8139, duplicate_rate=0.0) and one_word_rate is 1.0. Lengths are short and tightly bounded (len_min=2, len_mean≈9.79, len_max=15), consistent with a handle field rather than free text. No nulls, no URLs, no emoji, and allcaps usage is negligible (0.00037). Treatment: Treat as a user identifier; drop from modelling features and use only for joins or deduplication. high · anthropic:claude-opus-4-7
n
8,139
nulls
0 (0.0%)
unique
8,139
len_min
2
len_max
15
len_mean
9.793
len_median
10
len_p95
15
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
8,139
readability_flesch_mean
2.78
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.0003686
boilerplate_rate
0