letterboxd users export

source /home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv 8,139 rows 5 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 8,139 Letterboxd user profiles with 5 columns covering identifiers (username, _id), a display name, and two activity metrics (num_reviews, num_ratings_pages). The activity metrics are the most interesting signal: num_reviews is heavily right-skewed with a mean of 868 but a median of 588 and a max of 17,184, and num_ratings_pages shows similar skew along with a 41.7% null rate that warrants investigation. Display names are also worth a look — about 60% are one-word, 12.3% are duplicates, and 'null' literally appears 307 times as a value, suggesting some data quality issues. The username and _id columns are fully unique identifiers and can largely be ignored for analytical purposes.

citing: row_count · column_count · num_reviews.stats · num_ratings_pages.stats · num_ratings_pages.null_rate · display_name.stats · display_name.top_values · username.n_unique · _id.n_unique

Charts the summary said to look at first

num_reviews · Look at the long right tail — most users sit well below the 868 mean while a few hit thousands of reviews.

Show data table

Histogram bins for num_reviews (median: 588.0).
bin	count
0 – 429.6	3145
429.6 – 859.2	2163
859.2 – 1289	1165
1289 – 1718	615
1718 – 2148	396
2148 – 2578	217
2578 – 3007	158
3007 – 3437	98
3437 – 3866	56
3866 – 4296	37
4296 – 4726	22
4726 – 5155	15
5155 – 5585	10
5585 – 6014	11
6014 – 6444	5
6444 – 6874	9
6874 – 7303	4
7303 – 7733	1
7733 – 8162	2
8162 – 8592	1
8592 – 9022	0
9022 – 9451	1
9451 – 9881	2
9881 – 1.031e+04	1
1.031e+04 – 1.074e+04	0
1.074e+04 – 1.117e+04	0
1.117e+04 – 1.16e+04	0
1.16e+04 – 1.203e+04	0
1.203e+04 – 1.246e+04	1
1.246e+04 – 1.289e+04	0
1.289e+04 – 1.332e+04	0
1.332e+04 – 1.375e+04	1
1.375e+04 – 1.418e+04	0
1.418e+04 – 1.461e+04	1
1.461e+04 – 1.504e+04	1
1.504e+04 – 1.547e+04	0
1.547e+04 – 1.59e+04	0
1.59e+04 – 1.632e+04	0
1.632e+04 – 1.675e+04	0
1.675e+04 – 1.718e+04	1

num_ratings_pages · Same skew pattern as reviews, but note that 41.7% of values are missing before reading the distribution.

Show data table

Histogram bins for num_ratings_pages (median: 25.0).
bin	count
1 – 31.18	2911
31.18 – 61.35	1307
61.35 – 91.53	355
91.53 – 121.7	100
121.7 – 151.9	36
151.9 – 182.1	15
182.1 – 212.2	9
212.2 – 242.4	1
242.4 – 272.6	3
272.6 – 302.8	3
302.8 – 332.9	2
332.9 – 363.1	2
363.1 – 393.3	0
393.3 – 423.4	1
423.4 – 453.6	0
453.6 – 483.8	0
483.8 – 514	0
514 – 544.1	0
544.1 – 574.3	0
574.3 – 604.5	0
604.5 – 634.7	0
634.7 – 664.9	0
664.9 – 695	0
695 – 725.2	1
725.2 – 755.4	0
755.4 – 785.6	0
785.6 – 815.7	0
815.7 – 845.9	0
845.9 – 876.1	0
876.1 – 906.2	0
906.2 – 936.4	0
936.4 – 966.6	0
966.6 – 996.8	0
996.8 – 1027	0
1027 – 1057	0
1057 – 1087	0
1087 – 1117	0
1117 – 1148	0
1148 – 1178	0
1178 – 1208	1

display_name · Top display names show common first names like Sam and Jack, plus 307 literal 'null' entries flagging data quality issues.

Show data table

Character-length distribution for display_name (mean: 9.283327190072491).
chars	count
1 – 3	329
3 – 6	1711
6 – 8	1902
8 – 11	1037
11 – 13	1236
13 – 15	1354
15 – 18	275
18 – 20	167
20 – 23	39
23 – 25	27
25 – 27	31
27 – 30	4
30 – 32	14
32 – 35	2
35 – 37	2
37 – 39	2
39 – 42	0
42 – 44	1
44 – 47	1
47 – 49	1
49 – 51	1
51 – 54	0
54 – 56	0
56 – 59	0
59 – 61	0
61 – 63	0
63 – 66	2
66 – 68	0
68 – 71	0
71 – 73	0
73 – 75	0
75 – 78	0
78 – 80	0
80 – 83	0
83 – 85	0
85 – 87	0
87 – 90	0
90 – 92	0
92 – 95	0
95 – 97	1

display_name · Most display names are short (median 9 chars) but the max stretches to 97 — check for outlier-style names.

Show data table

Character-length distribution for display_name (mean: 9.283327190072491).
chars	count
1 – 3	329
3 – 6	1711
6 – 8	1902
8 – 11	1037
11 – 13	1236
13 – 15	1354
15 – 18	275
18 – 20	167
20 – 23	39
23 – 25	27
25 – 27	31
27 – 30	4
30 – 32	14
32 – 35	2
35 – 37	2
37 – 39	2
39 – 42	0
42 – 44	1
44 – 47	1
47 – 49	1
49 – 51	1
51 – 54	0
54 – 56	0
56 – 59	0
59 – 61	0
61 – 63	0
63 – 66	2
66 – 68	0
68 – 71	0
71 – 73	0
73 – 75	0
75 – 78	0
78 – 80	0
80 – 83	0
83 – 85	0
85 – 87	0
87 – 90	0
90 – 92	0
92 – 95	0
95 – 97	1

Schema

5 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
_id	text	0.0%	8,139	near_unique one_word
display_name	text	0.0%	7,136	one_word short_text
num_ratings_pages	numeric	41.7%	177	null_rate high_skew
num_reviews	numeric	0.0%	2,416	high_skew outliers
username	text	0.0%	8,139	near_unique one_word short_text

_id

text identifier near_unique one_word

This column is a per-row identifier, almost certainly MongoDB ObjectIds: every one of the 8139 values is unique, exactly 24 characters long, single-token, and the samples are 24-char hex strings. There are no nulls, duplicates, or empties, and vocab_size equals n, confirming a pure primary key with no analytic content. Treatment: Drop for modelling; retain only as a join key. high · anthropic:claude-opus-4-7

n: 8,139
nulls: 0 (0.0%)
unique: 8,139
len_min: 24
len_max: 24
len_mean: 24
len_median: 24
len_p95: 24
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 8,139
readability_flesch_mean: 31.97
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

display_name

text free_text one_word short_text

Short user display names: nearly 60% are a single word (one_word_rate 0.5978), median length 9 chars and median word count 1, with the top values dominated by first names like Sam, Jack, Emma. Notable quirks: 307 rows literally contain the string "null" (not actual nulls, since null_rate is 0.0), duplicate_rate is 12.3% with 1003 repeats, and 5.8% include emoji. Vocabulary is wide (7487 tokens across 8139 rows), consistent with free-form handles rather than a controlled label set. Treatment: Treat as free-text handles: replace literal "null" tokens with true missing, lowercase-normalize, and avoid using as a join key given 12% duplicates. high · anthropic:claude-opus-4-7

n: 8,139
nulls: 0 (0.0%)
unique: 7,136
len_min: 1
len_max: 97
len_mean: 9.283
len_median: 9
len_p95: 16
word_mean: 1.479
word_median: 1
n_empty: 0
n_duplicates: 1,003
duplicate_rate: 0.1232
vocab_size: 7,487
readability_flesch_mean: 45.67
emoji_rate: 0.05836
url_rate: 0
one_word_rate: 0.5979
allcaps_rate: 0.01954
boilerplate_rate: 0

num_ratings_pages

numeric feature null_rate high_skew

Numeric count of rating pages per item, present for roughly 58% of rows (null_rate 0.4168) with 177 distinct values from 1 to 1208 and a median of 25. The distribution is severely right-skewed (skew 11.24, kurtosis 298.34) with 236 outliers above the IQR fence, so the mean of 32.81 sits well above the typical row. Treatment: log-transform and impute missing before modelling. high · anthropic:claude-opus-4-7

n: 8,139
nulls: 3,392 (41.7%)
unique: 177
min: 1
max: 1,208
mean: 32.81
median: 25
std: 35.23
q1: 15
q3: 42
iqr: 27
skew: 11.24
kurtosis: 298.3
n_outliers: 236
outlier_rate: 0.04972
zero_rate: 0

num_reviews

numeric feature high_skew outliers

num_reviews is a count of reviews per item, ranging from 0 to 17184 with a median of 588 and mean of 868. The distribution is heavily right-skewed (skew 3.92, kurtosis 33.3), with 505 outliers (6.2%) and only 0.96% zeros. The gap between q3 (1130) and max (17184) signals a long tail of highly-reviewed items. Treatment: log1p-transform before modelling to tame the right tail. high · anthropic:claude-opus-4-7

n: 8,139
nulls: 0 (0.0%)
unique: 2,416
min: 0
max: 17,184
mean: 868.4
median: 588
std: 979.1
q1: 267
q3: 1,130
iqr: 863
skew: 3.923
kurtosis: 33.31
n_outliers: 505
outlier_rate: 0.06205
zero_rate: 0.009583

username

text identifier near_unique one_word short_text

This column holds unique single-token usernames: every one of the 8139 rows has a distinct value (n_unique=8139, duplicate_rate=0.0) and one_word_rate is 1.0. Lengths are short and tightly bounded (len_min=2, len_mean≈9.79, len_max=15), consistent with a handle field rather than free text. No nulls, no URLs, no emoji, and allcaps usage is negligible (0.00037). Treatment: Treat as a user identifier; drop from modelling features and use only for joins or deduplication. high · anthropic:claude-opus-4-7

n: 8,139
nulls: 0 (0.0%)
unique: 8,139
len_min: 2
len_max: 15
len_mean: 9.793
len_median: 10
len_p95: 15
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 8,139
readability_flesch_mean: 2.78
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.0003686
boilerplate_rate: 0