letterboxd users export

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv

Saturn profiled 8,139 rows across 5 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv",
    "--findings", "letterboxd-users_export.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 8,139 Letterboxd user profiles with 5 columns covering identifiers (username, _id), a display name, and two activity metrics (num_reviews, num_ratings_pages). The activity metrics are the most interesting signal: num_reviews is heavily right-skewed with a mean of 868 but a median of 588 and a max of 17,184, and num_ratings_pages shows similar skew along with a 41.7% null rate that warrants investigation. Display names are also worth a look — about 60% are one-word, 12.3% are duplicates, and 'null' literally appears 307 times as a value, suggesting some data quality issues. The username and _id columns are fully unique identifiers and can largely be ignored for analytical purposes.

citing: row_count · column_count · num_reviews.stats · num_ratings_pages.stats · num_ratings_pages.null_rate · display_name.stats · display_name.top_values · username.n_unique · _id.n_unique

Out[4]:

saturn.schema() · 5 columns

column	kind	n	null%	unique	alerts
_id	text	8,139	0.0%	8,139	near_unique one_word
display_name	text	8,139	0.0%	7,136	one_word short_text
num_ratings_pages	numeric	8,139	41.7%	177	null_rate high_skew
num_reviews	numeric	8,139	0.0%	2,416	high_skew outliers
username	text	8,139	0.0%	8,139	near_unique one_word short_text

Fig 1.

num_reviews · Look at the long right tail — most users sit well below the 868 mean while a few hit thousands of reviews.

Show data table

Histogram bins for num_reviews (median: 588.0).
bin	count
0 – 429.6	3145
429.6 – 859.2	2163
859.2 – 1289	1165
1289 – 1718	615
1718 – 2148	396
2148 – 2578	217
2578 – 3007	158
3007 – 3437	98
3437 – 3866	56
3866 – 4296	37
4296 – 4726	22
4726 – 5155	15
5155 – 5585	10
5585 – 6014	11
6014 – 6444	5
6444 – 6874	9
6874 – 7303	4
7303 – 7733	1
7733 – 8162	2
8162 – 8592	1
8592 – 9022	0
9022 – 9451	1
9451 – 9881	2
9881 – 1.031e+04	1
1.031e+04 – 1.074e+04	0
1.074e+04 – 1.117e+04	0
1.117e+04 – 1.16e+04	0
1.16e+04 – 1.203e+04	0
1.203e+04 – 1.246e+04	1
1.246e+04 – 1.289e+04	0
1.289e+04 – 1.332e+04	0
1.332e+04 – 1.375e+04	1
1.375e+04 – 1.418e+04	0
1.418e+04 – 1.461e+04	1
1.461e+04 – 1.504e+04	1
1.504e+04 – 1.547e+04	0
1.547e+04 – 1.59e+04	0
1.59e+04 – 1.632e+04	0
1.632e+04 – 1.675e+04	0
1.675e+04 – 1.718e+04	1

Fig 2.

num_ratings_pages · Same skew pattern as reviews, but note that 41.7% of values are missing before reading the distribution.

Show data table

Histogram bins for num_ratings_pages (median: 25.0).
bin	count
1 – 31.18	2911
31.18 – 61.35	1307
61.35 – 91.53	355
91.53 – 121.7	100
121.7 – 151.9	36
151.9 – 182.1	15
182.1 – 212.2	9
212.2 – 242.4	1
242.4 – 272.6	3
272.6 – 302.8	3
302.8 – 332.9	2
332.9 – 363.1	2
363.1 – 393.3	0
393.3 – 423.4	1
423.4 – 453.6	0
453.6 – 483.8	0
483.8 – 514	0
514 – 544.1	0
544.1 – 574.3	0
574.3 – 604.5	0
604.5 – 634.7	0
634.7 – 664.9	0
664.9 – 695	0
695 – 725.2	1
725.2 – 755.4	0
755.4 – 785.6	0
785.6 – 815.7	0
815.7 – 845.9	0
845.9 – 876.1	0
876.1 – 906.2	0
906.2 – 936.4	0
936.4 – 966.6	0
966.6 – 996.8	0
996.8 – 1027	0
1027 – 1057	0
1057 – 1087	0
1087 – 1117	0
1117 – 1148	0
1148 – 1178	0
1178 – 1208	1

Fig 3.

display_name · Top display names show common first names like Sam and Jack, plus 307 literal 'null' entries flagging data quality issues.

Show data table

Character-length distribution for display_name (mean: 9.283327190072491).
chars	count
1 – 3	329
3 – 6	1711
6 – 8	1902
8 – 11	1037
11 – 13	1236
13 – 15	1354
15 – 18	275
18 – 20	167
20 – 23	39
23 – 25	27
25 – 27	31
27 – 30	4
30 – 32	14
32 – 35	2
35 – 37	2
37 – 39	2
39 – 42	0
42 – 44	1
44 – 47	1
47 – 49	1
49 – 51	1
51 – 54	0
54 – 56	0
56 – 59	0
59 – 61	0
61 – 63	0
63 – 66	2
66 – 68	0
68 – 71	0
71 – 73	0
73 – 75	0
75 – 78	0
78 – 80	0
80 – 83	0
83 – 85	0
85 – 87	0
87 – 90	0
90 – 92	0
92 – 95	0
95 – 97	1

Fig 4.

display_name · Most display names are short (median 9 chars) but the max stretches to 97 — check for outlier-style names.

Show data table

Character-length distribution for display_name (mean: 9.283327190072491).
chars	count
1 – 3	329
3 – 6	1711
6 – 8	1902
8 – 11	1037
11 – 13	1236
13 – 15	1354
15 – 18	275
18 – 20	167
20 – 23	39
23 – 25	27
25 – 27	31
27 – 30	4
30 – 32	14
32 – 35	2
35 – 37	2
37 – 39	2
39 – 42	0
42 – 44	1
44 – 47	1
47 – 49	1
49 – 51	1
51 – 54	0
54 – 56	0
56 – 59	0
59 – 61	0
61 – 63	0
63 – 66	2
66 – 68	0
68 – 71	0
71 – 73	0
73 – 75	0
75 – 78	0
78 – 80	0
80 – 83	0
83 – 85	0
85 – 87	0
87 – 90	0
90 – 92	0
92 – 95	0
95 – 97	1

Fig 5.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
_id	text	0.0%
display_name	text	0.0%
num_ratings_pages	numeric	41.7%
num_reviews	numeric	0.0%
username	text	0.0%

Fig 6.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	num_ratings_pages	num_reviews
num_ratings_pages	+1.00	+0.08
num_reviews	+0.08	+1.00

_id text identifier

This column is a per-row identifier, almost certainly MongoDB ObjectIds: every one of the 8139 values is unique, exactly 24 characters long, single-token, and the samples are 24-char hex strings. There are no nulls, duplicates, or empties, and vocab_size equals n, confirming a pure primary key with no analytic content.

Treatment: Drop for modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["_id"].stats

stat	value
n	8,139
nulls	0 (0.0%)
unique	8,139
len_min	24
len_max	24
len_mean	24
len_median	24
len_p95	24
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	8,139
readability_flesch_mean	31.97
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word

Fig 7.

Character-length distribution for _id.

Show data table

Character-length distribution for _id (mean: 24.0).
chars	count
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	8139
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0
24 – 24	0

display_name text free_text

Short user display names: nearly 60% are a single word (one_word_rate 0.5978), median length 9 chars and median word count 1, with the top values dominated by first names like Sam, Jack, Emma. Notable quirks: 307 rows literally contain the string "null" (not actual nulls, since null_rate is 0.0), duplicate_rate is 12.3% with 1003 repeats, and 5.8% include emoji. Vocabulary is wide (7487 tokens across 8139 rows), consistent with free-form handles rather than a controlled label set.

Treatment: Treat as free-text handles: replace literal "null" tokens with true missing, lowercase-normalize, and avoid using as a join key given 12% duplicates.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["display_name"].stats

stat	value
n	8,139
nulls	0 (0.0%)
unique	7,136
len_min	1
len_max	97
len_mean	9.283
len_median	9
len_p95	16
word_mean	1.479
word_median	1
n_empty	0
n_duplicates	1,003
duplicate_rate	0.1232
vocab_size	7,487
readability_flesch_mean	45.67
emoji_rate	0.05836
url_rate	0
one_word_rate	0.5979
allcaps_rate	0.01954
boilerplate_rate	0
alert: one_word	59.8% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for display_name.

Show data table

Character-length distribution for display_name (mean: 9.283327190072491).
chars	count
1 – 3	329
3 – 6	1711
6 – 8	1902
8 – 11	1037
11 – 13	1236
13 – 15	1354
15 – 18	275
18 – 20	167
20 – 23	39
23 – 25	27
25 – 27	31
27 – 30	4
30 – 32	14
32 – 35	2
35 – 37	2
37 – 39	2
39 – 42	0
42 – 44	1
44 – 47	1
47 – 49	1
49 – 51	1
51 – 54	0
54 – 56	0
56 – 59	0
59 – 61	0
61 – 63	0
63 – 66	2
66 – 68	0
68 – 71	0
71 – 73	0
73 – 75	0
75 – 78	0
78 – 80	0
80 – 83	0
83 – 85	0
85 – 87	0
87 – 90	0
90 – 92	0
92 – 95	0
95 – 97	1

num_ratings_pages numeric feature

Numeric count of rating pages per item, present for roughly 58% of rows (null_rate 0.4168) with 177 distinct values from 1 to 1208 and a median of 25. The distribution is severely right-skewed (skew 11.24, kurtosis 298.34) with 236 outliers above the IQR fence, so the mean of 32.81 sits well above the typical row.

Treatment: log-transform and impute missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["num_ratings_pages"].stats

stat	value
n	8,139
nulls	3,392 (41.7%)
unique	177
min	1
max	1,208
mean	32.81
median	25
std	35.23
q1	15
q3	42
iqr	27
skew	11.24
kurtosis	298.3
n_outliers	236
outlier_rate	0.04972
zero_rate	0
alert: null_rate	41.7% null
alert: high_skew	skew=+11.24

Fig 9.

Distribution of num_ratings_pages. Vertical dash marks the median.

Show data table

Histogram bins for num_ratings_pages (median: 25.0).
bin	count
1 – 31.18	2911
31.18 – 61.35	1307
61.35 – 91.53	355
91.53 – 121.7	100
121.7 – 151.9	36
151.9 – 182.1	15
182.1 – 212.2	9
212.2 – 242.4	1
242.4 – 272.6	3
272.6 – 302.8	3
302.8 – 332.9	2
332.9 – 363.1	2
363.1 – 393.3	0
393.3 – 423.4	1
423.4 – 453.6	0
453.6 – 483.8	0
483.8 – 514	0
514 – 544.1	0
544.1 – 574.3	0
574.3 – 604.5	0
604.5 – 634.7	0
634.7 – 664.9	0
664.9 – 695	0
695 – 725.2	1
725.2 – 755.4	0
755.4 – 785.6	0
785.6 – 815.7	0
815.7 – 845.9	0
845.9 – 876.1	0
876.1 – 906.2	0
906.2 – 936.4	0
936.4 – 966.6	0
966.6 – 996.8	0
996.8 – 1027	0
1027 – 1057	0
1057 – 1087	0
1087 – 1117	0
1117 – 1148	0
1148 – 1178	0
1178 – 1208	1

num_reviews numeric feature

num_reviews is a count of reviews per item, ranging from 0 to 17184 with a median of 588 and mean of 868. The distribution is heavily right-skewed (skew 3.92, kurtosis 33.3), with 505 outliers (6.2%) and only 0.96% zeros. The gap between q3 (1130) and max (17184) signals a long tail of highly-reviewed items.

Treatment: log1p-transform before modelling to tame the right tail.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["num_reviews"].stats

stat	value
n	8,139
nulls	0 (0.0%)
unique	2,416
min	0
max	17,184
mean	868.4
median	588
std	979.1
q1	267
q3	1,130
iqr	863
skew	3.923
kurtosis	33.31
n_outliers	505
outlier_rate	0.06205
zero_rate	0.009583
alert: high_skew	skew=+3.92
alert: outliers	6.2% rows beyond 1.5 IQR

Fig 10.

Distribution of num_reviews. Vertical dash marks the median.

Show data table

Histogram bins for num_reviews (median: 588.0).
bin	count
0 – 429.6	3145
429.6 – 859.2	2163
859.2 – 1289	1165
1289 – 1718	615
1718 – 2148	396
2148 – 2578	217
2578 – 3007	158
3007 – 3437	98
3437 – 3866	56
3866 – 4296	37
4296 – 4726	22
4726 – 5155	15
5155 – 5585	10
5585 – 6014	11
6014 – 6444	5
6444 – 6874	9
6874 – 7303	4
7303 – 7733	1
7733 – 8162	2
8162 – 8592	1
8592 – 9022	0
9022 – 9451	1
9451 – 9881	2
9881 – 1.031e+04	1
1.031e+04 – 1.074e+04	0
1.074e+04 – 1.117e+04	0
1.117e+04 – 1.16e+04	0
1.16e+04 – 1.203e+04	0
1.203e+04 – 1.246e+04	1
1.246e+04 – 1.289e+04	0
1.289e+04 – 1.332e+04	0
1.332e+04 – 1.375e+04	1
1.375e+04 – 1.418e+04	0
1.418e+04 – 1.461e+04	1
1.461e+04 – 1.504e+04	1
1.504e+04 – 1.547e+04	0
1.547e+04 – 1.59e+04	0
1.59e+04 – 1.632e+04	0
1.632e+04 – 1.675e+04	0
1.675e+04 – 1.718e+04	1

username text identifier

This column holds unique single-token usernames: every one of the 8139 rows has a distinct value (n_unique=8139, duplicate_rate=0.0) and one_word_rate is 1.0. Lengths are short and tightly bounded (len_min=2, len_mean≈9.79, len_max=15), consistent with a handle field rather than free text. No nulls, no URLs, no emoji, and allcaps usage is negligible (0.00037).

Treatment: Treat as a user identifier; drop from modelling features and use only for joins or deduplication.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["username"].stats

stat	value
n	8,139
nulls	0 (0.0%)
unique	8,139
len_min	2
len_max	15
len_mean	9.793
len_median	10
len_p95	15
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	8,139
readability_flesch_mean	2.78
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0003686
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 11.

Character-length distribution for username.

Show data table

Character-length distribution for username (mean: 9.793463570463201).
chars	count
2 – 2	9
2 – 3	0
3 – 3	0
3 – 3	42
3 – 4	0
4 – 4	0
4 – 4	168
4 – 5	0
5 – 5	0
5 – 5	343
5 – 6	0
6 – 6	0
6 – 6	570
6 – 7	0
7 – 7	0
7 – 7	778
7 – 8	0
8 – 8	0
8 – 8	941
8 – 8	0
8 – 9	0
9 – 9	937
9 – 9	0
9 – 10	0
10 – 10	996
10 – 10	0
10 – 11	0
11 – 11	932
11 – 11	0
11 – 12	0
12 – 12	805
12 – 12	0
12 – 13	0
13 – 13	654
13 – 13	0
13 – 14	0
14 – 14	483
14 – 14	0
14 – 15	0
15 – 15	481

How to cite

click to copy

BibTeX

@misc{saturn-letterboxd-users-export-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: letterboxd users export},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/letterboxd-users_export}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: letterboxd users export. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/letterboxd-users_export