bluesky-alt-text · saturn notebook

Overview

Source: hf://lukeslp/bluesky-alt-text:[train]

Saturn profiled 404,841 rows across 21 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "huggingface", "lukeslp/bluesky-alt-text:[train]",
    "--findings", "bluesky-alt-text.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.

citing: row_count · column_count · columns.image_alt_length.stats · columns.author_did.top_values · columns.author_handle.top_values · columns.author_handle.null_rate · columns.image_mime_type.top_values · columns.langs_json.top_values · columns.source_mode.top_values · columns.image_count_in_post.stats · columns.text.language_counts

Out[4]:

saturn.schema() · 21 columns

column	kind	n	null%	unique	alerts
alt_text	text	404,841	0.0%	355,721	multilingual
image_alt_length	numeric	404,841	0.0%	2,052	high_skew outliers
text	text	404,841	0.0%	302,702	duplicates multilingual
author_handle	categorical	404,841	31.0%	489	null_rate
author_did	text	404,841	0.0%	35,183	duplicates one_word
post_uri	text	404,841	0.0%	335,384	one_word
post_cid	text	404,841	0.0%	335,545	one_word
created_at	text	404,841	0.0%	334,392	one_word allcaps
indexed_at	text	404,841	0.0%	335,543	one_word allcaps
langs_json	categorical	404,841	0.0%	214
image_index	numeric	404,841	0.0%	4	high_skew outliers
image_count_in_post	numeric	404,841	0.0%	4	outliers
image_mime_type	categorical	404,841	0.0%	8
image_ref	text	404,841	0.0%	375,188	one_word
image_thumb_url	text	404,841	31.0%	260,636	null_rate one_word url_heavy
image_fullsize_url	text	404,841	31.0%	260,636	null_rate one_word url_heavy
source_mode	categorical	404,841	0.0%	2
collected_at	text	404,841	0.0%	404,841	near_unique one_word allcaps
query	categorical	404,841	31.0%	489	null_rate
cursor	text	404,841	11.0%	105,919	duplicates one_word allcaps
raw_record_json	text	404,841	0.0%	335,545	multilingual

Fig 1.

image_mime_type · Shows JPEG dominance (~93%) with PNG and WebP as the only meaningful minorities.

Show data table

Top values for image_mime_type (8 unique shown, of 8 total).
value	count	share
image/jpeg	378401	93.5%
image/png	23402	5.8%
image/webp	2906	0.7%
image/gif	79	0.0%
image/avif	34	0.0%
image/svg+xml	14	0.0%
application/octet-stream	4	0.0%
image/heic	1	0.0%

Fig 2.

source_mode · Roughly 69/31 split between author_feed and jetstream collection modes — explains the 31% null rate on author/url fields.

Show data table

Top values for source_mode (2 unique shown, of 2 total).
value	count	share
author_feed	279196	69.0%
jetstream	125645	31.0%

Fig 3.

author_handle · Top handles are highly concentrated; check whether a few prolific accounts skew downstream analysis.

Show data table

Top values for author_handle (20 unique shown, of 489 total).
value	count	share
firefaerie81.bsky.social	6828	1.7%
dashdashado.bsky.social	6386	1.6%
epsteinweb.bsky.social	6146	1.5%
lukesteuber.com	6019	1.5%
azgibsonz.bsky.social	5538	1.4%
majima.club	5427	1.3%
allthegalaxies.galaxyzoo.org	5000	1.2%
guggenheim.bsky.social	5000	1.2%
listigeplaylists.bsky.social	4998	1.2%
catsofyore.bsky.social	4610	1.1%
magpie.tips	4375	1.1%
skwinnicki.bsky.social	4082	1.0%
gonzo.bsky.social	3894	1.0%
ethankocak.com	3653	0.9%
maladroithe.bsky.social	3590	0.9%
nikolaidenmark.bsky.social	3429	0.8%
mbkplus.bsky.social	3384	0.8%
eustace.justshootme.ca	3312	0.8%
veronicaf.bsky.social	2686	0.7%
sarahmackattack.bsky.social	2645	0.7%

Fig 4.

image_alt_length · Heavy right-skew (median 116, max 65,192) — inspect the long tail of unusually verbose alt text.

Show data table

Histogram bins for image_alt_length (median: 116.0).
bin	count
1 – 1631	400124
1631 – 3261	4709
3261 – 4890	2
4890 – 6520	0
6520 – 8150	1
8150 – 9780	0
9780 – 1.141e+04	0
1.141e+04 – 1.304e+04	0
1.304e+04 – 1.467e+04	0
1.467e+04 – 1.63e+04	0
1.63e+04 – 1.793e+04	0
1.793e+04 – 1.956e+04	1
1.956e+04 – 2.119e+04	0
2.119e+04 – 2.282e+04	0
2.282e+04 – 2.445e+04	0
2.445e+04 – 2.608e+04	0
2.608e+04 – 2.771e+04	0
2.771e+04 – 2.934e+04	0
2.934e+04 – 3.097e+04	0
3.097e+04 – 3.26e+04	0
3.26e+04 – 3.423e+04	0
3.423e+04 – 3.586e+04	0
3.586e+04 – 3.749e+04	0
3.749e+04 – 3.912e+04	1
3.912e+04 – 4.075e+04	0
4.075e+04 – 4.238e+04	0
4.238e+04 – 4.4e+04	0
4.4e+04 – 4.563e+04	0
4.563e+04 – 4.726e+04	0
4.726e+04 – 4.889e+04	0
4.889e+04 – 5.052e+04	1
5.052e+04 – 5.215e+04	0
5.215e+04 – 5.378e+04	0
5.378e+04 – 5.541e+04	0
5.541e+04 – 5.704e+04	0
5.704e+04 – 5.867e+04	0
5.867e+04 – 6.03e+04	0
6.03e+04 – 6.193e+04	0
6.193e+04 – 6.356e+04	0
6.356e+04 – 6.519e+04	2

Fig 5.

image_count_in_post · Most posts attach a single image; bar shows how often users post 2–4 image carousels.

Show data table

Histogram bins for image_count_in_post (median: 1.0).
bin	count
1 – 1.075	291415
1.075 – 1.15	0
1.15 – 1.225	0
1.225 – 1.3	0
1.3 – 1.375	0
1.375 – 1.45	0
1.45 – 1.525	0
1.525 – 1.6	0
1.6 – 1.675	0
1.675 – 1.75	0
1.75 – 1.825	0
1.825 – 1.9	0
1.9 – 1.975	0
1.975 – 2.05	54031
2.05 – 2.125	0
2.125 – 2.2	0
2.2 – 2.275	0
2.275 – 2.35	0
2.35 – 2.425	0
2.425 – 2.5	0
2.5 – 2.575	0
2.575 – 2.65	0
2.65 – 2.725	0
2.725 – 2.8	0
2.8 – 2.875	0
2.875 – 2.95	0
2.95 – 3.025	20144
3.025 – 3.1	0
3.1 – 3.175	0
3.175 – 3.25	0
3.25 – 3.325	0
3.325 – 3.4	0
3.4 – 3.475	0
3.475 – 3.55	0
3.55 – 3.625	0
3.625 – 3.7	0
3.7 – 3.775	0
3.775 – 3.85	0
3.85 – 3.925	0
3.925 – 4	39251

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
alt_text	text	0.0%
image_alt_length	numeric	0.0%
text	text	0.0%
author_handle	categorical	31.0%
author_did	text	0.0%
post_uri	text	0.0%
post_cid	text	0.0%
created_at	text	0.0%
indexed_at	text	0.0%
langs_json	categorical	0.0%
image_index	numeric	0.0%
image_count_in_post	numeric	0.0%
image_mime_type	categorical	0.0%
image_ref	text	0.0%
image_thumb_url	text	31.0%
image_fullsize_url	text	31.0%
source_mode	categorical	0.0%
collected_at	text	0.0%
query	categorical	31.0%
cursor	text	11.0%
raw_record_json	text	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 9,575 detected strings).
lang	count	share
en	8840	92.3%
de	186	1.9%
ja	124	1.3%
fr	103	1.1%
pt	90	0.9%
es	75	0.8%
it	31	0.3%
nl	26	0.3%
ru	16	0.2%
fi	15	0.2%
sv	8	0.1%
ko	7	0.1%
zh	7	0.1%
ar	5	0.1%
ca	5	0.1%
tr	4	0.0%
no	3	0.0%
uk	3	0.0%
sl	3	0.0%
da	2	0.0%
hi	2	0.0%
eu	2	0.0%
hu	2	0.0%
ceb	2	0.0%
is	2	0.0%
vi	2	0.0%
si	1	0.0%
pl	1	0.0%
sr	1	0.0%
te	1	0.0%
fa	1	0.0%
fy	1	0.0%
hr	1	0.0%
id	1	0.0%
eo	1	0.0%
oc	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	image_alt_length	image_index	image_count_in_post
image_alt_length	+1.00	-0.13	-0.07
image_index	-0.13	+1.00	+0.73
image_count_in_post	-0.07	+0.73	+1.00

alt_text text free_text

Free-text image alt-text descriptions, predominantly English (4071 detected) but with a long multilingual tail spanning 28 other languages including German (94), French (59), Japanese (47) and Spanish (37). Length varies wildly — median 116 chars but max 65192, with a 95th percentile of 966 — and 12.1% of values are duplicates, driven by templated boilerplate like 'Epstein Web iOS app on the App Store.' (5227 repeats) and lengthy pasted feed-rules blocks. Vocabulary is large (85972 unique tokens) and readability sits at Flesch 62.5, but boilerplate rate (4.6%) plus generic placeholders ('Image', 'Image 1', 'image') indicate substantial low-information content.

Treatment: Strip boilerplate and placeholder strings, then tokenize and embed (with language detection) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["alt_text"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	355,721
len_min	1
len_max	65,192
len_mean	227.5
len_median	116
len_p95	966
word_mean	34.21
word_median	20
n_empty	0
n_duplicates	49,120
duplicate_rate	0.1213
vocab_size	85,972
readability_flesch_mean	62.52
emoji_rate	0.02271
url_rate	0.0288
one_word_rate	0.02114
allcaps_rate	0.0159
boilerplate_rate	0.04618
alert: multilingual	30 languages detected in sample

Fig 9.

Character-length distribution for alt_text.

Show data table

Character-length distribution for alt_text (mean: 227.4568361406083).
chars	count
1 – 1631	400124
1631 – 3261	4709
3261 – 4890	2
4890 – 6520	0
6520 – 8150	1
8150 – 9780	0
9780 – 11409	0
11409 – 13039	0
13039 – 14669	0
14669 – 16299	0
16299 – 17929	0
17929 – 19558	1
19558 – 21188	0
21188 – 22818	0
22818 – 24448	0
24448 – 26077	0
26077 – 27707	0
27707 – 29337	0
29337 – 30967	0
30967 – 32596	0
32596 – 34226	0
34226 – 35856	0
35856 – 37486	0
37486 – 39116	1
39116 – 40745	0
40745 – 42375	0
42375 – 44005	0
44005 – 45635	0
45635 – 47264	0
47264 – 48894	0
48894 – 50524	1
50524 – 52154	0
52154 – 53784	0
53784 – 55413	0
55413 – 57043	0
57043 – 58673	0
58673 – 60303	0
60303 – 61932	0
61932 – 63562	0
63562 – 65192	2

image_alt_length numeric feature

Counts of characters in image alt-text fields, ranging from 1 to 65,192 with a median of 116 and IQR of 59–226. The distribution is severely right-skewed (skew 38.4, kurtosis 5934) and 11.8% of rows (47,944) flag as outliers, suggesting a long tail of unusually verbose alt-text or possibly entire passages dumped into the alt attribute. No nulls or zeros, so every row carries some alt content.

Treatment: Log-transform or winsorize before modelling to tame the extreme right tail.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["image_alt_length"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	2,052
min	1
max	65,192
mean	227.5
median	116
std	364.4
q1	59
q3	226
iqr	167
skew	38.39
kurtosis	5934
n_outliers	47,944
outlier_rate	0.1184
zero_rate	0
alert: high_skew	skew=+38.39
alert: outliers	11.8% rows beyond 1.5 IQR

Fig 10.

Distribution of image_alt_length. Vertical dash marks the median.

Show data table

Histogram bins for image_alt_length (median: 116.0).
bin	count
1 – 1631	400124
1631 – 3261	4709
3261 – 4890	2
4890 – 6520	0
6520 – 8150	1
8150 – 9780	0
9780 – 1.141e+04	0
1.141e+04 – 1.304e+04	0
1.304e+04 – 1.467e+04	0
1.467e+04 – 1.63e+04	0
1.63e+04 – 1.793e+04	0
1.793e+04 – 1.956e+04	1
1.956e+04 – 2.119e+04	0
2.119e+04 – 2.282e+04	0
2.282e+04 – 2.445e+04	0
2.445e+04 – 2.608e+04	0
2.608e+04 – 2.771e+04	0
2.771e+04 – 2.934e+04	0
2.934e+04 – 3.097e+04	0
3.097e+04 – 3.26e+04	0
3.26e+04 – 3.423e+04	0
3.423e+04 – 3.586e+04	0
3.586e+04 – 3.749e+04	0
3.749e+04 – 3.912e+04	1
3.912e+04 – 4.075e+04	0
4.075e+04 – 4.238e+04	0
4.238e+04 – 4.4e+04	0
4.4e+04 – 4.563e+04	0
4.563e+04 – 4.726e+04	0
4.726e+04 – 4.889e+04	0
4.889e+04 – 5.052e+04	1
5.052e+04 – 5.215e+04	0
5.215e+04 – 5.378e+04	0
5.378e+04 – 5.541e+04	0
5.541e+04 – 5.704e+04	0
5.704e+04 – 5.867e+04	0
5.867e+04 – 6.03e+04	0
6.03e+04 – 6.193e+04	0
6.193e+04 – 6.356e+04	0
6.356e+04 – 6.519e+04	2

text text free_text

Short user-generated posts (len_mean 136.5, len_max 397, word_median 15) that read like Mastodon/social toots — emoji_rate 0.226, url_rate 0.156, and top values include sign-up rule boilerplate and hashtags like #Alt4You. Predominantly English (4337) but spans 30 languages including German (91), Japanese (75), Portuguese (48), and French (43). Watch the heavy duplication (duplicate_rate 0.252, n_duplicates 102139) and 17970 empty strings sitting at the top of the value list.

Treatment: Drop empties and exact duplicates, then language-filter or multilingual-embed before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["text"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	302,702
len_min	0
len_max	397
len_mean	136.5
len_median	131
len_p95	296
word_mean	19.73
word_median	15
n_empty	17,970
n_duplicates	102,139
duplicate_rate	0.2523
vocab_size	82,709
readability_flesch_mean	50.35
emoji_rate	0.2264
url_rate	0.1555
one_word_rate	0.06977
allcaps_rate	0.01936
boilerplate_rate	0.001766
alert: duplicates	25.2% duplicate strings
alert: multilingual	31 languages detected in sample

Fig 11.

Character-length distribution for text.

Show data table

Character-length distribution for text (mean: 136.48918464285978).
chars	count
0 – 10	27167
10 – 20	16735
20 – 30	17832
30 – 40	17734
40 – 50	16396
50 – 60	15845
60 – 69	14365
69 – 79	13290
79 – 89	14179
89 – 99	13210
99 – 109	12132
109 – 119	11291
119 – 129	10399
129 – 139	9111
139 – 149	9015
149 – 159	38531
159 – 169	9235
169 – 179	10560
179 – 189	8096
189 – 198	8286
198 – 208	7590
208 – 218	6996
218 – 228	6854
228 – 238	7006
238 – 248	10714
248 – 258	8843
258 – 268	8344
268 – 278	9931
278 – 288	10595
288 – 298	19129
298 – 308	15409
308 – 318	14
318 – 328	0
328 – 337	0
337 – 347	0
347 – 357	1
357 – 367	1
367 – 377	1
377 – 387	2
387 – 397	2

author_handle categorical foreign_key

Author handles from what appears to be Bluesky/AT Protocol activity (most values end in .bsky.social, with some custom domains like lukesteuber.com and majima.club). Only 489 unique authors across 404,841 rows with high entropy ratio (0.859) and a flat top — the most prolific account holds just 2.4% of posts — so this is a curated set of accounts rather than an open population. Notable: 31% of rows have a null handle, which is high for an authorship field and worth investigating before any per-author aggregation.

Treatment: Treat as a categorical author key; investigate the 31% nulls and decide whether to drop or impute before grouping.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["author_handle"].stats

stat	value
n	404,841
nulls	125,645 (31.0%)
unique	489
top_value	firefaerie81.bsky.social
top_rate	0.02446
cardinality	489
entropy	7.676
entropy_ratio	0.8593
alert: null_rate	31.0% null

Fig 12.

Top values for author_handle.

Show data table

Top values for author_handle (20 unique shown, of 489 total).
value	count	share
firefaerie81.bsky.social	6828	1.7%
dashdashado.bsky.social	6386	1.6%
epsteinweb.bsky.social	6146	1.5%
lukesteuber.com	6019	1.5%
azgibsonz.bsky.social	5538	1.4%
majima.club	5427	1.3%
allthegalaxies.galaxyzoo.org	5000	1.2%
guggenheim.bsky.social	5000	1.2%
listigeplaylists.bsky.social	4998	1.2%
catsofyore.bsky.social	4610	1.1%
magpie.tips	4375	1.1%
skwinnicki.bsky.social	4082	1.0%
gonzo.bsky.social	3894	1.0%
ethankocak.com	3653	0.9%
maladroithe.bsky.social	3590	0.9%
nikolaidenmark.bsky.social	3429	0.8%
mbkplus.bsky.social	3384	0.8%
eustace.justshootme.ca	3312	0.8%
veronicaf.bsky.social	2686	0.7%
sarahmackattack.bsky.social	2645	0.7%

author_did text foreign_key

This column holds Bluesky/AT Protocol decentralized identifiers (`did:plc:` prefix) for post authors, with a tight length range of 17-33 chars and a single token per value. Across 404,841 rows there are only 35,183 unique authors (duplicate_rate 0.913), and one author alone accounts for 32,558 posts — heavy concentration at the top.

Treatment: Treat as an author key — left-join to a profile table, and consider per-author aggregation or capping before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["author_did"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	35,183
len_min	17
len_max	33
len_mean	32
len_median	32
len_p95	32
word_mean	1
word_median	1
n_empty	0
n_duplicates	369,658
duplicate_rate	0.9131
vocab_size	4,133
readability_flesch_mean	-228.6
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: duplicates	91.3% duplicate strings
alert: one_word	100.0% rows are a single word

Fig 13.

Character-length distribution for author_did.

Show data table

Character-length distribution for author_did (mean: 31.999787570922905).
chars	count
17 – 17	1
17 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 19	2
19 – 20	0
20 – 20	0
20 – 21	0
21 – 21	0
21 – 21	0
21 – 22	0
22 – 22	1
22 – 23	0
23 – 23	0
23 – 23	2
23 – 24	0
24 – 24	0
24 – 25	0
25 – 25	0
25 – 25	0
25 – 26	0
26 – 26	3
26 – 27	0
27 – 27	0
27 – 27	0
27 – 28	0
28 – 28	0
28 – 29	0
29 – 29	0
29 – 29	0
29 – 30	0
30 – 30	0
30 – 31	0
31 – 31	0
31 – 31	0
31 – 32	0
32 – 32	404831
32 – 33	0
33 – 33	1

post_uri text foreign_key

This is an AT Protocol post URI (Bluesky), encoding the author DID and post rkey in a fixed `at://did:plc:.../app.bsky.feed.post/...` format — note the tight length range (55-71 chars, mean 70) and one_word_rate of 1.0. Despite looking like a primary key, it is not unique: 335384 unique values across 404841 rows yields a 17.2% duplicate rate, with the top URI appearing 15 times. Readability and emoji metrics are meaningless here since each value is a single token.

Treatment: Treat as a post identifier for joins; deduplicate or aggregate on it rather than modelling its text.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["post_uri"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	335,384
len_min	55
len_max	71
len_mean	70
len_median	70
len_p95	70
word_mean	1
word_median	1
n_empty	0
n_duplicates	69,457
duplicate_rate	0.1716
vocab_size	19,733
readability_flesch_mean	-628.8
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word

Fig 14.

Character-length distribution for post_uri.

Show data table

Character-length distribution for post_uri (mean: 69.9997875709229).
chars	count
55 – 55	1
55 – 56	0
56 – 56	0
56 – 57	0
57 – 57	0
57 – 57	2
57 – 58	0
58 – 58	0
58 – 59	0
59 – 59	0
59 – 59	0
59 – 60	0
60 – 60	1
60 – 61	0
61 – 61	0
61 – 61	2
61 – 62	0
62 – 62	0
62 – 63	0
63 – 63	0
63 – 63	0
63 – 64	0
64 – 64	3
64 – 65	0
65 – 65	0
65 – 65	0
65 – 66	0
66 – 66	0
66 – 67	0
67 – 67	0
67 – 67	0
67 – 68	0
68 – 68	0
68 – 69	0
69 – 69	0
69 – 69	0
69 – 70	0
70 – 70	404831
70 – 71	0
71 – 71	1

post_cid text identifier

This is a post_cid column holding IPLD/CID v1 content identifiers (all 404841 values are exactly 59 chars and start with the 'bafyrei' CIDv1 dag-cbor prefix). It is near-unique (335545 distinct of 404841) but carries a notable 17.1% duplicate_rate with some CIDs repeating up to 4 times, which is unexpected if these are meant to be unique post references. The fixed length, single-token shape, and zero nulls confirm a machine-generated identifier rather than a feature.

Treatment: Treat as a join key; deduplicate before use and don't feed into models.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["post_cid"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	335,545
len_min	59
len_max	59
len_mean	59
len_median	59
len_p95	59
word_mean	1
word_median	1
n_empty	0
n_duplicates	69,296
duplicate_rate	0.1712
vocab_size	19,734
readability_flesch_mean	-567
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word

Fig 15.

Character-length distribution for post_cid.

Show data table

Character-length distribution for post_cid (mean: 59.0).
chars	count
58 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	404841
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 60	0

created_at text timestamp

ISO-8601 timestamps stored as text, mixing two suffix conventions (`.000Z`/`.911Z` style and `+00:00` offset) which is why the parser flagged allcaps and one_word at rate 1.0. Length ranges 20-35 chars and median 24 align with standard ISO datetime strings. Duplicate rate is 17.4% (70,449 repeats across 404,841 rows), and top values cluster heavily on 2026-04-16, suggesting a load-day spike or truncated export window.

Treatment: Parse to datetime (normalising the two ISO suffix formats) and use as a temporal feature.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["created_at"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	334,392
len_min	20
len_max	35
len_mean	24.55
len_median	24
len_p95	29
word_mean	1
word_median	1
n_empty	0
n_duplicates	70,449
duplicate_rate	0.174
vocab_size	19,732
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 16.

Character-length distribution for created_at.

Show data table

Character-length distribution for created_at (mean: 24.54517946551856).
chars	count
20 – 20	14365
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 22	0
22 – 23	0
23 – 23	0
23 – 23	0
23 – 24	0
24 – 24	324204
24 – 24	0
24 – 25	0
25 – 25	8039
25 – 26	0
26 – 26	0
26 – 26	4
26 – 27	0
27 – 27	37444
27 – 28	0
28 – 28	0
28 – 28	244
28 – 29	0
29 – 29	0
29 – 29	1903
29 – 30	0
30 – 30	1003
30 – 30	0
30 – 31	0
31 – 31	0
31 – 32	0
32 – 32	0
32 – 32	17434
32 – 33	0
33 – 33	200
33 – 34	0
34 – 34	0
34 – 34	0
34 – 35	0
35 – 35	1

indexed_at text timestamp

This is an ISO-8601 indexing timestamp stored as text, with values like '2025-08-02T00:50:34.108Z' and lengths between 24 and 32 characters. Two formats are mixed: a 24-char 'Z' suffix variant and a 32-char '+00:00' offset variant, which explains the length spread. Duplicate rate is 17.1% (69,298 rows) across 335,543 unique values out of 404,841, suggesting batch indexing events share timestamps.

Treatment: Parse to a single UTC datetime type and use for recency filtering or time-based joins.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["indexed_at"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	335,543
len_min	24
len_max	32
len_mean	26.48
len_median	24
len_p95	32
word_mean	1
word_median	1
n_empty	0
n_duplicates	69,298
duplicate_rate	0.1712
vocab_size	19,734
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 17.

Character-length distribution for indexed_at.

Show data table

Character-length distribution for indexed_at (mean: 26.48285129223572).
chars	count
24 – 24	279196
24 – 24	0
24 – 25	0
25 – 25	0
25 – 25	0
25 – 25	0
25 – 25	0
25 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 28	0
28 – 28	0
28 – 28	0
28 – 28	0
28 – 28	0
28 – 29	0
29 – 29	0
29 – 29	0
29 – 29	0
29 – 29	0
29 – 30	0
30 – 30	0
30 – 30	0
30 – 30	0
30 – 30	0
30 – 31	0
31 – 31	0
31 – 31	0
31 – 31	0
31 – 31	0
31 – 32	0
32 – 32	0
32 – 32	125645

langs_json categorical feature

This column holds JSON-encoded language tag arrays, almost certainly the detected or declared languages per record. English dominates at 71.9% of rows (["en"]), and a notable 83,264 rows carry an empty array [] indicating no language assigned. Cardinality is 214 distinct strings with low entropy ratio (0.179), and the long tail includes multi-language combinations like ["en", "sq"] and ["ae", "ay", "en"].

Treatment: Parse the JSON arrays and one-hot encode primary language, treating [] as a separate 'unknown' category.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["langs_json"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	214
top_value	["en"]
top_rate	0.7188
cardinality	214
entropy	1.389
entropy_ratio	0.1794

Fig 18.

Top values for langs_json.

Show data table

Top values for langs_json (20 unique shown, of 214 total).
value	count	share
["en"]	291014	71.9%
[]	83264	20.6%
["de"]	7702	1.9%
["fr"]	3892	1.0%
["ja"]	3087	0.8%
["en-US"]	2758	0.7%
["pt"]	2623	0.6%
["es"]	2341	0.6%
["en", "sq"]	1763	0.4%
["ae", "ay", "en"]	1029	0.3%
["nl"]	959	0.2%
["en-AU", "en"]	493	0.1%
["en", "ru", "hi"]	200	0.0%
["fi"]	194	0.0%
["sv"]	183	0.0%
["tr"]	182	0.0%
["en", "ja"]	178	0.0%
["ca"]	155	0.0%
["ar"]	137	0.0%
["ko"]	128	0.0%

image_index numeric feature

A small-integer counter taking only 4 distinct values between 0 and 3, with 82.8% zeros and a mean of 0.26 — likely an image position/index within a multi-image record. The distribution is heavily right-skewed (skew 2.73, kurtosis 7.08) and the IQR is 0, so any non-zero entry registers as an outlier (17.16% flagged).

Treatment: Treat as a low-cardinality categorical (one-hot or keep as ordinal); ignore outlier flag since IQR is zero by construction.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["image_index"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	4
min	0
max	3
mean	0.2604
median	0
std	0.6467
q1	0
q3	0
iqr	0
skew	2.726
kurtosis	7.083
n_outliers	69,490
outlier_rate	0.1716
zero_rate	0.8284
alert: high_skew	skew=+2.73
alert: outliers	17.2% rows beyond 1.5 IQR

Fig 19.

Distribution of image_index. Vertical dash marks the median.

Show data table

Histogram bins for image_index (median: 0.0).
bin	count
0 – 0.075	335351
0.075 – 0.15	0
0.15 – 0.225	0
0.225 – 0.3	0
0.3 – 0.375	0
0.375 – 0.45	0
0.45 – 0.525	0
0.525 – 0.6	0
0.6 – 0.675	0
0.675 – 0.75	0
0.75 – 0.825	0
0.825 – 0.9	0
0.9 – 0.975	0
0.975 – 1.05	43304
1.05 – 1.125	0
1.125 – 1.2	0
1.2 – 1.275	0
1.275 – 1.35	0
1.35 – 1.425	0
1.425 – 1.5	0
1.5 – 1.575	0
1.575 – 1.65	0
1.65 – 1.725	0
1.725 – 1.8	0
1.8 – 1.875	0
1.875 – 1.95	0
1.95 – 2.025	16438
2.025 – 2.1	0
2.1 – 2.175	0
2.175 – 2.25	0
2.25 – 2.325	0
2.325 – 2.4	0
2.4 – 2.475	0
2.475 – 2.55	0
2.55 – 2.625	0
2.625 – 2.7	0
2.7 – 2.775	0
2.775 – 2.85	0
2.85 – 2.925	0
2.925 – 3	9748

image_count_in_post numeric feature

Discrete count of images attached to a post, taking only 4 distinct integer values between 1 and 4 with no nulls or zeros. The distribution is heavily right-skewed (skew 1.72) toward single-image posts: median and Q1 are both 1, Q3 is 2, and roughly 9.7% of rows (39,251) flag as outliers — almost certainly the 3- and 4-image posts.

Treatment: Treat as a small ordinal/categorical count rather than continuous; bin 3-4 together or one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["image_count_in_post"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	4
min	1
max	4
mean	1.524
median	1
std	0.9647
q1	1
q3	2
iqr	1
skew	1.719
kurtosis	1.551
n_outliers	39,251
outlier_rate	0.09695
zero_rate	0
alert: outliers	9.7% rows beyond 1.5 IQR

Fig 20.

Distribution of image_count_in_post. Vertical dash marks the median.

Show data table

Histogram bins for image_count_in_post (median: 1.0).
bin	count
1 – 1.075	291415
1.075 – 1.15	0
1.15 – 1.225	0
1.225 – 1.3	0
1.3 – 1.375	0
1.375 – 1.45	0
1.45 – 1.525	0
1.525 – 1.6	0
1.6 – 1.675	0
1.675 – 1.75	0
1.75 – 1.825	0
1.825 – 1.9	0
1.9 – 1.975	0
1.975 – 2.05	54031
2.05 – 2.125	0
2.125 – 2.2	0
2.2 – 2.275	0
2.275 – 2.35	0
2.35 – 2.425	0
2.425 – 2.5	0
2.5 – 2.575	0
2.575 – 2.65	0
2.65 – 2.725	0
2.725 – 2.8	0
2.8 – 2.875	0
2.875 – 2.95	0
2.95 – 3.025	20144
3.025 – 3.1	0
3.1 – 3.175	0
3.175 – 3.25	0
3.25 – 3.325	0
3.325 – 3.4	0
3.4 – 3.475	0
3.475 – 3.55	0
3.55 – 3.625	0
3.625 – 3.7	0
3.7 – 3.775	0
3.775 – 3.85	0
3.85 – 3.925	0
3.925 – 4	39251

image_mime_type categorical metadata

This column records the MIME type of an associated image, with 8 distinct values across 404,841 rows and no nulls. It is heavily dominated by image/jpeg at 93.5% (378,401 rows), with image/png a distant second (23,402) and a long tail of webp, gif, avif, svg+xml, heic, plus 4 rows of application/octet-stream that aren't an image type at all. Entropy ratio is just 0.128, confirming very low diversity.

Treatment: Collapse rare types into an 'other' bucket (and review the 4 application/octet-stream rows) before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["image_mime_type"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	8
top_value	image/jpeg
top_rate	0.9347
cardinality	8
entropy	0.3842
entropy_ratio	0.1281

Fig 21.

Top values for image_mime_type.

Show data table

Top values for image_mime_type (8 unique shown, of 8 total).
value	count	share
image/jpeg	378401	93.5%
image/png	23402	5.8%
image/webp	2906	0.7%
image/gif	79	0.0%
image/avif	34	0.0%
image/svg+xml	14	0.0%
application/octet-stream	4	0.0%
image/heic	1	0.0%

image_ref text foreign_key

This column holds IPFS CIDv1 content hashes (all 59 chars, all starting with 'bafkrei...' indicating raw codec + sha256), one token per row across 404841 rows. With 375188 unique values and a 7.3% duplicate rate, it functions as a content-addressed image pointer rather than a row identifier — one CID appears 5227 times, suggesting a heavily reused asset. Length is rigidly fixed (min=max=59) and there are no nulls, URLs, or emojis.

Treatment: Treat as an opaque content-address key; left-join to an image/asset table and do not tokenize.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["image_ref"].stats

stat	value
n	404,841
nulls	19 (0.0%)
unique	375,188
len_min	59
len_max	59
len_mean	59
len_median	59
len_p95	59
word_mean	1
word_median	1
n_empty	0
n_duplicates	29,634
duplicate_rate	0.0732
vocab_size	19,560
readability_flesch_mean	-539.1
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word

Fig 22.

Character-length distribution for image_ref.

Show data table

Character-length distribution for image_ref (mean: 59.0).
chars	count
58 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	404822
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 59	0
59 – 60	0

image_thumb_url text metadata

This column holds Bluesky CDN thumbnail URLs (every non-null value is exactly 138 chars and a single token pointing at cdn.bsky.app/img/feed_thumbnail/plain/...). 31.04% of rows are null and 6.65% are duplicates, with one thumbnail repeating 1000 times — likely a heavily reshared post. With 260,636 unique URLs across 404,841 rows, it's effectively a per-post image reference rather than a modelling feature.

Treatment: Keep as a media reference for display or image fetching; drop from tabular modelling.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["image_thumb_url"].stats

stat	value
n	404,841
nulls	125,645 (31.0%)
unique	260,636
len_min	138
len_max	138
len_mean	138
len_median	138
len_p95	138
word_mean	1
word_median	1
n_empty	0
n_duplicates	18,560
duplicate_rate	0.06648
vocab_size	19,666
readability_flesch_mean	-1391
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: null_rate	31.0% null
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 23.

Character-length distribution for image_thumb_url.

Show data table

Character-length distribution for image_thumb_url (mean: 138.0).
chars	count
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	279196
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0
138 – 138	0

image_fullsize_url text metadata

This column holds Bluesky CDN URLs pointing to full-size feed images, all exactly 137 characters long and consisting of a single token (url_rate 1.0, one_word_rate 1.0). 31.04% of rows are null, indicating most posts lack an image, and 6.65% of non-null values are duplicates — one image URL repeats 1000 times, suggesting reshared or pinned media. With 260,636 unique values across 404,841 rows, the field is high-cardinality but not unique.

Treatment: Treat as an image asset reference; fetch/hash for media features rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["image_fullsize_url"].stats

stat	value
n	404,841
nulls	125,645 (31.0%)
unique	260,636
len_min	137
len_max	137
len_mean	137
len_median	137
len_p95	137
word_mean	1
word_median	1
n_empty	0
n_duplicates	18,560
duplicate_rate	0.06648
vocab_size	19,666
readability_flesch_mean	-1475
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: null_rate	31.0% null
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 24.

Character-length distribution for image_fullsize_url.

Show data table

Character-length distribution for image_fullsize_url (mean: 137.0).
chars	count
136 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	279196
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 137	0
137 – 138	0

source_mode categorical feature

Binary categorical flag indicating ingestion source, with values 'author_feed' (69.0%) and 'jetstream' (31.0%) across 404,841 complete rows. The split is imbalanced but both classes are well-represented, and entropy_ratio of 0.89 confirms meaningful spread between the two modes. No nulls, suggesting the field is always populated at ingest time.

Treatment: One-hot or boolean-encode for modelling; useful as a provenance stratifier.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["source_mode"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	2
top_value	author_feed
top_rate	0.6896
cardinality	2
entropy	0.8936
entropy_ratio	0.8936

Fig 25.

Top values for source_mode.

Show data table

Top values for source_mode (2 unique shown, of 2 total).
value	count	share
author_feed	279196	69.0%
jetstream	125645	31.0%

collected_at text timestamp

This is an ISO-8601 UTC timestamp column (every value is exactly 27 characters, one token, ending in Z) capturing when each of the 404,841 records was collected. All values are unique with zero nulls, so it functions as a per-row event time rather than a categorical feature. The sampled values cluster on 2026-04-15 and 2026-04-16, suggesting collection spans roughly a single day or two.

Treatment: parse to datetime and use for time-based splits or feature extraction rather than as a model input.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["collected_at"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	404,841
len_min	27
len_max	27
len_mean	27
len_median	27
len_p95	27
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 26.

Character-length distribution for collected_at.

Show data table

Character-length distribution for collected_at (mean: 27.0).
chars	count
26 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	404841
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 28	0

query categorical foreign_key

The column holds Bluesky-style handles (e.g., firefaerie81.bsky.social, lukesteuber.com), suggesting it captures the account being queried rather than a free-text search string. Cardinality is modest at 489 unique values across 404,841 rows with high entropy ratio (0.859), so traffic is spread fairly evenly with no dominant target—the top handle accounts for just 2.45%. Notably, 31.04% of rows are null, which is severe for what looks like a primary lookup key.

Treatment: Filter or impute the 31% nulls before joining; treat as a high-cardinality categorical key, not free text.

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns["query"].stats

stat	value
n	404,841
nulls	125,645 (31.0%)
unique	489
top_value	firefaerie81.bsky.social
top_rate	0.02446
cardinality	489
entropy	7.676
entropy_ratio	0.8593
alert: null_rate	31.0% null

Fig 27.

Top values for query.

Show data table

Top values for query (20 unique shown, of 489 total).
value	count	share
firefaerie81.bsky.social	6828	1.7%
dashdashado.bsky.social	6386	1.6%
epsteinweb.bsky.social	6146	1.5%
lukesteuber.com	6019	1.5%
azgibsonz.bsky.social	5538	1.4%
majima.club	5427	1.3%
allthegalaxies.galaxyzoo.org	5000	1.2%
guggenheim.bsky.social	5000	1.2%
listigeplaylists.bsky.social	4998	1.2%
catsofyore.bsky.social	4610	1.1%
magpie.tips	4375	1.1%
skwinnicki.bsky.social	4082	1.0%
gonzo.bsky.social	3894	1.0%
ethankocak.com	3653	0.9%
maladroithe.bsky.social	3590	0.9%
nikolaidenmark.bsky.social	3429	0.8%
mbkplus.bsky.social	3384	0.8%
eustace.justshootme.ca	3312	0.8%
veronicaf.bsky.social	2686	0.7%
sarahmackattack.bsky.social	2645	0.7%

cursor text metadata

Despite the 'text' classification, every value is a single ISO-8601 UTC timestamp (length 16-24, one_word_rate 1.0, allcaps_rate 1.0 from the trailing Z), suggesting this is a pagination cursor keyed on a record timestamp. Across 404,841 rows there are only 105,919 distinct values and a 70.6% duplicate rate, with the top cursors repeating 200+ times each — consistent with many records sharing the same paging anchor. Null rate is 10.97%, likely the first/last page where no cursor applies.

Treatment: Parse as timestamp if needed for ordering, otherwise drop — it is API pagination state, not a feature.

anthropic:claude-opus-4-7 · confidence high

Out[71]:

saturn.columns["cursor"].stats

stat	value
n	404,841
nulls	44,399 (11.0%)
unique	105,919
len_min	16
len_max	24
len_mean	20.98
len_median	24
len_p95	24
word_mean	1
word_median	1
n_empty	0
n_duplicates	254,523
duplicate_rate	0.7061
vocab_size	9,220
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: duplicates	70.6% duplicate strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 28.

Character-length distribution for cursor.

Show data table

Character-length distribution for cursor (mean: 20.97609046670477).
chars	count
16 – 16	125645
16 – 16	0
16 – 17	0
17 – 17	0
17 – 17	0
17 – 17	0
17 – 17	0
17 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 20	0
20 – 20	0
20 – 20	0
20 – 20	15247
20 – 20	0
20 – 21	0
21 – 21	0
21 – 21	0
21 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 22	0
22 – 22	2259
22 – 22	0
22 – 23	0
23 – 23	0
23 – 23	0
23 – 23	19278
23 – 23	0
23 – 24	0
24 – 24	0
24 – 24	198013

raw_record_json text free_text

This column holds the raw JSON payload of Bluesky posts (`app.bsky.feed.post` records with embeds, replies, langs, and text), one serialized object per row. With 404,841 rows, 335,545 distinct values and a 17.1% duplicate rate (69,296 repeats), the same post JSON is recurring — likely reposts or ingestion replays. Mean length is 1,184 chars (max 65,764), Flesch readability is -250 because the JSON scaffolding dominates over prose, and the language sample is 432 English vs only a handful of de/fi/fr/is/ja/pt despite the multilingual alert. URL-bearing rows hit 24.7% and emoji rows 23.1%, both inflated by emoji/URLs embedded inside alt-text and post bodies.

Treatment: Parse as JSON and project the fields you need (text, langs, embed) before any text modelling; do not feed the raw blob to a tokenizer.

anthropic:claude-opus-4-7 · confidence high

Out[74]:

saturn.columns["raw_record_json"].stats

stat	value
n	404,841
nulls	0 (0.0%)
unique	335,545
len_min	287
len_max	65,764
len_mean	1184
len_median	967
len_p95	2,504
word_mean	67.63
word_median	49
n_empty	0
n_duplicates	69,296
duplicate_rate	0.1712
vocab_size	117,811
readability_flesch_mean	-250
emoji_rate	0.2307
url_rate	0.2473
one_word_rate	0.00786
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	8 languages detected in sample

Fig 29.

Character-length distribution for raw_record_json.

Show data table

Character-length distribution for raw_record_json (mean: 1184.2669418364246).
chars	count
287 – 1924	351038
1924 – 3561	49191
3561 – 5198	3450
5198 – 6835	694
6835 – 8472	235
8472 – 10109	226
10109 – 11745	0
11745 – 13382	2
13382 – 15019	0
15019 – 16656	0
16656 – 18293	0
18293 – 19930	0
19930 – 21567	0
21567 – 23204	0
23204 – 24841	0
24841 – 26478	0
26478 – 28115	0
28115 – 29752	0
29752 – 31389	0
31389 – 33026	0
33026 – 34662	0
34662 – 36299	0
36299 – 37936	0
37936 – 39573	0
39573 – 41210	1
41210 – 42847	1
42847 – 44484	0
44484 – 46121	0
46121 – 47758	0
47758 – 49395	0
49395 – 51032	1
51032 – 52669	0
52669 – 54306	0
54306 – 55942	0
55942 – 57579	0
57579 – 59216	0
59216 – 60853	0
60853 – 62490	0
62490 – 64127	0
64127 – 65764	2

bluesky alt text

Overview

Summary confidence: high

alt_text text free_text

image_alt_length numeric feature

text text free_text

author_handle categorical foreign_key

author_did text foreign_key

post_uri text foreign_key

post_cid text identifier

created_at text timestamp

indexed_at text timestamp

langs_json categorical feature

image_index numeric feature

image_count_in_post numeric feature

image_mime_type categorical metadata

image_ref text foreign_key

image_thumb_url text metadata

image_fullsize_url text metadata

source_mode categorical feature

collected_at text timestamp

query categorical foreign_key

cursor text metadata

raw_record_json text free_text

How to cite