saturn·

bluesky alt text

saturn notebook · generated 2026-04-22 Report Notebook

Overview

Source: hf://lukeslp/bluesky-alt-text:[train]

Saturn profiled 404,841 rows across 21 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "huggingface", "lukeslp/bluesky-alt-text:[train]",
    "--findings", "bluesky-alt-text.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.

citing: row_count · column_count · columns.image_alt_length.stats · columns.author_did.top_values · columns.author_handle.top_values · columns.author_handle.null_rate · columns.image_mime_type.top_values · columns.langs_json.top_values · columns.source_mode.top_values · columns.image_count_in_post.stats · columns.text.language_counts

Out[4]:

saturn.schema() · 21 columns

column kind n null% unique alerts
alt_text text 404,841 0.0% 355,721 multilingual
image_alt_length numeric 404,841 0.0% 2,052 high_skew outliers
text text 404,841 0.0% 302,702 duplicates multilingual
author_handle categorical 404,841 31.0% 489 null_rate
author_did text 404,841 0.0% 35,183 duplicates one_word
post_uri text 404,841 0.0% 335,384 one_word
post_cid text 404,841 0.0% 335,545 one_word
created_at text 404,841 0.0% 334,392 one_word allcaps
indexed_at text 404,841 0.0% 335,543 one_word allcaps
langs_json categorical 404,841 0.0% 214
image_index numeric 404,841 0.0% 4 high_skew outliers
image_count_in_post numeric 404,841 0.0% 4 outliers
image_mime_type categorical 404,841 0.0% 8
image_ref text 404,841 0.0% 375,188 one_word
image_thumb_url text 404,841 31.0% 260,636 null_rate one_word url_heavy
image_fullsize_url text 404,841 31.0% 260,636 null_rate one_word url_heavy
source_mode categorical 404,841 0.0% 2
collected_at text 404,841 0.0% 404,841 near_unique one_word allcaps
query categorical 404,841 31.0% 489 null_rate
cursor text 404,841 11.0% 105,919 duplicates one_word allcaps
raw_record_json text 404,841 0.0% 335,545 multilingual
Fig 1.
image_mime_type · Shows JPEG dominance (~93%) with PNG and WebP as the only meaningful minorities.
Show data table
Top values for image_mime_type (8 unique shown, of 8 total).
valuecountshare
image/jpeg37840193.5%
image/png234025.8%
image/webp29060.7%
image/gif790.0%
image/avif340.0%
image/svg+xml140.0%
application/octet-stream40.0%
image/heic10.0%
Fig 2.
source_mode · Roughly 69/31 split between author_feed and jetstream collection modes — explains the 31% null rate on author/url fields.
Show data table
Top values for source_mode (2 unique shown, of 2 total).
valuecountshare
author_feed27919669.0%
jetstream12564531.0%
Fig 3.
author_handle · Top handles are highly concentrated; check whether a few prolific accounts skew downstream analysis.
Show data table
Top values for author_handle (20 unique shown, of 489 total).
valuecountshare
firefaerie81.bsky.social68281.7%
dashdashado.bsky.social63861.6%
epsteinweb.bsky.social61461.5%
lukesteuber.com60191.5%
azgibsonz.bsky.social55381.4%
majima.club54271.3%
allthegalaxies.galaxyzoo.org50001.2%
guggenheim.bsky.social50001.2%
listigeplaylists.bsky.social49981.2%
catsofyore.bsky.social46101.1%
magpie.tips43751.1%
skwinnicki.bsky.social40821.0%
gonzo.bsky.social38941.0%
ethankocak.com36530.9%
maladroithe.bsky.social35900.9%
nikolaidenmark.bsky.social34290.8%
mbkplus.bsky.social33840.8%
eustace.justshootme.ca33120.8%
veronicaf.bsky.social26860.7%
sarahmackattack.bsky.social26450.7%
Fig 4.
image_alt_length · Heavy right-skew (median 116, max 65,192) — inspect the long tail of unusually verbose alt text.
Show data table
Histogram bins for image_alt_length (median: 116.0).
bincount
1 – 1631400124
1631 – 32614709
3261 – 48902
4890 – 65200
6520 – 81501
8150 – 97800
9780 – 1.141e+040
1.141e+04 – 1.304e+040
1.304e+04 – 1.467e+040
1.467e+04 – 1.63e+040
1.63e+04 – 1.793e+040
1.793e+04 – 1.956e+041
1.956e+04 – 2.119e+040
2.119e+04 – 2.282e+040
2.282e+04 – 2.445e+040
2.445e+04 – 2.608e+040
2.608e+04 – 2.771e+040
2.771e+04 – 2.934e+040
2.934e+04 – 3.097e+040
3.097e+04 – 3.26e+040
3.26e+04 – 3.423e+040
3.423e+04 – 3.586e+040
3.586e+04 – 3.749e+040
3.749e+04 – 3.912e+041
3.912e+04 – 4.075e+040
4.075e+04 – 4.238e+040
4.238e+04 – 4.4e+040
4.4e+04 – 4.563e+040
4.563e+04 – 4.726e+040
4.726e+04 – 4.889e+040
4.889e+04 – 5.052e+041
5.052e+04 – 5.215e+040
5.215e+04 – 5.378e+040
5.378e+04 – 5.541e+040
5.541e+04 – 5.704e+040
5.704e+04 – 5.867e+040
5.867e+04 – 6.03e+040
6.03e+04 – 6.193e+040
6.193e+04 – 6.356e+040
6.356e+04 – 6.519e+042
Fig 5.
image_count_in_post · Most posts attach a single image; bar shows how often users post 2–4 image carousels.
Show data table
Histogram bins for image_count_in_post (median: 1.0).
bincount
1 – 1.075291415
1.075 – 1.150
1.15 – 1.2250
1.225 – 1.30
1.3 – 1.3750
1.375 – 1.450
1.45 – 1.5250
1.525 – 1.60
1.6 – 1.6750
1.675 – 1.750
1.75 – 1.8250
1.825 – 1.90
1.9 – 1.9750
1.975 – 2.0554031
2.05 – 2.1250
2.125 – 2.20
2.2 – 2.2750
2.275 – 2.350
2.35 – 2.4250
2.425 – 2.50
2.5 – 2.5750
2.575 – 2.650
2.65 – 2.7250
2.725 – 2.80
2.8 – 2.8750
2.875 – 2.950
2.95 – 3.02520144
3.025 – 3.10
3.1 – 3.1750
3.175 – 3.250
3.25 – 3.3250
3.325 – 3.40
3.4 – 3.4750
3.475 – 3.550
3.55 – 3.6250
3.625 – 3.70
3.7 – 3.7750
3.775 – 3.850
3.85 – 3.9250
3.925 – 439251
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
alt_texttext0.0%
image_alt_lengthnumeric0.0%
texttext0.0%
author_handlecategorical31.0%
author_didtext0.0%
post_uritext0.0%
post_cidtext0.0%
created_attext0.0%
indexed_attext0.0%
langs_jsoncategorical0.0%
image_indexnumeric0.0%
image_count_in_postnumeric0.0%
image_mime_typecategorical0.0%
image_reftext0.0%
image_thumb_urltext31.0%
image_fullsize_urltext31.0%
source_modecategorical0.0%
collected_attext0.0%
querycategorical31.0%
cursortext11.0%
raw_record_jsontext0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 9,575 detected strings).
langcountshare
en884092.3%
de1861.9%
ja1241.3%
fr1031.1%
pt900.9%
es750.8%
it310.3%
nl260.3%
ru160.2%
fi150.2%
sv80.1%
ko70.1%
zh70.1%
ar50.1%
ca50.1%
tr40.0%
no30.0%
uk30.0%
sl30.0%
da20.0%
hi20.0%
eu20.0%
hu20.0%
ceb20.0%
is20.0%
vi20.0%
si10.0%
pl10.0%
sr10.0%
te10.0%
fa10.0%
fy10.0%
hr10.0%
id10.0%
eo10.0%
oc10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
image_alt_lengthimage_indeximage_count_in_post
image_alt_length+1.00-0.13-0.07
image_index-0.13+1.00+0.73
image_count_in_post-0.07+0.73+1.00

alt_text text free_text

Free-text image alt-text descriptions, predominantly English (4071 detected) but with a long multilingual tail spanning 28 other languages including German (94), French (59), Japanese (47) and Spanish (37). Length varies wildly — median 116 chars but max 65192, with a 95th percentile of 966 — and 12.1% of values are duplicates, driven by templated boilerplate like 'Epstein Web iOS app on the App Store.' (5227 repeats) and lengthy pasted feed-rules blocks. Vocabulary is large (85972 unique tokens) and readability sits at Flesch 62.5, but boilerplate rate (4.6%) plus generic placeholders ('Image', 'Image 1', 'image') indicate substantial low-information content.

Treatment: Strip boilerplate and placeholder strings, then tokenize and embed (with language detection) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["alt_text"].stats

statvalue
n404,841
nulls0 (0.0%)
unique355,721
len_min 1
len_max 65,192
len_mean 227.5
len_median 116
len_p95 966
word_mean 34.21
word_median 20
n_empty 0
n_duplicates 49,120
duplicate_rate 0.1213
vocab_size 85,972
readability_flesch_mean 62.52
emoji_rate 0.02271
url_rate 0.0288
one_word_rate 0.02114
allcaps_rate 0.0159
boilerplate_rate 0.04618
alert: multilingual30 languages detected in sample
Fig 9.
Character-length distribution for alt_text.
Show data table
Character-length distribution for alt_text (mean: 227.4568361406083).
charscount
1 – 1631400124
1631 – 32614709
3261 – 48902
4890 – 65200
6520 – 81501
8150 – 97800
9780 – 114090
11409 – 130390
13039 – 146690
14669 – 162990
16299 – 179290
17929 – 195581
19558 – 211880
21188 – 228180
22818 – 244480
24448 – 260770
26077 – 277070
27707 – 293370
29337 – 309670
30967 – 325960
32596 – 342260
34226 – 358560
35856 – 374860
37486 – 391161
39116 – 407450
40745 – 423750
42375 – 440050
44005 – 456350
45635 – 472640
47264 – 488940
48894 – 505241
50524 – 521540
52154 – 537840
53784 – 554130
55413 – 570430
57043 – 586730
58673 – 603030
60303 – 619320
61932 – 635620
63562 – 651922

image_alt_length numeric feature

Counts of characters in image alt-text fields, ranging from 1 to 65,192 with a median of 116 and IQR of 59–226. The distribution is severely right-skewed (skew 38.4, kurtosis 5934) and 11.8% of rows (47,944) flag as outliers, suggesting a long tail of unusually verbose alt-text or possibly entire passages dumped into the alt attribute. No nulls or zeros, so every row carries some alt content.

Treatment: Log-transform or winsorize before modelling to tame the extreme right tail.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["image_alt_length"].stats

statvalue
n404,841
nulls0 (0.0%)
unique2,052
min 1
max 65,192
mean 227.5
median 116
std 364.4
q1 59
q3 226
iqr 167
skew 38.39
kurtosis 5934
n_outliers 47,944
outlier_rate 0.1184
zero_rate 0
alert: high_skewskew=+38.39
alert: outliers11.8% rows beyond 1.5 IQR
Fig 10.
Distribution of image_alt_length. Vertical dash marks the median.
Show data table
Histogram bins for image_alt_length (median: 116.0).
bincount
1 – 1631400124
1631 – 32614709
3261 – 48902
4890 – 65200
6520 – 81501
8150 – 97800
9780 – 1.141e+040
1.141e+04 – 1.304e+040
1.304e+04 – 1.467e+040
1.467e+04 – 1.63e+040
1.63e+04 – 1.793e+040
1.793e+04 – 1.956e+041
1.956e+04 – 2.119e+040
2.119e+04 – 2.282e+040
2.282e+04 – 2.445e+040
2.445e+04 – 2.608e+040
2.608e+04 – 2.771e+040
2.771e+04 – 2.934e+040
2.934e+04 – 3.097e+040
3.097e+04 – 3.26e+040
3.26e+04 – 3.423e+040
3.423e+04 – 3.586e+040
3.586e+04 – 3.749e+040
3.749e+04 – 3.912e+041
3.912e+04 – 4.075e+040
4.075e+04 – 4.238e+040
4.238e+04 – 4.4e+040
4.4e+04 – 4.563e+040
4.563e+04 – 4.726e+040
4.726e+04 – 4.889e+040
4.889e+04 – 5.052e+041
5.052e+04 – 5.215e+040
5.215e+04 – 5.378e+040
5.378e+04 – 5.541e+040
5.541e+04 – 5.704e+040
5.704e+04 – 5.867e+040
5.867e+04 – 6.03e+040
6.03e+04 – 6.193e+040
6.193e+04 – 6.356e+040
6.356e+04 – 6.519e+042

text text free_text

Short user-generated posts (len_mean 136.5, len_max 397, word_median 15) that read like Mastodon/social toots — emoji_rate 0.226, url_rate 0.156, and top values include sign-up rule boilerplate and hashtags like #Alt4You. Predominantly English (4337) but spans 30 languages including German (91), Japanese (75), Portuguese (48), and French (43). Watch the heavy duplication (duplicate_rate 0.252, n_duplicates 102139) and 17970 empty strings sitting at the top of the value list.

Treatment: Drop empties and exact duplicates, then language-filter or multilingual-embed before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["text"].stats

statvalue
n404,841
nulls0 (0.0%)
unique302,702
len_min 0
len_max 397
len_mean 136.5
len_median 131
len_p95 296
word_mean 19.73
word_median 15
n_empty 17,970
n_duplicates 102,139
duplicate_rate 0.2523
vocab_size 82,709
readability_flesch_mean 50.35
emoji_rate 0.2264
url_rate 0.1555
one_word_rate 0.06977
allcaps_rate 0.01936
boilerplate_rate 0.001766
alert: duplicates25.2% duplicate strings
alert: multilingual31 languages detected in sample
Fig 11.
Character-length distribution for text.
Show data table
Character-length distribution for text (mean: 136.48918464285978).
charscount
0 – 1027167
10 – 2016735
20 – 3017832
30 – 4017734
40 – 5016396
50 – 6015845
60 – 6914365
69 – 7913290
79 – 8914179
89 – 9913210
99 – 10912132
109 – 11911291
119 – 12910399
129 – 1399111
139 – 1499015
149 – 15938531
159 – 1699235
169 – 17910560
179 – 1898096
189 – 1988286
198 – 2087590
208 – 2186996
218 – 2286854
228 – 2387006
238 – 24810714
248 – 2588843
258 – 2688344
268 – 2789931
278 – 28810595
288 – 29819129
298 – 30815409
308 – 31814
318 – 3280
328 – 3370
337 – 3470
347 – 3571
357 – 3671
367 – 3771
377 – 3872
387 – 3972

author_handle categorical foreign_key

Author handles from what appears to be Bluesky/AT Protocol activity (most values end in .bsky.social, with some custom domains like lukesteuber.com and majima.club). Only 489 unique authors across 404,841 rows with high entropy ratio (0.859) and a flat top — the most prolific account holds just 2.4% of posts — so this is a curated set of accounts rather than an open population. Notable: 31% of rows have a null handle, which is high for an authorship field and worth investigating before any per-author aggregation.

Treatment: Treat as a categorical author key; investigate the 31% nulls and decide whether to drop or impute before grouping.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["author_handle"].stats

statvalue
n404,841
nulls125,645 (31.0%)
unique489
top_value firefaerie81.bsky.social
top_rate 0.02446
cardinality 489
entropy 7.676
entropy_ratio 0.8593
alert: null_rate31.0% null
Fig 12.
Top values for author_handle.
Show data table
Top values for author_handle (20 unique shown, of 489 total).
valuecountshare
firefaerie81.bsky.social68281.7%
dashdashado.bsky.social63861.6%
epsteinweb.bsky.social61461.5%
lukesteuber.com60191.5%
azgibsonz.bsky.social55381.4%
majima.club54271.3%
allthegalaxies.galaxyzoo.org50001.2%
guggenheim.bsky.social50001.2%
listigeplaylists.bsky.social49981.2%
catsofyore.bsky.social46101.1%
magpie.tips43751.1%
skwinnicki.bsky.social40821.0%
gonzo.bsky.social38941.0%
ethankocak.com36530.9%
maladroithe.bsky.social35900.9%
nikolaidenmark.bsky.social34290.8%
mbkplus.bsky.social33840.8%
eustace.justshootme.ca33120.8%
veronicaf.bsky.social26860.7%
sarahmackattack.bsky.social26450.7%

author_did text foreign_key

This column holds Bluesky/AT Protocol decentralized identifiers (`did:plc:` prefix) for post authors, with a tight length range of 17-33 chars and a single token per value. Across 404,841 rows there are only 35,183 unique authors (duplicate_rate 0.913), and one author alone accounts for 32,558 posts — heavy concentration at the top.

Treatment: Treat as an author key — left-join to a profile table, and consider per-author aggregation or capping before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["author_did"].stats

statvalue
n404,841
nulls0 (0.0%)
unique35,183
len_min 17
len_max 33
len_mean 32
len_median 32
len_p95 32
word_mean 1
word_median 1
n_empty 0
n_duplicates 369,658
duplicate_rate 0.9131
vocab_size 4,133
readability_flesch_mean -228.6
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: duplicates91.3% duplicate strings
alert: one_word100.0% rows are a single word
Fig 13.
Character-length distribution for author_did.
Show data table
Character-length distribution for author_did (mean: 31.999787570922905).
charscount
17 – 171
17 – 180
18 – 180
18 – 190
19 – 190
19 – 192
19 – 200
20 – 200
20 – 210
21 – 210
21 – 210
21 – 220
22 – 221
22 – 230
23 – 230
23 – 232
23 – 240
24 – 240
24 – 250
25 – 250
25 – 250
25 – 260
26 – 263
26 – 270
27 – 270
27 – 270
27 – 280
28 – 280
28 – 290
29 – 290
29 – 290
29 – 300
30 – 300
30 – 310
31 – 310
31 – 310
31 – 320
32 – 32404831
32 – 330
33 – 331

post_uri text foreign_key

This is an AT Protocol post URI (Bluesky), encoding the author DID and post rkey in a fixed `at://did:plc:.../app.bsky.feed.post/...` format — note the tight length range (55-71 chars, mean 70) and one_word_rate of 1.0. Despite looking like a primary key, it is not unique: 335384 unique values across 404841 rows yields a 17.2% duplicate rate, with the top URI appearing 15 times. Readability and emoji metrics are meaningless here since each value is a single token.

Treatment: Treat as a post identifier for joins; deduplicate or aggregate on it rather than modelling its text.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["post_uri"].stats

statvalue
n404,841
nulls0 (0.0%)
unique335,384
len_min 55
len_max 71
len_mean 70
len_median 70
len_p95 70
word_mean 1
word_median 1
n_empty 0
n_duplicates 69,457
duplicate_rate 0.1716
vocab_size 19,733
readability_flesch_mean -628.8
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
Fig 14.
Character-length distribution for post_uri.
Show data table
Character-length distribution for post_uri (mean: 69.9997875709229).
charscount
55 – 551
55 – 560
56 – 560
56 – 570
57 – 570
57 – 572
57 – 580
58 – 580
58 – 590
59 – 590
59 – 590
59 – 600
60 – 601
60 – 610
61 – 610
61 – 612
61 – 620
62 – 620
62 – 630
63 – 630
63 – 630
63 – 640
64 – 643
64 – 650
65 – 650
65 – 650
65 – 660
66 – 660
66 – 670
67 – 670
67 – 670
67 – 680
68 – 680
68 – 690
69 – 690
69 – 690
69 – 700
70 – 70404831
70 – 710
71 – 711

post_cid text identifier

This is a post_cid column holding IPLD/CID v1 content identifiers (all 404841 values are exactly 59 chars and start with the 'bafyrei' CIDv1 dag-cbor prefix). It is near-unique (335545 distinct of 404841) but carries a notable 17.1% duplicate_rate with some CIDs repeating up to 4 times, which is unexpected if these are meant to be unique post references. The fixed length, single-token shape, and zero nulls confirm a machine-generated identifier rather than a feature.

Treatment: Treat as a join key; deduplicate before use and don't feed into models.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["post_cid"].stats

statvalue
n404,841
nulls0 (0.0%)
unique335,545
len_min 59
len_max 59
len_mean 59
len_median 59
len_p95 59
word_mean 1
word_median 1
n_empty 0
n_duplicates 69,296
duplicate_rate 0.1712
vocab_size 19,734
readability_flesch_mean -567
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
Fig 15.
Character-length distribution for post_cid.
Show data table
Character-length distribution for post_cid (mean: 59.0).
charscount
58 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 59404841
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 600

created_at text timestamp

ISO-8601 timestamps stored as text, mixing two suffix conventions (`.000Z`/`.911Z` style and `+00:00` offset) which is why the parser flagged allcaps and one_word at rate 1.0. Length ranges 20-35 chars and median 24 align with standard ISO datetime strings. Duplicate rate is 17.4% (70,449 repeats across 404,841 rows), and top values cluster heavily on 2026-04-16, suggesting a load-day spike or truncated export window.

Treatment: Parse to datetime (normalising the two ISO suffix formats) and use as a temporal feature.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["created_at"].stats

statvalue
n404,841
nulls0 (0.0%)
unique334,392
len_min 20
len_max 35
len_mean 24.55
len_median 24
len_p95 29
word_mean 1
word_median 1
n_empty 0
n_duplicates 70,449
duplicate_rate 0.174
vocab_size 19,732
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 16.
Character-length distribution for created_at.
Show data table
Character-length distribution for created_at (mean: 24.54517946551856).
charscount
20 – 2014365
20 – 210
21 – 210
21 – 220
22 – 220
22 – 220
22 – 230
23 – 230
23 – 230
23 – 240
24 – 24324204
24 – 240
24 – 250
25 – 258039
25 – 260
26 – 260
26 – 264
26 – 270
27 – 2737444
27 – 280
28 – 280
28 – 28244
28 – 290
29 – 290
29 – 291903
29 – 300
30 – 301003
30 – 300
30 – 310
31 – 310
31 – 320
32 – 320
32 – 3217434
32 – 330
33 – 33200
33 – 340
34 – 340
34 – 340
34 – 350
35 – 351

indexed_at text timestamp

This is an ISO-8601 indexing timestamp stored as text, with values like '2025-08-02T00:50:34.108Z' and lengths between 24 and 32 characters. Two formats are mixed: a 24-char 'Z' suffix variant and a 32-char '+00:00' offset variant, which explains the length spread. Duplicate rate is 17.1% (69,298 rows) across 335,543 unique values out of 404,841, suggesting batch indexing events share timestamps.

Treatment: Parse to a single UTC datetime type and use for recency filtering or time-based joins.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["indexed_at"].stats

statvalue
n404,841
nulls0 (0.0%)
unique335,543
len_min 24
len_max 32
len_mean 26.48
len_median 24
len_p95 32
word_mean 1
word_median 1
n_empty 0
n_duplicates 69,298
duplicate_rate 0.1712
vocab_size 19,734
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 17.
Character-length distribution for indexed_at.
Show data table
Character-length distribution for indexed_at (mean: 26.48285129223572).
charscount
24 – 24279196
24 – 240
24 – 250
25 – 250
25 – 250
25 – 250
25 – 250
25 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 280
28 – 280
28 – 280
28 – 280
28 – 280
28 – 290
29 – 290
29 – 290
29 – 290
29 – 290
29 – 300
30 – 300
30 – 300
30 – 300
30 – 300
30 – 310
31 – 310
31 – 310
31 – 310
31 – 310
31 – 320
32 – 320
32 – 32125645

langs_json categorical feature

This column holds JSON-encoded language tag arrays, almost certainly the detected or declared languages per record. English dominates at 71.9% of rows (["en"]), and a notable 83,264 rows carry an empty array [] indicating no language assigned. Cardinality is 214 distinct strings with low entropy ratio (0.179), and the long tail includes multi-language combinations like ["en", "sq"] and ["ae", "ay", "en"].

Treatment: Parse the JSON arrays and one-hot encode primary language, treating [] as a separate 'unknown' category.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["langs_json"].stats

statvalue
n404,841
nulls0 (0.0%)
unique214
top_value ["en"]
top_rate 0.7188
cardinality 214
entropy 1.389
entropy_ratio 0.1794
Fig 18.
Top values for langs_json.
Show data table
Top values for langs_json (20 unique shown, of 214 total).
valuecountshare
["en"]29101471.9%
[]8326420.6%
["de"]77021.9%
["fr"]38921.0%
["ja"]30870.8%
["en-US"]27580.7%
["pt"]26230.6%
["es"]23410.6%
["en", "sq"]17630.4%
["ae", "ay", "en"]10290.3%
["nl"]9590.2%
["en-AU", "en"]4930.1%
["en", "ru", "hi"]2000.0%
["fi"]1940.0%
["sv"]1830.0%
["tr"]1820.0%
["en", "ja"]1780.0%
["ca"]1550.0%
["ar"]1370.0%
["ko"]1280.0%

image_index numeric feature

A small-integer counter taking only 4 distinct values between 0 and 3, with 82.8% zeros and a mean of 0.26 — likely an image position/index within a multi-image record. The distribution is heavily right-skewed (skew 2.73, kurtosis 7.08) and the IQR is 0, so any non-zero entry registers as an outlier (17.16% flagged).

Treatment: Treat as a low-cardinality categorical (one-hot or keep as ordinal); ignore outlier flag since IQR is zero by construction.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["image_index"].stats

statvalue
n404,841
nulls0 (0.0%)
unique4
min 0
max 3
mean 0.2604
median 0
std 0.6467
q1 0
q3 0
iqr 0
skew 2.726
kurtosis 7.083
n_outliers 69,490
outlier_rate 0.1716
zero_rate 0.8284
alert: high_skewskew=+2.73
alert: outliers17.2% rows beyond 1.5 IQR
Fig 19.
Distribution of image_index. Vertical dash marks the median.
Show data table
Histogram bins for image_index (median: 0.0).
bincount
0 – 0.075335351
0.075 – 0.150
0.15 – 0.2250
0.225 – 0.30
0.3 – 0.3750
0.375 – 0.450
0.45 – 0.5250
0.525 – 0.60
0.6 – 0.6750
0.675 – 0.750
0.75 – 0.8250
0.825 – 0.90
0.9 – 0.9750
0.975 – 1.0543304
1.05 – 1.1250
1.125 – 1.20
1.2 – 1.2750
1.275 – 1.350
1.35 – 1.4250
1.425 – 1.50
1.5 – 1.5750
1.575 – 1.650
1.65 – 1.7250
1.725 – 1.80
1.8 – 1.8750
1.875 – 1.950
1.95 – 2.02516438
2.025 – 2.10
2.1 – 2.1750
2.175 – 2.250
2.25 – 2.3250
2.325 – 2.40
2.4 – 2.4750
2.475 – 2.550
2.55 – 2.6250
2.625 – 2.70
2.7 – 2.7750
2.775 – 2.850
2.85 – 2.9250
2.925 – 39748

image_count_in_post numeric feature

Discrete count of images attached to a post, taking only 4 distinct integer values between 1 and 4 with no nulls or zeros. The distribution is heavily right-skewed (skew 1.72) toward single-image posts: median and Q1 are both 1, Q3 is 2, and roughly 9.7% of rows (39,251) flag as outliers — almost certainly the 3- and 4-image posts.

Treatment: Treat as a small ordinal/categorical count rather than continuous; bin 3-4 together or one-hot encode.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["image_count_in_post"].stats

statvalue
n404,841
nulls0 (0.0%)
unique4
min 1
max 4
mean 1.524
median 1
std 0.9647
q1 1
q3 2
iqr 1
skew 1.719
kurtosis 1.551
n_outliers 39,251
outlier_rate 0.09695
zero_rate 0
alert: outliers9.7% rows beyond 1.5 IQR
Fig 20.
Distribution of image_count_in_post. Vertical dash marks the median.
Show data table
Histogram bins for image_count_in_post (median: 1.0).
bincount
1 – 1.075291415
1.075 – 1.150
1.15 – 1.2250
1.225 – 1.30
1.3 – 1.3750
1.375 – 1.450
1.45 – 1.5250
1.525 – 1.60
1.6 – 1.6750
1.675 – 1.750
1.75 – 1.8250
1.825 – 1.90
1.9 – 1.9750
1.975 – 2.0554031
2.05 – 2.1250
2.125 – 2.20
2.2 – 2.2750
2.275 – 2.350
2.35 – 2.4250
2.425 – 2.50
2.5 – 2.5750
2.575 – 2.650
2.65 – 2.7250
2.725 – 2.80
2.8 – 2.8750
2.875 – 2.950
2.95 – 3.02520144
3.025 – 3.10
3.1 – 3.1750
3.175 – 3.250
3.25 – 3.3250
3.325 – 3.40
3.4 – 3.4750
3.475 – 3.550
3.55 – 3.6250
3.625 – 3.70
3.7 – 3.7750
3.775 – 3.850
3.85 – 3.9250
3.925 – 439251

image_mime_type categorical metadata

This column records the MIME type of an associated image, with 8 distinct values across 404,841 rows and no nulls. It is heavily dominated by image/jpeg at 93.5% (378,401 rows), with image/png a distant second (23,402) and a long tail of webp, gif, avif, svg+xml, heic, plus 4 rows of application/octet-stream that aren't an image type at all. Entropy ratio is just 0.128, confirming very low diversity.

Treatment: Collapse rare types into an 'other' bucket (and review the 4 application/octet-stream rows) before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["image_mime_type"].stats

statvalue
n404,841
nulls0 (0.0%)
unique8
top_value image/jpeg
top_rate 0.9347
cardinality 8
entropy 0.3842
entropy_ratio 0.1281
Fig 21.
Top values for image_mime_type.
Show data table
Top values for image_mime_type (8 unique shown, of 8 total).
valuecountshare
image/jpeg37840193.5%
image/png234025.8%
image/webp29060.7%
image/gif790.0%
image/avif340.0%
image/svg+xml140.0%
application/octet-stream40.0%
image/heic10.0%

image_ref text foreign_key

This column holds IPFS CIDv1 content hashes (all 59 chars, all starting with 'bafkrei...' indicating raw codec + sha256), one token per row across 404841 rows. With 375188 unique values and a 7.3% duplicate rate, it functions as a content-addressed image pointer rather than a row identifier — one CID appears 5227 times, suggesting a heavily reused asset. Length is rigidly fixed (min=max=59) and there are no nulls, URLs, or emojis.

Treatment: Treat as an opaque content-address key; left-join to an image/asset table and do not tokenize.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["image_ref"].stats

statvalue
n404,841
nulls19 (0.0%)
unique375,188
len_min 59
len_max 59
len_mean 59
len_median 59
len_p95 59
word_mean 1
word_median 1
n_empty 0
n_duplicates 29,634
duplicate_rate 0.0732
vocab_size 19,560
readability_flesch_mean -539.1
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
Fig 22.
Character-length distribution for image_ref.
Show data table
Character-length distribution for image_ref (mean: 59.0).
charscount
58 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 59404822
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 590
59 – 600

image_thumb_url text metadata

This column holds Bluesky CDN thumbnail URLs (every non-null value is exactly 138 chars and a single token pointing at cdn.bsky.app/img/feed_thumbnail/plain/...). 31.04% of rows are null and 6.65% are duplicates, with one thumbnail repeating 1000 times — likely a heavily reshared post. With 260,636 unique URLs across 404,841 rows, it's effectively a per-post image reference rather than a modelling feature.

Treatment: Keep as a media reference for display or image fetching; drop from tabular modelling.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["image_thumb_url"].stats

statvalue
n404,841
nulls125,645 (31.0%)
unique260,636
len_min 138
len_max 138
len_mean 138
len_median 138
len_p95 138
word_mean 1
word_median 1
n_empty 0
n_duplicates 18,560
duplicate_rate 0.06648
vocab_size 19,666
readability_flesch_mean -1391
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: null_rate31.0% null
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 23.
Character-length distribution for image_thumb_url.
Show data table
Character-length distribution for image_thumb_url (mean: 138.0).
charscount
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 138279196
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380
138 – 1380

image_fullsize_url text metadata

This column holds Bluesky CDN URLs pointing to full-size feed images, all exactly 137 characters long and consisting of a single token (url_rate 1.0, one_word_rate 1.0). 31.04% of rows are null, indicating most posts lack an image, and 6.65% of non-null values are duplicates — one image URL repeats 1000 times, suggesting reshared or pinned media. With 260,636 unique values across 404,841 rows, the field is high-cardinality but not unique.

Treatment: Treat as an image asset reference; fetch/hash for media features rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["image_fullsize_url"].stats

statvalue
n404,841
nulls125,645 (31.0%)
unique260,636
len_min 137
len_max 137
len_mean 137
len_median 137
len_p95 137
word_mean 1
word_median 1
n_empty 0
n_duplicates 18,560
duplicate_rate 0.06648
vocab_size 19,666
readability_flesch_mean -1475
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: null_rate31.0% null
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 24.
Character-length distribution for image_fullsize_url.
Show data table
Character-length distribution for image_fullsize_url (mean: 137.0).
charscount
136 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 137279196
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1370
137 – 1380

source_mode categorical feature

Binary categorical flag indicating ingestion source, with values 'author_feed' (69.0%) and 'jetstream' (31.0%) across 404,841 complete rows. The split is imbalanced but both classes are well-represented, and entropy_ratio of 0.89 confirms meaningful spread between the two modes. No nulls, suggesting the field is always populated at ingest time.

Treatment: One-hot or boolean-encode for modelling; useful as a provenance stratifier.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["source_mode"].stats

statvalue
n404,841
nulls0 (0.0%)
unique2
top_value author_feed
top_rate 0.6896
cardinality 2
entropy 0.8936
entropy_ratio 0.8936
Fig 25.
Top values for source_mode.
Show data table
Top values for source_mode (2 unique shown, of 2 total).
valuecountshare
author_feed27919669.0%
jetstream12564531.0%

collected_at text timestamp

This is an ISO-8601 UTC timestamp column (every value is exactly 27 characters, one token, ending in Z) capturing when each of the 404,841 records was collected. All values are unique with zero nulls, so it functions as a per-row event time rather than a categorical feature. The sampled values cluster on 2026-04-15 and 2026-04-16, suggesting collection spans roughly a single day or two.

Treatment: parse to datetime and use for time-based splits or feature extraction rather than as a model input.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["collected_at"].stats

statvalue
n404,841
nulls0 (0.0%)
unique404,841
len_min 27
len_max 27
len_mean 27
len_median 27
len_p95 27
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 26.
Character-length distribution for collected_at.
Show data table
Character-length distribution for collected_at (mean: 27.0).
charscount
26 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 27404841
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 280

query categorical foreign_key

The column holds Bluesky-style handles (e.g., firefaerie81.bsky.social, lukesteuber.com), suggesting it captures the account being queried rather than a free-text search string. Cardinality is modest at 489 unique values across 404,841 rows with high entropy ratio (0.859), so traffic is spread fairly evenly with no dominant target—the top handle accounts for just 2.45%. Notably, 31.04% of rows are null, which is severe for what looks like a primary lookup key.

Treatment: Filter or impute the 31% nulls before joining; treat as a high-cardinality categorical key, not free text.

anthropic:claude-opus-4-7 · confidence high
Out[68]:

saturn.columns["query"].stats

statvalue
n404,841
nulls125,645 (31.0%)
unique489
top_value firefaerie81.bsky.social
top_rate 0.02446
cardinality 489
entropy 7.676
entropy_ratio 0.8593
alert: null_rate31.0% null
Fig 27.
Top values for query.
Show data table
Top values for query (20 unique shown, of 489 total).
valuecountshare
firefaerie81.bsky.social68281.7%
dashdashado.bsky.social63861.6%
epsteinweb.bsky.social61461.5%
lukesteuber.com60191.5%
azgibsonz.bsky.social55381.4%
majima.club54271.3%
allthegalaxies.galaxyzoo.org50001.2%
guggenheim.bsky.social50001.2%
listigeplaylists.bsky.social49981.2%
catsofyore.bsky.social46101.1%
magpie.tips43751.1%
skwinnicki.bsky.social40821.0%
gonzo.bsky.social38941.0%
ethankocak.com36530.9%
maladroithe.bsky.social35900.9%
nikolaidenmark.bsky.social34290.8%
mbkplus.bsky.social33840.8%
eustace.justshootme.ca33120.8%
veronicaf.bsky.social26860.7%
sarahmackattack.bsky.social26450.7%

cursor text metadata

Despite the 'text' classification, every value is a single ISO-8601 UTC timestamp (length 16-24, one_word_rate 1.0, allcaps_rate 1.0 from the trailing Z), suggesting this is a pagination cursor keyed on a record timestamp. Across 404,841 rows there are only 105,919 distinct values and a 70.6% duplicate rate, with the top cursors repeating 200+ times each — consistent with many records sharing the same paging anchor. Null rate is 10.97%, likely the first/last page where no cursor applies.

Treatment: Parse as timestamp if needed for ordering, otherwise drop — it is API pagination state, not a feature.

anthropic:claude-opus-4-7 · confidence high
Out[71]:

saturn.columns["cursor"].stats

statvalue
n404,841
nulls44,399 (11.0%)
unique105,919
len_min 16
len_max 24
len_mean 20.98
len_median 24
len_p95 24
word_mean 1
word_median 1
n_empty 0
n_duplicates 254,523
duplicate_rate 0.7061
vocab_size 9,220
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: duplicates70.6% duplicate strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 28.
Character-length distribution for cursor.
Show data table
Character-length distribution for cursor (mean: 20.97609046670477).
charscount
16 – 16125645
16 – 160
16 – 170
17 – 170
17 – 170
17 – 170
17 – 170
17 – 180
18 – 180
18 – 180
18 – 180
18 – 180
18 – 190
19 – 190
19 – 190
19 – 190
19 – 190
19 – 200
20 – 200
20 – 200
20 – 2015247
20 – 200
20 – 210
21 – 210
21 – 210
21 – 210
21 – 210
21 – 220
22 – 220
22 – 220
22 – 222259
22 – 220
22 – 230
23 – 230
23 – 230
23 – 2319278
23 – 230
23 – 240
24 – 240
24 – 24198013

raw_record_json text free_text

This column holds the raw JSON payload of Bluesky posts (`app.bsky.feed.post` records with embeds, replies, langs, and text), one serialized object per row. With 404,841 rows, 335,545 distinct values and a 17.1% duplicate rate (69,296 repeats), the same post JSON is recurring — likely reposts or ingestion replays. Mean length is 1,184 chars (max 65,764), Flesch readability is -250 because the JSON scaffolding dominates over prose, and the language sample is 432 English vs only a handful of de/fi/fr/is/ja/pt despite the multilingual alert. URL-bearing rows hit 24.7% and emoji rows 23.1%, both inflated by emoji/URLs embedded inside alt-text and post bodies.

Treatment: Parse as JSON and project the fields you need (text, langs, embed) before any text modelling; do not feed the raw blob to a tokenizer.

anthropic:claude-opus-4-7 · confidence high
Out[74]:

saturn.columns["raw_record_json"].stats

statvalue
n404,841
nulls0 (0.0%)
unique335,545
len_min 287
len_max 65,764
len_mean 1184
len_median 967
len_p95 2,504
word_mean 67.63
word_median 49
n_empty 0
n_duplicates 69,296
duplicate_rate 0.1712
vocab_size 117,811
readability_flesch_mean -250
emoji_rate 0.2307
url_rate 0.2473
one_word_rate 0.00786
allcaps_rate 0
boilerplate_rate 0
alert: multilingual8 languages detected in sample
Fig 29.
Character-length distribution for raw_record_json.
Show data table
Character-length distribution for raw_record_json (mean: 1184.2669418364246).
charscount
287 – 1924351038
1924 – 356149191
3561 – 51983450
5198 – 6835694
6835 – 8472235
8472 – 10109226
10109 – 117450
11745 – 133822
13382 – 150190
15019 – 166560
16656 – 182930
18293 – 199300
19930 – 215670
21567 – 232040
23204 – 248410
24841 – 264780
26478 – 281150
28115 – 297520
29752 – 313890
31389 – 330260
33026 – 346620
34662 – 362990
36299 – 379360
37936 – 395730
39573 – 412101
41210 – 428471
42847 – 444840
44484 – 461210
46121 – 477580
47758 – 493950
49395 – 510321
51032 – 526690
52669 – 543060
54306 – 559420
55942 – 575790
57579 – 592160
59216 – 608530
60853 – 624900
62490 – 641270
64127 – 657642

How to cite

click to copy

BibTeX
@misc{saturn-bluesky-alt-text-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: bluesky alt text},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/bluesky-alt-text}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: bluesky alt text. Source: hf://lukeslp/bluesky-alt-text:[train]. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/bluesky-alt-text