saturn·

bluesky alt text

source hf://lukeslp/bluesky-alt-text:[train] 404,841 rows 21 columns profiled 2026-04-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.

citing: row_count · column_count · columns.image_alt_length.stats · columns.author_did.top_values · columns.author_handle.top_values · columns.author_handle.null_rate · columns.image_mime_type.top_values · columns.langs_json.top_values · columns.source_mode.top_values · columns.image_count_in_post.stats · columns.text.language_counts

Schema

21 columns
Per-column summary. Click column name to jump to its detail.
Alerts
alt_text text 0.0% 355,721
multilingual
image_alt_length numeric 0.0% 2,052
high_skew outliers
text text 0.0% 302,702
duplicates multilingual
author_handle categorical 31.0% 489
null_rate
author_did text 0.0% 35,183
duplicates one_word
post_uri text 0.0% 335,384
one_word
post_cid text 0.0% 335,545
one_word
created_at text 0.0% 334,392
one_word allcaps
indexed_at text 0.0% 335,543
one_word allcaps
langs_json categorical 0.0% 214
image_index numeric 0.0% 4
high_skew outliers
image_count_in_post numeric 0.0% 4
outliers
image_mime_type categorical 0.0% 8
image_ref text 0.0% 375,188
one_word
image_thumb_url text 31.0% 260,636
null_rate one_word url_heavy
image_fullsize_url text 31.0% 260,636
null_rate one_word url_heavy
source_mode categorical 0.0% 2
collected_at text 0.0% 404,841
near_unique one_word allcaps
query categorical 31.0% 489
null_rate
cursor text 11.0% 105,919
duplicates one_word allcaps
raw_record_json text 0.0% 335,545
multilingual

alt_text

text free_text multilingual
Free-text image alt-text descriptions, predominantly English (4071 detected) but with a long multilingual tail spanning 28 other languages including German (94), French (59), Japanese (47) and Spanish (37). Length varies wildly — median 116 chars but max 65192, with a 95th percentile of 966 — and 12.1% of values are duplicates, driven by templated boilerplate like 'Epstein Web iOS app on the App Store.' (5227 repeats) and lengthy pasted feed-rules blocks. Vocabulary is large (85972 unique tokens) and readability sits at Flesch 62.5, but boilerplate rate (4.6%) plus generic placeholders ('Image', 'Image 1', 'image') indicate substantial low-information content. Treatment: Strip boilerplate and placeholder strings, then tokenize and embed (with language detection) before modelling. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
355,721
len_min
1
len_max
65,192
len_mean
227.5
len_median
116
len_p95
966
word_mean
34.21
word_median
20
n_empty
0
n_duplicates
49,120
duplicate_rate
0.1213
vocab_size
85,972
readability_flesch_mean
62.52
emoji_rate
0.02271
url_rate
0.0288
one_word_rate
0.02114
allcaps_rate
0.0159
boilerplate_rate
0.04618

image_alt_length

numeric feature high_skew outliers
Counts of characters in image alt-text fields, ranging from 1 to 65,192 with a median of 116 and IQR of 59–226. The distribution is severely right-skewed (skew 38.4, kurtosis 5934) and 11.8% of rows (47,944) flag as outliers, suggesting a long tail of unusually verbose alt-text or possibly entire passages dumped into the alt attribute. No nulls or zeros, so every row carries some alt content. Treatment: Log-transform or winsorize before modelling to tame the extreme right tail. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
2,052
min
1
max
65,192
mean
227.5
median
116
std
364.4
q1
59
q3
226
iqr
167
skew
38.39
kurtosis
5934
n_outliers
47,944
outlier_rate
0.1184
zero_rate
0

text

text free_text duplicates multilingual
Short user-generated posts (len_mean 136.5, len_max 397, word_median 15) that read like Mastodon/social toots — emoji_rate 0.226, url_rate 0.156, and top values include sign-up rule boilerplate and hashtags like #Alt4You. Predominantly English (4337) but spans 30 languages including German (91), Japanese (75), Portuguese (48), and French (43). Watch the heavy duplication (duplicate_rate 0.252, n_duplicates 102139) and 17970 empty strings sitting at the top of the value list. Treatment: Drop empties and exact duplicates, then language-filter or multilingual-embed before modelling. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
302,702
len_min
0
len_max
397
len_mean
136.5
len_median
131
len_p95
296
word_mean
19.73
word_median
15
n_empty
17,970
n_duplicates
102,139
duplicate_rate
0.2523
vocab_size
82,709
readability_flesch_mean
50.35
emoji_rate
0.2264
url_rate
0.1555
one_word_rate
0.06977
allcaps_rate
0.01936
boilerplate_rate
0.001766

author_handle

categorical foreign_key null_rate
Author handles from what appears to be Bluesky/AT Protocol activity (most values end in .bsky.social, with some custom domains like lukesteuber.com and majima.club). Only 489 unique authors across 404,841 rows with high entropy ratio (0.859) and a flat top — the most prolific account holds just 2.4% of posts — so this is a curated set of accounts rather than an open population. Notable: 31% of rows have a null handle, which is high for an authorship field and worth investigating before any per-author aggregation. Treatment: Treat as a categorical author key; investigate the 31% nulls and decide whether to drop or impute before grouping. high · anthropic:claude-opus-4-7
n
404,841
nulls
125,645 (31.0%)
unique
489
top_value
firefaerie81.bsky.social
top_rate
0.02446
cardinality
489
entropy
7.676
entropy_ratio
0.8593

author_did

text foreign_key duplicates one_word
This column holds Bluesky/AT Protocol decentralized identifiers (`did:plc:` prefix) for post authors, with a tight length range of 17-33 chars and a single token per value. Across 404,841 rows there are only 35,183 unique authors (duplicate_rate 0.913), and one author alone accounts for 32,558 posts — heavy concentration at the top. Treatment: Treat as an author key — left-join to a profile table, and consider per-author aggregation or capping before modelling. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
35,183
len_min
17
len_max
33
len_mean
32
len_median
32
len_p95
32
word_mean
1
word_median
1
n_empty
0
n_duplicates
369,658
duplicate_rate
0.9131
vocab_size
4,133
readability_flesch_mean
-228.6
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

post_uri

text foreign_key one_word
This is an AT Protocol post URI (Bluesky), encoding the author DID and post rkey in a fixed `at://did:plc:.../app.bsky.feed.post/...` format — note the tight length range (55-71 chars, mean 70) and one_word_rate of 1.0. Despite looking like a primary key, it is not unique: 335384 unique values across 404841 rows yields a 17.2% duplicate rate, with the top URI appearing 15 times. Readability and emoji metrics are meaningless here since each value is a single token. Treatment: Treat as a post identifier for joins; deduplicate or aggregate on it rather than modelling its text. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
335,384
len_min
55
len_max
71
len_mean
70
len_median
70
len_p95
70
word_mean
1
word_median
1
n_empty
0
n_duplicates
69,457
duplicate_rate
0.1716
vocab_size
19,733
readability_flesch_mean
-628.8
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

post_cid

text identifier one_word
This is a post_cid column holding IPLD/CID v1 content identifiers (all 404841 values are exactly 59 chars and start with the 'bafyrei' CIDv1 dag-cbor prefix). It is near-unique (335545 distinct of 404841) but carries a notable 17.1% duplicate_rate with some CIDs repeating up to 4 times, which is unexpected if these are meant to be unique post references. The fixed length, single-token shape, and zero nulls confirm a machine-generated identifier rather than a feature. Treatment: Treat as a join key; deduplicate before use and don't feed into models. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
335,545
len_min
59
len_max
59
len_mean
59
len_median
59
len_p95
59
word_mean
1
word_median
1
n_empty
0
n_duplicates
69,296
duplicate_rate
0.1712
vocab_size
19,734
readability_flesch_mean
-567
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

created_at

text timestamp one_word allcaps
ISO-8601 timestamps stored as text, mixing two suffix conventions (`.000Z`/`.911Z` style and `+00:00` offset) which is why the parser flagged allcaps and one_word at rate 1.0. Length ranges 20-35 chars and median 24 align with standard ISO datetime strings. Duplicate rate is 17.4% (70,449 repeats across 404,841 rows), and top values cluster heavily on 2026-04-16, suggesting a load-day spike or truncated export window. Treatment: Parse to datetime (normalising the two ISO suffix formats) and use as a temporal feature. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
334,392
len_min
20
len_max
35
len_mean
24.55
len_median
24
len_p95
29
word_mean
1
word_median
1
n_empty
0
n_duplicates
70,449
duplicate_rate
0.174
vocab_size
19,732
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

indexed_at

text timestamp one_word allcaps
This is an ISO-8601 indexing timestamp stored as text, with values like '2025-08-02T00:50:34.108Z' and lengths between 24 and 32 characters. Two formats are mixed: a 24-char 'Z' suffix variant and a 32-char '+00:00' offset variant, which explains the length spread. Duplicate rate is 17.1% (69,298 rows) across 335,543 unique values out of 404,841, suggesting batch indexing events share timestamps. Treatment: Parse to a single UTC datetime type and use for recency filtering or time-based joins. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
335,543
len_min
24
len_max
32
len_mean
26.48
len_median
24
len_p95
32
word_mean
1
word_median
1
n_empty
0
n_duplicates
69,298
duplicate_rate
0.1712
vocab_size
19,734
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

langs_json

categorical feature
This column holds JSON-encoded language tag arrays, almost certainly the detected or declared languages per record. English dominates at 71.9% of rows (["en"]), and a notable 83,264 rows carry an empty array [] indicating no language assigned. Cardinality is 214 distinct strings with low entropy ratio (0.179), and the long tail includes multi-language combinations like ["en", "sq"] and ["ae", "ay", "en"]. Treatment: Parse the JSON arrays and one-hot encode primary language, treating [] as a separate 'unknown' category. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
214
top_value
["en"]
top_rate
0.7188
cardinality
214
entropy
1.389
entropy_ratio
0.1794

image_index

numeric feature high_skew outliers
A small-integer counter taking only 4 distinct values between 0 and 3, with 82.8% zeros and a mean of 0.26 — likely an image position/index within a multi-image record. The distribution is heavily right-skewed (skew 2.73, kurtosis 7.08) and the IQR is 0, so any non-zero entry registers as an outlier (17.16% flagged). Treatment: Treat as a low-cardinality categorical (one-hot or keep as ordinal); ignore outlier flag since IQR is zero by construction. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
4
min
0
max
3
mean
0.2604
median
0
std
0.6467
q1
0
q3
0
iqr
0
skew
2.726
kurtosis
7.083
n_outliers
69,490
outlier_rate
0.1716
zero_rate
0.8284

image_count_in_post

numeric feature outliers
Discrete count of images attached to a post, taking only 4 distinct integer values between 1 and 4 with no nulls or zeros. The distribution is heavily right-skewed (skew 1.72) toward single-image posts: median and Q1 are both 1, Q3 is 2, and roughly 9.7% of rows (39,251) flag as outliers — almost certainly the 3- and 4-image posts. Treatment: Treat as a small ordinal/categorical count rather than continuous; bin 3-4 together or one-hot encode. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
4
min
1
max
4
mean
1.524
median
1
std
0.9647
q1
1
q3
2
iqr
1
skew
1.719
kurtosis
1.551
n_outliers
39,251
outlier_rate
0.09695
zero_rate
0

image_mime_type

categorical metadata
This column records the MIME type of an associated image, with 8 distinct values across 404,841 rows and no nulls. It is heavily dominated by image/jpeg at 93.5% (378,401 rows), with image/png a distant second (23,402) and a long tail of webp, gif, avif, svg+xml, heic, plus 4 rows of application/octet-stream that aren't an image type at all. Entropy ratio is just 0.128, confirming very low diversity. Treatment: Collapse rare types into an 'other' bucket (and review the 4 application/octet-stream rows) before one-hot encoding. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
8
top_value
image/jpeg
top_rate
0.9347
cardinality
8
entropy
0.3842
entropy_ratio
0.1281

image_ref

text foreign_key one_word
This column holds IPFS CIDv1 content hashes (all 59 chars, all starting with 'bafkrei...' indicating raw codec + sha256), one token per row across 404841 rows. With 375188 unique values and a 7.3% duplicate rate, it functions as a content-addressed image pointer rather than a row identifier — one CID appears 5227 times, suggesting a heavily reused asset. Length is rigidly fixed (min=max=59) and there are no nulls, URLs, or emojis. Treatment: Treat as an opaque content-address key; left-join to an image/asset table and do not tokenize. high · anthropic:claude-opus-4-7
n
404,841
nulls
19 (0.0%)
unique
375,188
len_min
59
len_max
59
len_mean
59
len_median
59
len_p95
59
word_mean
1
word_median
1
n_empty
0
n_duplicates
29,634
duplicate_rate
0.0732
vocab_size
19,560
readability_flesch_mean
-539.1
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

image_thumb_url

text metadata null_rate one_word url_heavy
This column holds Bluesky CDN thumbnail URLs (every non-null value is exactly 138 chars and a single token pointing at cdn.bsky.app/img/feed_thumbnail/plain/...). 31.04% of rows are null and 6.65% are duplicates, with one thumbnail repeating 1000 times — likely a heavily reshared post. With 260,636 unique URLs across 404,841 rows, it's effectively a per-post image reference rather than a modelling feature. Treatment: Keep as a media reference for display or image fetching; drop from tabular modelling. high · anthropic:claude-opus-4-7
n
404,841
nulls
125,645 (31.0%)
unique
260,636
len_min
138
len_max
138
len_mean
138
len_median
138
len_p95
138
word_mean
1
word_median
1
n_empty
0
n_duplicates
18,560
duplicate_rate
0.06648
vocab_size
19,666
readability_flesch_mean
-1391
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

image_fullsize_url

text metadata null_rate one_word url_heavy
This column holds Bluesky CDN URLs pointing to full-size feed images, all exactly 137 characters long and consisting of a single token (url_rate 1.0, one_word_rate 1.0). 31.04% of rows are null, indicating most posts lack an image, and 6.65% of non-null values are duplicates — one image URL repeats 1000 times, suggesting reshared or pinned media. With 260,636 unique values across 404,841 rows, the field is high-cardinality but not unique. Treatment: Treat as an image asset reference; fetch/hash for media features rather than tokenizing as text. high · anthropic:claude-opus-4-7
n
404,841
nulls
125,645 (31.0%)
unique
260,636
len_min
137
len_max
137
len_mean
137
len_median
137
len_p95
137
word_mean
1
word_median
1
n_empty
0
n_duplicates
18,560
duplicate_rate
0.06648
vocab_size
19,666
readability_flesch_mean
-1475
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

source_mode

categorical feature
Binary categorical flag indicating ingestion source, with values 'author_feed' (69.0%) and 'jetstream' (31.0%) across 404,841 complete rows. The split is imbalanced but both classes are well-represented, and entropy_ratio of 0.89 confirms meaningful spread between the two modes. No nulls, suggesting the field is always populated at ingest time. Treatment: One-hot or boolean-encode for modelling; useful as a provenance stratifier. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
2
top_value
author_feed
top_rate
0.6896
cardinality
2
entropy
0.8936
entropy_ratio
0.8936

collected_at

text timestamp near_unique one_word allcaps
This is an ISO-8601 UTC timestamp column (every value is exactly 27 characters, one token, ending in Z) capturing when each of the 404,841 records was collected. All values are unique with zero nulls, so it functions as a per-row event time rather than a categorical feature. The sampled values cluster on 2026-04-15 and 2026-04-16, suggesting collection spans roughly a single day or two. Treatment: parse to datetime and use for time-based splits or feature extraction rather than as a model input. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
404,841
len_min
27
len_max
27
len_mean
27
len_median
27
len_p95
27
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

query

categorical foreign_key null_rate
The column holds Bluesky-style handles (e.g., firefaerie81.bsky.social, lukesteuber.com), suggesting it captures the account being queried rather than a free-text search string. Cardinality is modest at 489 unique values across 404,841 rows with high entropy ratio (0.859), so traffic is spread fairly evenly with no dominant target—the top handle accounts for just 2.45%. Notably, 31.04% of rows are null, which is severe for what looks like a primary lookup key. Treatment: Filter or impute the 31% nulls before joining; treat as a high-cardinality categorical key, not free text. high · anthropic:claude-opus-4-7
n
404,841
nulls
125,645 (31.0%)
unique
489
top_value
firefaerie81.bsky.social
top_rate
0.02446
cardinality
489
entropy
7.676
entropy_ratio
0.8593

cursor

text metadata duplicates one_word allcaps
Despite the 'text' classification, every value is a single ISO-8601 UTC timestamp (length 16-24, one_word_rate 1.0, allcaps_rate 1.0 from the trailing Z), suggesting this is a pagination cursor keyed on a record timestamp. Across 404,841 rows there are only 105,919 distinct values and a 70.6% duplicate rate, with the top cursors repeating 200+ times each — consistent with many records sharing the same paging anchor. Null rate is 10.97%, likely the first/last page where no cursor applies. Treatment: Parse as timestamp if needed for ordering, otherwise drop — it is API pagination state, not a feature. high · anthropic:claude-opus-4-7
n
404,841
nulls
44,399 (11.0%)
unique
105,919
len_min
16
len_max
24
len_mean
20.98
len_median
24
len_p95
24
word_mean
1
word_median
1
n_empty
0
n_duplicates
254,523
duplicate_rate
0.7061
vocab_size
9,220
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

raw_record_json

text free_text multilingual
This column holds the raw JSON payload of Bluesky posts (`app.bsky.feed.post` records with embeds, replies, langs, and text), one serialized object per row. With 404,841 rows, 335,545 distinct values and a 17.1% duplicate rate (69,296 repeats), the same post JSON is recurring — likely reposts or ingestion replays. Mean length is 1,184 chars (max 65,764), Flesch readability is -250 because the JSON scaffolding dominates over prose, and the language sample is 432 English vs only a handful of de/fi/fr/is/ja/pt despite the multilingual alert. URL-bearing rows hit 24.7% and emoji rows 23.1%, both inflated by emoji/URLs embedded inside alt-text and post bodies. Treatment: Parse as JSON and project the fields you need (text, langs, embed) before any text modelling; do not feed the raw blob to a tokenizer. high · anthropic:claude-opus-4-7
n
404,841
nulls
0 (0.0%)
unique
335,545
len_min
287
len_max
65,764
len_mean
1184
len_median
967
len_p95
2,504
word_mean
67.63
word_median
49
n_empty
0
n_duplicates
69,296
duplicate_rate
0.1712
vocab_size
117,811
readability_flesch_mean
-250
emoji_rate
0.2307
url_rate
0.2473
one_word_rate
0.00786
allcaps_rate
0
boilerplate_rate
0