bluesky alt text

source hf://lukeslp/bluesky-alt-text:[train] 404,841 rows 21 columns profiled 2026-04-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.

citing: row_count · column_count · columns.image_alt_length.stats · columns.author_did.top_values · columns.author_handle.top_values · columns.author_handle.null_rate · columns.image_mime_type.top_values · columns.langs_json.top_values · columns.source_mode.top_values · columns.image_count_in_post.stats · columns.text.language_counts

Charts the summary said to look at first

image_mime_type · Shows JPEG dominance (~93%) with PNG and WebP as the only meaningful minorities.

Show data table

Top values for image_mime_type (8 unique shown, of 8 total).
value	count	share
image/jpeg	378401	93.5%
image/png	23402	5.8%
image/webp	2906	0.7%
image/gif	79	0.0%
image/avif	34	0.0%
image/svg+xml	14	0.0%
application/octet-stream	4	0.0%
image/heic	1	0.0%

source_mode · Roughly 69/31 split between author_feed and jetstream collection modes — explains the 31% null rate on author/url fields.

Show data table

Top values for source_mode (2 unique shown, of 2 total).
value	count	share
author_feed	279196	69.0%
jetstream	125645	31.0%

author_handle · Top handles are highly concentrated; check whether a few prolific accounts skew downstream analysis.

Show data table

Top values for author_handle (20 unique shown, of 489 total).
value	count	share
firefaerie81.bsky.social	6828	1.7%
dashdashado.bsky.social	6386	1.6%
epsteinweb.bsky.social	6146	1.5%
lukesteuber.com	6019	1.5%
azgibsonz.bsky.social	5538	1.4%
majima.club	5427	1.3%
allthegalaxies.galaxyzoo.org	5000	1.2%
guggenheim.bsky.social	5000	1.2%
listigeplaylists.bsky.social	4998	1.2%
catsofyore.bsky.social	4610	1.1%
magpie.tips	4375	1.1%
skwinnicki.bsky.social	4082	1.0%
gonzo.bsky.social	3894	1.0%
ethankocak.com	3653	0.9%
maladroithe.bsky.social	3590	0.9%
nikolaidenmark.bsky.social	3429	0.8%
mbkplus.bsky.social	3384	0.8%
eustace.justshootme.ca	3312	0.8%
veronicaf.bsky.social	2686	0.7%
sarahmackattack.bsky.social	2645	0.7%

image_alt_length · Heavy right-skew (median 116, max 65,192) — inspect the long tail of unusually verbose alt text.

Show data table

Histogram bins for image_alt_length (median: 116.0).
bin	count
1 – 1631	400124
1631 – 3261	4709
3261 – 4890	2
4890 – 6520	0
6520 – 8150	1
8150 – 9780	0
9780 – 1.141e+04	0
1.141e+04 – 1.304e+04	0
1.304e+04 – 1.467e+04	0
1.467e+04 – 1.63e+04	0
1.63e+04 – 1.793e+04	0
1.793e+04 – 1.956e+04	1
1.956e+04 – 2.119e+04	0
2.119e+04 – 2.282e+04	0
2.282e+04 – 2.445e+04	0
2.445e+04 – 2.608e+04	0
2.608e+04 – 2.771e+04	0
2.771e+04 – 2.934e+04	0
2.934e+04 – 3.097e+04	0
3.097e+04 – 3.26e+04	0
3.26e+04 – 3.423e+04	0
3.423e+04 – 3.586e+04	0
3.586e+04 – 3.749e+04	0
3.749e+04 – 3.912e+04	1
3.912e+04 – 4.075e+04	0
4.075e+04 – 4.238e+04	0
4.238e+04 – 4.4e+04	0
4.4e+04 – 4.563e+04	0
4.563e+04 – 4.726e+04	0
4.726e+04 – 4.889e+04	0
4.889e+04 – 5.052e+04	1
5.052e+04 – 5.215e+04	0
5.215e+04 – 5.378e+04	0
5.378e+04 – 5.541e+04	0
5.541e+04 – 5.704e+04	0
5.704e+04 – 5.867e+04	0
5.867e+04 – 6.03e+04	0
6.03e+04 – 6.193e+04	0
6.193e+04 – 6.356e+04	0
6.356e+04 – 6.519e+04	2

image_count_in_post · Most posts attach a single image; bar shows how often users post 2–4 image carousels.

Show data table

Histogram bins for image_count_in_post (median: 1.0).
bin	count
1 – 1.075	291415
1.075 – 1.15	0
1.15 – 1.225	0
1.225 – 1.3	0
1.3 – 1.375	0
1.375 – 1.45	0
1.45 – 1.525	0
1.525 – 1.6	0
1.6 – 1.675	0
1.675 – 1.75	0
1.75 – 1.825	0
1.825 – 1.9	0
1.9 – 1.975	0
1.975 – 2.05	54031
2.05 – 2.125	0
2.125 – 2.2	0
2.2 – 2.275	0
2.275 – 2.35	0
2.35 – 2.425	0
2.425 – 2.5	0
2.5 – 2.575	0
2.575 – 2.65	0
2.65 – 2.725	0
2.725 – 2.8	0
2.8 – 2.875	0
2.875 – 2.95	0
2.95 – 3.025	20144
3.025 – 3.1	0
3.1 – 3.175	0
3.175 – 3.25	0
3.25 – 3.325	0
3.325 – 3.4	0
3.4 – 3.475	0
3.475 – 3.55	0
3.55 – 3.625	0
3.625 – 3.7	0
3.7 – 3.775	0
3.775 – 3.85	0
3.85 – 3.925	0
3.925 – 4	39251

Schema

21 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
alt_text	text	0.0%	355,721	multilingual
image_alt_length	numeric	0.0%	2,052	high_skew outliers
text	text	0.0%	302,702	duplicates multilingual
author_handle	categorical	31.0%	489	null_rate
author_did	text	0.0%	35,183	duplicates one_word
post_uri	text	0.0%	335,384	one_word
post_cid	text	0.0%	335,545	one_word
created_at	text	0.0%	334,392	one_word allcaps
indexed_at	text	0.0%	335,543	one_word allcaps
langs_json	categorical	0.0%	214
image_index	numeric	0.0%	4	high_skew outliers
image_count_in_post	numeric	0.0%	4	outliers
image_mime_type	categorical	0.0%	8
image_ref	text	0.0%	375,188	one_word
image_thumb_url	text	31.0%	260,636	null_rate one_word url_heavy
image_fullsize_url	text	31.0%	260,636	null_rate one_word url_heavy
source_mode	categorical	0.0%	2
collected_at	text	0.0%	404,841	near_unique one_word allcaps
query	categorical	31.0%	489	null_rate
cursor	text	11.0%	105,919	duplicates one_word allcaps
raw_record_json	text	0.0%	335,545	multilingual

alt_text

text free_text multilingual

Free-text image alt-text descriptions, predominantly English (4071 detected) but with a long multilingual tail spanning 28 other languages including German (94), French (59), Japanese (47) and Spanish (37). Length varies wildly — median 116 chars but max 65192, with a 95th percentile of 966 — and 12.1% of values are duplicates, driven by templated boilerplate like 'Epstein Web iOS app on the App Store.' (5227 repeats) and lengthy pasted feed-rules blocks. Vocabulary is large (85972 unique tokens) and readability sits at Flesch 62.5, but boilerplate rate (4.6%) plus generic placeholders ('Image', 'Image 1', 'image') indicate substantial low-information content. Treatment: Strip boilerplate and placeholder strings, then tokenize and embed (with language detection) before modelling. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 355,721
len_min: 1
len_max: 65,192
len_mean: 227.5
len_median: 116
len_p95: 966
word_mean: 34.21
word_median: 20
n_empty: 0
n_duplicates: 49,120
duplicate_rate: 0.1213
vocab_size: 85,972
readability_flesch_mean: 62.52
emoji_rate: 0.02271
url_rate: 0.0288
one_word_rate: 0.02114
allcaps_rate: 0.0159
boilerplate_rate: 0.04618

image_alt_length

numeric feature high_skew outliers

Counts of characters in image alt-text fields, ranging from 1 to 65,192 with a median of 116 and IQR of 59–226. The distribution is severely right-skewed (skew 38.4, kurtosis 5934) and 11.8% of rows (47,944) flag as outliers, suggesting a long tail of unusually verbose alt-text or possibly entire passages dumped into the alt attribute. No nulls or zeros, so every row carries some alt content. Treatment: Log-transform or winsorize before modelling to tame the extreme right tail. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 2,052
min: 1
max: 65,192
mean: 227.5
median: 116
std: 364.4
q1: 59
q3: 226
iqr: 167
skew: 38.39
kurtosis: 5934
n_outliers: 47,944
outlier_rate: 0.1184
zero_rate: 0

text

text free_text duplicates multilingual

Short user-generated posts (len_mean 136.5, len_max 397, word_median 15) that read like Mastodon/social toots — emoji_rate 0.226, url_rate 0.156, and top values include sign-up rule boilerplate and hashtags like #Alt4You. Predominantly English (4337) but spans 30 languages including German (91), Japanese (75), Portuguese (48), and French (43). Watch the heavy duplication (duplicate_rate 0.252, n_duplicates 102139) and 17970 empty strings sitting at the top of the value list. Treatment: Drop empties and exact duplicates, then language-filter or multilingual-embed before modelling. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 302,702
len_min: 0
len_max: 397
len_mean: 136.5
len_median: 131
len_p95: 296
word_mean: 19.73
word_median: 15
n_empty: 17,970
n_duplicates: 102,139
duplicate_rate: 0.2523
vocab_size: 82,709
readability_flesch_mean: 50.35
emoji_rate: 0.2264
url_rate: 0.1555
one_word_rate: 0.06977
allcaps_rate: 0.01936
boilerplate_rate: 0.001766

author_handle

categorical foreign_key null_rate

Author handles from what appears to be Bluesky/AT Protocol activity (most values end in .bsky.social, with some custom domains like lukesteuber.com and majima.club). Only 489 unique authors across 404,841 rows with high entropy ratio (0.859) and a flat top — the most prolific account holds just 2.4% of posts — so this is a curated set of accounts rather than an open population. Notable: 31% of rows have a null handle, which is high for an authorship field and worth investigating before any per-author aggregation. Treatment: Treat as a categorical author key; investigate the 31% nulls and decide whether to drop or impute before grouping. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 125,645 (31.0%)
unique: 489
top_value: firefaerie81.bsky.social
top_rate: 0.02446
cardinality: 489
entropy: 7.676
entropy_ratio: 0.8593

author_did

text foreign_key duplicates one_word

This column holds Bluesky/AT Protocol decentralized identifiers (`did:plc:` prefix) for post authors, with a tight length range of 17-33 chars and a single token per value. Across 404,841 rows there are only 35,183 unique authors (duplicate_rate 0.913), and one author alone accounts for 32,558 posts — heavy concentration at the top. Treatment: Treat as an author key — left-join to a profile table, and consider per-author aggregation or capping before modelling. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 35,183
len_min: 17
len_max: 33
len_mean: 32
len_median: 32
len_p95: 32
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 369,658
duplicate_rate: 0.9131
vocab_size: 4,133
readability_flesch_mean: -228.6
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

post_uri

text foreign_key one_word

This is an AT Protocol post URI (Bluesky), encoding the author DID and post rkey in a fixed `at://did:plc:.../app.bsky.feed.post/...` format — note the tight length range (55-71 chars, mean 70) and one_word_rate of 1.0. Despite looking like a primary key, it is not unique: 335384 unique values across 404841 rows yields a 17.2% duplicate rate, with the top URI appearing 15 times. Readability and emoji metrics are meaningless here since each value is a single token. Treatment: Treat as a post identifier for joins; deduplicate or aggregate on it rather than modelling its text. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 335,384
len_min: 55
len_max: 71
len_mean: 70
len_median: 70
len_p95: 70
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 69,457
duplicate_rate: 0.1716
vocab_size: 19,733
readability_flesch_mean: -628.8
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

post_cid

text identifier one_word

This is a post_cid column holding IPLD/CID v1 content identifiers (all 404841 values are exactly 59 chars and start with the 'bafyrei' CIDv1 dag-cbor prefix). It is near-unique (335545 distinct of 404841) but carries a notable 17.1% duplicate_rate with some CIDs repeating up to 4 times, which is unexpected if these are meant to be unique post references. The fixed length, single-token shape, and zero nulls confirm a machine-generated identifier rather than a feature. Treatment: Treat as a join key; deduplicate before use and don't feed into models. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 335,545
len_min: 59
len_max: 59
len_mean: 59
len_median: 59
len_p95: 59
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 69,296
duplicate_rate: 0.1712
vocab_size: 19,734
readability_flesch_mean: -567
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

created_at

text timestamp one_word allcaps

ISO-8601 timestamps stored as text, mixing two suffix conventions (`.000Z`/`.911Z` style and `+00:00` offset) which is why the parser flagged allcaps and one_word at rate 1.0. Length ranges 20-35 chars and median 24 align with standard ISO datetime strings. Duplicate rate is 17.4% (70,449 repeats across 404,841 rows), and top values cluster heavily on 2026-04-16, suggesting a load-day spike or truncated export window. Treatment: Parse to datetime (normalising the two ISO suffix formats) and use as a temporal feature. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 334,392
len_min: 20
len_max: 35
len_mean: 24.55
len_median: 24
len_p95: 29
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 70,449
duplicate_rate: 0.174
vocab_size: 19,732
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

indexed_at

text timestamp one_word allcaps

This is an ISO-8601 indexing timestamp stored as text, with values like '2025-08-02T00:50:34.108Z' and lengths between 24 and 32 characters. Two formats are mixed: a 24-char 'Z' suffix variant and a 32-char '+00:00' offset variant, which explains the length spread. Duplicate rate is 17.1% (69,298 rows) across 335,543 unique values out of 404,841, suggesting batch indexing events share timestamps. Treatment: Parse to a single UTC datetime type and use for recency filtering or time-based joins. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 335,543
len_min: 24
len_max: 32
len_mean: 26.48
len_median: 24
len_p95: 32
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 69,298
duplicate_rate: 0.1712
vocab_size: 19,734
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

langs_json

categorical feature

This column holds JSON-encoded language tag arrays, almost certainly the detected or declared languages per record. English dominates at 71.9% of rows (["en"]), and a notable 83,264 rows carry an empty array [] indicating no language assigned. Cardinality is 214 distinct strings with low entropy ratio (0.179), and the long tail includes multi-language combinations like ["en", "sq"] and ["ae", "ay", "en"]. Treatment: Parse the JSON arrays and one-hot encode primary language, treating [] as a separate 'unknown' category. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 214
top_value: ["en"]
top_rate: 0.7188
cardinality: 214
entropy: 1.389
entropy_ratio: 0.1794

image_index

numeric feature high_skew outliers

A small-integer counter taking only 4 distinct values between 0 and 3, with 82.8% zeros and a mean of 0.26 — likely an image position/index within a multi-image record. The distribution is heavily right-skewed (skew 2.73, kurtosis 7.08) and the IQR is 0, so any non-zero entry registers as an outlier (17.16% flagged). Treatment: Treat as a low-cardinality categorical (one-hot or keep as ordinal); ignore outlier flag since IQR is zero by construction. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 4
min: 0
max: 3
mean: 0.2604
median: 0
std: 0.6467
q1: 0
q3: 0
iqr: 0
skew: 2.726
kurtosis: 7.083
n_outliers: 69,490
outlier_rate: 0.1716
zero_rate: 0.8284

image_count_in_post

numeric feature outliers

Discrete count of images attached to a post, taking only 4 distinct integer values between 1 and 4 with no nulls or zeros. The distribution is heavily right-skewed (skew 1.72) toward single-image posts: median and Q1 are both 1, Q3 is 2, and roughly 9.7% of rows (39,251) flag as outliers — almost certainly the 3- and 4-image posts. Treatment: Treat as a small ordinal/categorical count rather than continuous; bin 3-4 together or one-hot encode. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 4
min: 1
max: 4
mean: 1.524
median: 1
std: 0.9647
q1: 1
q3: 2
iqr: 1
skew: 1.719
kurtosis: 1.551
n_outliers: 39,251
outlier_rate: 0.09695
zero_rate: 0

image_mime_type

categorical metadata

This column records the MIME type of an associated image, with 8 distinct values across 404,841 rows and no nulls. It is heavily dominated by image/jpeg at 93.5% (378,401 rows), with image/png a distant second (23,402) and a long tail of webp, gif, avif, svg+xml, heic, plus 4 rows of application/octet-stream that aren't an image type at all. Entropy ratio is just 0.128, confirming very low diversity. Treatment: Collapse rare types into an 'other' bucket (and review the 4 application/octet-stream rows) before one-hot encoding. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 8
top_value: image/jpeg
top_rate: 0.9347
cardinality: 8
entropy: 0.3842
entropy_ratio: 0.1281

image_ref

text foreign_key one_word

This column holds IPFS CIDv1 content hashes (all 59 chars, all starting with 'bafkrei...' indicating raw codec + sha256), one token per row across 404841 rows. With 375188 unique values and a 7.3% duplicate rate, it functions as a content-addressed image pointer rather than a row identifier — one CID appears 5227 times, suggesting a heavily reused asset. Length is rigidly fixed (min=max=59) and there are no nulls, URLs, or emojis. Treatment: Treat as an opaque content-address key; left-join to an image/asset table and do not tokenize. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 19 (0.0%)
unique: 375,188
len_min: 59
len_max: 59
len_mean: 59
len_median: 59
len_p95: 59
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 29,634
duplicate_rate: 0.0732
vocab_size: 19,560
readability_flesch_mean: -539.1
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

image_thumb_url

text metadata null_rate one_word url_heavy

This column holds Bluesky CDN thumbnail URLs (every non-null value is exactly 138 chars and a single token pointing at cdn.bsky.app/img/feed_thumbnail/plain/...). 31.04% of rows are null and 6.65% are duplicates, with one thumbnail repeating 1000 times — likely a heavily reshared post. With 260,636 unique URLs across 404,841 rows, it's effectively a per-post image reference rather than a modelling feature. Treatment: Keep as a media reference for display or image fetching; drop from tabular modelling. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 125,645 (31.0%)
unique: 260,636
len_min: 138
len_max: 138
len_mean: 138
len_median: 138
len_p95: 138
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 18,560
duplicate_rate: 0.06648
vocab_size: 19,666
readability_flesch_mean: -1391
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

image_fullsize_url

text metadata null_rate one_word url_heavy

This column holds Bluesky CDN URLs pointing to full-size feed images, all exactly 137 characters long and consisting of a single token (url_rate 1.0, one_word_rate 1.0). 31.04% of rows are null, indicating most posts lack an image, and 6.65% of non-null values are duplicates — one image URL repeats 1000 times, suggesting reshared or pinned media. With 260,636 unique values across 404,841 rows, the field is high-cardinality but not unique. Treatment: Treat as an image asset reference; fetch/hash for media features rather than tokenizing as text. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 125,645 (31.0%)
unique: 260,636
len_min: 137
len_max: 137
len_mean: 137
len_median: 137
len_p95: 137
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 18,560
duplicate_rate: 0.06648
vocab_size: 19,666
readability_flesch_mean: -1475
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

source_mode

categorical feature

Binary categorical flag indicating ingestion source, with values 'author_feed' (69.0%) and 'jetstream' (31.0%) across 404,841 complete rows. The split is imbalanced but both classes are well-represented, and entropy_ratio of 0.89 confirms meaningful spread between the two modes. No nulls, suggesting the field is always populated at ingest time. Treatment: One-hot or boolean-encode for modelling; useful as a provenance stratifier. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 2
top_value: author_feed
top_rate: 0.6896
cardinality: 2
entropy: 0.8936
entropy_ratio: 0.8936

collected_at

text timestamp near_unique one_word allcaps

This is an ISO-8601 UTC timestamp column (every value is exactly 27 characters, one token, ending in Z) capturing when each of the 404,841 records was collected. All values are unique with zero nulls, so it functions as a per-row event time rather than a categorical feature. The sampled values cluster on 2026-04-15 and 2026-04-16, suggesting collection spans roughly a single day or two. Treatment: parse to datetime and use for time-based splits or feature extraction rather than as a model input. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 404,841
len_min: 27
len_max: 27
len_mean: 27
len_median: 27
len_p95: 27
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 20,000
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

query

categorical foreign_key null_rate

The column holds Bluesky-style handles (e.g., firefaerie81.bsky.social, lukesteuber.com), suggesting it captures the account being queried rather than a free-text search string. Cardinality is modest at 489 unique values across 404,841 rows with high entropy ratio (0.859), so traffic is spread fairly evenly with no dominant target—the top handle accounts for just 2.45%. Notably, 31.04% of rows are null, which is severe for what looks like a primary lookup key. Treatment: Filter or impute the 31% nulls before joining; treat as a high-cardinality categorical key, not free text. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 125,645 (31.0%)
unique: 489
top_value: firefaerie81.bsky.social
top_rate: 0.02446
cardinality: 489
entropy: 7.676
entropy_ratio: 0.8593

cursor

text metadata duplicates one_word allcaps

Despite the 'text' classification, every value is a single ISO-8601 UTC timestamp (length 16-24, one_word_rate 1.0, allcaps_rate 1.0 from the trailing Z), suggesting this is a pagination cursor keyed on a record timestamp. Across 404,841 rows there are only 105,919 distinct values and a 70.6% duplicate rate, with the top cursors repeating 200+ times each — consistent with many records sharing the same paging anchor. Null rate is 10.97%, likely the first/last page where no cursor applies. Treatment: Parse as timestamp if needed for ordering, otherwise drop — it is API pagination state, not a feature. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 44,399 (11.0%)
unique: 105,919
len_min: 16
len_max: 24
len_mean: 20.98
len_median: 24
len_p95: 24
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 254,523
duplicate_rate: 0.7061
vocab_size: 9,220
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

raw_record_json

text free_text multilingual

This column holds the raw JSON payload of Bluesky posts (`app.bsky.feed.post` records with embeds, replies, langs, and text), one serialized object per row. With 404,841 rows, 335,545 distinct values and a 17.1% duplicate rate (69,296 repeats), the same post JSON is recurring — likely reposts or ingestion replays. Mean length is 1,184 chars (max 65,764), Flesch readability is -250 because the JSON scaffolding dominates over prose, and the language sample is 432 English vs only a handful of de/fi/fr/is/ja/pt despite the multilingual alert. URL-bearing rows hit 24.7% and emoji rows 23.1%, both inflated by emoji/URLs embedded inside alt-text and post bodies. Treatment: Parse as JSON and project the fields you need (text, langs, embed) before any text modelling; do not feed the raw blob to a tokenizer. high · anthropic:claude-opus-4-7

n: 404,841
nulls: 0 (0.0%)
unique: 335,545
len_min: 287
len_max: 65,764
len_mean: 1184
len_median: 967
len_p95: 2,504
word_mean: 67.63
word_median: 49
n_empty: 0
n_duplicates: 69,296
duplicate_rate: 0.1712
vocab_size: 117,811
readability_flesch_mean: -250
emoji_rate: 0.2307
url_rate: 0.2473
one_word_rate: 0.00786
allcaps_rate: 0
boilerplate_rate: 0