bluesky alt text
Reading
This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.
citing: row_count · column_count · columns.image_alt_length.stats · columns.author_did.top_values · columns.author_handle.top_values · columns.author_handle.null_rate · columns.image_mime_type.top_values · columns.langs_json.top_values · columns.source_mode.top_values · columns.image_count_in_post.stats · columns.text.language_counts
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| image/jpeg | 378401 | 93.5% |
| image/png | 23402 | 5.8% |
| image/webp | 2906 | 0.7% |
| image/gif | 79 | 0.0% |
| image/avif | 34 | 0.0% |
| image/svg+xml | 14 | 0.0% |
| application/octet-stream | 4 | 0.0% |
| image/heic | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| author_feed | 279196 | 69.0% |
| jetstream | 125645 | 31.0% |
Show data table
| value | count | share |
|---|---|---|
| firefaerie81.bsky.social | 6828 | 1.7% |
| dashdashado.bsky.social | 6386 | 1.6% |
| epsteinweb.bsky.social | 6146 | 1.5% |
| lukesteuber.com | 6019 | 1.5% |
| azgibsonz.bsky.social | 5538 | 1.4% |
| majima.club | 5427 | 1.3% |
| allthegalaxies.galaxyzoo.org | 5000 | 1.2% |
| guggenheim.bsky.social | 5000 | 1.2% |
| listigeplaylists.bsky.social | 4998 | 1.2% |
| catsofyore.bsky.social | 4610 | 1.1% |
| magpie.tips | 4375 | 1.1% |
| skwinnicki.bsky.social | 4082 | 1.0% |
| gonzo.bsky.social | 3894 | 1.0% |
| ethankocak.com | 3653 | 0.9% |
| maladroithe.bsky.social | 3590 | 0.9% |
| nikolaidenmark.bsky.social | 3429 | 0.8% |
| mbkplus.bsky.social | 3384 | 0.8% |
| eustace.justshootme.ca | 3312 | 0.8% |
| veronicaf.bsky.social | 2686 | 0.7% |
| sarahmackattack.bsky.social | 2645 | 0.7% |
Show data table
| bin | count |
|---|---|
| 1 – 1631 | 400124 |
| 1631 – 3261 | 4709 |
| 3261 – 4890 | 2 |
| 4890 – 6520 | 0 |
| 6520 – 8150 | 1 |
| 8150 – 9780 | 0 |
| 9780 – 1.141e+04 | 0 |
| 1.141e+04 – 1.304e+04 | 0 |
| 1.304e+04 – 1.467e+04 | 0 |
| 1.467e+04 – 1.63e+04 | 0 |
| 1.63e+04 – 1.793e+04 | 0 |
| 1.793e+04 – 1.956e+04 | 1 |
| 1.956e+04 – 2.119e+04 | 0 |
| 2.119e+04 – 2.282e+04 | 0 |
| 2.282e+04 – 2.445e+04 | 0 |
| 2.445e+04 – 2.608e+04 | 0 |
| 2.608e+04 – 2.771e+04 | 0 |
| 2.771e+04 – 2.934e+04 | 0 |
| 2.934e+04 – 3.097e+04 | 0 |
| 3.097e+04 – 3.26e+04 | 0 |
| 3.26e+04 – 3.423e+04 | 0 |
| 3.423e+04 – 3.586e+04 | 0 |
| 3.586e+04 – 3.749e+04 | 0 |
| 3.749e+04 – 3.912e+04 | 1 |
| 3.912e+04 – 4.075e+04 | 0 |
| 4.075e+04 – 4.238e+04 | 0 |
| 4.238e+04 – 4.4e+04 | 0 |
| 4.4e+04 – 4.563e+04 | 0 |
| 4.563e+04 – 4.726e+04 | 0 |
| 4.726e+04 – 4.889e+04 | 0 |
| 4.889e+04 – 5.052e+04 | 1 |
| 5.052e+04 – 5.215e+04 | 0 |
| 5.215e+04 – 5.378e+04 | 0 |
| 5.378e+04 – 5.541e+04 | 0 |
| 5.541e+04 – 5.704e+04 | 0 |
| 5.704e+04 – 5.867e+04 | 0 |
| 5.867e+04 – 6.03e+04 | 0 |
| 6.03e+04 – 6.193e+04 | 0 |
| 6.193e+04 – 6.356e+04 | 0 |
| 6.356e+04 – 6.519e+04 | 2 |
Show data table
| bin | count |
|---|---|
| 1 – 1.075 | 291415 |
| 1.075 – 1.15 | 0 |
| 1.15 – 1.225 | 0 |
| 1.225 – 1.3 | 0 |
| 1.3 – 1.375 | 0 |
| 1.375 – 1.45 | 0 |
| 1.45 – 1.525 | 0 |
| 1.525 – 1.6 | 0 |
| 1.6 – 1.675 | 0 |
| 1.675 – 1.75 | 0 |
| 1.75 – 1.825 | 0 |
| 1.825 – 1.9 | 0 |
| 1.9 – 1.975 | 0 |
| 1.975 – 2.05 | 54031 |
| 2.05 – 2.125 | 0 |
| 2.125 – 2.2 | 0 |
| 2.2 – 2.275 | 0 |
| 2.275 – 2.35 | 0 |
| 2.35 – 2.425 | 0 |
| 2.425 – 2.5 | 0 |
| 2.5 – 2.575 | 0 |
| 2.575 – 2.65 | 0 |
| 2.65 – 2.725 | 0 |
| 2.725 – 2.8 | 0 |
| 2.8 – 2.875 | 0 |
| 2.875 – 2.95 | 0 |
| 2.95 – 3.025 | 20144 |
| 3.025 – 3.1 | 0 |
| 3.1 – 3.175 | 0 |
| 3.175 – 3.25 | 0 |
| 3.25 – 3.325 | 0 |
| 3.325 – 3.4 | 0 |
| 3.4 – 3.475 | 0 |
| 3.475 – 3.55 | 0 |
| 3.55 – 3.625 | 0 |
| 3.625 – 3.7 | 0 |
| 3.7 – 3.775 | 0 |
| 3.775 – 3.85 | 0 |
| 3.85 – 3.925 | 0 |
| 3.925 – 4 | 39251 |
Schema
21 columns| Alerts | ||||
|---|---|---|---|---|
| alt_text | text | 0.0% | 355,721 |
multilingual
|
| image_alt_length | numeric | 0.0% | 2,052 |
high_skew
outliers
|
| text | text | 0.0% | 302,702 |
duplicates
multilingual
|
| author_handle | categorical | 31.0% | 489 |
null_rate
|
| author_did | text | 0.0% | 35,183 |
duplicates
one_word
|
| post_uri | text | 0.0% | 335,384 |
one_word
|
| post_cid | text | 0.0% | 335,545 |
one_word
|
| created_at | text | 0.0% | 334,392 |
one_word
allcaps
|
| indexed_at | text | 0.0% | 335,543 |
one_word
allcaps
|
| langs_json | categorical | 0.0% | 214 |
|
| image_index | numeric | 0.0% | 4 |
high_skew
outliers
|
| image_count_in_post | numeric | 0.0% | 4 |
outliers
|
| image_mime_type | categorical | 0.0% | 8 |
|
| image_ref | text | 0.0% | 375,188 |
one_word
|
| image_thumb_url | text | 31.0% | 260,636 |
null_rate
one_word
url_heavy
|
| image_fullsize_url | text | 31.0% | 260,636 |
null_rate
one_word
url_heavy
|
| source_mode | categorical | 0.0% | 2 |
|
| collected_at | text | 0.0% | 404,841 |
near_unique
one_word
allcaps
|
| query | categorical | 31.0% | 489 |
null_rate
|
| cursor | text | 11.0% | 105,919 |
duplicates
one_word
allcaps
|
| raw_record_json | text | 0.0% | 335,545 |
multilingual
|
alt_text
text free_text multilingualFree-text image alt-text descriptions, predominantly English (4071 detected) but with a long multilingual tail spanning 28 other languages including German (94), French (59), Japanese (47) and Spanish (37). Length varies wildly — median 116 chars but max 65192, with a 95th percentile of 966 — and 12.1% of values are duplicates, driven by templated boilerplate like 'Epstein Web iOS app on the App Store.' (5227 repeats) and lengthy pasted feed-rules blocks. Vocabulary is large (85972 unique tokens) and readability sits at Flesch 62.5, but boilerplate rate (4.6%) plus generic placeholders ('Image', 'Image 1', 'image') indicate substantial low-information content. Treatment: Strip boilerplate and placeholder strings, then tokenize and embed (with language detection) before modelling.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 355,721
- len_min
- 1
- len_max
- 65,192
- len_mean
- 227.5
- len_median
- 116
- len_p95
- 966
- word_mean
- 34.21
- word_median
- 20
- n_empty
- 0
- n_duplicates
- 49,120
- duplicate_rate
- 0.1213
- vocab_size
- 85,972
- readability_flesch_mean
- 62.52
- emoji_rate
- 0.02271
- url_rate
- 0.0288
- one_word_rate
- 0.02114
- allcaps_rate
- 0.0159
- boilerplate_rate
- 0.04618
image_alt_length
numeric feature high_skew outliersCounts of characters in image alt-text fields, ranging from 1 to 65,192 with a median of 116 and IQR of 59–226. The distribution is severely right-skewed (skew 38.4, kurtosis 5934) and 11.8% of rows (47,944) flag as outliers, suggesting a long tail of unusually verbose alt-text or possibly entire passages dumped into the alt attribute. No nulls or zeros, so every row carries some alt content. Treatment: Log-transform or winsorize before modelling to tame the extreme right tail.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 2,052
- min
- 1
- max
- 65,192
- mean
- 227.5
- median
- 116
- std
- 364.4
- q1
- 59
- q3
- 226
- iqr
- 167
- skew
- 38.39
- kurtosis
- 5934
- n_outliers
- 47,944
- outlier_rate
- 0.1184
- zero_rate
- 0
text
text free_text duplicates multilingualShort user-generated posts (len_mean 136.5, len_max 397, word_median 15) that read like Mastodon/social toots — emoji_rate 0.226, url_rate 0.156, and top values include sign-up rule boilerplate and hashtags like #Alt4You. Predominantly English (4337) but spans 30 languages including German (91), Japanese (75), Portuguese (48), and French (43). Watch the heavy duplication (duplicate_rate 0.252, n_duplicates 102139) and 17970 empty strings sitting at the top of the value list. Treatment: Drop empties and exact duplicates, then language-filter or multilingual-embed before modelling.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 302,702
- len_min
- 0
- len_max
- 397
- len_mean
- 136.5
- len_median
- 131
- len_p95
- 296
- word_mean
- 19.73
- word_median
- 15
- n_empty
- 17,970
- n_duplicates
- 102,139
- duplicate_rate
- 0.2523
- vocab_size
- 82,709
- readability_flesch_mean
- 50.35
- emoji_rate
- 0.2264
- url_rate
- 0.1555
- one_word_rate
- 0.06977
- allcaps_rate
- 0.01936
- boilerplate_rate
- 0.001766
post_uri
text foreign_key one_wordThis is an AT Protocol post URI (Bluesky), encoding the author DID and post rkey in a fixed `at://did:plc:.../app.bsky.feed.post/...` format — note the tight length range (55-71 chars, mean 70) and one_word_rate of 1.0. Despite looking like a primary key, it is not unique: 335384 unique values across 404841 rows yields a 17.2% duplicate rate, with the top URI appearing 15 times. Readability and emoji metrics are meaningless here since each value is a single token. Treatment: Treat as a post identifier for joins; deduplicate or aggregate on it rather than modelling its text.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 335,384
- len_min
- 55
- len_max
- 71
- len_mean
- 70
- len_median
- 70
- len_p95
- 70
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 69,457
- duplicate_rate
- 0.1716
- vocab_size
- 19,733
- readability_flesch_mean
- -628.8
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
post_cid
text identifier one_wordThis is a post_cid column holding IPLD/CID v1 content identifiers (all 404841 values are exactly 59 chars and start with the 'bafyrei' CIDv1 dag-cbor prefix). It is near-unique (335545 distinct of 404841) but carries a notable 17.1% duplicate_rate with some CIDs repeating up to 4 times, which is unexpected if these are meant to be unique post references. The fixed length, single-token shape, and zero nulls confirm a machine-generated identifier rather than a feature. Treatment: Treat as a join key; deduplicate before use and don't feed into models.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 335,545
- len_min
- 59
- len_max
- 59
- len_mean
- 59
- len_median
- 59
- len_p95
- 59
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 69,296
- duplicate_rate
- 0.1712
- vocab_size
- 19,734
- readability_flesch_mean
- -567
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
created_at
text timestamp one_word allcapsISO-8601 timestamps stored as text, mixing two suffix conventions (`.000Z`/`.911Z` style and `+00:00` offset) which is why the parser flagged allcaps and one_word at rate 1.0. Length ranges 20-35 chars and median 24 align with standard ISO datetime strings. Duplicate rate is 17.4% (70,449 repeats across 404,841 rows), and top values cluster heavily on 2026-04-16, suggesting a load-day spike or truncated export window. Treatment: Parse to datetime (normalising the two ISO suffix formats) and use as a temporal feature.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 334,392
- len_min
- 20
- len_max
- 35
- len_mean
- 24.55
- len_median
- 24
- len_p95
- 29
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 70,449
- duplicate_rate
- 0.174
- vocab_size
- 19,732
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
indexed_at
text timestamp one_word allcapsThis is an ISO-8601 indexing timestamp stored as text, with values like '2025-08-02T00:50:34.108Z' and lengths between 24 and 32 characters. Two formats are mixed: a 24-char 'Z' suffix variant and a 32-char '+00:00' offset variant, which explains the length spread. Duplicate rate is 17.1% (69,298 rows) across 335,543 unique values out of 404,841, suggesting batch indexing events share timestamps. Treatment: Parse to a single UTC datetime type and use for recency filtering or time-based joins.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 335,543
- len_min
- 24
- len_max
- 32
- len_mean
- 26.48
- len_median
- 24
- len_p95
- 32
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 69,298
- duplicate_rate
- 0.1712
- vocab_size
- 19,734
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
langs_json
categorical featureThis column holds JSON-encoded language tag arrays, almost certainly the detected or declared languages per record. English dominates at 71.9% of rows (["en"]), and a notable 83,264 rows carry an empty array [] indicating no language assigned. Cardinality is 214 distinct strings with low entropy ratio (0.179), and the long tail includes multi-language combinations like ["en", "sq"] and ["ae", "ay", "en"]. Treatment: Parse the JSON arrays and one-hot encode primary language, treating [] as a separate 'unknown' category.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 214
- top_value
- ["en"]
- top_rate
- 0.7188
- cardinality
- 214
- entropy
- 1.389
- entropy_ratio
- 0.1794
image_index
numeric feature high_skew outliersA small-integer counter taking only 4 distinct values between 0 and 3, with 82.8% zeros and a mean of 0.26 — likely an image position/index within a multi-image record. The distribution is heavily right-skewed (skew 2.73, kurtosis 7.08) and the IQR is 0, so any non-zero entry registers as an outlier (17.16% flagged). Treatment: Treat as a low-cardinality categorical (one-hot or keep as ordinal); ignore outlier flag since IQR is zero by construction.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 0
- max
- 3
- mean
- 0.2604
- median
- 0
- std
- 0.6467
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.726
- kurtosis
- 7.083
- n_outliers
- 69,490
- outlier_rate
- 0.1716
- zero_rate
- 0.8284
image_count_in_post
numeric feature outliersDiscrete count of images attached to a post, taking only 4 distinct integer values between 1 and 4 with no nulls or zeros. The distribution is heavily right-skewed (skew 1.72) toward single-image posts: median and Q1 are both 1, Q3 is 2, and roughly 9.7% of rows (39,251) flag as outliers — almost certainly the 3- and 4-image posts. Treatment: Treat as a small ordinal/categorical count rather than continuous; bin 3-4 together or one-hot encode.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 1
- max
- 4
- mean
- 1.524
- median
- 1
- std
- 0.9647
- q1
- 1
- q3
- 2
- iqr
- 1
- skew
- 1.719
- kurtosis
- 1.551
- n_outliers
- 39,251
- outlier_rate
- 0.09695
- zero_rate
- 0
image_mime_type
categorical metadataThis column records the MIME type of an associated image, with 8 distinct values across 404,841 rows and no nulls. It is heavily dominated by image/jpeg at 93.5% (378,401 rows), with image/png a distant second (23,402) and a long tail of webp, gif, avif, svg+xml, heic, plus 4 rows of application/octet-stream that aren't an image type at all. Entropy ratio is just 0.128, confirming very low diversity. Treatment: Collapse rare types into an 'other' bucket (and review the 4 application/octet-stream rows) before one-hot encoding.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- image/jpeg
- top_rate
- 0.9347
- cardinality
- 8
- entropy
- 0.3842
- entropy_ratio
- 0.1281
image_ref
text foreign_key one_wordThis column holds IPFS CIDv1 content hashes (all 59 chars, all starting with 'bafkrei...' indicating raw codec + sha256), one token per row across 404841 rows. With 375188 unique values and a 7.3% duplicate rate, it functions as a content-addressed image pointer rather than a row identifier — one CID appears 5227 times, suggesting a heavily reused asset. Length is rigidly fixed (min=max=59) and there are no nulls, URLs, or emojis. Treatment: Treat as an opaque content-address key; left-join to an image/asset table and do not tokenize.
- n
- 404,841
- nulls
- 19 (0.0%)
- unique
- 375,188
- len_min
- 59
- len_max
- 59
- len_mean
- 59
- len_median
- 59
- len_p95
- 59
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 29,634
- duplicate_rate
- 0.0732
- vocab_size
- 19,560
- readability_flesch_mean
- -539.1
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
image_thumb_url
text metadata null_rate one_word url_heavyThis column holds Bluesky CDN thumbnail URLs (every non-null value is exactly 138 chars and a single token pointing at cdn.bsky.app/img/feed_thumbnail/plain/...). 31.04% of rows are null and 6.65% are duplicates, with one thumbnail repeating 1000 times — likely a heavily reshared post. With 260,636 unique URLs across 404,841 rows, it's effectively a per-post image reference rather than a modelling feature. Treatment: Keep as a media reference for display or image fetching; drop from tabular modelling.
- n
- 404,841
- nulls
- 125,645 (31.0%)
- unique
- 260,636
- len_min
- 138
- len_max
- 138
- len_mean
- 138
- len_median
- 138
- len_p95
- 138
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 18,560
- duplicate_rate
- 0.06648
- vocab_size
- 19,666
- readability_flesch_mean
- -1391
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
image_fullsize_url
text metadata null_rate one_word url_heavyThis column holds Bluesky CDN URLs pointing to full-size feed images, all exactly 137 characters long and consisting of a single token (url_rate 1.0, one_word_rate 1.0). 31.04% of rows are null, indicating most posts lack an image, and 6.65% of non-null values are duplicates — one image URL repeats 1000 times, suggesting reshared or pinned media. With 260,636 unique values across 404,841 rows, the field is high-cardinality but not unique. Treatment: Treat as an image asset reference; fetch/hash for media features rather than tokenizing as text.
- n
- 404,841
- nulls
- 125,645 (31.0%)
- unique
- 260,636
- len_min
- 137
- len_max
- 137
- len_mean
- 137
- len_median
- 137
- len_p95
- 137
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 18,560
- duplicate_rate
- 0.06648
- vocab_size
- 19,666
- readability_flesch_mean
- -1475
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
source_mode
categorical featureBinary categorical flag indicating ingestion source, with values 'author_feed' (69.0%) and 'jetstream' (31.0%) across 404,841 complete rows. The split is imbalanced but both classes are well-represented, and entropy_ratio of 0.89 confirms meaningful spread between the two modes. No nulls, suggesting the field is always populated at ingest time. Treatment: One-hot or boolean-encode for modelling; useful as a provenance stratifier.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- author_feed
- top_rate
- 0.6896
- cardinality
- 2
- entropy
- 0.8936
- entropy_ratio
- 0.8936
collected_at
text timestamp near_unique one_word allcapsThis is an ISO-8601 UTC timestamp column (every value is exactly 27 characters, one token, ending in Z) capturing when each of the 404,841 records was collected. All values are unique with zero nulls, so it functions as a per-row event time rather than a categorical feature. The sampled values cluster on 2026-04-15 and 2026-04-16, suggesting collection spans roughly a single day or two. Treatment: parse to datetime and use for time-based splits or feature extraction rather than as a model input.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 404,841
- len_min
- 27
- len_max
- 27
- len_mean
- 27
- len_median
- 27
- len_p95
- 27
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
query
categorical foreign_key null_rateThe column holds Bluesky-style handles (e.g., firefaerie81.bsky.social, lukesteuber.com), suggesting it captures the account being queried rather than a free-text search string. Cardinality is modest at 489 unique values across 404,841 rows with high entropy ratio (0.859), so traffic is spread fairly evenly with no dominant target—the top handle accounts for just 2.45%. Notably, 31.04% of rows are null, which is severe for what looks like a primary lookup key. Treatment: Filter or impute the 31% nulls before joining; treat as a high-cardinality categorical key, not free text.
- n
- 404,841
- nulls
- 125,645 (31.0%)
- unique
- 489
- top_value
- firefaerie81.bsky.social
- top_rate
- 0.02446
- cardinality
- 489
- entropy
- 7.676
- entropy_ratio
- 0.8593
cursor
text metadata duplicates one_word allcapsDespite the 'text' classification, every value is a single ISO-8601 UTC timestamp (length 16-24, one_word_rate 1.0, allcaps_rate 1.0 from the trailing Z), suggesting this is a pagination cursor keyed on a record timestamp. Across 404,841 rows there are only 105,919 distinct values and a 70.6% duplicate rate, with the top cursors repeating 200+ times each — consistent with many records sharing the same paging anchor. Null rate is 10.97%, likely the first/last page where no cursor applies. Treatment: Parse as timestamp if needed for ordering, otherwise drop — it is API pagination state, not a feature.
- n
- 404,841
- nulls
- 44,399 (11.0%)
- unique
- 105,919
- len_min
- 16
- len_max
- 24
- len_mean
- 20.98
- len_median
- 24
- len_p95
- 24
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 254,523
- duplicate_rate
- 0.7061
- vocab_size
- 9,220
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
raw_record_json
text free_text multilingualThis column holds the raw JSON payload of Bluesky posts (`app.bsky.feed.post` records with embeds, replies, langs, and text), one serialized object per row. With 404,841 rows, 335,545 distinct values and a 17.1% duplicate rate (69,296 repeats), the same post JSON is recurring — likely reposts or ingestion replays. Mean length is 1,184 chars (max 65,764), Flesch readability is -250 because the JSON scaffolding dominates over prose, and the language sample is 432 English vs only a handful of de/fi/fr/is/ja/pt despite the multilingual alert. URL-bearing rows hit 24.7% and emoji rows 23.1%, both inflated by emoji/URLs embedded inside alt-text and post bodies. Treatment: Parse as JSON and project the fields you need (text, langs, embed) before any text modelling; do not feed the raw blob to a tokenizer.
- n
- 404,841
- nulls
- 0 (0.0%)
- unique
- 335,545
- len_min
- 287
- len_max
- 65,764
- len_mean
- 1184
- len_median
- 967
- len_p95
- 2,504
- word_mean
- 67.63
- word_median
- 49
- n_empty
- 0
- n_duplicates
- 69,296
- duplicate_rate
- 0.1712
- vocab_size
- 117,811
- readability_flesch_mean
- -250
- emoji_rate
- 0.2307
- url_rate
- 0.2473
- one_word_rate
- 0.00786
- allcaps_rate
- 0
- boilerplate_rate
- 0