This is an anonymized Bluesky firehose snapshot of 101,040 posts from late December 2025, with 19 columns covering hashed identifiers, post text, timestamps, embed metadata, language, and sentiment. The content is heavily multilingual: English dominates at roughly 61% of posts, but Japanese, Korean, German, and Portuguese also have meaningful presence, alongside an 'unknown' bucket worth investigating. Sentiment skews neutral (~48%) with positive outweighing negative roughly 2:1, and post length is right-skewed (median 68 chars, max 525). Engagement features are sparse — about 18% of posts carry a link, 14% have images, and only 1.3% have video — and the embed_type field is null for ~61% of rows, which is the biggest data-quality flag to check first. Reply hashes are null for ~58% of rows, suggesting most posts are top-level rather than replies.
saturn
/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv 101,040 rows sample n=101,040 seed 42 2026-05-01T23:26:26+00:00
Overview
| Source | /home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv |
| Total rows | 101,040 |
| Profiled sample | 101,040 |
| Columns | 19 |
| Generated | 2026-05-01T23:26:26+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
Short user-generated posts (likely social media, given bsky.app URLs and hashtag-heavy entries), averaging 14 words and a median of 68 characters. The corpus is predominantly English (3309) but spans 30 detected languages with notable Japanese (656) and Korean (125) presence, and 18.3% of rows contain emoji while 16.9% are flagged all-caps. Watch the 5,105 duplicates (5.05%) and the top entries dominated by emoji spam (224 sheep emojis) and repeated promotional Thai text.
This column holds 16-character hexadecimal hashes of author DIDs, with every value exactly 16 chars long and a single token (one_word_rate 1.0). Across 101040 rows there are only 43998 unique authors and a 56.5% duplicate rate, with the top author hash appearing 1016 times — so a small number of authors generate a disproportionate share of records.
This column holds 16-character single-token strings that look like truncated hex digests of URIs, with 101,039 unique values across 101,040 rows and zero nulls. It is effectively a primary key — duplicate_rate is 0.0 and every entry is exactly 16 characters long. No textual signal is present beyond the hash itself.
This column is a 16-character hexadecimal hash pointing at a parent message, populated only when a row is a reply — 57.67% of the 101,040 values are null. Every non-null entry is exactly 16 chars and one token, with 34,738 unique hashes and an 18.78% duplicate rate (8,032 duplicates), indicating popular parents that attract many replies (top hash 04a1db17fbc9ff3a appears 121 times). The flesch_mean of 71.73 is meaningless here since the strings are opaque IDs.
This column holds 16-character hexadecimal hashes identifying the root of a reply thread, with every value exactly 16 chars and a single token. 57.67% of rows are null (consistent with non-reply posts being root-less), and among the 42.33% populated rows there are 21,277 unique hashes with a 50.25% duplicate rate, indicating many replies share common thread roots.
This is a three-class sentiment label with values neutral, positive, and negative across 101040 rows and no nulls. The distribution is uneven: neutral dominates at 48.5% (top_rate 0.4847684085510689), positive accounts for 34622, and negative trails at 17437 — roughly half the positive count. Entropy ratio of 0.93 still indicates a fairly balanced spread despite the negative class being underrepresented.
Continuous sentiment polarity bounded in [-0.998, 1.0], consistent with a tool like VADER/TextBlob compound score. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but dominated by a neutral spike: 47.8% of rows are exactly 0 and the median is 0, pulling Q1 to 0 as well. Despite the symmetry, 5,763 rows (5.7%) flag as outliers, suggesting heavy tails of strongly polarised text alongside the neutral mass.
This is a creation timestamp stored as ISO-8601 strings rather than a native datetime, with lengths from 20 to 35 characters and a single token per value. The format is inconsistent: top values mix offset suffixes like `+00:00` with `Z` and microsecond `.000000Z` variants, which will break naive lexical comparisons. Cardinality is high (96576 unique of 101040) but 4464 duplicates (4.4%) suggest concurrent events sharing timestamps.
This column holds ISO-8601 timestamps with microsecond precision, stored as text rather than a native datetime type. All 101040 values are unique with zero nulls and a fixed length of 26 characters, and the sampled values cluster on 2025-12-23 and 2025-12-24. The text-profile alerts (near_unique, one_word, allcaps) are artefacts of the string representation, not genuine quality issues.
Language code of the record, dominated by English ('en' at 60.8% of 101,040 rows) with Japanese ('ja') a distant second at 12,607. Notably, 11,481 rows are literally 'unknown' and the codes mix bare ISO ('en') with locale variants ('en-US', 3,617), so the 90 distinct values overstate true language diversity. No nulls, but entropy ratio of 0.34 confirms heavy concentration at the top.
This is almost certainly a character-count feature for some text field, with all 101040 rows populated and only 341 distinct integer values between 1 and 525. The distribution is right-skewed (skew 1.02) with median 68 well below the mean of 97.6 and an IQR of 30-143, indicating most texts are short but a tail stretches out. Outliers are minimal (0.29%) and there are no zero-length entries.
This is a numeric feature counting words per record, ranging from 0 to 83 with a median of 10 and mean of 14.67. The distribution is right-skewed (skew 1.21) with an IQR of 19 and about 2.85% of values flagged as outliers, suggesting a long tail of unusually long entries. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) indicate counts cluster in a narrow integer range.
Binary flag indicating whether a record has images, encoded as 0/1 with only 2 unique values across 101040 rows and no nulls. The positive class is rare at mean 0.136 (zero_rate 0.864), which is why the profiler flags high skew (2.12) and treats the 13768 ones as 'outliers'.
This is a binary flag indicating whether a record has a video, stored numerically with only 2 unique values (0 and 1) and no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are ones, producing the flagged high skew (8.50) and extreme kurtosis (70.19).
This is a binary 0/1 flag indicating whether a record contains a link, with no nulls across 101,040 rows. About 17.95% are 1s and 82.05% are 0s, which is why the IQR is 0 and the minority class gets labelled as 18,140 'outliers' — that's a quirk of applying numeric outlier logic to a Bernoulli variable, not a data issue.
This column tags the embed type attached to Bluesky posts, with five AT Protocol values like app.bsky.embed.external and app.bsky.embed.images. Note the 61.15% null rate — most posts carry no embed at all — and among posts that do, external links dominate at 46.22%, with video and recordWithMedia being rare tails. Entropy ratio of 0.74 shows reasonable spread across the five categories when present.
This column stores per-row hashtag lists serialised as JSON arrays, with `[]` (empty list) appearing 87,318 times — the dominant value in a 101,040-row column. Duplicates are extreme (duplicate_rate 0.90, n_unique 10,103) and 90% of values are effectively one token, consistent with most rows having zero or one hashtag (word_median 1, len_median 2). The non-empty examples mix scripts (Thai, Japanese, German, English) including bot-like patterns such as `#NowPlaying` and repeated `#dpdi`, so any text treatment must be multilingual.
This column stores serialized JSON arrays of mentioned entity IDs (16-hex tokens), but the overwhelming majority are empty: '[]' appears 98,659 of 101,040 times and the duplicate rate is 0.981. When non-empty, arrays almost always hold a single ID (word_mean 1.01, len_p95 2.0), though len_max reaches 420 indicating occasional multi-mention rows. Vocabulary is small (660) across 1,921 unique values, so signal is sparse and ID-like rather than textual.
This column holds JSON-encoded arrays of URLs, but 96,211 of 101,040 rows are the empty array '[]', driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist across 101,040 rows, and just 4.8% of rows actually contain a URL. The non-empty entries point to streaming radio and news endpoints (BBC, Radio France, shoutcast streams).
Numeric correlation
Languages detected
Per-string language detection across text columns (sampled).
text text
Sample values (first 10)
- Un client arrêté après avoir poignardé son livreur de repas "Un livreur de repas a été grièvement blessé lors d’une tentative de meurtre dans la nuit de dimanche à lundi, à Bülach. Le..." https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreu…
- Fındığım bu giboyla ilgili itiraf etmek istediğin bir şey varsa tam vakti 😅 Dm kutuma gel anlat,söz bende kalacak anlattıkların 😅😅
- -#strongertogether
- Tu b’Shevat is approaching rapidly. Nit to put any pressure on you, but…
- Someone mentioned it in my timeline, and so I just rewatched "The Blue Carbuncle", from The Adventures of Sherlock Holmes (1984) with Jeremy Brett. It's a great Christmas story and Jeremy Brett is without question the best Holmes ever. I found it on Britbox
- https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46 【新着記事】 宇宙ステーションは「組み立てる」時代から「一発で広げる」時代へ?
- Happy Christmas Eve, sweet Flanoy! 🎄🧡🐈
- THESE FUCKERS SERIOUSLY COULDN’T WAIT ONE DAY!? /vneg #dandysworld
- 어딜 가지.....
- Made in hckr.fr 🏴☠️🖤 Le genre de petit message qui me fait chaud au cœur.
author_did_hash text
Sample values (first 10)
- 203b2f94ca34ad57
- e3fb7462b68ce168
- 8b80d746cd58f608
- 74e2cbc89edd37a6
- ed4f29630f55ae1d
- 039de54e8bef8899
- 61ee7267320497ee
- b464fb16192641fa
- 7422e82a369d2ace
- 8dc83ec255dde07a
uri_hash text
Sample values (first 10)
- 1a925eae4a68e954
- 9c7b35c448e9f56a
- 00000da0008897eb
- cd823fdacd02b11c
- 0a525ed50b0474f2
- 175ab73228973fa3
- 60f3a1a69409b7ae
- e9e0e481dbe7f266
- d49a9bf37ba42904
- ef51be69a5ee76f8
reply_parent_hash text
Sample values (first 10)
- fc2267f29dd1a492
- 6b56ce9644d8dcfc
- 701912916dd3aecb
- f16b66c1507d3da9
- 63ea68b3eabeb6c5
- 2e341c64d79713f6
- bfbbd6900834f900
- dd990c5f31cc4ea6
- 3b2f41bfb941204a
- a5ba750d7bf30263
reply_root_hash text
Sample values (first 10)
- fc2267f29dd1a492
- 6b56ce9644d8dcfc
- 701912916dd3aecb
- f16b66c1507d3da9
- 63ea68b3eabeb6c5
- 2e341c64d79713f6
- dc0cf00aab42248a
- 65f573012a42f37a
- 152ff36a17b9ab54
- 2da15f2e55e9a171
sentiment categorical
Top values (rank 1–20)
- neutral — 48,981
- positive — 34,622
- negative — 17,437
sentiment_score numeric
created_at text
Sample values (first 10)
- 2025-12-15T15:02:56.000000Z
- 2025-12-24T05:46:28.199Z
- 2025-12-24T05:53:52.540Z
- 2025-12-24T05:51:06.770Z
- 2025-12-24T05:24:05.186Z
- 2025-12-24T06:00:49.556+00:00
- 2025-12-24T05:25:08.535Z
- 2025-12-24T05:56:08.507Z
- 2025-12-24T05:51:12.695Z
- 2025-12-24T05:00:11.869Z
timestamp text
Sample values (first 10)
- 2025-12-23T23:35:20.812256
- 2025-12-23T23:46:29.113253
- 2025-12-23T23:53:52.818216
- 2025-12-23T23:51:06.721420
- 2025-12-23T23:24:08.130284
- 2025-12-24T00:00:49.619695
- 2025-12-23T23:25:09.314207
- 2025-12-23T23:56:13.728686
- 2025-12-23T23:51:13.117978
- 2025-12-23T23:00:12.134916
language categorical
Top values (rank 1–20)
- en — 61,468
- ja — 12,607
- unknown — 11,481
- en-US — 3,617
- ko — 2,406
- de — 1,821
- pt — 1,295
- es — 1,153
- fr — 746
- th — 612
- tr — 548
- nl — 525
- zh — 315
- it — 276
- ru — 213
- fi — 193
- ja-JP — 170
- id — 158
- pl — 139
- el — 116
char_count numeric
word_count numeric
has_images numeric
has_video numeric
has_link numeric
embed_type categorical
Top values (rank 1–20)
- app.bsky.embed.external — 18,140
- app.bsky.embed.images — 13,768
- app.bsky.embed.record — 5,126
- app.bsky.embed.video — 1,344
- app.bsky.embed.recordWithMedia — 871
hashtags text
Sample values (first 10)
- []
- []
- ["#strongertogether"]
- []
- []
- []
- []
- ["#dandysworld"]
- []
- []
mentions text
Sample values (first 10)
- []
- []
- []
- []
- []
- []
- []
- []
- []
- []
links text
Sample values (first 10)
- ["https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreur-de-repas-103470448"]
- []
- []
- []
- []
- ["https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46"]
- []
- []
- []
- []