bsky firehose anonymized dec 2025 bluesky posts
Reading
This is an anonymized Bluesky firehose snapshot of 101,040 posts from late December 2025, with 19 columns covering hashed identifiers, post text, timestamps, embed metadata, language, and sentiment. The content is heavily multilingual: English dominates at roughly 61% of posts, but Japanese, Korean, German, and Portuguese also have meaningful presence, alongside an 'unknown' bucket worth investigating. Sentiment skews neutral (~48%) with positive outweighing negative roughly 2:1, and post length is right-skewed (median 68 chars, max 525). Engagement features are sparse — about 18% of posts carry a link, 14% have images, and only 1.3% have video — and the embed_type field is null for ~61% of rows, which is the biggest data-quality flag to check first. Reply hashes are null for ~58% of rows, suggesting most posts are top-level rather than replies.
citing: row_count · column_count · language · sentiment · char_count · has_link · has_images · has_video · embed_type · reply_root_hash · text
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| en | 61468 | 60.8% |
| ja | 12607 | 12.5% |
| unknown | 11481 | 11.4% |
| en-US | 3617 | 3.6% |
| ko | 2406 | 2.4% |
| de | 1821 | 1.8% |
| pt | 1295 | 1.3% |
| es | 1153 | 1.1% |
| fr | 746 | 0.7% |
| th | 612 | 0.6% |
| tr | 548 | 0.5% |
| nl | 525 | 0.5% |
| zh | 315 | 0.3% |
| it | 276 | 0.3% |
| ru | 213 | 0.2% |
| fi | 193 | 0.2% |
| ja-JP | 170 | 0.2% |
| id | 158 | 0.2% |
| pl | 139 | 0.1% |
| el | 116 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| neutral | 48981 | 48.5% |
| positive | 34622 | 34.3% |
| negative | 17437 | 17.3% |
Show data table
| bin | count |
|---|---|
| 1 – 14.1 | 11042 |
| 14.1 – 27.2 | 12354 |
| 27.2 – 40.3 | 11078 |
| 40.3 – 53.4 | 8177 |
| 53.4 – 66.5 | 7049 |
| 66.5 – 79.6 | 6309 |
| 79.6 – 92.7 | 5146 |
| 92.7 – 105.8 | 4495 |
| 105.8 – 118.9 | 3865 |
| 118.9 – 132 | 3459 |
| 132 – 145.1 | 3244 |
| 145.1 – 158.2 | 2592 |
| 158.2 – 171.3 | 2219 |
| 171.3 – 184.4 | 2065 |
| 184.4 – 197.5 | 1867 |
| 197.5 – 210.6 | 1708 |
| 210.6 – 223.7 | 1624 |
| 223.7 – 236.8 | 1467 |
| 236.8 – 249.9 | 1439 |
| 249.9 – 263 | 1439 |
| 263 – 276.1 | 1563 |
| 276.1 – 289.2 | 1657 |
| 289.2 – 302.3 | 4862 |
| 302.3 – 315.4 | 33 |
| 315.4 – 328.5 | 261 |
| 328.5 – 341.6 | 3 |
| 341.6 – 354.7 | 5 |
| 354.7 – 367.8 | 1 |
| 367.8 – 380.9 | 2 |
| 380.9 – 394 | 9 |
| 394 – 407.1 | 3 |
| 407.1 – 420.2 | 0 |
| 420.2 – 433.3 | 1 |
| 433.3 – 446.4 | 1 |
| 446.4 – 459.5 | 0 |
| 459.5 – 472.6 | 0 |
| 472.6 – 485.7 | 0 |
| 485.7 – 498.8 | 0 |
| 498.8 – 511.9 | 0 |
| 511.9 – 525 | 1 |
Show data table
| value | count | share |
|---|---|---|
| app.bsky.embed.external | 18140 | 18.0% |
| app.bsky.embed.images | 13768 | 13.6% |
| app.bsky.embed.record | 5126 | 5.1% |
| app.bsky.embed.video | 1344 | 1.3% |
| app.bsky.embed.recordWithMedia | 871 | 0.9% |
Show data table
| bin | count |
|---|---|
| -0.998 – -0.948 | 247 |
| -0.948 – -0.8981 | 541 |
| -0.8981 – -0.8481 | 705 |
| -0.8481 – -0.7982 | 822 |
| -0.7982 – -0.7482 | 812 |
| -0.7482 – -0.6983 | 874 |
| -0.6983 – -0.6483 | 1081 |
| -0.6483 – -0.5984 | 1067 |
| -0.5984 – -0.5484 | 1130 |
| -0.5484 – -0.4985 | 1234 |
| -0.4985 – -0.4486 | 1387 |
| -0.4486 – -0.3986 | 1127 |
| -0.3986 – -0.3487 | 895 |
| -0.3487 – -0.2987 | 1120 |
| -0.2987 – -0.2488 | 1585 |
| -0.2488 – -0.1988 | 792 |
| -0.1988 – -0.1488 | 726 |
| -0.1488 – -0.0989 | 672 |
| -0.0989 – -0.04895 | 624 |
| -0.04895 – 0.001 | 48620 |
| 0.001 – 0.05095 | 358 |
| 0.05095 – 0.1009 | 740 |
| 0.1009 – 0.1508 | 623 |
| 0.1508 – 0.2008 | 781 |
| 0.2008 – 0.2508 | 1426 |
| 0.2508 – 0.3007 | 1343 |
| 0.3007 – 0.3507 | 1581 |
| 0.3507 – 0.4006 | 2179 |
| 0.4006 – 0.4506 | 3463 |
| 0.4506 – 0.5005 | 2801 |
| 0.5005 – 0.5505 | 1918 |
| 0.5505 – 0.6004 | 2382 |
| 0.6004 – 0.6503 | 2518 |
| 0.6503 – 0.7003 | 2004 |
| 0.7003 – 0.7503 | 1886 |
| 0.7503 – 0.8002 | 1870 |
| 0.8002 – 0.8501 | 2266 |
| 0.8501 – 0.9001 | 2011 |
| 0.9001 – 0.9501 | 1758 |
| 0.9501 – 1 | 1071 |
Schema
19 columns| Alerts | ||||
|---|---|---|---|---|
| text | text | 0.0% | 95,935 |
multilingual
allcaps
|
| author_did_hash | text | 0.0% | 43,998 |
one_word
short_text
duplicates
|
| uri_hash | text | 0.0% | 101,039 |
near_unique
one_word
short_text
|
| reply_parent_hash | text | 57.7% | 34,738 |
one_word
null_rate
short_text
|
| reply_root_hash | text | 57.7% | 21,277 |
one_word
null_rate
short_text
duplicates
|
| sentiment | categorical | 0.0% | 3 |
|
| sentiment_score | numeric | 0.0% | 1,928 |
outliers
|
| created_at | text | 0.0% | 96,576 |
near_unique
one_word
allcaps
|
| timestamp | text | 0.0% | 101,040 |
near_unique
one_word
allcaps
|
| language | categorical | 0.0% | 90 |
|
| char_count | numeric | 0.0% | 341 |
|
| word_count | numeric | 0.0% | 79 |
|
| has_images | numeric | 0.0% | 2 |
high_skew
outliers
|
| has_video | numeric | 0.0% | 2 |
high_skew
|
| has_link | numeric | 0.0% | 2 |
outliers
|
| embed_type | categorical | 61.2% | 5 |
null_rate
|
| hashtags | text | 0.0% | 10,103 |
one_word
duplicates
|
| mentions | text | 0.0% | 1,921 |
one_word
short_text
duplicates
|
| links | text | 0.0% | 3,771 |
one_word
short_text
duplicates
|
text
text free_text multilingual allcapsShort user-generated posts (likely social media, given bsky.app URLs and hashtag-heavy entries), averaging 14 words and a median of 68 characters. The corpus is predominantly English (3309) but spans 30 detected languages with notable Japanese (656) and Korean (125) presence, and 18.3% of rows contain emoji while 16.9% are flagged all-caps. Watch the 5,105 duplicates (5.05%) and the top entries dominated by emoji spam (224 sheep emojis) and repeated promotional Thai text. Treatment: Deduplicate, normalize casing/emoji, and apply a multilingual tokenizer or embedding model before downstream NLP.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 95,935
- len_min
- 1
- len_max
- 525
- len_mean
- 97.63
- len_median
- 68
- len_p95
- 290
- word_mean
- 14.23
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 5,105
- duplicate_rate
- 0.05052
- vocab_size
- 77,183
- readability_flesch_mean
- 64.09
- emoji_rate
- 0.1832
- url_rate
- 0.07586
- one_word_rate
- 0.1899
- allcaps_rate
- 0.1691
- boilerplate_rate
- 0.001049
uri_hash
text identifier near_unique one_word short_textThis column holds 16-character single-token strings that look like truncated hex digests of URIs, with 101,039 unique values across 101,040 rows and zero nulls. It is effectively a primary key — duplicate_rate is 0.0 and every entry is exactly 16 characters long. No textual signal is present beyond the hash itself. Treatment: Use as a row key or join key; drop before modelling.
- n
- 101,040
- nulls
- 1 (0.0%)
- unique
- 101,039
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 69.61
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0005245
- boilerplate_rate
- 0
reply_parent_hash
text foreign_key one_word null_rate short_textThis column is a 16-character hexadecimal hash pointing at a parent message, populated only when a row is a reply — 57.67% of the 101,040 values are null. Every non-null entry is exactly 16 chars and one token, with 34,738 unique hashes and an 18.78% duplicate rate (8,032 duplicates), indicating popular parents that attract many replies (top hash 04a1db17fbc9ff3a appears 121 times). The flesch_mean of 71.73 is meaningless here since the strings are opaque IDs. Treatment: Treat as a self-referential foreign key to the message hash; use for thread reconstruction and leave nulls as 'not a reply'.
- n
- 101,040
- nulls
- 58,270 (57.7%)
- unique
- 34,738
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 8,032
- duplicate_rate
- 0.1878
- vocab_size
- 17,415
- readability_flesch_mean
- 71.73
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0007949
- boilerplate_rate
- 0
reply_root_hash
text foreign_key one_word null_rate short_text duplicatesThis column holds 16-character hexadecimal hashes identifying the root of a reply thread, with every value exactly 16 chars and a single token. 57.67% of rows are null (consistent with non-reply posts being root-less), and among the 42.33% populated rows there are 21,277 unique hashes with a 50.25% duplicate rate, indicating many replies share common thread roots. Treatment: left-join on this id to retrieve the root post; do not feature-engineer the hash itself.
- n
- 101,040
- nulls
- 58,270 (57.7%)
- unique
- 21,277
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 21,493
- duplicate_rate
- 0.5025
- vocab_size
- 12,498
- readability_flesch_mean
- 77.23
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0008183
- boilerplate_rate
- 0
sentiment
categorical labelThis is a three-class sentiment label with values neutral, positive, and negative across 101040 rows and no nulls. The distribution is uneven: neutral dominates at 48.5% (top_rate 0.4847684085510689), positive accounts for 34622, and negative trails at 17437 — roughly half the positive count. Entropy ratio of 0.93 still indicates a fairly balanced spread despite the negative class being underrepresented. Treatment: Use as classification target; consider class weighting or resampling to offset the underrepresented negative class.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- neutral
- top_rate
- 0.4848
- cardinality
- 3
- entropy
- 1.473
- entropy_ratio
- 0.9295
sentiment_score
numeric feature outliersContinuous sentiment polarity bounded in [-0.998, 1.0], consistent with a tool like VADER/TextBlob compound score. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but dominated by a neutral spike: 47.8% of rows are exactly 0 and the median is 0, pulling Q1 to 0 as well. Despite the symmetry, 5,763 rows (5.7%) flag as outliers, suggesting heavy tails of strongly polarised text alongside the neutral mass. Treatment: Treat the zero-mass separately (e.g., add an is_neutral flag) before using the score as a numeric feature.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 1,928
- min
- -0.998
- max
- 1
- mean
- 0.1074
- median
- 0
- std
- 0.4104
- q1
- 0
- q3
- 0.402
- iqr
- 0.402
- skew
- 0.01861
- kurtosis
- 0.01774
- n_outliers
- 5,763
- outlier_rate
- 0.05704
- zero_rate
- 0.478
created_at
text timestamp near_unique one_word allcapsThis is a creation timestamp stored as ISO-8601 strings rather than a native datetime, with lengths from 20 to 35 characters and a single token per value. The format is inconsistent: top values mix offset suffixes like `+00:00` with `Z` and microsecond `.000000Z` variants, which will break naive lexical comparisons. Cardinality is high (96576 unique of 101040) but 4464 duplicates (4.4%) suggest concurrent events sharing timestamps. Treatment: Parse to a normalized UTC datetime before any time-based join or sort.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 96,576
- len_min
- 20
- len_max
- 35
- len_mean
- 24.34
- len_median
- 24
- len_p95
- 27
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 4,464
- duplicate_rate
- 0.04418
- vocab_size
- 19,720
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
timestamp
text timestamp near_unique one_word allcapsThis column holds ISO-8601 timestamps with microsecond precision, stored as text rather than a native datetime type. All 101040 values are unique with zero nulls and a fixed length of 26 characters, and the sampled values cluster on 2025-12-23 and 2025-12-24. The text-profile alerts (near_unique, one_word, allcaps) are artefacts of the string representation, not genuine quality issues. Treatment: parse to datetime and use for time-based ordering or feature extraction
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 101,040
- len_min
- 26
- len_max
- 26
- len_mean
- 26
- len_median
- 26
- len_p95
- 26
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
language
categorical featureLanguage code of the record, dominated by English ('en' at 60.8% of 101,040 rows) with Japanese ('ja') a distant second at 12,607. Notably, 11,481 rows are literally 'unknown' and the codes mix bare ISO ('en') with locale variants ('en-US', 3,617), so the 90 distinct values overstate true language diversity. No nulls, but entropy ratio of 0.34 confirms heavy concentration at the top. Treatment: Normalize locale variants to base codes and treat 'unknown' as missing before one-hot or target encoding.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 90
- top_value
- en
- top_rate
- 0.6084
- cardinality
- 90
- entropy
- 2.178
- entropy_ratio
- 0.3356
char_count
numeric featureThis is almost certainly a character-count feature for some text field, with all 101040 rows populated and only 341 distinct integer values between 1 and 525. The distribution is right-skewed (skew 1.02) with median 68 well below the mean of 97.6 and an IQR of 30-143, indicating most texts are short but a tail stretches out. Outliers are minimal (0.29%) and there are no zero-length entries. Treatment: Consider a log or sqrt transform before modelling to tame the right skew.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 341
- min
- 1
- max
- 525
- mean
- 97.63
- median
- 68
- std
- 86.05
- q1
- 30
- q3
- 143
- iqr
- 113
- skew
- 1.018
- kurtosis
- -0.05733
- n_outliers
- 289
- outlier_rate
- 0.00286
- zero_rate
- 0
word_count
numeric featureThis is a numeric feature counting words per record, ranging from 0 to 83 with a median of 10 and mean of 14.67. The distribution is right-skewed (skew 1.21) with an IQR of 19 and about 2.85% of values flagged as outliers, suggesting a long tail of unusually long entries. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) indicate counts cluster in a narrow integer range. Treatment: Consider a log or sqrt transform before modelling to tame the right skew.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 79
- min
- 0
- max
- 83
- mean
- 14.67
- median
- 10
- std
- 14.22
- q1
- 3
- q3
- 22
- iqr
- 19
- skew
- 1.209
- kurtosis
- 0.699
- n_outliers
- 2,882
- outlier_rate
- 0.02852
- zero_rate
- 0.0006037
has_images
numeric feature high_skew outliersBinary flag indicating whether a record has images, encoded as 0/1 with only 2 unique values across 101040 rows and no nulls. The positive class is rare at mean 0.136 (zero_rate 0.864), which is why the profiler flags high skew (2.12) and treats the 13768 ones as 'outliers'. Treatment: Treat as a boolean feature; ignore the outlier flag since it just reflects class imbalance.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.1363
- median
- 0
- std
- 0.3431
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.12
- kurtosis
- 2.497
- n_outliers
- 13,768
- outlier_rate
- 0.1363
- zero_rate
- 0.8637
has_video
numeric feature high_skewThis is a binary flag indicating whether a record has a video, stored numerically with only 2 unique values (0 and 1) and no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are ones, producing the flagged high skew (8.50) and extreme kurtosis (70.19). Treatment: Cast to boolean and keep as-is, but expect class imbalance when using it as a predictor or target.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.0133
- median
- 0
- std
- 0.1146
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 8.497
- kurtosis
- 70.19
- n_outliers
- 1,344
- outlier_rate
- 0.0133
- zero_rate
- 0.9867
has_link
numeric feature outliersThis is a binary 0/1 flag indicating whether a record contains a link, with no nulls across 101,040 rows. About 17.95% are 1s and 82.05% are 0s, which is why the IQR is 0 and the minority class gets labelled as 18,140 'outliers' — that's a quirk of applying numeric outlier logic to a Bernoulli variable, not a data issue. Treatment: Cast to boolean and use directly as a binary feature; ignore the outlier flag.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.1795
- median
- 0
- std
- 0.3838
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 1.67
- kurtosis
- 0.7888
- n_outliers
- 18,140
- outlier_rate
- 0.1795
- zero_rate
- 0.8205
embed_type
categorical feature null_rateThis column tags the embed type attached to Bluesky posts, with five AT Protocol values like app.bsky.embed.external and app.bsky.embed.images. Note the 61.15% null rate — most posts carry no embed at all — and among posts that do, external links dominate at 46.22%, with video and recordWithMedia being rare tails. Entropy ratio of 0.74 shows reasonable spread across the five categories when present. Treatment: One-hot encode with an explicit "none" category for nulls.
- n
- 101,040
- nulls
- 61,791 (61.2%)
- unique
- 5
- top_value
- app.bsky.embed.external
- top_rate
- 0.4622
- cardinality
- 5
- entropy
- 1.717
- entropy_ratio
- 0.7394
mentions
text foreign_key one_word short_text duplicatesThis column stores serialized JSON arrays of mentioned entity IDs (16-hex tokens), but the overwhelming majority are empty: '[]' appears 98,659 of 101,040 times and the duplicate rate is 0.981. When non-empty, arrays almost always hold a single ID (word_mean 1.01, len_p95 2.0), though len_max reaches 420 indicating occasional multi-mention rows. Vocabulary is small (660) across 1,921 unique values, so signal is sparse and ID-like rather than textual. Treatment: Parse the JSON array and explode to a mentions bridge table keyed by these hex IDs; treat empty as no-mention rather than null.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 1,921
- len_min
- 2
- len_max
- 420
- len_mean
- 2.67
- len_median
- 2
- len_p95
- 2
- word_mean
- 1.012
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 99,119
- duplicate_rate
- 0.981
- vocab_size
- 660
- readability_flesch_mean
- 0.702
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9962
- allcaps_rate
- 3.959e-05
- boilerplate_rate
- 0
links
text metadata one_word short_text duplicatesThis column holds JSON-encoded arrays of URLs, but 96,211 of 101,040 rows are the empty array '[]', driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist across 101,040 rows, and just 4.8% of rows actually contain a URL. The non-empty entries point to streaming radio and news endpoints (BBC, Radio France, shoutcast streams). Treatment: Parse the JSON array and convert to a has_link boolean or explode for per-URL analysis; the column is too sparse to use as-is.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 3,771
- len_min
- 2
- len_max
- 266
- len_mean
- 4.95
- len_median
- 2
- len_p95
- 2
- word_mean
- 1.003
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 97,269
- duplicate_rate
- 0.9627
- vocab_size
- 904
- readability_flesch_mean
- -20.11
- emoji_rate
- 0
- url_rate
- 0.04779
- one_word_rate
- 0.9978
- allcaps_rate
- 0
- boilerplate_rate
- 0