bsky firehose dec 2025
Reading
This dataset captures 101,040 anonymized Bluesky firehose posts from late December 2025, with 19 columns covering post hashes, authorship, timestamps, content text, embeds, hashtags, mentions, links, language, and sentiment. The text column is richly multilingual — English dominates at ~61% of posts, followed by Japanese (~12.6k) and a sizable 'unknown' bucket (~11.5k) — and sentiment skews neutral (48.5%) with positive outweighing negative roughly 2:1. Engagement-style features are heavily zero-inflated: only ~13.6% of posts include images, ~18% include links, and just ~1.3% include video, so most posts are plain text. About 58% of posts have no reply_root_hash, suggesting top-level posts dominate over threaded replies. The most useful first cuts are language mix, sentiment distribution, embed_type composition, and post-length shape via char_count.
citing: row_count · column_count · language.top_values · sentiment.top_values · embed_type.top_values · embed_type.null_rate · has_images.stats.mean · has_link.stats.mean · has_video.stats.mean · reply_root_hash.null_rate · char_count.stats · text.language_counts
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| en | 61468 | 60.8% |
| ja | 12607 | 12.5% |
| unknown | 11481 | 11.4% |
| en-US | 3617 | 3.6% |
| ko | 2406 | 2.4% |
| de | 1821 | 1.8% |
| pt | 1295 | 1.3% |
| es | 1153 | 1.1% |
| fr | 746 | 0.7% |
| th | 612 | 0.6% |
| tr | 548 | 0.5% |
| nl | 525 | 0.5% |
| zh | 315 | 0.3% |
| it | 276 | 0.3% |
| ru | 213 | 0.2% |
| fi | 193 | 0.2% |
| ja-JP | 170 | 0.2% |
| id | 158 | 0.2% |
| pl | 139 | 0.1% |
| el | 116 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| neutral | 48981 | 48.5% |
| positive | 34622 | 34.3% |
| negative | 17437 | 17.3% |
Show data table
| value | count | share |
|---|---|---|
| app.bsky.embed.external | 18140 | 18.0% |
| app.bsky.embed.images | 13768 | 13.6% |
| app.bsky.embed.record | 5126 | 5.1% |
| app.bsky.embed.video | 1344 | 1.3% |
| app.bsky.embed.recordWithMedia | 871 | 0.9% |
Show data table
| bin | count |
|---|---|
| 1 – 14.1 | 11042 |
| 14.1 – 27.2 | 12354 |
| 27.2 – 40.3 | 11078 |
| 40.3 – 53.4 | 8177 |
| 53.4 – 66.5 | 7049 |
| 66.5 – 79.6 | 6309 |
| 79.6 – 92.7 | 5146 |
| 92.7 – 105.8 | 4495 |
| 105.8 – 118.9 | 3865 |
| 118.9 – 132 | 3459 |
| 132 – 145.1 | 3244 |
| 145.1 – 158.2 | 2592 |
| 158.2 – 171.3 | 2219 |
| 171.3 – 184.4 | 2065 |
| 184.4 – 197.5 | 1867 |
| 197.5 – 210.6 | 1708 |
| 210.6 – 223.7 | 1624 |
| 223.7 – 236.8 | 1467 |
| 236.8 – 249.9 | 1439 |
| 249.9 – 263 | 1439 |
| 263 – 276.1 | 1563 |
| 276.1 – 289.2 | 1657 |
| 289.2 – 302.3 | 4862 |
| 302.3 – 315.4 | 33 |
| 315.4 – 328.5 | 261 |
| 328.5 – 341.6 | 3 |
| 341.6 – 354.7 | 5 |
| 354.7 – 367.8 | 1 |
| 367.8 – 380.9 | 2 |
| 380.9 – 394 | 9 |
| 394 – 407.1 | 3 |
| 407.1 – 420.2 | 0 |
| 420.2 – 433.3 | 1 |
| 433.3 – 446.4 | 1 |
| 446.4 – 459.5 | 0 |
| 459.5 – 472.6 | 0 |
| 472.6 – 485.7 | 0 |
| 485.7 – 498.8 | 0 |
| 498.8 – 511.9 | 0 |
| 511.9 – 525 | 1 |
Show data table
| bin | count |
|---|---|
| -0.998 – -0.948 | 247 |
| -0.948 – -0.8981 | 541 |
| -0.8981 – -0.8481 | 705 |
| -0.8481 – -0.7982 | 822 |
| -0.7982 – -0.7482 | 812 |
| -0.7482 – -0.6983 | 874 |
| -0.6983 – -0.6483 | 1081 |
| -0.6483 – -0.5984 | 1067 |
| -0.5984 – -0.5484 | 1130 |
| -0.5484 – -0.4985 | 1234 |
| -0.4985 – -0.4486 | 1387 |
| -0.4486 – -0.3986 | 1127 |
| -0.3986 – -0.3487 | 895 |
| -0.3487 – -0.2987 | 1120 |
| -0.2987 – -0.2488 | 1585 |
| -0.2488 – -0.1988 | 792 |
| -0.1988 – -0.1488 | 726 |
| -0.1488 – -0.0989 | 672 |
| -0.0989 – -0.04895 | 624 |
| -0.04895 – 0.001 | 48620 |
| 0.001 – 0.05095 | 358 |
| 0.05095 – 0.1009 | 740 |
| 0.1009 – 0.1508 | 623 |
| 0.1508 – 0.2008 | 781 |
| 0.2008 – 0.2508 | 1426 |
| 0.2508 – 0.3007 | 1343 |
| 0.3007 – 0.3507 | 1581 |
| 0.3507 – 0.4006 | 2179 |
| 0.4006 – 0.4506 | 3463 |
| 0.4506 – 0.5005 | 2801 |
| 0.5005 – 0.5505 | 1918 |
| 0.5505 – 0.6004 | 2382 |
| 0.6004 – 0.6503 | 2518 |
| 0.6503 – 0.7003 | 2004 |
| 0.7003 – 0.7503 | 1886 |
| 0.7503 – 0.8002 | 1870 |
| 0.8002 – 0.8501 | 2266 |
| 0.8501 – 0.9001 | 2011 |
| 0.9001 – 0.9501 | 1758 |
| 0.9501 – 1 | 1071 |
Schema
19 columns| Alerts | ||||
|---|---|---|---|---|
| text | text | 0.0% | 95,935 |
multilingual
allcaps
|
| author_did_hash | text | 0.0% | 43,998 |
duplicates
short_text
one_word
|
| uri_hash | text | 0.0% | 101,039 |
near_unique
short_text
one_word
|
| reply_parent_hash | text | 57.7% | 34,738 |
short_text
null_rate
one_word
|
| reply_root_hash | text | 57.7% | 21,277 |
duplicates
short_text
null_rate
one_word
|
| sentiment | categorical | 0.0% | 3 |
|
| sentiment_score | numeric | 0.0% | 1,928 |
outliers
|
| created_at | text | 0.0% | 96,576 |
near_unique
one_word
allcaps
|
| timestamp | text | 0.0% | 101,040 |
near_unique
one_word
allcaps
|
| language | categorical | 0.0% | 90 |
|
| char_count | numeric | 0.0% | 341 |
|
| word_count | numeric | 0.0% | 79 |
|
| has_images | numeric | 0.0% | 2 |
high_skew
outliers
|
| has_video | numeric | 0.0% | 2 |
high_skew
|
| has_link | numeric | 0.0% | 2 |
outliers
|
| embed_type | categorical | 61.2% | 5 |
null_rate
|
| hashtags | text | 0.0% | 10,103 |
duplicates
one_word
|
| mentions | text | 0.0% | 1,921 |
duplicates
short_text
one_word
|
| links | text | 0.0% | 3,771 |
duplicates
short_text
one_word
|
text
text free_text multilingual allcapsShort user-generated posts (likely social/Bluesky given the bsky.app top value and hashtag/emoji patterns), with median 10 words and a 525-character cap suggesting a platform limit. Heavily English-skewed (3309 of 101040) but genuinely multilingual with sizeable Japanese (656) and Korean (125) tails, plus 18.3% emoji rate, 16.9% all-caps lines, and 19% one-word entries. Note 5105 duplicates (5.05%) including spam-like Thai promo and repeated sheep-emoji posts among the top values. Treatment: Deduplicate, language-detect and route per language, then tokenize/embed for modelling.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 95,935
- len_min
- 1
- len_max
- 525
- len_mean
- 97.63
- len_median
- 68
- len_p95
- 290
- word_mean
- 14.23
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 5,105
- duplicate_rate
- 0.05052
- vocab_size
- 77,183
- readability_flesch_mean
- 64.09
- emoji_rate
- 0.1832
- url_rate
- 0.07586
- one_word_rate
- 0.1899
- allcaps_rate
- 0.1691
- boilerplate_rate
- 0.001049
uri_hash
text identifier near_unique short_text one_wordFixed 16-character single-token strings with 101,039 unique values across 101,040 rows and zero nulls — almost certainly hex digests (16 hex chars = 64-bit hash) of URIs, used as row identifiers. The column is effectively a primary key: one_word_rate is 1.0, length is exactly 16 at min/median/max, and duplicate_rate is 0.0. No analytic signal lives here beyond identity. Treatment: drop from modelling; retain only as a join key or deduplication handle.
- n
- 101,040
- nulls
- 1 (0.0%)
- unique
- 101,039
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 69.61
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0005245
- boilerplate_rate
- 0
reply_parent_hash
text foreign_key short_text null_rate one_wordEvery non-null value is a single 16-character token (len_min=len_max=16, one_word_rate=1.0), strongly indicating a hex hash identifier pointing to a parent post. 57.67% of rows are null, consistent with a reply-only field where most posts are top-level rather than replies. Among populated rows, 34,738 unique hashes cover 101,040 entries with an 18.78% duplicate rate, so some parents attract many replies (top hash appears 121 times). Treatment: Treat as a foreign key to the parent post; left-join on this hash and ignore for modelling.
- n
- 101,040
- nulls
- 58,270 (57.7%)
- unique
- 34,738
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 8,032
- duplicate_rate
- 0.1878
- vocab_size
- 17,415
- readability_flesch_mean
- 71.73
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0007949
- boilerplate_rate
- 0
reply_root_hash
text foreign_key duplicates short_text null_rate one_wordThis is a 16-character hex hash identifying the root post of a reply thread, with every value being a single token of fixed length 16. About 57.67% of rows are null (likely top-level posts with no reply root) and 50.25% of the non-null values are duplicates across 21,277 unique hashes, consistent with many replies sharing the same thread root. Treatment: left-join on this id to the parent post table; do not feature-encode.
- n
- 101,040
- nulls
- 58,270 (57.7%)
- unique
- 21,277
- len_min
- 16
- len_max
- 16
- len_mean
- 16
- len_median
- 16
- len_p95
- 16
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 21,493
- duplicate_rate
- 0.5025
- vocab_size
- 12,498
- readability_flesch_mean
- 77.23
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0008183
- boilerplate_rate
- 0
sentiment
categorical labelA three-class sentiment label across 101040 rows with no nulls and only 3 unique values: neutral, positive, negative. Distribution is uneven — neutral leads at 48.5%, followed by positive (34622) and negative (17437), so negatives are roughly half as common as positives. Entropy ratio of 0.93 indicates the classes are reasonably spread but not balanced. Treatment: Use as a categorical target; consider class weighting to offset the under-represented negative class.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- neutral
- top_rate
- 0.4848
- cardinality
- 3
- entropy
- 1.473
- entropy_ratio
- 0.9295
sentiment_score
numeric feature outliersThis is a bounded sentiment polarity score in [-0.998, 1.0], typical of lexicon- or model-based sentiment scoring. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but heavily zero-inflated: 47.8% of rows are exactly 0 and the median is 0, suggesting many neutral or unscoreable texts. Despite the symmetry, 5,763 rows (5.7%) are flagged as outliers, indicating fat tails of strong polarity at both ends. Treatment: Treat zeros as a separate 'neutral' indicator and use the raw score as a feature; no transform needed given symmetric bounded range.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 1,928
- min
- -0.998
- max
- 1
- mean
- 0.1074
- median
- 0
- std
- 0.4104
- q1
- 0
- q3
- 0.402
- iqr
- 0.402
- skew
- 0.01861
- kurtosis
- 0.01774
- n_outliers
- 5,763
- outlier_rate
- 0.05704
- zero_rate
- 0.478
created_at
text timestamp near_unique one_word allcapsThis is a creation timestamp column stored as ISO 8601 strings (lengths 20-35, single-token), not parsed datetimes. Values are near-unique (96576/101040) yet 4464 duplicates exist and the top values cluster tightly around 2025-12-24T05:00, suggesting a narrow ingestion window or batch insert. Format is inconsistent across rows, mixing '+00:00', '.000Z', and microsecond-precision 'Z' suffixes, which will break naive string sorting. Treatment: Parse to a normalized UTC datetime before any temporal analysis or joins.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 96,576
- len_min
- 20
- len_max
- 35
- len_mean
- 24.34
- len_median
- 24
- len_p95
- 27
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 4,464
- duplicate_rate
- 0.04418
- vocab_size
- 19,720
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
timestamp
text timestamp near_unique one_word allcapsThis is an ISO-8601 timestamp column stored as text, with every one of the 101040 values unique and exactly 26 characters long. Sampled values cluster on 2025-12-23 and 2025-12-24, suggesting a narrow capture window rather than a broad historical range. The 'allcaps' and 'one_word' alerts are artefacts of the ISO format (the literal 'T' separator and no whitespace), not a data quality issue. Treatment: Parse to datetime and derive features (hour, day, delta) instead of using as text.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 101,040
- len_min
- 26
- len_max
- 26
- len_mean
- 26
- len_median
- 26
- len_p95
- 26
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
language
categorical featureLanguage tag of each record, using ISO-style codes across 90 distinct values with no nulls. English dominates at 60.8% of rows, followed by Japanese (12,607) and a sizeable 'unknown' bucket (11,481) that signals missing-data leakage into the category itself. Note the inconsistent granularity: bare 'en' coexists with locale-specific 'en-US' (3,617), so codes need normalisation before grouping. Treatment: Normalise locale codes (collapse en-US into en), treat 'unknown' as missing, then one-hot or target-encode.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 90
- top_value
- en
- top_rate
- 0.6084
- cardinality
- 90
- entropy
- 2.178
- entropy_ratio
- 0.3356
char_count
numeric featureThis is almost certainly a per-row character count of some text field, ranging from 1 to 525 with a median of 68 and mean of 97.6. The distribution is right-skewed (skew 1.02) with a wide IQR of 113, but only 0.29% of values flag as outliers and there are no nulls or zeros. With just 341 unique integer values across 101,040 rows, the field is discrete and well-behaved. Treatment: Consider a log or sqrt transform before regression to tame the right skew.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 341
- min
- 1
- max
- 525
- mean
- 97.63
- median
- 68
- std
- 86.05
- q1
- 30
- q3
- 143
- iqr
- 113
- skew
- 1.018
- kurtosis
- -0.05733
- n_outliers
- 289
- outlier_rate
- 0.00286
- zero_rate
- 0
word_count
numeric featureThis is a per-row word count, ranging from 0 to 83 with a median of 10 and mean of 14.67, indicating most entries are short snippets rather than long documents. The distribution is right-skewed (skew 1.21) with a wide IQR of 19 and ~2.85% outliers, suggesting a long tail of unusually verbose rows. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) confirm it's a bounded discrete count with virtually no empty texts. Treatment: Log-transform or bin before modelling to dampen the right skew.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 79
- min
- 0
- max
- 83
- mean
- 14.67
- median
- 10
- std
- 14.22
- q1
- 3
- q3
- 22
- iqr
- 19
- skew
- 1.209
- kurtosis
- 0.699
- n_outliers
- 2,882
- outlier_rate
- 0.02852
- zero_rate
- 0.0006037
has_images
numeric feature high_skew outliersThis is a binary indicator (only 2 unique values, min 0, max 1) flagging whether a record has images. It's heavily imbalanced: 86.4% zeros and a mean of 0.136, meaning only ~13.6% of rows have images. The 'outliers' alert simply reflects the minority class rather than anomalous values. Treatment: Treat as a boolean flag; no transformation needed, but watch class imbalance if used as a target.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.1363
- median
- 0
- std
- 0.3431
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.12
- kurtosis
- 2.497
- n_outliers
- 13,768
- outlier_rate
- 0.1363
- zero_rate
- 0.8637
has_video
numeric feature high_skewBinary flag indicating whether a record has an associated video, stored as 0/1 with no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are one, producing extreme skew (8.50) and kurtosis (70.19). The 1344 ones are flagged as outliers purely because of the imbalance, not because they are anomalous. Treatment: Treat as a boolean indicator; expect minimal signal given 98.67% zeros and consider class-imbalance handling if used as a target.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.0133
- median
- 0
- std
- 0.1146
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 8.497
- kurtosis
- 70.19
- n_outliers
- 1,344
- outlier_rate
- 0.0133
- zero_rate
- 0.9867
has_link
numeric feature outliersBinary 0/1 flag indicating whether a record contains a link, with 82.0% zeros and a mean of 0.180 across 101,040 rows. The 18,140 'outliers' are simply the minority positive class, not anomalies — IQR-based outlier detection misfires on binary data. No nulls, exactly 2 unique values. Treatment: Use directly as a boolean feature; ignore the outlier alert.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.1795
- median
- 0
- std
- 0.3838
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 1.67
- kurtosis
- 0.7888
- n_outliers
- 18,140
- outlier_rate
- 0.1795
- zero_rate
- 0.8205
embed_type
categorical feature null_rateThis column tags the embed type attached to a Bluesky post, with 5 distinct AT Protocol lexicon values like app.bsky.embed.external and app.bsky.embed.images. 61.15% of rows are null, consistent with most posts having no embed, and among the populated rows external links (46.2%) and images dominate while video and recordWithMedia are rare. Entropy ratio of 0.74 indicates a moderately concentrated but not degenerate distribution. Treatment: Treat nulls as a 'no embed' category and one-hot encode the 5 levels.
- n
- 101,040
- nulls
- 61,791 (61.2%)
- unique
- 5
- top_value
- app.bsky.embed.external
- top_rate
- 0.4622
- cardinality
- 5
- entropy
- 1.717
- entropy_ratio
- 0.7394
mentions
text foreign_key duplicates short_text one_wordThis column stores a serialized JSON array of mention IDs (hex tokens) attached to each record, but 98,659 of 101,040 rows hold the empty list `[]`, giving a 0.98 duplicate rate and 0.996 one-word rate. When mentions do appear, they are almost always single-element arrays referencing 16-character hex IDs, with the most-cited ID occurring only 16 times. Effectively a sparse foreign-key list dominated by absence. Treatment: Parse the JSON array and explode to a mention-id join table, or collapse to a binary has_mentions flag given the 98% empty rate.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 1,921
- len_min
- 2
- len_max
- 420
- len_mean
- 2.67
- len_median
- 2
- len_p95
- 2
- word_mean
- 1.012
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 99,119
- duplicate_rate
- 0.981
- vocab_size
- 660
- readability_flesch_mean
- 0.702
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9962
- allcaps_rate
- 3.959e-05
- boilerplate_rate
- 0
links
text metadata duplicates short_text one_wordThis column stores JSON-encoded arrays of URLs, but 96,211 of 101,040 rows hold the empty literal "[]", driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist and just 4.8% contain a URL, so the field is overwhelmingly absent rather than informative. When populated, entries point to radio/streaming sites (BBC, Radio France, shoutcast), suggesting a sparse list-of-links attribute. Treatment: Parse the JSON arrays, derive a has_links boolean and link count, and skip the raw string for modelling.
- n
- 101,040
- nulls
- 0 (0.0%)
- unique
- 3,771
- len_min
- 2
- len_max
- 266
- len_mean
- 4.95
- len_median
- 2
- len_p95
- 2
- word_mean
- 1.003
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 97,269
- duplicate_rate
- 0.9627
- vocab_size
- 904
- readability_flesch_mean
- -20.11
- emoji_rate
- 0
- url_rate
- 0.04779
- one_word_rate
- 0.9978
- allcaps_rate
- 0
- boilerplate_rate
- 0