bsky-firehose-dec-2025 · saturn notebook

Overview

Source: /home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv

Saturn profiled 101,040 rows across 19 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv",
    "--findings", "bsky-firehose-dec-2025.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset captures 101,040 anonymized Bluesky firehose posts from late December 2025, with 19 columns covering post hashes, authorship, timestamps, content text, embeds, hashtags, mentions, links, language, and sentiment. The text column is richly multilingual — English dominates at ~61% of posts, followed by Japanese (~12.6k) and a sizable 'unknown' bucket (~11.5k) — and sentiment skews neutral (48.5%) with positive outweighing negative roughly 2:1. Engagement-style features are heavily zero-inflated: only ~13.6% of posts include images, ~18% include links, and just ~1.3% include video, so most posts are plain text. About 58% of posts have no reply_root_hash, suggesting top-level posts dominate over threaded replies. The most useful first cuts are language mix, sentiment distribution, embed_type composition, and post-length shape via char_count.

citing: row_count · column_count · language.top_values · sentiment.top_values · embed_type.top_values · embed_type.null_rate · has_images.stats.mean · has_link.stats.mean · has_video.stats.mean · reply_root_hash.null_rate · char_count.stats · text.language_counts

Out[4]:

saturn.schema() · 19 columns

column	kind	n	null%	unique	alerts
text	text	101,040	0.0%	95,935	multilingual allcaps
author_did_hash	text	101,040	0.0%	43,998	duplicates short_text one_word
uri_hash	text	101,040	0.0%	101,039	near_unique short_text one_word
reply_parent_hash	text	101,040	57.7%	34,738	short_text null_rate one_word
reply_root_hash	text	101,040	57.7%	21,277	duplicates short_text null_rate one_word
sentiment	categorical	101,040	0.0%	3
sentiment_score	numeric	101,040	0.0%	1,928	outliers
created_at	text	101,040	0.0%	96,576	near_unique one_word allcaps
timestamp	text	101,040	0.0%	101,040	near_unique one_word allcaps
language	categorical	101,040	0.0%	90
char_count	numeric	101,040	0.0%	341
word_count	numeric	101,040	0.0%	79
has_images	numeric	101,040	0.0%	2	high_skew outliers
has_video	numeric	101,040	0.0%	2	high_skew
has_link	numeric	101,040	0.0%	2	outliers
embed_type	categorical	101,040	61.2%	5	null_rate
hashtags	text	101,040	0.0%	10,103	duplicates one_word
mentions	text	101,040	0.0%	1,921	duplicates short_text one_word
links	text	101,040	0.0%	3,771	duplicates short_text one_word

Fig 1.

language · See how heavily English dominates and where Japanese, Korean, and 'unknown' sit in the long tail.

Show data table

Top values for language (20 unique shown, of 90 total).
value	count	share
en	61468	60.8%
ja	12607	12.5%
unknown	11481	11.4%
en-US	3617	3.6%
ko	2406	2.4%
de	1821	1.8%
pt	1295	1.3%
es	1153	1.1%
fr	746	0.7%
th	612	0.6%
tr	548	0.5%
nl	525	0.5%
zh	315	0.3%
it	276	0.3%
ru	213	0.2%
fi	193	0.2%
ja-JP	170	0.2%
id	158	0.2%
pl	139	0.1%
el	116	0.1%

Fig 2.

sentiment · Check the neutral-heavy split with positive roughly twice negative.

Show data table

Top values for sentiment (3 unique shown, of 3 total).
value	count	share
neutral	48981	48.5%
positive	34622	34.3%
negative	17437	17.3%

Fig 3.

embed_type · Among posts that have an embed, compare external links, images, quoted records, and video shares.

Show data table

Top values for embed_type (5 unique shown, of 5 total).
value	count	share
app.bsky.embed.external	18140	18.0%
app.bsky.embed.images	13768	13.6%
app.bsky.embed.record	5126	5.1%
app.bsky.embed.video	1344	1.3%
app.bsky.embed.recordWithMedia	871	0.9%

Fig 4.

char_count · Look at the right-skewed length distribution — median 68 chars but a long tail out to 525.

Show data table

Histogram bins for char_count (median: 68.0).
bin	count
1 – 14.1	11042
14.1 – 27.2	12354
27.2 – 40.3	11078
40.3 – 53.4	8177
53.4 – 66.5	7049
66.5 – 79.6	6309
79.6 – 92.7	5146
92.7 – 105.8	4495
105.8 – 118.9	3865
118.9 – 132	3459
132 – 145.1	3244
145.1 – 158.2	2592
158.2 – 171.3	2219
171.3 – 184.4	2065
184.4 – 197.5	1867
197.5 – 210.6	1708
210.6 – 223.7	1624
223.7 – 236.8	1467
236.8 – 249.9	1439
249.9 – 263	1439
263 – 276.1	1563
276.1 – 289.2	1657
289.2 – 302.3	4862
302.3 – 315.4	33
315.4 – 328.5	261
328.5 – 341.6	3
341.6 – 354.7	5
354.7 – 367.8	1
367.8 – 380.9	2
380.9 – 394	9
394 – 407.1	3
407.1 – 420.2	0
420.2 – 433.3	1
433.3 – 446.4	1
446.4 – 459.5	0
459.5 – 472.6	0
472.6 – 485.7	0
485.7 – 498.8	0
498.8 – 511.9	0
511.9 – 525	1

Fig 5.

sentiment_score · Inspect the spike at zero (~48%) and the modest positive lean in the continuous score.

Show data table

Histogram bins for sentiment_score (median: 0.0).
bin	count
-0.998 – -0.948	247
-0.948 – -0.8981	541
-0.8981 – -0.8481	705
-0.8481 – -0.7982	822
-0.7982 – -0.7482	812
-0.7482 – -0.6983	874
-0.6983 – -0.6483	1081
-0.6483 – -0.5984	1067
-0.5984 – -0.5484	1130
-0.5484 – -0.4985	1234
-0.4985 – -0.4486	1387
-0.4486 – -0.3986	1127
-0.3986 – -0.3487	895
-0.3487 – -0.2987	1120
-0.2987 – -0.2488	1585
-0.2488 – -0.1988	792
-0.1988 – -0.1488	726
-0.1488 – -0.0989	672
-0.0989 – -0.04895	624
-0.04895 – 0.001	48620
0.001 – 0.05095	358
0.05095 – 0.1009	740
0.1009 – 0.1508	623
0.1508 – 0.2008	781
0.2008 – 0.2508	1426
0.2508 – 0.3007	1343
0.3007 – 0.3507	1581
0.3507 – 0.4006	2179
0.4006 – 0.4506	3463
0.4506 – 0.5005	2801
0.5005 – 0.5505	1918
0.5505 – 0.6004	2382
0.6004 – 0.6503	2518
0.6503 – 0.7003	2004
0.7003 – 0.7503	1886
0.7503 – 0.8002	1870
0.8002 – 0.8501	2266
0.8501 – 0.9001	2011
0.9001 – 0.9501	1758
0.9501 – 1	1071

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
text	text	0.0%
author_did_hash	text	0.0%
uri_hash	text	0.0%
reply_parent_hash	text	57.7%
reply_root_hash	text	57.7%
sentiment	categorical	0.0%
sentiment_score	numeric	0.0%
created_at	text	0.0%
timestamp	text	0.0%
language	categorical	0.0%
char_count	numeric	0.0%
word_count	numeric	0.0%
has_images	numeric	0.0%
has_video	numeric	0.0%
has_link	numeric	0.0%
embed_type	categorical	61.2%
hashtags	text	0.0%
mentions	text	0.0%
links	text	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,743 detected strings).
lang	count	share
en	3309	69.8%
ja	656	13.8%
ko	125	2.6%
de	108	2.3%
fr	78	1.6%
pt	78	1.6%
es	71	1.5%
nl	46	1.0%
zh	37	0.8%
th	34	0.7%
tr	33	0.7%
ru	30	0.6%
it	30	0.6%
fi	13	0.3%
pl	12	0.3%
el	11	0.2%
id	11	0.2%
cs	10	0.2%
ar	9	0.2%
vi	7	0.1%
sv	7	0.1%
ca	6	0.1%
eo	4	0.1%
no	3	0.1%
bg	3	0.1%
uk	3	0.1%
sr	3	0.1%
hi	2	0.0%
als	2	0.0%
et	2	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	sentiment_score	char_count	word_count	has_images	has_video	has_link
sentiment_score	+1.00	-0.01	+0.02	+0.06	-0.02	-0.03
char_count	-0.01	+1.00	+0.86	+0.02	-0.01	+0.11
word_count	+0.02	+0.86	+1.00	+0.01	-0.03	+0.00
has_images	+0.06	+0.02	+0.01	+1.00	-0.04	-0.18
has_video	-0.02	-0.01	-0.03	-0.04	+1.00	-0.05
has_link	-0.03	+0.11	+0.00	-0.18	-0.05	+1.00

text text free_text

Short user-generated posts (likely social/Bluesky given the bsky.app top value and hashtag/emoji patterns), with median 10 words and a 525-character cap suggesting a platform limit. Heavily English-skewed (3309 of 101040) but genuinely multilingual with sizeable Japanese (656) and Korean (125) tails, plus 18.3% emoji rate, 16.9% all-caps lines, and 19% one-word entries. Note 5105 duplicates (5.05%) including spam-like Thai promo and repeated sheep-emoji posts among the top values.

Treatment: Deduplicate, language-detect and route per language, then tokenize/embed for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["text"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	95,935
len_min	1
len_max	525
len_mean	97.63
len_median	68
len_p95	290
word_mean	14.23
word_median	10
n_empty	0
n_duplicates	5,105
duplicate_rate	0.05052
vocab_size	77,183
readability_flesch_mean	64.09
emoji_rate	0.1832
url_rate	0.07586
one_word_rate	0.1899
allcaps_rate	0.1691
boilerplate_rate	0.001049
alert: multilingual	31 languages detected in sample
alert: allcaps	16.9% rows are all-caps

Fig 9.

Character-length distribution for text.

Show data table

Character-length distribution for text (mean: 97.62657363420428).
chars	count
1 – 14	11042
14 – 27	12354
27 – 40	11078
40 – 53	8177
53 – 66	7049
66 – 80	6309
80 – 93	5146
93 – 106	4495
106 – 119	3865
119 – 132	3459
132 – 145	3244
145 – 158	2592
158 – 171	2219
171 – 184	2065
184 – 198	1867
198 – 211	1708
211 – 224	1624
224 – 237	1467
237 – 250	1439
250 – 263	1439
263 – 276	1563
276 – 289	1657
289 – 302	4862
302 – 315	33
315 – 328	261
328 – 342	3
342 – 355	5
355 – 368	1
368 – 381	2
381 – 394	9
394 – 407	3
407 – 420	0
420 – 433	1
433 – 446	1
446 – 460	0
460 – 473	0
473 – 486	0
486 – 499	0
499 – 512	0
512 – 525	1

author_did_hash text foreign_key

Fixed 16-character single-token hex strings, almost certainly hashed author DIDs acting as a pseudonymous user identifier. Across 101,040 rows there are only 43,998 unique values and a 56.5% duplicate rate, with the top author appearing 1,016 times — so this is a foreign-key-style author handle, not a per-row id. No nulls or empties, and length is constant at 16.

Treatment: Treat as a categorical author key; left-join on this to author-level features rather than feeding the raw hash into a model.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["author_did_hash"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	43,998
len_min	16
len_max	16
len_mean	16
len_median	16
len_p95	16
word_mean	1
word_median	1
n_empty	0
n_duplicates	57,042
duplicate_rate	0.5645
vocab_size	13,938
readability_flesch_mean	68.35
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0003464
boilerplate_rate	0
alert: duplicates	56.5% duplicate strings
alert: short_text	95th-percentile length under 20 chars
alert: one_word	100.0% rows are a single word

Fig 10.

Character-length distribution for author_did_hash.

Show data table

Character-length distribution for author_did_hash (mean: 16.0).
chars	count
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	101040
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0

uri_hash text identifier

Fixed 16-character single-token strings with 101,039 unique values across 101,040 rows and zero nulls — almost certainly hex digests (16 hex chars = 64-bit hash) of URIs, used as row identifiers. The column is effectively a primary key: one_word_rate is 1.0, length is exactly 16 at min/median/max, and duplicate_rate is 0.0. No analytic signal lives here beyond identity.

Treatment: drop from modelling; retain only as a join key or deduplication handle.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["uri_hash"].stats

stat	value
n	101,040
nulls	1 (0.0%)
unique	101,039
len_min	16
len_max	16
len_mean	16
len_median	16
len_p95	16
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	69.61
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0005245
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: short_text	95th-percentile length under 20 chars
alert: one_word	100.0% rows are a single word

Fig 11.

Character-length distribution for uri_hash.

Show data table

Character-length distribution for uri_hash (mean: 16.0).
chars	count
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	101039
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0

reply_parent_hash text foreign_key

Every non-null value is a single 16-character token (len_min=len_max=16, one_word_rate=1.0), strongly indicating a hex hash identifier pointing to a parent post. 57.67% of rows are null, consistent with a reply-only field where most posts are top-level rather than replies. Among populated rows, 34,738 unique hashes cover 101,040 entries with an 18.78% duplicate rate, so some parents attract many replies (top hash appears 121 times).

Treatment: Treat as a foreign key to the parent post; left-join on this hash and ignore for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["reply_parent_hash"].stats

stat	value
n	101,040
nulls	58,270 (57.7%)
unique	34,738
len_min	16
len_max	16
len_mean	16
len_median	16
len_p95	16
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,032
duplicate_rate	0.1878
vocab_size	17,415
readability_flesch_mean	71.73
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0007949
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: null_rate	57.7% null
alert: one_word	100.0% rows are a single word

Fig 12.

Character-length distribution for reply_parent_hash.

Show data table

Character-length distribution for reply_parent_hash (mean: 16.0).
chars	count
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	42770
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0

reply_root_hash text foreign_key

This is a 16-character hex hash identifying the root post of a reply thread, with every value being a single token of fixed length 16. About 57.67% of rows are null (likely top-level posts with no reply root) and 50.25% of the non-null values are duplicates across 21,277 unique hashes, consistent with many replies sharing the same thread root.

Treatment: left-join on this id to the parent post table; do not feature-encode.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["reply_root_hash"].stats

stat	value
n	101,040
nulls	58,270 (57.7%)
unique	21,277
len_min	16
len_max	16
len_mean	16
len_median	16
len_p95	16
word_mean	1
word_median	1
n_empty	0
n_duplicates	21,493
duplicate_rate	0.5025
vocab_size	12,498
readability_flesch_mean	77.23
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0008183
boilerplate_rate	0
alert: duplicates	50.3% duplicate strings
alert: short_text	95th-percentile length under 20 chars
alert: null_rate	57.7% null
alert: one_word	100.0% rows are a single word

Fig 13.

Character-length distribution for reply_root_hash.

Show data table

Character-length distribution for reply_root_hash (mean: 16.0).
chars	count
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	42770
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0
16 – 16	0

sentiment categorical label

A three-class sentiment label across 101040 rows with no nulls and only 3 unique values: neutral, positive, negative. Distribution is uneven — neutral leads at 48.5%, followed by positive (34622) and negative (17437), so negatives are roughly half as common as positives. Entropy ratio of 0.93 indicates the classes are reasonably spread but not balanced.

Treatment: Use as a categorical target; consider class weighting to offset the under-represented negative class.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["sentiment"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	3
top_value	neutral
top_rate	0.4848
cardinality	3
entropy	1.473
entropy_ratio	0.9295

Fig 14.

Top values for sentiment.

Show data table

Top values for sentiment (3 unique shown, of 3 total).
value	count	share
neutral	48981	48.5%
positive	34622	34.3%
negative	17437	17.3%

sentiment_score numeric feature

This is a bounded sentiment polarity score in [-0.998, 1.0], typical of lexicon- or model-based sentiment scoring. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but heavily zero-inflated: 47.8% of rows are exactly 0 and the median is 0, suggesting many neutral or unscoreable texts. Despite the symmetry, 5,763 rows (5.7%) are flagged as outliers, indicating fat tails of strong polarity at both ends.

Treatment: Treat zeros as a separate 'neutral' indicator and use the raw score as a feature; no transform needed given symmetric bounded range.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["sentiment_score"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	1,928
min	-0.998
max	1
mean	0.1074
median	0
std	0.4104
q1	0
q3	0.402
iqr	0.402
skew	0.01861
kurtosis	0.01774
n_outliers	5,763
outlier_rate	0.05704
zero_rate	0.478
alert: outliers	5.7% rows beyond 1.5 IQR

Fig 15.

Distribution of sentiment_score. Vertical dash marks the median.

Show data table

Histogram bins for sentiment_score (median: 0.0).
bin	count
-0.998 – -0.948	247
-0.948 – -0.8981	541
-0.8981 – -0.8481	705
-0.8481 – -0.7982	822
-0.7982 – -0.7482	812
-0.7482 – -0.6983	874
-0.6983 – -0.6483	1081
-0.6483 – -0.5984	1067
-0.5984 – -0.5484	1130
-0.5484 – -0.4985	1234
-0.4985 – -0.4486	1387
-0.4486 – -0.3986	1127
-0.3986 – -0.3487	895
-0.3487 – -0.2987	1120
-0.2987 – -0.2488	1585
-0.2488 – -0.1988	792
-0.1988 – -0.1488	726
-0.1488 – -0.0989	672
-0.0989 – -0.04895	624
-0.04895 – 0.001	48620
0.001 – 0.05095	358
0.05095 – 0.1009	740
0.1009 – 0.1508	623
0.1508 – 0.2008	781
0.2008 – 0.2508	1426
0.2508 – 0.3007	1343
0.3007 – 0.3507	1581
0.3507 – 0.4006	2179
0.4006 – 0.4506	3463
0.4506 – 0.5005	2801
0.5005 – 0.5505	1918
0.5505 – 0.6004	2382
0.6004 – 0.6503	2518
0.6503 – 0.7003	2004
0.7003 – 0.7503	1886
0.7503 – 0.8002	1870
0.8002 – 0.8501	2266
0.8501 – 0.9001	2011
0.9001 – 0.9501	1758
0.9501 – 1	1071

created_at text timestamp

This is a creation timestamp column stored as ISO 8601 strings (lengths 20-35, single-token), not parsed datetimes. Values are near-unique (96576/101040) yet 4464 duplicates exist and the top values cluster tightly around 2025-12-24T05:00, suggesting a narrow ingestion window or batch insert. Format is inconsistent across rows, mixing '+00:00', '.000Z', and microsecond-precision 'Z' suffixes, which will break naive string sorting.

Treatment: Parse to a normalized UTC datetime before any temporal analysis or joins.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["created_at"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	96,576
len_min	20
len_max	35
len_mean	24.34
len_median	24
len_p95	27
word_mean	1
word_median	1
n_empty	0
n_duplicates	4,464
duplicate_rate	0.04418
vocab_size	19,720
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	95.6% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 16.

Character-length distribution for created_at.

Show data table

Character-length distribution for created_at (mean: 24.344883214568487).
chars	count
20 – 20	2195
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 22	1
22 – 23	0
23 – 23	0
23 – 23	3
23 – 24	0
24 – 24	88967
24 – 24	0
24 – 25	0
25 – 25	2823
25 – 26	0
26 – 26	0
26 – 26	1
26 – 27	0
27 – 27	2202
27 – 28	0
28 – 28	0
28 – 28	118
28 – 29	0
29 – 29	0
29 – 29	1342
29 – 30	0
30 – 30	61
30 – 30	0
30 – 31	0
31 – 31	0
31 – 32	0
32 – 32	0
32 – 32	3296
32 – 33	0
33 – 33	28
33 – 34	0
34 – 34	0
34 – 34	0
34 – 35	0
35 – 35	3

timestamp text timestamp

This is an ISO-8601 timestamp column stored as text, with every one of the 101040 values unique and exactly 26 characters long. Sampled values cluster on 2025-12-23 and 2025-12-24, suggesting a narrow capture window rather than a broad historical range. The 'allcaps' and 'one_word' alerts are artefacts of the ISO format (the literal 'T' separator and no whitespace), not a data quality issue.

Treatment: Parse to datetime and derive features (hour, day, delta) instead of using as text.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["timestamp"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	101,040
len_min	26
len_max	26
len_mean	26
len_median	26
len_p95	26
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 17.

Character-length distribution for timestamp.

Show data table

Character-length distribution for timestamp (mean: 26.0).
chars	count
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	101040
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0
26 – 26	0

language categorical feature

Language tag of each record, using ISO-style codes across 90 distinct values with no nulls. English dominates at 60.8% of rows, followed by Japanese (12,607) and a sizeable 'unknown' bucket (11,481) that signals missing-data leakage into the category itself. Note the inconsistent granularity: bare 'en' coexists with locale-specific 'en-US' (3,617), so codes need normalisation before grouping.

Treatment: Normalise locale codes (collapse en-US into en), treat 'unknown' as missing, then one-hot or target-encode.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["language"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	90
top_value	en
top_rate	0.6084
cardinality	90
entropy	2.178
entropy_ratio	0.3356

Fig 18.

Top values for language.

Show data table

Top values for language (20 unique shown, of 90 total).
value	count	share
en	61468	60.8%
ja	12607	12.5%
unknown	11481	11.4%
en-US	3617	3.6%
ko	2406	2.4%
de	1821	1.8%
pt	1295	1.3%
es	1153	1.1%
fr	746	0.7%
th	612	0.6%
tr	548	0.5%
nl	525	0.5%
zh	315	0.3%
it	276	0.3%
ru	213	0.2%
fi	193	0.2%
ja-JP	170	0.2%
id	158	0.2%
pl	139	0.1%
el	116	0.1%

char_count numeric feature

This is almost certainly a per-row character count of some text field, ranging from 1 to 525 with a median of 68 and mean of 97.6. The distribution is right-skewed (skew 1.02) with a wide IQR of 113, but only 0.29% of values flag as outliers and there are no nulls or zeros. With just 341 unique integer values across 101,040 rows, the field is discrete and well-behaved.

Treatment: Consider a log or sqrt transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["char_count"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	341
min	1
max	525
mean	97.63
median	68
std	86.05
q1	30
q3	143
iqr	113
skew	1.018
kurtosis	-0.05733
n_outliers	289
outlier_rate	0.00286
zero_rate	0

Fig 19.

Distribution of char_count. Vertical dash marks the median.

Show data table

Histogram bins for char_count (median: 68.0).
bin	count
1 – 14.1	11042
14.1 – 27.2	12354
27.2 – 40.3	11078
40.3 – 53.4	8177
53.4 – 66.5	7049
66.5 – 79.6	6309
79.6 – 92.7	5146
92.7 – 105.8	4495
105.8 – 118.9	3865
118.9 – 132	3459
132 – 145.1	3244
145.1 – 158.2	2592
158.2 – 171.3	2219
171.3 – 184.4	2065
184.4 – 197.5	1867
197.5 – 210.6	1708
210.6 – 223.7	1624
223.7 – 236.8	1467
236.8 – 249.9	1439
249.9 – 263	1439
263 – 276.1	1563
276.1 – 289.2	1657
289.2 – 302.3	4862
302.3 – 315.4	33
315.4 – 328.5	261
328.5 – 341.6	3
341.6 – 354.7	5
354.7 – 367.8	1
367.8 – 380.9	2
380.9 – 394	9
394 – 407.1	3
407.1 – 420.2	0
420.2 – 433.3	1
433.3 – 446.4	1
446.4 – 459.5	0
459.5 – 472.6	0
472.6 – 485.7	0
485.7 – 498.8	0
498.8 – 511.9	0
511.9 – 525	1

word_count numeric feature

This is a per-row word count, ranging from 0 to 83 with a median of 10 and mean of 14.67, indicating most entries are short snippets rather than long documents. The distribution is right-skewed (skew 1.21) with a wide IQR of 19 and ~2.85% outliers, suggesting a long tail of unusually verbose rows. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) confirm it's a bounded discrete count with virtually no empty texts.

Treatment: Log-transform or bin before modelling to dampen the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["word_count"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	79
min	0
max	83
mean	14.67
median	10
std	14.22
q1	3
q3	22
iqr	19
skew	1.209
kurtosis	0.699
n_outliers	2,882
outlier_rate	0.02852
zero_rate	0.0006037

Fig 20.

Distribution of word_count. Vertical dash marks the median.

Show data table

Histogram bins for word_count (median: 10.0).
bin	count
0 – 2.075	21450
2.075 – 4.15	9307
4.15 – 6.225	8043
6.225 – 8.3	7043
8.3 – 10.38	6459
10.38 – 12.45	5882
12.45 – 14.53	4926
14.53 – 16.6	4074
16.6 – 18.68	3582
18.68 – 20.75	3218
20.75 – 22.83	2640
22.83 – 24.9	2512
24.9 – 26.98	2280
26.98 – 29.05	3311
29.05 – 31.13	1923
31.13 – 33.2	1805
33.2 – 35.28	1633
35.28 – 37.35	1314
37.35 – 39.43	1144
39.43 – 41.5	1057
41.5 – 43.58	1076
43.58 – 45.65	1046
45.65 – 47.73	994
47.73 – 49.8	965
49.8 – 51.88	918
51.88 – 53.95	805
53.95 – 56.03	910
56.03 – 58.1	334
58.1 – 60.18	189
60.18 – 62.25	88
62.25 – 64.33	31
64.33 – 66.4	23
66.4 – 68.48	21
68.48 – 70.55	7
70.55 – 72.62	10
72.62 – 74.7	14
74.7 – 76.78	3
76.78 – 78.85	1
78.85 – 80.93	0
80.93 – 83	2

has_images numeric feature

This is a binary indicator (only 2 unique values, min 0, max 1) flagging whether a record has images. It's heavily imbalanced: 86.4% zeros and a mean of 0.136, meaning only ~13.6% of rows have images. The 'outliers' alert simply reflects the minority class rather than anomalous values.

Treatment: Treat as a boolean flag; no transformation needed, but watch class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["has_images"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	2
min	0
max	1
mean	0.1363
median	0
std	0.3431
q1	0
q3	0
iqr	0
skew	2.12
kurtosis	2.497
n_outliers	13,768
outlier_rate	0.1363
zero_rate	0.8637
alert: high_skew	skew=+2.12
alert: outliers	13.6% rows beyond 1.5 IQR

Fig 21.

Distribution of has_images. Vertical dash marks the median.

Show data table

Histogram bins for has_images (median: 0.0).
bin	count
0 – 0.025	87272
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	13768

has_video numeric feature

Binary flag indicating whether a record has an associated video, stored as 0/1 with no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are one, producing extreme skew (8.50) and kurtosis (70.19). The 1344 ones are flagged as outliers purely because of the imbalance, not because they are anomalous.

Treatment: Treat as a boolean indicator; expect minimal signal given 98.67% zeros and consider class-imbalance handling if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["has_video"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	2
min	0
max	1
mean	0.0133
median	0
std	0.1146
q1	0
q3	0
iqr	0
skew	8.497
kurtosis	70.19
n_outliers	1,344
outlier_rate	0.0133
zero_rate	0.9867
alert: high_skew	skew=+8.50

Fig 22.

Distribution of has_video. Vertical dash marks the median.

Show data table

Histogram bins for has_video (median: 0.0).
bin	count
0 – 0.025	99696
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	1344

has_link numeric feature

Binary 0/1 flag indicating whether a record contains a link, with 82.0% zeros and a mean of 0.180 across 101,040 rows. The 18,140 'outliers' are simply the minority positive class, not anomalies — IQR-based outlier detection misfires on binary data. No nulls, exactly 2 unique values.

Treatment: Use directly as a boolean feature; ignore the outlier alert.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["has_link"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	2
min	0
max	1
mean	0.1795
median	0
std	0.3838
q1	0
q3	0
iqr	0
skew	1.67
kurtosis	0.7888
n_outliers	18,140
outlier_rate	0.1795
zero_rate	0.8205
alert: outliers	18.0% rows beyond 1.5 IQR

Fig 23.

Distribution of has_link. Vertical dash marks the median.

Show data table

Histogram bins for has_link (median: 0.0).
bin	count
0 – 0.025	82900
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	18140

embed_type categorical feature

This column tags the embed type attached to a Bluesky post, with 5 distinct AT Protocol lexicon values like app.bsky.embed.external and app.bsky.embed.images. 61.15% of rows are null, consistent with most posts having no embed, and among the populated rows external links (46.2%) and images dominate while video and recordWithMedia are rare. Entropy ratio of 0.74 indicates a moderately concentrated but not degenerate distribution.

Treatment: Treat nulls as a 'no embed' category and one-hot encode the 5 levels.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["embed_type"].stats

stat	value
n	101,040
nulls	61,791 (61.2%)
unique	5
top_value	app.bsky.embed.external
top_rate	0.4622
cardinality	5
entropy	1.717
entropy_ratio	0.7394
alert: null_rate	61.2% null

Fig 24.

Top values for embed_type.

Show data table

Top values for embed_type (5 unique shown, of 5 total).
value	count	share
app.bsky.embed.external	18140	18.0%
app.bsky.embed.images	13768	13.6%
app.bsky.embed.record	5126	5.1%
app.bsky.embed.video	1344	1.3%
app.bsky.embed.recordWithMedia	871	0.9%

hashtags text feature

This column stores serialized JSON arrays of hashtags extracted from social posts, with 87,318 of 101,040 rows holding an empty list `[]` and only 10,103 distinct values overall (duplicate_rate 0.90). When hashtags are present they are short — word_mean 1.38 and len_median 2 — and span multiple scripts (Thai, Japanese, German, English), so any text processing must be Unicode-aware. The mix of `#NowPlaying` (115) and `#nowplaying` (85) shows case is not normalized.

Treatment: Parse the JSON list, lowercase, and one-hot or count-encode the most frequent tags; treat `[]` as an explicit no-tag category.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["hashtags"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	10,103
len_min	2
len_max	1,122
len_mean	10.38
len_median	2
len_p95	63
word_mean	1.384
word_median	1
n_empty	0
n_duplicates	90,937
duplicate_rate	0.9
vocab_size	7,036
readability_flesch_mean	2.752
emoji_rate	0
url_rate	0
one_word_rate	0.902
allcaps_rate	0.005701
boilerplate_rate	0
alert: duplicates	90.0% duplicate strings
alert: one_word	90.2% rows are a single word

Fig 25.

Character-length distribution for hashtags.

Show data table

Character-length distribution for hashtags (mean: 10.377622723673792).
chars	count
2 – 30	92111
30 – 58	3305
58 – 86	2893
86 – 114	1061
114 – 142	466
142 – 170	316
170 – 198	183
198 – 226	120
226 – 254	90
254 – 282	321
282 – 310	42
310 – 338	28
338 – 366	47
366 – 394	20
394 – 422	5
422 – 450	5
450 – 478	5
478 – 506	6
506 – 534	0
534 – 562	4
562 – 590	5
590 – 618	0
618 – 646	1
646 – 674	1
674 – 702	0
702 – 730	0
730 – 758	1
758 – 786	1
786 – 814	0
814 – 842	1
842 – 870	1
870 – 898	0
898 – 926	0
926 – 954	0
954 – 982	0
982 – 1010	0
1010 – 1038	0
1038 – 1066	0
1066 – 1094	0
1094 – 1122	1

mentions text foreign_key

This column stores a serialized JSON array of mention IDs (hex tokens) attached to each record, but 98,659 of 101,040 rows hold the empty list `[]`, giving a 0.98 duplicate rate and 0.996 one-word rate. When mentions do appear, they are almost always single-element arrays referencing 16-character hex IDs, with the most-cited ID occurring only 16 times. Effectively a sparse foreign-key list dominated by absence.

Treatment: Parse the JSON array and explode to a mention-id join table, or collapse to a binary has_mentions flag given the 98% empty rate.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["mentions"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	1,921
len_min	2
len_max	420
len_mean	2.67
len_median	2
len_p95	2
word_mean	1.012
word_median	1
n_empty	0
n_duplicates	99,119
duplicate_rate	0.981
vocab_size	660
readability_flesch_mean	0.702
emoji_rate	0
url_rate	0
one_word_rate	0.9962
allcaps_rate	3.959e-05
boilerplate_rate	0
alert: duplicates	98.1% duplicate strings
alert: short_text	95th-percentile length under 20 chars
alert: one_word	99.6% rows are a single word

Fig 26.

Character-length distribution for mentions.

Show data table

Character-length distribution for mentions (mean: 2.6698139350752177).
chars	count
2 – 12	98659
12 – 23	1998
23 – 33	0
33 – 44	180
44 – 54	0
54 – 65	45
65 – 75	0
75 – 86	33
86 – 96	0
96 – 106	20
106 – 117	0
117 – 127	21
127 – 138	0
138 – 148	21
148 – 159	0
159 – 169	18
169 – 180	0
180 – 190	14
190 – 201	20
201 – 211	0
211 – 221	5
221 – 232	0
232 – 242	3
242 – 253	0
253 – 263	0
263 – 274	0
274 – 284	0
284 – 295	0
295 – 305	0
305 – 316	0
316 – 326	0
326 – 336	0
336 – 347	0
347 – 357	0
357 – 368	0
368 – 378	0
378 – 389	0
389 – 399	0
399 – 410	0
410 – 420	3

links text metadata

This column stores JSON-encoded arrays of URLs, but 96,211 of 101,040 rows hold the empty literal "[]", driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist and just 4.8% contain a URL, so the field is overwhelmingly absent rather than informative. When populated, entries point to radio/streaming sites (BBC, Radio France, shoutcast), suggesting a sparse list-of-links attribute.

Treatment: Parse the JSON arrays, derive a has_links boolean and link count, and skip the raw string for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns["links"].stats

stat	value
n	101,040
nulls	0 (0.0%)
unique	3,771
len_min	2
len_max	266
len_mean	4.95
len_median	2
len_p95	2
word_mean	1.003
word_median	1
n_empty	0
n_duplicates	97,269
duplicate_rate	0.9627
vocab_size	904
readability_flesch_mean	-20.11
emoji_rate	0
url_rate	0.04779
one_word_rate	0.9978
allcaps_rate	0
boilerplate_rate	0
alert: duplicates	96.3% duplicate strings
alert: short_text	95th-percentile length under 20 chars
alert: one_word	99.8% rows are a single word

Fig 27.

Character-length distribution for links.

Show data table

Character-length distribution for links (mean: 4.95048495645289).
chars	count
2 – 9	96211
9 – 15	0
15 – 22	44
22 – 28	293
28 – 35	447
35 – 42	1282
42 – 48	347
48 – 55	301
55 – 61	208
61 – 68	252
68 – 75	218
75 – 81	133
81 – 88	109
88 – 94	185
94 – 101	116
101 – 108	158
108 – 114	169
114 – 121	119
121 – 127	139
127 – 134	84
134 – 141	33
141 – 147	43
147 – 154	40
154 – 160	39
160 – 167	18
167 – 174	19
174 – 180	13
180 – 187	1
187 – 193	5
193 – 200	1
200 – 207	5
207 – 213	0
213 – 220	2
220 – 226	1
226 – 233	0
233 – 240	1
240 – 246	0
246 – 253	0
253 – 259	1
259 – 266	3

bsky firehose dec 2025

Overview

Summary confidence: high

text text free_text

author_did_hash text foreign_key

uri_hash text identifier

reply_parent_hash text foreign_key

reply_root_hash text foreign_key

sentiment categorical label

sentiment_score numeric feature

created_at text timestamp

timestamp text timestamp

language categorical feature

char_count numeric feature

word_count numeric feature

has_images numeric feature

has_video numeric feature

has_link numeric feature

embed_type categorical feature

hashtags text feature

mentions text foreign_key

links text metadata

How to cite