saturn·

bsky firehose dec 2025

saturn notebook · generated 2026-04-22 Report Notebook

Overview

Source: /home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv

Saturn profiled 101,040 rows across 19 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv",
    "--findings", "bsky-firehose-dec-2025.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset captures 101,040 anonymized Bluesky firehose posts from late December 2025, with 19 columns covering post hashes, authorship, timestamps, content text, embeds, hashtags, mentions, links, language, and sentiment. The text column is richly multilingual — English dominates at ~61% of posts, followed by Japanese (~12.6k) and a sizable 'unknown' bucket (~11.5k) — and sentiment skews neutral (48.5%) with positive outweighing negative roughly 2:1. Engagement-style features are heavily zero-inflated: only ~13.6% of posts include images, ~18% include links, and just ~1.3% include video, so most posts are plain text. About 58% of posts have no reply_root_hash, suggesting top-level posts dominate over threaded replies. The most useful first cuts are language mix, sentiment distribution, embed_type composition, and post-length shape via char_count.

citing: row_count · column_count · language.top_values · sentiment.top_values · embed_type.top_values · embed_type.null_rate · has_images.stats.mean · has_link.stats.mean · has_video.stats.mean · reply_root_hash.null_rate · char_count.stats · text.language_counts

Out[4]:

saturn.schema() · 19 columns

column kind n null% unique alerts
text text 101,040 0.0% 95,935 multilingual allcaps
author_did_hash text 101,040 0.0% 43,998 duplicates short_text one_word
uri_hash text 101,040 0.0% 101,039 near_unique short_text one_word
reply_parent_hash text 101,040 57.7% 34,738 short_text null_rate one_word
reply_root_hash text 101,040 57.7% 21,277 duplicates short_text null_rate one_word
sentiment categorical 101,040 0.0% 3
sentiment_score numeric 101,040 0.0% 1,928 outliers
created_at text 101,040 0.0% 96,576 near_unique one_word allcaps
timestamp text 101,040 0.0% 101,040 near_unique one_word allcaps
language categorical 101,040 0.0% 90
char_count numeric 101,040 0.0% 341
word_count numeric 101,040 0.0% 79
has_images numeric 101,040 0.0% 2 high_skew outliers
has_video numeric 101,040 0.0% 2 high_skew
has_link numeric 101,040 0.0% 2 outliers
embed_type categorical 101,040 61.2% 5 null_rate
hashtags text 101,040 0.0% 10,103 duplicates one_word
mentions text 101,040 0.0% 1,921 duplicates short_text one_word
links text 101,040 0.0% 3,771 duplicates short_text one_word
Fig 1.
language · See how heavily English dominates and where Japanese, Korean, and 'unknown' sit in the long tail.
Show data table
Top values for language (20 unique shown, of 90 total).
valuecountshare
en6146860.8%
ja1260712.5%
unknown1148111.4%
en-US36173.6%
ko24062.4%
de18211.8%
pt12951.3%
es11531.1%
fr7460.7%
th6120.6%
tr5480.5%
nl5250.5%
zh3150.3%
it2760.3%
ru2130.2%
fi1930.2%
ja-JP1700.2%
id1580.2%
pl1390.1%
el1160.1%
Fig 2.
sentiment · Check the neutral-heavy split with positive roughly twice negative.
Show data table
Top values for sentiment (3 unique shown, of 3 total).
valuecountshare
neutral4898148.5%
positive3462234.3%
negative1743717.3%
Fig 3.
embed_type · Among posts that have an embed, compare external links, images, quoted records, and video shares.
Show data table
Top values for embed_type (5 unique shown, of 5 total).
valuecountshare
app.bsky.embed.external1814018.0%
app.bsky.embed.images1376813.6%
app.bsky.embed.record51265.1%
app.bsky.embed.video13441.3%
app.bsky.embed.recordWithMedia8710.9%
Fig 4.
char_count · Look at the right-skewed length distribution — median 68 chars but a long tail out to 525.
Show data table
Histogram bins for char_count (median: 68.0).
bincount
1 – 14.111042
14.1 – 27.212354
27.2 – 40.311078
40.3 – 53.48177
53.4 – 66.57049
66.5 – 79.66309
79.6 – 92.75146
92.7 – 105.84495
105.8 – 118.93865
118.9 – 1323459
132 – 145.13244
145.1 – 158.22592
158.2 – 171.32219
171.3 – 184.42065
184.4 – 197.51867
197.5 – 210.61708
210.6 – 223.71624
223.7 – 236.81467
236.8 – 249.91439
249.9 – 2631439
263 – 276.11563
276.1 – 289.21657
289.2 – 302.34862
302.3 – 315.433
315.4 – 328.5261
328.5 – 341.63
341.6 – 354.75
354.7 – 367.81
367.8 – 380.92
380.9 – 3949
394 – 407.13
407.1 – 420.20
420.2 – 433.31
433.3 – 446.41
446.4 – 459.50
459.5 – 472.60
472.6 – 485.70
485.7 – 498.80
498.8 – 511.90
511.9 – 5251
Fig 5.
sentiment_score · Inspect the spike at zero (~48%) and the modest positive lean in the continuous score.
Show data table
Histogram bins for sentiment_score (median: 0.0).
bincount
-0.998 – -0.948247
-0.948 – -0.8981541
-0.8981 – -0.8481705
-0.8481 – -0.7982822
-0.7982 – -0.7482812
-0.7482 – -0.6983874
-0.6983 – -0.64831081
-0.6483 – -0.59841067
-0.5984 – -0.54841130
-0.5484 – -0.49851234
-0.4985 – -0.44861387
-0.4486 – -0.39861127
-0.3986 – -0.3487895
-0.3487 – -0.29871120
-0.2987 – -0.24881585
-0.2488 – -0.1988792
-0.1988 – -0.1488726
-0.1488 – -0.0989672
-0.0989 – -0.04895624
-0.04895 – 0.00148620
0.001 – 0.05095358
0.05095 – 0.1009740
0.1009 – 0.1508623
0.1508 – 0.2008781
0.2008 – 0.25081426
0.2508 – 0.30071343
0.3007 – 0.35071581
0.3507 – 0.40062179
0.4006 – 0.45063463
0.4506 – 0.50052801
0.5005 – 0.55051918
0.5505 – 0.60042382
0.6004 – 0.65032518
0.6503 – 0.70032004
0.7003 – 0.75031886
0.7503 – 0.80021870
0.8002 – 0.85012266
0.8501 – 0.90012011
0.9001 – 0.95011758
0.9501 – 11071
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
texttext0.0%
author_did_hashtext0.0%
uri_hashtext0.0%
reply_parent_hashtext57.7%
reply_root_hashtext57.7%
sentimentcategorical0.0%
sentiment_scorenumeric0.0%
created_attext0.0%
timestamptext0.0%
languagecategorical0.0%
char_countnumeric0.0%
word_countnumeric0.0%
has_imagesnumeric0.0%
has_videonumeric0.0%
has_linknumeric0.0%
embed_typecategorical61.2%
hashtagstext0.0%
mentionstext0.0%
linkstext0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 4,743 detected strings).
langcountshare
en330969.8%
ja65613.8%
ko1252.6%
de1082.3%
fr781.6%
pt781.6%
es711.5%
nl461.0%
zh370.8%
th340.7%
tr330.7%
ru300.6%
it300.6%
fi130.3%
pl120.3%
el110.2%
id110.2%
cs100.2%
ar90.2%
vi70.1%
sv70.1%
ca60.1%
eo40.1%
no30.1%
bg30.1%
uk30.1%
sr30.1%
hi20.0%
als20.0%
et20.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
sentiment_scorechar_countword_counthas_imageshas_videohas_link
sentiment_score+1.00-0.01+0.02+0.06-0.02-0.03
char_count-0.01+1.00+0.86+0.02-0.01+0.11
word_count+0.02+0.86+1.00+0.01-0.03+0.00
has_images+0.06+0.02+0.01+1.00-0.04-0.18
has_video-0.02-0.01-0.03-0.04+1.00-0.05
has_link-0.03+0.11+0.00-0.18-0.05+1.00

text text free_text

Short user-generated posts (likely social/Bluesky given the bsky.app top value and hashtag/emoji patterns), with median 10 words and a 525-character cap suggesting a platform limit. Heavily English-skewed (3309 of 101040) but genuinely multilingual with sizeable Japanese (656) and Korean (125) tails, plus 18.3% emoji rate, 16.9% all-caps lines, and 19% one-word entries. Note 5105 duplicates (5.05%) including spam-like Thai promo and repeated sheep-emoji posts among the top values.

Treatment: Deduplicate, language-detect and route per language, then tokenize/embed for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["text"].stats

statvalue
n101,040
nulls0 (0.0%)
unique95,935
len_min 1
len_max 525
len_mean 97.63
len_median 68
len_p95 290
word_mean 14.23
word_median 10
n_empty 0
n_duplicates 5,105
duplicate_rate 0.05052
vocab_size 77,183
readability_flesch_mean 64.09
emoji_rate 0.1832
url_rate 0.07586
one_word_rate 0.1899
allcaps_rate 0.1691
boilerplate_rate 0.001049
alert: multilingual31 languages detected in sample
alert: allcaps16.9% rows are all-caps
Fig 9.
Character-length distribution for text.
Show data table
Character-length distribution for text (mean: 97.62657363420428).
charscount
1 – 1411042
14 – 2712354
27 – 4011078
40 – 538177
53 – 667049
66 – 806309
80 – 935146
93 – 1064495
106 – 1193865
119 – 1323459
132 – 1453244
145 – 1582592
158 – 1712219
171 – 1842065
184 – 1981867
198 – 2111708
211 – 2241624
224 – 2371467
237 – 2501439
250 – 2631439
263 – 2761563
276 – 2891657
289 – 3024862
302 – 31533
315 – 328261
328 – 3423
342 – 3555
355 – 3681
368 – 3812
381 – 3949
394 – 4073
407 – 4200
420 – 4331
433 – 4461
446 – 4600
460 – 4730
473 – 4860
486 – 4990
499 – 5120
512 – 5251

author_did_hash text foreign_key

Fixed 16-character single-token hex strings, almost certainly hashed author DIDs acting as a pseudonymous user identifier. Across 101,040 rows there are only 43,998 unique values and a 56.5% duplicate rate, with the top author appearing 1,016 times — so this is a foreign-key-style author handle, not a per-row id. No nulls or empties, and length is constant at 16.

Treatment: Treat as a categorical author key; left-join on this to author-level features rather than feeding the raw hash into a model.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["author_did_hash"].stats

statvalue
n101,040
nulls0 (0.0%)
unique43,998
len_min 16
len_max 16
len_mean 16
len_median 16
len_p95 16
word_mean 1
word_median 1
n_empty 0
n_duplicates 57,042
duplicate_rate 0.5645
vocab_size 13,938
readability_flesch_mean 68.35
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0003464
boilerplate_rate 0
alert: duplicates56.5% duplicate strings
alert: short_text95th-percentile length under 20 chars
alert: one_word100.0% rows are a single word
Fig 10.
Character-length distribution for author_did_hash.
Show data table
Character-length distribution for author_did_hash (mean: 16.0).
charscount
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 16101040
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160

uri_hash text identifier

Fixed 16-character single-token strings with 101,039 unique values across 101,040 rows and zero nulls — almost certainly hex digests (16 hex chars = 64-bit hash) of URIs, used as row identifiers. The column is effectively a primary key: one_word_rate is 1.0, length is exactly 16 at min/median/max, and duplicate_rate is 0.0. No analytic signal lives here beyond identity.

Treatment: drop from modelling; retain only as a join key or deduplication handle.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["uri_hash"].stats

statvalue
n101,040
nulls1 (0.0%)
unique101,039
len_min 16
len_max 16
len_mean 16
len_median 16
len_p95 16
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 69.61
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0005245
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: short_text95th-percentile length under 20 chars
alert: one_word100.0% rows are a single word
Fig 11.
Character-length distribution for uri_hash.
Show data table
Character-length distribution for uri_hash (mean: 16.0).
charscount
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 16101039
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160

reply_parent_hash text foreign_key

Every non-null value is a single 16-character token (len_min=len_max=16, one_word_rate=1.0), strongly indicating a hex hash identifier pointing to a parent post. 57.67% of rows are null, consistent with a reply-only field where most posts are top-level rather than replies. Among populated rows, 34,738 unique hashes cover 101,040 entries with an 18.78% duplicate rate, so some parents attract many replies (top hash appears 121 times).

Treatment: Treat as a foreign key to the parent post; left-join on this hash and ignore for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["reply_parent_hash"].stats

statvalue
n101,040
nulls58,270 (57.7%)
unique34,738
len_min 16
len_max 16
len_mean 16
len_median 16
len_p95 16
word_mean 1
word_median 1
n_empty 0
n_duplicates 8,032
duplicate_rate 0.1878
vocab_size 17,415
readability_flesch_mean 71.73
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0007949
boilerplate_rate 0
alert: short_text95th-percentile length under 20 chars
alert: null_rate57.7% null
alert: one_word100.0% rows are a single word
Fig 12.
Character-length distribution for reply_parent_hash.
Show data table
Character-length distribution for reply_parent_hash (mean: 16.0).
charscount
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 1642770
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160

reply_root_hash text foreign_key

This is a 16-character hex hash identifying the root post of a reply thread, with every value being a single token of fixed length 16. About 57.67% of rows are null (likely top-level posts with no reply root) and 50.25% of the non-null values are duplicates across 21,277 unique hashes, consistent with many replies sharing the same thread root.

Treatment: left-join on this id to the parent post table; do not feature-encode.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["reply_root_hash"].stats

statvalue
n101,040
nulls58,270 (57.7%)
unique21,277
len_min 16
len_max 16
len_mean 16
len_median 16
len_p95 16
word_mean 1
word_median 1
n_empty 0
n_duplicates 21,493
duplicate_rate 0.5025
vocab_size 12,498
readability_flesch_mean 77.23
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0008183
boilerplate_rate 0
alert: duplicates50.3% duplicate strings
alert: short_text95th-percentile length under 20 chars
alert: null_rate57.7% null
alert: one_word100.0% rows are a single word
Fig 13.
Character-length distribution for reply_root_hash.
Show data table
Character-length distribution for reply_root_hash (mean: 16.0).
charscount
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 1642770
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160
16 – 160

sentiment categorical label

A three-class sentiment label across 101040 rows with no nulls and only 3 unique values: neutral, positive, negative. Distribution is uneven — neutral leads at 48.5%, followed by positive (34622) and negative (17437), so negatives are roughly half as common as positives. Entropy ratio of 0.93 indicates the classes are reasonably spread but not balanced.

Treatment: Use as a categorical target; consider class weighting to offset the under-represented negative class.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["sentiment"].stats

statvalue
n101,040
nulls0 (0.0%)
unique3
top_value neutral
top_rate 0.4848
cardinality 3
entropy 1.473
entropy_ratio 0.9295
Fig 14.
Top values for sentiment.
Show data table
Top values for sentiment (3 unique shown, of 3 total).
valuecountshare
neutral4898148.5%
positive3462234.3%
negative1743717.3%

sentiment_score numeric feature

This is a bounded sentiment polarity score in [-0.998, 1.0], typical of lexicon- or model-based sentiment scoring. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but heavily zero-inflated: 47.8% of rows are exactly 0 and the median is 0, suggesting many neutral or unscoreable texts. Despite the symmetry, 5,763 rows (5.7%) are flagged as outliers, indicating fat tails of strong polarity at both ends.

Treatment: Treat zeros as a separate 'neutral' indicator and use the raw score as a feature; no transform needed given symmetric bounded range.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["sentiment_score"].stats

statvalue
n101,040
nulls0 (0.0%)
unique1,928
min -0.998
max 1
mean 0.1074
median 0
std 0.4104
q1 0
q3 0.402
iqr 0.402
skew 0.01861
kurtosis 0.01774
n_outliers 5,763
outlier_rate 0.05704
zero_rate 0.478
alert: outliers5.7% rows beyond 1.5 IQR
Fig 15.
Distribution of sentiment_score. Vertical dash marks the median.
Show data table
Histogram bins for sentiment_score (median: 0.0).
bincount
-0.998 – -0.948247
-0.948 – -0.8981541
-0.8981 – -0.8481705
-0.8481 – -0.7982822
-0.7982 – -0.7482812
-0.7482 – -0.6983874
-0.6983 – -0.64831081
-0.6483 – -0.59841067
-0.5984 – -0.54841130
-0.5484 – -0.49851234
-0.4985 – -0.44861387
-0.4486 – -0.39861127
-0.3986 – -0.3487895
-0.3487 – -0.29871120
-0.2987 – -0.24881585
-0.2488 – -0.1988792
-0.1988 – -0.1488726
-0.1488 – -0.0989672
-0.0989 – -0.04895624
-0.04895 – 0.00148620
0.001 – 0.05095358
0.05095 – 0.1009740
0.1009 – 0.1508623
0.1508 – 0.2008781
0.2008 – 0.25081426
0.2508 – 0.30071343
0.3007 – 0.35071581
0.3507 – 0.40062179
0.4006 – 0.45063463
0.4506 – 0.50052801
0.5005 – 0.55051918
0.5505 – 0.60042382
0.6004 – 0.65032518
0.6503 – 0.70032004
0.7003 – 0.75031886
0.7503 – 0.80021870
0.8002 – 0.85012266
0.8501 – 0.90012011
0.9001 – 0.95011758
0.9501 – 11071

created_at text timestamp

This is a creation timestamp column stored as ISO 8601 strings (lengths 20-35, single-token), not parsed datetimes. Values are near-unique (96576/101040) yet 4464 duplicates exist and the top values cluster tightly around 2025-12-24T05:00, suggesting a narrow ingestion window or batch insert. Format is inconsistent across rows, mixing '+00:00', '.000Z', and microsecond-precision 'Z' suffixes, which will break naive string sorting.

Treatment: Parse to a normalized UTC datetime before any temporal analysis or joins.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["created_at"].stats

statvalue
n101,040
nulls0 (0.0%)
unique96,576
len_min 20
len_max 35
len_mean 24.34
len_median 24
len_p95 27
word_mean 1
word_median 1
n_empty 0
n_duplicates 4,464
duplicate_rate 0.04418
vocab_size 19,720
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique95.6% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 16.
Character-length distribution for created_at.
Show data table
Character-length distribution for created_at (mean: 24.344883214568487).
charscount
20 – 202195
20 – 210
21 – 210
21 – 220
22 – 220
22 – 221
22 – 230
23 – 230
23 – 233
23 – 240
24 – 2488967
24 – 240
24 – 250
25 – 252823
25 – 260
26 – 260
26 – 261
26 – 270
27 – 272202
27 – 280
28 – 280
28 – 28118
28 – 290
29 – 290
29 – 291342
29 – 300
30 – 3061
30 – 300
30 – 310
31 – 310
31 – 320
32 – 320
32 – 323296
32 – 330
33 – 3328
33 – 340
34 – 340
34 – 340
34 – 350
35 – 353

timestamp text timestamp

This is an ISO-8601 timestamp column stored as text, with every one of the 101040 values unique and exactly 26 characters long. Sampled values cluster on 2025-12-23 and 2025-12-24, suggesting a narrow capture window rather than a broad historical range. The 'allcaps' and 'one_word' alerts are artefacts of the ISO format (the literal 'T' separator and no whitespace), not a data quality issue.

Treatment: Parse to datetime and derive features (hour, day, delta) instead of using as text.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["timestamp"].stats

statvalue
n101,040
nulls0 (0.0%)
unique101,040
len_min 26
len_max 26
len_mean 26
len_median 26
len_p95 26
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 17.
Character-length distribution for timestamp.
Show data table
Character-length distribution for timestamp (mean: 26.0).
charscount
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 26101040
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260
26 – 260

language categorical feature

Language tag of each record, using ISO-style codes across 90 distinct values with no nulls. English dominates at 60.8% of rows, followed by Japanese (12,607) and a sizeable 'unknown' bucket (11,481) that signals missing-data leakage into the category itself. Note the inconsistent granularity: bare 'en' coexists with locale-specific 'en-US' (3,617), so codes need normalisation before grouping.

Treatment: Normalise locale codes (collapse en-US into en), treat 'unknown' as missing, then one-hot or target-encode.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["language"].stats

statvalue
n101,040
nulls0 (0.0%)
unique90
top_value en
top_rate 0.6084
cardinality 90
entropy 2.178
entropy_ratio 0.3356
Fig 18.
Top values for language.
Show data table
Top values for language (20 unique shown, of 90 total).
valuecountshare
en6146860.8%
ja1260712.5%
unknown1148111.4%
en-US36173.6%
ko24062.4%
de18211.8%
pt12951.3%
es11531.1%
fr7460.7%
th6120.6%
tr5480.5%
nl5250.5%
zh3150.3%
it2760.3%
ru2130.2%
fi1930.2%
ja-JP1700.2%
id1580.2%
pl1390.1%
el1160.1%

char_count numeric feature

This is almost certainly a per-row character count of some text field, ranging from 1 to 525 with a median of 68 and mean of 97.6. The distribution is right-skewed (skew 1.02) with a wide IQR of 113, but only 0.29% of values flag as outliers and there are no nulls or zeros. With just 341 unique integer values across 101,040 rows, the field is discrete and well-behaved.

Treatment: Consider a log or sqrt transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["char_count"].stats

statvalue
n101,040
nulls0 (0.0%)
unique341
min 1
max 525
mean 97.63
median 68
std 86.05
q1 30
q3 143
iqr 113
skew 1.018
kurtosis -0.05733
n_outliers 289
outlier_rate 0.00286
zero_rate 0
Fig 19.
Distribution of char_count. Vertical dash marks the median.
Show data table
Histogram bins for char_count (median: 68.0).
bincount
1 – 14.111042
14.1 – 27.212354
27.2 – 40.311078
40.3 – 53.48177
53.4 – 66.57049
66.5 – 79.66309
79.6 – 92.75146
92.7 – 105.84495
105.8 – 118.93865
118.9 – 1323459
132 – 145.13244
145.1 – 158.22592
158.2 – 171.32219
171.3 – 184.42065
184.4 – 197.51867
197.5 – 210.61708
210.6 – 223.71624
223.7 – 236.81467
236.8 – 249.91439
249.9 – 2631439
263 – 276.11563
276.1 – 289.21657
289.2 – 302.34862
302.3 – 315.433
315.4 – 328.5261
328.5 – 341.63
341.6 – 354.75
354.7 – 367.81
367.8 – 380.92
380.9 – 3949
394 – 407.13
407.1 – 420.20
420.2 – 433.31
433.3 – 446.41
446.4 – 459.50
459.5 – 472.60
472.6 – 485.70
485.7 – 498.80
498.8 – 511.90
511.9 – 5251

word_count numeric feature

This is a per-row word count, ranging from 0 to 83 with a median of 10 and mean of 14.67, indicating most entries are short snippets rather than long documents. The distribution is right-skewed (skew 1.21) with a wide IQR of 19 and ~2.85% outliers, suggesting a long tail of unusually verbose rows. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) confirm it's a bounded discrete count with virtually no empty texts.

Treatment: Log-transform or bin before modelling to dampen the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["word_count"].stats

statvalue
n101,040
nulls0 (0.0%)
unique79
min 0
max 83
mean 14.67
median 10
std 14.22
q1 3
q3 22
iqr 19
skew 1.209
kurtosis 0.699
n_outliers 2,882
outlier_rate 0.02852
zero_rate 0.0006037
Fig 20.
Distribution of word_count. Vertical dash marks the median.
Show data table
Histogram bins for word_count (median: 10.0).
bincount
0 – 2.07521450
2.075 – 4.159307
4.15 – 6.2258043
6.225 – 8.37043
8.3 – 10.386459
10.38 – 12.455882
12.45 – 14.534926
14.53 – 16.64074
16.6 – 18.683582
18.68 – 20.753218
20.75 – 22.832640
22.83 – 24.92512
24.9 – 26.982280
26.98 – 29.053311
29.05 – 31.131923
31.13 – 33.21805
33.2 – 35.281633
35.28 – 37.351314
37.35 – 39.431144
39.43 – 41.51057
41.5 – 43.581076
43.58 – 45.651046
45.65 – 47.73994
47.73 – 49.8965
49.8 – 51.88918
51.88 – 53.95805
53.95 – 56.03910
56.03 – 58.1334
58.1 – 60.18189
60.18 – 62.2588
62.25 – 64.3331
64.33 – 66.423
66.4 – 68.4821
68.48 – 70.557
70.55 – 72.6210
72.62 – 74.714
74.7 – 76.783
76.78 – 78.851
78.85 – 80.930
80.93 – 832

has_images numeric feature

This is a binary indicator (only 2 unique values, min 0, max 1) flagging whether a record has images. It's heavily imbalanced: 86.4% zeros and a mean of 0.136, meaning only ~13.6% of rows have images. The 'outliers' alert simply reflects the minority class rather than anomalous values.

Treatment: Treat as a boolean flag; no transformation needed, but watch class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["has_images"].stats

statvalue
n101,040
nulls0 (0.0%)
unique2
min 0
max 1
mean 0.1363
median 0
std 0.3431
q1 0
q3 0
iqr 0
skew 2.12
kurtosis 2.497
n_outliers 13,768
outlier_rate 0.1363
zero_rate 0.8637
alert: high_skewskew=+2.12
alert: outliers13.6% rows beyond 1.5 IQR
Fig 21.
Distribution of has_images. Vertical dash marks the median.
Show data table
Histogram bins for has_images (median: 0.0).
bincount
0 – 0.02587272
0.025 – 0.050
0.05 – 0.0750
0.075 – 0.10
0.1 – 0.1250
0.125 – 0.150
0.15 – 0.1750
0.175 – 0.20
0.2 – 0.2250
0.225 – 0.250
0.25 – 0.2750
0.275 – 0.30
0.3 – 0.3250
0.325 – 0.350
0.35 – 0.3750
0.375 – 0.40
0.4 – 0.4250
0.425 – 0.450
0.45 – 0.4750
0.475 – 0.50
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 113768

has_video numeric feature

Binary flag indicating whether a record has an associated video, stored as 0/1 with no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are one, producing extreme skew (8.50) and kurtosis (70.19). The 1344 ones are flagged as outliers purely because of the imbalance, not because they are anomalous.

Treatment: Treat as a boolean indicator; expect minimal signal given 98.67% zeros and consider class-imbalance handling if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["has_video"].stats

statvalue
n101,040
nulls0 (0.0%)
unique2
min 0
max 1
mean 0.0133
median 0
std 0.1146
q1 0
q3 0
iqr 0
skew 8.497
kurtosis 70.19
n_outliers 1,344
outlier_rate 0.0133
zero_rate 0.9867
alert: high_skewskew=+8.50
Fig 22.
Distribution of has_video. Vertical dash marks the median.
Show data table
Histogram bins for has_video (median: 0.0).
bincount
0 – 0.02599696
0.025 – 0.050
0.05 – 0.0750
0.075 – 0.10
0.1 – 0.1250
0.125 – 0.150
0.15 – 0.1750
0.175 – 0.20
0.2 – 0.2250
0.225 – 0.250
0.25 – 0.2750
0.275 – 0.30
0.3 – 0.3250
0.325 – 0.350
0.35 – 0.3750
0.375 – 0.40
0.4 – 0.4250
0.425 – 0.450
0.45 – 0.4750
0.475 – 0.50
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 11344
Out[56]:

saturn.columns["has_link"].stats

statvalue
n101,040
nulls0 (0.0%)
unique2
min 0
max 1
mean 0.1795
median 0
std 0.3838
q1 0
q3 0
iqr 0
skew 1.67
kurtosis 0.7888
n_outliers 18,140
outlier_rate 0.1795
zero_rate 0.8205
alert: outliers18.0% rows beyond 1.5 IQR
Fig 23.
Distribution of has_link. Vertical dash marks the median.
Show data table
Histogram bins for has_link (median: 0.0).
bincount
0 – 0.02582900
0.025 – 0.050
0.05 – 0.0750
0.075 – 0.10
0.1 – 0.1250
0.125 – 0.150
0.15 – 0.1750
0.175 – 0.20
0.2 – 0.2250
0.225 – 0.250
0.25 – 0.2750
0.275 – 0.30
0.3 – 0.3250
0.325 – 0.350
0.35 – 0.3750
0.375 – 0.40
0.4 – 0.4250
0.425 – 0.450
0.45 – 0.4750
0.475 – 0.50
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 118140

embed_type categorical feature

This column tags the embed type attached to a Bluesky post, with 5 distinct AT Protocol lexicon values like app.bsky.embed.external and app.bsky.embed.images. 61.15% of rows are null, consistent with most posts having no embed, and among the populated rows external links (46.2%) and images dominate while video and recordWithMedia are rare. Entropy ratio of 0.74 indicates a moderately concentrated but not degenerate distribution.

Treatment: Treat nulls as a 'no embed' category and one-hot encode the 5 levels.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["embed_type"].stats

statvalue
n101,040
nulls61,791 (61.2%)
unique5
top_value app.bsky.embed.external
top_rate 0.4622
cardinality 5
entropy 1.717
entropy_ratio 0.7394
alert: null_rate61.2% null
Fig 24.
Top values for embed_type.
Show data table
Top values for embed_type (5 unique shown, of 5 total).
valuecountshare
app.bsky.embed.external1814018.0%
app.bsky.embed.images1376813.6%
app.bsky.embed.record51265.1%
app.bsky.embed.video13441.3%
app.bsky.embed.recordWithMedia8710.9%

hashtags text feature

This column stores serialized JSON arrays of hashtags extracted from social posts, with 87,318 of 101,040 rows holding an empty list `[]` and only 10,103 distinct values overall (duplicate_rate 0.90). When hashtags are present they are short — word_mean 1.38 and len_median 2 — and span multiple scripts (Thai, Japanese, German, English), so any text processing must be Unicode-aware. The mix of `#NowPlaying` (115) and `#nowplaying` (85) shows case is not normalized.

Treatment: Parse the JSON list, lowercase, and one-hot or count-encode the most frequent tags; treat `[]` as an explicit no-tag category.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["hashtags"].stats

statvalue
n101,040
nulls0 (0.0%)
unique10,103
len_min 2
len_max 1,122
len_mean 10.38
len_median 2
len_p95 63
word_mean 1.384
word_median 1
n_empty 0
n_duplicates 90,937
duplicate_rate 0.9
vocab_size 7,036
readability_flesch_mean 2.752
emoji_rate 0
url_rate 0
one_word_rate 0.902
allcaps_rate 0.005701
boilerplate_rate 0
alert: duplicates90.0% duplicate strings
alert: one_word90.2% rows are a single word
Fig 25.
Character-length distribution for hashtags.
Show data table
Character-length distribution for hashtags (mean: 10.377622723673792).
charscount
2 – 3092111
30 – 583305
58 – 862893
86 – 1141061
114 – 142466
142 – 170316
170 – 198183
198 – 226120
226 – 25490
254 – 282321
282 – 31042
310 – 33828
338 – 36647
366 – 39420
394 – 4225
422 – 4505
450 – 4785
478 – 5066
506 – 5340
534 – 5624
562 – 5905
590 – 6180
618 – 6461
646 – 6741
674 – 7020
702 – 7300
730 – 7581
758 – 7861
786 – 8140
814 – 8421
842 – 8701
870 – 8980
898 – 9260
926 – 9540
954 – 9820
982 – 10100
1010 – 10380
1038 – 10660
1066 – 10940
1094 – 11221

mentions text foreign_key

This column stores a serialized JSON array of mention IDs (hex tokens) attached to each record, but 98,659 of 101,040 rows hold the empty list `[]`, giving a 0.98 duplicate rate and 0.996 one-word rate. When mentions do appear, they are almost always single-element arrays referencing 16-character hex IDs, with the most-cited ID occurring only 16 times. Effectively a sparse foreign-key list dominated by absence.

Treatment: Parse the JSON array and explode to a mention-id join table, or collapse to a binary has_mentions flag given the 98% empty rate.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["mentions"].stats

statvalue
n101,040
nulls0 (0.0%)
unique1,921
len_min 2
len_max 420
len_mean 2.67
len_median 2
len_p95 2
word_mean 1.012
word_median 1
n_empty 0
n_duplicates 99,119
duplicate_rate 0.981
vocab_size 660
readability_flesch_mean 0.702
emoji_rate 0
url_rate 0
one_word_rate 0.9962
allcaps_rate 3.959e-05
boilerplate_rate 0
alert: duplicates98.1% duplicate strings
alert: short_text95th-percentile length under 20 chars
alert: one_word99.6% rows are a single word
Fig 26.
Character-length distribution for mentions.
Show data table
Character-length distribution for mentions (mean: 2.6698139350752177).
charscount
2 – 1298659
12 – 231998
23 – 330
33 – 44180
44 – 540
54 – 6545
65 – 750
75 – 8633
86 – 960
96 – 10620
106 – 1170
117 – 12721
127 – 1380
138 – 14821
148 – 1590
159 – 16918
169 – 1800
180 – 19014
190 – 20120
201 – 2110
211 – 2215
221 – 2320
232 – 2423
242 – 2530
253 – 2630
263 – 2740
274 – 2840
284 – 2950
295 – 3050
305 – 3160
316 – 3260
326 – 3360
336 – 3470
347 – 3570
357 – 3680
368 – 3780
378 – 3890
389 – 3990
399 – 4100
410 – 4203
Out[68]:

saturn.columns["links"].stats

statvalue
n101,040
nulls0 (0.0%)
unique3,771
len_min 2
len_max 266
len_mean 4.95
len_median 2
len_p95 2
word_mean 1.003
word_median 1
n_empty 0
n_duplicates 97,269
duplicate_rate 0.9627
vocab_size 904
readability_flesch_mean -20.11
emoji_rate 0
url_rate 0.04779
one_word_rate 0.9978
allcaps_rate 0
boilerplate_rate 0
alert: duplicates96.3% duplicate strings
alert: short_text95th-percentile length under 20 chars
alert: one_word99.8% rows are a single word
Fig 27.
Character-length distribution for links.
Show data table
Character-length distribution for links (mean: 4.95048495645289).
charscount
2 – 996211
9 – 150
15 – 2244
22 – 28293
28 – 35447
35 – 421282
42 – 48347
48 – 55301
55 – 61208
61 – 68252
68 – 75218
75 – 81133
81 – 88109
88 – 94185
94 – 101116
101 – 108158
108 – 114169
114 – 121119
121 – 127139
127 – 13484
134 – 14133
141 – 14743
147 – 15440
154 – 16039
160 – 16718
167 – 17419
174 – 18013
180 – 1871
187 – 1935
193 – 2001
200 – 2075
207 – 2130
213 – 2202
220 – 2261
226 – 2330
233 – 2401
240 – 2460
246 – 2530
253 – 2591
259 – 2663

How to cite

click to copy

BibTeX
@misc{saturn-bsky-firehose-dec-2025-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: bsky firehose dec 2025},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/bsky-firehose-dec-2025}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: bsky firehose dec 2025. Source: /home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/bsky-firehose-dec-2025