saturn

/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv 101,040 rows sample n=101,040 seed 42 2026-05-01T23:26:26+00:00

Overview

Source	/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv
Total rows	101,040
Profiled sample	101,040
Columns	19
Generated	2026-05-01T23:26:26+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This is an anonymized Bluesky firehose snapshot of 101,040 posts from late December 2025, with 19 columns covering hashed identifiers, post text, timestamps, embed metadata, language, and sentiment. The content is heavily multilingual: English dominates at roughly 61% of posts, but Japanese, Korean, German, and Portuguese also have meaningful presence, alongside an 'unknown' bucket worth investigating. Sentiment skews neutral (~48%) with positive outweighing negative roughly 2:1, and post length is right-skewed (median 68 chars, max 525). Engagement features are sparse — about 18% of posts carry a link, 14% have images, and only 1.3% have video — and the embed_type field is null for ~61% of rows, which is the biggest data-quality flag to check first. Reply hashes are null for ~58% of rows, suggesting most posts are top-level rather than replies.

text high anthropic:claude-opus-4-7

Short user-generated posts (likely social media, given bsky.app URLs and hashtag-heavy entries), averaging 14 words and a median of 68 characters. The corpus is predominantly English (3309) but spans 30 detected languages with notable Japanese (656) and Korean (125) presence, and 18.3% of rows contain emoji while 16.9% are flagged all-caps. Watch the 5,105 duplicates (5.05%) and the top entries dominated by emoji spam (224 sheep emojis) and repeated promotional Thai text.

author_did_hash high anthropic:claude-opus-4-7

This column holds 16-character hexadecimal hashes of author DIDs, with every value exactly 16 chars long and a single token (one_word_rate 1.0). Across 101040 rows there are only 43998 unique authors and a 56.5% duplicate rate, with the top author hash appearing 1016 times — so a small number of authors generate a disproportionate share of records.

uri_hash high anthropic:claude-opus-4-7

This column holds 16-character single-token strings that look like truncated hex digests of URIs, with 101,039 unique values across 101,040 rows and zero nulls. It is effectively a primary key — duplicate_rate is 0.0 and every entry is exactly 16 characters long. No textual signal is present beyond the hash itself.

reply_parent_hash high anthropic:claude-opus-4-7

This column is a 16-character hexadecimal hash pointing at a parent message, populated only when a row is a reply — 57.67% of the 101,040 values are null. Every non-null entry is exactly 16 chars and one token, with 34,738 unique hashes and an 18.78% duplicate rate (8,032 duplicates), indicating popular parents that attract many replies (top hash 04a1db17fbc9ff3a appears 121 times). The flesch_mean of 71.73 is meaningless here since the strings are opaque IDs.

reply_root_hash high anthropic:claude-opus-4-7

This column holds 16-character hexadecimal hashes identifying the root of a reply thread, with every value exactly 16 chars and a single token. 57.67% of rows are null (consistent with non-reply posts being root-less), and among the 42.33% populated rows there are 21,277 unique hashes with a 50.25% duplicate rate, indicating many replies share common thread roots.

sentiment high anthropic:claude-opus-4-7

This is a three-class sentiment label with values neutral, positive, and negative across 101040 rows and no nulls. The distribution is uneven: neutral dominates at 48.5% (top_rate 0.4847684085510689), positive accounts for 34622, and negative trails at 17437 — roughly half the positive count. Entropy ratio of 0.93 still indicates a fairly balanced spread despite the negative class being underrepresented.

sentiment_score high anthropic:claude-opus-4-7

Continuous sentiment polarity bounded in [-0.998, 1.0], consistent with a tool like VADER/TextBlob compound score. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but dominated by a neutral spike: 47.8% of rows are exactly 0 and the median is 0, pulling Q1 to 0 as well. Despite the symmetry, 5,763 rows (5.7%) flag as outliers, suggesting heavy tails of strongly polarised text alongside the neutral mass.

created_at high anthropic:claude-opus-4-7

This is a creation timestamp stored as ISO-8601 strings rather than a native datetime, with lengths from 20 to 35 characters and a single token per value. The format is inconsistent: top values mix offset suffixes like `+00:00` with `Z` and microsecond `.000000Z` variants, which will break naive lexical comparisons. Cardinality is high (96576 unique of 101040) but 4464 duplicates (4.4%) suggest concurrent events sharing timestamps.

timestamp high anthropic:claude-opus-4-7

This column holds ISO-8601 timestamps with microsecond precision, stored as text rather than a native datetime type. All 101040 values are unique with zero nulls and a fixed length of 26 characters, and the sampled values cluster on 2025-12-23 and 2025-12-24. The text-profile alerts (near_unique, one_word, allcaps) are artefacts of the string representation, not genuine quality issues.

language high anthropic:claude-opus-4-7

Language code of the record, dominated by English ('en' at 60.8% of 101,040 rows) with Japanese ('ja') a distant second at 12,607. Notably, 11,481 rows are literally 'unknown' and the codes mix bare ISO ('en') with locale variants ('en-US', 3,617), so the 90 distinct values overstate true language diversity. No nulls, but entropy ratio of 0.34 confirms heavy concentration at the top.

char_count high anthropic:claude-opus-4-7

This is almost certainly a character-count feature for some text field, with all 101040 rows populated and only 341 distinct integer values between 1 and 525. The distribution is right-skewed (skew 1.02) with median 68 well below the mean of 97.6 and an IQR of 30-143, indicating most texts are short but a tail stretches out. Outliers are minimal (0.29%) and there are no zero-length entries.

word_count high anthropic:claude-opus-4-7

This is a numeric feature counting words per record, ranging from 0 to 83 with a median of 10 and mean of 14.67. The distribution is right-skewed (skew 1.21) with an IQR of 19 and about 2.85% of values flagged as outliers, suggesting a long tail of unusually long entries. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) indicate counts cluster in a narrow integer range.

has_images high anthropic:claude-opus-4-7

Binary flag indicating whether a record has images, encoded as 0/1 with only 2 unique values across 101040 rows and no nulls. The positive class is rare at mean 0.136 (zero_rate 0.864), which is why the profiler flags high skew (2.12) and treats the 13768 ones as 'outliers'.

has_video high anthropic:claude-opus-4-7

This is a binary flag indicating whether a record has a video, stored numerically with only 2 unique values (0 and 1) and no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are ones, producing the flagged high skew (8.50) and extreme kurtosis (70.19).

has_link high anthropic:claude-opus-4-7

This is a binary 0/1 flag indicating whether a record contains a link, with no nulls across 101,040 rows. About 17.95% are 1s and 82.05% are 0s, which is why the IQR is 0 and the minority class gets labelled as 18,140 'outliers' — that's a quirk of applying numeric outlier logic to a Bernoulli variable, not a data issue.

embed_type high anthropic:claude-opus-4-7

This column tags the embed type attached to Bluesky posts, with five AT Protocol values like app.bsky.embed.external and app.bsky.embed.images. Note the 61.15% null rate — most posts carry no embed at all — and among posts that do, external links dominate at 46.22%, with video and recordWithMedia being rare tails. Entropy ratio of 0.74 shows reasonable spread across the five categories when present.

hashtags high anthropic:claude-opus-4-7

This column stores per-row hashtag lists serialised as JSON arrays, with `[]` (empty list) appearing 87,318 times — the dominant value in a 101,040-row column. Duplicates are extreme (duplicate_rate 0.90, n_unique 10,103) and 90% of values are effectively one token, consistent with most rows having zero or one hashtag (word_median 1, len_median 2). The non-empty examples mix scripts (Thai, Japanese, German, English) including bot-like patterns such as `#NowPlaying` and repeated `#dpdi`, so any text treatment must be multilingual.

mentions high anthropic:claude-opus-4-7

This column stores serialized JSON arrays of mentioned entity IDs (16-hex tokens), but the overwhelming majority are empty: '[]' appears 98,659 of 101,040 times and the duplicate rate is 0.981. When non-empty, arrays almost always hold a single ID (word_mean 1.01, len_p95 2.0), though len_max reaches 420 indicating occasional multi-mention rows. Vocabulary is small (660) across 1,921 unique values, so signal is sparse and ID-like rather than textual.

links high anthropic:claude-opus-4-7

This column holds JSON-encoded arrays of URLs, but 96,211 of 101,040 rows are the empty array '[]', driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist across 101,040 rows, and just 4.8% of rows actually contain a URL. The non-empty entries point to streaming radio and news endpoints (BBC, Radio France, shoutcast streams).

Numeric correlation

Languages detected

Per-string language detection across text columns (sampled).

text text

31 languages detected in sample 16.9% rows are all-caps

rows101,040

null0 (0.0%)

unique95,935

len_min1

len_max525

len_mean97.627

len_median68.000

len_p95290.000

word_mean14.235

word_median10.000

n_empty0

n_duplicates5,105

duplicate_rate0.051

vocab_size77,183

readability_flesch_mean64.091

emoji_rate0.183

url_rate0.076

one_word_rate0.190

allcaps_rate0.169

boilerplate_rate1.05e-03

Sample values (first 10)

Un client arrêté après avoir poignardé son livreur de repas "Un livreur de repas a été grièvement blessé lors d’une tentative de meurtre dans la nuit de dimanche à lundi, à Bülach. Le..." https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreu…
Fındığım bu giboyla ilgili itiraf etmek istediğin bir şey varsa tam vakti 😅 Dm kutuma gel anlat,söz bende kalacak anlattıkların 😅😅
-#strongertogether
Tu b’Shevat is approaching rapidly. Nit to put any pressure on you, but…
Someone mentioned it in my timeline, and so I just rewatched "The Blue Carbuncle", from The Adventures of Sherlock Holmes (1984) with Jeremy Brett. It's a great Christmas story and Jeremy Brett is without question the best Holmes ever. I found it on Britbox
https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46 【新着記事】宇宙ステーションは｢組み立てる｣時代から｢一発で広げる｣時代へ？
Happy Christmas Eve, sweet Flanoy! 🎄🧡🐈
THESE FUCKERS SERIOUSLY COULDN’T WAIT ONE DAY!? /vneg #dandysworld
어딜 가지.....
Made in hckr.fr 🏴‍☠️🖤 Le genre de petit message qui me fait chaud au cœur.

author_did_hash text

100.0% rows are a single word 95th-percentile length under 20 chars 56.5% duplicate strings

rows101,040

null0 (0.0%)

unique43,998

len_min16

len_max16

len_mean16.000

len_median16.000

len_p9516.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates57,042

duplicate_rate0.565

vocab_size13,938

readability_flesch_mean68.345

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate3.46e-04

boilerplate_rate0.000

Sample values (first 10)

203b2f94ca34ad57
e3fb7462b68ce168
8b80d746cd58f608
74e2cbc89edd37a6
ed4f29630f55ae1d
039de54e8bef8899
61ee7267320497ee
b464fb16192641fa
7422e82a369d2ace
8dc83ec255dde07a

uri_hash text

100.0% of rows are unique strings 100.0% rows are a single word 95th-percentile length under 20 chars

rows101,040

null1 (0.0%)

unique101,039

len_min16

len_max16

len_mean16.000

len_median16.000

len_p9516.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size20,000

readability_flesch_mean69.614

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate5.25e-04

boilerplate_rate0.000

Sample values (first 10)

1a925eae4a68e954
9c7b35c448e9f56a
00000da0008897eb
cd823fdacd02b11c
0a525ed50b0474f2
175ab73228973fa3
60f3a1a69409b7ae
e9e0e481dbe7f266
d49a9bf37ba42904
ef51be69a5ee76f8

reply_parent_hash text

100.0% rows are a single word 57.7% null 95th-percentile length under 20 chars

rows101,040

null58,270 (57.7%)

unique34,738

len_min16

len_max16

len_mean16.000

len_median16.000

len_p9516.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates8,032

duplicate_rate0.188

vocab_size17,415

readability_flesch_mean71.729

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate7.95e-04

boilerplate_rate0.000

Sample values (first 10)

fc2267f29dd1a492
6b56ce9644d8dcfc
701912916dd3aecb
f16b66c1507d3da9
63ea68b3eabeb6c5
2e341c64d79713f6
bfbbd6900834f900
dd990c5f31cc4ea6
3b2f41bfb941204a
a5ba750d7bf30263

reply_root_hash text

100.0% rows are a single word 57.7% null 95th-percentile length under 20 chars 50.3% duplicate strings

rows101,040

null58,270 (57.7%)

unique21,277

len_min16

len_max16

len_mean16.000

len_median16.000

len_p9516.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates21,493

duplicate_rate0.503

vocab_size12,498

readability_flesch_mean77.228

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate8.18e-04

boilerplate_rate0.000

Sample values (first 10)

fc2267f29dd1a492
6b56ce9644d8dcfc
701912916dd3aecb
f16b66c1507d3da9
63ea68b3eabeb6c5
2e341c64d79713f6
dc0cf00aab42248a
65f573012a42f37a
152ff36a17b9ab54
2da15f2e55e9a171

sentiment categorical

rows101,040

null0 (0.0%)

unique3

top_valueneutral

top_rate0.485

cardinality3

entropy1.473

entropy_ratio0.930

Top values (rank 1–20)

neutral — 48,981
positive — 34,622
negative — 17,437

sentiment_score numeric

5.7% rows beyond 1.5 IQR

rows101,040

null0 (0.0%)

unique1,928

min-0.998

max1.000

mean0.107

median0.000

std0.410

q10.000

q30.402

iqr0.402

skew0.019

kurtosis0.018

n_outliers5,763

outlier_rate0.057

zero_rate0.478

created_at text

95.6% of rows are unique strings 100.0% rows are a single word 100.0% rows are all-caps

rows101,040

null0 (0.0%)

unique96,576

len_min20

len_max35

len_mean24.345

len_median24.000

len_p9527.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates4,464

duplicate_rate0.044

vocab_size19,720

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

2025-12-15T15:02:56.000000Z
2025-12-24T05:46:28.199Z
2025-12-24T05:53:52.540Z
2025-12-24T05:51:06.770Z
2025-12-24T05:24:05.186Z
2025-12-24T06:00:49.556+00:00
2025-12-24T05:25:08.535Z
2025-12-24T05:56:08.507Z
2025-12-24T05:51:12.695Z
2025-12-24T05:00:11.869Z

timestamp text

100.0% of rows are unique strings 100.0% rows are a single word 100.0% rows are all-caps

rows101,040

null0 (0.0%)

unique101,040

len_min26

len_max26

len_mean26.000

len_median26.000

len_p9526.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size20,000

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

2025-12-23T23:35:20.812256
2025-12-23T23:46:29.113253
2025-12-23T23:53:52.818216
2025-12-23T23:51:06.721420
2025-12-23T23:24:08.130284
2025-12-24T00:00:49.619695
2025-12-23T23:25:09.314207
2025-12-23T23:56:13.728686
2025-12-23T23:51:13.117978
2025-12-23T23:00:12.134916

language categorical

rows101,040

null0 (0.0%)

unique90

top_valueen

top_rate0.608

cardinality90

entropy2.178

entropy_ratio0.336

Top values (rank 1–20)

en — 61,468
ja — 12,607
unknown — 11,481
en-US — 3,617
ko — 2,406
de — 1,821
pt — 1,295
es — 1,153
fr — 746
th — 612
tr — 548
nl — 525
zh — 315
it — 276
ru — 213
fi — 193
ja-JP — 170
id — 158
pl — 139
el — 116

char_count numeric

rows101,040

null0 (0.0%)

unique341

min1.000

max525.000

mean97.627

median68.000

std86.052

q130.000

q3143.000

iqr113.000

skew1.018

kurtosis-0.057

n_outliers289

outlier_rate2.86e-03

zero_rate0.000

word_count numeric

rows101,040

null0 (0.0%)

unique79

min0.000

max83.000

mean14.675

median10.000

std14.223

q13.000

q322.000

iqr19.000

skew1.209

kurtosis0.699

n_outliers2,882

outlier_rate0.029

zero_rate6.04e-04

has_images numeric

skew=+2.12 13.6% rows beyond 1.5 IQR

rows101,040

null0 (0.0%)

unique2

min0.000

max1.000

mean0.136

median0.000

std0.343

q10.000

q30.000

iqr0.000

skew2.120

kurtosis2.497

n_outliers13,768

outlier_rate0.136

zero_rate0.864

has_video numeric

skew=+8.50

rows101,040

null0 (0.0%)

unique2

min0.000

max1.000

mean0.013

median0.000

std0.115

q10.000

q30.000

iqr0.000

skew8.497

kurtosis70.192

n_outliers1,344

outlier_rate0.013

zero_rate0.987

has_link numeric

18.0% rows beyond 1.5 IQR

rows101,040

null0 (0.0%)

unique2

min0.000

max1.000

mean0.180

median0.000

std0.384

q10.000

q30.000

iqr0.000

skew1.670

kurtosis0.789

n_outliers18,140

outlier_rate0.180

zero_rate0.820

embed_type categorical

61.2% null

rows101,040

null61,791 (61.2%)

unique5

top_valueapp.bsky.embed.external

top_rate0.462

cardinality5

entropy1.717

entropy_ratio0.739

Top values (rank 1–20)

app.bsky.embed.external — 18,140
app.bsky.embed.images — 13,768
app.bsky.embed.record — 5,126
app.bsky.embed.video — 1,344
app.bsky.embed.recordWithMedia — 871

hashtags text

90.2% rows are a single word 90.0% duplicate strings

rows101,040

null0 (0.0%)

unique10,103

len_min2

len_max1,122

len_mean10.378

len_median2.000

len_p9563.000

word_mean1.384

word_median1.000

n_empty0

n_duplicates90,937

duplicate_rate0.900

vocab_size7,036

readability_flesch_mean2.752

emoji_rate0.000

url_rate0.000

one_word_rate0.902

allcaps_rate5.70e-03

boilerplate_rate0.000

Sample values (first 10)

[]
[]
["#strongertogether"]
[]
[]
[]
[]
["#dandysworld"]
[]
[]

mentions text

99.6% rows are a single word 95th-percentile length under 20 chars 98.1% duplicate strings

rows101,040

null0 (0.0%)

unique1,921

len_min2

len_max420

len_mean2.670

len_median2.000

len_p952.000

word_mean1.012

word_median1.000

n_empty0

n_duplicates99,119

duplicate_rate0.981

vocab_size660

readability_flesch_mean0.702

emoji_rate0.000

url_rate0.000

one_word_rate0.996

allcaps_rate3.96e-05

boilerplate_rate0.000

Sample values (first 10)

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

links text

99.8% rows are a single word 95th-percentile length under 20 chars 96.3% duplicate strings

rows101,040

null0 (0.0%)

unique3,771

len_min2

len_max266

len_mean4.950

len_median2.000

len_p952.000

word_mean1.003

word_median1.000

n_empty0

n_duplicates97,269

duplicate_rate0.963

vocab_size904

readability_flesch_mean-20.108

emoji_rate0.000

url_rate0.048

one_word_rate0.998

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

["https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreur-de-repas-103470448"]
[]
[]
[]
[]
["https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46"]
[]
[]
[]
[]