saturn

/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv 101,040 rows sample n=101,040 seed 42 2026-05-01T23:26:26+00:00

Overview

Source/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv
Total rows101,040
Profiled sample101,040
Columns19
Generated2026-05-01T23:26:26+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This is an anonymized Bluesky firehose snapshot of 101,040 posts from late December 2025, with 19 columns covering hashed identifiers, post text, timestamps, embed metadata, language, and sentiment. The content is heavily multilingual: English dominates at roughly 61% of posts, but Japanese, Korean, German, and Portuguese also have meaningful presence, alongside an 'unknown' bucket worth investigating. Sentiment skews neutral (~48%) with positive outweighing negative roughly 2:1, and post length is right-skewed (median 68 chars, max 525). Engagement features are sparse — about 18% of posts carry a link, 14% have images, and only 1.3% have video — and the embed_type field is null for ~61% of rows, which is the biggest data-quality flag to check first. Reply hashes are null for ~58% of rows, suggesting most posts are top-level rather than replies.

text high anthropic:claude-opus-4-7

Short user-generated posts (likely social media, given bsky.app URLs and hashtag-heavy entries), averaging 14 words and a median of 68 characters. The corpus is predominantly English (3309) but spans 30 detected languages with notable Japanese (656) and Korean (125) presence, and 18.3% of rows contain emoji while 16.9% are flagged all-caps. Watch the 5,105 duplicates (5.05%) and the top entries dominated by emoji spam (224 sheep emojis) and repeated promotional Thai text.

author_did_hash high anthropic:claude-opus-4-7

This column holds 16-character hexadecimal hashes of author DIDs, with every value exactly 16 chars long and a single token (one_word_rate 1.0). Across 101040 rows there are only 43998 unique authors and a 56.5% duplicate rate, with the top author hash appearing 1016 times — so a small number of authors generate a disproportionate share of records.

uri_hash high anthropic:claude-opus-4-7

This column holds 16-character single-token strings that look like truncated hex digests of URIs, with 101,039 unique values across 101,040 rows and zero nulls. It is effectively a primary key — duplicate_rate is 0.0 and every entry is exactly 16 characters long. No textual signal is present beyond the hash itself.

reply_parent_hash high anthropic:claude-opus-4-7

This column is a 16-character hexadecimal hash pointing at a parent message, populated only when a row is a reply — 57.67% of the 101,040 values are null. Every non-null entry is exactly 16 chars and one token, with 34,738 unique hashes and an 18.78% duplicate rate (8,032 duplicates), indicating popular parents that attract many replies (top hash 04a1db17fbc9ff3a appears 121 times). The flesch_mean of 71.73 is meaningless here since the strings are opaque IDs.

reply_root_hash high anthropic:claude-opus-4-7

This column holds 16-character hexadecimal hashes identifying the root of a reply thread, with every value exactly 16 chars and a single token. 57.67% of rows are null (consistent with non-reply posts being root-less), and among the 42.33% populated rows there are 21,277 unique hashes with a 50.25% duplicate rate, indicating many replies share common thread roots.

sentiment high anthropic:claude-opus-4-7

This is a three-class sentiment label with values neutral, positive, and negative across 101040 rows and no nulls. The distribution is uneven: neutral dominates at 48.5% (top_rate 0.4847684085510689), positive accounts for 34622, and negative trails at 17437 — roughly half the positive count. Entropy ratio of 0.93 still indicates a fairly balanced spread despite the negative class being underrepresented.

sentiment_score high anthropic:claude-opus-4-7

Continuous sentiment polarity bounded in [-0.998, 1.0], consistent with a tool like VADER/TextBlob compound score. Distribution is roughly symmetric (skew 0.019, kurtosis 0.018) but dominated by a neutral spike: 47.8% of rows are exactly 0 and the median is 0, pulling Q1 to 0 as well. Despite the symmetry, 5,763 rows (5.7%) flag as outliers, suggesting heavy tails of strongly polarised text alongside the neutral mass.

created_at high anthropic:claude-opus-4-7

This is a creation timestamp stored as ISO-8601 strings rather than a native datetime, with lengths from 20 to 35 characters and a single token per value. The format is inconsistent: top values mix offset suffixes like `+00:00` with `Z` and microsecond `.000000Z` variants, which will break naive lexical comparisons. Cardinality is high (96576 unique of 101040) but 4464 duplicates (4.4%) suggest concurrent events sharing timestamps.

timestamp high anthropic:claude-opus-4-7

This column holds ISO-8601 timestamps with microsecond precision, stored as text rather than a native datetime type. All 101040 values are unique with zero nulls and a fixed length of 26 characters, and the sampled values cluster on 2025-12-23 and 2025-12-24. The text-profile alerts (near_unique, one_word, allcaps) are artefacts of the string representation, not genuine quality issues.

language high anthropic:claude-opus-4-7

Language code of the record, dominated by English ('en' at 60.8% of 101,040 rows) with Japanese ('ja') a distant second at 12,607. Notably, 11,481 rows are literally 'unknown' and the codes mix bare ISO ('en') with locale variants ('en-US', 3,617), so the 90 distinct values overstate true language diversity. No nulls, but entropy ratio of 0.34 confirms heavy concentration at the top.

char_count high anthropic:claude-opus-4-7

This is almost certainly a character-count feature for some text field, with all 101040 rows populated and only 341 distinct integer values between 1 and 525. The distribution is right-skewed (skew 1.02) with median 68 well below the mean of 97.6 and an IQR of 30-143, indicating most texts are short but a tail stretches out. Outliers are minimal (0.29%) and there are no zero-length entries.

word_count high anthropic:claude-opus-4-7

This is a numeric feature counting words per record, ranging from 0 to 83 with a median of 10 and mean of 14.67. The distribution is right-skewed (skew 1.21) with an IQR of 19 and about 2.85% of values flagged as outliers, suggesting a long tail of unusually long entries. Only 79 unique values across 101,040 rows and a near-zero zero_rate (0.06%) indicate counts cluster in a narrow integer range.

has_images high anthropic:claude-opus-4-7

Binary flag indicating whether a record has images, encoded as 0/1 with only 2 unique values across 101040 rows and no nulls. The positive class is rare at mean 0.136 (zero_rate 0.864), which is why the profiler flags high skew (2.12) and treats the 13768 ones as 'outliers'.

has_video high anthropic:claude-opus-4-7

This is a binary flag indicating whether a record has a video, stored numerically with only 2 unique values (0 and 1) and no nulls across 101040 rows. The positive class is rare: 98.67% are zero and only 1.33% are ones, producing the flagged high skew (8.50) and extreme kurtosis (70.19).

has_link high anthropic:claude-opus-4-7

This is a binary 0/1 flag indicating whether a record contains a link, with no nulls across 101,040 rows. About 17.95% are 1s and 82.05% are 0s, which is why the IQR is 0 and the minority class gets labelled as 18,140 'outliers' — that's a quirk of applying numeric outlier logic to a Bernoulli variable, not a data issue.

embed_type high anthropic:claude-opus-4-7

This column tags the embed type attached to Bluesky posts, with five AT Protocol values like app.bsky.embed.external and app.bsky.embed.images. Note the 61.15% null rate — most posts carry no embed at all — and among posts that do, external links dominate at 46.22%, with video and recordWithMedia being rare tails. Entropy ratio of 0.74 shows reasonable spread across the five categories when present.

hashtags high anthropic:claude-opus-4-7

This column stores per-row hashtag lists serialised as JSON arrays, with `[]` (empty list) appearing 87,318 times — the dominant value in a 101,040-row column. Duplicates are extreme (duplicate_rate 0.90, n_unique 10,103) and 90% of values are effectively one token, consistent with most rows having zero or one hashtag (word_median 1, len_median 2). The non-empty examples mix scripts (Thai, Japanese, German, English) including bot-like patterns such as `#NowPlaying` and repeated `#dpdi`, so any text treatment must be multilingual.

mentions high anthropic:claude-opus-4-7

This column stores serialized JSON arrays of mentioned entity IDs (16-hex tokens), but the overwhelming majority are empty: '[]' appears 98,659 of 101,040 times and the duplicate rate is 0.981. When non-empty, arrays almost always hold a single ID (word_mean 1.01, len_p95 2.0), though len_max reaches 420 indicating occasional multi-mention rows. Vocabulary is small (660) across 1,921 unique values, so signal is sparse and ID-like rather than textual.

links high anthropic:claude-opus-4-7

This column holds JSON-encoded arrays of URLs, but 96,211 of 101,040 rows are the empty array '[]', driving a 96.3% duplicate rate and a median length of 2 characters. Only 3,771 unique values exist across 101,040 rows, and just 4.8% of rows actually contain a URL. The non-empty entries point to streaming radio and news endpoints (BBC, Radio France, shoutcast streams).

Numeric correlation

Languages detected

Per-string language detection across text columns (sampled).

text text

31 languages detected in sample 16.9% rows are all-caps
rows101,040
null0 (0.0%)
unique95,935
len_min1
len_max525
len_mean97.627
len_median68.000
len_p95290.000
word_mean14.235
word_median10.000
n_empty0
n_duplicates5,105
duplicate_rate0.051
vocab_size77,183
readability_flesch_mean64.091
emoji_rate0.183
url_rate0.076
one_word_rate0.190
allcaps_rate0.169
boilerplate_rate1.05e-03
Sample values (first 10)
  1. Un client arrêté après avoir poignardé son livreur de repas "Un livreur de repas a été grièvement blessé lors d’une tentative de meurtre dans la nuit de dimanche à lundi, à Bülach. Le..." https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreu…
  2. Fındığım bu giboyla ilgili itiraf etmek istediğin bir şey varsa tam vakti 😅 Dm kutuma gel anlat,söz bende kalacak anlattıkların 😅😅
  3. -#strongertogether
  4. Tu b’Shevat is approaching rapidly. Nit to put any pressure on you, but…
  5. Someone mentioned it in my timeline, and so I just rewatched "The Blue Carbuncle", from The Adventures of Sherlock Holmes (1984) with Jeremy Brett. It's a great Christmas story and Jeremy Brett is without question the best Holmes ever. I found it on Britbox
  6. https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46 【新着記事】 宇宙ステーションは「組み立てる」時代から「一発で広げる」時代へ?
  7. Happy Christmas Eve, sweet Flanoy! 🎄🧡🐈
  8. THESE FUCKERS SERIOUSLY COULDN’T WAIT ONE DAY!? /vneg #dandysworld
  9. 어딜 가지.....
  10. Made in hckr.fr 🏴‍☠️🖤 Le genre de petit message qui me fait chaud au cœur.

author_did_hash text

100.0% rows are a single word 95th-percentile length under 20 chars 56.5% duplicate strings
rows101,040
null0 (0.0%)
unique43,998
len_min16
len_max16
len_mean16.000
len_median16.000
len_p9516.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates57,042
duplicate_rate0.565
vocab_size13,938
readability_flesch_mean68.345
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate3.46e-04
boilerplate_rate0.000
Sample values (first 10)
  1. 203b2f94ca34ad57
  2. e3fb7462b68ce168
  3. 8b80d746cd58f608
  4. 74e2cbc89edd37a6
  5. ed4f29630f55ae1d
  6. 039de54e8bef8899
  7. 61ee7267320497ee
  8. b464fb16192641fa
  9. 7422e82a369d2ace
  10. 8dc83ec255dde07a

uri_hash text

100.0% of rows are unique strings 100.0% rows are a single word 95th-percentile length under 20 chars
rows101,040
null1 (0.0%)
unique101,039
len_min16
len_max16
len_mean16.000
len_median16.000
len_p9516.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates0
duplicate_rate0.000
vocab_size20,000
readability_flesch_mean69.614
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate5.25e-04
boilerplate_rate0.000
Sample values (first 10)
  1. 1a925eae4a68e954
  2. 9c7b35c448e9f56a
  3. 00000da0008897eb
  4. cd823fdacd02b11c
  5. 0a525ed50b0474f2
  6. 175ab73228973fa3
  7. 60f3a1a69409b7ae
  8. e9e0e481dbe7f266
  9. d49a9bf37ba42904
  10. ef51be69a5ee76f8

reply_parent_hash text

100.0% rows are a single word 57.7% null 95th-percentile length under 20 chars
rows101,040
null58,270 (57.7%)
unique34,738
len_min16
len_max16
len_mean16.000
len_median16.000
len_p9516.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates8,032
duplicate_rate0.188
vocab_size17,415
readability_flesch_mean71.729
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate7.95e-04
boilerplate_rate0.000
Sample values (first 10)
  1. fc2267f29dd1a492
  2. 6b56ce9644d8dcfc
  3. 701912916dd3aecb
  4. f16b66c1507d3da9
  5. 63ea68b3eabeb6c5
  6. 2e341c64d79713f6
  7. bfbbd6900834f900
  8. dd990c5f31cc4ea6
  9. 3b2f41bfb941204a
  10. a5ba750d7bf30263

reply_root_hash text

100.0% rows are a single word 57.7% null 95th-percentile length under 20 chars 50.3% duplicate strings
rows101,040
null58,270 (57.7%)
unique21,277
len_min16
len_max16
len_mean16.000
len_median16.000
len_p9516.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates21,493
duplicate_rate0.503
vocab_size12,498
readability_flesch_mean77.228
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate8.18e-04
boilerplate_rate0.000
Sample values (first 10)
  1. fc2267f29dd1a492
  2. 6b56ce9644d8dcfc
  3. 701912916dd3aecb
  4. f16b66c1507d3da9
  5. 63ea68b3eabeb6c5
  6. 2e341c64d79713f6
  7. dc0cf00aab42248a
  8. 65f573012a42f37a
  9. 152ff36a17b9ab54
  10. 2da15f2e55e9a171

sentiment categorical

rows101,040
null0 (0.0%)
unique3
top_valueneutral
top_rate0.485
cardinality3
entropy1.473
entropy_ratio0.930
Top values (rank 1–20)
  1. neutral — 48,981
  2. positive — 34,622
  3. negative — 17,437

sentiment_score numeric

5.7% rows beyond 1.5 IQR
rows101,040
null0 (0.0%)
unique1,928
min-0.998
max1.000
mean0.107
median0.000
std0.410
q10.000
q30.402
iqr0.402
skew0.019
kurtosis0.018
n_outliers5,763
outlier_rate0.057
zero_rate0.478

created_at text

95.6% of rows are unique strings 100.0% rows are a single word 100.0% rows are all-caps
rows101,040
null0 (0.0%)
unique96,576
len_min20
len_max35
len_mean24.345
len_median24.000
len_p9527.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates4,464
duplicate_rate0.044
vocab_size19,720
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 2025-12-15T15:02:56.000000Z
  2. 2025-12-24T05:46:28.199Z
  3. 2025-12-24T05:53:52.540Z
  4. 2025-12-24T05:51:06.770Z
  5. 2025-12-24T05:24:05.186Z
  6. 2025-12-24T06:00:49.556+00:00
  7. 2025-12-24T05:25:08.535Z
  8. 2025-12-24T05:56:08.507Z
  9. 2025-12-24T05:51:12.695Z
  10. 2025-12-24T05:00:11.869Z

timestamp text

100.0% of rows are unique strings 100.0% rows are a single word 100.0% rows are all-caps
rows101,040
null0 (0.0%)
unique101,040
len_min26
len_max26
len_mean26.000
len_median26.000
len_p9526.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates0
duplicate_rate0.000
vocab_size20,000
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 2025-12-23T23:35:20.812256
  2. 2025-12-23T23:46:29.113253
  3. 2025-12-23T23:53:52.818216
  4. 2025-12-23T23:51:06.721420
  5. 2025-12-23T23:24:08.130284
  6. 2025-12-24T00:00:49.619695
  7. 2025-12-23T23:25:09.314207
  8. 2025-12-23T23:56:13.728686
  9. 2025-12-23T23:51:13.117978
  10. 2025-12-23T23:00:12.134916

language categorical

rows101,040
null0 (0.0%)
unique90
top_valueen
top_rate0.608
cardinality90
entropy2.178
entropy_ratio0.336
Top values (rank 1–20)
  1. en — 61,468
  2. ja — 12,607
  3. unknown — 11,481
  4. en-US — 3,617
  5. ko — 2,406
  6. de — 1,821
  7. pt — 1,295
  8. es — 1,153
  9. fr — 746
  10. th — 612
  11. tr — 548
  12. nl — 525
  13. zh — 315
  14. it — 276
  15. ru — 213
  16. fi — 193
  17. ja-JP — 170
  18. id — 158
  19. pl — 139
  20. el — 116

char_count numeric

rows101,040
null0 (0.0%)
unique341
min1.000
max525.000
mean97.627
median68.000
std86.052
q130.000
q3143.000
iqr113.000
skew1.018
kurtosis-0.057
n_outliers289
outlier_rate2.86e-03
zero_rate0.000

word_count numeric

rows101,040
null0 (0.0%)
unique79
min0.000
max83.000
mean14.675
median10.000
std14.223
q13.000
q322.000
iqr19.000
skew1.209
kurtosis0.699
n_outliers2,882
outlier_rate0.029
zero_rate6.04e-04

has_images numeric

skew=+2.12 13.6% rows beyond 1.5 IQR
rows101,040
null0 (0.0%)
unique2
min0.000
max1.000
mean0.136
median0.000
std0.343
q10.000
q30.000
iqr0.000
skew2.120
kurtosis2.497
n_outliers13,768
outlier_rate0.136
zero_rate0.864

has_video numeric

skew=+8.50
rows101,040
null0 (0.0%)
unique2
min0.000
max1.000
mean0.013
median0.000
std0.115
q10.000
q30.000
iqr0.000
skew8.497
kurtosis70.192
n_outliers1,344
outlier_rate0.013
zero_rate0.987

has_link numeric

18.0% rows beyond 1.5 IQR
rows101,040
null0 (0.0%)
unique2
min0.000
max1.000
mean0.180
median0.000
std0.384
q10.000
q30.000
iqr0.000
skew1.670
kurtosis0.789
n_outliers18,140
outlier_rate0.180
zero_rate0.820

embed_type categorical

61.2% null
rows101,040
null61,791 (61.2%)
unique5
top_valueapp.bsky.embed.external
top_rate0.462
cardinality5
entropy1.717
entropy_ratio0.739
Top values (rank 1–20)
  1. app.bsky.embed.external — 18,140
  2. app.bsky.embed.images — 13,768
  3. app.bsky.embed.record — 5,126
  4. app.bsky.embed.video — 1,344
  5. app.bsky.embed.recordWithMedia — 871

hashtags text

90.2% rows are a single word 90.0% duplicate strings
rows101,040
null0 (0.0%)
unique10,103
len_min2
len_max1,122
len_mean10.378
len_median2.000
len_p9563.000
word_mean1.384
word_median1.000
n_empty0
n_duplicates90,937
duplicate_rate0.900
vocab_size7,036
readability_flesch_mean2.752
emoji_rate0.000
url_rate0.000
one_word_rate0.902
allcaps_rate5.70e-03
boilerplate_rate0.000
Sample values (first 10)
  1. []
  2. []
  3. ["#strongertogether"]
  4. []
  5. []
  6. []
  7. []
  8. ["#dandysworld"]
  9. []
  10. []

mentions text

99.6% rows are a single word 95th-percentile length under 20 chars 98.1% duplicate strings
rows101,040
null0 (0.0%)
unique1,921
len_min2
len_max420
len_mean2.670
len_median2.000
len_p952.000
word_mean1.012
word_median1.000
n_empty0
n_duplicates99,119
duplicate_rate0.981
vocab_size660
readability_flesch_mean0.702
emoji_rate0.000
url_rate0.000
one_word_rate0.996
allcaps_rate3.96e-05
boilerplate_rate0.000
Sample values (first 10)
  1. []
  2. []
  3. []
  4. []
  5. []
  6. []
  7. []
  8. []
  9. []
  10. []

links text

99.8% rows are a single word 95th-percentile length under 20 chars 96.3% duplicate strings
rows101,040
null0 (0.0%)
unique3,771
len_min2
len_max266
len_mean4.950
len_median2.000
len_p952.000
word_mean1.003
word_median1.000
n_empty0
n_duplicates97,269
duplicate_rate0.963
vocab_size904
readability_flesch_mean-20.108
emoji_rate0.000
url_rate0.048
one_word_rate0.998
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. ["https://www.20min.ch/fr/story/buelach-zh-un-client-arrete-apres-avoir-poignarde-son-livreur-de-repas-103470448"]
  2. []
  3. []
  4. []
  5. []
  6. ["https://trecome.info/articles/89cfe941-6cf1-45ae-8626-cc0241375b46"]
  7. []
  8. []
  9. []
  10. []