saturn·

curated vs firehose

saturn compare notebook · generated 2026-04-22 Report Notebook

Overview

Pair of datasets compared column-by-column. curated has 279,196 rows; firehose has 125,645 rows. Divergence score combines normalised null drift, mean/length delta, entropy delta, top-value jaccard, and language-mix jaccard. Every term is capped at 1.0.

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "compare",
    "hf://lukeslp/bluesky-alt-text:[train][source_mode=author_feed]", "hf://lukeslp/bluesky-alt-text:[train][source_mode=jetstream]",
    "--label-a", "curated",
    "--label-b", "firehose",
    "--findings", "bluesky-alt-text-curated-vs-firehose.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

The curated dataset (279,196 rows) is roughly 2.2× the size of firehose (125,645 rows), and several columns differ structurally rather than just statistically. Four fields—`query`, `image_fullsize_url`, `image_thumb_url`, and `author_handle`—show a +100% null shift with zero top-value overlap, indicating they are populated on one side and entirely absent on the other. Text fields also diverge in shape: `alt_text` has a mean length +79 characters with low language overlap (jaccard 0.35) and almost no shared top values (0.03), while `raw_record_json` is +154 characters longer on average with jaccard 0.25. These patterns suggest the two sources capture different schemas/enrichments rather than just sampling noise. Confidence is high that the divergence is structural, though the evidence does not say which side carries the nulls.

citing: a_row_count · b_row_count · divergences

Out[4]:

report.divergence_summary()

columnkindscoresignals
alt_text text 2.01 len_mean +79, lang-jaccard 0.35, top-val-jaccard 0.03
query categorical 2.00 null +100%, top-val-jaccard 0.00
image_fullsize_url text 2.00 null +100%, top-val-jaccard 0.00
image_thumb_url text 2.00 null +100%, top-val-jaccard 0.00
author_handle categorical 2.00 null +100%, top-val-jaccard 0.00
raw_record_json text 1.89 len_mean +154, lang-jaccard 0.25, top-val-jaccard 0.00

image_mime_type categorical

Out[6]:

report.columns["image_mime_type"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 5 8
n_unique_delta+3
entropy_delta+0.552
top-value jaccard0.62

source_mode categorical

Out[8]:

report.columns["source_mode"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 1 1
n_unique_delta+0
entropy_delta+0
top-value jaccard0.00

image_ref text

Out[10]:

report.columns["image_ref"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% -0.0%
unique 260,595 114,610
n_unique_delta-1.46e+05
len_mean_delta+0
len_median_delta+0
len_p95_delta+0
word_mean_delta+0
duplicate_rate_delta+0.0213
vocab_size_delta-644
top-value jaccard0.05

query categorical

The query column is fully populated in curated (null_rate 0.0 across 279196 rows, 489 unique values led by 'firefaerie81.bsky.social' at 2.4%) but entirely null in firehose (null_rate 1.0 over 125645 rows, flagged all_null). This yields a null_rate_delta of 1.0 and a top_value_jaccard of 0.0, meaning the two slices share no overlap on this field. Curated also shows a healthy entropy of 7.68, while firehose has no distribution to compare.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

report.columns["query"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 489
top-value jaccard0.00

image_fullsize_url text

The columns diverge completely: curated has a 0.0 null_rate across 279196 rows with url_rate 1.0, while firehose is entirely missing (null_rate 1.0 over 125645 rows, flagged all_empty). With null_rate_delta of 1.0 and top_value_jaccard at 0.0, there is no overlap to compare further. Curated also shows one_word_rate 1.0 and a duplicate_rate of 0.0665 (18560 duplicates), but firehose provides no values to contrast.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

report.columns["image_fullsize_url"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 260,636
top-value jaccard0.00

post_cid text

Out[16]:

report.columns["post_cid"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,528
n_unique_delta-1.28e+05
len_mean_delta+0
len_median_delta+0
len_p95_delta+0
word_mean_delta+0
duplicate_rate_delta+0.00705
vocab_size_delta-389
top-value jaccard0.00

image_index numeric

Out[18]:

report.columns["image_index"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 4 4
n_unique_delta+0
mean_delta+0.00735
median_delta+0
std_delta+0.00287
q1_delta+0
q3_delta+0
min_delta+0
max_delta+0
outlier_rate_delta+0.00826
skew_delta-0.0347

post_uri text

Out[20]:

report.columns["post_uri"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,367
n_unique_delta-1.29e+05
len_mean_delta-0.000684
len_median_delta+0
len_p95_delta+0
word_mean_delta+0
duplicate_rate_delta+0.00833
vocab_size_delta-402
top-value jaccard0.00

alt_text text

firehose alt_text is far dirtier and more skewed than curated: duplicate_rate jumps from 0.086 to 0.200 and the median length collapses from 127 to 84 even as the mean climbs to 281.9 and p95 nearly doubles to 1284, indicating a long tail of very long entries alongside many short repeats. Readability also craters (flesch_mean 67.3 vs 22.7) while url_rate (0.070 vs 0.010) and one_word_rate (0.063 vs 0.002) spike. The slices barely overlap on language mix (language_jaccard 0.35) or frequent values (top_value_jaccard 0.03), with firehose introducing ar, ceb, ko, he and others absent from curated.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

report.columns["alt_text"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 255,318 100,455
n_unique_delta-1.55e+05
len_mean_delta+78.9
len_median_delta-43
len_p95_delta+582
word_mean_delta+2.39
duplicate_rate_delta+0.115
vocab_size_delta+2.12e+04
top-value jaccard0.03
language jaccard0.35

collected_at text

Out[24]:

report.columns["collected_at"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 279,196 125,645
n_unique_delta-1.54e+05
len_mean_delta+0
len_median_delta+0
len_p95_delta+0
word_mean_delta+0
duplicate_rate_delta+0
vocab_size_delta+0

cursor text

Out[26]:

report.columns["cursor"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 15.9% 0.0% -15.9%
unique 2,391 103,528
n_unique_delta+1.01e+05
len_mean_delta-7.64
len_median_delta-8
len_p95_delta-8
word_mean_delta+0
duplicate_rate_delta-0.814
vocab_size_delta+1.7e+04
top-value jaccard0.00

langs_json categorical

Out[28]:

report.columns["langs_json"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 56 189
n_unique_delta+133
entropy_delta+0.833
top-value jaccard0.29

image_count_in_post numeric

Out[30]:

report.columns["image_count_in_post"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 4 4
n_unique_delta+0
mean_delta+0.0161
median_delta+0
std_delta+0.000328
q1_delta+0
q3_delta+0
min_delta+0
max_delta+0
outlier_rate_delta+0.00408
skew_delta-0.0177

created_at text

Out[32]:

report.columns["created_at"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 102,375
n_unique_delta-1.3e+05
len_mean_delta+1.17
len_median_delta+0
len_p95_delta+5
word_mean_delta+0
duplicate_rate_delta+0.0162
vocab_size_delta-436
top-value jaccard0.00

author_did categorical

Out[34]:

report.columns["author_did"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 489 34,785
n_unique_delta+3.43e+04
entropy_delta+3.77
top-value jaccard0.03

raw_record_json text

firehose records carry a url_rate of 0.4649 versus curated's 0.1494, tripling link density and triggering its url_heavy alert. Language mix also diverges sharply (language_jaccard 0.25): firehose adds de, fi, fr, ja, ru, and tr on top of en/pt, earning the multilingual alert, while curated is effectively English-only (396 en vs 2 pt). firehose payloads are longer too (len_mean 1290.7 vs 1136.37, len_p95 2774 vs 2354) with a far more negative readability_flesch_mean of -422.28 vs -155.70, and its one_word_rate of 0.0242 dwarfs curated's 0.0005.

anthropic:claude-opus-4-7 · confidence high
Out[36]:

report.columns["raw_record_json"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,528
n_unique_delta-1.28e+05
len_mean_delta+154
len_median_delta+81
len_p95_delta+420
word_mean_delta-2.45
duplicate_rate_delta+0.00705
vocab_size_delta+1.19e+04
top-value jaccard0.00
language jaccard0.25

text text

Out[38]:

report.columns["text"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 210,092 92,997
n_unique_delta-1.17e+05
len_mean_delta-7.78
len_median_delta+17
len_p95_delta-4
word_mean_delta-5.95
duplicate_rate_delta+0.0123
vocab_size_delta+7.65e+03
top-value jaccard0.05
language jaccard0.43

image_thumb_url text

The image_thumb_url column is fully populated in curated (null_rate 0.0) but entirely missing in firehose (null_rate 1.0, all_empty alert), a null_rate_delta of 1.0. Curated also shows url_rate 1.0 across 279196 rows with 260636 unique values, while firehose contributes no values at all, yielding a top_value_jaccard of 0.0. The two slices share no overlap on this field.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

report.columns["image_thumb_url"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 260,636
top-value jaccard0.00

image_alt_length numeric

Out[42]:

report.columns["image_alt_length"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 1,951 2,036
n_unique_delta+85
mean_delta+78.9
median_delta-43
std_delta+246
q1_delta-36
q3_delta+58
min_delta+0
max_delta+1.5e+04
outlier_rate_delta+0.0695
skew_delta+9.61

author_handle categorical

The author_handle column is fully populated in curated (null_rate 0.0 across 279196 rows) but entirely missing in firehose, which triggers an all_null alert at null_rate 1.0 over 125645 rows. As a result, curated's 489 unique handles and top value 'firefaerie81.bsky.social' (top_rate 0.02446, entropy 7.6765) have no counterpart in firehose, yielding a top_value_jaccard of 0.0. This is a structural divergence: firehose is not carrying author_handle at all.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

report.columns["author_handle"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 489
top-value jaccard0.00

indexed_at text

Out[46]:

report.columns["indexed_at"].delta

statcuratedfirehoseΔ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,015 103,528
n_unique_delta-1.28e+05
len_mean_delta+8
len_median_delta+8
len_p95_delta+8
word_mean_delta+0
duplicate_rate_delta+0.00704
vocab_size_delta-389
top-value jaccard0.00

How to cite

click to copy

BibTeX
@misc{saturn-bluesky-alt-text-curated-vs-firehose-2026,
  author       = {Steuber, Luke},
  title        = {Saturn compare: curated vs firehose},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/bluesky-alt-text-curated-vs-firehose}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn compare: curated vs firehose. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/bluesky-alt-text-curated-vs-firehose