saturn·

curated vs firehose

curated: 279,196 rows firehose: 125,645 rows 21 columns compared generated 2026-04-22 raw JSON static .html .ipynb Report Notebook

Reading

compare summary · high confidence anthropic:claude-opus-4-7

The curated dataset (279,196 rows) is roughly 2.2× the size of firehose (125,645 rows), and several columns differ structurally rather than just statistically. Four fields—`query`, `image_fullsize_url`, `image_thumb_url`, and `author_handle`—show a +100% null shift with zero top-value overlap, indicating they are populated on one side and entirely absent on the other. Text fields also diverge in shape: `alt_text` has a mean length +79 characters with low language overlap (jaccard 0.35) and almost no shared top values (0.03), while `raw_record_json` is +154 characters longer on average with jaccard 0.25. These patterns suggest the two sources capture different schemas/enrichments rather than just sampling noise. Confidence is high that the divergence is structural, though the evidence does not say which side carries the nulls.

citing: a_row_count · b_row_count · divergences

Most divergent columns

ranked by composite score
  1. 2.01
    alt_text
    len_mean +79lang-jaccard 0.35top-val-jaccard 0.03
    text
  2. 2.00
    query
    null +100%top-val-jaccard 0.00
    categorical
  3. 2.00
    image_fullsize_url
    null +100%top-val-jaccard 0.00
    text
  4. 2.00
    image_thumb_url
    null +100%top-val-jaccard 0.00
    text
  5. 2.00
    author_handle
    null +100%top-val-jaccard 0.00
    categorical
  6. 1.89
    raw_record_json
    len_mean +154lang-jaccard 0.25top-val-jaccard 0.00
    text

image_mime_type

categorical
Side-by-side stats for image_mime_type
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 5 8
n_unique_delta +3
entropy_delta +0.552
top-value jaccard 0.62

source_mode

categorical
Side-by-side stats for source_mode
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 1 1
n_unique_delta +0
entropy_delta +0
top-value jaccard 0.00

image_ref

text
Side-by-side stats for image_ref
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% -0.0%
unique 260,595 114,610
n_unique_delta -1.46e+05
len_mean_delta +0
len_median_delta +0
len_p95_delta +0
word_mean_delta +0
duplicate_rate_delta +0.0213
vocab_size_delta -644
top-value jaccard 0.05

query

categorical
The query column is fully populated in curated (null_rate 0.0 across 279196 rows, 489 unique values led by 'firefaerie81.bsky.social' at 2.4%) but entirely null in firehose (null_rate 1.0 over 125645 rows, flagged all_null). This yields a null_rate_delta of 1.0 and a top_value_jaccard of 0.0, meaning the two slices share no overlap on this field. Curated also shows a healthy entropy of 7.68, while firehose has no distribution to compare. high · anthropic:claude-opus-4-7
Side-by-side stats for query
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 489
top-value jaccard 0.00

image_fullsize_url

text
The columns diverge completely: curated has a 0.0 null_rate across 279196 rows with url_rate 1.0, while firehose is entirely missing (null_rate 1.0 over 125645 rows, flagged all_empty). With null_rate_delta of 1.0 and top_value_jaccard at 0.0, there is no overlap to compare further. Curated also shows one_word_rate 1.0 and a duplicate_rate of 0.0665 (18560 duplicates), but firehose provides no values to contrast. high · anthropic:claude-opus-4-7
Side-by-side stats for image_fullsize_url
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 260,636
top-value jaccard 0.00

post_cid

text
Side-by-side stats for post_cid
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,528
n_unique_delta -1.28e+05
len_mean_delta +0
len_median_delta +0
len_p95_delta +0
word_mean_delta +0
duplicate_rate_delta +0.00705
vocab_size_delta -389
top-value jaccard 0.00

image_index

numeric
Side-by-side stats for image_index
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 4 4
n_unique_delta +0
mean_delta +0.00735
median_delta +0
std_delta +0.00287
q1_delta +0
q3_delta +0
min_delta +0
max_delta +0
outlier_rate_delta +0.00826
skew_delta -0.0347

post_uri

text
Side-by-side stats for post_uri
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,367
n_unique_delta -1.29e+05
len_mean_delta -0.000684
len_median_delta +0
len_p95_delta +0
word_mean_delta +0
duplicate_rate_delta +0.00833
vocab_size_delta -402
top-value jaccard 0.00

alt_text

text
firehose alt_text is far dirtier and more skewed than curated: duplicate_rate jumps from 0.086 to 0.200 and the median length collapses from 127 to 84 even as the mean climbs to 281.9 and p95 nearly doubles to 1284, indicating a long tail of very long entries alongside many short repeats. Readability also craters (flesch_mean 67.3 vs 22.7) while url_rate (0.070 vs 0.010) and one_word_rate (0.063 vs 0.002) spike. The slices barely overlap on language mix (language_jaccard 0.35) or frequent values (top_value_jaccard 0.03), with firehose introducing ar, ceb, ko, he and others absent from curated. high · anthropic:claude-opus-4-7
Side-by-side stats for alt_text
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 255,318 100,455
n_unique_delta -1.55e+05
len_mean_delta +78.9
len_median_delta -43
len_p95_delta +582
word_mean_delta +2.39
duplicate_rate_delta +0.115
vocab_size_delta +2.12e+04
top-value jaccard 0.03
language jaccard 0.35

collected_at

text
Side-by-side stats for collected_at
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 279,196 125,645
n_unique_delta -1.54e+05
len_mean_delta +0
len_median_delta +0
len_p95_delta +0
word_mean_delta +0
duplicate_rate_delta +0
vocab_size_delta +0

cursor

text
Side-by-side stats for cursor
Stat curated firehose Δ
n 279,196 125,645
null rate 15.9% 0.0% -15.9%
unique 2,391 103,528
n_unique_delta +1.01e+05
len_mean_delta -7.64
len_median_delta -8
len_p95_delta -8
word_mean_delta +0
duplicate_rate_delta -0.814
vocab_size_delta +1.7e+04
top-value jaccard 0.00

langs_json

categorical
Side-by-side stats for langs_json
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 56 189
n_unique_delta +133
entropy_delta +0.833
top-value jaccard 0.29

image_count_in_post

numeric
Side-by-side stats for image_count_in_post
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 4 4
n_unique_delta +0
mean_delta +0.0161
median_delta +0
std_delta +0.000328
q1_delta +0
q3_delta +0
min_delta +0
max_delta +0
outlier_rate_delta +0.00408
skew_delta -0.0177

created_at

text
Side-by-side stats for created_at
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 102,375
n_unique_delta -1.3e+05
len_mean_delta +1.17
len_median_delta +0
len_p95_delta +5
word_mean_delta +0
duplicate_rate_delta +0.0162
vocab_size_delta -436
top-value jaccard 0.00

author_did

categorical
Side-by-side stats for author_did
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 489 34,785
n_unique_delta +3.43e+04
entropy_delta +3.77
top-value jaccard 0.03

raw_record_json

text
firehose records carry a url_rate of 0.4649 versus curated's 0.1494, tripling link density and triggering its url_heavy alert. Language mix also diverges sharply (language_jaccard 0.25): firehose adds de, fi, fr, ja, ru, and tr on top of en/pt, earning the multilingual alert, while curated is effectively English-only (396 en vs 2 pt). firehose payloads are longer too (len_mean 1290.7 vs 1136.37, len_p95 2774 vs 2354) with a far more negative readability_flesch_mean of -422.28 vs -155.70, and its one_word_rate of 0.0242 dwarfs curated's 0.0005. high · anthropic:claude-opus-4-7
Side-by-side stats for raw_record_json
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,017 103,528
n_unique_delta -1.28e+05
len_mean_delta +154
len_median_delta +81
len_p95_delta +420
word_mean_delta -2.45
duplicate_rate_delta +0.00705
vocab_size_delta +1.19e+04
top-value jaccard 0.00
language jaccard 0.25

text

text
Side-by-side stats for text
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 210,092 92,997
n_unique_delta -1.17e+05
len_mean_delta -7.78
len_median_delta +17
len_p95_delta -4
word_mean_delta -5.95
duplicate_rate_delta +0.0123
vocab_size_delta +7.65e+03
top-value jaccard 0.05
language jaccard 0.43

image_thumb_url

text
The image_thumb_url column is fully populated in curated (null_rate 0.0) but entirely missing in firehose (null_rate 1.0, all_empty alert), a null_rate_delta of 1.0. Curated also shows url_rate 1.0 across 279196 rows with 260636 unique values, while firehose contributes no values at all, yielding a top_value_jaccard of 0.0. The two slices share no overlap on this field. high · anthropic:claude-opus-4-7
Side-by-side stats for image_thumb_url
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 260,636
top-value jaccard 0.00

image_alt_length

numeric
Side-by-side stats for image_alt_length
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 1,951 2,036
n_unique_delta +85
mean_delta +78.9
median_delta -43
std_delta +246
q1_delta -36
q3_delta +58
min_delta +0
max_delta +1.5e+04
outlier_rate_delta +0.0695
skew_delta +9.61

author_handle

categorical
The author_handle column is fully populated in curated (null_rate 0.0 across 279196 rows) but entirely missing in firehose, which triggers an all_null alert at null_rate 1.0 over 125645 rows. As a result, curated's 489 unique handles and top value 'firefaerie81.bsky.social' (top_rate 0.02446, entropy 7.6765) have no counterpart in firehose, yielding a top_value_jaccard of 0.0. This is a structural divergence: firehose is not carrying author_handle at all. high · anthropic:claude-opus-4-7
Side-by-side stats for author_handle
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 100.0% +100.0%
unique 489
top-value jaccard 0.00

indexed_at

text
Side-by-side stats for indexed_at
Stat curated firehose Δ
n 279,196 125,645
null rate 0.0% 0.0% +0.0%
unique 232,015 103,528
n_unique_delta -1.28e+05
len_mean_delta +8
len_median_delta +8
len_p95_delta +8
word_mean_delta +0
duplicate_rate_delta +0.00704
vocab_size_delta -389
top-value jaccard 0.00