curated vs firehose
Reading
The curated dataset (279,196 rows) is roughly 2.2× the size of firehose (125,645 rows), and several columns differ structurally rather than just statistically. Four fields—`query`, `image_fullsize_url`, `image_thumb_url`, and `author_handle`—show a +100% null shift with zero top-value overlap, indicating they are populated on one side and entirely absent on the other. Text fields also diverge in shape: `alt_text` has a mean length +79 characters with low language overlap (jaccard 0.35) and almost no shared top values (0.03), while `raw_record_json` is +154 characters longer on average with jaccard 0.25. These patterns suggest the two sources capture different schemas/enrichments rather than just sampling noise. Confidence is high that the divergence is structural, though the evidence does not say which side carries the nulls.
citing: a_row_count · b_row_count · divergences
Most divergent columns
ranked by composite score-
2.01alt_textlen_mean +79lang-jaccard 0.35top-val-jaccard 0.03text
-
2.00querynull +100%top-val-jaccard 0.00categorical
-
2.00image_fullsize_urlnull +100%top-val-jaccard 0.00text
-
2.00image_thumb_urlnull +100%top-val-jaccard 0.00text
-
2.00author_handlenull +100%top-val-jaccard 0.00categorical
-
1.89raw_record_jsonlen_mean +154lang-jaccard 0.25top-val-jaccard 0.00text
image_mime_type
categorical| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 5 | 8 | — |
| n_unique_delta | — | — | +3 |
| entropy_delta | — | — | +0.552 |
| top-value jaccard | — | 0.62 | |
source_mode
categorical| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 1 | 1 | — |
| n_unique_delta | — | — | +0 |
| entropy_delta | — | — | +0 |
| top-value jaccard | — | 0.00 | |
image_ref
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | -0.0% |
| unique | 260,595 | 114,610 | — |
| n_unique_delta | — | — | -1.46e+05 |
| len_mean_delta | — | — | +0 |
| len_median_delta | — | — | +0 |
| len_p95_delta | — | — | +0 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0.0213 |
| vocab_size_delta | — | — | -644 |
| top-value jaccard | — | 0.05 | |
query
categoricalThe query column is fully populated in curated (null_rate 0.0 across 279196 rows, 489 unique values led by 'firefaerie81.bsky.social' at 2.4%) but entirely null in firehose (null_rate 1.0 over 125645 rows, flagged all_null). This yields a null_rate_delta of 1.0 and a top_value_jaccard of 0.0, meaning the two slices share no overlap on this field. Curated also shows a healthy entropy of 7.68, while firehose has no distribution to compare.
| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 100.0% | +100.0% |
| unique | 489 | — | — |
| top-value jaccard | — | 0.00 | |
image_fullsize_url
textThe columns diverge completely: curated has a 0.0 null_rate across 279196 rows with url_rate 1.0, while firehose is entirely missing (null_rate 1.0 over 125645 rows, flagged all_empty). With null_rate_delta of 1.0 and top_value_jaccard at 0.0, there is no overlap to compare further. Curated also shows one_word_rate 1.0 and a duplicate_rate of 0.0665 (18560 duplicates), but firehose provides no values to contrast.
| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 100.0% | +100.0% |
| unique | 260,636 | — | — |
| top-value jaccard | — | 0.00 | |
post_cid
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 232,017 | 103,528 | — |
| n_unique_delta | — | — | -1.28e+05 |
| len_mean_delta | — | — | +0 |
| len_median_delta | — | — | +0 |
| len_p95_delta | — | — | +0 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0.00705 |
| vocab_size_delta | — | — | -389 |
| top-value jaccard | — | 0.00 | |
image_index
numeric| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 4 | 4 | — |
| n_unique_delta | — | — | +0 |
| mean_delta | — | — | +0.00735 |
| median_delta | — | — | +0 |
| std_delta | — | — | +0.00287 |
| q1_delta | — | — | +0 |
| q3_delta | — | — | +0 |
| min_delta | — | — | +0 |
| max_delta | — | — | +0 |
| outlier_rate_delta | — | — | +0.00826 |
| skew_delta | — | — | -0.0347 |
post_uri
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 232,017 | 103,367 | — |
| n_unique_delta | — | — | -1.29e+05 |
| len_mean_delta | — | — | -0.000684 |
| len_median_delta | — | — | +0 |
| len_p95_delta | — | — | +0 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0.00833 |
| vocab_size_delta | — | — | -402 |
| top-value jaccard | — | 0.00 | |
alt_text
textfirehose alt_text is far dirtier and more skewed than curated: duplicate_rate jumps from 0.086 to 0.200 and the median length collapses from 127 to 84 even as the mean climbs to 281.9 and p95 nearly doubles to 1284, indicating a long tail of very long entries alongside many short repeats. Readability also craters (flesch_mean 67.3 vs 22.7) while url_rate (0.070 vs 0.010) and one_word_rate (0.063 vs 0.002) spike. The slices barely overlap on language mix (language_jaccard 0.35) or frequent values (top_value_jaccard 0.03), with firehose introducing ar, ceb, ko, he and others absent from curated.
| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 255,318 | 100,455 | — |
| n_unique_delta | — | — | -1.55e+05 |
| len_mean_delta | — | — | +78.9 |
| len_median_delta | — | — | -43 |
| len_p95_delta | — | — | +582 |
| word_mean_delta | — | — | +2.39 |
| duplicate_rate_delta | — | — | +0.115 |
| vocab_size_delta | — | — | +2.12e+04 |
| top-value jaccard | — | 0.03 | |
| language jaccard | — | 0.35 | |
collected_at
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 279,196 | 125,645 | — |
| n_unique_delta | — | — | -1.54e+05 |
| len_mean_delta | — | — | +0 |
| len_median_delta | — | — | +0 |
| len_p95_delta | — | — | +0 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0 |
| vocab_size_delta | — | — | +0 |
cursor
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 15.9% | 0.0% | -15.9% |
| unique | 2,391 | 103,528 | — |
| n_unique_delta | — | — | +1.01e+05 |
| len_mean_delta | — | — | -7.64 |
| len_median_delta | — | — | -8 |
| len_p95_delta | — | — | -8 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | -0.814 |
| vocab_size_delta | — | — | +1.7e+04 |
| top-value jaccard | — | 0.00 | |
langs_json
categorical| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 56 | 189 | — |
| n_unique_delta | — | — | +133 |
| entropy_delta | — | — | +0.833 |
| top-value jaccard | — | 0.29 | |
image_count_in_post
numeric| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 4 | 4 | — |
| n_unique_delta | — | — | +0 |
| mean_delta | — | — | +0.0161 |
| median_delta | — | — | +0 |
| std_delta | — | — | +0.000328 |
| q1_delta | — | — | +0 |
| q3_delta | — | — | +0 |
| min_delta | — | — | +0 |
| max_delta | — | — | +0 |
| outlier_rate_delta | — | — | +0.00408 |
| skew_delta | — | — | -0.0177 |
created_at
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 232,017 | 102,375 | — |
| n_unique_delta | — | — | -1.3e+05 |
| len_mean_delta | — | — | +1.17 |
| len_median_delta | — | — | +0 |
| len_p95_delta | — | — | +5 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0.0162 |
| vocab_size_delta | — | — | -436 |
| top-value jaccard | — | 0.00 | |
raw_record_json
textfirehose records carry a url_rate of 0.4649 versus curated's 0.1494, tripling link density and triggering its url_heavy alert. Language mix also diverges sharply (language_jaccard 0.25): firehose adds de, fi, fr, ja, ru, and tr on top of en/pt, earning the multilingual alert, while curated is effectively English-only (396 en vs 2 pt). firehose payloads are longer too (len_mean 1290.7 vs 1136.37, len_p95 2774 vs 2354) with a far more negative readability_flesch_mean of -422.28 vs -155.70, and its one_word_rate of 0.0242 dwarfs curated's 0.0005.
| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 232,017 | 103,528 | — |
| n_unique_delta | — | — | -1.28e+05 |
| len_mean_delta | — | — | +154 |
| len_median_delta | — | — | +81 |
| len_p95_delta | — | — | +420 |
| word_mean_delta | — | — | -2.45 |
| duplicate_rate_delta | — | — | +0.00705 |
| vocab_size_delta | — | — | +1.19e+04 |
| top-value jaccard | — | 0.00 | |
| language jaccard | — | 0.25 | |
text
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 210,092 | 92,997 | — |
| n_unique_delta | — | — | -1.17e+05 |
| len_mean_delta | — | — | -7.78 |
| len_median_delta | — | — | +17 |
| len_p95_delta | — | — | -4 |
| word_mean_delta | — | — | -5.95 |
| duplicate_rate_delta | — | — | +0.0123 |
| vocab_size_delta | — | — | +7.65e+03 |
| top-value jaccard | — | 0.05 | |
| language jaccard | — | 0.43 | |
image_thumb_url
textThe image_thumb_url column is fully populated in curated (null_rate 0.0) but entirely missing in firehose (null_rate 1.0, all_empty alert), a null_rate_delta of 1.0. Curated also shows url_rate 1.0 across 279196 rows with 260636 unique values, while firehose contributes no values at all, yielding a top_value_jaccard of 0.0. The two slices share no overlap on this field.
| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 100.0% | +100.0% |
| unique | 260,636 | — | — |
| top-value jaccard | — | 0.00 | |
image_alt_length
numeric| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 1,951 | 2,036 | — |
| n_unique_delta | — | — | +85 |
| mean_delta | — | — | +78.9 |
| median_delta | — | — | -43 |
| std_delta | — | — | +246 |
| q1_delta | — | — | -36 |
| q3_delta | — | — | +58 |
| min_delta | — | — | +0 |
| max_delta | — | — | +1.5e+04 |
| outlier_rate_delta | — | — | +0.0695 |
| skew_delta | — | — | +9.61 |
indexed_at
text| Stat | curated | firehose | Δ |
|---|---|---|---|
| n | 279,196 | 125,645 | — |
| null rate | 0.0% | 0.0% | +0.0% |
| unique | 232,015 | 103,528 | — |
| n_unique_delta | — | — | -1.28e+05 |
| len_mean_delta | — | — | +8 |
| len_median_delta | — | — | +8 |
| len_p95_delta | — | — | +8 |
| word_mean_delta | — | — | +0 |
| duplicate_rate_delta | — | — | +0.00704 |
| vocab_size_delta | — | — | -389 |
| top-value jaccard | — | 0.00 | |