saturn compare
curated vs firehose
hf://lukeslp/bluesky-alt-text:[train][source_mode=author_feed] 279,196 rows hf://lukeslp/bluesky-alt-text:[train][source_mode=jetstream] 125,645 rows 2026-04-22T05:55:01+00:00
Most divergent columns
Ranked by a composite of null-rate drift, mean/length delta, entropy delta, top-value jaccard, and language-mix jaccard.
- alt_text text score 2.01 len_mean +79 lang-jaccard 0.35 top-val-jaccard 0.03
- query categorical score 2.0 null +100% top-val-jaccard 0.00
- image_fullsize_url text score 2.0 null +100% top-val-jaccard 0.00
- image_thumb_url text score 2.0 null +100% top-val-jaccard 0.00
- author_handle categorical score 2.0 null +100% top-val-jaccard 0.00
- raw_record_json text score 1.886 len_mean +154 lang-jaccard 0.25 top-val-jaccard 0.00
image_mime_type categorical
curated
279,196 rows · 0.0% null
· 5 unique
top value is 97.5% of rows
firehose
125,645 rows · 0.0% null
· 8 unique
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
5
8
+3.000
entropy
0.172
0.724
+0.552
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.625
jaccard
source_mode categorical
curated
279,196 rows · 0.0% null
· 1 unique
top value is 100.0% of rows
firehose
125,645 rows · 0.0% null
· 1 unique
top value is 100.0% of rows
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
1
1
+0.000
entropy
-0.000
-0.000
+0.000
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
image_ref text
curated
279,196 rows · 0.0% null
· 260,595 unique
100.0% rows are a single word
firehose
125,645 rows · 0.0% null
· 114,610 unique
100.0% rows are a single word
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
260,595
114,610
-145,985
mean length
59.000
59.000
+0.000
median length
59.000
59.000
+0.000
p95 length
59.000
59.000
+0.000
mean words
1.000
1.000
+0.000
duplicate rate
0.067
0.088
+0.021
vocab size (top-K)
19,666
19,022
-644.000
null rate
6.81e-05
0.000
-6.81e-05
top-value overlap
—
—
+0.053
jaccard
top-word overlap
—
—
+0.042
jaccard
query categorical
curated
279,196 rows · 0.0% null
· 489 unique
firehose
125,645 rows · 100.0% null
column has no values
Δ = firehose − curated
metric
curated
firehose
Δ
entropy
7.676
—
null rate
0.000
1.000
+1.000
top-value overlap
—
—
+0.000
jaccard
image_fullsize_url text
curated
279,196 rows · 0.0% null
· 260,636 unique
100.0% rows are a single word
100.0% rows contain a URL
firehose
125,645 rows · 100.0% null
column has no non-empty values
Δ = firehose − curated
metric
curated
firehose
Δ
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
post_cid text
curated
279,196 rows · 0.0% null
· 232,017 unique
100.0% rows are a single word
firehose
125,645 rows · 0.0% null
· 103,528 unique
100.0% rows are a single word
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
232,017
103,528
-128,489
mean length
59.000
59.000
+0.000
median length
59.000
59.000
+0.000
p95 length
59.000
59.000
+0.000
mean words
1.000
1.000
+0.000
duplicate rate
0.169
0.176
+0.007
vocab size (top-K)
19,639
19,250
-389.000
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
image_index numeric
curated
279,196 rows · 0.0% null
· 4 unique
skew=+2.74
16.9% rows beyond 1.5 IQR
firehose
125,645 rows · 0.0% null
· 4 unique
skew=+2.70
17.7% rows beyond 1.5 IQR
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
4
4
+0.000
mean
0.258
0.265
+0.007
median
0.000
0.000
+0.000
std
0.646
0.649
+0.003
q1
0.000
0.000
+0.000
q3
0.000
0.000
+0.000
min
0.000
0.000
+0.000
max
3.000
3.000
+0.000
outlier rate
0.169
0.177
+0.008
skew
2.737
2.702
-0.035
null rate
0.000
0.000
+0.000
post_uri text
curated
279,196 rows · 0.0% null
· 232,017 unique
100.0% rows are a single word
firehose
125,645 rows · 0.0% null
· 103,367 unique
100.0% rows are a single word
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
232,017
103,367
-128,650
mean length
70.000
69.999
-6.84e-04
median length
70.000
70.000
+0.000
p95 length
70.000
70.000
+0.000
mean words
1.000
1.000
+0.000
duplicate rate
0.169
0.177
+0.008
vocab size (top-K)
19,639
19,237
-402.000
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
alt_text text
curated
279,196 rows · 0.0% null
· 255,318 unique
17 languages detected in sample
firehose
125,645 rows · 0.0% null
· 100,455 unique
20.0% duplicate strings
31 languages detected in sample
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
255,318
100,455
-154,863
mean length
202.974
281.860
+78.886
median length
127.000
84.000
-43.000
p95 length
702.000
1,284
+582.000
mean words
33.466
35.854
+2.388
duplicate rate
0.086
0.200
+0.115
vocab size (top-K)
76,067
97,238
+21,171
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.026
jaccard
top-word overlap
—
—
+0.471
jaccard
language overlap
—
—
+0.353
jaccard
languages only in a
—
—
fy, hr, la, war
languages only in b
—
—
ar, ceb, da, el, et, gl, he, ko, ku, lt
collected_at text
curated
279,196 rows · 0.0% null
· 279,196 unique
100.0% of rows are unique strings
100.0% rows are a single word
100.0% rows are all-caps
firehose
125,645 rows · 0.0% null
· 125,645 unique
100.0% of rows are unique strings
100.0% rows are a single word
100.0% rows are all-caps
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
279,196
125,645
-153,551
mean length
27.000
27.000
+0.000
median length
27.000
27.000
+0.000
p95 length
27.000
27.000
+0.000
mean words
1.000
1.000
+0.000
duplicate rate
0.000
0.000
+0.000
vocab size (top-K)
20,000
20,000
+0.000
null rate
0.000
0.000
+0.000
top-word overlap
—
—
+0.000
jaccard
cursor text
curated
279,196 rows · 15.9% null
· 2,391 unique
99.0% duplicate strings
100.0% rows are a single word
100.0% rows are all-caps
firehose
125,645 rows · 0.0% null
· 103,528 unique
95th-percentile length under 20 chars
100.0% rows are a single word
100.0% rows are all-caps
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
2,391
103,528
+101,137
mean length
23.639
16.000
-7.639
median length
24.000
16.000
-8.000
p95 length
24.000
16.000
-8.000
mean words
1.000
1.000
+0.000
duplicate rate
0.990
0.176
-0.814
vocab size (top-K)
2,287
19,250
+16,963
null rate
0.159
0.000
-0.159
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
langs_json categorical
curated
279,196 rows · 0.0% null
· 56 unique
firehose
125,645 rows · 0.0% null
· 189 unique
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
56
189
+133.000
entropy
0.992
1.825
+0.833
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.290
jaccard
image_count_in_post numeric
curated
279,196 rows · 0.0% null
· 4 unique
9.6% rows beyond 1.5 IQR
firehose
125,645 rows · 0.0% null
· 4 unique
10.0% rows beyond 1.5 IQR
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
4
4
+0.000
mean
1.519
1.535
+0.016
median
1.000
1.000
+0.000
std
0.965
0.965
+3.28e-04
q1
1.000
1.000
+0.000
q3
2.000
2.000
+0.000
min
1.000
1.000
+0.000
max
4.000
4.000
+0.000
outlier rate
0.096
0.100
+0.004
skew
1.724
1.706
-0.018
null rate
0.000
0.000
+0.000
created_at text
curated
279,196 rows · 0.0% null
· 232,017 unique
100.0% rows are a single word
100.0% rows are all-caps
firehose
125,645 rows · 0.0% null
· 102,375 unique
100.0% rows are a single word
100.0% rows are all-caps
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
232,017
102,375
-129,642
mean length
24.181
25.355
+1.174
median length
24.000
24.000
+0.000
p95 length
27.000
32.000
+5.000
mean words
1.000
1.000
+0.000
duplicate rate
0.169
0.185
+0.016
vocab size (top-K)
19,639
19,203
-436.000
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
author_did categorical
curated
279,196 rows · 0.0% null
· 489 unique
firehose
125,645 rows · 0.0% null
· 34,785 unique
21452 singleton categories
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
489
34,785
+34,296
entropy
7.676
11.450
+3.774
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.026
jaccard
raw_record_json text
curated
279,196 rows · 0.0% null
· 232,017 unique
firehose
125,645 rows · 0.0% null
· 103,528 unique
9 languages detected in sample
46.5% rows contain a URL
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
232,017
103,528
-128,489
mean length
1,136
1,291
+154.334
median length
948.000
1,029
+81.000
p95 length
2,354
2,774
+420.000
mean words
68.388
65.942
-2.446
duplicate rate
0.169
0.176
+0.007
vocab size (top-K)
109,460
121,394
+11,934
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.562
jaccard
language overlap
—
—
+0.250
jaccard
languages only in b
—
—
de, fi, fr, ja, ru, tr
text text
curated
279,196 rows · 0.0% null
· 210,092 unique
24.8% duplicate strings
21 languages detected in sample
firehose
125,645 rows · 0.0% null
· 92,997 unique
26.0% duplicate strings
31 languages detected in sample
33.7% rows contain a URL
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
210,092
92,997
-117,095
mean length
138.904
131.124
-7.780
median length
126.000
143.000
+17.000
p95 length
297.000
293.000
-4.000
mean words
21.581
15.626
-5.954
duplicate rate
0.248
0.260
+0.012
vocab size (top-K)
76,438
84,089
+7,651
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.053
jaccard
top-word overlap
—
—
+0.613
jaccard
language overlap
—
—
+0.429
jaccard
languages only in a
—
—
fy, ga, jv, uk, war
languages only in b
—
—
ar, cs, da, el, et, fi, hi, hu, ko, lt
image_thumb_url text
curated
279,196 rows · 0.0% null
· 260,636 unique
100.0% rows are a single word
100.0% rows contain a URL
firehose
125,645 rows · 100.0% null
column has no non-empty values
Δ = firehose − curated
metric
curated
firehose
Δ
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard
image_alt_length numeric
curated
279,196 rows · 0.0% null
· 1,951 unique
skew=+26.05
9.7% rows beyond 1.5 IQR
firehose
125,645 rows · 0.0% null
· 2,036 unique
skew=+35.66
16.6% rows beyond 1.5 IQR
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
1,951
2,036
+85.000
mean
202.974
281.860
+78.886
median
127.000
84.000
-43.000
std
267.794
514.084
+246.290
q1
73.000
37.000
-36.000
q3
217.000
275.000
+58.000
min
1.000
1.000
+0.000
max
50,148
65,192
+15,044
outlier rate
0.097
0.166
+0.069
skew
26.050
35.656
+9.606
null rate
0.000
0.000
+0.000
author_handle categorical
curated
279,196 rows · 0.0% null
· 489 unique
firehose
125,645 rows · 100.0% null
column has no values
Δ = firehose − curated
metric
curated
firehose
Δ
entropy
7.676
—
null rate
0.000
1.000
+1.000
top-value overlap
—
—
+0.000
jaccard
indexed_at text
curated
279,196 rows · 0.0% null
· 232,015 unique
100.0% rows are a single word
100.0% rows are all-caps
firehose
125,645 rows · 0.0% null
· 103,528 unique
100.0% rows are a single word
100.0% rows are all-caps
Δ = firehose − curated
metric
curated
firehose
Δ
distinct
232,015
103,528
-128,487
mean length
24.000
32.000
+8.000
median length
24.000
32.000
+8.000
p95 length
24.000
32.000
+8.000
mean words
1.000
1.000
+0.000
duplicate rate
0.169
0.176
+0.007
vocab size (top-K)
19,639
19,250
-389.000
null rate
0.000
0.000
+0.000
top-value overlap
—
—
+0.000
jaccard
top-word overlap
—
—
+0.000
jaccard