Summary confidence: high
The curated dataset (279,196 rows) is roughly 2.2× the size of firehose (125,645 rows), and several columns differ structurally rather than just statistically. Four fields—`query`, `image_fullsize_url`, `image_thumb_url`, and `author_handle`—show a +100% null shift with zero top-value overlap, indicating they are populated on one side and entirely absent on the other. Text fields also diverge in shape: `alt_text` has a mean length +79 characters with low language overlap (jaccard 0.35) and almost no shared top values (0.03), while `raw_record_json` is +154 characters longer on average with jaccard 0.25. These patterns suggest the two sources capture different schemas/enrichments rather than just sampling noise. Confidence is high that the divergence is structural, though the evidence does not say which side carries the nulls.
citing: a_row_count · b_row_count · divergences