saturn·

data trove onion headlines

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv

Saturn profiled 2,103 rows across 3 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv",
    "--findings", "data-trove-onion-headlines.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset is an index of 2,103 articles from The Onion, a satirical news outlet, containing article headlines, thumbnail image URLs, and a sequential row ID. The headline column is the most analytically interesting field, with a vocabulary of 7,613 unique words and mean headline length of about 9 words and 61 characters — worth exploring for length distribution and common vocabulary patterns. There are also 209 duplicate image URLs (~10% of rows), suggesting some thumbnails are reused across multiple articles, with one image appearing 11 times.

citing: row_count · column_count · n_unique · duplicate_rate · n_duplicates · vocab_size · len_mean · len_median · len_max · word_mean

Fig 1.
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg · Most-reused thumbnail URLs show one image appears 11 times — check whether repeated images indicate article templates or content categories.
Show data table
Character-length distribution for https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg (mean: 107.88445078459344).
charscount
107 – 1071948
107 – 1080
108 – 1080
108 – 1080
108 – 1080
108 – 1090
109 – 1090
109 – 1090
109 – 1100
110 – 1100
110 – 1100
110 – 1110
111 – 1110
111 – 1110
111 – 1120
112 – 1120
112 – 1120
112 – 1120
112 – 1130
113 – 1130
113 – 1130
113 – 1140
114 – 1140
114 – 1140
114 – 1140
114 – 1150
115 – 1150
115 – 1150
115 – 1160
116 – 1160
116 – 1160
116 – 1170
117 – 1170
117 – 1170
117 – 1180
118 – 1180
118 – 1180
118 – 1180
118 – 1190
119 – 119155
Fig 2.
1 · The row ID is perfectly uniform with no gaps or outliers, confirming this is a clean sequential index with no missing records.
Show data table
Histogram bins for 1 (median: 1053.0).
bincount
2 – 54.5553
54.55 – 107.153
107.1 – 159.652
159.6 – 212.253
212.2 – 264.852
264.8 – 317.353
317.3 – 369.852
369.8 – 422.453
422.4 – 474.952
474.9 – 527.553
527.5 – 58053
580 – 632.652
632.6 – 685.153
685.1 – 737.752
737.7 – 790.253
790.2 – 842.852
842.8 – 895.353
895.3 – 947.952
947.9 – 100053
1000 – 105352
1053 – 110653
1106 – 115853
1158 – 121152
1211 – 126353
1263 – 131652
1316 – 136853
1368 – 142152
1421 – 147353
1473 – 152652
1526 – 157853
1578 – 163153
1631 – 168452
1684 – 173653
1736 – 178952
1789 – 184153
1841 – 189452
1894 – 194653
1946 – 199952
1999 – 205153
2051 – 210453
Fig 3.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
1numeric0.0%
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Residenttext0.0%
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpgtext0.0%

1 numeric identifier

This column is almost certainly a row index or sequential identifier: it contains 2103 values, all unique, with min=2 and max=2104, perfectly symmetric distribution (skew=0.0, mean=median=1053.0), and zero outliers. The near-perfect uniformity (platykurtic at -1.2, IQR=1051 spanning exactly half the range) confirms an arithmetic sequence. The column name '1' suggests it was emitted as an unnamed index column during export.

Treatment: Drop before modelling; carries no predictive signal and is a row index artifact.

anthropic:default · confidence high
Out[9]:

saturn.columns["1"].stats

statvalue
n2,103
nulls0 (0.0%)
unique2,103
min 2
max 2,104
mean 1,053
median 1,053
std 607.2
q1 527.5
q3 1578
iqr 1,051
skew 0
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 4.
Distribution of 1. Vertical dash marks the median.
Show data table
Histogram bins for 1 (median: 1053.0).
bincount
2 – 54.5553
54.55 – 107.153
107.1 – 159.652
159.6 – 212.253
212.2 – 264.852
264.8 – 317.353
317.3 – 369.852
369.8 – 422.453
422.4 – 474.952
474.9 – 527.553
527.5 – 58053
580 – 632.652
632.6 – 685.153
685.1 – 737.752
737.7 – 790.253
790.2 – 842.852
842.8 – 895.353
895.3 – 947.952
947.9 – 100053
1000 – 105352
1053 – 110653
1106 – 115853
1158 – 121152
1211 – 126353
1263 – 131652
1316 – 136853
1368 – 142152
1421 – 147353
1473 – 152652
1526 – 157853
1578 – 163153
1631 – 168452
1684 – 173653
1736 – 178952
1789 – 184153
1841 – 189452
1894 – 194653
1946 – 199952
1999 – 205153
2051 – 210453

‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident text free_text

This column contains satirical or parody news headlines, with the column name itself being a representative example ('That'll Be $3,' Says Trump After Handing Water Bottle To Sick Ohio Resident). With 2095 unique values out of 2103 rows and a mean length of ~61 characters (~9 words), these are near-unique short text strings consistent with headline-style content. Surprisingly, 8 duplicate headlines exist despite the near-unique alert, and the Flesch readability score of 46.87 indicates moderately difficult prose — higher than typical clickbait but consistent with crafted satirical writing. Top words ('to', 'of', 'in', 'new') are generic function words offering little discriminative signal on their own.

Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; deduplicate the 8 duplicate headlines first.

anthropic:default · confidence high
Out[12]:

saturn.columns["‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident"].stats

statvalue
n2,103
nulls0 (0.0%)
unique2,095
len_min 8
len_max 926
len_mean 60.99
len_median 58
len_p95 102
word_mean 9.268
word_median 9
n_empty 0
n_duplicates 8
duplicate_rate 0.003804
vocab_size 7,613
readability_flesch_mean 46.87
emoji_rate 0
url_rate 0
one_word_rate 0.0004755
allcaps_rate 0.0004755
boilerplate_rate 0.0004755
alert: near_unique99.6% of rows are unique strings
Fig 5.
Character-length distribution for ‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident.
Show data table
Character-length distribution for ‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident (mean: 60.99048977650975).
charscount
8 – 31131
31 – 54781
54 – 77719
77 – 100353
100 – 12399
123 – 14613
146 – 1694
169 – 1920
192 – 2150
215 – 2380
238 – 2601
260 – 2831
283 – 3060
306 – 3290
329 – 3520
352 – 3750
375 – 3980
398 – 4210
421 – 4440
444 – 4670
467 – 4900
490 – 5130
513 – 5360
536 – 5590
559 – 5820
582 – 6050
605 – 6280
628 – 6510
651 – 6740
674 – 6960
696 – 7190
719 – 7420
742 – 7650
765 – 7880
788 – 8110
811 – 8340
834 – 8570
857 – 8800
880 – 9030
903 – 9261

https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg text metadata

This column contains Kinja/Gawker CDN image URLs, serving as thumbnail or article image references across 2103 rows. Every value is a URL (url_rate 1.0) pointing to a fixed-size JPEG (w_645, q_60 transform parameters are baked into all URLs), making the path prefix structurally identical across all rows with only the hash filename varying. Surprising: 209 duplicate URLs exist (duplicate_rate ~9.9%), with one image URL appearing 11 times, suggesting multiple articles or rows share the same thumbnail — likely a placeholder or default image reuse pattern.

Treatment: Extract the hash filename as a unique image identifier; flag the top duplicate URL (count=11) as a likely default/placeholder and treat separately from article-specific images.

anthropic:default · confidence high
Out[15]:

saturn.columns["https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg"].stats

statvalue
n2,103
nulls0 (0.0%)
unique1,894
len_min 107
len_max 119
len_mean 107.9
len_median 107
len_p95 119
word_mean 1
word_median 1
n_empty 0
n_duplicates 209
duplicate_rate 0.09938
vocab_size 1,894
readability_flesch_mean -1672
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 6.
Character-length distribution for https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg.
Show data table
Character-length distribution for https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg (mean: 107.88445078459344).
charscount
107 – 1071948
107 – 1080
108 – 1080
108 – 1080
108 – 1080
108 – 1090
109 – 1090
109 – 1090
109 – 1100
110 – 1100
110 – 1100
110 – 1110
111 – 1110
111 – 1110
111 – 1120
112 – 1120
112 – 1120
112 – 1120
112 – 1130
113 – 1130
113 – 1130
113 – 1140
114 – 1140
114 – 1140
114 – 1140
114 – 1150
115 – 1150
115 – 1150
115 – 1160
116 – 1160
116 – 1160
116 – 1170
117 – 1170
117 – 1170
117 – 1180
118 – 1180
118 – 1180
118 – 1180
118 – 1190
119 – 119155

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-onion-headlines-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove onion headlines},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-onion-headlines}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove onion headlines. Source: /home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-onion-headlines