data trove onion headlines

source /home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv 2,103 rows 3 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · medium confidence anthropic:default

This dataset is an index of 2,103 articles from The Onion, a satirical news outlet, containing article headlines, thumbnail image URLs, and a sequential row ID. The headline column is the most analytically interesting field, with a vocabulary of 7,613 unique words and mean headline length of about 9 words and 61 characters — worth exploring for length distribution and common vocabulary patterns. There are also 209 duplicate image URLs (~10% of rows), suggesting some thumbnails are reused across multiple articles, with one image appearing 11 times.

citing: row_count · column_count · n_unique · duplicate_rate · n_duplicates · vocab_size · len_mean · len_median · len_max · word_mean

Charts the summary said to look at first

https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg · Most-reused thumbnail URLs show one image appears 11 times — check whether repeated images indicate article templates or content categories.

Show data table

Character-length distribution for https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg (mean: 107.88445078459344).
chars	count
107 – 107	1948
107 – 108	0
108 – 108	0
108 – 108	0
108 – 108	0
108 – 109	0
109 – 109	0
109 – 109	0
109 – 110	0
110 – 110	0
110 – 110	0
110 – 111	0
111 – 111	0
111 – 111	0
111 – 112	0
112 – 112	0
112 – 112	0
112 – 112	0
112 – 113	0
113 – 113	0
113 – 113	0
113 – 114	0
114 – 114	0
114 – 114	0
114 – 114	0
114 – 115	0
115 – 115	0
115 – 115	0
115 – 116	0
116 – 116	0
116 – 116	0
116 – 117	0
117 – 117	0
117 – 117	0
117 – 118	0
118 – 118	0
118 – 118	0
118 – 118	0
118 – 119	0
119 – 119	155

1 · The row ID is perfectly uniform with no gaps or outliers, confirming this is a clean sequential index with no missing records.

Show data table

Histogram bins for 1 (median: 1053.0).
bin	count
2 – 54.55	53
54.55 – 107.1	53
107.1 – 159.6	52
159.6 – 212.2	53
212.2 – 264.8	52
264.8 – 317.3	53
317.3 – 369.8	52
369.8 – 422.4	53
422.4 – 474.9	52
474.9 – 527.5	53
527.5 – 580	53
580 – 632.6	52
632.6 – 685.1	53
685.1 – 737.7	52
737.7 – 790.2	53
790.2 – 842.8	52
842.8 – 895.3	53
895.3 – 947.9	52
947.9 – 1000	53
1000 – 1053	52
1053 – 1106	53
1106 – 1158	53
1158 – 1211	52
1211 – 1263	53
1263 – 1316	52
1316 – 1368	53
1368 – 1421	52
1421 – 1473	53
1473 – 1526	52
1526 – 1578	53
1578 – 1631	53
1631 – 1684	52
1684 – 1736	53
1736 – 1789	52
1789 – 1841	53
1841 – 1894	52
1894 – 1946	53
1946 – 1999	52
1999 – 2051	53
2051 – 2104	53

Schema

3 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
1	numeric	0.0%	2,103
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident	text	0.0%	2,095	near_unique
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg	text	0.0%	1,894	one_word url_heavy

1

numeric identifier

This column is almost certainly a row index or sequential identifier: it contains 2103 values, all unique, with min=2 and max=2104, perfectly symmetric distribution (skew=0.0, mean=median=1053.0), and zero outliers. The near-perfect uniformity (platykurtic at -1.2, IQR=1051 spanning exactly half the range) confirms an arithmetic sequence. The column name '1' suggests it was emitted as an unnamed index column during export. Treatment: Drop before modelling; carries no predictive signal and is a row index artifact. high · anthropic:default

n: 2,103
nulls: 0 (0.0%)
unique: 2,103
min: 2
max: 2,104
mean: 1,053
median: 1,053
std: 607.2
q1: 527.5
q3: 1578
iqr: 1,051
skew: 0
kurtosis: -1.2
n_outliers: 0
outlier_rate: 0
zero_rate: 0

‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident

text free_text near_unique

This column contains satirical or parody news headlines, with the column name itself being a representative example ('That'll Be $3,' Says Trump After Handing Water Bottle To Sick Ohio Resident). With 2095 unique values out of 2103 rows and a mean length of ~61 characters (~9 words), these are near-unique short text strings consistent with headline-style content. Surprisingly, 8 duplicate headlines exist despite the near-unique alert, and the Flesch readability score of 46.87 indicates moderately difficult prose — higher than typical clickbait but consistent with crafted satirical writing. Top words ('to', 'of', 'in', 'new') are generic function words offering little discriminative signal on their own. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; deduplicate the 8 duplicate headlines first. high · anthropic:default

n: 2,103
nulls: 0 (0.0%)
unique: 2,095
len_min: 8
len_max: 926
len_mean: 60.99
len_median: 58
len_p95: 102
word_mean: 9.268
word_median: 9
n_empty: 0
n_duplicates: 8
duplicate_rate: 0.003804
vocab_size: 7,613
readability_flesch_mean: 46.87
emoji_rate: 0
url_rate: 0
one_word_rate: 0.0004755
allcaps_rate: 0.0004755
boilerplate_rate: 0.0004755

https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg

text metadata one_word url_heavy

This column contains Kinja/Gawker CDN image URLs, serving as thumbnail or article image references across 2103 rows. Every value is a URL (url_rate 1.0) pointing to a fixed-size JPEG (w_645, q_60 transform parameters are baked into all URLs), making the path prefix structurally identical across all rows with only the hash filename varying. Surprising: 209 duplicate URLs exist (duplicate_rate ~9.9%), with one image URL appearing 11 times, suggesting multiple articles or rows share the same thumbnail — likely a placeholder or default image reuse pattern. Treatment: Extract the hash filename as a unique image identifier; flag the top duplicate URL (count=11) as a likely default/placeholder and treat separately from article-specific images. high · anthropic:default

n: 2,103
nulls: 0 (0.0%)
unique: 1,894
len_min: 107
len_max: 119
len_mean: 107.9
len_median: 107
len_p95: 119
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 209
duplicate_rate: 0.09938
vocab_size: 1,894
readability_flesch_mean: -1672
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0