saturn·

data trove onion headlines

source /home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv 2,103 rows 3 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · medium confidence anthropic:default

This dataset is an index of 2,103 articles from The Onion, a satirical news outlet, containing article headlines, thumbnail image URLs, and a sequential row ID. The headline column is the most analytically interesting field, with a vocabulary of 7,613 unique words and mean headline length of about 9 words and 61 characters — worth exploring for length distribution and common vocabulary patterns. There are also 209 duplicate image URLs (~10% of rows), suggesting some thumbnails are reused across multiple articles, with one image appearing 11 times.

citing: row_count · column_count · n_unique · duplicate_rate · n_duplicates · vocab_size · len_mean · len_median · len_max · word_mean

Schema

3 columns
Per-column summary. Click column name to jump to its detail.
Alerts
1 numeric 0.0% 2,103
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident text 0.0% 2,095
near_unique
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg text 0.0% 1,894
one_word url_heavy

1

numeric identifier
This column is almost certainly a row index or sequential identifier: it contains 2103 values, all unique, with min=2 and max=2104, perfectly symmetric distribution (skew=0.0, mean=median=1053.0), and zero outliers. The near-perfect uniformity (platykurtic at -1.2, IQR=1051 spanning exactly half the range) confirms an arithmetic sequence. The column name '1' suggests it was emitted as an unnamed index column during export. Treatment: Drop before modelling; carries no predictive signal and is a row index artifact. high · anthropic:default
n
2,103
nulls
0 (0.0%)
unique
2,103
min
2
max
2,104
mean
1,053
median
1,053
std
607.2
q1
527.5
q3
1578
iqr
1,051
skew
0
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
0

‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident

text free_text near_unique
This column contains satirical or parody news headlines, with the column name itself being a representative example ('That'll Be $3,' Says Trump After Handing Water Bottle To Sick Ohio Resident). With 2095 unique values out of 2103 rows and a mean length of ~61 characters (~9 words), these are near-unique short text strings consistent with headline-style content. Surprisingly, 8 duplicate headlines exist despite the near-unique alert, and the Flesch readability score of 46.87 indicates moderately difficult prose — higher than typical clickbait but consistent with crafted satirical writing. Top words ('to', 'of', 'in', 'new') are generic function words offering little discriminative signal on their own. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; deduplicate the 8 duplicate headlines first. high · anthropic:default
n
2,103
nulls
0 (0.0%)
unique
2,095
len_min
8
len_max
926
len_mean
60.99
len_median
58
len_p95
102
word_mean
9.268
word_median
9
n_empty
0
n_duplicates
8
duplicate_rate
0.003804
vocab_size
7,613
readability_flesch_mean
46.87
emoji_rate
0
url_rate
0
one_word_rate
0.0004755
allcaps_rate
0.0004755
boilerplate_rate
0.0004755

https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg

text metadata one_word url_heavy
This column contains Kinja/Gawker CDN image URLs, serving as thumbnail or article image references across 2103 rows. Every value is a URL (url_rate 1.0) pointing to a fixed-size JPEG (w_645, q_60 transform parameters are baked into all URLs), making the path prefix structurally identical across all rows with only the hash filename varying. Surprising: 209 duplicate URLs exist (duplicate_rate ~9.9%), with one image URL appearing 11 times, suggesting multiple articles or rows share the same thumbnail — likely a placeholder or default image reuse pattern. Treatment: Extract the hash filename as a unique image identifier; flag the top duplicate URL (count=11) as a likely default/placeholder and treat separately from article-specific images. high · anthropic:default
n
2,103
nulls
0 (0.0%)
unique
1,894
len_min
107
len_max
119
len_mean
107.9
len_median
107
len_p95
119
word_mean
1
word_median
1
n_empty
0
n_duplicates
209
duplicate_rate
0.09938
vocab_size
1,894
readability_flesch_mean
-1672
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0