saturn·

satire theonion index to dataset

source /home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv 2,103 rows 3 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 2,103 rows and three columns scraped from The Onion: a numeric index, a satirical headline, and an associated image URL. The headlines column is the most substantive — it has 7,613 unique vocabulary tokens, a median of 9 words, and an average Flesch readability score of about 46.9, suggesting typical news-headline phrasing. The image URL column is uniform in structure (every value is a single URL averaging ~108 characters) but contains a roughly 9.9% duplicate rate, with one image reused 11 times — worth a look if you're checking scrape integrity. The numeric index column is a clean 2 → 2104 sequence with no outliers and is essentially just a row identifier.

citing: row_count · column_count · columns[0].stats.duplicate_rate · columns[0].stats.n_duplicates · columns[0].stats.len_mean · columns[0].stats.url_rate · columns[1].stats.word_median · columns[1].stats.vocab_size · columns[1].stats.readability_flesch_mean · columns[1].top_words · columns[2].stats.min · columns[2].stats.max · columns[2].stats.mean

Schema

3 columns
Per-column summary. Click column name to jump to its detail.
Alerts
1 numeric 0.0% 2,103
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident text 0.0% 2,095
near_unique
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg text 0.0% 1,894
one_word url_heavy

1

numeric identifier
Column '1' is a perfectly unique integer running from 2 to 2104 across all 2103 rows, with zero nulls and a symmetric, flat distribution (skew 0.0, kurtosis -1.2, mean = median = 1053). The contiguous range and one-to-one cardinality strongly suggest a row index or sequential identifier rather than a measured feature. Treatment: drop before modelling; it is a sequential row id with no predictive content. high · anthropic:claude-opus-4-7
n
2,103
nulls
0 (0.0%)
unique
2,103
min
2
max
2,104
mean
1,053
median
1,053
std
607.2
q1
527.5
q3
1578
iqr
1,051
skew
0
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
0

‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident

text free_text near_unique
This column appears to hold short English headlines (mean 9.27 words, median length 58 chars, max 926), with the column name itself being one such headline. Of 2103 rows, 2095 are unique and only 8 duplicates exist, giving near-unique cardinality, while top words are common function words like 'to', 'of', 'in' consistent with news titles. Flesch readability averages 46.87 (moderately difficult), and there are no URLs, emoji, or empty strings. Treatment: tokenize and embed before modelling; do not use as a categorical key given near-unique values. high · anthropic:claude-opus-4-7
n
2,103
nulls
0 (0.0%)
unique
2,095
len_min
8
len_max
926
len_mean
60.99
len_median
58
len_p95
102
word_mean
9.268
word_median
9
n_empty
0
n_duplicates
8
duplicate_rate
0.003804
vocab_size
7,613
readability_flesch_mean
46.87
emoji_rate
0
url_rate
0
one_word_rate
0.0004755
allcaps_rate
0.0004755
boilerplate_rate
0.0004755

https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg

text metadata one_word url_heavy
This column holds Kinja (Gawker Media) CDN image URLs, all sharing the same transform path (c_fit,f_auto,g_center,q_60,w_645) and differing only by a hashed filename, with url_rate 1.0 and one_word_rate 1.0. Lengths are nearly fixed (min 107, max 119, median 107), and across 2103 rows there are 1894 unique values with a 9.9% duplicate_rate (209 dupes), the top URL recurring 11 times. The column header itself is a URL, suggesting the file was loaded without a proper header row. Treatment: Treat as an image asset reference; fix the header, then either drop or fetch/encode the image separately rather than feeding the URL string into a model. high · anthropic:claude-opus-4-7
n
2,103
nulls
0 (0.0%)
unique
1,894
len_min
107
len_max
119
len_mean
107.9
len_median
107
len_p95
119
word_mean
1
word_median
1
n_empty
0
n_duplicates
209
duplicate_rate
0.09938
vocab_size
1,894
readability_flesch_mean
-1672
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0