data trove onion headlines
Reading
This dataset is an index of 2,103 articles from The Onion, a satirical news outlet, containing article headlines, thumbnail image URLs, and a sequential row ID. The headline column is the most analytically interesting field, with a vocabulary of 7,613 unique words and mean headline length of about 9 words and 61 characters — worth exploring for length distribution and common vocabulary patterns. There are also 209 duplicate image URLs (~10% of rows), suggesting some thumbnails are reused across multiple articles, with one image appearing 11 times.
citing: row_count · column_count · n_unique · duplicate_rate · n_duplicates · vocab_size · len_mean · len_median · len_max · word_mean
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 107 – 107 | 1948 |
| 107 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 109 | 0 |
| 109 – 109 | 0 |
| 109 – 109 | 0 |
| 109 – 110 | 0 |
| 110 – 110 | 0 |
| 110 – 110 | 0 |
| 110 – 111 | 0 |
| 111 – 111 | 0 |
| 111 – 111 | 0 |
| 111 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 113 | 0 |
| 113 – 113 | 0 |
| 113 – 113 | 0 |
| 113 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 115 | 0 |
| 115 – 115 | 0 |
| 115 – 115 | 0 |
| 115 – 116 | 0 |
| 116 – 116 | 0 |
| 116 – 116 | 0 |
| 116 – 117 | 0 |
| 117 – 117 | 0 |
| 117 – 117 | 0 |
| 117 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 119 | 0 |
| 119 – 119 | 155 |
Show data table
| bin | count |
|---|---|
| 2 – 54.55 | 53 |
| 54.55 – 107.1 | 53 |
| 107.1 – 159.6 | 52 |
| 159.6 – 212.2 | 53 |
| 212.2 – 264.8 | 52 |
| 264.8 – 317.3 | 53 |
| 317.3 – 369.8 | 52 |
| 369.8 – 422.4 | 53 |
| 422.4 – 474.9 | 52 |
| 474.9 – 527.5 | 53 |
| 527.5 – 580 | 53 |
| 580 – 632.6 | 52 |
| 632.6 – 685.1 | 53 |
| 685.1 – 737.7 | 52 |
| 737.7 – 790.2 | 53 |
| 790.2 – 842.8 | 52 |
| 842.8 – 895.3 | 53 |
| 895.3 – 947.9 | 52 |
| 947.9 – 1000 | 53 |
| 1000 – 1053 | 52 |
| 1053 – 1106 | 53 |
| 1106 – 1158 | 53 |
| 1158 – 1211 | 52 |
| 1211 – 1263 | 53 |
| 1263 – 1316 | 52 |
| 1316 – 1368 | 53 |
| 1368 – 1421 | 52 |
| 1421 – 1473 | 53 |
| 1473 – 1526 | 52 |
| 1526 – 1578 | 53 |
| 1578 – 1631 | 53 |
| 1631 – 1684 | 52 |
| 1684 – 1736 | 53 |
| 1736 – 1789 | 52 |
| 1789 – 1841 | 53 |
| 1841 – 1894 | 52 |
| 1894 – 1946 | 53 |
| 1946 – 1999 | 52 |
| 1999 – 2051 | 53 |
| 2051 – 2104 | 53 |
Schema
3 columns| Alerts | ||||
|---|---|---|---|---|
| 1 | numeric | 0.0% | 2,103 |
|
| ‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident | text | 0.0% | 2,095 |
near_unique
|
| https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg | text | 0.0% | 1,894 |
one_word
url_heavy
|
1
numeric identifierThis column is almost certainly a row index or sequential identifier: it contains 2103 values, all unique, with min=2 and max=2104, perfectly symmetric distribution (skew=0.0, mean=median=1053.0), and zero outliers. The near-perfect uniformity (platykurtic at -1.2, IQR=1051 spanning exactly half the range) confirms an arithmetic sequence. The column name '1' suggests it was emitted as an unnamed index column during export. Treatment: Drop before modelling; carries no predictive signal and is a row index artifact.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 2,103
- min
- 2
- max
- 2,104
- mean
- 1,053
- median
- 1,053
- std
- 607.2
- q1
- 527.5
- q3
- 1578
- iqr
- 1,051
- skew
- 0
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident
text free_text near_uniqueThis column contains satirical or parody news headlines, with the column name itself being a representative example ('That'll Be $3,' Says Trump After Handing Water Bottle To Sick Ohio Resident). With 2095 unique values out of 2103 rows and a mean length of ~61 characters (~9 words), these are near-unique short text strings consistent with headline-style content. Surprisingly, 8 duplicate headlines exist despite the near-unique alert, and the Flesch readability score of 46.87 indicates moderately difficult prose — higher than typical clickbait but consistent with crafted satirical writing. Top words ('to', 'of', 'in', 'new') are generic function words offering little discriminative signal on their own. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; deduplicate the 8 duplicate headlines first.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 2,095
- len_min
- 8
- len_max
- 926
- len_mean
- 60.99
- len_median
- 58
- len_p95
- 102
- word_mean
- 9.268
- word_median
- 9
- n_empty
- 0
- n_duplicates
- 8
- duplicate_rate
- 0.003804
- vocab_size
- 7,613
- readability_flesch_mean
- 46.87
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0004755
- allcaps_rate
- 0.0004755
- boilerplate_rate
- 0.0004755
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg
text metadata one_word url_heavyThis column contains Kinja/Gawker CDN image URLs, serving as thumbnail or article image references across 2103 rows. Every value is a URL (url_rate 1.0) pointing to a fixed-size JPEG (w_645, q_60 transform parameters are baked into all URLs), making the path prefix structurally identical across all rows with only the hash filename varying. Surprising: 209 duplicate URLs exist (duplicate_rate ~9.9%), with one image URL appearing 11 times, suggesting multiple articles or rows share the same thumbnail — likely a placeholder or default image reuse pattern. Treatment: Extract the hash filename as a unique image identifier; flag the top duplicate URL (count=11) as a likely default/placeholder and treat separately from article-specific images.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 1,894
- len_min
- 107
- len_max
- 119
- len_mean
- 107.9
- len_median
- 107
- len_p95
- 119
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 209
- duplicate_rate
- 0.09938
- vocab_size
- 1,894
- readability_flesch_mean
- -1672
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0