satire theonion index to dataset
Reading
This dataset contains 2,103 rows and three columns scraped from The Onion: a numeric index, a satirical headline, and an associated image URL. The headlines column is the most substantive — it has 7,613 unique vocabulary tokens, a median of 9 words, and an average Flesch readability score of about 46.9, suggesting typical news-headline phrasing. The image URL column is uniform in structure (every value is a single URL averaging ~108 characters) but contains a roughly 9.9% duplicate rate, with one image reused 11 times — worth a look if you're checking scrape integrity. The numeric index column is a clean 2 → 2104 sequence with no outliers and is essentially just a row identifier.
citing: row_count · column_count · columns[0].stats.duplicate_rate · columns[0].stats.n_duplicates · columns[0].stats.len_mean · columns[0].stats.url_rate · columns[1].stats.word_median · columns[1].stats.vocab_size · columns[1].stats.readability_flesch_mean · columns[1].top_words · columns[2].stats.min · columns[2].stats.max · columns[2].stats.mean
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 8 – 31 | 131 |
| 31 – 54 | 781 |
| 54 – 77 | 719 |
| 77 – 100 | 353 |
| 100 – 123 | 99 |
| 123 – 146 | 13 |
| 146 – 169 | 4 |
| 169 – 192 | 0 |
| 192 – 215 | 0 |
| 215 – 238 | 0 |
| 238 – 260 | 1 |
| 260 – 283 | 1 |
| 283 – 306 | 0 |
| 306 – 329 | 0 |
| 329 – 352 | 0 |
| 352 – 375 | 0 |
| 375 – 398 | 0 |
| 398 – 421 | 0 |
| 421 – 444 | 0 |
| 444 – 467 | 0 |
| 467 – 490 | 0 |
| 490 – 513 | 0 |
| 513 – 536 | 0 |
| 536 – 559 | 0 |
| 559 – 582 | 0 |
| 582 – 605 | 0 |
| 605 – 628 | 0 |
| 628 – 651 | 0 |
| 651 – 674 | 0 |
| 674 – 696 | 0 |
| 696 – 719 | 0 |
| 719 – 742 | 0 |
| 742 – 765 | 0 |
| 765 – 788 | 0 |
| 788 – 811 | 0 |
| 811 – 834 | 0 |
| 834 – 857 | 0 |
| 857 – 880 | 0 |
| 880 – 903 | 0 |
| 903 – 926 | 1 |
Show data table
| chars | count |
|---|---|
| 8 – 31 | 131 |
| 31 – 54 | 781 |
| 54 – 77 | 719 |
| 77 – 100 | 353 |
| 100 – 123 | 99 |
| 123 – 146 | 13 |
| 146 – 169 | 4 |
| 169 – 192 | 0 |
| 192 – 215 | 0 |
| 215 – 238 | 0 |
| 238 – 260 | 1 |
| 260 – 283 | 1 |
| 283 – 306 | 0 |
| 306 – 329 | 0 |
| 329 – 352 | 0 |
| 352 – 375 | 0 |
| 375 – 398 | 0 |
| 398 – 421 | 0 |
| 421 – 444 | 0 |
| 444 – 467 | 0 |
| 467 – 490 | 0 |
| 490 – 513 | 0 |
| 513 – 536 | 0 |
| 536 – 559 | 0 |
| 559 – 582 | 0 |
| 582 – 605 | 0 |
| 605 – 628 | 0 |
| 628 – 651 | 0 |
| 651 – 674 | 0 |
| 674 – 696 | 0 |
| 696 – 719 | 0 |
| 719 – 742 | 0 |
| 742 – 765 | 0 |
| 765 – 788 | 0 |
| 788 – 811 | 0 |
| 811 – 834 | 0 |
| 834 – 857 | 0 |
| 857 – 880 | 0 |
| 880 – 903 | 0 |
| 903 – 926 | 1 |
Show data table
| chars | count |
|---|---|
| 107 – 107 | 1948 |
| 107 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 108 | 0 |
| 108 – 109 | 0 |
| 109 – 109 | 0 |
| 109 – 109 | 0 |
| 109 – 110 | 0 |
| 110 – 110 | 0 |
| 110 – 110 | 0 |
| 110 – 111 | 0 |
| 111 – 111 | 0 |
| 111 – 111 | 0 |
| 111 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 112 | 0 |
| 112 – 113 | 0 |
| 113 – 113 | 0 |
| 113 – 113 | 0 |
| 113 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 114 | 0 |
| 114 – 115 | 0 |
| 115 – 115 | 0 |
| 115 – 115 | 0 |
| 115 – 116 | 0 |
| 116 – 116 | 0 |
| 116 – 116 | 0 |
| 116 – 117 | 0 |
| 117 – 117 | 0 |
| 117 – 117 | 0 |
| 117 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 118 | 0 |
| 118 – 119 | 0 |
| 119 – 119 | 155 |
Show data table
| bin | count |
|---|---|
| 2 – 54.55 | 53 |
| 54.55 – 107.1 | 53 |
| 107.1 – 159.6 | 52 |
| 159.6 – 212.2 | 53 |
| 212.2 – 264.8 | 52 |
| 264.8 – 317.3 | 53 |
| 317.3 – 369.8 | 52 |
| 369.8 – 422.4 | 53 |
| 422.4 – 474.9 | 52 |
| 474.9 – 527.5 | 53 |
| 527.5 – 580 | 53 |
| 580 – 632.6 | 52 |
| 632.6 – 685.1 | 53 |
| 685.1 – 737.7 | 52 |
| 737.7 – 790.2 | 53 |
| 790.2 – 842.8 | 52 |
| 842.8 – 895.3 | 53 |
| 895.3 – 947.9 | 52 |
| 947.9 – 1000 | 53 |
| 1000 – 1053 | 52 |
| 1053 – 1106 | 53 |
| 1106 – 1158 | 53 |
| 1158 – 1211 | 52 |
| 1211 – 1263 | 53 |
| 1263 – 1316 | 52 |
| 1316 – 1368 | 53 |
| 1368 – 1421 | 52 |
| 1421 – 1473 | 53 |
| 1473 – 1526 | 52 |
| 1526 – 1578 | 53 |
| 1578 – 1631 | 53 |
| 1631 – 1684 | 52 |
| 1684 – 1736 | 53 |
| 1736 – 1789 | 52 |
| 1789 – 1841 | 53 |
| 1841 – 1894 | 52 |
| 1894 – 1946 | 53 |
| 1946 – 1999 | 52 |
| 1999 – 2051 | 53 |
| 2051 – 2104 | 53 |
Schema
3 columns| Alerts | ||||
|---|---|---|---|---|
| 1 | numeric | 0.0% | 2,103 |
|
| ‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident | text | 0.0% | 2,095 |
near_unique
|
| https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg | text | 0.0% | 1,894 |
one_word
url_heavy
|
1
numeric identifierColumn '1' is a perfectly unique integer running from 2 to 2104 across all 2103 rows, with zero nulls and a symmetric, flat distribution (skew 0.0, kurtosis -1.2, mean = median = 1053). The contiguous range and one-to-one cardinality strongly suggest a row index or sequential identifier rather than a measured feature. Treatment: drop before modelling; it is a sequential row id with no predictive content.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 2,103
- min
- 2
- max
- 2,104
- mean
- 1,053
- median
- 1,053
- std
- 607.2
- q1
- 527.5
- q3
- 1578
- iqr
- 1,051
- skew
- 0
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
‘That’ll Be $3,’ Says Trump After Handing Water Bottle To Sick Ohio Resident
text free_text near_uniqueThis column appears to hold short English headlines (mean 9.27 words, median length 58 chars, max 926), with the column name itself being one such headline. Of 2103 rows, 2095 are unique and only 8 duplicates exist, giving near-unique cardinality, while top words are common function words like 'to', 'of', 'in' consistent with news titles. Flesch readability averages 46.87 (moderately difficult), and there are no URLs, emoji, or empty strings. Treatment: tokenize and embed before modelling; do not use as a categorical key given near-unique values.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 2,095
- len_min
- 8
- len_max
- 926
- len_mean
- 60.99
- len_median
- 58
- len_p95
- 102
- word_mean
- 9.268
- word_median
- 9
- n_empty
- 0
- n_duplicates
- 8
- duplicate_rate
- 0.003804
- vocab_size
- 7,613
- readability_flesch_mean
- 46.87
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0004755
- allcaps_rate
- 0.0004755
- boilerplate_rate
- 0.0004755
https://i.kinja-img.com/gawker-media/image/upload/c_fit,f_auto,g_center,q_60,w_645/a30d02609da7e7ddb46d6edea9460d5e.jpg
text metadata one_word url_heavyThis column holds Kinja (Gawker Media) CDN image URLs, all sharing the same transform path (c_fit,f_auto,g_center,q_60,w_645) and differing only by a hashed filename, with url_rate 1.0 and one_word_rate 1.0. Lengths are nearly fixed (min 107, max 119, median 107), and across 2103 rows there are 1894 unique values with a 9.9% duplicate_rate (209 dupes), the top URL recurring 11 times. The column header itself is a URL, suggesting the file was loaded without a proper header row. Treatment: Treat as an image asset reference; fix the header, then either drop or fetch/encode the image separately rather than feeding the URL string into a model.
- n
- 2,103
- nulls
- 0 (0.0%)
- unique
- 1,894
- len_min
- 107
- len_max
- 119
- len_mean
- 107.9
- len_median
- 107
- len_p95
- 119
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 209
- duplicate_rate
- 0.09938
- vocab_size
- 1,894
- readability_flesch_mean
- -1672
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0