saturn·

us attention data wikipedia trending

source /home/coolhand/datasets/us-attention-data/wikipedia_trending.json 500 rows 5 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset captures 500 trending Wikipedia articles, with each row identified by a unique title and described by days_in_top_100, peak_views, total_views, and a daily_views series. All three numeric columns are heavily right-skewed with significant outliers — total_views skew is 10.4 with a max of ~23.9M against a median of ~213K, and peak_views shows similar behavior. Most articles spend only a few days in the top 100 (median 3, max 30), but a long tail extends well beyond. Start by examining the distribution of total_views and days_in_top_100 to understand how concentrated attention is on a few breakout articles.

citing: days_in_top_100.stats · peak_views.stats · total_views.stats · title.top_values · row_count

Schema

5 columns
Per-column summary. Click column name to jump to its detail.
Alerts
title categorical 0.0% 500
long_tail
total_views numeric 0.0% 500
high_skew outliers
days_in_top_100 numeric 0.0% 27
high_skew outliers
peak_views numeric 0.0% 499
high_skew outliers
daily_views unknown 0.0%
skipped

title

categorical identifier long_tail
Wikipedia-style article titles with underscores (e.g. '1989_Tiananmen_Square_protests_and_massacre', 'Stranger_Things_season_5'), unique across all 500 rows (n_unique=500, entropy_ratio=1.0). Every value appears exactly once, so this functions as a row identifier rather than a categorical feature. The long_tail alert simply reflects that uniqueness. Treatment: Treat as a row key; drop from modelling or use only for joins and lookup. high · anthropic:claude-opus-4-7
n
500
nulls
0 (0.0%)
unique
500
top_value
1989_Tiananmen_Square_protests_and_massacre
top_rate
0.002
cardinality
500
entropy
8.966
entropy_ratio
1

total_views

numeric feature high_skew outliers
Likely a per-row view count, with all 500 values unique and no nulls or zeros. The distribution is severely right-skewed (skew 10.44, kurtosis 149.66): the median is 213,065 but the mean is 580,331 and the max reaches 23,890,102, roughly 112x the median. Outliers make up 10.2% of rows (51 of 500), so a small set of viral entries dominates the tail. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
500
nulls
0 (0.0%)
unique
500
min
76,451
max
2.389e+07
mean
5.803e+05
median
213,065
std
1.424e+06
q1
1.151e+05
q3
535,224
iqr
4.201e+05
skew
10.44
kurtosis
149.7
n_outliers
51
outlier_rate
0.102
zero_rate
0

days_in_top_100

numeric feature high_skew outliers
This column counts days a record spent in some top-100 ranking, with 500 non-null integer values ranging from 1 to 30 and a median of just 3. The distribution is heavily right-skewed (skew 2.45, kurtosis 5.98) — most items churn out fast while a long tail lingers, producing 56 outliers (11.2% outlier rate) above the q3 of 6. Mean (5.18) sits well above median, and only 27 unique values suggest tenure is bounded and discrete. Treatment: Log- or sqrt-transform before modelling to tame the right skew and outlier mass. high · anthropic:claude-opus-4-7
n
500
nulls
0 (0.0%)
unique
27
min
1
max
30
mean
5.176
median
3
std
6.239
q1
2
q3
6
iqr
4
skew
2.449
kurtosis
5.979
n_outliers
56
outlier_rate
0.112
zero_rate
0

peak_views

numeric feature high_skew outliers
A numeric measure of peak viewership per record, with 499 unique values across 500 rows and no nulls or zeros. The distribution is severely right-skewed (skew 9.71, kurtosis 127.99): the median is 104,303 but the mean is 159,797 and the max reaches 4,011,044, well above q3 of 148,907. 57 outliers (11.4%) sit above the upper whisker, suggesting a small tail of viral peaks dominates the variance (std 250,171). Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
500
nulls
0 (0.0%)
unique
499
min
40,332
max
4.011e+06
mean
1.598e+05
median
104,303
std
2.502e+05
q1
7.73e+04
q3
148,907
iqr
7.16e+04
skew
9.709
kurtosis
128
n_outliers
57
outlier_rate
0.114
zero_rate
0

daily_views

unknown other skipped
Column 'daily_views' was skipped by the profiler, so no type, uniqueness, or distribution stats were computed despite a full 500 non-null rows. The name suggests a per-day view count (likely numeric and right-skewed in practice), but nothing in the evidence confirms that. Re-run profiling with this column included before drawing any conclusions. Treatment: Re-profile the column to recover type and distribution before any downstream use. low · anthropic:claude-opus-4-7
n
500
nulls
0 (0.0%)
unique