data trove accessibility audit tools

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/accessibility_audit/web_accessibility_data_top100.csv

Saturn profiled 92 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/accessibility_audit/web_accessibility_data_top100.csv",
    "--findings", "data-trove-accessibility-audit-tools.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a web accessibility audit of approximately 100 top websites, covering 92 rows with metrics on error counts, error density, popularity rank, and automated WAVE tool scores. The most striking finding is that both 'errors' and 'error_density' are heavily right-skewed with extreme outliers — the median error count is just 5, but the max reaches 364, suggesting a small cluster of sites are dramatically worse than the rest. A second angle worth exploring is the 'notes' column, where 'Low contrast text' dominates as the most common accessibility issue (12 occurrences), pointing to a systemic problem across high-traffic sites. The near-uniform distribution of 'popularity_rank' suggests the sample spans the full top-100 range evenly, making comparisons across popularity tiers feasible.

citing: errors.median · errors.max · errors.skew · errors.n_outliers · error_density.skew · error_density.n_outliers · notes.top_value · notes.top_rate · popularity_rank.skew · popularity_rank.min · popularity_rank.max

Out[4]:

saturn.schema() · 6 columns

column	kind	n	null%	unique	alerts
domain	categorical	92	0.0%	92	long_tail
wave_rank	numeric	92	9.8%	83
popularity_rank	numeric	92	9.8%	82
errors	numeric	92	9.8%	36	high_skew outliers
error_density	numeric	92	9.8%	64	high_skew outliers
notes	categorical	92	0.0%	38	long_tail

Fig 1.

errors · Look for the extreme right tail — most sites cluster near zero errors but a handful have counts above 100, driving the mean far above the median.

Show data table

Histogram bins for errors (median: 5.0).
bin	count
0 – 40.44	70
40.44 – 80.89	7
80.89 – 121.3	1
121.3 – 161.8	2
161.8 – 202.2	0
202.2 – 242.7	2
242.7 – 283.1	0
283.1 – 323.6	0
323.6 – 364	1

Fig 2.

error_density · Similar skew to raw errors but normalised per page element; outliers here flag sites that are structurally inaccessible, not just large.

Show data table

Histogram bins for error_density (median: 0.0057).
bin	count
0 – 0.03367	65
0.03367 – 0.06733	10
0.06733 – 0.101	2
0.101 – 0.1347	2
0.1347 – 0.1683	0
0.1683 – 0.202	1
0.202 – 0.2357	2
0.2357 – 0.2693	0
0.2693 – 0.303	1

Fig 3.

notes · Low contrast text and missing alt text dominate — this bar reveals which accessibility failure types are most prevalent across top websites.

Show data table

Top values for notes (20 unique shown, of 38 total).
value	count	share
Low contrast text	12	13.0%
Missing form input label	9	9.8%
Low contrast text, missing alt text	9	9.8%
Missing alt text	5	5.4%
Asia-based	5	5.4%
No detected errors	5	5.4%
Low contrast text, empty link	4	4.3%
Low contrast text, missing alt text, empty link, empty button	4	4.3%
No data	3	3.3%
Low contrast text, missing alt text, missing labels	3	3.3%
Low contrast text, missing alt text, empty button	3	3.3%
Missing document language	2	2.2%
Missing alt text, missing document language	2	2.2%
Missing form input label, empty button	2	2.2%
Low contrast text, missing alt text, empty link, missing form labels, empty button	1	1.1%
Low contrast text, missing alt text, empty links	1	1.1%
Missing alt text, missing form labels, empty buttons	1	1.1%
Missing form input label, missing document language	1	1.1%
Low contrast text, missing alt text, missing language	1	1.1%
Missing alt text, missing form labels	1	1.1%

Fig 4.

popularity_rank · The near-flat distribution confirms even coverage across the top 100, so error patterns can be compared fairly across all popularity tiers.

Show data table

Histogram bins for popularity_rank (median: 52.0).
bin	count
1 – 12	11
12 – 23	5
23 – 34	10
34 – 45	10
45 – 56	9
56 – 67	9
67 – 78	9
78 – 89	9
89 – 100	11

Fig 5.

wave_rank · The wide spread and right skew of WAVE scores show substantial variation in how accessibility tools rate these sites independently of popularity.

Show data table

Histogram bins for wave_rank (median: 203569.0).
bin	count
1183 – 1.112e+05	27
1.112e+05 – 2.212e+05	17
2.212e+05 – 3.312e+05	10
3.312e+05 – 4.411e+05	5
4.411e+05 – 5.511e+05	7
5.511e+05 – 6.611e+05	6
6.611e+05 – 7.711e+05	2
7.711e+05 – 8.811e+05	4
8.811e+05 – 9.911e+05	5

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
domain	categorical	0.0%
wave_rank	numeric	9.8%
popularity_rank	numeric	9.8%
errors	numeric	9.8%
error_density	numeric	9.8%
notes	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	wave_rank	popularity_rank	errors	error_density
wave_rank	+1.00	+0.01	+0.77	+0.52
popularity_rank	+0.01	+1.00	+0.06	-0.09
errors	+0.77	+0.06	+1.00	+0.28
error_density	+0.52	-0.09	+0.28	+1.00

domain categorical identifier

This column contains internet domain names (e.g., google.com, youtube.com, wikipedia.org), functioning as a unique identifier for each row in the dataset. With 92 rows and 92 unique values, cardinality is 100% and entropy_ratio is exactly 1.0, meaning every domain appears exactly once — the top_rate of 0.0109 confirms this. There is no distributional signal here beyond the list itself, making the column unsuitable as a categorical feature without further engineering.

Treatment: Use as a row key or for joins; extract TLD / domain-level features (e.g., suffix, brand) before any modelling.

anthropic:default · confidence high

Out[13]:

saturn.columns["domain"].stats

stat	value
n	92
nulls	0 (0.0%)
unique	92
top_value	google.com
top_rate	0.01087
cardinality	92
entropy	6.524
entropy_ratio	1
alert: long_tail	92 singleton categories

Fig 8.

Top values for domain.

Show data table

Top values for domain (20 unique shown, of 92 total).
value	count	share
google.com	1	1.1%
youtube.com	1	1.1%
facebook.com	1	1.1%
wikipedia.org	1	1.1%
instagram.com	1	1.1%
bing.com	1	1.1%
reddit.com	1	1.1%
x.com	1	1.1%
chatgpt.com	1	1.1%
yandex.ru	1	1.1%
whatsapp.com	1	1.1%
amazon.com	1	1.1%
yahoo.com	1	1.1%
weather.com	1	1.1%
duckduckgo.com	1	1.1%
microsoftonline.com	1	1.1%
twitch.tv	1	1.1%
linkedin.com	1	1.1%
live.com	1	1.1%
fandom.com	1	1.1%

wave_rank numeric feature

This column appears to be a numeric rank or score assigned within a 'wave' (likely a survey wave or data collection round), with values ranging from 1,183 to 991,094 — a span suggesting population-scale ranking rather than a small cohort. The distribution is moderately right-skewed (skew ≈ 0.96) with a wide IQR of 395,414.5 and a mean (301,136) well above the median (203,569), indicating many lower-ranked entries but a long tail of high-rank values. Notably, 83 of 92 non-null values are unique, consistent with a rank-like identifier, yet ~9.8% of rows are null, which warrants investigation before use.

Treatment: Investigate nulls (9.78% missing) before modelling; consider rank-normalization or log-transform to reduce skew prior to regression.

anthropic:default · confidence medium

Out[16]:

saturn.columns["wave_rank"].stats

stat	value
n	92
nulls	9 (9.8%)
unique	83
min	1,183
max	991,094
mean	3.011e+05
median	203,569
std	2.746e+05
q1	8.221e+04
q3	477,628
iqr	3.954e+05
skew	0.9633
kurtosis	-0.1876
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of wave_rank. Vertical dash marks the median.

Show data table

Histogram bins for wave_rank (median: 203569.0).
bin	count
1183 – 1.112e+05	27
1.112e+05 – 2.212e+05	17
2.212e+05 – 3.312e+05	10
3.312e+05 – 4.411e+05	5
4.411e+05 – 5.511e+05	7
5.511e+05 – 6.611e+05	6
6.611e+05 – 7.711e+05	2
7.711e+05 – 8.811e+05	4
8.811e+05 – 9.911e+05	5

popularity_rank numeric feature

This column is a popularity rank, almost certainly an ordinal score from 1 to 100 assigned to each record. Its distribution is strikingly uniform: mean 51.2, median 52.0, IQR of 49.0 spanning Q1=27.5 to Q3=76.5, near-zero skew (-0.05), and a platykurtic shape (kurtosis -1.19), all consistent with an approximately flat distribution across the full 1–100 range. With 82 unique values out of 92 non-null rows, some ranks are shared between records, suggesting either ties or binned scoring rather than a strict unique ranking. The 9.78% null rate warrants attention—records missing this field may represent unranked or newly added items.

Treatment: Use as-is or normalize to [0,1]; investigate null records to determine if missingness is informative before imputing.

anthropic:default · confidence high

Out[19]:

saturn.columns["popularity_rank"].stats

stat	value
n	92
nulls	9 (9.8%)
unique	82
min	1
max	100
mean	51.23
median	52
std	29.39
q1	27.5
q3	76.5
iqr	49
skew	-0.05046
kurtosis	-1.192
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 10.

Distribution of popularity_rank. Vertical dash marks the median.

Show data table

Histogram bins for popularity_rank (median: 52.0).
bin	count
1 – 12	11
12 – 23	5
23 – 34	10
34 – 45	10
45 – 56	9
56 – 67	9
67 – 78	9
78 – 89	9
89 – 100	11

errors numeric feature

This column likely represents an error count per observation (e.g., system errors, validation failures, or defects), with values ranging from 0 to 364. The distribution is severely right-skewed (skew = 3.80, kurtosis = 16.12): the median is only 5.0 while the mean is 27.17 and the std is 57.20, indicating a small number of extreme cases are pulling the average dramatically upward. Eight outliers (9.6% of non-null rows) drive this effect, and ~9.8% of rows are null — both figures warrant investigation before modelling.

Treatment: Log-transform (log1p) or apply a robust scaler before modelling; investigate and impute or flag the 9.78% nulls and 8 extreme outliers separately.

anthropic:default · confidence high

Out[22]:

saturn.columns["errors"].stats

stat	value
n	92
nulls	9 (9.8%)
unique	36
min	0
max	364
mean	27.17
median	5
std	57.2
q1	2
q3	26.5
iqr	24.5
skew	3.803
kurtosis	16.12
n_outliers	8
outlier_rate	0.09639
zero_rate	0.06024
alert: high_skew	skew=+3.80
alert: outliers	9.6% rows beyond 1.5 IQR

Fig 11.

Distribution of errors. Vertical dash marks the median.

Show data table

Histogram bins for errors (median: 5.0).
bin	count
0 – 40.44	70
40.44 – 80.89	7
80.89 – 121.3	1
121.3 – 161.8	2
161.8 – 202.2	0
202.2 – 242.7	2
242.7 – 283.1	0
283.1 – 323.6	0
323.6 – 364	1

error_density numeric feature

This column represents a density or rate of errors — likely errors per unit (e.g., per line of code, per request, or per token) — given its name and bounded [0, 0.303] range. The distribution is severely right-skewed (skew=3.23, kurtosis=10.95): the median is only 0.0057 while the mean is 0.027, and 8 outliers (9.6% of rows) pull the tail toward the maximum of 0.303. With ~9.8% nulls and ~6% zeros, the bulk of observations cluster near zero but a meaningful minority exhibit substantially elevated error rates.

Treatment: Log-transform (e.g., log1p) before modelling to reduce skew; impute or flag nulls separately; investigate the 8 outliers for data quality issues.

anthropic:default · confidence high

Out[25]:

saturn.columns["error_density"].stats

stat	value
n	92
nulls	9 (9.8%)
unique	64
min	0
max	0.303
mean	0.02725
median	0.0057
std	0.05376
q1	0.0025
q3	0.02545
iqr	0.02295
skew	3.229
kurtosis	10.95
n_outliers	8
outlier_rate	0.09639
zero_rate	0.06024
alert: high_skew	skew=+3.23
alert: outliers	9.6% rows beyond 1.5 IQR

Fig 12.

Distribution of error_density. Vertical dash marks the median.

Show data table

Histogram bins for error_density (median: 0.0057).
bin	count
0 – 0.03367	65
0.03367 – 0.06733	10
0.06733 – 0.101	2
0.101 – 0.1347	2
0.1347 – 0.1683	0
0.1683 – 0.202	1
0.202 – 0.2357	2
0.2357 – 0.2693	0
0.2693 – 0.303	1

notes categorical label

This column contains free-text accessibility audit notes describing detected WCAG/accessibility violations (e.g., 'Low contrast text', 'Missing alt text', 'Missing form input label') for 92 records. The top value 'Low contrast text' appears 12 times (13% of rows), and with 38 unique values across 92 rows the entropy ratio is high at 0.89, flagging a long-tail distribution of compound violation combinations. Notably, values like 'Asia-based' and 'No data' appear alongside structured violation strings, suggesting the column conflates geographic metadata and data-availability flags with accessibility findings — an inconsistency an analyst should investigate. No nulls exist, but the semantic mixing across categories limits direct modelling use.

Treatment: Parse and split compound violation strings (comma-separated) into multi-label binary indicators; handle 'Asia-based' and 'No data' as separate flags.

anthropic:default · confidence high

Out[28]:

saturn.columns["notes"].stats

stat	value
n	92
nulls	0 (0.0%)
unique	38
top_value	Low contrast text
top_rate	0.1304
cardinality	38
entropy	4.663
entropy_ratio	0.8885
alert: long_tail	24 singleton categories

Fig 13.

Top values for notes.

Show data table

Top values for notes (20 unique shown, of 38 total).
value	count	share
Low contrast text	12	13.0%
Missing form input label	9	9.8%
Low contrast text, missing alt text	9	9.8%
Missing alt text	5	5.4%
Asia-based	5	5.4%
No detected errors	5	5.4%
Low contrast text, empty link	4	4.3%
Low contrast text, missing alt text, empty link, empty button	4	4.3%
No data	3	3.3%
Low contrast text, missing alt text, missing labels	3	3.3%
Low contrast text, missing alt text, empty button	3	3.3%
Missing document language	2	2.2%
Missing alt text, missing document language	2	2.2%
Missing form input label, empty button	2	2.2%
Low contrast text, missing alt text, empty link, missing form labels, empty button	1	1.1%
Low contrast text, missing alt text, empty links	1	1.1%
Missing alt text, missing form labels, empty buttons	1	1.1%
Missing form input label, missing document language	1	1.1%
Low contrast text, missing alt text, missing language	1	1.1%
Missing alt text, missing form labels	1	1.1%

How to cite

click to copy

BibTeX

@misc{saturn-data-trove-accessibility-audit-tools-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove accessibility audit tools},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-accessibility-audit-tools}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}

APA

Steuber, L. (2026). Saturn reading: data trove accessibility audit tools. Source: /home/coolhand/html/datavis/data_trove/accessibility_audit/web_accessibility_data_top100.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-accessibility-audit-tools