accessibility wlasl index

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv

Saturn profiled 2,000 rows across 2 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv",
    "--findings", "accessibility-wlasl_index.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.

citing: row_count · column_count · columns.gloss.n_unique · columns.gloss.stats.one_word_rate · columns.gloss.stats.len_mean · columns.gloss.stats.word_mean · columns.instances.stats.url_rate · columns.instances.stats.len_mean · columns.instances.stats.len_max · columns.instances.stats.word_mean

Out[4]:

saturn.schema() · 2 columns

column	kind	n	null%	unique	alerts
gloss	text	2,000	0.0%	2,000	near_unique one_word short_text
instances	text	2,000	0.0%	2,000	near_unique vocab_skipped url_heavy

Fig 1.

gloss · Confirm that gloss entries are uniformly short word-labels (mean ~6 characters, max 16).

Show data table

Character-length distribution for gloss (mean: 6.0075).
chars	count
1 – 1	21
1 – 2	0
2 – 2	13
2 – 2	0
2 – 3	0
3 – 3	139
3 – 4	0
4 – 4	0
4 – 4	377
4 – 5	0
5 – 5	376
5 – 6	0
6 – 6	0
6 – 6	337
6 – 7	0
7 – 7	0
7 – 7	285
7 – 8	0
8 – 8	189
8 – 8	0
8 – 9	0
9 – 9	127
9 – 10	0
10 – 10	0
10 – 10	66
10 – 11	0
11 – 11	35
11 – 12	0
12 – 12	0
12 – 12	19
12 – 13	0
13 – 13	0
13 – 13	10
13 – 14	0
14 – 14	3
14 – 14	0
14 – 15	0
15 – 15	2
15 – 16	0
16 – 16	1

Fig 2.

instances · Inspect the wide spread of payload sizes from 1,590 up to 9,982 characters to gauge per-word video coverage.

Show data table

Character-length distribution for instances (mean: 2731.8595).
chars	count
1590 – 1800	212
1800 – 2010	232
2010 – 2219	269
2219 – 2429	236
2429 – 2639	201
2639 – 2849	131
2849 – 3059	118
3059 – 3268	131
3268 – 3478	106
3478 – 3688	93
3688 – 3898	60
3898 – 4108	44
4108 – 4317	35
4317 – 4527	23
4527 – 4737	30
4737 – 4947	18
4947 – 5157	17
5157 – 5366	15
5366 – 5576	10
5576 – 5786	7
5786 – 5996	2
5996 – 6206	1
6206 – 6415	1
6415 – 6625	2
6625 – 6835	2
6835 – 7045	1
7045 – 7255	0
7255 – 7464	0
7464 – 7674	0
7674 – 7884	0
7884 – 8094	1
8094 – 8304	0
8304 – 8513	0
8513 – 8723	0
8723 – 8933	0
8933 – 9143	1
9143 – 9353	0
9353 – 9562	0
9562 – 9772	0
9772 – 9982	1

Fig 3.

gloss · Show the few words that recur across glosses (e.g., 'up' appears 7 times) to spot any non-unique tokens.

Show data table

Character-length distribution for gloss (mean: 6.0075).
chars	count
1 – 1	21
1 – 2	0
2 – 2	13
2 – 2	0
2 – 3	0
3 – 3	139
3 – 4	0
4 – 4	0
4 – 4	377
4 – 5	0
5 – 5	376
5 – 6	0
6 – 6	0
6 – 6	337
6 – 7	0
7 – 7	0
7 – 7	285
7 – 8	0
8 – 8	189
8 – 8	0
8 – 9	0
9 – 9	127
9 – 10	0
10 – 10	0
10 – 10	66
10 – 11	0
11 – 11	35
11 – 12	0
12 – 12	0
12 – 12	19
12 – 13	0
13 – 13	0
13 – 13	10
13 – 14	0
14 – 14	3
14 – 14	0
14 – 15	0
15 – 15	2
15 – 16	0
16 – 16	1

Fig 4.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
gloss	text	0.0%
instances	text	0.0%

gloss text label

The `gloss` column holds short text labels — 97.75% are single words with a mean length of 6 characters and a max of 16. Every one of the 2000 rows is unique (vocab_size 1984), so this behaves like a per-row gloss or term rather than a categorical feature. Top words like 'up', 'hearing', 'last', 'take' suggest English lexical entries with no duplicates, URLs, or emoji.

Treatment: Treat as a per-row text label; tokenize or embed if used as a feature, otherwise keep as join key to a lexicon.

anthropic:claude-opus-4-7 · confidence high

Out[10]:

saturn.columns["gloss"].stats

stat	value
n	2,000
nulls	0 (0.0%)
unique	2,000
len_min	1
len_max	16
len_mean	6.008
len_median	6
len_p95	10
word_mean	1.024
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,984
readability_flesch_mean	54.58
emoji_rate	0
url_rate	0
one_word_rate	0.9775
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	97.8% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 5.

Character-length distribution for gloss.

Show data table

Character-length distribution for gloss (mean: 6.0075).
chars	count
1 – 1	21
1 – 2	0
2 – 2	13
2 – 2	0
2 – 3	0
3 – 3	139
3 – 4	0
4 – 4	0
4 – 4	377
4 – 5	0
5 – 5	376
5 – 6	0
6 – 6	0
6 – 6	337
6 – 7	0
7 – 7	0
7 – 7	285
7 – 8	0
8 – 8	189
8 – 8	0
8 – 9	0
9 – 9	127
9 – 10	0
10 – 10	0
10 – 10	66
10 – 11	0
11 – 11	35
11 – 12	0
12 – 12	0
12 – 12	19
12 – 13	0
13 – 13	0
13 – 13	10
13 – 14	0
14 – 14	3
14 – 14	0
14 – 15	0
15 – 15	2
15 – 16	0
16 – 16	1

instances text free_text

Every one of the 2000 rows is unique and contains a URL (url_rate 1.0), with lengths ranging from 1590 to 9982 characters (mean 2731.86) and roughly 263 words on average. This looks like a free-text field holding long, link-bearing instance documents — possibly scraped pages or annotated examples — rather than a categorical feature. Vocabulary was skipped and readability sits at Flesch 54.8, suggesting moderately complex prose embedded with links.

Treatment: Strip URLs and tokenize/embed the remaining prose before modelling; do not treat as a categorical feature.

anthropic:claude-opus-4-7 · confidence medium

Out[13]:

saturn.columns["instances"].stats

stat	value
n	2,000
nulls	0 (0.0%)
unique	2,000
len_min	1,590
len_max	9,982
len_mean	2732
len_median	2,546
len_p95	4575
word_mean	263.5
word_median	250
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	0
readability_flesch_mean	54.82
emoji_rate	0
url_rate	1
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: vocab_skipped	avg row 2731 chars — vocab expansion would be too costly
alert: url_heavy	100.0% rows contain a URL

Fig 6.

Character-length distribution for instances.

Show data table

Character-length distribution for instances (mean: 2731.8595).
chars	count
1590 – 1800	212
1800 – 2010	232
2010 – 2219	269
2219 – 2429	236
2429 – 2639	201
2639 – 2849	131
2849 – 3059	118
3059 – 3268	131
3268 – 3478	106
3478 – 3688	93
3688 – 3898	60
3898 – 4108	44
4108 – 4317	35
4317 – 4527	23
4527 – 4737	30
4737 – 4947	18
4947 – 5157	17
5157 – 5366	15
5366 – 5576	10
5576 – 5786	7
5786 – 5996	2
5996 – 6206	1
6206 – 6415	1
6415 – 6625	2
6625 – 6835	2
6835 – 7045	1
7045 – 7255	0
7255 – 7464	0
7464 – 7674	0
7674 – 7884	0
7884 – 8094	1
8094 – 8304	0
8304 – 8513	0
8513 – 8723	0
8723 – 8933	0
8933 – 9143	1
9143 – 9353	0
9353 – 9562	0
9562 – 9772	0
9772 – 9982	1

How to cite

click to copy

BibTeX

@misc{saturn-accessibility-wlasl-index-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: accessibility wlasl index},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/accessibility-wlasl_index}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: accessibility wlasl index. Source: /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/accessibility-wlasl_index