accessibility wlasl index

source /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv 2,000 rows 2 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.

citing: row_count · column_count · columns.gloss.n_unique · columns.gloss.stats.one_word_rate · columns.gloss.stats.len_mean · columns.gloss.stats.word_mean · columns.instances.stats.url_rate · columns.instances.stats.len_mean · columns.instances.stats.len_max · columns.instances.stats.word_mean

Charts the summary said to look at first

gloss · Confirm that gloss entries are uniformly short word-labels (mean ~6 characters, max 16).

Show data table

Character-length distribution for gloss (mean: 6.0075).
chars	count
1 – 1	21
1 – 2	0
2 – 2	13
2 – 2	0
2 – 3	0
3 – 3	139
3 – 4	0
4 – 4	0
4 – 4	377
4 – 5	0
5 – 5	376
5 – 6	0
6 – 6	0
6 – 6	337
6 – 7	0
7 – 7	0
7 – 7	285
7 – 8	0
8 – 8	189
8 – 8	0
8 – 9	0
9 – 9	127
9 – 10	0
10 – 10	0
10 – 10	66
10 – 11	0
11 – 11	35
11 – 12	0
12 – 12	0
12 – 12	19
12 – 13	0
13 – 13	0
13 – 13	10
13 – 14	0
14 – 14	3
14 – 14	0
14 – 15	0
15 – 15	2
15 – 16	0
16 – 16	1

instances · Inspect the wide spread of payload sizes from 1,590 up to 9,982 characters to gauge per-word video coverage.

Show data table

Character-length distribution for instances (mean: 2731.8595).
chars	count
1590 – 1800	212
1800 – 2010	232
2010 – 2219	269
2219 – 2429	236
2429 – 2639	201
2639 – 2849	131
2849 – 3059	118
3059 – 3268	131
3268 – 3478	106
3478 – 3688	93
3688 – 3898	60
3898 – 4108	44
4108 – 4317	35
4317 – 4527	23
4527 – 4737	30
4737 – 4947	18
4947 – 5157	17
5157 – 5366	15
5366 – 5576	10
5576 – 5786	7
5786 – 5996	2
5996 – 6206	1
6206 – 6415	1
6415 – 6625	2
6625 – 6835	2
6835 – 7045	1
7045 – 7255	0
7255 – 7464	0
7464 – 7674	0
7674 – 7884	0
7884 – 8094	1
8094 – 8304	0
8304 – 8513	0
8513 – 8723	0
8723 – 8933	0
8933 – 9143	1
9143 – 9353	0
9353 – 9562	0
9562 – 9772	0
9772 – 9982	1

gloss · Show the few words that recur across glosses (e.g., 'up' appears 7 times) to spot any non-unique tokens.

Show data table

Character-length distribution for gloss (mean: 6.0075).
chars	count
1 – 1	21
1 – 2	0
2 – 2	13
2 – 2	0
2 – 3	0
3 – 3	139
3 – 4	0
4 – 4	0
4 – 4	377
4 – 5	0
5 – 5	376
5 – 6	0
6 – 6	0
6 – 6	337
6 – 7	0
7 – 7	0
7 – 7	285
7 – 8	0
8 – 8	189
8 – 8	0
8 – 9	0
9 – 9	127
9 – 10	0
10 – 10	0
10 – 10	66
10 – 11	0
11 – 11	35
11 – 12	0
12 – 12	0
12 – 12	19
12 – 13	0
13 – 13	0
13 – 13	10
13 – 14	0
14 – 14	3
14 – 14	0
14 – 15	0
15 – 15	2
15 – 16	0
16 – 16	1

Schema

2 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
gloss	text	0.0%	2,000	near_unique one_word short_text
instances	text	0.0%	2,000	near_unique vocab_skipped url_heavy

gloss

text label near_unique one_word short_text

The `gloss` column holds short text labels — 97.75% are single words with a mean length of 6 characters and a max of 16. Every one of the 2000 rows is unique (vocab_size 1984), so this behaves like a per-row gloss or term rather than a categorical feature. Top words like 'up', 'hearing', 'last', 'take' suggest English lexical entries with no duplicates, URLs, or emoji. Treatment: Treat as a per-row text label; tokenize or embed if used as a feature, otherwise keep as join key to a lexicon. high · anthropic:claude-opus-4-7

n: 2,000
nulls: 0 (0.0%)
unique: 2,000
len_min: 1
len_max: 16
len_mean: 6.008
len_median: 6
len_p95: 10
word_mean: 1.024
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 1,984
readability_flesch_mean: 54.58
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9775
allcaps_rate: 0
boilerplate_rate: 0

instances

text free_text near_unique vocab_skipped url_heavy

Every one of the 2000 rows is unique and contains a URL (url_rate 1.0), with lengths ranging from 1590 to 9982 characters (mean 2731.86) and roughly 263 words on average. This looks like a free-text field holding long, link-bearing instance documents — possibly scraped pages or annotated examples — rather than a categorical feature. Vocabulary was skipped and readability sits at Flesch 54.8, suggesting moderately complex prose embedded with links. Treatment: Strip URLs and tokenize/embed the remaining prose before modelling; do not treat as a categorical feature. medium · anthropic:claude-opus-4-7

n: 2,000
nulls: 0 (0.0%)
unique: 2,000
len_min: 1,590
len_max: 9,982
len_mean: 2732
len_median: 2,546
len_p95: 4575
word_mean: 263.5
word_median: 250
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 0
readability_flesch_mean: 54.82
emoji_rate: 0
url_rate: 1
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0