saturn·

accessibility wlasl index

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv

Saturn profiled 2,000 rows across 2 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv",
    "--findings", "accessibility-wlasl_index.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.

citing: row_count · column_count · columns.gloss.n_unique · columns.gloss.stats.one_word_rate · columns.gloss.stats.len_mean · columns.gloss.stats.word_mean · columns.instances.stats.url_rate · columns.instances.stats.len_mean · columns.instances.stats.len_max · columns.instances.stats.word_mean

Out[4]:

saturn.schema() · 2 columns

column kind n null% unique alerts
gloss text 2,000 0.0% 2,000 near_unique one_word short_text
instances text 2,000 0.0% 2,000 near_unique vocab_skipped url_heavy
Fig 1.
gloss · Confirm that gloss entries are uniformly short word-labels (mean ~6 characters, max 16).
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161
Fig 2.
instances · Inspect the wide spread of payload sizes from 1,590 up to 9,982 characters to gauge per-word video coverage.
Show data table
Character-length distribution for instances (mean: 2731.8595).
charscount
1590 – 1800212
1800 – 2010232
2010 – 2219269
2219 – 2429236
2429 – 2639201
2639 – 2849131
2849 – 3059118
3059 – 3268131
3268 – 3478106
3478 – 368893
3688 – 389860
3898 – 410844
4108 – 431735
4317 – 452723
4527 – 473730
4737 – 494718
4947 – 515717
5157 – 536615
5366 – 557610
5576 – 57867
5786 – 59962
5996 – 62061
6206 – 64151
6415 – 66252
6625 – 68352
6835 – 70451
7045 – 72550
7255 – 74640
7464 – 76740
7674 – 78840
7884 – 80941
8094 – 83040
8304 – 85130
8513 – 87230
8723 – 89330
8933 – 91431
9143 – 93530
9353 – 95620
9562 – 97720
9772 – 99821
Fig 3.
gloss · Show the few words that recur across glosses (e.g., 'up' appears 7 times) to spot any non-unique tokens.
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161
Fig 4.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
glosstext0.0%
instancestext0.0%

gloss text label

The `gloss` column holds short text labels — 97.75% are single words with a mean length of 6 characters and a max of 16. Every one of the 2000 rows is unique (vocab_size 1984), so this behaves like a per-row gloss or term rather than a categorical feature. Top words like 'up', 'hearing', 'last', 'take' suggest English lexical entries with no duplicates, URLs, or emoji.

Treatment: Treat as a per-row text label; tokenize or embed if used as a feature, otherwise keep as join key to a lexicon.

anthropic:claude-opus-4-7 · confidence high
Out[10]:

saturn.columns["gloss"].stats

statvalue
n2,000
nulls0 (0.0%)
unique2,000
len_min 1
len_max 16
len_mean 6.008
len_median 6
len_p95 10
word_mean 1.024
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 1,984
readability_flesch_mean 54.58
emoji_rate 0
url_rate 0
one_word_rate 0.9775
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word97.8% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 5.
Character-length distribution for gloss.
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161

instances text free_text

Every one of the 2000 rows is unique and contains a URL (url_rate 1.0), with lengths ranging from 1590 to 9982 characters (mean 2731.86) and roughly 263 words on average. This looks like a free-text field holding long, link-bearing instance documents — possibly scraped pages or annotated examples — rather than a categorical feature. Vocabulary was skipped and readability sits at Flesch 54.8, suggesting moderately complex prose embedded with links.

Treatment: Strip URLs and tokenize/embed the remaining prose before modelling; do not treat as a categorical feature.

anthropic:claude-opus-4-7 · confidence medium
Out[13]:

saturn.columns["instances"].stats

statvalue
n2,000
nulls0 (0.0%)
unique2,000
len_min 1,590
len_max 9,982
len_mean 2732
len_median 2,546
len_p95 4575
word_mean 263.5
word_median 250
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 0
readability_flesch_mean 54.82
emoji_rate 0
url_rate 1
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: vocab_skippedavg row 2731 chars — vocab expansion would be too costly
alert: url_heavy100.0% rows contain a URL
Fig 6.
Character-length distribution for instances.
Show data table
Character-length distribution for instances (mean: 2731.8595).
charscount
1590 – 1800212
1800 – 2010232
2010 – 2219269
2219 – 2429236
2429 – 2639201
2639 – 2849131
2849 – 3059118
3059 – 3268131
3268 – 3478106
3478 – 368893
3688 – 389860
3898 – 410844
4108 – 431735
4317 – 452723
4527 – 473730
4737 – 494718
4947 – 515717
5157 – 536615
5366 – 557610
5576 – 57867
5786 – 59962
5996 – 62061
6206 – 64151
6415 – 66252
6625 – 68352
6835 – 70451
7045 – 72550
7255 – 74640
7464 – 76740
7674 – 78840
7884 – 80941
8094 – 83040
8304 – 85130
8513 – 87230
8723 – 89330
8933 – 91431
9143 – 93530
9353 – 95620
9562 – 97720
9772 – 99821

How to cite

click to copy

BibTeX
@misc{saturn-accessibility-wlasl-index-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: accessibility wlasl index},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/accessibility-wlasl_index}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: accessibility wlasl index. Source: /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/accessibility-wlasl_index