saturn·

accessibility wlasl index

source /home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv 2,000 rows 2 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.

citing: row_count · column_count · columns.gloss.n_unique · columns.gloss.stats.one_word_rate · columns.gloss.stats.len_mean · columns.gloss.stats.word_mean · columns.instances.stats.url_rate · columns.instances.stats.len_mean · columns.instances.stats.len_max · columns.instances.stats.word_mean

Schema

2 columns
Per-column summary. Click column name to jump to its detail.
Alerts
gloss text 0.0% 2,000
near_unique one_word short_text
instances text 0.0% 2,000
near_unique vocab_skipped url_heavy

gloss

text label near_unique one_word short_text
The `gloss` column holds short text labels — 97.75% are single words with a mean length of 6 characters and a max of 16. Every one of the 2000 rows is unique (vocab_size 1984), so this behaves like a per-row gloss or term rather than a categorical feature. Top words like 'up', 'hearing', 'last', 'take' suggest English lexical entries with no duplicates, URLs, or emoji. Treatment: Treat as a per-row text label; tokenize or embed if used as a feature, otherwise keep as join key to a lexicon. high · anthropic:claude-opus-4-7
n
2,000
nulls
0 (0.0%)
unique
2,000
len_min
1
len_max
16
len_mean
6.008
len_median
6
len_p95
10
word_mean
1.024
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,984
readability_flesch_mean
54.58
emoji_rate
0
url_rate
0
one_word_rate
0.9775
allcaps_rate
0
boilerplate_rate
0

instances

text free_text near_unique vocab_skipped url_heavy
Every one of the 2000 rows is unique and contains a URL (url_rate 1.0), with lengths ranging from 1590 to 9982 characters (mean 2731.86) and roughly 263 words on average. This looks like a free-text field holding long, link-bearing instance documents — possibly scraped pages or annotated examples — rather than a categorical feature. Vocabulary was skipped and readability sits at Flesch 54.8, suggesting moderately complex prose embedded with links. Treatment: Strip URLs and tokenize/embed the remaining prose before modelling; do not treat as a categorical feature. medium · anthropic:claude-opus-4-7
n
2,000
nulls
0 (0.0%)
unique
2,000
len_min
1,590
len_max
9,982
len_mean
2732
len_median
2,546
len_p95
4575
word_mean
263.5
word_median
250
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
0
readability_flesch_mean
54.82
emoji_rate
0
url_rate
1
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0