accessibility wlasl index
Reading
This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.
citing: row_count · column_count · columns.gloss.n_unique · columns.gloss.stats.one_word_rate · columns.gloss.stats.len_mean · columns.gloss.stats.word_mean · columns.instances.stats.url_rate · columns.instances.stats.len_mean · columns.instances.stats.len_max · columns.instances.stats.word_mean
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 1 – 1 | 21 |
| 1 – 2 | 0 |
| 2 – 2 | 13 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 139 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 377 |
| 4 – 5 | 0 |
| 5 – 5 | 376 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 337 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 285 |
| 7 – 8 | 0 |
| 8 – 8 | 189 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 127 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 66 |
| 10 – 11 | 0 |
| 11 – 11 | 35 |
| 11 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 19 |
| 12 – 13 | 0 |
| 13 – 13 | 0 |
| 13 – 13 | 10 |
| 13 – 14 | 0 |
| 14 – 14 | 3 |
| 14 – 14 | 0 |
| 14 – 15 | 0 |
| 15 – 15 | 2 |
| 15 – 16 | 0 |
| 16 – 16 | 1 |
Show data table
| chars | count |
|---|---|
| 1590 – 1800 | 212 |
| 1800 – 2010 | 232 |
| 2010 – 2219 | 269 |
| 2219 – 2429 | 236 |
| 2429 – 2639 | 201 |
| 2639 – 2849 | 131 |
| 2849 – 3059 | 118 |
| 3059 – 3268 | 131 |
| 3268 – 3478 | 106 |
| 3478 – 3688 | 93 |
| 3688 – 3898 | 60 |
| 3898 – 4108 | 44 |
| 4108 – 4317 | 35 |
| 4317 – 4527 | 23 |
| 4527 – 4737 | 30 |
| 4737 – 4947 | 18 |
| 4947 – 5157 | 17 |
| 5157 – 5366 | 15 |
| 5366 – 5576 | 10 |
| 5576 – 5786 | 7 |
| 5786 – 5996 | 2 |
| 5996 – 6206 | 1 |
| 6206 – 6415 | 1 |
| 6415 – 6625 | 2 |
| 6625 – 6835 | 2 |
| 6835 – 7045 | 1 |
| 7045 – 7255 | 0 |
| 7255 – 7464 | 0 |
| 7464 – 7674 | 0 |
| 7674 – 7884 | 0 |
| 7884 – 8094 | 1 |
| 8094 – 8304 | 0 |
| 8304 – 8513 | 0 |
| 8513 – 8723 | 0 |
| 8723 – 8933 | 0 |
| 8933 – 9143 | 1 |
| 9143 – 9353 | 0 |
| 9353 – 9562 | 0 |
| 9562 – 9772 | 0 |
| 9772 – 9982 | 1 |
Show data table
| chars | count |
|---|---|
| 1 – 1 | 21 |
| 1 – 2 | 0 |
| 2 – 2 | 13 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 139 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 377 |
| 4 – 5 | 0 |
| 5 – 5 | 376 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 337 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 285 |
| 7 – 8 | 0 |
| 8 – 8 | 189 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 127 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 66 |
| 10 – 11 | 0 |
| 11 – 11 | 35 |
| 11 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 19 |
| 12 – 13 | 0 |
| 13 – 13 | 0 |
| 13 – 13 | 10 |
| 13 – 14 | 0 |
| 14 – 14 | 3 |
| 14 – 14 | 0 |
| 14 – 15 | 0 |
| 15 – 15 | 2 |
| 15 – 16 | 0 |
| 16 – 16 | 1 |
Schema
2 columns| Alerts | ||||
|---|---|---|---|---|
| gloss | text | 0.0% | 2,000 |
near_unique
one_word
short_text
|
| instances | text | 0.0% | 2,000 |
near_unique
vocab_skipped
url_heavy
|
gloss
text label near_unique one_word short_textThe `gloss` column holds short text labels — 97.75% are single words with a mean length of 6 characters and a max of 16. Every one of the 2000 rows is unique (vocab_size 1984), so this behaves like a per-row gloss or term rather than a categorical feature. Top words like 'up', 'hearing', 'last', 'take' suggest English lexical entries with no duplicates, URLs, or emoji. Treatment: Treat as a per-row text label; tokenize or embed if used as a feature, otherwise keep as join key to a lexicon.
- n
- 2,000
- nulls
- 0 (0.0%)
- unique
- 2,000
- len_min
- 1
- len_max
- 16
- len_mean
- 6.008
- len_median
- 6
- len_p95
- 10
- word_mean
- 1.024
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 1,984
- readability_flesch_mean
- 54.58
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9775
- allcaps_rate
- 0
- boilerplate_rate
- 0
instances
text free_text near_unique vocab_skipped url_heavyEvery one of the 2000 rows is unique and contains a URL (url_rate 1.0), with lengths ranging from 1590 to 9982 characters (mean 2731.86) and roughly 263 words on average. This looks like a free-text field holding long, link-bearing instance documents — possibly scraped pages or annotated examples — rather than a categorical feature. Vocabulary was skipped and readability sits at Flesch 54.8, suggesting moderately complex prose embedded with links. Treatment: Strip URLs and tokenize/embed the remaining prose before modelling; do not treat as a categorical feature.
- n
- 2,000
- nulls
- 0 (0.0%)
- unique
- 2,000
- len_min
- 1,590
- len_max
- 9,982
- len_mean
- 2732
- len_median
- 2,546
- len_p95
- 4575
- word_mean
- 263.5
- word_median
- 250
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 0
- readability_flesch_mean
- 54.82
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0