saturn·

wlasl index

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/wlasl_index.json

Saturn profiled 2,000 rows across 2 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/wlasl_index.json",
    "--findings", "wlasl_index.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a 2000-row index from a WLASL (Word-Level American Sign Language) source, with two columns: 'gloss' (text labels) and 'instances' (an unparsed/unknown field, likely nested data). The 'gloss' column is essentially a vocabulary list — every one of the 2000 rows is unique, 97.75% are single words, and the mean length is just 6 characters. The 'instances' column was skipped by the profiler and warrants manual inspection, since it likely contains the actual sign-language sample records keyed to each gloss. Start by looking at the gloss length distribution to confirm the single-word pattern, then dig into the structure of 'instances' separately.

citing: row_count · column_count · columns[gloss].n_unique · columns[gloss].stats.one_word_rate · columns[gloss].stats.len_mean · columns[gloss].stats.len_max · columns[gloss].stats.word_mean · columns[gloss].stats.vocab_size · columns[gloss].top_words · columns[instances].alerts

Out[4]:

saturn.schema() · 2 columns

column kind n null% unique alerts
gloss text 2,000 0.0% 2,000 near_unique one_word short_text
instances unknown 2,000 0.0% skipped
Fig 1.
gloss · Character-length distribution of glosses — expect a tight cluster around 6 with a max of 16.
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161
Fig 2.
gloss · Top recurring tokens within glosses; note that even the most common word ('up') appears only 7 times, confirming near-unique labels.
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161
Fig 3.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
glosstext0.0%
instancesunknown0.0%

gloss text label

This column holds short glosses—2000 rows, all unique, with 97.75% being a single word and a mean length of 6 characters (max 16). The vocabulary is 1984 distinct tokens across 2000 rows, so almost every entry is its own term, with only minor repeats like 'up' (7) or 'hearing' (3). It reads as a lexicon-style label field rather than free text.

Treatment: Treat as a high-cardinality categorical key; embed or hash rather than one-hot.

anthropic:claude-opus-4-7 · confidence high
Out[9]:

saturn.columns["gloss"].stats

statvalue
n2,000
nulls0 (0.0%)
unique2,000
len_min 1
len_max 16
len_mean 6.008
len_median 6
len_p95 10
word_mean 1.024
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 1,984
readability_flesch_mean 54.58
emoji_rate 0
url_rate 0
one_word_rate 0.9775
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word97.8% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 4.
Character-length distribution for gloss.
Show data table
Character-length distribution for gloss (mean: 6.0075).
charscount
1 – 121
1 – 20
2 – 213
2 – 20
2 – 30
3 – 3139
3 – 40
4 – 40
4 – 4377
4 – 50
5 – 5376
5 – 60
6 – 60
6 – 6337
6 – 70
7 – 70
7 – 7285
7 – 80
8 – 8189
8 – 80
8 – 90
9 – 9127
9 – 100
10 – 100
10 – 1066
10 – 110
11 – 1135
11 – 120
12 – 120
12 – 1219
12 – 130
13 – 130
13 – 1310
13 – 140
14 – 143
14 – 140
14 – 150
15 – 152
15 – 160
16 – 161

instances unknown other

The column is named "instances" but saturn skipped detailed profiling, so its kind is unknown and no descriptive statistics were computed. We can only confirm 2000 rows with a 0.0 null rate; uniqueness, distribution, and dtype are all unreported. Without a sample value or type signal, the semantic role cannot be inferred from the evidence.

Treatment: Re-profile or inspect raw values manually before deciding how to use this column.

anthropic:claude-opus-4-7 · confidence low
Out[12]:

saturn.columns["instances"].stats

statvalue
n2,000
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

How to cite

click to copy

BibTeX
@misc{saturn-wlasl-index-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: wlasl index},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/wlasl_index}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: wlasl index. Source: /home/coolhand/html/datavis/data_trove/cache/wlasl_index.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/wlasl_index