language data world languages integrated

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/datasets/language-data/world_languages_integrated.json

Saturn profiled 7,130 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/world_languages_integrated.json",
    "--findings", "language-data-world_languages_integrated.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: medium

This dataset catalogs 7,130 world languages, each row keyed by a unique ISO 639-3 code paired with a language name. Only two of the eight columns (iso_639_3, name) parsed cleanly as text; the other six — including data_sources, glottolog, speaker_count, and us_indigenous — were skipped as nested or non-scalar structures and will need flattening before they yield insight. The name column is the most interesting surface feature: it averages about 9 characters, is single-word ~73% of the time, and its top tokens reveal heavy use of regional qualifiers like 'southern', 'northern', 'eastern', and 'central', plus 152–153 entries containing 'language' or 'sign'. Start by unpacking the skipped object columns (especially speaker_count and glottolog) and look at name-token frequencies to understand naming conventions and language-family clustering.

citing: row_count · column_count · columns.iso_639_3.n_unique · columns.iso_639_3.stats.len_mean · columns.name.stats.len_mean · columns.name.stats.one_word_rate · columns.name.top_words · kinds

Out[4]:

saturn.schema() · 8 columns

column	kind	n	null%	unique	alerts
iso_639_3	text	7,130	0.0%	7,130	near_unique one_word short_text
name	text	7,130	0.0%	7,130	near_unique one_word
joshua_project	unknown	7,130	0.0%	—	skipped
glottolog	unknown	7,130	0.0%	—	skipped
language_history	unknown	7,130	0.0%	—	skipped
us_indigenous	unknown	7,130	0.0%	—	skipped
speaker_count	unknown	7,130	0.0%	—	skipped
data_sources	unknown	7,130	0.0%	—	skipped

Fig 1.

name · Distribution of language-name lengths shows most names are short (median 7 chars) with a long tail up to 43.

Show data table

Character-length distribution for name (mean: 8.993688639551193).
chars	count
1 – 2	25
2 – 3	195
3 – 4	761
4 – 5	1106
5 – 6	1096
6 – 7	847
7 – 8	529
8 – 9	351
9 – 10	278
10 – 12	238
12 – 13	213
13 – 14	215
14 – 15	191
15 – 16	202
16 – 17	149
17 – 18	95
18 – 19	75
19 – 20	72
20 – 21	66
21 – 22	70
22 – 23	154
23 – 24	50
24 – 25	26
25 – 26	30
26 – 27	23
27 – 28	8
28 – 29	7
29 – 30	10
30 – 31	14
31 – 32	6
32 – 34	8
34 – 35	2
35 – 36	10
36 – 37	1
37 – 38	4
38 – 39	0
39 – 40	1
40 – 41	0
41 – 42	1
42 – 43	1

Fig 2.

name · Top tokens in language names highlight directional qualifiers (southern, northern, eastern) and the prominence of 'sign' languages.

Show data table

Character-length distribution for name (mean: 8.993688639551193).
chars	count
1 – 2	25
2 – 3	195
3 – 4	761
4 – 5	1106
5 – 6	1096
6 – 7	847
7 – 8	529
8 – 9	351
9 – 10	278
10 – 12	238
12 – 13	213
13 – 14	215
14 – 15	191
15 – 16	202
16 – 17	149
17 – 18	95
18 – 19	75
19 – 20	72
20 – 21	66
21 – 22	70
22 – 23	154
23 – 24	50
24 – 25	26
25 – 26	30
26 – 27	23
27 – 28	8
28 – 29	7
29 – 30	10
30 – 31	14
31 – 32	6
32 – 34	8
34 – 35	2
35 – 36	10
36 – 37	1
37 – 38	4
38 – 39	0
39 – 40	1
40 – 41	0
41 – 42	1
42 – 43	1

Fig 3.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
iso_639_3	text	0.0%
name	text	0.0%
joshua_project	unknown	0.0%
glottolog	unknown	0.0%
language_history	unknown	0.0%
us_indigenous	unknown	0.0%
speaker_count	unknown	0.0%
data_sources	unknown	0.0%

iso_639_3 text identifier

This column holds ISO 639-3 language codes: every value is exactly 3 characters, single-word, and unique across all 7130 rows (vocab_size 7130, len_min/max 3, one_word_rate 1.0). With n_unique equal to n and zero nulls or duplicates, it functions as a primary key for languages rather than a feature. Sample tokens like 'aou', 'aiw', 'aas' match the ISO 639-3 three-letter convention.

Treatment: Use as the join key to language-level metadata; do not feed into a model directly.

anthropic:claude-opus-4-7 · confidence high

Out[9]:

saturn.columns["iso_639_3"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	7,130
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,130
readability_flesch_mean	120.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 4.

Character-length distribution for iso_639_3.

Show data table

Character-length distribution for iso_639_3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7130
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

name text identifier

This column appears to be a unique name identifier, likely for languages given the prominence of 'language', 'sign', 'zapotec', 'mixtec', and 'naga' in top words. All 7130 values are unique (n_unique equals n) with zero nulls or duplicates, and 72.9% are single words with a mean length of ~9 characters. The directional modifiers (southern, northern, western, eastern, central) suggest many entries are regional language variants.

Treatment: Use as a primary key or label; do not feed raw into models.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["name"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	7,130
len_min	1
len_max	43
len_mean	8.994
len_median	7
len_p95	21
word_mean	1.372
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,124
readability_flesch_mean	56.15
emoji_rate	0
url_rate	0
one_word_rate	0.7292
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	72.9% rows are a single word

Fig 5.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 8.993688639551193).
chars	count
1 – 2	25
2 – 3	195
3 – 4	761
4 – 5	1106
5 – 6	1096
6 – 7	847
7 – 8	529
8 – 9	351
9 – 10	278
10 – 12	238
12 – 13	213
13 – 14	215
14 – 15	191
15 – 16	202
16 – 17	149
17 – 18	95
18 – 19	75
19 – 20	72
20 – 21	66
21 – 22	70
22 – 23	154
23 – 24	50
24 – 25	26
25 – 26	30
26 – 27	23
27 – 28	8
28 – 29	7
29 – 30	10
30 – 31	14
31 – 32	6
32 – 34	8
34 – 35	2
35 – 36	10
36 – 37	1
37 – 38	4
38 – 39	0
39 – 40	1
40 – 41	0
41 – 42	1
42 – 43	1

joshua_project unknown other

The column "joshua_project" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7130 and a null rate of 0.0. Without unique counts or sample values it is not possible to infer whether this is an identifier, code, or descriptive field. The name suggests a reference to the Joshua Project (an ethnographic/missiological dataset), which would typically indicate a people-group code or link.

Treatment: Re-profile with the skip cleared to determine type and cardinality before deciding on use.

anthropic:claude-opus-4-7 · confidence low

Out[15]:

saturn.columns["joshua_project"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

glottolog unknown identifier

The column 'glottolog' was skipped by the profiler and carries no computed statistics beyond a row count of 7130 and a null rate of 0.0. The name suggests Glottolog language identifiers (e.g., codes like 'stan1293'), which would align with the ~7000-language scale of that catalogue, but uniqueness, type, and value distribution are all unknown here. No further signal is available to characterise it.

Treatment: Re-profile with parsing enabled; if confirmed as Glottolog codes, treat as a foreign key to join against the Glottolog reference.

anthropic:claude-opus-4-7 · confidence low

Out[17]:

saturn.columns["glottolog"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

language_history unknown other

The column `language_history` was skipped by the profiler, which emitted no type, uniqueness, or value statistics. All 7130 rows are non-null, but with `kind` reported as unknown and `stats` empty, nothing can be inferred about the contents from this evidence alone. The name suggests a record of languages (possibly a list or sequence per row), which would explain why a scalar profiler bailed out.

Treatment: Re-profile after parsing the field (e.g., explode list values) before deciding on use.

anthropic:claude-opus-4-7 · confidence low

Out[19]:

saturn.columns["language_history"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

us_indigenous unknown other

The column `us_indigenous` was skipped by the profiler, so its data type and value distribution are unknown. We can only confirm it has 7130 rows with a 0.0 null rate; cardinality and any descriptive statistics are missing. Without further inspection it is impossible to tell whether this is a flag, category, or something else.

Treatment: Re-profile or inspect manually to determine type before any downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[21]:

saturn.columns["us_indigenous"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

speaker_count unknown other

Column `speaker_count` was skipped by the profiler, so no type, cardinality, or distribution stats are available beyond a row count of 7130 and zero nulls. The name suggests an integer count of speakers per record, but this cannot be confirmed from the evidence. Re-profile with the appropriate parser before drawing conclusions.

Treatment: Re-run the profiler on this column to recover type and distribution before use.

anthropic:claude-opus-4-7 · confidence low

Out[23]:

saturn.columns["speaker_count"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

data_sources unknown other

The column 'data_sources' was skipped by saturn's profiler (alerts: 'skipped') and reports kind 'unknown' with no unique count or stats. Only the row count (7130) and a null_rate of 0.0 are available, so the actual content and structure are unverified here. Likely a nested or non-scalar field (e.g. list/struct of source identifiers) that the profiler could not flatten.

Treatment: Inspect manually and parse/explode the structure before any downstream modelling.

anthropic:claude-opus-4-7 · confidence low

Out[25]:

saturn.columns["data_sources"].stats

stat	value
n	7,130
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

How to cite

click to copy

BibTeX

@misc{saturn-language-data-world-languages-integrated-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: language data world languages integrated},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/language-data-world_languages_integrated}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: language data world languages integrated. Source: /home/coolhand/datasets/language-data/world_languages_integrated.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/language-data-world_languages_integrated