saturn

/home/coolhand/datasets/language-data/world_languages_integrated.json 7,130 rows sample n=7,130 seed 42 2026-05-01T23:00:44+00:00

Overview

Source	/home/coolhand/datasets/language-data/world_languages_integrated.json
Total rows	7,130
Profiled sample	7,130
Columns	8
Generated	2026-05-01T23:00:44+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset medium anthropic:claude-opus-4-7

This dataset catalogs 7,130 world languages, each row keyed by a unique ISO 639-3 code paired with a language name. Only two of the eight columns (iso_639_3, name) parsed cleanly as text; the other six — including data_sources, glottolog, speaker_count, and us_indigenous — were skipped as nested or non-scalar structures and will need flattening before they yield insight. The name column is the most interesting surface feature: it averages about 9 characters, is single-word ~73% of the time, and its top tokens reveal heavy use of regional qualifiers like 'southern', 'northern', 'eastern', and 'central', plus 152–153 entries containing 'language' or 'sign'. Start by unpacking the skipped object columns (especially speaker_count and glottolog) and look at name-token frequencies to understand naming conventions and language-family clustering.

iso_639_3 high anthropic:claude-opus-4-7

This column holds ISO 639-3 language codes: every value is exactly 3 characters, single-word, and unique across all 7130 rows (vocab_size 7130, len_min/max 3, one_word_rate 1.0). With n_unique equal to n and zero nulls or duplicates, it functions as a primary key for languages rather than a feature. Sample tokens like 'aou', 'aiw', 'aas' match the ISO 639-3 three-letter convention.

name high anthropic:claude-opus-4-7

This column appears to be a unique name identifier, likely for languages given the prominence of 'language', 'sign', 'zapotec', 'mixtec', and 'naga' in top words. All 7130 values are unique (n_unique equals n) with zero nulls or duplicates, and 72.9% are single words with a mean length of ~9 characters. The directional modifiers (southern, northern, western, eastern, central) suggest many entries are regional language variants.

joshua_project low anthropic:claude-opus-4-7

The column "joshua_project" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7130 and a null rate of 0.0. Without unique counts or sample values it is not possible to infer whether this is an identifier, code, or descriptive field. The name suggests a reference to the Joshua Project (an ethnographic/missiological dataset), which would typically indicate a people-group code or link.

glottolog low anthropic:claude-opus-4-7

The column 'glottolog' was skipped by the profiler and carries no computed statistics beyond a row count of 7130 and a null rate of 0.0. The name suggests Glottolog language identifiers (e.g., codes like 'stan1293'), which would align with the ~7000-language scale of that catalogue, but uniqueness, type, and value distribution are all unknown here. No further signal is available to characterise it.

language_history low anthropic:claude-opus-4-7

The column `language_history` was skipped by the profiler, which emitted no type, uniqueness, or value statistics. All 7130 rows are non-null, but with `kind` reported as unknown and `stats` empty, nothing can be inferred about the contents from this evidence alone. The name suggests a record of languages (possibly a list or sequence per row), which would explain why a scalar profiler bailed out.

us_indigenous low anthropic:claude-opus-4-7

The column `us_indigenous` was skipped by the profiler, so its data type and value distribution are unknown. We can only confirm it has 7130 rows with a 0.0 null rate; cardinality and any descriptive statistics are missing. Without further inspection it is impossible to tell whether this is a flag, category, or something else.

speaker_count low anthropic:claude-opus-4-7

Column `speaker_count` was skipped by the profiler, so no type, cardinality, or distribution stats are available beyond a row count of 7130 and zero nulls. The name suggests an integer count of speakers per record, but this cannot be confirmed from the evidence. Re-profile with the appropriate parser before drawing conclusions.

data_sources low anthropic:claude-opus-4-7

The column 'data_sources' was skipped by saturn's profiler (alerts: 'skipped') and reports kind 'unknown' with no unique count or stats. Only the row count (7130) and a null_rate of 0.0 are available, so the actual content and structure are unverified here. Likely a nested or non-scalar field (e.g. list/struct of source identifiers) that the profiler could not flatten.

iso_639_3 text

100.0% of rows are unique strings 100.0% rows are a single word 95th-percentile length under 20 chars

rows7,130

null0 (0.0%)

unique7,130

len_min3

len_max3

len_mean3.000

len_median3.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size7,130

readability_flesch_mean120.374

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

name text

100.0% of rows are unique strings 72.9% rows are a single word

rows7,130

null0 (0.0%)

unique7,130

len_min1

len_max43

len_mean8.994

len_median7.000

len_p9521.000

word_mean1.372

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size7,124

readability_flesch_mean56.151

emoji_rate0.000

url_rate0.000

one_word_rate0.729

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

Abaza
Quileute
Tày Sa Pa
Soli
Kokola
Yanomámi
Kpasam
Trumai
Songoora
Angolar

joshua_project unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)

glottolog unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)

language_history unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)

us_indigenous unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)

speaker_count unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)

data_sources unknown

no profiler for kind=unknown

rows7,130

null0 (0.0%)