saturn

/home/coolhand/html/datavis/data_trove/data/linguistic/world_languages_integrated.json 7,130 rows sample n=7,130 seed 42 2026-06-22T00:12:06+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/linguistic/world_languages_integrated.json
Total rows7,130
Profiled sample7,130
Columns8
Generated2026-06-22T00:12:06+00:00
Show data table
Per-column null rate across the corpus.
columnkindnull %
iso_639_3text0.0%
nametext0.0%
joshua_projectunknown0.0%
glottologunknown0.0%
language_historyunknown0.0%
us_indigenousunknown0.0%
speaker_countunknown0.0%
data_sourcesunknown0.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset medium anthropic:default

This dataset is a reference catalogue of 7,130 world languages, each identified by a unique ISO 639-3 three-letter code alongside its name and several linked data sources (Glottolog, Joshua Project, speaker counts, and more). Every row is distinct — no duplicate language codes or names — making this primarily a lookup/reference table. Two things stand out for closer inspection: first, the name column reveals notable clusters around directional qualifiers (Southern, Northern, Western, Eastern) and language families like Zapotec (58), Mixtec (52), and Naga (49), suggesting rich geographic and genealogical structure worth exploring. Second, 'sign' appears 152 times in language names, indicating a surprisingly large representation of sign languages across the world's documented tongues.

iso_639_3 high anthropic:default

This column contains ISO 639-3 language codes — the international standard three-letter identifiers for individual human languages. Every value is exactly 3 characters long (len_min=3, len_max=3, len_mean=3.0), all lowercase, and completely unique across all 7,130 rows with zero nulls or duplicates. The 7,130 distinct codes is strikingly close to the total number of living languages catalogued by ISO 639-3 (~7,000+), suggesting this may be a near-complete reference table of language codes.

name high anthropic:default

This column contains names of human languages or dialects, evidenced by the dominant top words ('language', 'sign', 'zapotec', 'mixtec', 'naga') and directional qualifiers ('southern', 'northern', 'western', 'eastern', 'central') typical of language taxonomy naming conventions. All 7,130 rows are unique with zero duplicates and zero nulls, making this a near-perfect identifier. The one-word rate of 72.9% and a mean word count of 1.37 reflect the compact, often single-token nature of language names, while the max length of 43 characters accommodates longer multi-word dialect names. The vocabulary size of 7,124 against 7,130 rows confirms near-total uniqueness with only minimal token reuse.

data_sources low anthropic:default

The column 'data_sources' has 7130 rows with zero nulls, but saturn skipped profiling it entirely — no type was resolved, no unique count was computed, and no statistics were collected. This suggests the column contains a complex or nested type (e.g., JSON, arrays, or mixed structures) that the profiler could not parse. No further characterisation is possible from the available evidence.

glottolog low anthropic:default

This column contains Glottolog codes, which are standardized four-letter+four-digit identifiers for the world's languages and language families maintained by the Glottolog database. The profiler skipped detailed analysis (kind='unknown', no stats, null_rate=0.0), so cardinality and value distribution are unavailable. With 7130 rows and zero nulls, the column is fully populated, but uniqueness cannot be assessed from the available evidence.

joshua_project low anthropic:default

This column references the Joshua Project, a well-known ethnoreligious people-group classification system commonly used in missiology and demographic datasets. The profiler returned a 'skipped' alert with no computed stats or uniqueness count, meaning the column type was not resolved and no distributional analysis was available. With 7,130 non-null rows and zero null rate, the data is fully populated, but nothing further can be inferred about cardinality, value distribution, or data type from the evidence provided.

language_history low anthropic:default

This column contains an unknown data type that Saturn skipped during profiling, likely a complex or nested structure (e.g., JSON, array, or serialized object) representing a history of languages associated with each record. With 7,130 rows, zero nulls, and no stats computed, the actual content and distribution are entirely opaque from this evidence alone. The 'skipped' alert means no cardinality, frequency, or value-level analysis was performed, so downstream usability is unknown.

speaker_count low anthropic:default

The column 'speaker_count' is likely a numeric count of speakers per record (e.g., per document, meeting, or audio segment). No distributional statistics or uniqueness data were computed — the profiler skipped this column, possibly due to an unrecognised dtype or an upstream parsing issue. With 7,130 non-null rows and zero nulls, the data is complete, but no further characterisation is possible from the available evidence.

us_indigenous low anthropic:default

The column 'us_indigenous' likely represents a binary or categorical indicator of US Indigenous identity/ethnicity for 7,130 records. The profiler emitted a 'skipped' alert with no computed stats or uniqueness count, meaning the column's data type or content prevented standard analysis. No nulls are present, but without distribution or cardinality data, the actual value composition (e.g., boolean flags, counts, coded strings) cannot be determined from this evidence alone.

iso_639_3 text

100.0% of rows are unique strings 100.0% rows are a single word 95th-percentile length under 20 chars
rows7,130
null0 (0.0%)
unique7,130
len_min3
len_max3
len_mean3.000
len_median3.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates0
duplicate_rate0.000
vocab_size7,130
readability_flesch_mean120.374
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Show data table
Character-length distribution for iso_639_3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 37130
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40
Sample values (first 10)
  1. abq
  2. qui
  3. tys
  4. sby
  5. kzn
  6. wca
  7. pbn
  8. tpy
  9. sod
  10. aoa

name text

100.0% of rows are unique strings 72.9% rows are a single word
rows7,130
null0 (0.0%)
unique7,130
len_min1
len_max43
len_mean8.994
len_median7.000
len_p9521.000
word_mean1.372
word_median1.000
n_empty0
n_duplicates0
duplicate_rate0.000
vocab_size7,124
readability_flesch_mean56.151
emoji_rate0.000
url_rate0.000
one_word_rate0.729
allcaps_rate0.000
boilerplate_rate0.000
Show data table
Character-length distribution for name (mean: 8.993688639551193).
charscount
1 – 225
2 – 3195
3 – 4761
4 – 51106
5 – 61096
6 – 7847
7 – 8529
8 – 9351
9 – 10278
10 – 12238
12 – 13213
13 – 14215
14 – 15191
15 – 16202
16 – 17149
17 – 1895
18 – 1975
19 – 2072
20 – 2166
21 – 2270
22 – 23154
23 – 2450
24 – 2526
25 – 2630
26 – 2723
27 – 288
28 – 297
29 – 3010
30 – 3114
31 – 326
32 – 348
34 – 352
35 – 3610
36 – 371
37 – 384
38 – 390
39 – 401
40 – 410
41 – 421
42 – 431
Sample values (first 10)
  1. Abaza
  2. Quileute
  3. Tày Sa Pa
  4. Soli
  5. Kokola
  6. Yanomámi
  7. Kpasam
  8. Trumai
  9. Songoora
  10. Angolar

joshua_project unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)

glottolog unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)

language_history unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)

us_indigenous unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)

speaker_count unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)

data_sources unknown

no profiler for kind=unknown
rows7,130
null0 (0.0%)