saturn·

glottolog languages

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet

Saturn profiled 27,037 rows across 15 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet",
    "--findings", "glottolog_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.

citing: Level · Macroarea · Family_ID · Countries · Is_Isolate · Last_Year_Of_Documentation · First_Year_Of_Documentation · Latitude · Longitude · Name

Out[4]:

saturn.schema() · 15 columns

column kind n null% unique alerts
ID text 27,037 0.0% 27,037 near_unique one_word short_text
Name text 27,037 0.0% 27,037 near_unique one_word
Macroarea categorical 27,037 0.8% 30
Latitude numeric 27,037 1.8% 13,231
Longitude numeric 27,037 1.8% 13,203
Glottocode text 27,037 0.0% 27,037 near_unique one_word short_text
ISO639P3code text 27,037 69.7% 8,180 near_unique one_word null_rate short_text
Level categorical 27,037 0.0% 3
Countries categorical 27,037 66.4% 737 null_rate
Family_ID categorical 27,037 1.6% 297
Language_ID text 27,037 49.7% 3,110 one_word null_rate short_text duplicates
Closest_ISO369P3code text 27,037 21.3% 8,180 one_word null_rate short_text duplicates
First_Year_Of_Documentation numeric 27,037 99.2% 114 null_rate
Last_Year_Of_Documentation numeric 27,037 96.0% 269 null_rate high_skew outliers
Is_Isolate categorical 27,037 68.1% 2 null_rate imbalance
Fig 1.
Level · Roughly half of all entries are dialects, with languages and families making up the rest.
Show data table
Top values for Level (3 unique shown, of 3 total).
valuecountshare
dialect1359350.3%
language861231.9%
family483217.9%
Fig 2.
Macroarea · Eurasia and Africa dominate, together accounting for over half of the entries.
Show data table
Top values for Macroarea (20 unique shown, of 30 total).
valuecountshare
Eurasia806029.8%
Africa802029.7%
Papunesia632623.4%
North America17826.6%
South America15245.6%
Australia9193.4%
Africa;Eurasia290.1%
Eurasia;Papunesia220.1%
Africa;Eurasia;North America;Papunesia;South America180.1%
Africa;Australia;Eurasia;North America;Papunesia;South America170.1%
North America;South America150.1%
Eurasia;North America120.0%
Africa;North America120.0%
Eurasia;South America110.0%
Eurasia;Papunesia;South America80.0%
Africa;Eurasia;Papunesia;South America70.0%
Eurasia;North America;South America50.0%
Eurasia;North America;Papunesia;South America40.0%
Africa;Australia;Eurasia;North America;Papunesia30.0%
Papunesia;South America30.0%
Fig 3.
Family_ID · A handful of large families (atla1278, aust1307, indo1319) carry most of the rows out of 297 families.
Show data table
Top values for Family_ID (20 unique shown, of 297 total).
valuecountshare
atla1278486118.0%
aust1307410815.2%
indo1319317311.7%
sino124519267.1%
afro125514585.4%
nucl17098343.1%
pama12506422.4%
aust13055261.9%
otom12993851.4%
book12423821.4%
sign12383431.3%
mand14693221.2%
drav12512811.0%
turk13112731.0%
cent22252671.0%
taik12562611.0%
ural12722360.9%
nilo12472350.9%
nakh12451900.7%
araw12811880.7%
Fig 4.
Latitude · Shows the geographic spread of languages, skewed toward equatorial and northern latitudes.
Show data table
Histogram bins for Latitude (median: 8.52697).
bincount
-55.27 – -52.0620
-52.06 – -48.856
-48.85 – -45.643
-45.64 – -42.4311
-42.43 – -39.2217
-39.22 – -36.0151
-36.01 – -32.870
-32.8 – -29.5990
-29.59 – -26.38139
-26.38 – -23.17276
-23.17 – -19.96323
-19.96 – -16.75446
-16.75 – -13.54670
-13.54 – -10.33665
-10.33 – -7.1211558
-7.121 – -3.9112186
-3.911 – -0.70051999
-0.7005 – 2.511102
2.51 – 5.721636
5.72 – 8.932277
8.93 – 12.142361
12.14 – 15.35993
15.35 – 18.561013
18.56 – 21.77696
21.77 – 24.98956
24.98 – 28.191358
28.19 – 31.4620
31.4 – 34.61733
34.61 – 37.82938
37.82 – 41.03558
41.03 – 44.24736
44.24 – 47.45374
47.45 – 50.66418
50.66 – 53.87527
53.87 – 57.08185
57.08 – 60.29143
60.29 – 63.5204
63.5 – 66.71109
66.71 – 69.9366
69.93 – 73.1425
Fig 5.
Countries · Papua New Guinea, Indonesia, and Nigeria lead — useful context for where linguistic diversity concentrates.
Show data table
Top values for Countries (20 unique shown, of 737 total).
valuecountshare
PG9053.3%
ID7082.6%
NG5121.9%
AU4761.8%
IN4021.5%
MX3161.2%
CN3151.2%
BR2771.0%
US2550.9%
CM2050.8%
PH1880.7%
CD1620.6%
VU1290.5%
RU1040.4%
TZ1030.4%
PE1020.4%
MY880.3%
TD880.3%
NP820.3%
CO800.3%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
IDtext0.0%
Nametext0.0%
Macroareacategorical0.8%
Latitudenumeric1.8%
Longitudenumeric1.8%
Glottocodetext0.0%
ISO639P3codetext69.7%
Levelcategorical0.0%
Countriescategorical66.4%
Family_IDcategorical1.6%
Language_IDtext49.7%
Closest_ISO369P3codetext21.3%
First_Year_Of_Documentationnumeric99.2%
Last_Year_Of_Documentationnumeric96.0%
Is_Isolatecategorical68.1%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
LatitudeLongitudeFirst_Year_Of_DocumentationLast_Year_Of_Documentation
Latitude+1.00-0.31-0.06+0.04
Longitude-0.31+1.00+0.01-0.08
First_Year_Of_Documentation-0.06+0.01+1.00+0.18
Last_Year_Of_Documentation+0.04-0.08+0.18+1.00

ID text identifier

Fixed-length 8-character single-token codes (len_min=len_max=8, word_mean=1.0) that are perfectly unique across all 27037 rows with zero nulls or duplicates. Sample values like 'cent1996' and 'chan1318' look like 4-letter prefix plus 4-digit suffix codes, consistent with Glottolog-style language identifiers rather than arbitrary surrogate keys.

Treatment: Use as the row key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ID"].stats

statvalue
n27,037
nulls0 (0.0%)
unique27,037
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 92.03
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for ID.
Show data table
Character-length distribution for ID (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 827037
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

Name text identifier

`Name` is a fully unique short label column (27037 rows, 27037 distinct values, no nulls or duplicates), with a mean length of 10.4 characters and 66.7% of entries being a single word. The vocabulary of 18126 tokens skews toward geographic and topical descriptors — 'nuclear', 'central', 'western', 'northern', 'eastern', 'southern' lead the frequency list — suggesting these are entity or category names rather than personal names. The combination of perfect uniqueness and short, often one-word values flags it as an identifier-like label.

Treatment: Treat as a unique key; drop from modelling features or use only for joins and display.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["Name"].stats

statvalue
n27,037
nulls0 (0.0%)
unique27,037
len_min 1
len_max 109
len_mean 10.44
len_median 8
len_p95 23
word_mean 1.444
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 18,126
readability_flesch_mean 29.91
emoji_rate 0
url_rate 0
one_word_rate 0.6675
allcaps_rate 0.0001479
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word66.7% rows are a single word
Fig 9.
Character-length distribution for Name.
Show data table
Character-length distribution for Name (mean: 10.439101971372564).
charscount
1 – 4565
4 – 68233
6 – 96520
9 – 122380
12 – 143551
14 – 172357
17 – 20937
20 – 231041
23 – 25722
25 – 28241
28 – 31217
31 – 33137
33 – 3669
36 – 3916
39 – 4226
42 – 449
44 – 475
47 – 504
50 – 523
52 – 550
55 – 581
58 – 602
60 – 630
63 – 660
66 – 680
68 – 710
71 – 740
74 – 770
77 – 790
79 – 820
82 – 850
85 – 870
87 – 900
90 – 930
93 – 960
96 – 980
98 – 1010
101 – 1040
104 – 1060
106 – 1091

Macroarea categorical feature

Geographic macroarea label for each record, almost certainly tagging languages or populations by world region. Six canonical regions dominate (Eurasia 8060, Africa 8020, Papunesia 6326, North America 1782, South America 1524, Australia 919), but cardinality is 30 because some rows carry semicolon-joined multi-region strings like 'Africa;Eurasia' (29) or even all six regions concatenated (17). Null rate is low at 0.83% and entropy_ratio of 0.46 reflects the heavy Eurasia/Africa/Papunesia concentration (top_rate 0.30).

Treatment: Split the semicolon-delimited compound values into a multi-hot encoding over the six base regions before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["Macroarea"].stats

statvalue
n27,037
nulls224 (0.8%)
unique30
top_value Eurasia
top_rate 0.3006
cardinality 30
entropy 2.271
entropy_ratio 0.4628
Fig 10.
Top values for Macroarea.
Show data table
Top values for Macroarea (20 unique shown, of 30 total).
valuecountshare
Eurasia806029.8%
Africa802029.7%
Papunesia632623.4%
North America17826.6%
South America15245.6%
Australia9193.4%
Africa;Eurasia290.1%
Eurasia;Papunesia220.1%
Africa;Eurasia;North America;Papunesia;South America180.1%
Africa;Australia;Eurasia;North America;Papunesia;South America170.1%
North America;South America150.1%
Eurasia;North America120.0%
Africa;North America120.0%
Eurasia;South America110.0%
Eurasia;Papunesia;South America80.0%
Africa;Eurasia;Papunesia;South America70.0%
Eurasia;North America;South America50.0%
Eurasia;North America;Papunesia;South America40.0%
Africa;Australia;Eurasia;North America;Papunesia30.0%
Papunesia;South America30.0%

Latitude numeric feature

Geographic latitude in decimal degrees, spanning -55.2748 to 73.1354, which fits the global range. The distribution is mildly right-skewed (0.42) with a median of 8.52697, consistent with land mass concentrated in the Northern Hemisphere. About 1.77% of rows are null and only 48 outliers (0.18%) sit outside the IQR fence, so the column is largely clean.

Treatment: Pair with longitude for geospatial features; impute or drop the 1.77% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["Latitude"].stats

statvalue
n27,037
nulls479 (1.8%)
unique13,231
min -55.27
max 73.14
mean 11.59
median 8.527
std 20.57
q1 -3.747
q3 26
iqr 29.75
skew 0.4211
kurtosis -0.1912
n_outliers 48
outlier_rate 0.001807
zero_rate 0
Fig 11.
Distribution of Latitude. Vertical dash marks the median.
Show data table
Histogram bins for Latitude (median: 8.52697).
bincount
-55.27 – -52.0620
-52.06 – -48.856
-48.85 – -45.643
-45.64 – -42.4311
-42.43 – -39.2217
-39.22 – -36.0151
-36.01 – -32.870
-32.8 – -29.5990
-29.59 – -26.38139
-26.38 – -23.17276
-23.17 – -19.96323
-19.96 – -16.75446
-16.75 – -13.54670
-13.54 – -10.33665
-10.33 – -7.1211558
-7.121 – -3.9112186
-3.911 – -0.70051999
-0.7005 – 2.511102
2.51 – 5.721636
5.72 – 8.932277
8.93 – 12.142361
12.14 – 15.35993
15.35 – 18.561013
18.56 – 21.77696
21.77 – 24.98956
24.98 – 28.191358
28.19 – 31.4620
31.4 – 34.61733
34.61 – 37.82938
37.82 – 41.03558
41.03 – 44.24736
44.24 – 47.45374
47.45 – 50.66418
50.66 – 53.87527
53.87 – 57.08185
57.08 – 60.29143
60.29 – 63.5204
63.5 – 66.71109
66.71 – 69.9366
69.93 – 73.1425

Longitude numeric feature

This column captures geographic longitude, with values spanning -178.785 to 179.43 — essentially the full -180/180 globe. The distribution is wide (std 74.05, IQR 110.17) and slightly left-skewed (-0.47), with 13,203 unique values across 27,037 rows and a 1.77% null rate. Only 51 outliers (0.19%) flag, which is expected since longitude is bounded.

Treatment: Pair with latitude for geospatial features; consider sin/cos encoding to handle the -180/180 wraparound.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["Longitude"].stats

statvalue
n27,037
nulls479 (1.8%)
unique13,203
min -178.8
max 179.4
mean 51.82
median 44.07
std 74.05
q1 9.225
q3 119.4
iqr 110.2
skew -0.468
kurtosis -0.4518
n_outliers 51
outlier_rate 0.00192
zero_rate 0
Fig 12.
Distribution of Longitude. Vertical dash marks the median.
Show data table
Histogram bins for Longitude (median: 44.065281).
bincount
-178.8 – -169.825
-169.8 – -160.911
-160.9 – -151.923
-151.9 – -14347
-143 – -13439
-134 – -125.145
-125.1 – -116.1357
-116.1 – -107.1122
-107.1 – -98.19212
-98.19 – -89.23627
-89.23 – -80.28151
-80.28 – -71.32543
-71.32 – -62.36558
-62.36 – -53.41337
-53.41 – -44.45144
-44.45 – -35.579
-35.5 – -26.540
-26.54 – -17.597
-17.59 – -8.632378
-8.632 – 0.32291242
0.3229 – 9.2781732
9.278 – 18.232769
18.23 – 27.191327
27.19 – 36.141659
36.14 – 45.11017
45.1 – 54.06729
54.06 – 63.01168
63.01 – 71.97342
71.97 – 80.92849
80.92 – 89.88765
89.88 – 98.831092
98.83 – 107.81350
107.8 – 116.7904
116.7 – 125.71659
125.7 – 134.7883
134.7 – 143.61695
143.6 – 152.61807
152.6 – 161.5330
161.5 – 170.5469
170.5 – 179.465

Glottocode text identifier

This column holds Glottocodes—the standard 8-character identifiers used by the Glottolog language catalogue (e.g. 'cent1996', 'chan1318'). Every one of the 27,037 rows is unique with a fixed length of 8 and exactly one word, and there are no nulls or duplicates, so it functions as a primary key for languages/dialects. Nothing surprising in the distribution; it behaves exactly like a clean ID field.

Treatment: Use as a primary key to left-join against Glottolog metadata; do not feed into models as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["Glottocode"].stats

statvalue
n27,037
nulls0 (0.0%)
unique27,037
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 92.03
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 13.
Character-length distribution for Glottocode.
Show data table
Character-length distribution for Glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 827037
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes — exactly 3 characters, one word, every value lowercase alphabetic. It is 69.75% null and the 8,180 unique codes across 27,037 rows suggest each code maps to a distinct language entry, consistent with a language-registry foreign key rather than a feature. No duplicates or empties among the populated rows.

Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and encode missingness explicitly.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["ISO639P3code"].stats

statvalue
n27,037
nulls18,857 (69.7%)
unique8,180
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 8,180
readability_flesch_mean 119.1
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: null_rate69.7% null
alert: short_text95th-percentile length under 20 chars
Fig 14.
Character-length distribution for ISO639P3code.
Show data table
Character-length distribution for ISO639P3code (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 38180
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

Level categorical label

This is a low-cardinality categorical taxonomy field with exactly 3 levels: dialect, language, and family. Distribution is uneven but not pathological — dialect dominates at 50.3% (13,593 of 27,037), followed by language (8,612) and family (4,832), yielding entropy ratio 0.93. No nulls, suggesting a curated classification scheme likely from a linguistic dataset.

Treatment: one-hot or ordinal encode for modelling; safe to use as a stratification key.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["Level"].stats

statvalue
n27,037
nulls0 (0.0%)
unique3
top_value dialect
top_rate 0.5028
cardinality 3
entropy 1.468
entropy_ratio 0.9265
Fig 15.
Top values for Level.
Show data table
Top values for Level (3 unique shown, of 3 total).
valuecountshare
dialect1359350.3%
language861231.9%
family483217.9%

Countries categorical feature

Two-letter ISO country codes, with 737 distinct values across 27,037 rows. Two-thirds of rows are null (null_rate 0.6641), and even among present values the distribution is broad (entropy_ratio 0.69) with PG topping out at just 9.97%. The presence of 737 distinct codes is surprising since ISO 3166-1 alpha-2 only defines ~250, suggesting multi-country concatenations or non-standard codes mixed in.

Treatment: Normalize/split non-standard codes, add an explicit missing indicator, then group rare levels before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["Countries"].stats

statvalue
n27,037
nulls17,956 (66.4%)
unique737
top_value PG
top_rate 0.09966
cardinality 737
entropy 6.562
entropy_ratio 0.6888
alert: null_rate66.4% null
Fig 16.
Top values for Countries.
Show data table
Top values for Countries (20 unique shown, of 737 total).
valuecountshare
PG9053.3%
ID7082.6%
NG5121.9%
AU4761.8%
IN4021.5%
MX3161.2%
CN3151.2%
BR2771.0%
US2550.9%
CM2050.8%
PH1880.7%
CD1620.6%
VU1290.5%
RU1040.4%
TZ1030.4%
PE1020.4%
MY880.3%
TD880.3%
NP820.3%
CO800.3%

Family_ID categorical foreign_key

Family_ID holds Glottolog-style language family codes (e.g., atla1278, aust1307, indo1319), making it a categorical grouping key across 27,037 rows with 297 distinct families. The distribution is heavily skewed: the top family atla1278 alone covers 18.27% of rows, and the top three account for the bulk of the data, yielding an entropy ratio of 0.60. Null rate is low at 1.59%.

Treatment: left-join on this id to a language-family reference, or group-by for stratified analysis.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["Family_ID"].stats

statvalue
n27,037
nulls429 (1.6%)
unique297
top_value atla1278
top_rate 0.1827
cardinality 297
entropy 4.938
entropy_ratio 0.6011
Fig 17.
Top values for Family_ID.
Show data table
Top values for Family_ID (20 unique shown, of 297 total).
valuecountshare
atla1278486118.0%
aust1307410815.2%
indo1319317311.7%
sino124519267.1%
afro125514585.4%
nucl17098343.1%
pama12506422.4%
aust13055261.9%
otom12993851.4%
book12423821.4%
sign12383431.3%
mand14693221.2%
drav12512811.0%
turk13112731.0%
cent22252671.0%
taik12562611.0%
ural12722360.9%
nilo12472350.9%
nakh12451900.7%
araw12811880.7%

Language_ID text foreign_key

This column holds 8-character single-token codes (len_min/max=8, one_word_rate=1.0) that look like Glottolog language identifiers (e.g., 'nucl1643', 'stan1293'). With 3110 unique values across 27037 rows and a 0.7712 duplicate rate, it behaves like a categorical foreign key into a language registry. Note that 0.4972 of rows are null, so nearly half the dataset has no language assignment.

Treatment: Left-join on this id to a language reference table; treat missing as a separate category.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["Language_ID"].stats

statvalue
n27,037
nulls13,444 (49.7%)
unique3,110
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 10,483
duplicate_rate 0.7712
vocab_size 3,110
readability_flesch_mean 86.53
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate49.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates77.1% duplicate strings
Fig 18.
Character-length distribution for Language_ID.
Show data table
Character-length distribution for Language_ID (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 813593
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

Closest_ISO369P3code text feature

This column holds ISO 639-3 three-letter language codes: every value is exactly 3 characters and one word (len_mean 3.0, one_word_rate 1.0), with 8180 unique codes led by jpn (120), eng (115), and pes (64). Notable signals: 21.28% nulls and a 61.57% duplicate rate (13103 duplicates), so coverage is partial but the field is a clean categorical.

Treatment: Treat as a categorical language code; impute or flag the 21% nulls and join to an ISO 639-3 reference table for names/families.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["Closest_ISO369P3code"].stats

statvalue
n27,037
nulls5,754 (21.3%)
unique8,180
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 13,103
duplicate_rate 0.6157
vocab_size 7,877
readability_flesch_mean 117.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate21.3% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates61.6% duplicate strings
Fig 19.
Character-length distribution for Closest_ISO369P3code.
Show data table
Character-length distribution for Closest_ISO369P3code (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 321283
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

First_Year_Of_Documentation numeric metadata

This column appears to record the earliest year an item was documented, spanning from -2100 (BCE) to 1932 CE with a median of 711. Severe nullity is the headline: 99.2% of the 27,037 rows are missing, leaving only ~215 populated values across 114 unique years. The wide IQR (-300 to 1710.5) and negative skew indicate a long tail into antiquity rather than a modern-era concentration.

Treatment: Drop or treat as a sparse indicator; too null to use as a feature without heavy imputation.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["First_Year_Of_Documentation"].stats

statvalue
n27,037
nulls26,822 (99.2%)
unique114
min -2,100
max 1,932
mean 673.7
median 711
std 1055
q1 -300
q3 1710
iqr 2010
skew -0.4581
kurtosis -0.9206
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate99.2% null
Fig 20.
Distribution of First_Year_Of_Documentation. Vertical dash marks the median.
Show data table
Histogram bins for First_Year_Of_Documentation (median: 711.0).
bincount
-2100 – -18122
-1812 – -15244
-1524 – -12363
-1236 – -9481
-948 – -66014
-660 – -37226
-372 – -8412
-84 – 20413
204 – 49215
492 – 78019
780 – 106812
1068 – 135610
1356 – 164419
1644 – 193265

Last_Year_Of_Documentation numeric timestamp

This appears to be the last year a record was documented, populated for only ~4% of rows (null_rate 0.9605). Values span an implausible range from -3100 to 2024 with a median of 1960, and the heavy left skew (-3.35) plus kurtosis of 12.3 yields 170 outliers (15.9% of non-null entries). The negative minimum suggests BCE-style dating or sentinel values rather than clean calendar years.

Treatment: Validate or clip the year range and treat as mostly-missing; impute or flag presence rather than relying on the raw value.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["Last_Year_Of_Documentation"].stats

statvalue
n27,037
nulls25,969 (96.0%)
unique269
min -3,100
max 2,024
mean 1700
median 1,960
std 699.3
q1 1858
q3 1987
iqr 129.5
skew -3.345
kurtosis 12.32
n_outliers 170
outlier_rate 0.1592
zero_rate 0
alert: null_rate96.0% null
alert: high_skewskew=-3.35
alert: outliers15.9% rows beyond 1.5 IQR
Fig 21.
Distribution of Last_Year_Of_Documentation. Vertical dash marks the median.
Show data table
Histogram bins for Last_Year_Of_Documentation (median: 1960.0).
bincount
-3100 – -29402
-2940 – -27800
-2780 – -26200
-2620 – -24601
-2460 – -22992
-2299 – -21391
-2139 – -19790
-1979 – -18190
-1819 – -16591
-1659 – -14991
-1499 – -13394
-1339 – -11782
-1178 – -10183
-1018 – -858.21
-858.2 – -698.12
-698.1 – -5383
-538 – -377.99
-377.9 – -217.88
-217.8 – -57.6214
-57.62 – 102.510
102.5 – 262.67
262.6 – 422.87
422.8 – 582.94
582.9 – 74313
743 – 903.19
903.1 – 10639
1063 – 122317
1223 – 13847
1384 – 154418
1544 – 170421
1704 – 186496
1864 – 2024796

Is_Isolate categorical feature

Boolean flag indicating isolate status, present on only ~32% of the 27,037 rows (null_rate 0.6815). Among non-null values, 'False' dominates at 0.9789 with just 182 'True' cases, yielding very low entropy (0.148).

Treatment: Impute or add a missingness indicator; near-constant, so expect little predictive lift.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["Is_Isolate"].stats

statvalue
n27,037
nulls18,425 (68.1%)
unique2
top_value False
top_rate 0.9789
cardinality 2
entropy 0.1478
entropy_ratio 0.1478
alert: null_rate68.1% null
alert: imbalancetop value is 97.9% of rows
Fig 22.
Top values for Is_Isolate.
Show data table
Top values for Is_Isolate (2 unique shown, of 2 total).
valuecountshare
False843031.2%
True1820.7%

How to cite

click to copy

BibTeX
@misc{saturn-glottolog-languages-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: glottolog languages},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/glottolog_languages}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: glottolog languages. Source: /home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/glottolog_languages