glottolog_languages · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet

Saturn profiled 27,037 rows across 15 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet",
    "--findings", "glottolog_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.

citing: Level · Macroarea · Family_ID · Countries · Is_Isolate · Last_Year_Of_Documentation · First_Year_Of_Documentation · Latitude · Longitude · Name

Out[4]:

saturn.schema() · 15 columns

column	kind	n	null%	unique	alerts
ID	text	27,037	0.0%	27,037	near_unique one_word short_text
Name	text	27,037	0.0%	27,037	near_unique one_word
Macroarea	categorical	27,037	0.8%	30
Latitude	numeric	27,037	1.8%	13,231
Longitude	numeric	27,037	1.8%	13,203
Glottocode	text	27,037	0.0%	27,037	near_unique one_word short_text
ISO639P3code	text	27,037	69.7%	8,180	near_unique one_word null_rate short_text
Level	categorical	27,037	0.0%	3
Countries	categorical	27,037	66.4%	737	null_rate
Family_ID	categorical	27,037	1.6%	297
Language_ID	text	27,037	49.7%	3,110	one_word null_rate short_text duplicates
Closest_ISO369P3code	text	27,037	21.3%	8,180	one_word null_rate short_text duplicates
First_Year_Of_Documentation	numeric	27,037	99.2%	114	null_rate
Last_Year_Of_Documentation	numeric	27,037	96.0%	269	null_rate high_skew outliers
Is_Isolate	categorical	27,037	68.1%	2	null_rate imbalance

Fig 1.

Level · Roughly half of all entries are dialects, with languages and families making up the rest.

Show data table

Top values for Level (3 unique shown, of 3 total).
value	count	share
dialect	13593	50.3%
language	8612	31.9%
family	4832	17.9%

Fig 2.

Macroarea · Eurasia and Africa dominate, together accounting for over half of the entries.

Show data table

Top values for Macroarea (20 unique shown, of 30 total).
value	count	share
Eurasia	8060	29.8%
Africa	8020	29.7%
Papunesia	6326	23.4%
North America	1782	6.6%
South America	1524	5.6%
Australia	919	3.4%
Africa;Eurasia	29	0.1%
Eurasia;Papunesia	22	0.1%
Africa;Eurasia;North America;Papunesia;South America	18	0.1%
Africa;Australia;Eurasia;North America;Papunesia;South America	17	0.1%
North America;South America	15	0.1%
Eurasia;North America	12	0.0%
Africa;North America	12	0.0%
Eurasia;South America	11	0.0%
Eurasia;Papunesia;South America	8	0.0%
Africa;Eurasia;Papunesia;South America	7	0.0%
Eurasia;North America;South America	5	0.0%
Eurasia;North America;Papunesia;South America	4	0.0%
Africa;Australia;Eurasia;North America;Papunesia	3	0.0%
Papunesia;South America	3	0.0%

Fig 3.

Family_ID · A handful of large families (atla1278, aust1307, indo1319) carry most of the rows out of 297 families.

Show data table

Top values for Family_ID (20 unique shown, of 297 total).
value	count	share
atla1278	4861	18.0%
aust1307	4108	15.2%
indo1319	3173	11.7%
sino1245	1926	7.1%
afro1255	1458	5.4%
nucl1709	834	3.1%
pama1250	642	2.4%
aust1305	526	1.9%
otom1299	385	1.4%
book1242	382	1.4%
sign1238	343	1.3%
mand1469	322	1.2%
drav1251	281	1.0%
turk1311	273	1.0%
cent2225	267	1.0%
taik1256	261	1.0%
ural1272	236	0.9%
nilo1247	235	0.9%
nakh1245	190	0.7%
araw1281	188	0.7%

Fig 4.

Latitude · Shows the geographic spread of languages, skewed toward equatorial and northern latitudes.

Show data table

Histogram bins for Latitude (median: 8.52697).
bin	count
-55.27 – -52.06	20
-52.06 – -48.85	6
-48.85 – -45.64	3
-45.64 – -42.43	11
-42.43 – -39.22	17
-39.22 – -36.01	51
-36.01 – -32.8	70
-32.8 – -29.59	90
-29.59 – -26.38	139
-26.38 – -23.17	276
-23.17 – -19.96	323
-19.96 – -16.75	446
-16.75 – -13.54	670
-13.54 – -10.33	665
-10.33 – -7.121	1558
-7.121 – -3.911	2186
-3.911 – -0.7005	1999
-0.7005 – 2.51	1102
2.51 – 5.72	1636
5.72 – 8.93	2277
8.93 – 12.14	2361
12.14 – 15.35	993
15.35 – 18.56	1013
18.56 – 21.77	696
21.77 – 24.98	956
24.98 – 28.19	1358
28.19 – 31.4	620
31.4 – 34.61	733
34.61 – 37.82	938
37.82 – 41.03	558
41.03 – 44.24	736
44.24 – 47.45	374
47.45 – 50.66	418
50.66 – 53.87	527
53.87 – 57.08	185
57.08 – 60.29	143
60.29 – 63.5	204
63.5 – 66.71	109
66.71 – 69.93	66
69.93 – 73.14	25

Fig 5.

Countries · Papua New Guinea, Indonesia, and Nigeria lead — useful context for where linguistic diversity concentrates.

Show data table

Top values for Countries (20 unique shown, of 737 total).
value	count	share
PG	905	3.3%
ID	708	2.6%
NG	512	1.9%
AU	476	1.8%
IN	402	1.5%
MX	316	1.2%
CN	315	1.2%
BR	277	1.0%
US	255	0.9%
CM	205	0.8%
PH	188	0.7%
CD	162	0.6%
VU	129	0.5%
RU	104	0.4%
TZ	103	0.4%
PE	102	0.4%
MY	88	0.3%
TD	88	0.3%
NP	82	0.3%
CO	80	0.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
ID	text	0.0%
Name	text	0.0%
Macroarea	categorical	0.8%
Latitude	numeric	1.8%
Longitude	numeric	1.8%
Glottocode	text	0.0%
ISO639P3code	text	69.7%
Level	categorical	0.0%
Countries	categorical	66.4%
Family_ID	categorical	1.6%
Language_ID	text	49.7%
Closest_ISO369P3code	text	21.3%
First_Year_Of_Documentation	numeric	99.2%
Last_Year_Of_Documentation	numeric	96.0%
Is_Isolate	categorical	68.1%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	Latitude	Longitude	First_Year_Of_Documentation	Last_Year_Of_Documentation
Latitude	+1.00	-0.31	-0.06	+0.04
Longitude	-0.31	+1.00	+0.01	-0.08
First_Year_Of_Documentation	-0.06	+0.01	+1.00	+0.18
Last_Year_Of_Documentation	+0.04	-0.08	+0.18	+1.00

ID text identifier

Fixed-length 8-character single-token codes (len_min=len_max=8, word_mean=1.0) that are perfectly unique across all 27037 rows with zero nulls or duplicates. Sample values like 'cent1996' and 'chan1318' look like 4-letter prefix plus 4-digit suffix codes, consistent with Glottolog-style language identifiers rather than arbitrary surrogate keys.

Treatment: Use as the row key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["ID"].stats

stat	value
n	27,037
nulls	0 (0.0%)
unique	27,037
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	92.03
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for ID.

Show data table

Character-length distribution for ID (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	27037
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

Name text identifier

`Name` is a fully unique short label column (27037 rows, 27037 distinct values, no nulls or duplicates), with a mean length of 10.4 characters and 66.7% of entries being a single word. The vocabulary of 18126 tokens skews toward geographic and topical descriptors — 'nuclear', 'central', 'western', 'northern', 'eastern', 'southern' lead the frequency list — suggesting these are entity or category names rather than personal names. The combination of perfect uniqueness and short, often one-word values flags it as an identifier-like label.

Treatment: Treat as a unique key; drop from modelling features or use only for joins and display.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Name"].stats

stat	value
n	27,037
nulls	0 (0.0%)
unique	27,037
len_min	1
len_max	109
len_mean	10.44
len_median	8
len_p95	23
word_mean	1.444
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	18,126
readability_flesch_mean	29.91
emoji_rate	0
url_rate	0
one_word_rate	0.6675
allcaps_rate	0.0001479
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	66.7% rows are a single word

Fig 9.

Character-length distribution for Name.

Show data table

Character-length distribution for Name (mean: 10.439101971372564).
chars	count
1 – 4	565
4 – 6	8233
6 – 9	6520
9 – 12	2380
12 – 14	3551
14 – 17	2357
17 – 20	937
20 – 23	1041
23 – 25	722
25 – 28	241
28 – 31	217
31 – 33	137
33 – 36	69
36 – 39	16
39 – 42	26
42 – 44	9
44 – 47	5
47 – 50	4
50 – 52	3
52 – 55	0
55 – 58	1
58 – 60	2
60 – 63	0
63 – 66	0
66 – 68	0
68 – 71	0
71 – 74	0
74 – 77	0
77 – 79	0
79 – 82	0
82 – 85	0
85 – 87	0
87 – 90	0
90 – 93	0
93 – 96	0
96 – 98	0
98 – 101	0
101 – 104	0
104 – 106	0
106 – 109	1

Macroarea categorical feature

Geographic macroarea label for each record, almost certainly tagging languages or populations by world region. Six canonical regions dominate (Eurasia 8060, Africa 8020, Papunesia 6326, North America 1782, South America 1524, Australia 919), but cardinality is 30 because some rows carry semicolon-joined multi-region strings like 'Africa;Eurasia' (29) or even all six regions concatenated (17). Null rate is low at 0.83% and entropy_ratio of 0.46 reflects the heavy Eurasia/Africa/Papunesia concentration (top_rate 0.30).

Treatment: Split the semicolon-delimited compound values into a multi-hot encoding over the six base regions before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["Macroarea"].stats

stat	value
n	27,037
nulls	224 (0.8%)
unique	30
top_value	Eurasia
top_rate	0.3006
cardinality	30
entropy	2.271
entropy_ratio	0.4628

Fig 10.

Top values for Macroarea.

Show data table

Top values for Macroarea (20 unique shown, of 30 total).
value	count	share
Eurasia	8060	29.8%
Africa	8020	29.7%
Papunesia	6326	23.4%
North America	1782	6.6%
South America	1524	5.6%
Australia	919	3.4%
Africa;Eurasia	29	0.1%
Eurasia;Papunesia	22	0.1%
Africa;Eurasia;North America;Papunesia;South America	18	0.1%
Africa;Australia;Eurasia;North America;Papunesia;South America	17	0.1%
North America;South America	15	0.1%
Eurasia;North America	12	0.0%
Africa;North America	12	0.0%
Eurasia;South America	11	0.0%
Eurasia;Papunesia;South America	8	0.0%
Africa;Eurasia;Papunesia;South America	7	0.0%
Eurasia;North America;South America	5	0.0%
Eurasia;North America;Papunesia;South America	4	0.0%
Africa;Australia;Eurasia;North America;Papunesia	3	0.0%
Papunesia;South America	3	0.0%

Latitude numeric feature

Geographic latitude in decimal degrees, spanning -55.2748 to 73.1354, which fits the global range. The distribution is mildly right-skewed (0.42) with a median of 8.52697, consistent with land mass concentrated in the Northern Hemisphere. About 1.77% of rows are null and only 48 outliers (0.18%) sit outside the IQR fence, so the column is largely clean.

Treatment: Pair with longitude for geospatial features; impute or drop the 1.77% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["Latitude"].stats

stat	value
n	27,037
nulls	479 (1.8%)
unique	13,231
min	-55.27
max	73.14
mean	11.59
median	8.527
std	20.57
q1	-3.747
q3	26
iqr	29.75
skew	0.4211
kurtosis	-0.1912
n_outliers	48
outlier_rate	0.001807
zero_rate	0

Fig 11.

Distribution of Latitude. Vertical dash marks the median.

Show data table

Histogram bins for Latitude (median: 8.52697).
bin	count
-55.27 – -52.06	20
-52.06 – -48.85	6
-48.85 – -45.64	3
-45.64 – -42.43	11
-42.43 – -39.22	17
-39.22 – -36.01	51
-36.01 – -32.8	70
-32.8 – -29.59	90
-29.59 – -26.38	139
-26.38 – -23.17	276
-23.17 – -19.96	323
-19.96 – -16.75	446
-16.75 – -13.54	670
-13.54 – -10.33	665
-10.33 – -7.121	1558
-7.121 – -3.911	2186
-3.911 – -0.7005	1999
-0.7005 – 2.51	1102
2.51 – 5.72	1636
5.72 – 8.93	2277
8.93 – 12.14	2361
12.14 – 15.35	993
15.35 – 18.56	1013
18.56 – 21.77	696
21.77 – 24.98	956
24.98 – 28.19	1358
28.19 – 31.4	620
31.4 – 34.61	733
34.61 – 37.82	938
37.82 – 41.03	558
41.03 – 44.24	736
44.24 – 47.45	374
47.45 – 50.66	418
50.66 – 53.87	527
53.87 – 57.08	185
57.08 – 60.29	143
60.29 – 63.5	204
63.5 – 66.71	109
66.71 – 69.93	66
69.93 – 73.14	25

Longitude numeric feature

This column captures geographic longitude, with values spanning -178.785 to 179.43 — essentially the full -180/180 globe. The distribution is wide (std 74.05, IQR 110.17) and slightly left-skewed (-0.47), with 13,203 unique values across 27,037 rows and a 1.77% null rate. Only 51 outliers (0.19%) flag, which is expected since longitude is bounded.

Treatment: Pair with latitude for geospatial features; consider sin/cos encoding to handle the -180/180 wraparound.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["Longitude"].stats

stat	value
n	27,037
nulls	479 (1.8%)
unique	13,203
min	-178.8
max	179.4
mean	51.82
median	44.07
std	74.05
q1	9.225
q3	119.4
iqr	110.2
skew	-0.468
kurtosis	-0.4518
n_outliers	51
outlier_rate	0.00192
zero_rate	0

Fig 12.

Distribution of Longitude. Vertical dash marks the median.

Show data table

Histogram bins for Longitude (median: 44.065281).
bin	count
-178.8 – -169.8	25
-169.8 – -160.9	11
-160.9 – -151.9	23
-151.9 – -143	47
-143 – -134	39
-134 – -125.1	45
-125.1 – -116.1	357
-116.1 – -107.1	122
-107.1 – -98.19	212
-98.19 – -89.23	627
-89.23 – -80.28	151
-80.28 – -71.32	543
-71.32 – -62.36	558
-62.36 – -53.41	337
-53.41 – -44.45	144
-44.45 – -35.5	79
-35.5 – -26.54	0
-26.54 – -17.59	7
-17.59 – -8.632	378
-8.632 – 0.3229	1242
0.3229 – 9.278	1732
9.278 – 18.23	2769
18.23 – 27.19	1327
27.19 – 36.14	1659
36.14 – 45.1	1017
45.1 – 54.06	729
54.06 – 63.01	168
63.01 – 71.97	342
71.97 – 80.92	849
80.92 – 89.88	765
89.88 – 98.83	1092
98.83 – 107.8	1350
107.8 – 116.7	904
116.7 – 125.7	1659
125.7 – 134.7	883
134.7 – 143.6	1695
143.6 – 152.6	1807
152.6 – 161.5	330
161.5 – 170.5	469
170.5 – 179.4	65

Glottocode text identifier

This column holds Glottocodes—the standard 8-character identifiers used by the Glottolog language catalogue (e.g. 'cent1996', 'chan1318'). Every one of the 27,037 rows is unique with a fixed length of 8 and exactly one word, and there are no nulls or duplicates, so it functions as a primary key for languages/dialects. Nothing surprising in the distribution; it behaves exactly like a clean ID field.

Treatment: Use as a primary key to left-join against Glottolog metadata; do not feed into models as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["Glottocode"].stats

stat	value
n	27,037
nulls	0 (0.0%)
unique	27,037
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	92.03
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 13.

Character-length distribution for Glottocode.

Show data table

Character-length distribution for Glottocode (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	27037
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes — exactly 3 characters, one word, every value lowercase alphabetic. It is 69.75% null and the 8,180 unique codes across 27,037 rows suggest each code maps to a distinct language entry, consistent with a language-registry foreign key rather than a feature. No duplicates or empties among the populated rows.

Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and encode missingness explicitly.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["ISO639P3code"].stats

stat	value
n	27,037
nulls	18,857 (69.7%)
unique	8,180
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	8,180
readability_flesch_mean	119.1
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: null_rate	69.7% null
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for ISO639P3code.

Show data table

Character-length distribution for ISO639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	8180
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Level categorical label

This is a low-cardinality categorical taxonomy field with exactly 3 levels: dialect, language, and family. Distribution is uneven but not pathological — dialect dominates at 50.3% (13,593 of 27,037), followed by language (8,612) and family (4,832), yielding entropy ratio 0.93. No nulls, suggesting a curated classification scheme likely from a linguistic dataset.

Treatment: one-hot or ordinal encode for modelling; safe to use as a stratification key.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["Level"].stats

stat	value
n	27,037
nulls	0 (0.0%)
unique	3
top_value	dialect
top_rate	0.5028
cardinality	3
entropy	1.468
entropy_ratio	0.9265

Fig 15.

Top values for Level.

Show data table

Top values for Level (3 unique shown, of 3 total).
value	count	share
dialect	13593	50.3%
language	8612	31.9%
family	4832	17.9%

Countries categorical feature

Two-letter ISO country codes, with 737 distinct values across 27,037 rows. Two-thirds of rows are null (null_rate 0.6641), and even among present values the distribution is broad (entropy_ratio 0.69) with PG topping out at just 9.97%. The presence of 737 distinct codes is surprising since ISO 3166-1 alpha-2 only defines ~250, suggesting multi-country concatenations or non-standard codes mixed in.

Treatment: Normalize/split non-standard codes, add an explicit missing indicator, then group rare levels before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["Countries"].stats

stat	value
n	27,037
nulls	17,956 (66.4%)
unique	737
top_value	PG
top_rate	0.09966
cardinality	737
entropy	6.562
entropy_ratio	0.6888
alert: null_rate	66.4% null

Fig 16.

Top values for Countries.

Show data table

Top values for Countries (20 unique shown, of 737 total).
value	count	share
PG	905	3.3%
ID	708	2.6%
NG	512	1.9%
AU	476	1.8%
IN	402	1.5%
MX	316	1.2%
CN	315	1.2%
BR	277	1.0%
US	255	0.9%
CM	205	0.8%
PH	188	0.7%
CD	162	0.6%
VU	129	0.5%
RU	104	0.4%
TZ	103	0.4%
PE	102	0.4%
MY	88	0.3%
TD	88	0.3%
NP	82	0.3%
CO	80	0.3%

Family_ID categorical foreign_key

Family_ID holds Glottolog-style language family codes (e.g., atla1278, aust1307, indo1319), making it a categorical grouping key across 27,037 rows with 297 distinct families. The distribution is heavily skewed: the top family atla1278 alone covers 18.27% of rows, and the top three account for the bulk of the data, yielding an entropy ratio of 0.60. Null rate is low at 1.59%.

Treatment: left-join on this id to a language-family reference, or group-by for stratified analysis.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["Family_ID"].stats

stat	value
n	27,037
nulls	429 (1.6%)
unique	297
top_value	atla1278
top_rate	0.1827
cardinality	297
entropy	4.938
entropy_ratio	0.6011

Fig 17.

Top values for Family_ID.

Show data table

Top values for Family_ID (20 unique shown, of 297 total).
value	count	share
atla1278	4861	18.0%
aust1307	4108	15.2%
indo1319	3173	11.7%
sino1245	1926	7.1%
afro1255	1458	5.4%
nucl1709	834	3.1%
pama1250	642	2.4%
aust1305	526	1.9%
otom1299	385	1.4%
book1242	382	1.4%
sign1238	343	1.3%
mand1469	322	1.2%
drav1251	281	1.0%
turk1311	273	1.0%
cent2225	267	1.0%
taik1256	261	1.0%
ural1272	236	0.9%
nilo1247	235	0.9%
nakh1245	190	0.7%
araw1281	188	0.7%

Language_ID text foreign_key

This column holds 8-character single-token codes (len_min/max=8, one_word_rate=1.0) that look like Glottolog language identifiers (e.g., 'nucl1643', 'stan1293'). With 3110 unique values across 27037 rows and a 0.7712 duplicate rate, it behaves like a categorical foreign key into a language registry. Note that 0.4972 of rows are null, so nearly half the dataset has no language assignment.

Treatment: Left-join on this id to a language reference table; treat missing as a separate category.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["Language_ID"].stats

stat	value
n	27,037
nulls	13,444 (49.7%)
unique	3,110
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	10,483
duplicate_rate	0.7712
vocab_size	3,110
readability_flesch_mean	86.53
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	49.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	77.1% duplicate strings

Fig 18.

Character-length distribution for Language_ID.

Show data table

Character-length distribution for Language_ID (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	13593
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

Closest_ISO369P3code text feature

This column holds ISO 639-3 three-letter language codes: every value is exactly 3 characters and one word (len_mean 3.0, one_word_rate 1.0), with 8180 unique codes led by jpn (120), eng (115), and pes (64). Notable signals: 21.28% nulls and a 61.57% duplicate rate (13103 duplicates), so coverage is partial but the field is a clean categorical.

Treatment: Treat as a categorical language code; impute or flag the 21% nulls and join to an ISO 639-3 reference table for names/families.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["Closest_ISO369P3code"].stats

stat	value
n	27,037
nulls	5,754 (21.3%)
unique	8,180
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	13,103
duplicate_rate	0.6157
vocab_size	7,877
readability_flesch_mean	117.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	21.3% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	61.6% duplicate strings

Fig 19.

Character-length distribution for Closest_ISO369P3code.

Show data table

Character-length distribution for Closest_ISO369P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	21283
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

First_Year_Of_Documentation numeric metadata

This column appears to record the earliest year an item was documented, spanning from -2100 (BCE) to 1932 CE with a median of 711. Severe nullity is the headline: 99.2% of the 27,037 rows are missing, leaving only ~215 populated values across 114 unique years. The wide IQR (-300 to 1710.5) and negative skew indicate a long tail into antiquity rather than a modern-era concentration.

Treatment: Drop or treat as a sparse indicator; too null to use as a feature without heavy imputation.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["First_Year_Of_Documentation"].stats

stat	value
n	27,037
nulls	26,822 (99.2%)
unique	114
min	-2,100
max	1,932
mean	673.7
median	711
std	1055
q1	-300
q3	1710
iqr	2010
skew	-0.4581
kurtosis	-0.9206
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	99.2% null

Fig 20.

Distribution of First_Year_Of_Documentation. Vertical dash marks the median.

Show data table

Histogram bins for First_Year_Of_Documentation (median: 711.0).
bin	count
-2100 – -1812	2
-1812 – -1524	4
-1524 – -1236	3
-1236 – -948	1
-948 – -660	14
-660 – -372	26
-372 – -84	12
-84 – 204	13
204 – 492	15
492 – 780	19
780 – 1068	12
1068 – 1356	10
1356 – 1644	19
1644 – 1932	65

Last_Year_Of_Documentation numeric timestamp

This appears to be the last year a record was documented, populated for only ~4% of rows (null_rate 0.9605). Values span an implausible range from -3100 to 2024 with a median of 1960, and the heavy left skew (-3.35) plus kurtosis of 12.3 yields 170 outliers (15.9% of non-null entries). The negative minimum suggests BCE-style dating or sentinel values rather than clean calendar years.

Treatment: Validate or clip the year range and treat as mostly-missing; impute or flag presence rather than relying on the raw value.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["Last_Year_Of_Documentation"].stats

stat	value
n	27,037
nulls	25,969 (96.0%)
unique	269
min	-3,100
max	2,024
mean	1700
median	1,960
std	699.3
q1	1858
q3	1987
iqr	129.5
skew	-3.345
kurtosis	12.32
n_outliers	170
outlier_rate	0.1592
zero_rate	0
alert: null_rate	96.0% null
alert: high_skew	skew=-3.35
alert: outliers	15.9% rows beyond 1.5 IQR

Fig 21.

Distribution of Last_Year_Of_Documentation. Vertical dash marks the median.

Show data table

Histogram bins for Last_Year_Of_Documentation (median: 1960.0).
bin	count
-3100 – -2940	2
-2940 – -2780	0
-2780 – -2620	0
-2620 – -2460	1
-2460 – -2299	2
-2299 – -2139	1
-2139 – -1979	0
-1979 – -1819	0
-1819 – -1659	1
-1659 – -1499	1
-1499 – -1339	4
-1339 – -1178	2
-1178 – -1018	3
-1018 – -858.2	1
-858.2 – -698.1	2
-698.1 – -538	3
-538 – -377.9	9
-377.9 – -217.8	8
-217.8 – -57.62	14
-57.62 – 102.5	10
102.5 – 262.6	7
262.6 – 422.8	7
422.8 – 582.9	4
582.9 – 743	13
743 – 903.1	9
903.1 – 1063	9
1063 – 1223	17
1223 – 1384	7
1384 – 1544	18
1544 – 1704	21
1704 – 1864	96
1864 – 2024	796

Is_Isolate categorical feature

Boolean flag indicating isolate status, present on only ~32% of the 27,037 rows (null_rate 0.6815). Among non-null values, 'False' dominates at 0.9789 with just 182 'True' cases, yielding very low entropy (0.148).

Treatment: Impute or add a missingness indicator; near-constant, so expect little predictive lift.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["Is_Isolate"].stats

stat	value
n	27,037
nulls	18,425 (68.1%)
unique	2
top_value	False
top_rate	0.9789
cardinality	2
entropy	0.1478
entropy_ratio	0.1478
alert: null_rate	68.1% null
alert: imbalance	top value is 97.9% of rows

Fig 22.

Top values for Is_Isolate.

Show data table

Top values for Is_Isolate (2 unique shown, of 2 total).
value	count	share
False	8430	31.2%
True	182	0.7%

glottolog languages

Overview

Summary confidence: high

ID text identifier

Name text identifier

Macroarea categorical feature

Latitude numeric feature

Longitude numeric feature

Glottocode text identifier

ISO639P3code text foreign_key

Level categorical label

Countries categorical feature

Family_ID categorical foreign_key

Language_ID text foreign_key

Closest_ISO369P3code text feature

First_Year_Of_Documentation numeric metadata

Last_Year_Of_Documentation numeric timestamp

Is_Isolate categorical feature

How to cite