saturn·

glottolog languages

source /home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet 27,037 rows 15 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.

citing: Level · Macroarea · Family_ID · Countries · Is_Isolate · Last_Year_Of_Documentation · First_Year_Of_Documentation · Latitude · Longitude · Name

Schema

15 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ID text 0.0% 27,037
near_unique one_word short_text
Name text 0.0% 27,037
near_unique one_word
Macroarea categorical 0.8% 30
Latitude numeric 1.8% 13,231
Longitude numeric 1.8% 13,203
Glottocode text 0.0% 27,037
near_unique one_word short_text
ISO639P3code text 69.7% 8,180
near_unique one_word null_rate short_text
Level categorical 0.0% 3
Countries categorical 66.4% 737
null_rate
Family_ID categorical 1.6% 297
Language_ID text 49.7% 3,110
one_word null_rate short_text duplicates
Closest_ISO369P3code text 21.3% 8,180
one_word null_rate short_text duplicates
First_Year_Of_Documentation numeric 99.2% 114
null_rate
Last_Year_Of_Documentation numeric 96.0% 269
null_rate high_skew outliers
Is_Isolate categorical 68.1% 2
null_rate imbalance

ID

text identifier near_unique one_word short_text
Fixed-length 8-character single-token codes (len_min=len_max=8, word_mean=1.0) that are perfectly unique across all 27037 rows with zero nulls or duplicates. Sample values like 'cent1996' and 'chan1318' look like 4-letter prefix plus 4-digit suffix codes, consistent with Glottolog-style language identifiers rather than arbitrary surrogate keys. Treatment: Use as the row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
27,037
nulls
0 (0.0%)
unique
27,037
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
92.03
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Name

text identifier near_unique one_word
`Name` is a fully unique short label column (27037 rows, 27037 distinct values, no nulls or duplicates), with a mean length of 10.4 characters and 66.7% of entries being a single word. The vocabulary of 18126 tokens skews toward geographic and topical descriptors — 'nuclear', 'central', 'western', 'northern', 'eastern', 'southern' lead the frequency list — suggesting these are entity or category names rather than personal names. The combination of perfect uniqueness and short, often one-word values flags it as an identifier-like label. Treatment: Treat as a unique key; drop from modelling features or use only for joins and display. high · anthropic:claude-opus-4-7
n
27,037
nulls
0 (0.0%)
unique
27,037
len_min
1
len_max
109
len_mean
10.44
len_median
8
len_p95
23
word_mean
1.444
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
18,126
readability_flesch_mean
29.91
emoji_rate
0
url_rate
0
one_word_rate
0.6675
allcaps_rate
0.0001479
boilerplate_rate
0

Macroarea

categorical feature
Geographic macroarea label for each record, almost certainly tagging languages or populations by world region. Six canonical regions dominate (Eurasia 8060, Africa 8020, Papunesia 6326, North America 1782, South America 1524, Australia 919), but cardinality is 30 because some rows carry semicolon-joined multi-region strings like 'Africa;Eurasia' (29) or even all six regions concatenated (17). Null rate is low at 0.83% and entropy_ratio of 0.46 reflects the heavy Eurasia/Africa/Papunesia concentration (top_rate 0.30). Treatment: Split the semicolon-delimited compound values into a multi-hot encoding over the six base regions before modelling. high · anthropic:claude-opus-4-7
n
27,037
nulls
224 (0.8%)
unique
30
top_value
Eurasia
top_rate
0.3006
cardinality
30
entropy
2.271
entropy_ratio
0.4628

Latitude

numeric feature
Geographic latitude in decimal degrees, spanning -55.2748 to 73.1354, which fits the global range. The distribution is mildly right-skewed (0.42) with a median of 8.52697, consistent with land mass concentrated in the Northern Hemisphere. About 1.77% of rows are null and only 48 outliers (0.18%) sit outside the IQR fence, so the column is largely clean. Treatment: Pair with longitude for geospatial features; impute or drop the 1.77% nulls before modelling. high · anthropic:claude-opus-4-7
n
27,037
nulls
479 (1.8%)
unique
13,231
min
-55.27
max
73.14
mean
11.59
median
8.527
std
20.57
q1
-3.747
q3
26
iqr
29.75
skew
0.4211
kurtosis
-0.1912
n_outliers
48
outlier_rate
0.001807
zero_rate
0

Longitude

numeric feature
This column captures geographic longitude, with values spanning -178.785 to 179.43 — essentially the full -180/180 globe. The distribution is wide (std 74.05, IQR 110.17) and slightly left-skewed (-0.47), with 13,203 unique values across 27,037 rows and a 1.77% null rate. Only 51 outliers (0.19%) flag, which is expected since longitude is bounded. Treatment: Pair with latitude for geospatial features; consider sin/cos encoding to handle the -180/180 wraparound. high · anthropic:claude-opus-4-7
n
27,037
nulls
479 (1.8%)
unique
13,203
min
-178.8
max
179.4
mean
51.82
median
44.07
std
74.05
q1
9.225
q3
119.4
iqr
110.2
skew
-0.468
kurtosis
-0.4518
n_outliers
51
outlier_rate
0.00192
zero_rate
0

Glottocode

text identifier near_unique one_word short_text
This column holds Glottocodes—the standard 8-character identifiers used by the Glottolog language catalogue (e.g. 'cent1996', 'chan1318'). Every one of the 27,037 rows is unique with a fixed length of 8 and exactly one word, and there are no nulls or duplicates, so it functions as a primary key for languages/dialects. Nothing surprising in the distribution; it behaves exactly like a clean ID field. Treatment: Use as a primary key to left-join against Glottolog metadata; do not feed into models as a feature. high · anthropic:claude-opus-4-7
n
27,037
nulls
0 (0.0%)
unique
27,037
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
92.03
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

ISO639P3code

text foreign_key near_unique one_word null_rate short_text
This column holds ISO 639-3 language codes — exactly 3 characters, one word, every value lowercase alphabetic. It is 69.75% null and the 8,180 unique codes across 27,037 rows suggest each code maps to a distinct language entry, consistent with a language-registry foreign key rather than a feature. No duplicates or empties among the populated rows. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and encode missingness explicitly. high · anthropic:claude-opus-4-7
n
27,037
nulls
18,857 (69.7%)
unique
8,180
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
8,180
readability_flesch_mean
119.1
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Level

categorical label
This is a low-cardinality categorical taxonomy field with exactly 3 levels: dialect, language, and family. Distribution is uneven but not pathological — dialect dominates at 50.3% (13,593 of 27,037), followed by language (8,612) and family (4,832), yielding entropy ratio 0.93. No nulls, suggesting a curated classification scheme likely from a linguistic dataset. Treatment: one-hot or ordinal encode for modelling; safe to use as a stratification key. high · anthropic:claude-opus-4-7
n
27,037
nulls
0 (0.0%)
unique
3
top_value
dialect
top_rate
0.5028
cardinality
3
entropy
1.468
entropy_ratio
0.9265

Countries

categorical feature null_rate
Two-letter ISO country codes, with 737 distinct values across 27,037 rows. Two-thirds of rows are null (null_rate 0.6641), and even among present values the distribution is broad (entropy_ratio 0.69) with PG topping out at just 9.97%. The presence of 737 distinct codes is surprising since ISO 3166-1 alpha-2 only defines ~250, suggesting multi-country concatenations or non-standard codes mixed in. Treatment: Normalize/split non-standard codes, add an explicit missing indicator, then group rare levels before encoding. high · anthropic:claude-opus-4-7
n
27,037
nulls
17,956 (66.4%)
unique
737
top_value
PG
top_rate
0.09966
cardinality
737
entropy
6.562
entropy_ratio
0.6888

Family_ID

categorical foreign_key
Family_ID holds Glottolog-style language family codes (e.g., atla1278, aust1307, indo1319), making it a categorical grouping key across 27,037 rows with 297 distinct families. The distribution is heavily skewed: the top family atla1278 alone covers 18.27% of rows, and the top three account for the bulk of the data, yielding an entropy ratio of 0.60. Null rate is low at 1.59%. Treatment: left-join on this id to a language-family reference, or group-by for stratified analysis. high · anthropic:claude-opus-4-7
n
27,037
nulls
429 (1.6%)
unique
297
top_value
atla1278
top_rate
0.1827
cardinality
297
entropy
4.938
entropy_ratio
0.6011

Language_ID

text foreign_key one_word null_rate short_text duplicates
This column holds 8-character single-token codes (len_min/max=8, one_word_rate=1.0) that look like Glottolog language identifiers (e.g., 'nucl1643', 'stan1293'). With 3110 unique values across 27037 rows and a 0.7712 duplicate rate, it behaves like a categorical foreign key into a language registry. Note that 0.4972 of rows are null, so nearly half the dataset has no language assignment. Treatment: Left-join on this id to a language reference table; treat missing as a separate category. high · anthropic:claude-opus-4-7
n
27,037
nulls
13,444 (49.7%)
unique
3,110
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
10,483
duplicate_rate
0.7712
vocab_size
3,110
readability_flesch_mean
86.53
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Closest_ISO369P3code

text feature one_word null_rate short_text duplicates
This column holds ISO 639-3 three-letter language codes: every value is exactly 3 characters and one word (len_mean 3.0, one_word_rate 1.0), with 8180 unique codes led by jpn (120), eng (115), and pes (64). Notable signals: 21.28% nulls and a 61.57% duplicate rate (13103 duplicates), so coverage is partial but the field is a clean categorical. Treatment: Treat as a categorical language code; impute or flag the 21% nulls and join to an ISO 639-3 reference table for names/families. high · anthropic:claude-opus-4-7
n
27,037
nulls
5,754 (21.3%)
unique
8,180
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
13,103
duplicate_rate
0.6157
vocab_size
7,877
readability_flesch_mean
117.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

First_Year_Of_Documentation

numeric metadata null_rate
This column appears to record the earliest year an item was documented, spanning from -2100 (BCE) to 1932 CE with a median of 711. Severe nullity is the headline: 99.2% of the 27,037 rows are missing, leaving only ~215 populated values across 114 unique years. The wide IQR (-300 to 1710.5) and negative skew indicate a long tail into antiquity rather than a modern-era concentration. Treatment: Drop or treat as a sparse indicator; too null to use as a feature without heavy imputation. high · anthropic:claude-opus-4-7
n
27,037
nulls
26,822 (99.2%)
unique
114
min
-2,100
max
1,932
mean
673.7
median
711
std
1055
q1
-300
q3
1710
iqr
2010
skew
-0.4581
kurtosis
-0.9206
n_outliers
0
outlier_rate
0
zero_rate
0

Last_Year_Of_Documentation

numeric timestamp null_rate high_skew outliers
This appears to be the last year a record was documented, populated for only ~4% of rows (null_rate 0.9605). Values span an implausible range from -3100 to 2024 with a median of 1960, and the heavy left skew (-3.35) plus kurtosis of 12.3 yields 170 outliers (15.9% of non-null entries). The negative minimum suggests BCE-style dating or sentinel values rather than clean calendar years. Treatment: Validate or clip the year range and treat as mostly-missing; impute or flag presence rather than relying on the raw value. high · anthropic:claude-opus-4-7
n
27,037
nulls
25,969 (96.0%)
unique
269
min
-3,100
max
2,024
mean
1700
median
1,960
std
699.3
q1
1858
q3
1987
iqr
129.5
skew
-3.345
kurtosis
12.32
n_outliers
170
outlier_rate
0.1592
zero_rate
0

Is_Isolate

categorical feature null_rate imbalance
Boolean flag indicating isolate status, present on only ~32% of the 27,037 rows (null_rate 0.6815). Among non-null values, 'False' dominates at 0.9789 with just 182 'True' cases, yielding very low entropy (0.148). Treatment: Impute or add a missingness indicator; near-constant, so expect little predictive lift. high · anthropic:claude-opus-4-7
n
27,037
nulls
18,425 (68.1%)
unique
2
top_value
False
top_rate
0.9789
cardinality
2
entropy
0.1478
entropy_ratio
0.1478