saturn·

data raw glottolog languoid

source /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv 19,401 rows 7 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.

citing: row_count · column_count · columns · kinds

Schema

7 columns
Per-column summary. Click column name to jump to its detail.
Alerts
glottocode text 0.0% 19,401
near_unique one_word short_text
name text 0.0% 19,401
near_unique one_word
isocodes text 59.2% 7,922
near_unique one_word null_rate short_text
level categorical 0.0% 2
macroarea categorical 4.3% 6
latitude numeric 59.1% 7,786
null_rate
longitude numeric 59.1% 7,745
null_rate

glottocode

text identifier near_unique one_word short_text
This column holds Glottocodes — fixed 8-character identifiers (len_min=len_max=8) that uniquely tag languages in the Glottolog catalogue. Every one of the 19,401 rows is unique (n_unique=19401, duplicate_rate=0.0) and single-token (one_word_rate=1.0), matching the canonical four-letters-plus-four-digits pattern visible in top_words like 'aala1237' and 'aari1239'. There are no nulls and no collisions, so this is a clean primary key rather than a feature. Treatment: Use as the primary key; left-join on this id rather than feeding it to a model. high · anthropic:claude-opus-4-7
n
19,401
nulls
0 (0.0%)
unique
19,401
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
19,401
readability_flesch_mean
93.3
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

name

text identifier near_unique one_word
The `name` column is a short text identifier, with all 19,401 values unique and no nulls or duplicates — effectively a primary key or label. Entries are terse (mean 9.2 chars, median 1 word) and 71.7% are single words, yet the top tokens (`nuclear`, `language`, `central`, `southern`, `western`) suggest these are entity or topic names rather than personal names. Vocabulary is wide (17,861 distinct words) for such short strings. Treatment: Treat as a unique key; do not feature-engineer directly, but tokenize if used for matching or embedding. high · anthropic:claude-opus-4-7
n
19,401
nulls
0 (0.0%)
unique
19,401
len_min
1
len_max
58
len_mean
9.211
len_median
7
len_p95
20
word_mean
1.369
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
17,861
readability_flesch_mean
60.53
emoji_rate
0
url_rate
0
one_word_rate
0.7169
allcaps_rate
0
boilerplate_rate
0

isocodes

text identifier near_unique one_word null_rate short_text
This column holds 3-character ISO-style codes: every non-null value is exactly 3 characters and one word (len_min/len_max=3, one_word_rate=1.0). It is sparsely populated with a 0.5917 null rate, and of the 19401 rows there are 7922 distinct codes with no duplicates among the populated entries (vocab_size=7922, n_duplicates=0). The top_words sample (aiw, aay, aas, kbt…) suggests ISO 639-3 language codes rather than country codes, each appearing only once. Treatment: Treat as a categorical/foreign key; left-join to an ISO code lookup and impute or flag the ~59% missing values. high · anthropic:claude-opus-4-7
n
19,401
nulls
11,479 (59.2%)
unique
7,922
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,922
readability_flesch_mean
118.7
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

level

categorical label
A binary categorical flag distinguishing 'dialect' (10,920 rows, 56.3%) from 'language' (8,481 rows). The split is fairly balanced, with entropy at 0.989 of the maximum, and there are no nulls across all 19,401 rows. Treatment: One-hot or binary-encode for modelling. high · anthropic:claude-opus-4-7
n
19,401
nulls
0 (0.0%)
unique
2
top_value
dialect
top_rate
0.5629
cardinality
2
entropy
0.9886
entropy_ratio
0.9886

macroarea

categorical feature
Categorical macro-region label with just 6 values covering the world's linguistic areas (Africa, Eurasia, Papunesia, South America, North America, Australia). Distribution is moderately balanced (entropy ratio 0.84) with Africa leading at 32.1% and a long tail in Australia at 602 rows; 4.32% are null. No single category dominates, but the Americas and Australia are markedly underrepresented relative to Africa and Eurasia. Treatment: one-hot or target-encode; impute or flag the ~4% missing before modelling. high · anthropic:claude-opus-4-7
n
19,401
nulls
839 (4.3%)
unique
6
top_value
Africa
top_rate
0.3208
cardinality
6
entropy
2.176
entropy_ratio
0.8418

latitude

numeric feature null_rate
Geographic latitude coordinates ranging from -55.2748 to 73.1354, consistent with a worldwide point dataset. Nearly 59% of rows are null, which dominates any spatial analysis, and the median of 6.29 with mean 8.16 hints at a slight northern-hemisphere lean despite moderate skew (0.54). Treatment: Pair with longitude for geospatial features; impute or filter the 59% nulls before mapping. high · anthropic:claude-opus-4-7
n
19,401
nulls
11,472 (59.1%)
unique
7,786
min
-55.27
max
73.14
mean
8.164
median
6.292
std
18.96
q1
-5.139
q3
19.27
iqr
24.41
skew
0.5425
kurtosis
0.3048
n_outliers
135
outlier_rate
0.01703
zero_rate
0

longitude

numeric feature null_rate
Geographic longitude coordinate spanning the full globe (min -178.785, max 179.306) with median 47.565486, suggesting a worldwide dataset weighted toward Eurasia. The 59.13% null rate is the dominant concern—most records lack location—while only 13 outliers (0.16%) appear and skew is mild (-0.48). Treatment: Impute or flag the 59% missing values before any geospatial modelling; pair with latitude for joint use. high · anthropic:claude-opus-4-7
n
19,401
nulls
11,472 (59.1%)
unique
7,745
min
-178.8
max
179.3
mean
51.22
median
47.57
std
81.15
q1
7.18
q3
124.1
iqr
117
skew
-0.4814
kurtosis
-0.7765
n_outliers
13
outlier_rate
0.00164
zero_rate
0