saturn·

language data glottolog languoid

source /home/coolhand/datasets/language-data/glottolog_languoid.csv 23,740 rows 16 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.

citing: level · status · country_ids · family_id · latitude · longitude · iso639P3code · child_dialect_count · bookkeeping

Schema

16 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id text 0.0% 23,740
near_unique one_word short_text
family_id categorical 1.8% 287
parent_id text 1.8% 7,338
one_word short_text duplicates
name text 0.0% 23,740
near_unique one_word
bookkeeping categorical 0.0% 2
imbalance
level categorical 0.0% 3
status categorical 0.0% 6
latitude numeric 66.5% 7,798
null_rate
longitude numeric 66.5% 7,757
null_rate
iso639P3code text 66.4% 7,968
near_unique one_word null_rate short_text
description unknown 0.0%
skipped
markup_description unknown 0.0%
skipped
child_family_count numeric 0.0% 88
high_skew outliers
child_language_count numeric 0.0% 126
high_skew outliers
child_dialect_count numeric 0.0% 164
high_skew outliers
country_ids categorical 64.2% 680
null_rate

id

text identifier near_unique one_word short_text
Fixed 8-character single-token codes (e.g., 'melk1240', 'yang1299'), unique across all 23,740 rows with no nulls or duplicates. The pattern of four letters followed by four digits is consistent with Glottolog-style language identifiers, making this a primary key rather than analyzable text. Treatment: Use as the row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
23,740
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
86.11
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

family_id

categorical foreign_key
Looks like a Glottolog-style language family identifier (e.g. 'atla1278' for Atlantic-Congo, 'aust1307' for Austronesian) tagging each of the 23,740 rows. The distribution is heavily skewed: the top family alone covers 20.0% of rows and the top 10 of 287 families dominate, yielding an entropy ratio of 0.598. Null rate is low at 1.81%. Treatment: left-join on this id to a language-family reference table; consider grouping the long tail before modelling. high · anthropic:claude-opus-4-7
n
23,740
nulls
429 (1.8%)
unique
287
top_value
atla1278
top_rate
0.2
cardinality
287
entropy
4.886
entropy_ratio
0.5984

parent_id

text foreign_key one_word short_text duplicates
Fixed-width 8-character single-token codes (e.g. 'book1242', 'uncl1493') with len_min=len_max=8 and one_word_rate=1.0 — these look like Glottolog-style language/family identifiers used as a parent reference. With 7338 unique values across 23740 rows and a 68.5% duplicate_rate, many children share parents; 'book1242' alone accounts for 399 rows. Null rate is low at 1.81%. Treatment: left-join on this id to a parent/language lookup table. high · anthropic:claude-opus-4-7
n
23,740
nulls
429 (1.8%)
unique
7,338
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
15,973
duplicate_rate
0.6852
vocab_size
7,189
readability_flesch_mean
91.19
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

name

text identifier near_unique one_word
The `name` column holds 23,740 fully unique short labels (null_rate 0.0, n_unique equals n), with a mean length of 9.95 characters and 69.5% being a single word. Top tokens like `nuclear`, `central`, `western`, `eastern`, `northern`, `southern`, and `language` suggest these are entity/topic names rather than person names. Vocabulary is broad (17,915 distinct words across only ~1.4 words per row), and there are no duplicates, URLs, or emoji. Treatment: Treat as a unique label key; drop from modelling features or hash/embed if semantic content is needed. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
23,740
len_min
1
len_max
58
len_mean
9.95
len_median
8
len_p95
22
word_mean
1.398
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
17,915
readability_flesch_mean
42.62
emoji_rate
0
url_rate
0
one_word_rate
0.6953
allcaps_rate
0.0001685
boilerplate_rate
0

bookkeeping

categorical feature imbalance
Binary boolean flag indicating whether a record involves bookkeeping, with only two values across 23740 rows and no nulls. The distribution is severely imbalanced: 'False' covers 98.3% (23341 rows) versus only 399 'True' cases, yielding an entropy ratio of just 0.12. Treatment: Encode as boolean and apply class-imbalance handling (e.g., stratification or reweighting) if used as a target. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
2
top_value
False
top_rate
0.9832
cardinality
2
entropy
0.1231
entropy_ratio
0.1231

level

categorical feature
This is a categorical taxonomy tag with exactly 3 levels (dialect, language, family) and no nulls across 23,740 rows. The distribution is well-spread (entropy_ratio 0.94), with 'dialect' leading at 46.0%, followed by 'language' (8,481) and 'family' (4,339). Looks like a linguistic classification level rather than a free-form attribute. Treatment: one-hot encode for modelling or use directly as a stratification key. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
3
top_value
dialect
top_rate
0.46
cardinality
3
entropy
1.494
entropy_ratio
0.9426

status

categorical label
This is a categorical status column with 6 levels matching UNESCO-style language endangerment categories (safe, definitely endangered, vulnerable, extinct, critically endangered, severely endangered). The distribution is heavily imbalanced: 'safe' accounts for 79.9% of 23,740 rows, while the rarest level 'severely endangered' has only 413 records. Entropy ratio is 0.44, confirming low diversity despite 6 classes. Treatment: Treat as ordinal target; stratify or rebalance before classification given the 80/20 dominance of 'safe'. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
6
top_value
safe
top_rate
0.7989
cardinality
6
entropy
1.15
entropy_ratio
0.4447

latitude

numeric feature null_rate
Geographic latitude coordinate spanning -55.27 to 73.14, consistent with valid Earth latitudes. Two-thirds of rows are null (null_rate 0.6654), which severely limits coverage. The distribution is mildly right-skewed (0.54) with median 6.31 and IQR ~24.5, suggesting a bias toward northern-hemisphere but tropical-leaning observations. Treatment: Impute or filter the 66% nulls before any spatial modelling; pair with longitude for geo-features. high · anthropic:claude-opus-4-7
n
23,740
nulls
15,797 (66.5%)
unique
7,798
min
-55.27
max
73.14
mean
8.17
median
6.306
std
18.96
q1
-5.137
q3
19.34
iqr
24.47
skew
0.5403
kurtosis
0.3006
n_outliers
129
outlier_rate
0.01624
zero_rate
0

longitude

numeric feature null_rate
Geographic longitude in decimal degrees, with values spanning -178.785 to 179.306, consistent with the global WGS84 range. Two-thirds of rows are null (null_rate 0.6654), so coverage is the dominant concern; among populated rows the distribution is mildly left-skewed (-0.48) with a median of 47.72 suggesting an Eastern-Hemisphere bias. Only 13 outliers (0.16%) and no zeros, so the populated values themselves look clean. Treatment: Pair with latitude for geospatial features; impute or filter the 66.5% missing before modelling. high · anthropic:claude-opus-4-7
n
23,740
nulls
15,797 (66.5%)
unique
7,757
min
-178.8
max
179.3
mean
51.27
median
47.72
std
81.14
q1
7.235
q3
124.1
iqr
116.9
skew
-0.4832
kurtosis
-0.7745
n_outliers
13
outlier_rate
0.001637
zero_rate
0

iso639P3code

text identifier near_unique one_word null_rate short_text
This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with 7968 distinct codes across 23740 rows. Two-thirds are missing (null_rate=0.6644), and the cardinality is near-unique among populated rows, suggesting one code per language entry rather than a repeated categorical. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and handle the 66% nulls explicitly. high · anthropic:claude-opus-4-7
n
23,740
nulls
15,772 (66.4%)
unique
7,968
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,968
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

description

unknown free_text skipped
This column is named 'description' but saturn skipped profiling, so no kind, uniqueness, or value statistics are available. Only the row count (23740) and a null rate of 0.0 are reported. Without further stats, the content and structure cannot be characterized. Treatment: Re-profile or sample manually before deciding; if textual, tokenize and embed before modelling. low · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique

markup_description

unknown free_text skipped
This column was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 23,740 and a null rate of 0.0. The name suggests it holds markup or descriptive text (likely HTML or formatted product/item descriptions), but that is inferred from the label alone, not from evidence. No distributional signal can be assessed here. Treatment: Re-run the profiler with text handling enabled, then tokenize and embed before modelling. low · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique

child_family_count

numeric feature high_skew outliers
A numeric count of child families per record, with 23740 rows and only 88 distinct values. It is overwhelmingly zero (zero_rate 0.9082) so q1, median, and q3 are all 0 and IQR is 0, yet a long tail pushes max to 859 with mean 0.879 and std 13.20. Skew of 44.40 and kurtosis of 2352.94 confirm an extreme heavy-tailed distribution, and 2179 rows (9.18%) flag as outliers. Treatment: Binarize (zero vs non-zero) or apply log1p before modelling to tame the extreme skew. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
88
min
0
max
859
mean
0.8792
median
0
std
13.2
q1
0
q3
0
iqr
0
skew
44.4
kurtosis
2353
n_outliers
2,179
outlier_rate
0.09179
zero_rate
0.9082

child_language_count

numeric feature high_skew outliers
A numeric count of child languages per record, where 81.7% of rows are zero and Q1=median=Q3=0, so the typical entity has none. The distribution is extremely long-tailed (skew 41.86, kurtosis 2115) with a max of 1435 and 4339 outliers (18.3% outlier rate), suggesting a small set of hub-like records dominate. Treatment: Binarize (has_children vs none) or log1p-transform before modelling given the heavy zero-inflation and skew. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
126
min
0
max
1,435
mean
1.996
median
0
std
23.41
q1
0
q3
0
iqr
0
skew
41.86
kurtosis
2115
n_outliers
4,339
outlier_rate
0.1828
zero_rate
0.8172

child_dialect_count

numeric feature high_skew outliers
This is a count of child dialects per record, dominated by zeros (zero_rate 0.7442) with a median and Q3 of 0/1 yet a max of 2369. Skew of 42.22 and kurtosis of 2159 confirm an extreme long tail, and 17.99% of rows flag as outliers. The mean of 3.39 is pulled far above the median, so any aggregate using it will mislead. Treatment: Log1p-transform or bin (zero / one / many) before modelling. high · anthropic:claude-opus-4-7
n
23,740
nulls
0 (0.0%)
unique
164
min
0
max
2,369
mean
3.389
median
0
std
36.8
q1
0
q3
1
iqr
1
skew
42.22
kurtosis
2159
n_outliers
4,272
outlier_rate
0.1799
zero_rate
0.7442

country_ids

categorical feature null_rate
This column holds ISO-style country codes (PG, ID, NG, AU, IN…) as a categorical feature with 680 distinct values across 23,740 rows. Coverage is poor — 64.24% of rows are null — and the non-null distribution is broad (entropy 6.49, ratio 0.69) with Papua New Guinea leading at only 10.29%. The 680 distinct codes far exceed the ~250 real countries, suggesting multi-value concatenations or non-standard tokens behind the 'country_ids' plural. Treatment: Split multi-code entries, normalise to ISO-3166, and impute or flag the 64% missing before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
23,740
nulls
15,250 (64.2%)
unique
680
top_value
PG
top_rate
0.1029
cardinality
680
entropy
6.493
entropy_ratio
0.6901