language data glottolog languoid

source /home/coolhand/datasets/language-data/glottolog_languoid.csv 23,740 rows 16 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.

citing: level · status · country_ids · family_id · latitude · longitude · iso639P3code · child_dialect_count · bookkeeping

Charts the summary said to look at first

level · Shows the split between dialect, language, and family entries — the core taxonomy of the dataset.

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

status · Highlights how many languoids are safe versus various endangerment levels, including 889 already extinct.

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

family_id · Reveals the dominant language families; atla1278 and aust1307 alone account for over a third of records.

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

country_ids · Shows geographic concentration, with Papua New Guinea, Indonesia, and Nigeria leading by languoid count.

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

latitude · Distribution of languoid latitudes (where known) — note the ~66% null rate limits coverage.

Show data table

Histogram bins for latitude (median: 6.30619).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	48
-26.38 – -23.17	78
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	281
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	379
2.51 – 5.72	469
5.72 – 8.93	664
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	387
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	373
28.19 – 31.4	167
31.4 – 34.61	144
34.61 – 37.82	179
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	78
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

Schema

16 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
id	text	0.0%	23,740	near_unique one_word short_text
family_id	categorical	1.8%	287
parent_id	text	1.8%	7,338	one_word short_text duplicates
name	text	0.0%	23,740	near_unique one_word
bookkeeping	categorical	0.0%	2	imbalance
level	categorical	0.0%	3
status	categorical	0.0%	6
latitude	numeric	66.5%	7,798	null_rate
longitude	numeric	66.5%	7,757	null_rate
iso639P3code	text	66.4%	7,968	near_unique one_word null_rate short_text
description	unknown	0.0%	—	skipped
markup_description	unknown	0.0%	—	skipped
child_family_count	numeric	0.0%	88	high_skew outliers
child_language_count	numeric	0.0%	126	high_skew outliers
child_dialect_count	numeric	0.0%	164	high_skew outliers
country_ids	categorical	64.2%	680	null_rate

id

text identifier near_unique one_word short_text

Fixed 8-character single-token codes (e.g., 'melk1240', 'yang1299'), unique across all 23,740 rows with no nulls or duplicates. The pattern of four letters followed by four digits is consistent with Glottolog-style language identifiers, making this a primary key rather than analyzable text. Treatment: Use as the row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 23,740
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 20,000
readability_flesch_mean: 86.11
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

family_id

categorical foreign_key

Looks like a Glottolog-style language family identifier (e.g. 'atla1278' for Atlantic-Congo, 'aust1307' for Austronesian) tagging each of the 23,740 rows. The distribution is heavily skewed: the top family alone covers 20.0% of rows and the top 10 of 287 families dominate, yielding an entropy ratio of 0.598. Null rate is low at 1.81%. Treatment: left-join on this id to a language-family reference table; consider grouping the long tail before modelling. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 429 (1.8%)
unique: 287
top_value: atla1278
top_rate: 0.2
cardinality: 287
entropy: 4.886
entropy_ratio: 0.5984

parent_id

text foreign_key one_word short_text duplicates

Fixed-width 8-character single-token codes (e.g. 'book1242', 'uncl1493') with len_min=len_max=8 and one_word_rate=1.0 — these look like Glottolog-style language/family identifiers used as a parent reference. With 7338 unique values across 23740 rows and a 68.5% duplicate_rate, many children share parents; 'book1242' alone accounts for 399 rows. Null rate is low at 1.81%. Treatment: left-join on this id to a parent/language lookup table. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 429 (1.8%)
unique: 7,338
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 15,973
duplicate_rate: 0.6852
vocab_size: 7,189
readability_flesch_mean: 91.19
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

name

text identifier near_unique one_word

The `name` column holds 23,740 fully unique short labels (null_rate 0.0, n_unique equals n), with a mean length of 9.95 characters and 69.5% being a single word. Top tokens like `nuclear`, `central`, `western`, `eastern`, `northern`, `southern`, and `language` suggest these are entity/topic names rather than person names. Vocabulary is broad (17,915 distinct words across only ~1.4 words per row), and there are no duplicates, URLs, or emoji. Treatment: Treat as a unique label key; drop from modelling features or hash/embed if semantic content is needed. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 23,740
len_min: 1
len_max: 58
len_mean: 9.95
len_median: 8
len_p95: 22
word_mean: 1.398
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 17,915
readability_flesch_mean: 42.62
emoji_rate: 0
url_rate: 0
one_word_rate: 0.6953
allcaps_rate: 0.0001685
boilerplate_rate: 0

bookkeeping

categorical feature imbalance

Binary boolean flag indicating whether a record involves bookkeeping, with only two values across 23740 rows and no nulls. The distribution is severely imbalanced: 'False' covers 98.3% (23341 rows) versus only 399 'True' cases, yielding an entropy ratio of just 0.12. Treatment: Encode as boolean and apply class-imbalance handling (e.g., stratification or reweighting) if used as a target. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 2
top_value: False
top_rate: 0.9832
cardinality: 2
entropy: 0.1231
entropy_ratio: 0.1231

level

categorical feature

This is a categorical taxonomy tag with exactly 3 levels (dialect, language, family) and no nulls across 23,740 rows. The distribution is well-spread (entropy_ratio 0.94), with 'dialect' leading at 46.0%, followed by 'language' (8,481) and 'family' (4,339). Looks like a linguistic classification level rather than a free-form attribute. Treatment: one-hot encode for modelling or use directly as a stratification key. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 3
top_value: dialect
top_rate: 0.46
cardinality: 3
entropy: 1.494
entropy_ratio: 0.9426

status

categorical label

This is a categorical status column with 6 levels matching UNESCO-style language endangerment categories (safe, definitely endangered, vulnerable, extinct, critically endangered, severely endangered). The distribution is heavily imbalanced: 'safe' accounts for 79.9% of 23,740 rows, while the rarest level 'severely endangered' has only 413 records. Entropy ratio is 0.44, confirming low diversity despite 6 classes. Treatment: Treat as ordinal target; stratify or rebalance before classification given the 80/20 dominance of 'safe'. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 6
top_value: safe
top_rate: 0.7989
cardinality: 6
entropy: 1.15
entropy_ratio: 0.4447

latitude

numeric feature null_rate

Geographic latitude coordinate spanning -55.27 to 73.14, consistent with valid Earth latitudes. Two-thirds of rows are null (null_rate 0.6654), which severely limits coverage. The distribution is mildly right-skewed (0.54) with median 6.31 and IQR ~24.5, suggesting a bias toward northern-hemisphere but tropical-leaning observations. Treatment: Impute or filter the 66% nulls before any spatial modelling; pair with longitude for geo-features. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 15,797 (66.5%)
unique: 7,798
min: -55.27
max: 73.14
mean: 8.17
median: 6.306
std: 18.96
q1: -5.137
q3: 19.34
iqr: 24.47
skew: 0.5403
kurtosis: 0.3006
n_outliers: 129
outlier_rate: 0.01624
zero_rate: 0

longitude

numeric feature null_rate

Geographic longitude in decimal degrees, with values spanning -178.785 to 179.306, consistent with the global WGS84 range. Two-thirds of rows are null (null_rate 0.6654), so coverage is the dominant concern; among populated rows the distribution is mildly left-skewed (-0.48) with a median of 47.72 suggesting an Eastern-Hemisphere bias. Only 13 outliers (0.16%) and no zeros, so the populated values themselves look clean. Treatment: Pair with latitude for geospatial features; impute or filter the 66.5% missing before modelling. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 15,797 (66.5%)
unique: 7,757
min: -178.8
max: 179.3
mean: 51.27
median: 47.72
std: 81.14
q1: 7.235
q3: 124.1
iqr: 116.9
skew: -0.4832
kurtosis: -0.7745
n_outliers: 13
outlier_rate: 0.001637
zero_rate: 0

iso639P3code

text identifier near_unique one_word null_rate short_text

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with 7968 distinct codes across 23740 rows. Two-thirds are missing (null_rate=0.6644), and the cardinality is near-unique among populated rows, suggesting one code per language entry rather than a repeated categorical. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and handle the 66% nulls explicitly. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 15,772 (66.4%)
unique: 7,968
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,968
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

description

unknown free_text skipped

This column is named 'description' but saturn skipped profiling, so no kind, uniqueness, or value statistics are available. Only the row count (23740) and a null rate of 0.0 are reported. Without further stats, the content and structure cannot be characterized. Treatment: Re-profile or sample manually before deciding; if textual, tokenize and embed before modelling. low · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: —

markup_description

unknown free_text skipped

This column was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 23,740 and a null rate of 0.0. The name suggests it holds markup or descriptive text (likely HTML or formatted product/item descriptions), but that is inferred from the label alone, not from evidence. No distributional signal can be assessed here. Treatment: Re-run the profiler with text handling enabled, then tokenize and embed before modelling. low · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: —

child_family_count

numeric feature high_skew outliers

A numeric count of child families per record, with 23740 rows and only 88 distinct values. It is overwhelmingly zero (zero_rate 0.9082) so q1, median, and q3 are all 0 and IQR is 0, yet a long tail pushes max to 859 with mean 0.879 and std 13.20. Skew of 44.40 and kurtosis of 2352.94 confirm an extreme heavy-tailed distribution, and 2179 rows (9.18%) flag as outliers. Treatment: Binarize (zero vs non-zero) or apply log1p before modelling to tame the extreme skew. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 88
min: 0
max: 859
mean: 0.8792
median: 0
std: 13.2
q1: 0
q3: 0
iqr: 0
skew: 44.4
kurtosis: 2353
n_outliers: 2,179
outlier_rate: 0.09179
zero_rate: 0.9082

child_language_count

numeric feature high_skew outliers

A numeric count of child languages per record, where 81.7% of rows are zero and Q1=median=Q3=0, so the typical entity has none. The distribution is extremely long-tailed (skew 41.86, kurtosis 2115) with a max of 1435 and 4339 outliers (18.3% outlier rate), suggesting a small set of hub-like records dominate. Treatment: Binarize (has_children vs none) or log1p-transform before modelling given the heavy zero-inflation and skew. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 126
min: 0
max: 1,435
mean: 1.996
median: 0
std: 23.41
q1: 0
q3: 0
iqr: 0
skew: 41.86
kurtosis: 2115
n_outliers: 4,339
outlier_rate: 0.1828
zero_rate: 0.8172

child_dialect_count

numeric feature high_skew outliers

This is a count of child dialects per record, dominated by zeros (zero_rate 0.7442) with a median and Q3 of 0/1 yet a max of 2369. Skew of 42.22 and kurtosis of 2159 confirm an extreme long tail, and 17.99% of rows flag as outliers. The mean of 3.39 is pulled far above the median, so any aggregate using it will mislead. Treatment: Log1p-transform or bin (zero / one / many) before modelling. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 0 (0.0%)
unique: 164
min: 0
max: 2,369
mean: 3.389
median: 0
std: 36.8
q1: 0
q3: 1
iqr: 1
skew: 42.22
kurtosis: 2159
n_outliers: 4,272
outlier_rate: 0.1799
zero_rate: 0.7442

country_ids

categorical feature null_rate

This column holds ISO-style country codes (PG, ID, NG, AU, IN…) as a categorical feature with 680 distinct values across 23,740 rows. Coverage is poor — 64.24% of rows are null — and the non-null distribution is broad (entropy 6.49, ratio 0.69) with Papua New Guinea leading at only 10.29%. The 680 distinct codes far exceed the ~250 real countries, suggesting multi-value concatenations or non-standard tokens behind the 'country_ids' plural. Treatment: Split multi-code entries, normalise to ISO-3166, and impute or flag the 64% missing before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 23,740
nulls: 15,250 (64.2%)
unique: 680
top_value: PG
top_rate: 0.1029
cardinality: 680
entropy: 6.493
entropy_ratio: 0.6901