saturn·

language data glottolog languoid

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/datasets/language-data/glottolog_languoid.csv

Saturn profiled 23,740 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/glottolog_languoid.csv",
    "--findings", "language-data-glottolog_languoid.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.

citing: level · status · country_ids · family_id · latitude · longitude · iso639P3code · child_dialect_count · bookkeeping

Out[4]:

saturn.schema() · 16 columns

column kind n null% unique alerts
id text 23,740 0.0% 23,740 near_unique one_word short_text
family_id categorical 23,740 1.8% 287
parent_id text 23,740 1.8% 7,338 one_word short_text duplicates
name text 23,740 0.0% 23,740 near_unique one_word
bookkeeping categorical 23,740 0.0% 2 imbalance
level categorical 23,740 0.0% 3
status categorical 23,740 0.0% 6
latitude numeric 23,740 66.5% 7,798 null_rate
longitude numeric 23,740 66.5% 7,757 null_rate
iso639P3code text 23,740 66.4% 7,968 near_unique one_word null_rate short_text
description unknown 23,740 0.0% skipped
markup_description unknown 23,740 0.0% skipped
child_family_count numeric 23,740 0.0% 88 high_skew outliers
child_language_count numeric 23,740 0.0% 126 high_skew outliers
child_dialect_count numeric 23,740 0.0% 164 high_skew outliers
country_ids categorical 23,740 64.2% 680 null_rate
Fig 1.
level · Shows the split between dialect, language, and family entries — the core taxonomy of the dataset.
Show data table
Top values for level (3 unique shown, of 3 total).
valuecountshare
dialect1092046.0%
language848135.7%
family433918.3%
Fig 2.
status · Highlights how many languoids are safe versus various endangerment levels, including 889 already extinct.
Show data table
Top values for status (6 unique shown, of 6 total).
valuecountshare
safe1896579.9%
definitely endangered18147.6%
vulnerable11945.0%
extinct8893.7%
critically endangered4652.0%
severely endangered4131.7%
Fig 3.
family_id · Reveals the dominant language families; atla1278 and aust1307 alone account for over a third of records.
Show data table
Top values for family_id (20 unique shown, of 287 total).
valuecountshare
atla1278466319.6%
aust1307385016.2%
indo131922019.3%
sino124516667.0%
afro125512595.3%
nucl17097623.2%
pama12505982.5%
aust13055032.1%
book12423991.7%
otom12993381.4%
mand14693031.3%
sign12382591.1%
drav12512551.1%
cent22252511.1%
turk13112291.0%
taik12562230.9%
nilo12472010.8%
ural12721850.8%
japo12371790.8%
tupi12751570.7%
Fig 4.
country_ids · Shows geographic concentration, with Papua New Guinea, Indonesia, and Nigeria leading by languoid count.
Show data table
Top values for country_ids (20 unique shown, of 680 total).
valuecountshare
PG8743.7%
ID6952.9%
NG4802.0%
AU4321.8%
IN3561.5%
MX2971.3%
CN2711.1%
BR2631.1%
US2471.0%
CM1960.8%
PH1770.7%
CD1560.7%
VU1180.5%
SD990.4%
PE970.4%
TZ930.4%
MY900.4%
TD880.4%
RU830.3%
CO820.3%
Fig 5.
latitude · Distribution of languoid latitudes (where known) — note the ~66% null rate limits coverage.
Show data table
Histogram bins for latitude (median: 6.30619).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3848
-26.38 – -23.1778
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54281
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51379
2.51 – 5.72469
5.72 – 8.93664
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56387
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19373
28.19 – 31.4167
31.4 – 34.61144
34.61 – 37.82179
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6678
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idtext0.0%
family_idcategorical1.8%
parent_idtext1.8%
nametext0.0%
bookkeepingcategorical0.0%
levelcategorical0.0%
statuscategorical0.0%
latitudenumeric66.5%
longitudenumeric66.5%
iso639P3codetext66.4%
descriptionunknown0.0%
markup_descriptionunknown0.0%
child_family_countnumeric0.0%
child_language_countnumeric0.0%
child_dialect_countnumeric0.0%
country_idscategorical64.2%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
latitudelongitudechild_family_countchild_language_countchild_dialect_count
latitude+1.00-0.31+0.03+0.01+0.06
longitude-0.31+1.00-0.03-0.04-0.05
child_family_count+0.03-0.03+1.00+0.96+0.74
child_language_count+0.01-0.04+0.96+1.00+0.69
child_dialect_count+0.06-0.05+0.74+0.69+1.00

id text identifier

Fixed 8-character single-token codes (e.g., 'melk1240', 'yang1299'), unique across all 23,740 rows with no nulls or duplicates. The pattern of four letters followed by four digits is consistent with Glottolog-style language identifiers, making this a primary key rather than analyzable text.

Treatment: Use as the row key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["id"].stats

statvalue
n23,740
nulls0 (0.0%)
unique23,740
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 86.11
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for id.
Show data table
Character-length distribution for id (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 823740
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

family_id categorical foreign_key

Looks like a Glottolog-style language family identifier (e.g. 'atla1278' for Atlantic-Congo, 'aust1307' for Austronesian) tagging each of the 23,740 rows. The distribution is heavily skewed: the top family alone covers 20.0% of rows and the top 10 of 287 families dominate, yielding an entropy ratio of 0.598. Null rate is low at 1.81%.

Treatment: left-join on this id to a language-family reference table; consider grouping the long tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["family_id"].stats

statvalue
n23,740
nulls429 (1.8%)
unique287
top_value atla1278
top_rate 0.2
cardinality 287
entropy 4.886
entropy_ratio 0.5984
Fig 9.
Top values for family_id.
Show data table
Top values for family_id (20 unique shown, of 287 total).
valuecountshare
atla1278466319.6%
aust1307385016.2%
indo131922019.3%
sino124516667.0%
afro125512595.3%
nucl17097623.2%
pama12505982.5%
aust13055032.1%
book12423991.7%
otom12993381.4%
mand14693031.3%
sign12382591.1%
drav12512551.1%
cent22252511.1%
turk13112291.0%
taik12562230.9%
nilo12472010.8%
ural12721850.8%
japo12371790.8%
tupi12751570.7%

parent_id text foreign_key

Fixed-width 8-character single-token codes (e.g. 'book1242', 'uncl1493') with len_min=len_max=8 and one_word_rate=1.0 — these look like Glottolog-style language/family identifiers used as a parent reference. With 7338 unique values across 23740 rows and a 68.5% duplicate_rate, many children share parents; 'book1242' alone accounts for 399 rows. Null rate is low at 1.81%.

Treatment: left-join on this id to a parent/language lookup table.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["parent_id"].stats

statvalue
n23,740
nulls429 (1.8%)
unique7,338
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 15,973
duplicate_rate 0.6852
vocab_size 7,189
readability_flesch_mean 91.19
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates68.5% duplicate strings
Fig 10.
Character-length distribution for parent_id.
Show data table
Character-length distribution for parent_id (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 823311
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

name text identifier

The `name` column holds 23,740 fully unique short labels (null_rate 0.0, n_unique equals n), with a mean length of 9.95 characters and 69.5% being a single word. Top tokens like `nuclear`, `central`, `western`, `eastern`, `northern`, `southern`, and `language` suggest these are entity/topic names rather than person names. Vocabulary is broad (17,915 distinct words across only ~1.4 words per row), and there are no duplicates, URLs, or emoji.

Treatment: Treat as a unique label key; drop from modelling features or hash/embed if semantic content is needed.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["name"].stats

statvalue
n23,740
nulls0 (0.0%)
unique23,740
len_min 1
len_max 58
len_mean 9.95
len_median 8
len_p95 22
word_mean 1.398
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 17,915
readability_flesch_mean 42.62
emoji_rate 0
url_rate 0
one_word_rate 0.6953
allcaps_rate 0.0001685
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word69.5% rows are a single word
Fig 11.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 9.950126368997473).
charscount
1 – 254
2 – 4496
4 – 54648
5 – 73145
7 – 84510
8 – 101373
10 – 111069
11 – 121997
12 – 141023
14 – 151768
15 – 17644
17 – 18848
18 – 20340
20 – 21281
21 – 22518
22 – 24224
24 – 25298
25 – 27101
27 – 28140
28 – 3046
30 – 3144
31 – 3269
32 – 3424
34 – 3534
35 – 3712
37 – 389
38 – 394
39 – 412
41 – 423
42 – 443
44 – 455
45 – 471
47 – 482
48 – 491
49 – 510
51 – 522
52 – 540
54 – 550
55 – 570
57 – 582

bookkeeping categorical feature

Binary boolean flag indicating whether a record involves bookkeeping, with only two values across 23740 rows and no nulls. The distribution is severely imbalanced: 'False' covers 98.3% (23341 rows) versus only 399 'True' cases, yielding an entropy ratio of just 0.12.

Treatment: Encode as boolean and apply class-imbalance handling (e.g., stratification or reweighting) if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["bookkeeping"].stats

statvalue
n23,740
nulls0 (0.0%)
unique2
top_value False
top_rate 0.9832
cardinality 2
entropy 0.1231
entropy_ratio 0.1231
alert: imbalancetop value is 98.3% of rows
Fig 12.
Top values for bookkeeping.
Show data table
Top values for bookkeeping (2 unique shown, of 2 total).
valuecountshare
False2334198.3%
True3991.7%

level categorical feature

This is a categorical taxonomy tag with exactly 3 levels (dialect, language, family) and no nulls across 23,740 rows. The distribution is well-spread (entropy_ratio 0.94), with 'dialect' leading at 46.0%, followed by 'language' (8,481) and 'family' (4,339). Looks like a linguistic classification level rather than a free-form attribute.

Treatment: one-hot encode for modelling or use directly as a stratification key.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["level"].stats

statvalue
n23,740
nulls0 (0.0%)
unique3
top_value dialect
top_rate 0.46
cardinality 3
entropy 1.494
entropy_ratio 0.9426
Fig 13.
Top values for level.
Show data table
Top values for level (3 unique shown, of 3 total).
valuecountshare
dialect1092046.0%
language848135.7%
family433918.3%

status categorical label

This is a categorical status column with 6 levels matching UNESCO-style language endangerment categories (safe, definitely endangered, vulnerable, extinct, critically endangered, severely endangered). The distribution is heavily imbalanced: 'safe' accounts for 79.9% of 23,740 rows, while the rarest level 'severely endangered' has only 413 records. Entropy ratio is 0.44, confirming low diversity despite 6 classes.

Treatment: Treat as ordinal target; stratify or rebalance before classification given the 80/20 dominance of 'safe'.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["status"].stats

statvalue
n23,740
nulls0 (0.0%)
unique6
top_value safe
top_rate 0.7989
cardinality 6
entropy 1.15
entropy_ratio 0.4447
Fig 14.
Top values for status.
Show data table
Top values for status (6 unique shown, of 6 total).
valuecountshare
safe1896579.9%
definitely endangered18147.6%
vulnerable11945.0%
extinct8893.7%
critically endangered4652.0%
severely endangered4131.7%

latitude numeric feature

Geographic latitude coordinate spanning -55.27 to 73.14, consistent with valid Earth latitudes. Two-thirds of rows are null (null_rate 0.6654), which severely limits coverage. The distribution is mildly right-skewed (0.54) with median 6.31 and IQR ~24.5, suggesting a bias toward northern-hemisphere but tropical-leaning observations.

Treatment: Impute or filter the 66% nulls before any spatial modelling; pair with longitude for geo-features.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["latitude"].stats

statvalue
n23,740
nulls15,797 (66.5%)
unique7,798
min -55.27
max 73.14
mean 8.17
median 6.306
std 18.96
q1 -5.137
q3 19.34
iqr 24.47
skew 0.5403
kurtosis 0.3006
n_outliers 129
outlier_rate 0.01624
zero_rate 0
alert: null_rate66.5% null
Fig 15.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 6.30619).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3848
-26.38 – -23.1778
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54281
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51379
2.51 – 5.72469
5.72 – 8.93664
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56387
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19373
28.19 – 31.4167
31.4 – 34.61144
34.61 – 37.82179
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6678
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146

longitude numeric feature

Geographic longitude in decimal degrees, with values spanning -178.785 to 179.306, consistent with the global WGS84 range. Two-thirds of rows are null (null_rate 0.6654), so coverage is the dominant concern; among populated rows the distribution is mildly left-skewed (-0.48) with a median of 47.72 suggesting an Eastern-Hemisphere bias. Only 13 outliers (0.16%) and no zeros, so the populated values themselves look clean.

Treatment: Pair with latitude for geospatial features; impute or filter the 66.5% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["longitude"].stats

statvalue
n23,740
nulls15,797 (66.5%)
unique7,757
min -178.8
max 179.3
mean 51.27
median 47.72
std 81.14
q1 7.235
q3 124.1
iqr 116.9
skew -0.4832
kurtosis -0.7745
n_outliers 13
outlier_rate 0.001637
zero_rate 0
alert: null_rate66.5% null
Fig 16.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: 47.7236).
bincount
-178.8 – -169.813
-169.8 – -160.94
-160.9 – -151.910
-151.9 – -14311
-143 – -13410
-134 – -125.117
-125.1 – -116.1124
-116.1 – -107.247
-107.2 – -98.2178
-98.21 – -89.26280
-89.26 – -80.3159
-80.31 – -71.36235
-71.36 – -62.41218
-62.41 – -53.45150
-53.45 – -44.560
-44.5 – -35.5540
-35.55 – -26.60
-26.6 – -17.644
-17.64 – -8.692105
-8.692 – 0.2605275
0.2605 – 9.213444
9.213 – 18.17751
18.17 – 27.12322
27.12 – 36.07430
36.07 – 45.02228
45.02 – 53.97126
53.97 – 62.9335
62.93 – 71.8880
71.88 – 80.83210
80.83 – 89.78208
89.78 – 98.74269
98.74 – 107.7457
107.7 – 116.6239
116.6 – 125.6502
125.6 – 134.5316
134.5 – 143.5598
143.5 – 152.4667
152.4 – 161.4123
161.4 – 170.4186
170.4 – 179.312

iso639P3code text identifier

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with 7968 distinct codes across 23740 rows. Two-thirds are missing (null_rate=0.6644), and the cardinality is near-unique among populated rows, suggesting one code per language entry rather than a repeated categorical.

Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and handle the 66% nulls explicitly.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["iso639P3code"].stats

statvalue
n23,740
nulls15,772 (66.4%)
unique7,968
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,968
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: null_rate66.4% null
alert: short_text95th-percentile length under 20 chars
Fig 17.
Character-length distribution for iso639P3code.
Show data table
Character-length distribution for iso639P3code (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 37968
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

description unknown free_text

This column is named 'description' but saturn skipped profiling, so no kind, uniqueness, or value statistics are available. Only the row count (23740) and a null rate of 0.0 are reported. Without further stats, the content and structure cannot be characterized.

Treatment: Re-profile or sample manually before deciding; if textual, tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence low
Out[43]:

saturn.columns["description"].stats

statvalue
n23,740
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

markup_description unknown free_text

This column was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 23,740 and a null rate of 0.0. The name suggests it holds markup or descriptive text (likely HTML or formatted product/item descriptions), but that is inferred from the label alone, not from evidence. No distributional signal can be assessed here.

Treatment: Re-run the profiler with text handling enabled, then tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence low
Out[45]:

saturn.columns["markup_description"].stats

statvalue
n23,740
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

child_family_count numeric feature

A numeric count of child families per record, with 23740 rows and only 88 distinct values. It is overwhelmingly zero (zero_rate 0.9082) so q1, median, and q3 are all 0 and IQR is 0, yet a long tail pushes max to 859 with mean 0.879 and std 13.20. Skew of 44.40 and kurtosis of 2352.94 confirm an extreme heavy-tailed distribution, and 2179 rows (9.18%) flag as outliers.

Treatment: Binarize (zero vs non-zero) or apply log1p before modelling to tame the extreme skew.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["child_family_count"].stats

statvalue
n23,740
nulls0 (0.0%)
unique88
min 0
max 859
mean 0.8792
median 0
std 13.2
q1 0
q3 0
iqr 0
skew 44.4
kurtosis 2353
n_outliers 2,179
outlier_rate 0.09179
zero_rate 0.9082
alert: high_skewskew=+44.40
alert: outliers9.2% rows beyond 1.5 IQR
Fig 18.
Distribution of child_family_count. Vertical dash marks the median.
Show data table
Histogram bins for child_family_count (median: 0.0).
bincount
0 – 21.4823588
21.48 – 42.9593
42.95 – 64.4327
64.43 – 85.96
85.9 – 107.44
107.4 – 128.93
128.9 – 150.32
150.3 – 171.82
171.8 – 193.31
193.3 – 214.81
214.8 – 236.20
236.2 – 257.70
257.7 – 279.21
279.2 – 300.72
300.7 – 322.11
322.1 – 343.61
343.6 – 365.10
365.1 – 386.60
386.6 – 4082
408 – 429.51
429.5 – 4510
451 – 472.50
472.5 – 493.90
493.9 – 515.40
515.4 – 536.90
536.9 – 558.40
558.4 – 579.81
579.8 – 601.30
601.3 – 622.80
622.8 – 644.20
644.2 – 665.70
665.7 – 687.20
687.2 – 708.72
708.7 – 730.20
730.2 – 751.60
751.6 – 773.10
773.1 – 794.60
794.6 – 816.11
816.1 – 837.50
837.5 – 8591

child_language_count numeric feature

A numeric count of child languages per record, where 81.7% of rows are zero and Q1=median=Q3=0, so the typical entity has none. The distribution is extremely long-tailed (skew 41.86, kurtosis 2115) with a max of 1435 and 4339 outliers (18.3% outlier rate), suggesting a small set of hub-like records dominate.

Treatment: Binarize (has_children vs none) or log1p-transform before modelling given the heavy zero-inflation and skew.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["child_language_count"].stats

statvalue
n23,740
nulls0 (0.0%)
unique126
min 0
max 1,435
mean 1.996
median 0
std 23.41
q1 0
q3 0
iqr 0
skew 41.86
kurtosis 2115
n_outliers 4,339
outlier_rate 0.1828
zero_rate 0.8172
alert: high_skewskew=+41.86
alert: outliers18.3% rows beyond 1.5 IQR
Fig 19.
Distribution of child_language_count. Vertical dash marks the median.
Show data table
Histogram bins for child_language_count (median: 0.0).
bincount
0 – 35.8823547
35.88 – 71.75121
71.75 – 107.637
107.6 – 143.56
143.5 – 179.43
179.4 – 215.24
215.2 – 251.13
251.1 – 2872
287 – 322.92
322.9 – 358.80
358.8 – 394.61
394.6 – 430.51
430.5 – 466.40
466.4 – 502.21
502.2 – 538.11
538.1 – 5742
574 – 609.91
609.9 – 645.80
645.8 – 681.60
681.6 – 717.51
717.5 – 753.42
753.4 – 789.20
789.2 – 825.10
825.1 – 8610
861 – 896.90
896.9 – 932.80
932.8 – 968.60
968.6 – 10041
1004 – 10400
1040 – 10760
1076 – 11120
1112 – 11480
1148 – 11840
1184 – 12200
1220 – 12560
1256 – 12922
1292 – 13270
1327 – 13630
1363 – 13991
1399 – 14351

child_dialect_count numeric feature

This is a count of child dialects per record, dominated by zeros (zero_rate 0.7442) with a median and Q3 of 0/1 yet a max of 2369. Skew of 42.22 and kurtosis of 2159 confirm an extreme long tail, and 17.99% of rows flag as outliers. The mean of 3.39 is pulled far above the median, so any aggregate using it will mislead.

Treatment: Log1p-transform or bin (zero / one / many) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["child_dialect_count"].stats

statvalue
n23,740
nulls0 (0.0%)
unique164
min 0
max 2,369
mean 3.389
median 0
std 36.8
q1 0
q3 1
iqr 1
skew 42.22
kurtosis 2159
n_outliers 4,272
outlier_rate 0.1799
zero_rate 0.7442
alert: high_skewskew=+42.22
alert: outliers18.0% rows beyond 1.5 IQR
Fig 20.
Distribution of child_dialect_count. Vertical dash marks the median.
Show data table
Histogram bins for child_dialect_count (median: 0.0).
bincount
0 – 59.2323575
59.23 – 118.599
118.5 – 177.724
177.7 – 236.918
236.9 – 296.14
296.1 – 355.42
355.4 – 414.60
414.6 – 473.81
473.8 – 5331
533 – 592.21
592.2 – 651.51
651.5 – 710.72
710.7 – 769.90
769.9 – 829.11
829.1 – 888.40
888.4 – 947.63
947.6 – 10070
1007 – 10660
1066 – 11251
1125 – 11841
1184 – 12440
1244 – 13030
1303 – 13621
1362 – 14210
1421 – 14810
1481 – 15400
1540 – 15991
1599 – 16580
1658 – 17180
1718 – 17770
1777 – 18361
1836 – 18951
1895 – 19540
1954 – 20140
2014 – 20730
2073 – 21320
2132 – 21910
2191 – 22511
2251 – 23100
2310 – 23691

country_ids categorical feature

This column holds ISO-style country codes (PG, ID, NG, AU, IN…) as a categorical feature with 680 distinct values across 23,740 rows. Coverage is poor — 64.24% of rows are null — and the non-null distribution is broad (entropy 6.49, ratio 0.69) with Papua New Guinea leading at only 10.29%. The 680 distinct codes far exceed the ~250 real countries, suggesting multi-value concatenations or non-standard tokens behind the 'country_ids' plural.

Treatment: Split multi-code entries, normalise to ISO-3166, and impute or flag the 64% missing before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["country_ids"].stats

statvalue
n23,740
nulls15,250 (64.2%)
unique680
top_value PG
top_rate 0.1029
cardinality 680
entropy 6.493
entropy_ratio 0.6901
alert: null_rate64.2% null
Fig 21.
Top values for country_ids.
Show data table
Top values for country_ids (20 unique shown, of 680 total).
valuecountshare
PG8743.7%
ID6952.9%
NG4802.0%
AU4321.8%
IN3561.5%
MX2971.3%
CN2711.1%
BR2631.1%
US2471.0%
CM1960.8%
PH1770.7%
CD1560.7%
VU1180.5%
SD990.4%
PE970.4%
TZ930.4%
MY900.4%
TD880.4%
RU830.3%
CO820.3%

How to cite

click to copy

BibTeX
@misc{saturn-language-data-glottolog-languoid-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: language data glottolog languoid},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/language-data-glottolog_languoid}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: language data glottolog languoid. Source: /home/coolhand/datasets/language-data/glottolog_languoid.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/language-data-glottolog_languoid