data-trove-glottolog-languoids

Overview

Source: /home/coolhand/html/datavis/data_trove/data/linguistic/glottolog_languoid.csv

Saturn profiled 23,740 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/linguistic/glottolog_languoid.csv",
    "--findings", "data-trove-glottolog-languoids.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a comprehensive catalogue of the world's languoids from Glottolog, covering 23,740 entries that span dialects (10,920), languages (8,481), and language families (4,339). The most striking pattern is in endangerment status: while the majority (18,965) are marked 'safe', nearly 4,800 entries are endangered, extinct, or vulnerable — worth examining closely against family and geographic distribution. A second area of interest is the highly skewed child-count columns: most languoids have zero children (74% for dialects, 82% for languages), but a handful of nodes have hundreds or even thousands of descendants, suggesting a very uneven tree structure. Geographic coverage is also notably incomplete, with latitude and longitude missing for 66% of rows, limiting spatial analysis to a subset of the data.

citing: row_count · level.top_values · status.top_values · child_dialect_count.zero_rate · child_dialect_count.max · child_language_count.zero_rate · child_language_count.max · latitude.null_rate · family_id.top_values · country_ids.top_values

Out[4]:

saturn.schema() · 16 columns

column	kind	n	null%	unique	alerts
id	text	23,740	0.0%	23,740	near_unique one_word short_text
family_id	categorical	23,740	1.8%	287
parent_id	text	23,740	1.8%	7,338	one_word short_text duplicates
name	text	23,740	0.0%	23,740	near_unique one_word
bookkeeping	categorical	23,740	0.0%	2	imbalance
level	categorical	23,740	0.0%	3
status	categorical	23,740	0.0%	6
latitude	numeric	23,740	66.5%	7,798	null_rate
longitude	numeric	23,740	66.5%	7,757	null_rate
iso639P3code	text	23,740	66.4%	7,968	near_unique one_word null_rate short_text
description	unknown	23,740	0.0%	—	skipped
markup_description	unknown	23,740	0.0%	—	skipped
child_family_count	numeric	23,740	0.0%	88	high_skew outliers
child_language_count	numeric	23,740	0.0%	126	high_skew outliers
child_dialect_count	numeric	23,740	0.0%	164	high_skew outliers
country_ids	categorical	23,740	64.2%	680	null_rate

Fig 1.

status · Look for how many languoids fall into each endangerment tier versus the dominant 'safe' category — the endangered tail is substantial.

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

Fig 2.

level · Shows the split between dialects, languages, and families, which sets the context for interpreting all other columns.

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

Fig 3.

child_dialect_count · Expect a steep spike at zero with a very long right tail — a few nodes dominate the entire dialect tree.

Show data table

Histogram bins for child_dialect_count (median: 0.0).
bin	count
0 – 59.23	23575
59.23 – 118.5	99
118.5 – 177.7	24
177.7 – 236.9	18
236.9 – 296.1	4
296.1 – 355.4	2
355.4 – 414.6	0
414.6 – 473.8	1
473.8 – 533	1
533 – 592.2	1
592.2 – 651.5	1
651.5 – 710.7	2
710.7 – 769.9	0
769.9 – 829.1	1
829.1 – 888.4	0
888.4 – 947.6	3
947.6 – 1007	0
1007 – 1066	0
1066 – 1125	1
1125 – 1184	1
1184 – 1244	0
1244 – 1303	0
1303 – 1362	1
1362 – 1421	0
1421 – 1481	0
1481 – 1540	0
1540 – 1599	1
1599 – 1658	0
1658 – 1718	0
1718 – 1777	0
1777 – 1836	1
1836 – 1895	1
1895 – 1954	0
1954 – 2014	0
2014 – 2073	0
2073 – 2132	0
2132 – 2191	0
2191 – 2251	1
2251 – 2310	0
2310 – 2369	1

Fig 4.

family_id · Atlantic-Congo (atla1278) and Austronesian (aust1307) together account for over a third of all languoids — look for which families dominate.

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

Fig 5.

country_ids · Papua New Guinea, Indonesia, and Nigeria top the list — check whether linguistic diversity concentrates in a handful of countries.

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	text	0.0%
family_id	categorical	1.8%
parent_id	text	1.8%
name	text	0.0%
bookkeeping	categorical	0.0%
level	categorical	0.0%
status	categorical	0.0%
latitude	numeric	66.5%
longitude	numeric	66.5%
iso639P3code	text	66.4%
description	unknown	0.0%
markup_description	unknown	0.0%
child_family_count	numeric	0.0%
child_language_count	numeric	0.0%
child_dialect_count	numeric	0.0%
country_ids	categorical	64.2%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
	latitude	longitude	child_family_count	child_language_count	child_dialect_count
latitude	+1.00	-0.31	+0.03	+0.01	+0.06
longitude	-0.31	+1.00	-0.03	-0.04	-0.05
child_family_count	+0.03	-0.03	+1.00	+0.96	+0.74
child_language_count	+0.01	-0.04	+0.96	+1.00	+0.69
child_dialect_count	+0.06	-0.05	+0.74	+0.69	+1.00

id text identifier

This column is a structured language or dialect identifier, following the Glottolog-style 8-character code pattern (4 letters + 4 digits, e.g. 'tibe1272', 'east2553'). All 23,740 values are exactly 8 characters long (len_min/max/mean/median all equal 8) with zero nulls and zero duplicates, making it a perfect primary key. The top_words sample aligns precisely with Glottolog languoid codes, suggesting this is a linguistic database. No anomalies — the column is unusually clean.

Treatment: Retain as primary key; left-join on this id to enrich with Glottolog metadata.

anthropic:default · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	23,740
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	86.11
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for id.

Show data table

Character-length distribution for id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23740
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

family_id categorical foreign_key

This column represents language family identifiers, using Glottolog-style codes (e.g., 'atla1278' = Atlantic-Congo, 'aust1307' = Austronesian, 'indo1319' = Indo-European). With only 287 unique values across 23,740 rows it acts as a low-to-medium cardinality grouping key for linguistic families. The top value 'atla1278' is notably dominant at 20.0% of rows (4,663 records), and the top two families together account for roughly 36% of the dataset, signalling a heavily skewed distribution toward large African and Pacific language families.

Treatment: Left-join on this id to a language-family reference table; use as a categorical grouping variable or fixed effect in models, noting the severe class imbalance toward 'atla1278' and 'aust1307'.

anthropic:default · confidence high

Out[16]:

saturn.columns["family_id"].stats

stat	value
n	23,740
nulls	429 (1.8%)
unique	287
top_value	atla1278
top_rate	0.2
cardinality	287
entropy	4.886
entropy_ratio	0.5984

Fig 9.

Top values for family_id.

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

parent_id text foreign_key

This column contains Glottolog language family or clade identifiers — fixed 8-character alphanumeric codes (e.g. 'book1242', 'pidg1258') linking each row to a parent node in a linguistic taxonomy. Every value is exactly 8 characters long (len_min=8, len_max=8) and all are single tokens (one_word_rate=1.0), consistent with Glottolog's standardised glottocode format. The duplicate rate is notably high at 68.5%, with 'book1242' appearing 399 times, indicating many child languages share the same parent grouping — expected for a hierarchical classification. Null rate is low at 1.81%, suggesting most records have a resolvable parent.

Treatment: Left-join on this id against a Glottolog taxonomy table to enrich with family-level linguistic metadata.

anthropic:default · confidence high

Out[19]:

saturn.columns["parent_id"].stats

stat	value
n	23,740
nulls	429 (1.8%)
unique	7,338
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	15,973
duplicate_rate	0.6852
vocab_size	7,189
readability_flesch_mean	91.19
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	68.5% duplicate strings

Fig 10.

Character-length distribution for parent_id.

Show data table

Character-length distribution for parent_id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23311
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

name text label

This column contains names of linguistic or geographic entities — likely language varieties, dialects, or regional groupings, as evidenced by top words such as 'nuclear', 'central', 'western', 'eastern', 'northern', 'southern', 'language', and 'sign'. All 23,740 rows are unique with zero nulls and zero duplicates, making this a perfect natural-language identifier. Surprisingly, 69.5% of values are single words despite a mean length of ~9.95 characters, suggesting a mix of short atomic names and multi-word descriptors (up to 58 characters), with a vocabulary of 17,915 distinct tokens across the corpus.

Treatment: Use as a human-readable label or index key; tokenize and embed if semantic similarity between names is needed downstream.

anthropic:default · confidence high

Out[22]:

saturn.columns["name"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	23,740
len_min	1
len_max	58
len_mean	9.95
len_median	8
len_p95	22
word_mean	1.398
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	17,915
readability_flesch_mean	42.62
emoji_rate	0
url_rate	0
one_word_rate	0.6953
allcaps_rate	0.0001685
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	69.5% rows are a single word

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 9.950126368997473).
chars	count
1 – 2	54
2 – 4	496
4 – 5	4648
5 – 7	3145
7 – 8	4510
8 – 10	1373
10 – 11	1069
11 – 12	1997
12 – 14	1023
14 – 15	1768
15 – 17	644
17 – 18	848
18 – 20	340
20 – 21	281
21 – 22	518
22 – 24	224
24 – 25	298
25 – 27	101
27 – 28	140
28 – 30	46
30 – 31	44
31 – 32	69
32 – 34	24
34 – 35	34
35 – 37	12
37 – 38	9
38 – 39	4
39 – 41	2
41 – 42	3
42 – 44	3
44 – 45	5
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	2
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

bookkeeping categorical label

This column is a boolean flag (stored as strings 'True'/'False') indicating whether a record has some bookkeeping status applied — likely a soft-delete, correction, or administrative override marker. The distribution is severely imbalanced: 98.3% of rows are 'False' (23,341) versus only 1.7% 'True' (399), which triggered the imbalance alert. The entropy of 0.123 confirms near-zero information variance, meaning this flag is rarely set. Analysts should treat the 'True' minority with care as it may mark records to exclude or handle separately.

Treatment: Use as a filter or stratification variable; if modelling, oversample or reweight the 399 'True' cases, or exclude them if they represent invalid/corrected records.

anthropic:default · confidence high

Out[25]:

saturn.columns["bookkeeping"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	2
top_value	False
top_rate	0.9832
cardinality	2
entropy	0.1231
entropy_ratio	0.1231
alert: imbalance	top value is 98.3% of rows

Fig 12.

Top values for bookkeeping.

Show data table

Top values for bookkeeping (2 unique shown, of 2 total).
value	count	share
False	23341	98.3%
True	399	1.7%

level categorical label

This column classifies linguistic entities into one of three hierarchical levels: 'dialect', 'language', and 'family', suggesting the dataset concerns a linguistic taxonomy or inventory (e.g., Ethnologue-style data). With only 3 unique values across 23,740 rows and zero nulls, it is a clean, fully populated categorical field. Notably, 'dialect' is the modal class at ~46% (10,920), followed by 'language' at ~35.7% (8,481) and 'family' at ~18.3% (4,339) — the class imbalance is moderate but not severe. The entropy ratio of 0.943 is surprisingly high for a 3-class variable, indicating near-uniform distribution across categories.

Treatment: One-hot encode or ordinal-encode (dialect < language < family) depending on whether hierarchy is meaningful to the model.

anthropic:default · confidence high

Out[28]:

saturn.columns["level"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	3
top_value	dialect
top_rate	0.46
cardinality	3
entropy	1.494
entropy_ratio	0.9426

Fig 13.

Top values for level.

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

status categorical label

This column represents a conservation or endangerment status classification, most likely for languages or species, with 6 distinct ordinal categories ranging from 'safe' to 'extinct'. The distribution is heavily skewed: 'safe' dominates at nearly 80% of records (18,965 of 23,740), while the threatened categories collectively account for only ~20%. The low entropy ratio (0.445) confirms this imbalance, and analysts building classification models should expect a significant class imbalance problem.

Treatment: Encode as ordinal (safe < vulnerable < severely endangered < definitely endangered < critically endangered < extinct) and apply class-imbalance mitigation (e.g., oversampling or class weights) before modelling.

anthropic:default · confidence high

Out[31]:

saturn.columns["status"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	6
top_value	safe
top_rate	0.7989
cardinality	6
entropy	1.15
entropy_ratio	0.4447

Fig 14.

Top values for status.

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

latitude numeric feature

This column contains geographic latitude values for records in the dataset, spanning from -55.27° (southern South America or similar) to 73.14° (Arctic latitudes), consistent with global coverage. The most striking issue is a 66.54% null rate — two-thirds of rows lack a latitude value, which severely limits spatial analysis. The distribution is mildly right-skewed (skew 0.54) with a mean of ~8.2° and median of ~6.3°, suggesting a concentration of records in tropical/equatorial regions. Only 129 outliers are flagged (1.6%), so the non-null values themselves appear geographically plausible.

Treatment: Investigate and impute or filter nulls (66.54% missing) before use; pair with longitude for spatial modelling or clustering.

anthropic:default · confidence high

Out[34]:

saturn.columns["latitude"].stats

stat	value
n	23,740
nulls	15,797 (66.5%)
unique	7,798
min	-55.27
max	73.14
mean	8.17
median	6.306
std	18.96
q1	-5.137
q3	19.34
iqr	24.47
skew	0.5403
kurtosis	0.3006
n_outliers	129
outlier_rate	0.01624
zero_rate	0
alert: null_rate	66.5% null

Fig 15.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 6.30619).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	48
-26.38 – -23.17	78
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	281
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	379
2.51 – 5.72	469
5.72 – 8.93	664
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	387
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	373
28.19 – 31.4	167
31.4 – 34.61	144
34.61 – 37.82	179
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	78
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

longitude numeric feature

This column represents geographic longitude, covering nearly the full global range from –178.785° to 179.306°. The null rate of 66.54% is a critical alert — two-thirds of records lack a coordinate, which may indicate missing geolocation data for a large subset of entities. The IQR of ~117° and std of ~81° confirm values are broadly distributed across hemispheres, and the mean (51.27°) skewing toward positive (Eastern) longitudes with a median of 47.72° suggests a concentration in Europe/Asia. Only 13 outliers are flagged despite the extreme range, consistent with legitimate global coordinates rather than data entry errors.

Treatment: Investigate and impute or filter the 66.54% null records before use; pair with latitude for geospatial modelling or clustering.

anthropic:default · confidence high

Out[37]:

saturn.columns["longitude"].stats

stat	value
n	23,740
nulls	15,797 (66.5%)
unique	7,757
min	-178.8
max	179.3
mean	51.27
median	47.72
std	81.14
q1	7.235
q3	124.1
iqr	116.9
skew	-0.4832
kurtosis	-0.7745
n_outliers	13
outlier_rate	0.001637
zero_rate	0
alert: null_rate	66.5% null

Fig 16.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 47.7236).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	124
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	444
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	430
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	80
71.88 – 80.83	210
80.83 – 89.78	208
89.78 – 98.74	269
98.74 – 107.7	457
107.7 – 116.6	239
116.6 – 125.6	502
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	123
161.4 – 170.4	186
170.4 – 179.3	12

iso639P3code text foreign_key

This column contains ISO 639-3 language codes — standardised three-letter identifiers for individual languages (e.g., 'aiz', 'kbt', 'mij'). Every non-null value is exactly 3 characters long (min=3, max=3, mean=3.0) and appears only once, yielding 7,968 unique codes across 7,968 vocabulary entries with zero duplicates — consistent with a lookup or reference table of distinct languages. The striking concern is a 66.44% null rate, meaning two-thirds of the 23,740 rows carry no code at all, which likely reflects languages or entries not yet mapped to an ISO 639-3 identifier.

Treatment: Left-join on this code to enrich with ISO 639-3 language metadata; investigate and impute or flag the 66.44% nulls before joining.

anthropic:default · confidence high

Out[40]:

saturn.columns["iso639P3code"].stats

stat	value
n	23,740
nulls	15,772 (66.4%)
unique	7,968
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,968
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: null_rate	66.4% null
alert: short_text	95th-percentile length under 20 chars

Fig 17.

Character-length distribution for iso639P3code.

Show data table

Character-length distribution for iso639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7968
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

description unknown free_text

This column contains textual descriptions with zero null values across 23,740 rows. The profile was skipped by saturn, so no uniqueness, length, or token statistics are available — the column's exact nature (short labels vs. long free text) cannot be determined from the evidence. The 'skipped' alert and absent stats dict suggest the profiler either timed out or excluded this column type from deep analysis.

Treatment: Tokenize and embed before modelling, or apply NLP preprocessing (stopword removal, TF-IDF/sentence embeddings) depending on downstream task.

anthropic:default · confidence low

Out[43]:

saturn.columns["description"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

markup_description unknown free_text

This column contains markup or formatted description text (likely HTML, Markdown, or similar) across 23,740 rows with zero nulls. The profiler skipped detailed analysis — indicated by the 'skipped' alert — so no uniqueness, length, or content statistics are available. Without further inspection, the content type and cardinality are unknown, but the name strongly implies free-form or templated descriptive text with embedded formatting syntax.

Treatment: Strip markup tags, then tokenize and embed before modelling, or inspect raw values to confirm structure before deciding on parsing strategy.

anthropic:default · confidence low

Out[45]:

saturn.columns["markup_description"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

child_family_count numeric feature

This column counts the number of children or family members associated with a record, with 90.8% of values being exactly zero and a median of 0.0 — meaning the vast majority of subjects have no associated family/child count. The distribution is extraordinarily right-skewed (skew = 44.4, kurtosis = 2352.9) with a max of 859.0 and 2,179 outliers (9.2% of rows), suggesting a small subset of records have implausibly large counts that warrant investigation for data-entry errors or aggregation anomalies.

Treatment: Investigate extreme values (max 859.0) for data errors; consider capping/winsorizing at a defensible threshold, then apply log1p transform before modelling.

anthropic:default · confidence high

Out[47]:

saturn.columns["child_family_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	88
min	0
max	859
mean	0.8792
median	0
std	13.2
q1	0
q3	0
iqr	0
skew	44.4
kurtosis	2353
n_outliers	2,179
outlier_rate	0.09179
zero_rate	0.9082
alert: high_skew	skew=+44.40
alert: outliers	9.2% rows beyond 1.5 IQR

Fig 18.

Distribution of child_family_count. Vertical dash marks the median.

Show data table

Histogram bins for child_family_count (median: 0.0).
bin	count
0 – 21.48	23588
21.48 – 42.95	93
42.95 – 64.43	27
64.43 – 85.9	6
85.9 – 107.4	4
107.4 – 128.9	3
128.9 – 150.3	2
150.3 – 171.8	2
171.8 – 193.3	1
193.3 – 214.8	1
214.8 – 236.2	0
236.2 – 257.7	0
257.7 – 279.2	1
279.2 – 300.7	2
300.7 – 322.1	1
322.1 – 343.6	1
343.6 – 365.1	0
365.1 – 386.6	0
386.6 – 408	2
408 – 429.5	1
429.5 – 451	0
451 – 472.5	0
472.5 – 493.9	0
493.9 – 515.4	0
515.4 – 536.9	0
536.9 – 558.4	0
558.4 – 579.8	1
579.8 – 601.3	0
601.3 – 622.8	0
622.8 – 644.2	0
644.2 – 665.7	0
665.7 – 687.2	0
687.2 – 708.7	2
708.7 – 730.2	0
730.2 – 751.6	0
751.6 – 773.1	0
773.1 – 794.6	0
794.6 – 816.1	1
816.1 – 837.5	0
837.5 – 859	1

child_language_count numeric feature

This column counts the number of child languages associated with a record, likely in a linguistic or taxonomy dataset. The distribution is extremely right-skewed (skew = 41.86, kurtosis = 2115.08): 81.7% of rows have a value of zero, the median is 0.0, and the IQR is 0.0, yet the mean is ~2.0 and the max reaches 1,435. This implies a small number of parent-language nodes dominate with very large child counts, while most entries are leaf nodes with no children. Over 18% of rows (4,339) are flagged as outliers, reinforcing the extreme concentration at zero with a long, sparse upper tail.

Treatment: Log-transform (log1p) or apply a zero-inflated model; consider a binary 'has_children' flag as a complementary feature given the 81.7% zero rate.

anthropic:default · confidence high

Out[50]:

saturn.columns["child_language_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	126
min	0
max	1,435
mean	1.996
median	0
std	23.41
q1	0
q3	0
iqr	0
skew	41.86
kurtosis	2115
n_outliers	4,339
outlier_rate	0.1828
zero_rate	0.8172
alert: high_skew	skew=+41.86
alert: outliers	18.3% rows beyond 1.5 IQR

Fig 19.

Distribution of child_language_count. Vertical dash marks the median.

Show data table

Histogram bins for child_language_count (median: 0.0).
bin	count
0 – 35.88	23547
35.88 – 71.75	121
71.75 – 107.6	37
107.6 – 143.5	6
143.5 – 179.4	3
179.4 – 215.2	4
215.2 – 251.1	3
251.1 – 287	2
287 – 322.9	2
322.9 – 358.8	0
358.8 – 394.6	1
394.6 – 430.5	1
430.5 – 466.4	0
466.4 – 502.2	1
502.2 – 538.1	1
538.1 – 574	2
574 – 609.9	1
609.9 – 645.8	0
645.8 – 681.6	0
681.6 – 717.5	1
717.5 – 753.4	2
753.4 – 789.2	0
789.2 – 825.1	0
825.1 – 861	0
861 – 896.9	0
896.9 – 932.8	0
932.8 – 968.6	0
968.6 – 1004	1
1004 – 1040	0
1040 – 1076	0
1076 – 1112	0
1112 – 1148	0
1148 – 1184	0
1184 – 1220	0
1220 – 1256	0
1256 – 1292	2
1292 – 1327	0
1327 – 1363	0
1363 – 1399	1
1399 – 1435	1

child_dialect_count numeric feature

This column counts the number of child dialects associated with each record in what appears to be a linguistic or language-taxonomy dataset. The distribution is extreme: 74.4% of rows have zero child dialects, the median is 0, and the IQR is just 0–1, yet the mean is 3.39 and the maximum reaches 2369—producing a skew of 42.2 and a kurtosis of 2159.3. Nearly 18% of rows (4,272) are flagged as outliers, indicating a small number of language/dialect nodes act as hubs with enormous fan-out while the vast majority are leaf nodes.

Treatment: Log-transform (log1p) before regression or distance-based modelling; consider also a binary 'has_children' flag and a separate high-cardinality indicator given the extreme outliers.

anthropic:default · confidence high

Out[53]:

saturn.columns["child_dialect_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	164
min	0
max	2,369
mean	3.389
median	0
std	36.8
q1	0
q3	1
iqr	1
skew	42.22
kurtosis	2159
n_outliers	4,272
outlier_rate	0.1799
zero_rate	0.7442
alert: high_skew	skew=+42.22
alert: outliers	18.0% rows beyond 1.5 IQR

Fig 20.

Distribution of child_dialect_count. Vertical dash marks the median.

Show data table

Histogram bins for child_dialect_count (median: 0.0).
bin	count
0 – 59.23	23575
59.23 – 118.5	99
118.5 – 177.7	24
177.7 – 236.9	18
236.9 – 296.1	4
296.1 – 355.4	2
355.4 – 414.6	0
414.6 – 473.8	1
473.8 – 533	1
533 – 592.2	1
592.2 – 651.5	1
651.5 – 710.7	2
710.7 – 769.9	0
769.9 – 829.1	1
829.1 – 888.4	0
888.4 – 947.6	3
947.6 – 1007	0
1007 – 1066	0
1066 – 1125	1
1125 – 1184	1
1184 – 1244	0
1244 – 1303	0
1303 – 1362	1
1362 – 1421	0
1421 – 1481	0
1481 – 1540	0
1540 – 1599	1
1599 – 1658	0
1658 – 1718	0
1718 – 1777	0
1777 – 1836	1
1836 – 1895	1
1895 – 1954	0
1954 – 2014	0
2014 – 2073	0
2073 – 2132	0
2132 – 2191	0
2191 – 2251	1
2251 – 2310	0
2310 – 2369	1

country_ids categorical feature

This column contains ISO 3166-1 alpha-2 country codes, functioning as a categorical geographic identifier for each record. The most striking issue is its 64.24% null rate, meaning nearly two-thirds of the 23,740 rows carry no country information. Among populated rows, Papua New Guinea ('PG') dominates at 10.29% of non-null values, followed by Indonesia ('ID') and Nigeria ('NG') — a developing-world skew that may reflect data collection bias. The 680-unique-value count is anomalously high for standard two-letter ISO codes (only ~250 exist), suggesting some values may be multi-code concatenations, malformed entries, or non-standard codes.

Treatment: Investigate the 680 distinct values for malformed or concatenated codes; impute or flag nulls (64.24% missing) before use; one-hot encode or embed after cleaning.

anthropic:default · confidence medium

Out[56]:

saturn.columns["country_ids"].stats

stat	value
n	23,740
nulls	15,250 (64.2%)
unique	680
top_value	PG
top_rate	0.1029
cardinality	680
entropy	6.493
entropy_ratio	0.6901
alert: null_rate	64.2% null

Fig 21.

Top values for country_ids.

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

data trove glottolog languoids

Overview

Summary confidence: high

id text identifier

family_id categorical foreign_key

parent_id text foreign_key

name text label

bookkeeping categorical label

level categorical label

status categorical label

latitude numeric feature

longitude numeric feature

iso639P3code text foreign_key

description unknown free_text

markup_description unknown free_text

child_family_count numeric feature

child_language_count numeric feature

child_dialect_count numeric feature

country_ids categorical feature

How to cite