language-data-glottolog_languoid

Overview

Source: /home/coolhand/datasets/language-data/glottolog_languoid.csv

Saturn profiled 23,740 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/glottolog_languoid.csv",
    "--findings", "language-data-glottolog_languoid.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.

citing: level · status · country_ids · family_id · latitude · longitude · iso639P3code · child_dialect_count · bookkeeping

Out[4]:

saturn.schema() · 16 columns

column	kind	n	null%	unique	alerts
id	text	23,740	0.0%	23,740	near_unique one_word short_text
family_id	categorical	23,740	1.8%	287
parent_id	text	23,740	1.8%	7,338	one_word short_text duplicates
name	text	23,740	0.0%	23,740	near_unique one_word
bookkeeping	categorical	23,740	0.0%	2	imbalance
level	categorical	23,740	0.0%	3
status	categorical	23,740	0.0%	6
latitude	numeric	23,740	66.5%	7,798	null_rate
longitude	numeric	23,740	66.5%	7,757	null_rate
iso639P3code	text	23,740	66.4%	7,968	near_unique one_word null_rate short_text
description	unknown	23,740	0.0%	—	skipped
markup_description	unknown	23,740	0.0%	—	skipped
child_family_count	numeric	23,740	0.0%	88	high_skew outliers
child_language_count	numeric	23,740	0.0%	126	high_skew outliers
child_dialect_count	numeric	23,740	0.0%	164	high_skew outliers
country_ids	categorical	23,740	64.2%	680	null_rate

Fig 1.

level · Shows the split between dialect, language, and family entries — the core taxonomy of the dataset.

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

Fig 2.

status · Highlights how many languoids are safe versus various endangerment levels, including 889 already extinct.

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

Fig 3.

family_id · Reveals the dominant language families; atla1278 and aust1307 alone account for over a third of records.

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

Fig 4.

country_ids · Shows geographic concentration, with Papua New Guinea, Indonesia, and Nigeria leading by languoid count.

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

Fig 5.

latitude · Distribution of languoid latitudes (where known) — note the ~66% null rate limits coverage.

Show data table

Histogram bins for latitude (median: 6.30619).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	48
-26.38 – -23.17	78
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	281
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	379
2.51 – 5.72	469
5.72 – 8.93	664
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	387
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	373
28.19 – 31.4	167
31.4 – 34.61	144
34.61 – 37.82	179
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	78
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	text	0.0%
family_id	categorical	1.8%
parent_id	text	1.8%
name	text	0.0%
bookkeeping	categorical	0.0%
level	categorical	0.0%
status	categorical	0.0%
latitude	numeric	66.5%
longitude	numeric	66.5%
iso639P3code	text	66.4%
description	unknown	0.0%
markup_description	unknown	0.0%
child_family_count	numeric	0.0%
child_language_count	numeric	0.0%
child_dialect_count	numeric	0.0%
country_ids	categorical	64.2%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
	latitude	longitude	child_family_count	child_language_count	child_dialect_count
latitude	+1.00	-0.31	+0.03	+0.01	+0.06
longitude	-0.31	+1.00	-0.03	-0.04	-0.05
child_family_count	+0.03	-0.03	+1.00	+0.96	+0.74
child_language_count	+0.01	-0.04	+0.96	+1.00	+0.69
child_dialect_count	+0.06	-0.05	+0.74	+0.69	+1.00

id text identifier

Fixed 8-character single-token codes (e.g., 'melk1240', 'yang1299'), unique across all 23,740 rows with no nulls or duplicates. The pattern of four letters followed by four digits is consistent with Glottolog-style language identifiers, making this a primary key rather than analyzable text.

Treatment: Use as the row key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	23,740
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	86.11
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for id.

Show data table

Character-length distribution for id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23740
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

family_id categorical foreign_key

Looks like a Glottolog-style language family identifier (e.g. 'atla1278' for Atlantic-Congo, 'aust1307' for Austronesian) tagging each of the 23,740 rows. The distribution is heavily skewed: the top family alone covers 20.0% of rows and the top 10 of 287 families dominate, yielding an entropy ratio of 0.598. Null rate is low at 1.81%.

Treatment: left-join on this id to a language-family reference table; consider grouping the long tail before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["family_id"].stats

stat	value
n	23,740
nulls	429 (1.8%)
unique	287
top_value	atla1278
top_rate	0.2
cardinality	287
entropy	4.886
entropy_ratio	0.5984

Fig 9.

Top values for family_id.

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

parent_id text foreign_key

Fixed-width 8-character single-token codes (e.g. 'book1242', 'uncl1493') with len_min=len_max=8 and one_word_rate=1.0 — these look like Glottolog-style language/family identifiers used as a parent reference. With 7338 unique values across 23740 rows and a 68.5% duplicate_rate, many children share parents; 'book1242' alone accounts for 399 rows. Null rate is low at 1.81%.

Treatment: left-join on this id to a parent/language lookup table.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["parent_id"].stats

stat	value
n	23,740
nulls	429 (1.8%)
unique	7,338
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	15,973
duplicate_rate	0.6852
vocab_size	7,189
readability_flesch_mean	91.19
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	68.5% duplicate strings

Fig 10.

Character-length distribution for parent_id.

Show data table

Character-length distribution for parent_id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23311
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

name text identifier

The `name` column holds 23,740 fully unique short labels (null_rate 0.0, n_unique equals n), with a mean length of 9.95 characters and 69.5% being a single word. Top tokens like `nuclear`, `central`, `western`, `eastern`, `northern`, `southern`, and `language` suggest these are entity/topic names rather than person names. Vocabulary is broad (17,915 distinct words across only ~1.4 words per row), and there are no duplicates, URLs, or emoji.

Treatment: Treat as a unique label key; drop from modelling features or hash/embed if semantic content is needed.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["name"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	23,740
len_min	1
len_max	58
len_mean	9.95
len_median	8
len_p95	22
word_mean	1.398
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	17,915
readability_flesch_mean	42.62
emoji_rate	0
url_rate	0
one_word_rate	0.6953
allcaps_rate	0.0001685
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	69.5% rows are a single word

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 9.950126368997473).
chars	count
1 – 2	54
2 – 4	496
4 – 5	4648
5 – 7	3145
7 – 8	4510
8 – 10	1373
10 – 11	1069
11 – 12	1997
12 – 14	1023
14 – 15	1768
15 – 17	644
17 – 18	848
18 – 20	340
20 – 21	281
21 – 22	518
22 – 24	224
24 – 25	298
25 – 27	101
27 – 28	140
28 – 30	46
30 – 31	44
31 – 32	69
32 – 34	24
34 – 35	34
35 – 37	12
37 – 38	9
38 – 39	4
39 – 41	2
41 – 42	3
42 – 44	3
44 – 45	5
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	2
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

bookkeeping categorical feature

Binary boolean flag indicating whether a record involves bookkeeping, with only two values across 23740 rows and no nulls. The distribution is severely imbalanced: 'False' covers 98.3% (23341 rows) versus only 399 'True' cases, yielding an entropy ratio of just 0.12.

Treatment: Encode as boolean and apply class-imbalance handling (e.g., stratification or reweighting) if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["bookkeeping"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	2
top_value	False
top_rate	0.9832
cardinality	2
entropy	0.1231
entropy_ratio	0.1231
alert: imbalance	top value is 98.3% of rows

Fig 12.

Top values for bookkeeping.

Show data table

Top values for bookkeeping (2 unique shown, of 2 total).
value	count	share
False	23341	98.3%
True	399	1.7%

level categorical feature

This is a categorical taxonomy tag with exactly 3 levels (dialect, language, family) and no nulls across 23,740 rows. The distribution is well-spread (entropy_ratio 0.94), with 'dialect' leading at 46.0%, followed by 'language' (8,481) and 'family' (4,339). Looks like a linguistic classification level rather than a free-form attribute.

Treatment: one-hot encode for modelling or use directly as a stratification key.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["level"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	3
top_value	dialect
top_rate	0.46
cardinality	3
entropy	1.494
entropy_ratio	0.9426

Fig 13.

Top values for level.

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

status categorical label

This is a categorical status column with 6 levels matching UNESCO-style language endangerment categories (safe, definitely endangered, vulnerable, extinct, critically endangered, severely endangered). The distribution is heavily imbalanced: 'safe' accounts for 79.9% of 23,740 rows, while the rarest level 'severely endangered' has only 413 records. Entropy ratio is 0.44, confirming low diversity despite 6 classes.

Treatment: Treat as ordinal target; stratify or rebalance before classification given the 80/20 dominance of 'safe'.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["status"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	6
top_value	safe
top_rate	0.7989
cardinality	6
entropy	1.15
entropy_ratio	0.4447

Fig 14.

Top values for status.

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

latitude numeric feature

Geographic latitude coordinate spanning -55.27 to 73.14, consistent with valid Earth latitudes. Two-thirds of rows are null (null_rate 0.6654), which severely limits coverage. The distribution is mildly right-skewed (0.54) with median 6.31 and IQR ~24.5, suggesting a bias toward northern-hemisphere but tropical-leaning observations.

Treatment: Impute or filter the 66% nulls before any spatial modelling; pair with longitude for geo-features.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["latitude"].stats

stat	value
n	23,740
nulls	15,797 (66.5%)
unique	7,798
min	-55.27
max	73.14
mean	8.17
median	6.306
std	18.96
q1	-5.137
q3	19.34
iqr	24.47
skew	0.5403
kurtosis	0.3006
n_outliers	129
outlier_rate	0.01624
zero_rate	0
alert: null_rate	66.5% null

Fig 15.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 6.30619).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	48
-26.38 – -23.17	78
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	281
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	379
2.51 – 5.72	469
5.72 – 8.93	664
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	387
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	373
28.19 – 31.4	167
31.4 – 34.61	144
34.61 – 37.82	179
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	78
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

longitude numeric feature

Geographic longitude in decimal degrees, with values spanning -178.785 to 179.306, consistent with the global WGS84 range. Two-thirds of rows are null (null_rate 0.6654), so coverage is the dominant concern; among populated rows the distribution is mildly left-skewed (-0.48) with a median of 47.72 suggesting an Eastern-Hemisphere bias. Only 13 outliers (0.16%) and no zeros, so the populated values themselves look clean.

Treatment: Pair with latitude for geospatial features; impute or filter the 66.5% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["longitude"].stats

stat	value
n	23,740
nulls	15,797 (66.5%)
unique	7,757
min	-178.8
max	179.3
mean	51.27
median	47.72
std	81.14
q1	7.235
q3	124.1
iqr	116.9
skew	-0.4832
kurtosis	-0.7745
n_outliers	13
outlier_rate	0.001637
zero_rate	0
alert: null_rate	66.5% null

Fig 16.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 47.7236).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	124
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	444
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	430
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	80
71.88 – 80.83	210
80.83 – 89.78	208
89.78 – 98.74	269
98.74 – 107.7	457
107.7 – 116.6	239
116.6 – 125.6	502
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	123
161.4 – 170.4	186
170.4 – 179.3	12

iso639P3code text identifier

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with 7968 distinct codes across 23740 rows. Two-thirds are missing (null_rate=0.6644), and the cardinality is near-unique among populated rows, suggesting one code per language entry rather than a repeated categorical.

Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and handle the 66% nulls explicitly.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["iso639P3code"].stats

stat	value
n	23,740
nulls	15,772 (66.4%)
unique	7,968
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,968
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: null_rate	66.4% null
alert: short_text	95th-percentile length under 20 chars

Fig 17.

Character-length distribution for iso639P3code.

Show data table

Character-length distribution for iso639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7968
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

description unknown free_text

This column is named 'description' but saturn skipped profiling, so no kind, uniqueness, or value statistics are available. Only the row count (23740) and a null rate of 0.0 are reported. Without further stats, the content and structure cannot be characterized.

Treatment: Re-profile or sample manually before deciding; if textual, tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence low

Out[43]:

saturn.columns["description"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

markup_description unknown free_text

This column was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 23,740 and a null rate of 0.0. The name suggests it holds markup or descriptive text (likely HTML or formatted product/item descriptions), but that is inferred from the label alone, not from evidence. No distributional signal can be assessed here.

Treatment: Re-run the profiler with text handling enabled, then tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence low

Out[45]:

saturn.columns["markup_description"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

child_family_count numeric feature

A numeric count of child families per record, with 23740 rows and only 88 distinct values. It is overwhelmingly zero (zero_rate 0.9082) so q1, median, and q3 are all 0 and IQR is 0, yet a long tail pushes max to 859 with mean 0.879 and std 13.20. Skew of 44.40 and kurtosis of 2352.94 confirm an extreme heavy-tailed distribution, and 2179 rows (9.18%) flag as outliers.

Treatment: Binarize (zero vs non-zero) or apply log1p before modelling to tame the extreme skew.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["child_family_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	88
min	0
max	859
mean	0.8792
median	0
std	13.2
q1	0
q3	0
iqr	0
skew	44.4
kurtosis	2353
n_outliers	2,179
outlier_rate	0.09179
zero_rate	0.9082
alert: high_skew	skew=+44.40
alert: outliers	9.2% rows beyond 1.5 IQR

Fig 18.

Distribution of child_family_count. Vertical dash marks the median.

Show data table

Histogram bins for child_family_count (median: 0.0).
bin	count
0 – 21.48	23588
21.48 – 42.95	93
42.95 – 64.43	27
64.43 – 85.9	6
85.9 – 107.4	4
107.4 – 128.9	3
128.9 – 150.3	2
150.3 – 171.8	2
171.8 – 193.3	1
193.3 – 214.8	1
214.8 – 236.2	0
236.2 – 257.7	0
257.7 – 279.2	1
279.2 – 300.7	2
300.7 – 322.1	1
322.1 – 343.6	1
343.6 – 365.1	0
365.1 – 386.6	0
386.6 – 408	2
408 – 429.5	1
429.5 – 451	0
451 – 472.5	0
472.5 – 493.9	0
493.9 – 515.4	0
515.4 – 536.9	0
536.9 – 558.4	0
558.4 – 579.8	1
579.8 – 601.3	0
601.3 – 622.8	0
622.8 – 644.2	0
644.2 – 665.7	0
665.7 – 687.2	0
687.2 – 708.7	2
708.7 – 730.2	0
730.2 – 751.6	0
751.6 – 773.1	0
773.1 – 794.6	0
794.6 – 816.1	1
816.1 – 837.5	0
837.5 – 859	1

child_language_count numeric feature

A numeric count of child languages per record, where 81.7% of rows are zero and Q1=median=Q3=0, so the typical entity has none. The distribution is extremely long-tailed (skew 41.86, kurtosis 2115) with a max of 1435 and 4339 outliers (18.3% outlier rate), suggesting a small set of hub-like records dominate.

Treatment: Binarize (has_children vs none) or log1p-transform before modelling given the heavy zero-inflation and skew.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["child_language_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	126
min	0
max	1,435
mean	1.996
median	0
std	23.41
q1	0
q3	0
iqr	0
skew	41.86
kurtosis	2115
n_outliers	4,339
outlier_rate	0.1828
zero_rate	0.8172
alert: high_skew	skew=+41.86
alert: outliers	18.3% rows beyond 1.5 IQR

Fig 19.

Distribution of child_language_count. Vertical dash marks the median.

Show data table

Histogram bins for child_language_count (median: 0.0).
bin	count
0 – 35.88	23547
35.88 – 71.75	121
71.75 – 107.6	37
107.6 – 143.5	6
143.5 – 179.4	3
179.4 – 215.2	4
215.2 – 251.1	3
251.1 – 287	2
287 – 322.9	2
322.9 – 358.8	0
358.8 – 394.6	1
394.6 – 430.5	1
430.5 – 466.4	0
466.4 – 502.2	1
502.2 – 538.1	1
538.1 – 574	2
574 – 609.9	1
609.9 – 645.8	0
645.8 – 681.6	0
681.6 – 717.5	1
717.5 – 753.4	2
753.4 – 789.2	0
789.2 – 825.1	0
825.1 – 861	0
861 – 896.9	0
896.9 – 932.8	0
932.8 – 968.6	0
968.6 – 1004	1
1004 – 1040	0
1040 – 1076	0
1076 – 1112	0
1112 – 1148	0
1148 – 1184	0
1184 – 1220	0
1220 – 1256	0
1256 – 1292	2
1292 – 1327	0
1327 – 1363	0
1363 – 1399	1
1399 – 1435	1

child_dialect_count numeric feature

This is a count of child dialects per record, dominated by zeros (zero_rate 0.7442) with a median and Q3 of 0/1 yet a max of 2369. Skew of 42.22 and kurtosis of 2159 confirm an extreme long tail, and 17.99% of rows flag as outliers. The mean of 3.39 is pulled far above the median, so any aggregate using it will mislead.

Treatment: Log1p-transform or bin (zero / one / many) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["child_dialect_count"].stats

stat	value
n	23,740
nulls	0 (0.0%)
unique	164
min	0
max	2,369
mean	3.389
median	0
std	36.8
q1	0
q3	1
iqr	1
skew	42.22
kurtosis	2159
n_outliers	4,272
outlier_rate	0.1799
zero_rate	0.7442
alert: high_skew	skew=+42.22
alert: outliers	18.0% rows beyond 1.5 IQR

Fig 20.

Distribution of child_dialect_count. Vertical dash marks the median.

Show data table

Histogram bins for child_dialect_count (median: 0.0).
bin	count
0 – 59.23	23575
59.23 – 118.5	99
118.5 – 177.7	24
177.7 – 236.9	18
236.9 – 296.1	4
296.1 – 355.4	2
355.4 – 414.6	0
414.6 – 473.8	1
473.8 – 533	1
533 – 592.2	1
592.2 – 651.5	1
651.5 – 710.7	2
710.7 – 769.9	0
769.9 – 829.1	1
829.1 – 888.4	0
888.4 – 947.6	3
947.6 – 1007	0
1007 – 1066	0
1066 – 1125	1
1125 – 1184	1
1184 – 1244	0
1244 – 1303	0
1303 – 1362	1
1362 – 1421	0
1421 – 1481	0
1481 – 1540	0
1540 – 1599	1
1599 – 1658	0
1658 – 1718	0
1718 – 1777	0
1777 – 1836	1
1836 – 1895	1
1895 – 1954	0
1954 – 2014	0
2014 – 2073	0
2073 – 2132	0
2132 – 2191	0
2191 – 2251	1
2251 – 2310	0
2310 – 2369	1

country_ids categorical feature

This column holds ISO-style country codes (PG, ID, NG, AU, IN…) as a categorical feature with 680 distinct values across 23,740 rows. Coverage is poor — 64.24% of rows are null — and the non-null distribution is broad (entropy 6.49, ratio 0.69) with Papua New Guinea leading at only 10.29%. The 680 distinct codes far exceed the ~250 real countries, suggesting multi-value concatenations or non-standard tokens behind the 'country_ids' plural.

Treatment: Split multi-code entries, normalise to ISO-3166, and impute or flag the 64% missing before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["country_ids"].stats

stat	value
n	23,740
nulls	15,250 (64.2%)
unique	680
top_value	PG
top_rate	0.1029
cardinality	680
entropy	6.493
entropy_ratio	0.6901
alert: null_rate	64.2% null

Fig 21.

Top values for country_ids.

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

language data glottolog languoid

Overview

Summary confidence: high

id text identifier

family_id categorical foreign_key

parent_id text foreign_key

name text identifier

bookkeeping categorical feature

level categorical feature

status categorical label

latitude numeric feature

longitude numeric feature

iso639P3code text identifier

description unknown free_text

markup_description unknown free_text

child_family_count numeric feature

child_language_count numeric feature

child_dialect_count numeric feature

country_ids categorical feature

How to cite