saturn

/home/coolhand/html/datavis/data_trove/data/linguistic/glottolog_languoid.csv 23,740 rows sample n=23,740 seed 42 2026-06-22T00:19:32+00:00

Overview

Source	/home/coolhand/html/datavis/data_trove/data/linguistic/glottolog_languoid.csv
Total rows	23,740
Profiled sample	23,740
Columns	16
Generated	2026-06-22T00:19:32+00:00

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	text	0.0%
family_id	categorical	1.8%
parent_id	text	1.8%
name	text	0.0%
bookkeeping	categorical	0.0%
level	categorical	0.0%
status	categorical	0.0%
latitude	numeric	66.5%
longitude	numeric	66.5%
iso639P3code	text	66.4%
description	unknown	0.0%
markup_description	unknown	0.0%
child_family_count	numeric	0.0%
child_language_count	numeric	0.0%
child_dialect_count	numeric	0.0%
country_ids	categorical	64.2%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is a comprehensive catalogue of the world's languoids from Glottolog, covering 23,740 entries that span dialects (10,920), languages (8,481), and language families (4,339). The most striking pattern is in endangerment status: while the majority (18,965) are marked 'safe', nearly 4,800 entries are endangered, extinct, or vulnerable — worth examining closely against family and geographic distribution. A second area of interest is the highly skewed child-count columns: most languoids have zero children (74% for dialects, 82% for languages), but a handful of nodes have hundreds or even thousands of descendants, suggesting a very uneven tree structure. Geographic coverage is also notably incomplete, with latitude and longitude missing for 66% of rows, limiting spatial analysis to a subset of the data.

iso639P3code high anthropic:default

This column contains ISO 639-3 language codes — standardised three-letter identifiers for individual languages (e.g., 'aiz', 'kbt', 'mij'). Every non-null value is exactly 3 characters long (min=3, max=3, mean=3.0) and appears only once, yielding 7,968 unique codes across 7,968 vocabulary entries with zero duplicates — consistent with a lookup or reference table of distinct languages. The striking concern is a 66.44% null rate, meaning two-thirds of the 23,740 rows carry no code at all, which likely reflects languages or entries not yet mapped to an ISO 639-3 identifier.

parent_id high anthropic:default

This column contains Glottolog language family or clade identifiers — fixed 8-character alphanumeric codes (e.g. 'book1242', 'pidg1258') linking each row to a parent node in a linguistic taxonomy. Every value is exactly 8 characters long (len_min=8, len_max=8) and all are single tokens (one_word_rate=1.0), consistent with Glottolog's standardised glottocode format. The duplicate rate is notably high at 68.5%, with 'book1242' appearing 399 times, indicating many child languages share the same parent grouping — expected for a hierarchical classification. Null rate is low at 1.81%, suggesting most records have a resolvable parent.

id high anthropic:default

This column is a structured language or dialect identifier, following the Glottolog-style 8-character code pattern (4 letters + 4 digits, e.g. 'tibe1272', 'east2553'). All 23,740 values are exactly 8 characters long (len_min/max/mean/median all equal 8) with zero nulls and zero duplicates, making it a perfect primary key. The top_words sample aligns precisely with Glottolog languoid codes, suggesting this is a linguistic database. No anomalies — the column is unusually clean.

name high anthropic:default

This column contains names of linguistic or geographic entities — likely language varieties, dialects, or regional groupings, as evidenced by top words such as 'nuclear', 'central', 'western', 'eastern', 'northern', 'southern', 'language', and 'sign'. All 23,740 rows are unique with zero nulls and zero duplicates, making this a perfect natural-language identifier. Surprisingly, 69.5% of values are single words despite a mean length of ~9.95 characters, suggesting a mix of short atomic names and multi-word descriptors (up to 58 characters), with a vocabulary of 17,915 distinct tokens across the corpus.

child_dialect_count high anthropic:default

This column counts the number of child dialects associated with each record in what appears to be a linguistic or language-taxonomy dataset. The distribution is extreme: 74.4% of rows have zero child dialects, the median is 0, and the IQR is just 0–1, yet the mean is 3.39 and the maximum reaches 2369—producing a skew of 42.2 and a kurtosis of 2159.3. Nearly 18% of rows (4,272) are flagged as outliers, indicating a small number of language/dialect nodes act as hubs with enormous fan-out while the vast majority are leaf nodes.

child_family_count high anthropic:default

This column counts the number of children or family members associated with a record, with 90.8% of values being exactly zero and a median of 0.0 — meaning the vast majority of subjects have no associated family/child count. The distribution is extraordinarily right-skewed (skew = 44.4, kurtosis = 2352.9) with a max of 859.0 and 2,179 outliers (9.2% of rows), suggesting a small subset of records have implausibly large counts that warrant investigation for data-entry errors or aggregation anomalies.

child_language_count high anthropic:default

This column counts the number of child languages associated with a record, likely in a linguistic or taxonomy dataset. The distribution is extremely right-skewed (skew = 41.86, kurtosis = 2115.08): 81.7% of rows have a value of zero, the median is 0.0, and the IQR is 0.0, yet the mean is ~2.0 and the max reaches 1,435. This implies a small number of parent-language nodes dominate with very large child counts, while most entries are leaf nodes with no children. Over 18% of rows (4,339) are flagged as outliers, reinforcing the extreme concentration at zero with a long, sparse upper tail.

latitude high anthropic:default

This column contains geographic latitude values for records in the dataset, spanning from -55.27° (southern South America or similar) to 73.14° (Arctic latitudes), consistent with global coverage. The most striking issue is a 66.54% null rate — two-thirds of rows lack a latitude value, which severely limits spatial analysis. The distribution is mildly right-skewed (skew 0.54) with a mean of ~8.2° and median of ~6.3°, suggesting a concentration of records in tropical/equatorial regions. Only 129 outliers are flagged (1.6%), so the non-null values themselves appear geographically plausible.

longitude high anthropic:default

This column represents geographic longitude, covering nearly the full global range from –178.785° to 179.306°. The null rate of 66.54% is a critical alert — two-thirds of records lack a coordinate, which may indicate missing geolocation data for a large subset of entities. The IQR of ~117° and std of ~81° confirm values are broadly distributed across hemispheres, and the mean (51.27°) skewing toward positive (Eastern) longitudes with a median of 47.72° suggests a concentration in Europe/Asia. Only 13 outliers are flagged despite the extreme range, consistent with legitimate global coordinates rather than data entry errors.

bookkeeping high anthropic:default

This column is a boolean flag (stored as strings 'True'/'False') indicating whether a record has some bookkeeping status applied — likely a soft-delete, correction, or administrative override marker. The distribution is severely imbalanced: 98.3% of rows are 'False' (23,341) versus only 1.7% 'True' (399), which triggered the imbalance alert. The entropy of 0.123 confirms near-zero information variance, meaning this flag is rarely set. Analysts should treat the 'True' minority with care as it may mark records to exclude or handle separately.

country_ids medium anthropic:default

This column contains ISO 3166-1 alpha-2 country codes, functioning as a categorical geographic identifier for each record. The most striking issue is its 64.24% null rate, meaning nearly two-thirds of the 23,740 rows carry no country information. Among populated rows, Papua New Guinea ('PG') dominates at 10.29% of non-null values, followed by Indonesia ('ID') and Nigeria ('NG') — a developing-world skew that may reflect data collection bias. The 680-unique-value count is anomalously high for standard two-letter ISO codes (only ~250 exist), suggesting some values may be multi-code concatenations, malformed entries, or non-standard codes.

description low anthropic:default

This column contains textual descriptions with zero null values across 23,740 rows. The profile was skipped by saturn, so no uniqueness, length, or token statistics are available — the column's exact nature (short labels vs. long free text) cannot be determined from the evidence. The 'skipped' alert and absent stats dict suggest the profiler either timed out or excluded this column type from deep analysis.

markup_description low anthropic:default

This column contains markup or formatted description text (likely HTML, Markdown, or similar) across 23,740 rows with zero nulls. The profiler skipped detailed analysis — indicated by the 'skipped' alert — so no uniqueness, length, or content statistics are available. Without further inspection, the content type and cardinality are unknown, but the name strongly implies free-form or templated descriptive text with embedded formatting syntax.

level high anthropic:default

This column classifies linguistic entities into one of three hierarchical levels: 'dialect', 'language', and 'family', suggesting the dataset concerns a linguistic taxonomy or inventory (e.g., Ethnologue-style data). With only 3 unique values across 23,740 rows and zero nulls, it is a clean, fully populated categorical field. Notably, 'dialect' is the modal class at ~46% (10,920), followed by 'language' at ~35.7% (8,481) and 'family' at ~18.3% (4,339) — the class imbalance is moderate but not severe. The entropy ratio of 0.943 is surprisingly high for a 3-class variable, indicating near-uniform distribution across categories.

status high anthropic:default

This column represents a conservation or endangerment status classification, most likely for languages or species, with 6 distinct ordinal categories ranging from 'safe' to 'extinct'. The distribution is heavily skewed: 'safe' dominates at nearly 80% of records (18,965 of 23,740), while the threatened categories collectively account for only ~20%. The low entropy ratio (0.445) confirms this imbalance, and analysts building classification models should expect a significant class imbalance problem.

family_id high anthropic:default

This column represents language family identifiers, using Glottolog-style codes (e.g., 'atla1278' = Atlantic-Congo, 'aust1307' = Austronesian, 'indo1319' = Indo-European). With only 287 unique values across 23,740 rows it acts as a low-to-medium cardinality grouping key for linguistic families. The top value 'atla1278' is notably dominant at 20.0% of rows (4,663 records), and the top two families together account for roughly 36% of the dataset, signalling a heavily skewed distribution toward large African and Pacific language families.

Numeric correlation

Show data table

Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
	latitude	longitude	child_family_count	child_language_count	child_dialect_count
latitude	+1.00	-0.31	+0.03	+0.01	+0.06
longitude	-0.31	+1.00	-0.03	-0.04	-0.05
child_family_count	+0.03	-0.03	+1.00	+0.96	+0.74
child_language_count	+0.01	-0.04	+0.96	+1.00	+0.69
child_dialect_count	+0.06	-0.05	+0.74	+0.69	+1.00

id text

100.0% of rows are unique strings 100.0% rows are a single word 95th-percentile length under 20 chars

rows23,740

null0 (0.0%)

unique23,740

len_min8

len_max8

len_mean8.000

len_median8.000

len_p958.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size20,000

readability_flesch_mean86.111

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Show data table

Character-length distribution for id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23740
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

Sample values (first 10)

abbe1238
sanm1298
thur1255
suar1238
kukn1238
yagu1244
labu1252
uist1237
suku1258
arak1251

family_id categorical

rows23,740

null429 (1.8%)

unique287

top_valueatla1278

top_rate0.200

cardinality287

entropy4.886

entropy_ratio0.598

Show data table

Top values for family_id (20 unique shown, of 287 total).
value	count	share
atla1278	4663	19.6%
aust1307	3850	16.2%
indo1319	2201	9.3%
sino1245	1666	7.0%
afro1255	1259	5.3%
nucl1709	762	3.2%
pama1250	598	2.5%
aust1305	503	2.1%
book1242	399	1.7%
otom1299	338	1.4%
mand1469	303	1.3%
sign1238	259	1.1%
drav1251	255	1.1%
cent2225	251	1.1%
turk1311	229	1.0%
taik1256	223	0.9%
nilo1247	201	0.8%
ural1272	185	0.8%
japo1237	179	0.8%
tupi1275	157	0.7%

Top values (rank 1–20)

atla1278 — 4,663
aust1307 — 3,850
indo1319 — 2,201
sino1245 — 1,666
afro1255 — 1,259
nucl1709 — 762
pama1250 — 598
aust1305 — 503
book1242 — 399
otom1299 — 338
mand1469 — 303
sign1238 — 259
drav1251 — 255
cent2225 — 251
turk1311 — 229
taik1256 — 223
nilo1247 — 201
ural1272 — 185
japo1237 — 179
tupi1275 — 157

parent_id text

100.0% rows are a single word 95th-percentile length under 20 chars 68.5% duplicate strings

rows23,740

null429 (1.8%)

unique7,338

len_min8

len_max8

len_mean8.000

len_median8.000

len_p958.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates15,973

duplicate_rate0.685

vocab_size7,189

readability_flesch_mean91.187

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Show data table

Character-length distribution for parent_id (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	23311
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

Sample values (first 10)

abee1242
cent2144
yuag1237
mnon1258
pama1253
raic1241
kenh1234
uygh1240
taih1244
book1242

name text

100.0% of rows are unique strings 69.5% rows are a single word

rows23,740

null0 (0.0%)

unique23,740

len_min1

len_max58

len_mean9.950

len_median8.000

len_p9522.000

word_mean1.398

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size17,915

readability_flesch_mean42.625

emoji_rate0.000

url_rate0.000

one_word_rate0.695

allcaps_rate1.68e-04

boilerplate_rate0.000

Show data table

Character-length distribution for name (mean: 9.950126368997473).
chars	count
1 – 2	54
2 – 4	496
4 – 5	4648
5 – 7	3145
7 – 8	4510
8 – 10	1373
10 – 11	1069
11 – 12	1997
12 – 14	1023
14 – 15	1768
15 – 17	644
17 – 18	848
18 – 20	340
20 – 21	281
21 – 22	518
22 – 24	224
24 – 25	298
25 – 27	101
27 – 28	140
28 – 30	46
30 – 31	44
31 – 32	69
32 – 34	24
34 – 35	34
35 – 37	12
37 – 38	9
38 – 39	4
39 – 41	2
41 – 42	3
42 – 44	3
44 – 45	5
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	2
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

Sample values (first 10)

Abbey-Ve
San Martín Itunyoso Triqui
Thuri
Asabano
Kukna
Yagua
Labuan
Uis Tasae
Sukurase
Araki (Iran)

bookkeeping categorical

top value is 98.3% of rows

rows23,740

null0 (0.0%)

unique2

top_valueFalse

top_rate0.983

cardinality2

entropy0.123

entropy_ratio0.123

Show data table

Top values for bookkeeping (2 unique shown, of 2 total).
value	count	share
False	23341	98.3%
True	399	1.7%

Top values (rank 1–20)

False — 23,341
True — 399

level categorical

rows23,740

null0 (0.0%)

unique3

top_valuedialect

top_rate0.460

cardinality3

entropy1.494

entropy_ratio0.943

Show data table

Top values for level (3 unique shown, of 3 total).
value	count	share
dialect	10920	46.0%
language	8481	35.7%
family	4339	18.3%

Top values (rank 1–20)

dialect — 10,920
language — 8,481
family — 4,339

status categorical

rows23,740

null0 (0.0%)

unique6

top_valuesafe

top_rate0.799

cardinality6

entropy1.150

entropy_ratio0.445

Show data table

Top values for status (6 unique shown, of 6 total).
value	count	share
safe	18965	79.9%
definitely endangered	1814	7.6%
vulnerable	1194	5.0%
extinct	889	3.7%
critically endangered	465	2.0%
severely endangered	413	1.7%

Top values (rank 1–20)

safe — 18,965
definitely endangered — 1,814
vulnerable — 1,194
extinct — 889
critically endangered — 465
severely endangered — 413

latitude numeric

66.5% null

rows23,740

null15,797 (66.5%)

unique7,798

min-55.275

max73.135

mean8.170

median6.306

std18.962

q1-5.137

q319.336

iqr24.472

skew0.540

kurtosis0.301

n_outliers129

outlier_rate0.016

zero_rate0.000

Show data table

Histogram bins for latitude (median: 6.30619).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	48
-26.38 – -23.17	78
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	281
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	379
2.51 – 5.72	469
5.72 – 8.93	664
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	387
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	373
28.19 – 31.4	167
31.4 – 34.61	144
34.61 – 37.82	179
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	78
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

longitude numeric

66.5% null

rows23,740

null15,797 (66.5%)

unique7,757

min-178.785

max179.306

mean51.270

median47.724

std81.138

q17.235

q3124.122

iqr116.887

skew-0.483

kurtosis-0.774

n_outliers13

outlier_rate1.64e-03

zero_rate0.000

Show data table

Histogram bins for longitude (median: 47.7236).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	124
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	444
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	430
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	80
71.88 – 80.83	210
80.83 – 89.78	208
89.78 – 98.74	269
98.74 – 107.7	457
107.7 – 116.6	239
116.6 – 125.6	502
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	123
161.4 – 170.4	186
170.4 – 179.3	12

iso639P3code text

100.0% of rows are unique strings 100.0% rows are a single word 66.4% null 95th-percentile length under 20 chars

rows23,740

null15,772 (66.4%)

unique7,968

len_min3

len_max3

len_mean3.000

len_median3.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates0

duplicate_rate0.000

vocab_size7,968

readability_flesch_mean119.528

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Show data table

Character-length distribution for iso639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7968
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Sample values (first 10)

description unknown

no profiler for kind=unknown

rows23,740

null0 (0.0%)

markup_description unknown

no profiler for kind=unknown

rows23,740

null0 (0.0%)

child_family_count numeric

skew=+44.40 9.2% rows beyond 1.5 IQR

rows23,740

null0 (0.0%)

unique88

min0.000

max859.000

mean0.879

median0.000

std13.204

q10.000

q30.000

iqr0.000

skew44.398

kurtosis2,353

n_outliers2,179

outlier_rate0.092

zero_rate0.908

Show data table

Histogram bins for child_family_count (median: 0.0).
bin	count
0 – 21.48	23588
21.48 – 42.95	93
42.95 – 64.43	27
64.43 – 85.9	6
85.9 – 107.4	4
107.4 – 128.9	3
128.9 – 150.3	2
150.3 – 171.8	2
171.8 – 193.3	1
193.3 – 214.8	1
214.8 – 236.2	0
236.2 – 257.7	0
257.7 – 279.2	1
279.2 – 300.7	2
300.7 – 322.1	1
322.1 – 343.6	1
343.6 – 365.1	0
365.1 – 386.6	0
386.6 – 408	2
408 – 429.5	1
429.5 – 451	0
451 – 472.5	0
472.5 – 493.9	0
493.9 – 515.4	0
515.4 – 536.9	0
536.9 – 558.4	0
558.4 – 579.8	1
579.8 – 601.3	0
601.3 – 622.8	0
622.8 – 644.2	0
644.2 – 665.7	0
665.7 – 687.2	0
687.2 – 708.7	2
708.7 – 730.2	0
730.2 – 751.6	0
751.6 – 773.1	0
773.1 – 794.6	0
794.6 – 816.1	1
816.1 – 837.5	0
837.5 – 859	1

child_language_count numeric

skew=+41.86 18.3% rows beyond 1.5 IQR

rows23,740

null0 (0.0%)

unique126

min0.000

max1,435

mean1.996

median0.000

std23.408

q10.000

q30.000

iqr0.000

skew41.859

kurtosis2,115

n_outliers4,339

outlier_rate0.183

zero_rate0.817

Show data table

Histogram bins for child_language_count (median: 0.0).
bin	count
0 – 35.88	23547
35.88 – 71.75	121
71.75 – 107.6	37
107.6 – 143.5	6
143.5 – 179.4	3
179.4 – 215.2	4
215.2 – 251.1	3
251.1 – 287	2
287 – 322.9	2
322.9 – 358.8	0
358.8 – 394.6	1
394.6 – 430.5	1
430.5 – 466.4	0
466.4 – 502.2	1
502.2 – 538.1	1
538.1 – 574	2
574 – 609.9	1
609.9 – 645.8	0
645.8 – 681.6	0
681.6 – 717.5	1
717.5 – 753.4	2
753.4 – 789.2	0
789.2 – 825.1	0
825.1 – 861	0
861 – 896.9	0
896.9 – 932.8	0
932.8 – 968.6	0
968.6 – 1004	1
1004 – 1040	0
1040 – 1076	0
1076 – 1112	0
1112 – 1148	0
1148 – 1184	0
1184 – 1220	0
1220 – 1256	0
1256 – 1292	2
1292 – 1327	0
1327 – 1363	0
1363 – 1399	1
1399 – 1435	1

child_dialect_count numeric

skew=+42.22 18.0% rows beyond 1.5 IQR

rows23,740

null0 (0.0%)

unique164

min0.000

max2,369

mean3.389

median0.000

std36.799

q10.000

q31.000

iqr1.000

skew42.219

kurtosis2,159

n_outliers4,272

outlier_rate0.180

zero_rate0.744

Show data table

Histogram bins for child_dialect_count (median: 0.0).
bin	count
0 – 59.23	23575
59.23 – 118.5	99
118.5 – 177.7	24
177.7 – 236.9	18
236.9 – 296.1	4
296.1 – 355.4	2
355.4 – 414.6	0
414.6 – 473.8	1
473.8 – 533	1
533 – 592.2	1
592.2 – 651.5	1
651.5 – 710.7	2
710.7 – 769.9	0
769.9 – 829.1	1
829.1 – 888.4	0
888.4 – 947.6	3
947.6 – 1007	0
1007 – 1066	0
1066 – 1125	1
1125 – 1184	1
1184 – 1244	0
1244 – 1303	0
1303 – 1362	1
1362 – 1421	0
1421 – 1481	0
1481 – 1540	0
1540 – 1599	1
1599 – 1658	0
1658 – 1718	0
1718 – 1777	0
1777 – 1836	1
1836 – 1895	1
1895 – 1954	0
1954 – 2014	0
2014 – 2073	0
2073 – 2132	0
2132 – 2191	0
2191 – 2251	1
2251 – 2310	0
2310 – 2369	1

country_ids categorical

64.2% null

rows23,740

null15,250 (64.2%)

unique680

top_valuePG

top_rate0.103

cardinality680

entropy6.493

entropy_ratio0.690

Show data table

Top values for country_ids (20 unique shown, of 680 total).
value	count	share
PG	874	3.7%
ID	695	2.9%
NG	480	2.0%
AU	432	1.8%
IN	356	1.5%
MX	297	1.3%
CN	271	1.1%
BR	263	1.1%
US	247	1.0%
CM	196	0.8%
PH	177	0.7%
CD	156	0.7%
VU	118	0.5%
SD	99	0.4%
PE	97	0.4%
TZ	93	0.4%
MY	90	0.4%
TD	88	0.4%
RU	83	0.3%
CO	82	0.3%

Top values (rank 1–20)

PG — 874
ID — 695
NG — 480
AU — 432
IN — 356
MX — 297
CN — 271
BR — 263
US — 247
CM — 196
PH — 177
CD — 156
VU — 118
SD — 99
PE — 97
TZ — 93
MY — 90
TD — 88
RU — 83
CO — 82