saturn·

parquet languages

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/languages.parquet

Saturn profiled 19,401 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/languages.parquet",
    "--findings", "parquet-languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 19,401 world languages, each identified by a unique Glottocode and name, with attributes like geographic coordinates, macroarea, language family, ISO code, and phoneme count. Two things stand out for closer inspection: phoneme_count is missing for 88.8% of rows and is heavily right-skewed (mean ~38, max 231), so any analysis of phonological inventories will rely on a small subsample with notable outliers. Latitude and longitude are also null for 59.1% of rows, which will limit mapping coverage. On the categorical side, macroarea is well-distributed across six regions but dominated by Africa (32%), while the status column is uninformative since every language is labelled 'living'.

citing: phoneme_count · latitude · longitude · macroarea · status · name · glottocode

Out[4]:

saturn.schema() · 11 columns

column kind n null% unique alerts
glottocode text 19,401 0.0% 19,401 near_unique one_word short_text
iso_639_3 unknown 19,401 0.0% skipped
name text 19,401 0.0% 19,401 near_unique one_word
family_name unknown 19,401 0.0% skipped
family_glottocode unknown 19,401 0.0% skipped
macroarea categorical 19,401 4.3% 6
latitude numeric 19,401 59.1% 7,786 null_rate
longitude numeric 19,401 59.1% 7,745 null_rate
status categorical 19,401 0.0% 1 imbalance
speakers_count unknown 19,401 0.0% skipped
phoneme_count numeric 19,401 88.8% 100 null_rate high_skew
Fig 1.
macroarea · Shows the geographic distribution of languages, with Africa, Eurasia, and Papunesia accounting for the bulk of entries.
Show data table
Top values for macroarea (6 unique shown, of 6 total).
valuecountshare
Africa595530.7%
Eurasia502825.9%
Papunesia484725.0%
South America10955.6%
North America10355.3%
Australia6023.1%
Fig 2.
phoneme_count · Reveals the right-skewed distribution of phoneme inventories — note that this is based on only ~11% of rows due to high null rate.
Show data table
Histogram bins for phoneme_count (median: 34.0).
bincount
11 – 16.519
16.5 – 22208
22 – 27.5460
27.5 – 33285
33 – 38.5348
38.5 – 44233
44 – 49.5200
49.5 – 55115
55 – 60.5116
60.5 – 6651
66 – 71.545
71.5 – 7714
77 – 82.521
82.5 – 8814
88 – 93.514
93.5 – 999
99 – 104.52
104.5 – 1103
110 – 115.52
115.5 – 1213
121 – 126.53
126.5 – 1322
132 – 137.51
137.5 – 1432
143 – 148.50
148.5 – 1541
154 – 159.50
159.5 – 1651
165 – 170.50
170.5 – 1760
176 – 181.50
181.5 – 1870
187 – 192.50
192.5 – 1980
198 – 203.50
203.5 – 2090
209 – 214.50
214.5 – 2200
220 – 225.50
225.5 – 2311
Fig 3.
latitude · Indicates where languages cluster latitudinally, concentrated near the equator and tropics.
Show data table
Histogram bins for latitude (median: 6.2918).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3847
-26.38 – -23.1777
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54280
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51378
2.51 – 5.72468
5.72 – 8.93663
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56384
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19371
28.19 – 31.4167
31.4 – 34.61143
34.61 – 37.82178
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6677
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146
Fig 4.
longitude · Shows longitudinal spread across the globe, useful for spotting regional concentrations.
Show data table
Histogram bins for longitude (median: 47.565486).
bincount
-178.8 – -169.813
-169.8 – -160.94
-160.9 – -151.910
-151.9 – -14311
-143 – -13410
-134 – -125.117
-125.1 – -116.1123
-116.1 – -107.247
-107.2 – -98.2178
-98.21 – -89.26280
-89.26 – -80.3159
-80.31 – -71.36235
-71.36 – -62.41218
-62.41 – -53.45150
-53.45 – -44.560
-44.5 – -35.5540
-35.55 – -26.60
-26.6 – -17.644
-17.64 – -8.692105
-8.692 – 0.2605275
0.2605 – 9.213443
9.213 – 18.17751
18.17 – 27.12322
27.12 – 36.07429
36.07 – 45.02228
45.02 – 53.97126
53.97 – 62.9335
62.93 – 71.8879
71.88 – 80.83210
80.83 – 89.78207
89.78 – 98.74269
98.74 – 107.7454
107.7 – 116.6239
116.6 – 125.6497
125.6 – 134.5316
134.5 – 143.5598
143.5 – 152.4667
152.4 – 161.4122
161.4 – 170.4186
170.4 – 179.312
Fig 5.
name · Most language names are one or two words; check the long tail for compound or descriptive names.
Show data table
Character-length distribution for name (mean: 9.211483944126591).
charscount
1 – 252
2 – 4455
4 – 54355
5 – 72890
7 – 83995
8 – 101115
10 – 11813
11 – 121417
12 – 14724
14 – 151264
15 – 17439
17 – 18524
18 – 20214
20 – 21175
21 – 22343
22 – 24143
24 – 25173
25 – 2758
27 – 2886
28 – 3026
30 – 3126
31 – 3243
32 – 3412
34 – 3524
35 – 3711
37 – 388
38 – 393
39 – 411
41 – 422
42 – 441
44 – 453
45 – 471
47 – 482
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 582
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
glottocodetext0.0%
iso_639_3unknown0.0%
nametext0.0%
family_nameunknown0.0%
family_glottocodeunknown0.0%
macroareacategorical4.3%
latitudenumeric59.1%
longitudenumeric59.1%
statuscategorical0.0%
speakers_countunknown0.0%
phoneme_countnumeric88.8%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
latitudelongitudephoneme_count
latitude+1.00-0.30+0.05
longitude-0.30+1.00+0.03
phoneme_count+0.05+0.03+1.00

glottocode text identifier

This column holds Glottocodes, the standard 8-character identifiers for languages in the Glottolog catalogue (e.g. 'aala1237', 'aari1239'). Every one of the 19,401 rows is unique with length exactly 8 and a single token, and there are no nulls or duplicates, consistent with a primary key over languages. Nothing anomalous: the column is a clean identifier rather than analysable text.

Treatment: Use as the primary key; left-join other language metadata on this code rather than modelling it.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["glottocode"].stats

statvalue
n19,401
nulls0 (0.0%)
unique19,401
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 19,401
readability_flesch_mean 93.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for glottocode.
Show data table
Character-length distribution for glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 819401
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

iso_639_3 unknown metadata

This column is named iso_639_3, suggesting it should hold ISO 639-3 three-letter language codes. Saturn skipped profiling (kind=unknown), so no uniqueness or value distribution is available; only the row count of 19401 and a 0.0 null rate are confirmed. Without cardinality or sample values, the actual contents and their alignment with the ISO 639-3 standard cannot be verified here.

Treatment: Re-profile with explicit string typing to recover cardinality, then use as a categorical language tag.

anthropic:claude-opus-4-7 · confidence low
Out[16]:

saturn.columns["iso_639_3"].stats

statvalue
n19,401
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

name text identifier

This is a short-text 'name' field with 19401 entirely unique values (n_unique equals n) and no nulls or duplicates. Entries are mostly single tokens — 71.7% are one_word and word_median is 1 — averaging 9.2 characters and capping at 58. The top vocabulary (nuclear, sign, language, central, southern, western, northern, eastern) suggests these are labels for things like categories, regions, or articles rather than person names.

Treatment: Treat as a unique key — join or display only, do not use as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["name"].stats

statvalue
n19,401
nulls0 (0.0%)
unique19,401
len_min 1
len_max 58
len_mean 9.211
len_median 7
len_p95 20
word_mean 1.369
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 17,861
readability_flesch_mean 60.53
emoji_rate 0
url_rate 0
one_word_rate 0.7169
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word71.7% rows are a single word
Fig 9.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 9.211483944126591).
charscount
1 – 252
2 – 4455
4 – 54355
5 – 72890
7 – 83995
8 – 101115
10 – 11813
11 – 121417
12 – 14724
14 – 151264
15 – 17439
17 – 18524
18 – 20214
20 – 21175
21 – 22343
22 – 24143
24 – 25173
25 – 2758
27 – 2886
28 – 3026
30 – 3126
31 – 3243
32 – 3412
34 – 3524
35 – 3711
37 – 388
38 – 393
39 – 411
41 – 422
42 – 441
44 – 453
45 – 471
47 – 482
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 582

family_name unknown other

This column is named family_name and was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 19401 and a null rate of 0.0. The name suggests surnames or a taxonomic family label, but without distribution evidence the actual content cannot be confirmed. The only notable signal is the explicit 'skipped' alert, meaning downstream consumers are flying blind on this field.

Treatment: Re-run profiling with this column included before deciding how to use it.

anthropic:claude-opus-4-7 · confidence low
Out[21]:

saturn.columns["family_name"].stats

statvalue
n19,401
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

family_glottocode unknown foreign_key

The column was skipped by the profiler, so beyond a complete absence of nulls across 19,401 rows there is no distributional evidence to draw on. The name family_glottocode points to Glottolog family identifiers (a linguistic taxonomy code), but uniqueness, cardinality, and value distribution are all unknown here.

Treatment: Re-profile with string handling enabled, then left-join to a Glottolog reference table.

anthropic:claude-opus-4-7 · confidence low
Out[23]:

saturn.columns["family_glottocode"].stats

statvalue
n19,401
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

macroarea categorical feature

Six-valued geographic grouping that bins records into continental-scale macroareas (Africa, Eurasia, Papunesia, South America, North America, Australia), suggesting linguistic or biogeographic data. Distribution is uneven: Africa leads at 32.1% of rows while Australia holds only 602, and entropy ratio of 0.84 confirms moderate but not extreme imbalance. About 4.3% of rows are null.

Treatment: One-hot encode and impute or flag the 4.3% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["macroarea"].stats

statvalue
n19,401
nulls839 (4.3%)
unique6
top_value Africa
top_rate 0.3208
cardinality 6
entropy 2.176
entropy_ratio 0.8418
Fig 10.
Top values for macroarea.
Show data table
Top values for macroarea (6 unique shown, of 6 total).
valuecountshare
Africa595530.7%
Eurasia502825.9%
Papunesia484725.0%
South America10955.6%
North America10355.3%
Australia6023.1%

latitude numeric feature

This is a geographic latitude coordinate, with values spanning -55.2748 to 73.1354 and a median of 6.2918, consistent with degrees north/south of the equator. Nearly 59% of rows are null, which is the dominant concern; among present values, the distribution is mildly right-skewed (0.54) and roughly centered north of the equator (mean 8.16). Only 7,786 unique values across 19,401 rows suggests repeated locations rather than per-row precise coordinates.

Treatment: Pair with longitude for geospatial features; impute or flag missingness given the 59% null rate before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["latitude"].stats

statvalue
n19,401
nulls11,472 (59.1%)
unique7,786
min -55.27
max 73.14
mean 8.164
median 6.292
std 18.96
q1 -5.139
q3 19.27
iqr 24.41
skew 0.5425
kurtosis 0.3048
n_outliers 135
outlier_rate 0.01703
zero_rate 0
alert: null_rate59.1% null
Fig 11.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 6.2918).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3847
-26.38 – -23.1777
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54280
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51378
2.51 – 5.72468
5.72 – 8.93663
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56384
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19371
28.19 – 31.4167
31.4 – 34.61143
34.61 – 37.82178
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6677
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146

longitude numeric feature

Geographic longitude in decimal degrees, with values spanning -178.785 to 179.306 and a median of 47.57 consistent with a global coordinate range. The distribution is mildly left-skewed (-0.48) and platykurtic (-0.78), with only 13 outliers (0.16%), but 59.13% of rows are null, meaning location is missing for most records. Of 19,401 rows, 7,745 unique values suggest repeated locations rather than per-record GPS fixes.

Treatment: Pair with latitude for geospatial features; impute or add a missingness flag given the high null rate.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["longitude"].stats

statvalue
n19,401
nulls11,472 (59.1%)
unique7,745
min -178.8
max 179.3
mean 51.22
median 47.57
std 81.15
q1 7.18
q3 124.1
iqr 117
skew -0.4814
kurtosis -0.7765
n_outliers 13
outlier_rate 0.00164
zero_rate 0
alert: null_rate59.1% null
Fig 12.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: 47.565486).
bincount
-178.8 – -169.813
-169.8 – -160.94
-160.9 – -151.910
-151.9 – -14311
-143 – -13410
-134 – -125.117
-125.1 – -116.1123
-116.1 – -107.247
-107.2 – -98.2178
-98.21 – -89.26280
-89.26 – -80.3159
-80.31 – -71.36235
-71.36 – -62.41218
-62.41 – -53.45150
-53.45 – -44.560
-44.5 – -35.5540
-35.55 – -26.60
-26.6 – -17.644
-17.64 – -8.692105
-8.692 – 0.2605275
0.2605 – 9.213443
9.213 – 18.17751
18.17 – 27.12322
27.12 – 36.07429
36.07 – 45.02228
45.02 – 53.97126
53.97 – 62.9335
62.93 – 71.8879
71.88 – 80.83210
80.83 – 89.78207
89.78 – 98.74269
98.74 – 107.7454
107.7 – 116.6239
116.6 – 125.6497
125.6 – 134.5316
134.5 – 143.5598
143.5 – 152.4667
152.4 – 161.4122
161.4 – 170.4186
170.4 – 179.312

status categorical metadata

This is a single-value categorical field where every one of the 19,401 rows is "living". With cardinality 1 and entropy 0, it carries no information and cannot discriminate between records.

Treatment: Drop; constant column with zero entropy.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["status"].stats

statvalue
n19,401
nulls0 (0.0%)
unique1
top_value living
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 13.
Top values for status.
Show data table
Top values for status (1 unique shown, of 1 total).
valuecountshare
living19401100.0%

speakers_count unknown other

The column is named speakers_count, which suggests a numeric tally of speakers per record, but saturn skipped profiling and returned no type, uniqueness, or distribution stats. The only confirmed signals are 19401 rows and a 0.0 null rate. Without kind or summary statistics, nothing further can be said about its actual contents.

Treatment: Re-profile this column with type inference forced before deciding on downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[37]:

saturn.columns["speakers_count"].stats

statvalue
n19,401
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

phoneme_count numeric feature

Counts of phonemes per record, ranging 11 to 231 with median 34 and IQR 20. The distribution is heavily right-skewed (skew 2.32, kurtosis 11.5) with 79 outliers (3.6%), and critically 88.8% of rows are null so only ~2,170 values are present.

Treatment: Impute or flag missingness and log-transform before modelling given the 88.8% null rate and skew of 2.32.

anthropic:claude-opus-4-7 · confidence high
Out[39]:

saturn.columns["phoneme_count"].stats

statvalue
n19,401
nulls17,228 (88.8%)
unique100
min 11
max 231
mean 38.2
median 34
std 17.78
q1 26
q3 46
iqr 20
skew 2.325
kurtosis 11.54
n_outliers 79
outlier_rate 0.03636
zero_rate 0
alert: null_rate88.8% null
alert: high_skewskew=+2.32
Fig 14.
Distribution of phoneme_count. Vertical dash marks the median.
Show data table
Histogram bins for phoneme_count (median: 34.0).
bincount
11 – 16.519
16.5 – 22208
22 – 27.5460
27.5 – 33285
33 – 38.5348
38.5 – 44233
44 – 49.5200
49.5 – 55115
55 – 60.5116
60.5 – 6651
66 – 71.545
71.5 – 7714
77 – 82.521
82.5 – 8814
88 – 93.514
93.5 – 999
99 – 104.52
104.5 – 1103
110 – 115.52
115.5 – 1213
121 – 126.53
126.5 – 1322
132 – 137.51
137.5 – 1432
143 – 148.50
148.5 – 1541
154 – 159.50
159.5 – 1651
165 – 170.50
170.5 – 1760
176 – 181.50
181.5 – 1870
187 – 192.50
192.5 – 1980
198 – 203.50
203.5 – 2090
209 – 214.50
214.5 – 2200
220 – 225.50
225.5 – 2311

How to cite

click to copy

BibTeX
@misc{saturn-parquet-languages-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet languages},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-languages}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: parquet languages. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/languages.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-languages