language data glottolog languoid
Reading
This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.
citing: level · status · country_ids · family_id · latitude · longitude · iso639P3code · child_dialect_count · bookkeeping
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| dialect | 10920 | 46.0% |
| language | 8481 | 35.7% |
| family | 4339 | 18.3% |
Show data table
| value | count | share |
|---|---|---|
| safe | 18965 | 79.9% |
| definitely endangered | 1814 | 7.6% |
| vulnerable | 1194 | 5.0% |
| extinct | 889 | 3.7% |
| critically endangered | 465 | 2.0% |
| severely endangered | 413 | 1.7% |
Show data table
| value | count | share |
|---|---|---|
| atla1278 | 4663 | 19.6% |
| aust1307 | 3850 | 16.2% |
| indo1319 | 2201 | 9.3% |
| sino1245 | 1666 | 7.0% |
| afro1255 | 1259 | 5.3% |
| nucl1709 | 762 | 3.2% |
| pama1250 | 598 | 2.5% |
| aust1305 | 503 | 2.1% |
| book1242 | 399 | 1.7% |
| otom1299 | 338 | 1.4% |
| mand1469 | 303 | 1.3% |
| sign1238 | 259 | 1.1% |
| drav1251 | 255 | 1.1% |
| cent2225 | 251 | 1.1% |
| turk1311 | 229 | 1.0% |
| taik1256 | 223 | 0.9% |
| nilo1247 | 201 | 0.8% |
| ural1272 | 185 | 0.8% |
| japo1237 | 179 | 0.8% |
| tupi1275 | 157 | 0.7% |
Show data table
| value | count | share |
|---|---|---|
| PG | 874 | 3.7% |
| ID | 695 | 2.9% |
| NG | 480 | 2.0% |
| AU | 432 | 1.8% |
| IN | 356 | 1.5% |
| MX | 297 | 1.3% |
| CN | 271 | 1.1% |
| BR | 263 | 1.1% |
| US | 247 | 1.0% |
| CM | 196 | 0.8% |
| PH | 177 | 0.7% |
| CD | 156 | 0.7% |
| VU | 118 | 0.5% |
| SD | 99 | 0.4% |
| PE | 97 | 0.4% |
| TZ | 93 | 0.4% |
| MY | 90 | 0.4% |
| TD | 88 | 0.4% |
| RU | 83 | 0.3% |
| CO | 82 | 0.3% |
Show data table
| bin | count |
|---|---|
| -55.27 – -52.06 | 5 |
| -52.06 – -48.85 | 1 |
| -48.85 – -45.64 | 1 |
| -45.64 – -42.43 | 4 |
| -42.43 – -39.22 | 7 |
| -39.22 – -36.01 | 16 |
| -36.01 – -32.8 | 29 |
| -32.8 – -29.59 | 26 |
| -29.59 – -26.38 | 48 |
| -26.38 – -23.17 | 78 |
| -23.17 – -19.96 | 125 |
| -19.96 – -16.75 | 141 |
| -16.75 – -13.54 | 281 |
| -13.54 – -10.33 | 256 |
| -10.33 – -7.121 | 495 |
| -7.121 – -3.911 | 788 |
| -3.911 – -0.7005 | 681 |
| -0.7005 – 2.51 | 379 |
| 2.51 – 5.72 | 469 |
| 5.72 – 8.93 | 664 |
| 8.93 – 12.14 | 710 |
| 12.14 – 15.35 | 303 |
| 15.35 – 18.56 | 387 |
| 18.56 – 21.77 | 233 |
| 21.77 – 24.98 | 318 |
| 24.98 – 28.19 | 373 |
| 28.19 – 31.4 | 167 |
| 31.4 – 34.61 | 144 |
| 34.61 – 37.82 | 179 |
| 37.82 – 41.03 | 113 |
| 41.03 – 44.24 | 138 |
| 44.24 – 47.45 | 79 |
| 47.45 – 50.66 | 78 |
| 50.66 – 53.87 | 76 |
| 53.87 – 57.08 | 46 |
| 57.08 – 60.29 | 21 |
| 60.29 – 63.5 | 41 |
| 63.5 – 66.71 | 23 |
| 66.71 – 69.93 | 14 |
| 69.93 – 73.14 | 6 |
Schema
16 columns| Alerts | ||||
|---|---|---|---|---|
| id | text | 0.0% | 23,740 |
near_unique
one_word
short_text
|
| family_id | categorical | 1.8% | 287 |
|
| parent_id | text | 1.8% | 7,338 |
one_word
short_text
duplicates
|
| name | text | 0.0% | 23,740 |
near_unique
one_word
|
| bookkeeping | categorical | 0.0% | 2 |
imbalance
|
| level | categorical | 0.0% | 3 |
|
| status | categorical | 0.0% | 6 |
|
| latitude | numeric | 66.5% | 7,798 |
null_rate
|
| longitude | numeric | 66.5% | 7,757 |
null_rate
|
| iso639P3code | text | 66.4% | 7,968 |
near_unique
one_word
null_rate
short_text
|
| description | unknown | 0.0% | — |
skipped
|
| markup_description | unknown | 0.0% | — |
skipped
|
| child_family_count | numeric | 0.0% | 88 |
high_skew
outliers
|
| child_language_count | numeric | 0.0% | 126 |
high_skew
outliers
|
| child_dialect_count | numeric | 0.0% | 164 |
high_skew
outliers
|
| country_ids | categorical | 64.2% | 680 |
null_rate
|
id
text identifier near_unique one_word short_textFixed 8-character single-token codes (e.g., 'melk1240', 'yang1299'), unique across all 23,740 rows with no nulls or duplicates. The pattern of four letters followed by four digits is consistent with Glottolog-style language identifiers, making this a primary key rather than analyzable text. Treatment: Use as the row key for joins; exclude from modelling features.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 23,740
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 86.11
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
family_id
categorical foreign_keyLooks like a Glottolog-style language family identifier (e.g. 'atla1278' for Atlantic-Congo, 'aust1307' for Austronesian) tagging each of the 23,740 rows. The distribution is heavily skewed: the top family alone covers 20.0% of rows and the top 10 of 287 families dominate, yielding an entropy ratio of 0.598. Null rate is low at 1.81%. Treatment: left-join on this id to a language-family reference table; consider grouping the long tail before modelling.
- n
- 23,740
- nulls
- 429 (1.8%)
- unique
- 287
- top_value
- atla1278
- top_rate
- 0.2
- cardinality
- 287
- entropy
- 4.886
- entropy_ratio
- 0.5984
parent_id
text foreign_key one_word short_text duplicatesFixed-width 8-character single-token codes (e.g. 'book1242', 'uncl1493') with len_min=len_max=8 and one_word_rate=1.0 — these look like Glottolog-style language/family identifiers used as a parent reference. With 7338 unique values across 23740 rows and a 68.5% duplicate_rate, many children share parents; 'book1242' alone accounts for 399 rows. Null rate is low at 1.81%. Treatment: left-join on this id to a parent/language lookup table.
- n
- 23,740
- nulls
- 429 (1.8%)
- unique
- 7,338
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 15,973
- duplicate_rate
- 0.6852
- vocab_size
- 7,189
- readability_flesch_mean
- 91.19
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
name
text identifier near_unique one_wordThe `name` column holds 23,740 fully unique short labels (null_rate 0.0, n_unique equals n), with a mean length of 9.95 characters and 69.5% being a single word. Top tokens like `nuclear`, `central`, `western`, `eastern`, `northern`, `southern`, and `language` suggest these are entity/topic names rather than person names. Vocabulary is broad (17,915 distinct words across only ~1.4 words per row), and there are no duplicates, URLs, or emoji. Treatment: Treat as a unique label key; drop from modelling features or hash/embed if semantic content is needed.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 23,740
- len_min
- 1
- len_max
- 58
- len_mean
- 9.95
- len_median
- 8
- len_p95
- 22
- word_mean
- 1.398
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 17,915
- readability_flesch_mean
- 42.62
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.6953
- allcaps_rate
- 0.0001685
- boilerplate_rate
- 0
bookkeeping
categorical feature imbalanceBinary boolean flag indicating whether a record involves bookkeeping, with only two values across 23740 rows and no nulls. The distribution is severely imbalanced: 'False' covers 98.3% (23341 rows) versus only 399 'True' cases, yielding an entropy ratio of just 0.12. Treatment: Encode as boolean and apply class-imbalance handling (e.g., stratification or reweighting) if used as a target.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9832
- cardinality
- 2
- entropy
- 0.1231
- entropy_ratio
- 0.1231
level
categorical featureThis is a categorical taxonomy tag with exactly 3 levels (dialect, language, family) and no nulls across 23,740 rows. The distribution is well-spread (entropy_ratio 0.94), with 'dialect' leading at 46.0%, followed by 'language' (8,481) and 'family' (4,339). Looks like a linguistic classification level rather than a free-form attribute. Treatment: one-hot encode for modelling or use directly as a stratification key.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- dialect
- top_rate
- 0.46
- cardinality
- 3
- entropy
- 1.494
- entropy_ratio
- 0.9426
status
categorical labelThis is a categorical status column with 6 levels matching UNESCO-style language endangerment categories (safe, definitely endangered, vulnerable, extinct, critically endangered, severely endangered). The distribution is heavily imbalanced: 'safe' accounts for 79.9% of 23,740 rows, while the rarest level 'severely endangered' has only 413 records. Entropy ratio is 0.44, confirming low diversity despite 6 classes. Treatment: Treat as ordinal target; stratify or rebalance before classification given the 80/20 dominance of 'safe'.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- safe
- top_rate
- 0.7989
- cardinality
- 6
- entropy
- 1.15
- entropy_ratio
- 0.4447
latitude
numeric feature null_rateGeographic latitude coordinate spanning -55.27 to 73.14, consistent with valid Earth latitudes. Two-thirds of rows are null (null_rate 0.6654), which severely limits coverage. The distribution is mildly right-skewed (0.54) with median 6.31 and IQR ~24.5, suggesting a bias toward northern-hemisphere but tropical-leaning observations. Treatment: Impute or filter the 66% nulls before any spatial modelling; pair with longitude for geo-features.
- n
- 23,740
- nulls
- 15,797 (66.5%)
- unique
- 7,798
- min
- -55.27
- max
- 73.14
- mean
- 8.17
- median
- 6.306
- std
- 18.96
- q1
- -5.137
- q3
- 19.34
- iqr
- 24.47
- skew
- 0.5403
- kurtosis
- 0.3006
- n_outliers
- 129
- outlier_rate
- 0.01624
- zero_rate
- 0
longitude
numeric feature null_rateGeographic longitude in decimal degrees, with values spanning -178.785 to 179.306, consistent with the global WGS84 range. Two-thirds of rows are null (null_rate 0.6654), so coverage is the dominant concern; among populated rows the distribution is mildly left-skewed (-0.48) with a median of 47.72 suggesting an Eastern-Hemisphere bias. Only 13 outliers (0.16%) and no zeros, so the populated values themselves look clean. Treatment: Pair with latitude for geospatial features; impute or filter the 66.5% missing before modelling.
- n
- 23,740
- nulls
- 15,797 (66.5%)
- unique
- 7,757
- min
- -178.8
- max
- 179.3
- mean
- 51.27
- median
- 47.72
- std
- 81.14
- q1
- 7.235
- q3
- 124.1
- iqr
- 116.9
- skew
- -0.4832
- kurtosis
- -0.7745
- n_outliers
- 13
- outlier_rate
- 0.001637
- zero_rate
- 0
iso639P3code
text identifier near_unique one_word null_rate short_textThis column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with 7968 distinct codes across 23740 rows. Two-thirds are missing (null_rate=0.6644), and the cardinality is near-unique among populated rows, suggesting one code per language entry rather than a repeated categorical. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and handle the 66% nulls explicitly.
- n
- 23,740
- nulls
- 15,772 (66.4%)
- unique
- 7,968
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 7,968
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
description
unknown free_text skippedThis column is named 'description' but saturn skipped profiling, so no kind, uniqueness, or value statistics are available. Only the row count (23740) and a null rate of 0.0 are reported. Without further stats, the content and structure cannot be characterized. Treatment: Re-profile or sample manually before deciding; if textual, tokenize and embed before modelling.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- —
markup_description
unknown free_text skippedThis column was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 23,740 and a null rate of 0.0. The name suggests it holds markup or descriptive text (likely HTML or formatted product/item descriptions), but that is inferred from the label alone, not from evidence. No distributional signal can be assessed here. Treatment: Re-run the profiler with text handling enabled, then tokenize and embed before modelling.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- —
child_family_count
numeric feature high_skew outliersA numeric count of child families per record, with 23740 rows and only 88 distinct values. It is overwhelmingly zero (zero_rate 0.9082) so q1, median, and q3 are all 0 and IQR is 0, yet a long tail pushes max to 859 with mean 0.879 and std 13.20. Skew of 44.40 and kurtosis of 2352.94 confirm an extreme heavy-tailed distribution, and 2179 rows (9.18%) flag as outliers. Treatment: Binarize (zero vs non-zero) or apply log1p before modelling to tame the extreme skew.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 88
- min
- 0
- max
- 859
- mean
- 0.8792
- median
- 0
- std
- 13.2
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 44.4
- kurtosis
- 2353
- n_outliers
- 2,179
- outlier_rate
- 0.09179
- zero_rate
- 0.9082
child_language_count
numeric feature high_skew outliersA numeric count of child languages per record, where 81.7% of rows are zero and Q1=median=Q3=0, so the typical entity has none. The distribution is extremely long-tailed (skew 41.86, kurtosis 2115) with a max of 1435 and 4339 outliers (18.3% outlier rate), suggesting a small set of hub-like records dominate. Treatment: Binarize (has_children vs none) or log1p-transform before modelling given the heavy zero-inflation and skew.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 126
- min
- 0
- max
- 1,435
- mean
- 1.996
- median
- 0
- std
- 23.41
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 41.86
- kurtosis
- 2115
- n_outliers
- 4,339
- outlier_rate
- 0.1828
- zero_rate
- 0.8172
child_dialect_count
numeric feature high_skew outliersThis is a count of child dialects per record, dominated by zeros (zero_rate 0.7442) with a median and Q3 of 0/1 yet a max of 2369. Skew of 42.22 and kurtosis of 2159 confirm an extreme long tail, and 17.99% of rows flag as outliers. The mean of 3.39 is pulled far above the median, so any aggregate using it will mislead. Treatment: Log1p-transform or bin (zero / one / many) before modelling.
- n
- 23,740
- nulls
- 0 (0.0%)
- unique
- 164
- min
- 0
- max
- 2,369
- mean
- 3.389
- median
- 0
- std
- 36.8
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 42.22
- kurtosis
- 2159
- n_outliers
- 4,272
- outlier_rate
- 0.1799
- zero_rate
- 0.7442
country_ids
categorical feature null_rateThis column holds ISO-style country codes (PG, ID, NG, AU, IN…) as a categorical feature with 680 distinct values across 23,740 rows. Coverage is poor — 64.24% of rows are null — and the non-null distribution is broad (entropy 6.49, ratio 0.69) with Papua New Guinea leading at only 10.29%. The 680 distinct codes far exceed the ~250 real countries, suggesting multi-value concatenations or non-standard tokens behind the 'country_ids' plural. Treatment: Split multi-code entries, normalise to ISO-3166, and impute or flag the 64% missing before one-hot or target encoding.
- n
- 23,740
- nulls
- 15,250 (64.2%)
- unique
- 680
- top_value
- PG
- top_rate
- 0.1029
- cardinality
- 680
- entropy
- 6.493
- entropy_ratio
- 0.6901