data raw glottolog languoid
Reading
This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.
citing: row_count · column_count · columns · kinds
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Africa | 5955 | 30.7% |
| Eurasia | 5028 | 25.9% |
| Papunesia | 4847 | 25.0% |
| South America | 1095 | 5.6% |
| North America | 1035 | 5.3% |
| Australia | 602 | 3.1% |
Show data table
| value | count | share |
|---|---|---|
| dialect | 10920 | 56.3% |
| language | 8481 | 43.7% |
Show data table
| bin | count |
|---|---|
| -55.27 – -52.06 | 5 |
| -52.06 – -48.85 | 1 |
| -48.85 – -45.64 | 1 |
| -45.64 – -42.43 | 4 |
| -42.43 – -39.22 | 7 |
| -39.22 – -36.01 | 16 |
| -36.01 – -32.8 | 29 |
| -32.8 – -29.59 | 26 |
| -29.59 – -26.38 | 47 |
| -26.38 – -23.17 | 77 |
| -23.17 – -19.96 | 125 |
| -19.96 – -16.75 | 141 |
| -16.75 – -13.54 | 280 |
| -13.54 – -10.33 | 256 |
| -10.33 – -7.121 | 495 |
| -7.121 – -3.911 | 788 |
| -3.911 – -0.7005 | 681 |
| -0.7005 – 2.51 | 378 |
| 2.51 – 5.72 | 468 |
| 5.72 – 8.93 | 663 |
| 8.93 – 12.14 | 710 |
| 12.14 – 15.35 | 303 |
| 15.35 – 18.56 | 384 |
| 18.56 – 21.77 | 233 |
| 21.77 – 24.98 | 318 |
| 24.98 – 28.19 | 371 |
| 28.19 – 31.4 | 167 |
| 31.4 – 34.61 | 143 |
| 34.61 – 37.82 | 178 |
| 37.82 – 41.03 | 113 |
| 41.03 – 44.24 | 138 |
| 44.24 – 47.45 | 79 |
| 47.45 – 50.66 | 77 |
| 50.66 – 53.87 | 76 |
| 53.87 – 57.08 | 46 |
| 57.08 – 60.29 | 21 |
| 60.29 – 63.5 | 41 |
| 63.5 – 66.71 | 23 |
| 66.71 – 69.93 | 14 |
| 69.93 – 73.14 | 6 |
Show data table
| bin | count |
|---|---|
| -178.8 – -169.8 | 13 |
| -169.8 – -160.9 | 4 |
| -160.9 – -151.9 | 10 |
| -151.9 – -143 | 11 |
| -143 – -134 | 10 |
| -134 – -125.1 | 17 |
| -125.1 – -116.1 | 123 |
| -116.1 – -107.2 | 47 |
| -107.2 – -98.21 | 78 |
| -98.21 – -89.26 | 280 |
| -89.26 – -80.31 | 59 |
| -80.31 – -71.36 | 235 |
| -71.36 – -62.41 | 218 |
| -62.41 – -53.45 | 150 |
| -53.45 – -44.5 | 60 |
| -44.5 – -35.55 | 40 |
| -35.55 – -26.6 | 0 |
| -26.6 – -17.64 | 4 |
| -17.64 – -8.692 | 105 |
| -8.692 – 0.2605 | 275 |
| 0.2605 – 9.213 | 443 |
| 9.213 – 18.17 | 751 |
| 18.17 – 27.12 | 322 |
| 27.12 – 36.07 | 429 |
| 36.07 – 45.02 | 228 |
| 45.02 – 53.97 | 126 |
| 53.97 – 62.93 | 35 |
| 62.93 – 71.88 | 79 |
| 71.88 – 80.83 | 210 |
| 80.83 – 89.78 | 207 |
| 89.78 – 98.74 | 269 |
| 98.74 – 107.7 | 454 |
| 107.7 – 116.6 | 239 |
| 116.6 – 125.6 | 497 |
| 125.6 – 134.5 | 316 |
| 134.5 – 143.5 | 598 |
| 143.5 – 152.4 | 667 |
| 152.4 – 161.4 | 122 |
| 161.4 – 170.4 | 186 |
| 170.4 – 179.3 | 12 |
Show data table
| chars | count |
|---|---|
| 1 – 2 | 52 |
| 2 – 4 | 455 |
| 4 – 5 | 4355 |
| 5 – 7 | 2890 |
| 7 – 8 | 3995 |
| 8 – 10 | 1115 |
| 10 – 11 | 813 |
| 11 – 12 | 1417 |
| 12 – 14 | 724 |
| 14 – 15 | 1264 |
| 15 – 17 | 439 |
| 17 – 18 | 524 |
| 18 – 20 | 214 |
| 20 – 21 | 175 |
| 21 – 22 | 343 |
| 22 – 24 | 143 |
| 24 – 25 | 173 |
| 25 – 27 | 58 |
| 27 – 28 | 86 |
| 28 – 30 | 26 |
| 30 – 31 | 26 |
| 31 – 32 | 43 |
| 32 – 34 | 12 |
| 34 – 35 | 24 |
| 35 – 37 | 11 |
| 37 – 38 | 8 |
| 38 – 39 | 3 |
| 39 – 41 | 1 |
| 41 – 42 | 2 |
| 42 – 44 | 1 |
| 44 – 45 | 3 |
| 45 – 47 | 1 |
| 47 – 48 | 2 |
| 48 – 49 | 1 |
| 49 – 51 | 0 |
| 51 – 52 | 0 |
| 52 – 54 | 0 |
| 54 – 55 | 0 |
| 55 – 57 | 0 |
| 57 – 58 | 2 |
Schema
7 columns| Alerts | ||||
|---|---|---|---|---|
| glottocode | text | 0.0% | 19,401 |
near_unique
one_word
short_text
|
| name | text | 0.0% | 19,401 |
near_unique
one_word
|
| isocodes | text | 59.2% | 7,922 |
near_unique
one_word
null_rate
short_text
|
| level | categorical | 0.0% | 2 |
|
| macroarea | categorical | 4.3% | 6 |
|
| latitude | numeric | 59.1% | 7,786 |
null_rate
|
| longitude | numeric | 59.1% | 7,745 |
null_rate
|
glottocode
text identifier near_unique one_word short_textThis column holds Glottocodes — fixed 8-character identifiers (len_min=len_max=8) that uniquely tag languages in the Glottolog catalogue. Every one of the 19,401 rows is unique (n_unique=19401, duplicate_rate=0.0) and single-token (one_word_rate=1.0), matching the canonical four-letters-plus-four-digits pattern visible in top_words like 'aala1237' and 'aari1239'. There are no nulls and no collisions, so this is a clean primary key rather than a feature. Treatment: Use as the primary key; left-join on this id rather than feeding it to a model.
- n
- 19,401
- nulls
- 0 (0.0%)
- unique
- 19,401
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 19,401
- readability_flesch_mean
- 93.3
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
name
text identifier near_unique one_wordThe `name` column is a short text identifier, with all 19,401 values unique and no nulls or duplicates — effectively a primary key or label. Entries are terse (mean 9.2 chars, median 1 word) and 71.7% are single words, yet the top tokens (`nuclear`, `language`, `central`, `southern`, `western`) suggest these are entity or topic names rather than personal names. Vocabulary is wide (17,861 distinct words) for such short strings. Treatment: Treat as a unique key; do not feature-engineer directly, but tokenize if used for matching or embedding.
- n
- 19,401
- nulls
- 0 (0.0%)
- unique
- 19,401
- len_min
- 1
- len_max
- 58
- len_mean
- 9.211
- len_median
- 7
- len_p95
- 20
- word_mean
- 1.369
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 17,861
- readability_flesch_mean
- 60.53
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7169
- allcaps_rate
- 0
- boilerplate_rate
- 0
isocodes
text identifier near_unique one_word null_rate short_textThis column holds 3-character ISO-style codes: every non-null value is exactly 3 characters and one word (len_min/len_max=3, one_word_rate=1.0). It is sparsely populated with a 0.5917 null rate, and of the 19401 rows there are 7922 distinct codes with no duplicates among the populated entries (vocab_size=7922, n_duplicates=0). The top_words sample (aiw, aay, aas, kbt…) suggests ISO 639-3 language codes rather than country codes, each appearing only once. Treatment: Treat as a categorical/foreign key; left-join to an ISO code lookup and impute or flag the ~59% missing values.
- n
- 19,401
- nulls
- 11,479 (59.2%)
- unique
- 7,922
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 7,922
- readability_flesch_mean
- 118.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
level
categorical labelA binary categorical flag distinguishing 'dialect' (10,920 rows, 56.3%) from 'language' (8,481 rows). The split is fairly balanced, with entropy at 0.989 of the maximum, and there are no nulls across all 19,401 rows. Treatment: One-hot or binary-encode for modelling.
- n
- 19,401
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- dialect
- top_rate
- 0.5629
- cardinality
- 2
- entropy
- 0.9886
- entropy_ratio
- 0.9886
macroarea
categorical featureCategorical macro-region label with just 6 values covering the world's linguistic areas (Africa, Eurasia, Papunesia, South America, North America, Australia). Distribution is moderately balanced (entropy ratio 0.84) with Africa leading at 32.1% and a long tail in Australia at 602 rows; 4.32% are null. No single category dominates, but the Americas and Australia are markedly underrepresented relative to Africa and Eurasia. Treatment: one-hot or target-encode; impute or flag the ~4% missing before modelling.
- n
- 19,401
- nulls
- 839 (4.3%)
- unique
- 6
- top_value
- Africa
- top_rate
- 0.3208
- cardinality
- 6
- entropy
- 2.176
- entropy_ratio
- 0.8418
latitude
numeric feature null_rateGeographic latitude coordinates ranging from -55.2748 to 73.1354, consistent with a worldwide point dataset. Nearly 59% of rows are null, which dominates any spatial analysis, and the median of 6.29 with mean 8.16 hints at a slight northern-hemisphere lean despite moderate skew (0.54). Treatment: Pair with longitude for geospatial features; impute or filter the 59% nulls before mapping.
- n
- 19,401
- nulls
- 11,472 (59.1%)
- unique
- 7,786
- min
- -55.27
- max
- 73.14
- mean
- 8.164
- median
- 6.292
- std
- 18.96
- q1
- -5.139
- q3
- 19.27
- iqr
- 24.41
- skew
- 0.5425
- kurtosis
- 0.3048
- n_outliers
- 135
- outlier_rate
- 0.01703
- zero_rate
- 0
longitude
numeric feature null_rateGeographic longitude coordinate spanning the full globe (min -178.785, max 179.306) with median 47.565486, suggesting a worldwide dataset weighted toward Eurasia. The 59.13% null rate is the dominant concern—most records lack location—while only 13 outliers (0.16%) appear and skew is mild (-0.48). Treatment: Impute or flag the 59% missing values before any geospatial modelling; pair with latitude for joint use.
- n
- 19,401
- nulls
- 11,472 (59.1%)
- unique
- 7,745
- min
- -178.8
- max
- 179.3
- mean
- 51.22
- median
- 47.57
- std
- 81.15
- q1
- 7.18
- q3
- 124.1
- iqr
- 117
- skew
- -0.4814
- kurtosis
- -0.7765
- n_outliers
- 13
- outlier_rate
- 0.00164
- zero_rate
- 0