processed word forms
Reading
This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.
citing: row_count · column_count · form.stats.one_word_rate · form.stats.duplicate_rate · form.stats.len_mean · iso_639_3.n_unique · iso_639_3.top_values · concept.n_unique · concept.top_values · language_name.top_values · source_dataset.top_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| ell | 522 | 2.0% |
| slv | 509 | 2.0% |
| mkd | 497 | 1.9% |
| bre | 353 | 1.4% |
| swe | 347 | 1.3% |
| ces | 346 | 1.3% |
| pol | 345 | 1.3% |
| sdh | 345 | 1.3% |
| src | 343 | 1.3% |
| por | 341 | 1.3% |
| oss | 341 | 1.3% |
| cat | 340 | 1.3% |
| grc | 332 | 1.3% |
| bsh | 289 | 1.1% |
| bqi | 178 | 0.7% |
| nep | 177 | 0.7% |
| osp | 177 | 0.7% |
| pnt | 177 | 0.7% |
| hin | 176 | 0.7% |
| ron | 176 | 0.7% |
Show data table
| value | count | share |
|---|---|---|
| say | 170 | 0.7% |
| man | 166 | 0.6% |
| big | 163 | 0.6% |
| stone | 163 | 0.6% |
| house | 163 | 0.6% |
| foot | 161 | 0.6% |
| hand | 161 | 0.6% |
| head | 161 | 0.6% |
| see | 161 | 0.6% |
| woman | 161 | 0.6% |
| year | 161 | 0.6% |
| day | 160 | 0.6% |
| good | 160 | 0.6% |
| name | 160 | 0.6% |
| water | 160 | 0.6% |
| do | 160 | 0.6% |
| come | 159 | 0.6% |
| give | 159 | 0.6% |
| know | 159 | 0.6% |
| red | 159 | 0.6% |
Show data table
| value | count | share |
|---|---|---|
| Bakhtiari | 178 | 0.7% |
| Nepali | 177 | 0.7% |
| Greek: Italiot | 177 | 0.7% |
| Old Spanish | 177 | 0.7% |
| Greek: Pontic | 177 | 0.7% |
| Breton: Treger | 177 | 0.7% |
| Hindi | 176 | 0.7% |
| Romanian | 176 | 0.7% |
| Greek: Cappadocian | 176 | 0.7% |
| Breton: Gwened | 176 | 0.7% |
| Middle Welsh | 176 | 0.7% |
| Ladin | 175 | 0.7% |
| Old Church Slavonic | 175 | 0.7% |
| Elfdalian | 175 | 0.7% |
| Old Swedish | 175 | 0.7% |
| Welsh: North | 175 | 0.7% |
| Old Polish | 175 | 0.7% |
| Lithuanian | 174 | 0.7% |
| Sinhalese | 174 | 0.7% |
| Urdu | 174 | 0.7% |
Show data table
| chars | count |
|---|---|
| 1 – 3 | 530 |
| 3 – 4 | 9411 |
| 4 – 6 | 6280 |
| 6 – 7 | 6745 |
| 7 – 9 | 1056 |
| 9 – 10 | 823 |
| 10 – 12 | 222 |
| 12 – 13 | 284 |
| 13 – 15 | 100 |
| 15 – 16 | 128 |
| 16 – 18 | 65 |
| 18 – 20 | 14 |
| 20 – 21 | 27 |
| 21 – 23 | 5 |
| 23 – 24 | 7 |
| 24 – 26 | 10 |
| 26 – 27 | 5 |
| 27 – 29 | 2 |
| 29 – 30 | 3 |
| 30 – 32 | 0 |
| 32 – 34 | 4 |
| 34 – 35 | 6 |
| 35 – 37 | 0 |
| 37 – 38 | 0 |
| 38 – 40 | 0 |
| 40 – 41 | 0 |
| 41 – 43 | 0 |
| 43 – 44 | 0 |
| 44 – 46 | 0 |
| 46 – 48 | 0 |
| 48 – 49 | 0 |
| 49 – 51 | 1 |
| 51 – 52 | 1 |
| 52 – 54 | 0 |
| 54 – 55 | 0 |
| 55 – 57 | 0 |
| 57 – 58 | 0 |
| 58 – 60 | 0 |
| 60 – 61 | 0 |
| 61 – 63 | 2 |
Show data table
| bin | count |
|---|---|
| 3 – 252.5 | 3663 |
| 252.5 – 501.9 | 3046 |
| 501.9 – 751.4 | 1164 |
| 751.4 – 1001 | 2020 |
| 1001 – 1250 | 1934 |
| 1250 – 1500 | 677 |
| 1500 – 1749 | 687 |
| 1749 – 1999 | 370 |
| 1999 – 2248 | 629 |
| 2248 – 2498 | 897 |
| 2498 – 2747 | 263 |
| 2747 – 2997 | 436 |
| 2997 – 3246 | 392 |
| 3246 – 3496 | 234 |
| 3496 – 3745 | 129 |
| 3745 – 3995 | 141 |
| 3995 – 4244 | 158 |
| 4244 – 4494 | 105 |
| 4494 – 4743 | 99 |
| 4743 – 4992 | 175 |
| 4992 – 5242 | 1554 |
| 5242 – 5491 | 368 |
| 5491 – 5741 | 274 |
| 5741 – 5990 | 296 |
| 5990 – 6240 | 373 |
| 6240 – 6489 | 695 |
| 6489 – 6739 | 602 |
| 6739 – 6988 | 318 |
| 6988 – 7238 | 281 |
| 7238 – 7487 | 504 |
| 7487 – 7737 | 574 |
| 7737 – 7986 | 263 |
| 7986 – 8236 | 275 |
| 8236 – 8485 | 652 |
| 8485 – 8735 | 273 |
| 8735 – 8984 | 106 |
| 8984 – 9234 | 199 |
| 9234 – 9483 | 285 |
| 9483 – 9733 | 314 |
| 9733 – 9982 | 306 |
Schema
8 columns| Alerts | ||||
|---|---|---|---|---|
| form | text | 0.0% | 19,334 |
one_word
short_text
duplicates
|
| language_id | numeric | 0.0% | 160 |
|
| language_name | categorical | 0.0% | 160 |
|
| glottocode | categorical | 0.0% | 152 |
|
| iso_639_3 | categorical | 0.7% | 142 |
|
| concept | categorical | 0.0% | 170 |
|
| cognate_id | numeric | 0.0% | 4,979 |
|
| source_dataset | categorical | 0.0% | 1 |
imbalance
|
form
text feature one_word short_text duplicatesThis column holds short single-word lexical forms — 94.6% are one-word entries with a mean length of 5.4 characters and median word count of 1. The vocabulary spans 16,219 distinct words across 25,731 rows, and top values like 'noga', 'pā', 'dūr', 'voda', 'bitter' suggest a multilingual mix (Slavic, Polynesian, Germanic). Notably, 24.9% of rows are duplicates (6,397), yet no single form dominates — the most frequent ('noga') appears only 24 times. Treatment: Treat as a categorical lexical token; normalize unicode and consider language detection before embedding.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 19,334
- len_min
- 1
- len_max
- 63
- len_mean
- 5.373
- len_median
- 5
- len_p95
- 9
- word_mean
- 1.067
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 6,397
- duplicate_rate
- 0.2486
- vocab_size
- 16,219
- readability_flesch_mean
- 86.62
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9464
- allcaps_rate
- 0
- boilerplate_rate
- 0
language_id
numeric foreign_keyNumeric code with 160 distinct values across 25,731 rows and zero nulls, ranging from 3 to 317 with a near-symmetric distribution (skew -0.05) and flat shape (kurtosis -1.47). The flat, wide spread and integer-looking quartiles (65, 174, 266) suggest this is a categorical language identifier stored as an integer rather than a true numeric measurement. No outliers and no zeros, consistent with a lookup key. Treatment: Treat as categorical and left-join to a language lookup table; do not model as a continuous variable.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 160
- min
- 3
- max
- 317
- mean
- 166
- median
- 174
- std
- 101.4
- q1
- 65
- q3
- 266
- iqr
- 201
- skew
- -0.04885
- kurtosis
- -1.471
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
language_name
categorical featureThis column holds language names, with 160 distinct values across 25,731 rows and no nulls. The distribution is remarkably flat: entropy_ratio is 0.996 and the most common value 'Bakhtiari' appears just 178 times (0.69%), suggesting a near-uniform sampling of languages rather than a natural population. Several entries use a 'Family: Variety' convention (e.g., 'Greek: Italiot', 'Breton: Treger'), indicating dialect-level granularity mixed with top-level language names. Treatment: Use as a categorical feature; consider splitting on ':' to separate family from variety before encoding.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 160
- top_value
- Bakhtiari
- top_rate
- 0.006918
- cardinality
- 160
- entropy
- 7.294
- entropy_ratio
- 0.9962
glottocode
categorical foreign_keyGlottocodes are Glottolog's stable language identifiers, so this column tags each of the 25,731 rows with one of 152 distinct languages. The distribution is remarkably flat: entropy ratio is 0.991 and the most frequent code 'mace1250' covers only 1.93% of rows, with several Slavic and Germanic codes clustered around 340–350 occurrences. No nulls, and a visible drop-off after the top seven codes (down to ~177) suggests a tiered sampling design rather than a long tail. Treatment: left-join on this id to Glottolog metadata, or one-hot/target-encode for modelling.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 152
- top_value
- mace1250
- top_rate
- 0.01932
- cardinality
- 152
- entropy
- 7.184
- entropy_ratio
- 0.9912
iso_639_3
categorical featureThis column holds ISO 639-3 language codes, with 142 distinct languages spread across 25,731 rows. The distribution is remarkably flat — entropy ratio of 0.985 and the top code 'ell' covering only 2.04% — so no single language dominates. Null rate is negligible at 0.67%. Treatment: Treat as a categorical feature; one-hot or target-encode given the 142 levels, or group rare codes.
- n
- 25,731
- nulls
- 173 (0.7%)
- unique
- 142
- top_value
- ell
- top_rate
- 0.02042
- cardinality
- 142
- entropy
- 7.044
- entropy_ratio
- 0.9853
concept
categorical foreign_keyThis column holds 170 distinct concept labels (e.g., 'say', 'man', 'big', 'stone', 'house') spread almost perfectly evenly across 25,731 rows, with the top value covering only 0.66% of the data and entropy at 99.98% of the maximum. The vocabulary resembles a Swadesh-style basic concept list, and the near-uniform distribution suggests each concept appears a fixed number of times — likely once per language or source. Treatment: Treat as a categorical key; group or pivot on it rather than one-hot encoding given 170 balanced levels.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 170
- top_value
- say
- top_rate
- 0.006607
- cardinality
- 170
- entropy
- 7.408
- entropy_ratio
- 0.9998
cognate_id
numeric foreign_keyThis is almost certainly a cognate group identifier: 4,979 distinct integer values spread across 25,731 rows (roughly 5x repetition) with no nulls and no zeros. Despite being stored as numeric, the wide range (3 to 9,982), moderate skew (0.73) and negative kurtosis (-0.90) suggest these are arbitrary group labels rather than a measured quantity. The lack of outliers is consistent with categorical-style codes packed into the integer space. Treatment: Treat as a categorical group key; join or group-by rather than feeding as a numeric feature.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 4,979
- min
- 3
- max
- 9,982
- mean
- 3086
- median
- 1,610
- std
- 3024
- q1
- 411
- q3
- 5,640
- iqr
- 5,229
- skew
- 0.7307
- kurtosis
- -0.9048
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
source_dataset
categorical metadata imbalanceThis column is a constant provenance tag identifying the source dataset, with every one of the 25731 rows labelled "iecor". Cardinality is 1 and entropy is 0, so it carries no discriminative signal. Treatment: Drop before modelling; retain only as a provenance flag.
- n
- 25,731
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- iecor
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0