parquet cognate sets
Reading
This dataset catalogs 4,981 cognate sets from the IECoR source, with each row identified by a unique cognate_id and accompanied by a JSON-like 'words' payload listing language entries. The numeric columns language_count and word_count are nearly identical twins, both highly skewed (skew ~6.84) with a median of 2 but a max of 157 and ~13% outliers — a small set of cognate groups is dramatically larger than the rest. Three columns (concept, confidence, source_dataset) are constant or empty and carry no analytic signal. Start by examining the distribution of language_count to understand the long tail of cross-linguistic coverage, and inspect the longest 'words' entries (len_max ~14,956) to see which cognate sets dominate.
citing: language_count · word_count · words · cognate_id · concept · confidence · source_dataset
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 1 – 4.9 | 3849 |
| 4.9 – 8.8 | 483 |
| 8.8 – 12.7 | 194 |
| 12.7 – 16.6 | 96 |
| 16.6 – 20.5 | 90 |
| 20.5 – 24.4 | 107 |
| 24.4 – 28.3 | 32 |
| 28.3 – 32.2 | 25 |
| 32.2 – 36.1 | 15 |
| 36.1 – 40 | 4 |
| 40 – 43.9 | 8 |
| 43.9 – 47.8 | 8 |
| 47.8 – 51.7 | 6 |
| 51.7 – 55.6 | 5 |
| 55.6 – 59.5 | 3 |
| 59.5 – 63.4 | 3 |
| 63.4 – 67.3 | 8 |
| 67.3 – 71.2 | 3 |
| 71.2 – 75.1 | 3 |
| 75.1 – 79 | 4 |
| 79 – 82.9 | 0 |
| 82.9 – 86.8 | 1 |
| 86.8 – 90.7 | 2 |
| 90.7 – 94.6 | 4 |
| 94.6 – 98.5 | 5 |
| 98.5 – 102.4 | 2 |
| 102.4 – 106.3 | 3 |
| 106.3 – 110.2 | 2 |
| 110.2 – 114.1 | 2 |
| 114.1 – 118 | 3 |
| 118 – 121.9 | 0 |
| 121.9 – 125.8 | 1 |
| 125.8 – 129.7 | 1 |
| 129.7 – 133.6 | 0 |
| 133.6 – 137.5 | 2 |
| 137.5 – 141.4 | 2 |
| 141.4 – 145.3 | 0 |
| 145.3 – 149.2 | 0 |
| 149.2 – 153.1 | 1 |
| 153.1 – 157 | 4 |
Show data table
| bin | count |
|---|---|
| 1 – 4.9 | 3848 |
| 4.9 – 8.8 | 484 |
| 8.8 – 12.7 | 194 |
| 12.7 – 16.6 | 96 |
| 16.6 – 20.5 | 90 |
| 20.5 – 24.4 | 107 |
| 24.4 – 28.3 | 32 |
| 28.3 – 32.2 | 25 |
| 32.2 – 36.1 | 15 |
| 36.1 – 40 | 4 |
| 40 – 43.9 | 8 |
| 43.9 – 47.8 | 8 |
| 47.8 – 51.7 | 6 |
| 51.7 – 55.6 | 5 |
| 55.6 – 59.5 | 3 |
| 59.5 – 63.4 | 3 |
| 63.4 – 67.3 | 8 |
| 67.3 – 71.2 | 3 |
| 71.2 – 75.1 | 3 |
| 75.1 – 79 | 3 |
| 79 – 82.9 | 1 |
| 82.9 – 86.8 | 1 |
| 86.8 – 90.7 | 2 |
| 90.7 – 94.6 | 4 |
| 94.6 – 98.5 | 5 |
| 98.5 – 102.4 | 2 |
| 102.4 – 106.3 | 3 |
| 106.3 – 110.2 | 2 |
| 110.2 – 114.1 | 2 |
| 114.1 – 118 | 3 |
| 118 – 121.9 | 0 |
| 121.9 – 125.8 | 1 |
| 125.8 – 129.7 | 1 |
| 129.7 – 133.6 | 0 |
| 133.6 – 137.5 | 2 |
| 137.5 – 141.4 | 2 |
| 141.4 – 145.3 | 0 |
| 145.3 – 149.2 | 0 |
| 149.2 – 153.1 | 1 |
| 153.1 – 157 | 4 |
Show data table
| chars | count |
|---|---|
| 83 – 455 | 3857 |
| 455 – 827 | 475 |
| 827 – 1198 | 191 |
| 1198 – 1570 | 99 |
| 1570 – 1942 | 93 |
| 1942 – 2314 | 98 |
| 2314 – 2686 | 36 |
| 2686 – 3058 | 27 |
| 3058 – 3429 | 14 |
| 3429 – 3801 | 6 |
| 3801 – 4173 | 8 |
| 4173 – 4545 | 7 |
| 4545 – 4917 | 6 |
| 4917 – 5289 | 4 |
| 5289 – 5660 | 4 |
| 5660 – 6032 | 4 |
| 6032 – 6404 | 4 |
| 6404 – 6776 | 6 |
| 6776 – 7148 | 3 |
| 7148 – 7520 | 4 |
| 7520 – 7891 | 1 |
| 7891 – 8263 | 0 |
| 8263 – 8635 | 3 |
| 8635 – 9007 | 3 |
| 9007 – 9379 | 4 |
| 9379 – 9750 | 2 |
| 9750 – 10122 | 3 |
| 10122 – 10494 | 3 |
| 10494 – 10866 | 2 |
| 10866 – 11238 | 2 |
| 11238 – 11610 | 1 |
| 11610 – 11981 | 0 |
| 11981 – 12353 | 2 |
| 12353 – 12725 | 1 |
| 12725 – 13097 | 1 |
| 13097 – 13469 | 2 |
| 13469 – 13841 | 0 |
| 13841 – 14212 | 0 |
| 14212 – 14584 | 2 |
| 14584 – 14956 | 3 |
Show data table
| value | count | share |
|---|---|---|
| iecor | 4981 | 100.0% |
Schema
7 columns| Alerts | ||||
|---|---|---|---|---|
| cognate_id | text | 0.0% | 4,981 |
near_unique
one_word
short_text
|
| concept | categorical | 0.0% | 1 |
imbalance
|
| word_count | numeric | 0.0% | 93 |
high_skew
outliers
|
| language_count | numeric | 0.0% | 94 |
high_skew
outliers
|
| words | text | 0.0% | 4,963 |
near_unique
|
| source_dataset | categorical | 0.0% | 1 |
imbalance
|
| confidence | numeric | 0.0% | 1 |
constant
|
cognate_id
text identifier near_unique one_word short_textThis is a unique cognate identifier column, with every one of the 4981 rows carrying a distinct single-token value (vocab_size 4981, one_word_rate 1.0, null_rate 0). Values follow a fixed `iecor:` scheme with lengths between 7 and 10 characters, consistent with a namespaced primary key from the IE-CoR lexical database. There is nothing to model here — it is pure row identity with zero duplicates. Treatment: Use as a join key; drop before modelling.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 4,981
- len_min
- 7
- len_max
- 10
- len_mean
- 9.884
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 4,981
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
concept
categorical metadata imbalanceThe 'concept' column is constant: all 4981 rows hold the empty string, with cardinality 1, entropy 0, and a top_rate of 1.0. There is no variation to exploit and no non-empty category was observed. Treatment: Drop the column; it carries zero information.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
word_count
numeric feature high_skew outliersCounts of words per record (likely titles, queries, or short text fields), with 4981 non-null integer values ranging 1 to 157. The distribution is heavily right-skewed (skew 6.84, kurtosis 59.74): median is 2 and Q3 is 4, yet the max reaches 157, dragging the mean to 5.17 with std 12.13. About 13% of rows (649) are flagged as outliers, indicating a long tail of unusually verbose entries. Treatment: log-transform or clip before modelling to tame the long right tail.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 93
- min
- 1
- max
- 157
- mean
- 5.168
- median
- 2
- std
- 12.13
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 6.837
- kurtosis
- 59.74
- n_outliers
- 649
- outlier_rate
- 0.1303
- zero_rate
- 0
language_count
numeric feature high_skew outliers`language_count` is a positive integer feature counting languages per record, ranging from 1 to 157 with a median of 2 and Q3 of 4. The distribution is severely right-skewed (skew 6.84, kurtosis 59.77) with 649 outliers (13.0% outlier rate) stretching the mean to 5.17 against a std of 12.13. No nulls or zeros, and only 94 distinct values across 4981 rows. Treatment: Log1p-transform or cap at a high quantile before modelling to tame the heavy right tail.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 94
- min
- 1
- max
- 157
- mean
- 5.166
- median
- 2
- std
- 12.13
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 6.838
- kurtosis
- 59.77
- n_outliers
- 649
- outlier_rate
- 0.1303
- zero_rate
- 0
words
text free_text near_uniqueThis column holds serialized JSON arrays of word entries, each carrying fields like "form", "language", "iso_639_3", and "glottocode" — every one of the 4981 rows starts with `[{"form":`. Values are nearly unique (4963/4981) with 18 duplicates, and lengths vary wildly from 83 to 14956 characters (mean 498, median 184), indicating variable-size nested records rather than free prose. Top tokens reveal a multilingual etymology dataset spanning Greek, Old (Iranian/English?), Armenian, Albanian, etc., so Flesch readability (48.4) is meaningless here. Treatment: Parse as JSON and explode into a child table of word entries before any modelling.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 4,963
- len_min
- 83
- len_max
- 14,956
- len_mean
- 498.9
- len_median
- 184
- len_p95
- 1,988
- word_mean
- 44.53
- word_median
- 16
- n_empty
- 0
- n_duplicates
- 18
- duplicate_rate
- 0.003614
- vocab_size
- 12,094
- readability_flesch_mean
- 48.43
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
source_dataset
categorical metadata imbalanceThis column records the source dataset provenance, but every one of the 4981 rows carries the single value "iecor". With cardinality 1 and entropy 0, it conveys no information and serves only as a constant tag. Treatment: Drop before modelling; retain only as a provenance note.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- iecor
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
confidence
numeric metadata constantThis column is labelled 'confidence' and appears to be a numeric score, but every one of the 4981 rows holds the value 1.0 with zero standard deviation. It carries no information and was flagged constant. Likely an upstream default or placeholder rather than a measured confidence. Treatment: drop, constant column with no variance
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0