processed cognate sets
Reading
This dataset contains 4,981 cognate sets sourced entirely from the 'iecor' source_dataset, each identified by a unique cognate_id. The two main numeric signals are language_count and word_count, which are nearly identical in distribution: both have a median of 2 and mean around 5.17, but stretch out to a maximum of 157 with skew above 6.8 and roughly 13% of rows flagged as outliers. That long tail is the most interesting story — most cognate sets are small, but a minority span very many languages/words and deserve a closer look. Note that concept is empty for every row, confidence is constant at 1.0, and source_dataset has only one value, so those columns carry no analytic signal.
citing: row_count · column_count · language_count · word_count · concept · confidence · source_dataset · cognate_id
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 1 – 4.9 | 3849 |
| 4.9 – 8.8 | 483 |
| 8.8 – 12.7 | 194 |
| 12.7 – 16.6 | 96 |
| 16.6 – 20.5 | 90 |
| 20.5 – 24.4 | 107 |
| 24.4 – 28.3 | 32 |
| 28.3 – 32.2 | 25 |
| 32.2 – 36.1 | 15 |
| 36.1 – 40 | 4 |
| 40 – 43.9 | 8 |
| 43.9 – 47.8 | 8 |
| 47.8 – 51.7 | 6 |
| 51.7 – 55.6 | 5 |
| 55.6 – 59.5 | 3 |
| 59.5 – 63.4 | 3 |
| 63.4 – 67.3 | 8 |
| 67.3 – 71.2 | 3 |
| 71.2 – 75.1 | 3 |
| 75.1 – 79 | 4 |
| 79 – 82.9 | 0 |
| 82.9 – 86.8 | 1 |
| 86.8 – 90.7 | 2 |
| 90.7 – 94.6 | 4 |
| 94.6 – 98.5 | 5 |
| 98.5 – 102.4 | 2 |
| 102.4 – 106.3 | 3 |
| 106.3 – 110.2 | 2 |
| 110.2 – 114.1 | 2 |
| 114.1 – 118 | 3 |
| 118 – 121.9 | 0 |
| 121.9 – 125.8 | 1 |
| 125.8 – 129.7 | 1 |
| 129.7 – 133.6 | 0 |
| 133.6 – 137.5 | 2 |
| 137.5 – 141.4 | 2 |
| 141.4 – 145.3 | 0 |
| 145.3 – 149.2 | 0 |
| 149.2 – 153.1 | 1 |
| 153.1 – 157 | 4 |
Show data table
| bin | count |
|---|---|
| 1 – 4.9 | 3848 |
| 4.9 – 8.8 | 484 |
| 8.8 – 12.7 | 194 |
| 12.7 – 16.6 | 96 |
| 16.6 – 20.5 | 90 |
| 20.5 – 24.4 | 107 |
| 24.4 – 28.3 | 32 |
| 28.3 – 32.2 | 25 |
| 32.2 – 36.1 | 15 |
| 36.1 – 40 | 4 |
| 40 – 43.9 | 8 |
| 43.9 – 47.8 | 8 |
| 47.8 – 51.7 | 6 |
| 51.7 – 55.6 | 5 |
| 55.6 – 59.5 | 3 |
| 59.5 – 63.4 | 3 |
| 63.4 – 67.3 | 8 |
| 67.3 – 71.2 | 3 |
| 71.2 – 75.1 | 3 |
| 75.1 – 79 | 3 |
| 79 – 82.9 | 1 |
| 82.9 – 86.8 | 1 |
| 86.8 – 90.7 | 2 |
| 90.7 – 94.6 | 4 |
| 94.6 – 98.5 | 5 |
| 98.5 – 102.4 | 2 |
| 102.4 – 106.3 | 3 |
| 106.3 – 110.2 | 2 |
| 110.2 – 114.1 | 2 |
| 114.1 – 118 | 3 |
| 118 – 121.9 | 0 |
| 121.9 – 125.8 | 1 |
| 125.8 – 129.7 | 1 |
| 129.7 – 133.6 | 0 |
| 133.6 – 137.5 | 2 |
| 137.5 – 141.4 | 2 |
| 141.4 – 145.3 | 0 |
| 145.3 – 149.2 | 0 |
| 149.2 – 153.1 | 1 |
| 153.1 – 157 | 4 |
Show data table
| value | count | share |
|---|---|---|
| iecor | 4981 | 100.0% |
Show data table
| chars | count |
|---|---|
| 7 – 7 | 5 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 44 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 477 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 4455 |
Schema
8 columns| Alerts | ||||
|---|---|---|---|---|
| cognate_id | text | 0.0% | 4,981 |
near_unique
one_word
short_text
|
| concept | categorical | 0.0% | 1 |
imbalance
|
| words | unknown | 0.0% | — |
skipped
|
| source_dataset | categorical | 0.0% | 1 |
imbalance
|
| confidence | numeric | 0.0% | 1 |
constant
|
| word_count | numeric | 0.0% | 93 |
high_skew
outliers
|
| language_count | numeric | 0.0% | 94 |
high_skew
outliers
|
| sources | unknown | 0.0% | — |
skipped
|
cognate_id
text identifier near_unique one_word short_textThis is a primary identifier column: every one of the 4981 values is unique, non-null, single-token, and follows an `iecor:` pattern (length 7-10 chars). With n_unique == n and duplicate_rate 0, it functions as a cognate-set key from the IECoR resource rather than a modelling feature. Treatment: Use as the join key to cognate metadata; exclude from any model features.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 4,981
- len_min
- 7
- len_max
- 10
- len_mean
- 9.884
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 4,981
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
concept
categorical other imbalanceThe 'concept' column is a categorical field that is entirely constant: all 4981 rows hold the same empty-string value, giving cardinality 1 and entropy 0. It carries no information and was flagged for imbalance with a top_rate of 1.0. Treatment: Drop; the column is constant and contributes no signal.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
words
unknown other skippedThe column is named "words" but saturn skipped profiling it, so its kind is unknown and no descriptive statistics were computed. The only confirmed signals are 4981 non-null rows with a null rate of 0.0; uniqueness, type, and value distribution are all unavailable. Treatment: Re-profile or manually inspect this column before any downstream use, since saturn skipped it.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- —
source_dataset
categorical metadata imbalanceThis column records the originating dataset for each row, but every one of the 4981 records carries the single value "iecor". Cardinality is 1 and entropy is 0, so the field carries no information for modelling or grouping. Treatment: Drop; constant column with no discriminative signal.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- iecor
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
confidence
numeric metadata constantThe column 'confidence' is a numeric field that is entirely constant: all 4981 rows hold the value 1.0, with zero standard deviation and a single unique value. It carries no information for downstream modelling and likely reflects a default or hard-coded score rather than a measured probability. Treatment: drop, constant column with no variance
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
word_count
numeric feature high_skew outliersCounts of words per record, ranging from 1 to 157 with a median of 2 and IQR of 3. The distribution is severely right-skewed (skew 6.84, kurtosis 59.74) with 649 outliers (13.0% of rows) pulling the mean to 5.17 against std 12.13. Most entries are very short while a long tail of verbose records dominates the variance. Treatment: Apply a log1p transform before modelling to tame the heavy right tail.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 93
- min
- 1
- max
- 157
- mean
- 5.168
- median
- 2
- std
- 12.13
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 6.837
- kurtosis
- 59.74
- n_outliers
- 649
- outlier_rate
- 0.1303
- zero_rate
- 0
language_count
numeric feature high_skew outliersA count of languages per record, ranging from 1 to 157 with a median of just 2 and IQR of 3. The distribution is severely right-skewed (skew 6.84, kurtosis 59.77) with 649 outliers (13.0%), meaning a small number of records list dozens of languages while most list only a handful. Treatment: Log-transform or cap before modelling to tame the heavy tail.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- 94
- min
- 1
- max
- 157
- mean
- 5.166
- median
- 2
- std
- 12.13
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 6.838
- kurtosis
- 59.77
- n_outliers
- 649
- outlier_rate
- 0.1303
- zero_rate
- 0
sources
unknown other skippedThe column 'sources' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We can only confirm it has 4981 rows with a null rate of 0.0 and no recorded unique count. Without further evidence, the content and structure cannot be characterised. Treatment: Re-profile or manually inspect this column before deciding on downstream handling.
- n
- 4,981
- nulls
- 0 (0.0%)
- unique
- —