quirky silence data
Reading
This dataset catalogs 6,998 world languages, each with a name (n), speaker population (p), geographic coordinates (lat/lng), a numeric score (s), and an endangerment status (ss). The most striking feature is the extreme skew in speaker population: the median language has just 11,000 speakers but the mean is over 1.1 million, with a max near 965 million and roughly 13% of entries showing zero speakers — a tell-tale signature of dying or dormant languages. The endangerment status field is also worth a close look: only about 44% of languages are 'safe', while the remaining categories span from 'vulnerable' all the way to 'extinct' (219 cases). Geography is broadly distributed (latitude centered near the tropics, longitude spanning the globe), so the dataset supports both statistical and map-based exploration.
citing: row_count · columns.p.stats.median · columns.p.stats.mean · columns.p.stats.max · columns.p.stats.zero_rate · columns.p.stats.skew · columns.ss.top_rate · columns.ss.top_values · columns.lat.stats.median · columns.lng.stats.median · columns.s.n_unique
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| safe | 3074 | 43.9% |
| definitely endangered | 1753 | 25.1% |
| vulnerable | 1160 | 16.6% |
| severely endangered | 374 | 5.3% |
| critically endangered | 327 | 4.7% |
| extinct | 219 | 3.1% |
| unknown | 91 | 1.3% |
Show data table
| bin | count |
|---|---|
| 0 – 2.411e+07 | 6943 |
| 2.411e+07 – 4.823e+07 | 28 |
| 4.823e+07 – 7.234e+07 | 5 |
| 7.234e+07 – 9.646e+07 | 12 |
| 9.646e+07 – 1.206e+08 | 0 |
| 1.206e+08 – 1.447e+08 | 3 |
| 1.447e+08 – 1.688e+08 | 1 |
| 1.688e+08 – 1.929e+08 | 0 |
| 1.929e+08 – 2.17e+08 | 0 |
| 2.17e+08 – 2.411e+08 | 2 |
| 2.411e+08 – 2.653e+08 | 0 |
| 2.653e+08 – 2.894e+08 | 0 |
| 2.894e+08 – 3.135e+08 | 0 |
| 3.135e+08 – 3.376e+08 | 0 |
| 3.376e+08 – 3.617e+08 | 0 |
| 3.617e+08 – 3.858e+08 | 1 |
| 3.858e+08 – 4.099e+08 | 0 |
| 4.099e+08 – 4.34e+08 | 0 |
| 4.34e+08 – 4.582e+08 | 1 |
| 4.582e+08 – 4.823e+08 | 0 |
| 4.823e+08 – 5.064e+08 | 0 |
| 5.064e+08 – 5.305e+08 | 0 |
| 5.305e+08 – 5.546e+08 | 0 |
| 5.546e+08 – 5.787e+08 | 0 |
| 5.787e+08 – 6.028e+08 | 0 |
| 6.028e+08 – 6.27e+08 | 0 |
| 6.27e+08 – 6.511e+08 | 0 |
| 6.511e+08 – 6.752e+08 | 0 |
| 6.752e+08 – 6.993e+08 | 0 |
| 6.993e+08 – 7.234e+08 | 1 |
| 7.234e+08 – 7.475e+08 | 0 |
| 7.475e+08 – 7.716e+08 | 0 |
| 7.716e+08 – 7.958e+08 | 0 |
| 7.958e+08 – 8.199e+08 | 0 |
| 8.199e+08 – 8.44e+08 | 0 |
| 8.44e+08 – 8.681e+08 | 0 |
| 8.681e+08 – 8.922e+08 | 0 |
| 8.922e+08 – 9.163e+08 | 0 |
| 9.163e+08 – 9.404e+08 | 0 |
| 9.404e+08 – 9.646e+08 | 1 |
Show data table
| bin | count |
|---|---|
| -55.27 – -52.06 | 1 |
| -52.06 – -48.85 | 1 |
| -48.85 – -45.64 | 1 |
| -45.64 – -42.43 | 0 |
| -42.43 – -39.22 | 2 |
| -39.22 – -36.01 | 5 |
| -36.01 – -32.8 | 9 |
| -32.8 – -29.59 | 13 |
| -29.59 – -26.38 | 23 |
| -26.38 – -23.17 | 48 |
| -23.17 – -19.96 | 100 |
| -19.96 – -16.75 | 101 |
| -16.75 – -13.54 | 217 |
| -13.54 – -10.33 | 202 |
| -10.33 – -7.116 | 461 |
| -7.116 – -3.906 | 749 |
| -3.906 – -0.6958 | 640 |
| -0.6958 – 2.514 | 344 |
| 2.514 – 5.725 | 434 |
| 5.725 – 8.935 | 609 |
| 8.935 – 12.15 | 676 |
| 12.15 – 15.36 | 264 |
| 15.36 – 18.57 | 368 |
| 18.57 – 21.78 | 210 |
| 21.78 – 24.99 | 282 |
| 24.99 – 28.2 | 320 |
| 28.2 – 31.41 | 131 |
| 31.41 – 34.62 | 124 |
| 34.62 – 37.83 | 148 |
| 37.83 – 41.04 | 81 |
| 41.04 – 44.25 | 107 |
| 44.25 – 47.46 | 65 |
| 47.46 – 50.67 | 68 |
| 50.67 – 53.88 | 68 |
| 53.88 – 57.09 | 35 |
| 57.09 – 60.3 | 20 |
| 60.3 – 63.51 | 34 |
| 63.51 – 66.72 | 18 |
| 66.72 – 69.93 | 13 |
| 69.93 – 73.14 | 6 |
Show data table
| bin | count |
|---|---|
| 0 – 0.125 | 250 |
| 0.125 – 0.25 | 0 |
| 0.25 – 0.375 | 0 |
| 0.375 – 0.5 | 0 |
| 0.5 – 0.625 | 0 |
| 0.625 – 0.75 | 0 |
| 0.75 – 0.875 | 0 |
| 0.875 – 1 | 0 |
| 1 – 1.125 | 328 |
| 1.125 – 1.25 | 0 |
| 1.25 – 1.375 | 0 |
| 1.375 – 1.5 | 0 |
| 1.5 – 1.625 | 0 |
| 1.625 – 1.75 | 0 |
| 1.75 – 1.875 | 0 |
| 1.875 – 2 | 0 |
| 2 – 2.125 | 383 |
| 2.125 – 2.25 | 0 |
| 2.25 – 2.375 | 0 |
| 2.375 – 2.5 | 0 |
| 2.5 – 2.625 | 0 |
| 2.625 – 2.75 | 0 |
| 2.75 – 2.875 | 0 |
| 2.875 – 3 | 0 |
| 3 – 3.125 | 1768 |
| 3.125 – 3.25 | 0 |
| 3.25 – 3.375 | 0 |
| 3.375 – 3.5 | 0 |
| 3.5 – 3.625 | 0 |
| 3.625 – 3.75 | 0 |
| 3.75 – 3.875 | 0 |
| 3.875 – 4 | 0 |
| 4 – 4.125 | 1176 |
| 4.125 – 4.25 | 0 |
| 4.25 – 4.375 | 0 |
| 4.375 – 4.5 | 0 |
| 4.5 – 4.625 | 0 |
| 4.625 – 4.75 | 0 |
| 4.75 – 4.875 | 0 |
| 4.875 – 5 | 3093 |
Show data table
| bin | count |
|---|---|
| -178.8 – -169.8 | 11 |
| -169.8 – -160.9 | 4 |
| -160.9 – -151.9 | 9 |
| -151.9 – -143 | 11 |
| -143 – -134 | 10 |
| -134 – -125.1 | 15 |
| -125.1 – -116.1 | 94 |
| -116.1 – -107.2 | 37 |
| -107.2 – -98.21 | 71 |
| -98.21 – -89.26 | 257 |
| -89.26 – -80.31 | 50 |
| -80.31 – -71.35 | 176 |
| -71.35 – -62.4 | 150 |
| -62.4 – -53.45 | 112 |
| -53.45 – -44.5 | 48 |
| -44.5 – -35.54 | 21 |
| -35.54 – -26.59 | 0 |
| -26.59 – -17.64 | 3 |
| -17.64 – -8.687 | 95 |
| -8.687 – 0.265 | 261 |
| 0.265 – 9.217 | 420 |
| 9.217 – 18.17 | 687 |
| 18.17 – 27.12 | 297 |
| 27.12 – 36.07 | 402 |
| 36.07 – 45.03 | 211 |
| 45.03 – 53.98 | 113 |
| 53.98 – 62.93 | 27 |
| 62.93 – 71.88 | 73 |
| 71.88 – 80.84 | 212 |
| 80.84 – 89.79 | 191 |
| 89.79 – 98.74 | 242 |
| 98.74 – 107.7 | 386 |
| 107.7 – 116.6 | 202 |
| 116.6 – 125.6 | 437 |
| 125.6 – 134.5 | 266 |
| 134.5 – 143.5 | 512 |
| 143.5 – 152.5 | 597 |
| 152.5 – 161.4 | 106 |
| 161.4 – 170.4 | 170 |
| 170.4 – 179.3 | 12 |
Schema
6 columns| Alerts | ||||
|---|---|---|---|---|
| n | text | 0.0% | 6,998 |
near_unique
one_word
|
| lat | numeric | 0.0% | 4,048 |
|
| lng | numeric | 0.0% | 5,560 |
|
| p | numeric | 0.0% | 1,627 |
high_skew
outliers
|
| s | numeric | 0.0% | 6 |
|
| ss | categorical | 0.0% | 7 |
|
n
text identifier near_unique one_wordColumn 'n' holds short text labels, one per row, with all 6998 values unique and no nulls. About 73% are single words (word_mean 1.37, len_mean ~9), and the top tokens — 'language', 'sign', 'zapotec', 'mixtec', 'naga' plus directional modifiers — strongly suggest these are language names (e.g., 'Southern Zapotec', 'sign language' variants). Vocab_size (7003) slightly exceeds n, consistent with a small set of recurring qualifiers across otherwise unique names. Treatment: Treat as a name key; left-join on this rather than using as a model feature.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 6,998
- len_min
- 1
- len_max
- 43
- len_mean
- 8.972
- len_median
- 7
- len_p95
- 21
- word_mean
- 1.369
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 7,003
- readability_flesch_mean
- 57.13
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7314
- allcaps_rate
- 0
- boilerplate_rate
- 0
lat
numeric featureLatitude coordinates spanning -55.27 to 73.14, covering most inhabited latitudes from sub-Antarctic to high Arctic. The distribution leans north of the equator (mean 8.44, median 6.37) with mild positive skew (0.70) and 149 outliers (2.1%) likely representing extreme polar observations. With 4048 unique values across 6998 rows and no nulls, this is clean geospatial data ready for use. Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or climate zone.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 4,048
- min
- -55.27
- max
- 73.14
- mean
- 8.437
- median
- 6.37
- std
- 18
- q1
- -4.65
- q3
- 18.29
- iqr
- 22.94
- skew
- 0.6975
- kurtosis
- 0.4773
- n_outliers
- 149
- outlier_rate
- 0.02129
- zero_rate
- 0
lng
numeric featureThis is a longitude feature spanning -178.78 to 179.31 across 6998 rows with 5560 unique values and no nulls. The distribution is mildly left-skewed (-0.498) with a median of 47.65 and IQR from 8.28 to 123.94, suggesting an Eastern-Hemisphere bias rather than a uniform global spread. Only 12 outliers (0.17%) flagged, which is expected given the bounded geographic range. Treatment: Pair with latitude as a geospatial coordinate; consider cyclical encoding or geohashing rather than treating as a plain scalar.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 5,560
- min
- -178.8
- max
- 179.3
- mean
- 52.46
- median
- 47.65
- std
- 79.67
- q1
- 8.282
- q3
- 123.9
- iqr
- 115.7
- skew
- -0.4982
- kurtosis
- -0.673
- n_outliers
- 12
- outlier_rate
- 0.001715
- zero_rate
- 0
p
numeric feature high_skew outliersColumn `p` is a heavily right-skewed numeric value spanning 0 to 964,553,200 with a median of just 11,000 — likely a monetary amount, count, or size metric. The skew of 39.45 and kurtosis of 1870 are extreme, with 16.56% of rows flagged as outliers and 13.46% sitting at exactly zero. The mean (1,157,741) sits more than 100x above the median, so summary averages will be misleading. Treatment: Apply a log1p transform before modelling and consider a separate zero-vs-nonzero indicator.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 1,627
- min
- 0
- max
- 9.646e+08
- mean
- 1.158e+06
- median
- 11,000
- std
- 1.73e+07
- q1
- 1,200
- q3
- 83,000
- iqr
- 81,800
- skew
- 39.45
- kurtosis
- 1870
- n_outliers
- 1,159
- outlier_rate
- 0.1656
- zero_rate
- 0.1346
s
numeric featureColumn 's' holds an integer-valued score on a 0-5 scale with only 6 unique values across 6998 rows and no nulls. The distribution skews high (mean 3.80, median 4.0, skew -1.04), with Q1-Q3 spanning 3 to 5 and about 3.6% of rows at zero. This looks like an ordinal rating rather than a continuous measurement. Treatment: Treat as an ordinal rating; consider ordinal encoding or bucket low scores before modelling.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 0
- max
- 5
- mean
- 3.796
- median
- 4
- std
- 1.366
- q1
- 3
- q3
- 5
- iqr
- 2
- skew
- -1.041
- kurtosis
- 0.4154
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.03572
ss
categorical labelThe column 'ss' is a categorical safety/endangerment status with 7 distinct levels ranging from 'safe' to 'extinct', plus an 'unknown' bucket. 'safe' dominates at 43.9% (3074 of 6998) with 'definitely endangered' second at 1753, and there are no nulls. The distribution is moderately balanced (entropy ratio 0.76) but the tail categories like 'extinct' (219) and 'unknown' (91) are sparse. Treatment: Encode as an ordinal scale (safe → extinct) and treat 'unknown' separately or impute.
- n
- 6,998
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- safe
- top_rate
- 0.4393
- cardinality
- 7
- entropy
- 2.122
- entropy_ratio
- 0.7557