data trove steam games catalog
Reading
This is a Steam games catalogue with 122,611 rows and 40 columns, covering titles, publishers, developers, genres, pricing, review counts, and associated URLs. The most important thing to examine first is the extreme skew across nearly all numeric engagement columns (column23, column24, column27, column29, column31): medians sit at 0–5 while means run into the hundreds or thousands, meaning a tiny fraction of blockbuster titles account for the vast majority of reviews and activity. A second area worth attention is genre distribution (column36), where just a handful of Casual/Indie/Action combinations account for the bulk of the catalogue, and the estimated owner-count banding (column03) shows over 61% of games have fewer than 20,000 owners — pointing to a long-tail market dominated by low-visibility titles.
citing: column23.stats.median · column23.stats.mean · column27.stats.zero_rate · column29.stats.zero_rate · column03.top_values · column03.stats.top_rate · column36.top_values · column36.stats.duplicate_rate · column06.stats.median · column06.stats.mean
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| 0 - 20000 | 75404 | 61.5% |
| 0 - 0 | 21641 | 17.7% |
| 20000 - 50000 | 11396 | 9.3% |
| 50000 - 100000 | 5355 | 4.4% |
| 100000 - 200000 | 3454 | 2.8% |
| 200000 - 500000 | 2853 | 2.3% |
| 500000 - 1000000 | 1154 | 0.9% |
| 1000000 - 2000000 | 729 | 0.6% |
| 2000000 - 5000000 | 405 | 0.3% |
| 5000000 - 10000000 | 125 | 0.1% |
| 10000000 - 20000000 | 51 | 0.0% |
| 20000000 - 50000000 | 31 | 0.0% |
| 50000000 - 100000000 | 9 | 0.0% |
| 100000000 - 200000000 | 4 | 0.0% |
Show data table
| chars | count |
|---|---|
| 3 – 9 | 12259 |
| 9 – 15 | 21084 |
| 15 – 20 | 22318 |
| 20 – 26 | 25837 |
| 26 – 32 | 12284 |
| 32 – 38 | 8026 |
| 38 – 44 | 5596 |
| 44 – 50 | 2995 |
| 50 – 55 | 1587 |
| 55 – 61 | 848 |
| 61 – 67 | 593 |
| 67 – 73 | 229 |
| 73 – 79 | 196 |
| 79 – 85 | 137 |
| 85 – 90 | 71 |
| 90 – 96 | 35 |
| 96 – 102 | 37 |
| 102 – 108 | 13 |
| 108 – 114 | 15 |
| 114 – 120 | 10 |
| 120 – 125 | 6 |
| 125 – 131 | 6 |
| 131 – 137 | 4 |
| 137 – 143 | 2 |
| 143 – 149 | 2 |
| 149 – 154 | 1 |
| 154 – 160 | 0 |
| 160 – 166 | 1 |
| 166 – 172 | 4 |
| 172 – 178 | 0 |
| 178 – 184 | 0 |
| 184 – 189 | 0 |
| 189 – 195 | 0 |
| 195 – 201 | 0 |
| 201 – 207 | 0 |
| 207 – 213 | 1 |
| 213 – 219 | 0 |
| 219 – 224 | 0 |
| 224 – 230 | 0 |
| 230 – 236 | 1 |
Show data table
| bin | count |
|---|---|
| 0 – 1.911e+05 | 122511 |
| 1.911e+05 – 3.821e+05 | 57 |
| 3.821e+05 – 5.732e+05 | 16 |
| 5.732e+05 – 7.642e+05 | 10 |
| 7.642e+05 – 9.553e+05 | 6 |
| 9.553e+05 – 1.146e+06 | 5 |
| 1.146e+06 – 1.337e+06 | 1 |
| 1.337e+06 – 1.528e+06 | 2 |
| 1.528e+06 – 1.719e+06 | 0 |
| 1.719e+06 – 1.911e+06 | 1 |
| 1.911e+06 – 2.102e+06 | 1 |
| 2.102e+06 – 2.293e+06 | 0 |
| 2.293e+06 – 2.484e+06 | 0 |
| 2.484e+06 – 2.675e+06 | 0 |
| 2.675e+06 – 2.866e+06 | 0 |
| 2.866e+06 – 3.057e+06 | 0 |
| 3.057e+06 – 3.248e+06 | 0 |
| 3.248e+06 – 3.439e+06 | 0 |
| 3.439e+06 – 3.63e+06 | 0 |
| 3.63e+06 – 3.821e+06 | 0 |
| 3.821e+06 – 4.012e+06 | 0 |
| 4.012e+06 – 4.203e+06 | 0 |
| 4.203e+06 – 4.394e+06 | 0 |
| 4.394e+06 – 4.585e+06 | 0 |
| 4.585e+06 – 4.776e+06 | 0 |
| 4.776e+06 – 4.967e+06 | 0 |
| 4.967e+06 – 5.158e+06 | 0 |
| 5.158e+06 – 5.349e+06 | 0 |
| 5.349e+06 – 5.541e+06 | 0 |
| 5.541e+06 – 5.732e+06 | 0 |
| 5.732e+06 – 5.923e+06 | 0 |
| 5.923e+06 – 6.114e+06 | 0 |
| 6.114e+06 – 6.305e+06 | 0 |
| 6.305e+06 – 6.496e+06 | 0 |
| 6.496e+06 – 6.687e+06 | 0 |
| 6.687e+06 – 6.878e+06 | 0 |
| 6.878e+06 – 7.069e+06 | 0 |
| 7.069e+06 – 7.26e+06 | 0 |
| 7.26e+06 – 7.451e+06 | 0 |
| 7.451e+06 – 7.642e+06 | 1 |
Show data table
| bin | count |
|---|---|
| 0 – 25 | 120926 |
| 25 – 50 | 1081 |
| 50 – 75 | 248 |
| 75 – 100 | 47 |
| 100 – 125 | 6 |
| 125 – 150 | 13 |
| 150 – 175 | 1 |
| 175 – 200 | 282 |
| 200 – 225 | 0 |
| 225 – 250 | 0 |
| 250 – 275 | 1 |
| 275 – 300 | 2 |
| 300 – 325 | 0 |
| 325 – 350 | 0 |
| 350 – 375 | 0 |
| 375 – 400 | 0 |
| 400 – 425 | 0 |
| 425 – 450 | 0 |
| 450 – 475 | 0 |
| 475 – 500 | 0 |
| 500 – 525 | 1 |
| 525 – 550 | 0 |
| 550 – 575 | 0 |
| 575 – 600 | 0 |
| 600 – 625 | 0 |
| 625 – 650 | 0 |
| 650 – 675 | 0 |
| 675 – 700 | 0 |
| 700 – 725 | 0 |
| 725 – 750 | 0 |
| 750 – 775 | 0 |
| 775 – 800 | 0 |
| 800 – 825 | 0 |
| 825 – 850 | 0 |
| 850 – 875 | 0 |
| 875 – 900 | 0 |
| 900 – 925 | 0 |
| 925 – 950 | 0 |
| 950 – 975 | 0 |
| 975 – 1000 | 3 |
Show data table
| chars | count |
|---|---|
| 11 – 11 | 34494 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 11 | 0 |
| 11 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 0 |
| 12 – 12 | 88117 |
Schema
40 columns| Alerts | ||||
|---|---|---|---|---|
| column00 | numeric | 0.0% | 122,611 |
|
| column01 | text | 0.0% | 121,454 |
near_unique
|
| column02 | text | 0.0% | 5,081 |
short_text
duplicates
|
| column03 | categorical | 0.0% | 14 |
|
| column04 | numeric | 0.0% | 1,110 |
high_skew
outliers
|
| column05 | numeric | 0.0% | 15 |
high_skew
|
| column06 | numeric | 0.0% | 941 |
high_skew
outliers
|
| column07 | numeric | 0.0% | 88 |
|
| column08 | numeric | 0.0% | 117 |
high_skew
outliers
|
| column09 | text | 6.9% | 113,556 |
near_unique
|
| column10 | text | 0.0% | 19,113 |
one_word
duplicates
|
| column11 | text | 0.0% | 3,710 |
one_word
duplicates
|
| column12 | text | 90.2% | 11,884 |
near_unique
null_rate
|
| column13 | text | 0.1% | 122,420 |
near_unique
one_word
url_heavy
|
| column14 | text | 59.5% | 39,703 |
one_word
url_heavy
null_rate
duplicates
|
| column15 | text | 55.8% | 35,399 |
one_word
url_heavy
null_rate
duplicates
|
| column16 | text | 18.1% | 60,519 |
one_word
duplicates
|
| column17 | categorical | 0.0% | 2 |
imbalance
|
| column18 | categorical | 0.0% | 2 |
|
| column19 | categorical | 0.0% | 2 |
|
| column20 | numeric | 0.0% | 73 |
high_skew
|
| column21 | text | 96.5% | 4,160 |
near_unique
one_word
url_heavy
null_rate
|
| column22 | numeric | 0.0% | 31 |
high_skew
|
| column23 | numeric | 0.0% | 5,540 |
high_skew
outliers
|
| column24 | numeric | 0.0% | 2,725 |
high_skew
outliers
|
| column25 | numeric | 100.0% | 3 |
null_rate
|
| column26 | numeric | 0.0% | 448 |
high_skew
outliers
|
| column27 | numeric | 0.0% | 5,332 |
high_skew
outliers
|
| column28 | text | 81.7% | 18,620 |
multilingual
null_rate
|
| column29 | numeric | 0.0% | 3,037 |
high_skew
outliers
|
| column30 | numeric | 0.0% | 993 |
high_skew
|
| column31 | numeric | 0.0% | 2,511 |
high_skew
outliers
|
| column32 | numeric | 0.0% | 993 |
high_skew
|
| column33 | text | 6.9% | 70,816 |
one_word
duplicates
|
| column34 | text | 7.2% | 62,689 |
one_word
duplicates
|
| column35 | text | 7.3% | 13,291 |
duplicates
|
| column36 | text | 6.9% | 2,894 |
one_word
duplicates
|
| column37 | text | 32.0% | 77,179 |
multilingual
null_rate
|
| column38 | text | 4.9% | 116,483 |
near_unique
one_word
url_heavy
|
| column39 | unknown | 0.0% | — |
skipped
|
column00
numeric identifierThis column contains 122,611 numeric values that are all unique, null-free, and span from 10 to 4,264,350 — strongly suggesting it is a unique numeric identifier (e.g., a record ID or transaction number). The distribution is remarkably flat and near-uniform: kurtosis of -1.05, negligible skew of 0.18, and zero detected outliers, which is highly unusual for a natural measurement or feature and is consistent with a sequentially or pseudo-randomly assigned integer key. The IQR of 1,806,385 is close to half the full range, further supporting a uniform spread across the ID space. Treatment: Drop before modelling or use as a row key only; do not use as a predictive feature.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 122,611
- min
- 10
- max
- 4.264e+06
- mean
- 1.985e+06
- median
- 1.907e+06
- std
- 1.088e+06
- q1
- 1.063e+06
- q3
- 2.87e+06
- iqr
- 1.806e+06
- skew
- 0.1772
- kurtosis
- -1.05
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
column01
text label near_uniqueThis column contains short, near-unique text strings averaging ~3 words and 18 characters, consistent with game or software session/product titles. The dominant top words — 'playtest', 'vr', 'simulator' — strongly suggest these are names of VR game playtesting sessions or titles. Surprising signals include 1,156 duplicates (~0.94% duplicate rate) despite the near-unique alert, a small emoji presence (0.26%), and a maximum length of 413 characters which is anomalously long relative to the median of 16. Treatment: Use as a descriptive label; deduplicate or flag the 1,156 repeated entries, and investigate the long-tail outliers (len_max 413) before any downstream grouping or embedding.
- n
- 122,611
- nulls
- 1 (0.0%)
- unique
- 121,454
- len_min
- 1
- len_max
- 413
- len_mean
- 18.07
- len_median
- 16
- len_p95
- 37.55
- word_mean
- 2.912
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 1,156
- duplicate_rate
- 0.009428
- vocab_size
- 18,813
- readability_flesch_mean
- 52.87
- emoji_rate
- 0.002585
- url_rate
- 0
- one_word_rate
- 0.1866
- allcaps_rate
- 0.06731
- boilerplate_rate
- 4.078e-05
column02
text timestamp short_text duplicatesThis column contains dates formatted as 'Mon DD, YYYY' (e.g., 'Oct 23, 2025'), stored as text rather than a native date type. The values span at least 2021–2025 based on top word frequencies, with a striking duplicate rate of 95.86% — 117,530 of 122,611 rows share one of only 5,081 distinct dates, meaning many records map to the same calendar day. The near-constant string length (median 12, min 11, max 12) and vocabulary of just 68 tokens confirm this is a tightly formatted date field with no free-text variation. Treatment: Parse to a native date type (e.g., datetime64) before any time-series analysis or feature engineering.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 5,081
- len_min
- 11
- len_max
- 12
- len_mean
- 11.72
- len_median
- 12
- len_p95
- 12
- word_mean
- 3
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 117,530
- duplicate_rate
- 0.9586
- vocab_size
- 68
- readability_flesch_mean
- 98.6
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
column03
categorical featureThis column encodes a numeric quantity as binned range labels — almost certainly an income, revenue, or financial amount bracket given the scale (0 to 10,000,000+) and logarithmically spaced bin edges. The distribution is heavily right-skewed: 61.5% of rows fall in the '0 - 20000' bucket alone, and a notable 21,641 rows sit in '0 - 0', suggesting a zero-value spike that may warrant separate treatment. With only 14 distinct values and zero nulls across 122,611 rows, the encoding is clean but lossy. Treatment: Ordinal-encode using the natural bin order, or extract bin midpoints as a numeric approximation; investigate the '0 - 0' segment (21,641 rows) as a potential distinct class.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 14
- top_value
- 0 - 20000
- top_rate
- 0.615
- cardinality
- 14
- entropy
- 1.814
- entropy_ratio
- 0.4764
column04
numeric feature high_skew outliersThis column is a heavily zero-inflated numeric field — likely a count, transaction amount, or event frequency — where 83.95% of values are exactly zero and the interquartile range is 0.0, meaning the entire middle half of the distribution is flat at zero. The remaining values are extremely right-skewed (skew = 209.95, kurtosis = 51452.44) with a max of 1,013,936 against a mean of only 54.59, indicating a small number of very large outliers; 16.05% of rows (19,676) are flagged as outliers. The 1,110 unique values and zero null rate suggest this may be a sparse activity or volume metric. Treatment: Consider a two-part model (zero-inflation indicator + log1p transform on non-zero values) or cap/winsorize at a high percentile before modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 1,110
- min
- 0
- max
- 1.014e+06
- mean
- 54.59
- median
- 0
- std
- 3729
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 210
- kurtosis
- 5.145e+04
- n_outliers
- 19,676
- outlier_rate
- 0.1605
- zero_rate
- 0.8395
column05
numeric feature high_skewThis column is a low-cardinality integer count (only 15 distinct values, range 0–21) where 98.96% of rows are exactly zero, making it an extreme sparse count feature — likely recording rare events or occurrences per record. The distribution is severely right-skewed (skew 9.88, kurtosis 96.52) with only 1,272 outlier rows (1.04%) carrying any non-zero signal; the IQR is zero because all three quartiles collapse to 0. Treatment: Treat as a sparse count; consider binarising (0 vs >0) or applying log1p transform, and flag the 1,272 non-zero rows as a minority sub-population for modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 15
- min
- 0
- max
- 21
- mean
- 0.1676
- median
- 0
- std
- 1.654
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 9.883
- kurtosis
- 96.52
- n_outliers
- 1,272
- outlier_rate
- 0.01037
- zero_rate
- 0.9896
column06
numeric feature high_skew outliersThis column likely represents a monetary amount, duration, or rate — a continuous positive measure where most values are small. The distribution is extreme: the median is 2.24 and Q3 is only 5.24, yet the max reaches 999.98, producing a skew of 22.4 and a kurtosis of 1,135. Over 7.5% of rows (9,297) are flagged as outliers, and 21.4% of values are exactly zero, suggesting a two-part structure (zero-inflation plus a heavy-tailed positive component) that would violate standard regression assumptions. Treatment: Model the zero-inflation separately (e.g., hurdle or Tweedie model), then log1p-transform the positive portion before regression or scaling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 941
- min
- 0
- max
- 1000
- mean
- 4.765
- median
- 2.24
- std
- 12.53
- q1
- 0.55
- q3
- 5.24
- iqr
- 4.69
- skew
- 22.4
- kurtosis
- 1135
- n_outliers
- 9,297
- outlier_rate
- 0.07583
- zero_rate
- 0.2137
column07
numeric featureThis column is a bounded numeric score or percentage, ranging from 0 to 100 with only 88 distinct values, suggesting a discretized or rounded measurement (e.g., a completion rate, satisfaction score, or grade). The most striking feature is that 66.8% of values are exactly zero, making the distribution heavily zero-inflated; the median is 0.0 while the mean is 18.35 and Q3 is only 40.0, confirming the mass is concentrated at the floor. Despite the zero inflation, kurtosis is near zero (−0.05), meaning the non-zero portion is roughly flat or uniform across the 0–100 range. Analysts should treat this as a zero-inflated bounded variable rather than a standard continuous feature. Treatment: Model with a two-part (hurdle/zero-inflated) approach, or apply an indicator for zero alongside the raw value; avoid log-transform without offset due to zero mass.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 88
- min
- 0
- max
- 100
- mean
- 18.35
- median
- 0
- std
- 28.86
- q1
- 0
- q3
- 40
- iqr
- 40
- skew
- 1.22
- kurtosis
- -0.05072
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.6682
column08
numeric feature high_skew outliersThis column is a sparse count or event-frequency field: 85.5% of its 122,611 rows are exactly zero, the median and IQR are both 0, yet the mean is 0.55 and the max reaches 3,703. The extreme concentration at zero combined with a skew of 171.8 and kurtosis of 38,359 indicates a heavy-tailed distribution driven by rare but very large values; 14.5% of rows (17,771) are flagged as outliers. Only 117 distinct values across 122,611 rows further suggests this is a discrete count, not a continuous measure. Treatment: Apply log1p transform or use a zero-inflated / Poisson model; consider capping or winsorizing at a high quantile given the max of 3703.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 117
- min
- 0
- max
- 3,703
- mean
- 0.5459
- median
- 0
- std
- 14.52
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 171.8
- kurtosis
- 3.836e+04
- n_outliers
- 17,771
- outlier_rate
- 0.1449
- zero_rate
- 0.8551
column09
text free_text near_uniqueThis column contains long-form natural language text, likely user-generated content such as reviews, product descriptions, or messages — with a mean of 1,297 characters and 214 words per entry, and a vocabulary of 105,903 unique terms. The near-unique alert (113,556 unique values out of 122,611 rows) confirms these are essentially free-text narratives rather than categorical labels. Notably, 4.7% of entries contain emojis, suggesting informal or consumer-facing content, and the max length of 89,665 characters indicates some extreme outliers well beyond the 95th-percentile length of 2,966 characters. Flesch readability mean of 58.7 places the text in a 'fairly easy' register, consistent with consumer writing. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; flag or truncate the extreme-length outliers above len_p95 of 2,966 characters.
- n
- 122,611
- nulls
- 8,449 (6.9%)
- unique
- 113,556
- len_min
- 1
- len_max
- 89,665
- len_mean
- 1297
- len_median
- 1,064
- len_p95
- 2,966
- word_mean
- 214.3
- word_median
- 177
- n_empty
- 0
- n_duplicates
- 606
- duplicate_rate
- 0.005308
- vocab_size
- 105,903
- readability_flesch_mean
- 58.75
- emoji_rate
- 0.04672
- url_rate
- 0.0003504
- one_word_rate
- 0.0004029
- allcaps_rate
- 0.007559
- boilerplate_rate
- 0.01517
column10
text feature one_word duplicatesThis column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or game). The dominant value is `['English']` appearing 55,314 times, with `[]` (no languages listed) in 8,380 rows. The duplicate rate is extremely high at 84.4%, which is expected given the limited vocabulary of 217 unique tokens and only 19,113 unique values across 122,611 rows — the data is stored as raw string-serialized lists rather than a normalized structure, which is a notable preprocessing concern. Treatment: Parse the string-serialized lists into actual list structures, then multi-hot encode each language as a binary feature column.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 19,113
- len_min
- 2
- len_max
- 1,216
- len_mean
- 68.02
- len_median
- 11
- len_p95
- 224
- word_mean
- 6.889
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,498
- duplicate_rate
- 0.8441
- vocab_size
- 217
- readability_flesch_mean
- 14.07
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5333
- allcaps_rate
- 0
- boilerplate_rate
- 0
column11
text feature one_word duplicatesThis column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or media item). The dominant value is '[]' (empty list) appearing 72,730 times — nearly 60% of rows — indicating most records have no language metadata populated. Despite 122,611 rows, only 3,710 unique values exist and the duplicate rate is 96.97%, which is expected for a categorical-list field, but the vocabulary is tiny at just 194 words, confirming a closed set of language names. Treatment: Parse the serialized list strings into proper multi-label indicators (one binary column per language) before modelling; treat '[]' as missing/unknown.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 3,710
- len_min
- 2
- len_max
- 1,216
- len_mean
- 24.31
- len_median
- 2
- len_p95
- 46
- word_mean
- 2.854
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 118,901
- duplicate_rate
- 0.9697
- vocab_size
- 194
- readability_flesch_mean
- 8.003
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.813
- allcaps_rate
- 0
- boilerplate_rate
- 0
column12
text free_text near_unique null_rateThis column contains substantial free-text descriptions or reviews, most likely about games — the word 'game' is the top non-stopword at 7,882 occurrences, average text length is ~340 characters (~57 words), and the vocabulary spans 61,840 unique tokens. The 90.16% null rate is a major alert: only about 12,000 of 122,611 rows carry any content, meaning this field is sparsely populated. An emoji_rate of ~1.6% and a median Flesch readability score of ~57.8 suggest informal, consumer-written prose. The near_unique flag is partially explained by the sparse population — 11,884 unique values among ~12,000 non-null rows confirms almost every entry is distinct. Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; impute or mask nulls explicitly given the 90.16% null rate.
- n
- 122,611
- nulls
- 110,541 (90.2%)
- unique
- 11,884
- len_min
- 3
- len_max
- 2,912
- len_mean
- 340.3
- len_median
- 295
- len_p95
- 763
- word_mean
- 57.37
- word_median
- 49
- n_empty
- 0
- n_duplicates
- 186
- duplicate_rate
- 0.01541
- vocab_size
- 61,840
- readability_flesch_mean
- 57.83
- emoji_rate
- 0.01649
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0.008202
- boilerplate_rate
- 0
column13
text metadata near_unique one_word url_heavyThis column contains Steam CDN URLs pointing to game header images hosted on Akamai's steamstatic.com infrastructure — specifically `header.jpg` assets keyed by Steam app ID. With a url_rate of 1.0 and one_word_rate of 1.0, every single value is a single URL. The column is near-unique (122,420 distinct values out of 122,611 rows), with only 110 duplicates, suggesting these map closely to individual game or product records; the small number of repeated URLs (max frequency 5) likely reflects games appearing in multiple dataset rows. Treatment: Extract Steam app ID from URL path for joining; drop raw URL before modelling or store as-is for image retrieval pipelines.
- n
- 122,611
- nulls
- 81 (0.1%)
- unique
- 122,420
- len_min
- 93
- len_max
- 153
- len_mean
- 104.6
- len_median
- 98
- len_p95
- 139
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 110
- duplicate_rate
- 0.0008977
- vocab_size
- 19,992
- readability_flesch_mean
- -834.3
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
column14
text metadata one_word url_heavy null_rate duplicatesThis column contains publisher or developer website URLs, almost certainly scraped from a Steam or similar games catalogue. Virtually every non-null value is a single URL (one_word_rate 0.9999, url_rate 0.9999), pointing to publisher homepages, Facebook pages, or Steam publisher/group pages. Two signals stand out: 59.48% of rows are null, meaning many game records carry no website; and 20.08% of non-null values are duplicates (9,973 repeated URLs), reflecting publishers with large catalogues who share one website across many titles. Treatment: Extract domain as a categorical publisher identifier; flag or impute nulls; do not embed raw URL strings.
- n
- 122,611
- nulls
- 72,935 (59.5%)
- unique
- 39,703
- len_min
- 7
- len_max
- 236
- len_mean
- 32.57
- len_median
- 29
- len_p95
- 56
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 9,973
- duplicate_rate
- 0.2008
- vocab_size
- 17,059
- readability_flesch_mean
- -260.3
- emoji_rate
- 0
- url_rate
- 0.9999
- one_word_rate
- 0.9999
- allcaps_rate
- 0
- boilerplate_rate
- 0
column15
text metadata one_word url_heavy null_rate duplicatesThis column is a support/contact URL field — almost certainly a developer or publisher support link associated with game or software records. 95.6% of non-null values are URLs, and the one-word rate is 99.9%, consistent with bare URL strings. Two surprises stand out: the null rate is very high at 55.8%, meaning more than half of records lack this URL, and the duplicate rate is 34.7% (18,808 duplicate values out of ~54,200 non-null rows), reflecting that many games share the same support domain (e.g., Big Fish Games, EA, Facebook pages). Treatment: Extract domain as a categorical feature; treat raw URL as a grouping key rather than a text feature; impute or flag nulls separately given 55.8% null rate.
- n
- 122,611
- nulls
- 68,404 (55.8%)
- unique
- 35,399
- len_min
- 1
- len_max
- 851
- len_mean
- 31.19
- len_median
- 29
- len_p95
- 51
- word_mean
- 1.002
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 18,808
- duplicate_rate
- 0.347
- vocab_size
- 14,875
- readability_flesch_mean
- -245.1
- emoji_rate
- 0
- url_rate
- 0.9559
- one_word_rate
- 0.9993
- allcaps_rate
- 0.0007933
- boilerplate_rate
- 0
column16
text foreign_key one_word duplicatesThis column contains email addresses for game developers or publishers, as evidenced by the top values (e.g., 'info@bigfishgames.com', 'support@quanticlab.com'). Nearly all values (99.86%) are single tokens, consistent with email format. The duplicate rate is high at 39.7% (39,849 duplicates out of 122,611 rows), indicating many records share a contact email — expected for a publisher-level field where one entity owns multiple titles. The null rate of 18.14% is notable and should be investigated for systematic missingness. Treatment: Use as a grouping/join key on publisher or developer entity; normalize to lowercase and strip whitespace before joining.
- n
- 122,611
- nulls
- 22,243 (18.1%)
- unique
- 60,519
- len_min
- 1
- len_max
- 169
- len_mean
- 22.91
- len_median
- 23
- len_p95
- 31
- word_mean
- 1.004
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 39,849
- duplicate_rate
- 0.397
- vocab_size
- 15,319
- readability_flesch_mean
- -223.7
- emoji_rate
- 9.963e-06
- url_rate
- 0.003906
- one_word_rate
- 0.9986
- allcaps_rate
- 0.001016
- boilerplate_rate
- 0
column17
categorical feature imbalanceThis column is a boolean flag stored as string values ('True'/'False'), covering 122,611 rows with no nulls. It is severely imbalanced: 'True' accounts for 99.964% of rows (122,567 occurrences) while 'False' appears only 44 times. The near-zero entropy (0.0046) confirms the column carries almost no information, making it nearly constant. Treatment: Investigate whether the 44 'False' rows are meaningful anomalies; otherwise drop as near-constant with no predictive variance.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- True
- top_rate
- 0.9996
- cardinality
- 2
- entropy
- 0.004625
- entropy_ratio
- 0.004625
column18
categorical featureThis column is a binary boolean flag stored as string literals 'True'/'False', with zero nulls across 122,611 rows. The dominant value is 'False' at 82.6% (101,319 occurrences), leaving 'True' at roughly 17.4% (21,292) — a moderately imbalanced split that may matter for classification tasks. The entropy ratio of 0.666 confirms meaningful but uneven information content. Treatment: Cast to boolean/integer (0/1) and monitor class imbalance if used as a target or predictor.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- False
- top_rate
- 0.8263
- cardinality
- 2
- entropy
- 0.666
- entropy_ratio
- 0.666
column19
categorical labelThis column is a boolean flag stored as string literals 'True'/'False', covering all 122,611 rows with zero nulls. The distribution is heavily skewed: 'False' dominates at 87.2% (106,905 rows) versus 'True' at only 12.8% (15,706 rows). The low entropy of 0.552 confirms the imbalance. An analyst building a classifier on this as a target should anticipate class imbalance requiring resampling or adjusted class weights. Treatment: encode as binary integer (False=0, True=1) and address class imbalance (~87/13 split) before modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- False
- top_rate
- 0.8719
- cardinality
- 2
- entropy
- 0.5522
- entropy_ratio
- 0.5522
column20
numeric feature high_skewThis column is a sparse numeric count or score with only 73 distinct values across 122,611 rows, almost certainly representing an event count, frequency, or discrete rating. The distribution is extraordinarily concentrated at zero — 96.5% of values are exactly 0 — with IQR of 0.0 and a median of 0.0, yet the max reaches 97.0, producing extreme positive skew (5.23) and kurtosis (25.75). The 4,256 outlier rows (3.47%) carrying non-zero values likely represent a small active or engaged sub-population, which is the analytically interesting segment. Treatment: Apply log1p transform or binarise (zero vs. non-zero) before modelling; consider separating the zero-inflated mass from the active tail for two-part modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 73
- min
- 0
- max
- 97
- mean
- 2.565
- median
- 0
- std
- 13.66
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 5.227
- kurtosis
- 25.75
- n_outliers
- 4,256
- outlier_rate
- 0.03471
- zero_rate
- 0.9653
column21
text near_unique one_word url_heavy null_rate- n
- 122,611
- nulls
- 118,355 (96.5%)
- unique
- 4,160
- len_min
- 42
- len_max
- 142
- len_mean
- 72.43
- len_median
- 70
- len_p95
- 91
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 96
- duplicate_rate
- 0.02256
- vocab_size
- 4,160
- readability_flesch_mean
- -704.1
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
column22
numeric feature high_skewThis column is almost certainly a sparse indicator or rare-event count: 99.97% of its 122,611 values are exactly zero, with only 40 flagged outliers and a maximum of 100.0. The 31 unique values and an IQR of 0.0 confirm that the vast majority of rows carry no signal at all. The extreme skew (59.25) and kurtosis (3,627.8) are a direct consequence of this near-total zero mass, making standard continuous modelling inappropriate without transformation or binarisation. Treatment: Binarise (zero vs. non-zero) or treat as a rare-event indicator; if the raw magnitude matters, cap at a sensible percentile and log1p-transform before modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 31
- min
- 0
- max
- 100
- mean
- 0.02455
- median
- 0
- std
- 1.395
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 59.25
- kurtosis
- 3628
- n_outliers
- 40
- outlier_rate
- 0.0003262
- zero_rate
- 0.9997
column23
numeric feature high_skew outliersThis column is a numeric count or magnitude field — likely representing activity volume, transaction amount, or similar accumulation metric — with 122,611 non-null records and only 5,540 distinct values. The distribution is extraordinarily right-skewed (skew=177.84, kurtosis=45,295.94): the median is just 5.0 while the mean is 1,044.99, and the maximum reaches 7,642,084 — a value roughly 272x the standard deviation above the mean. About 34.5% of values are zero and 17.0% are flagged as outliers (20,797 rows), indicating a heavy zero-inflated tail with extreme rare events dominating the mean. Treatment: Apply log1p-transform (to handle zeros) before modelling, and consider capping or winsorizing at a high percentile to suppress the extreme outliers up to 7,642,084.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 5,540
- min
- 0
- max
- 7.642e+06
- mean
- 1045
- median
- 5
- std
- 2.809e+04
- q1
- 0
- q3
- 37
- iqr
- 37
- skew
- 177.8
- kurtosis
- 4.53e+04
- n_outliers
- 20,797
- outlier_rate
- 0.1696
- zero_rate
- 0.3448
column24
numeric feature high_skew outliersThis column is likely a count or frequency measure (e.g., event occurrences, transaction counts, or interaction tallies) given its non-negative integer-like range and high zero rate. The distribution is extraordinarily right-skewed: the median is 1.0 and Q3 is only 10.0, yet the maximum reaches 1,173,003 — a difference of over six orders of magnitude. With 45% zeros, ~16.9% flagged outliers (20,696 rows), a skew of 156.86, and kurtosis exceeding 30,000, the bulk of records cluster near zero while a small number of extreme values dominate the mean (169.20 vs. median 1.0). This is a severe long-tail distribution that will distort any linear model if used as-is. Treatment: Apply log1p-transform (or cap at a high percentile) before modelling to reduce extreme skew.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 2,725
- min
- 0
- max
- 1.173e+06
- mean
- 169.2
- median
- 1
- std
- 5375
- q1
- 0
- q3
- 10
- iqr
- 10
- skew
- 156.9
- kurtosis
- 3.063e+04
- n_outliers
- 20,696
- outlier_rate
- 0.1688
- zero_rate
- 0.4502
column25
numeric null_rate- n
- 122,611
- nulls
- 122,571 (100.0%)
- unique
- 3
- min
- 98
- max
- 100
- mean
- 99.17
- median
- 99
- std
- 0.6751
- q1
- 99
- q3
- 100
- iqr
- 1
- skew
- -0.2149
- kurtosis
- -0.7872
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
column26
numeric feature high_skew outliersThis column is likely a count or frequency metric (e.g., event occurrences, transaction counts, or tenure in days/months), given its non-negative integer values with only 448 distinct values across 122,611 rows. The distribution is severely right-skewed (skew=32.63, kurtosis=1192.15): the median is just 2.0 while the mean is 18.09, Q1 is 0.0, and the maximum reaches 9,821—an extreme outlier relative to the IQR of 19. Nearly half the rows (48.6%) are zero, and 6.9% are flagged as outliers, signaling a heavy zero-inflated tail that will distort any linear model trained on raw values. Treatment: Apply log1p-transform (or a zero-inflated model) to compress the extreme right tail before modelling.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 448
- min
- 0
- max
- 9,821
- mean
- 18.09
- median
- 2
- std
- 141.5
- q1
- 0
- q3
- 19
- iqr
- 19
- skew
- 32.63
- kurtosis
- 1192
- n_outliers
- 8,433
- outlier_rate
- 0.06878
- zero_rate
- 0.4859
column27
numeric feature high_skew outliersThis column is a sparse, heavily right-skewed numeric count or amount field — likely representing an event frequency, transaction volume, or similar quantity that is zero for the vast majority of records. 82.9% of the 122,611 rows are exactly zero, the median is 0.0, and the IQR is 0.0, yet the mean is 961.8 and the maximum reaches 4,830,455 — indicating a tiny fraction of extreme values driving nearly all the variance. The skew of 113.9 and kurtosis of 20,874.5 are extraordinary, and 17.1% of rows are flagged as outliers, confirming that the non-zero tail is severely extreme relative to the bulk of the distribution. Treatment: Apply log1p-transform (or treat as two-part: zero/non-zero indicator + log-transformed non-zero value) before modelling to handle extreme skew and outliers.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 5,332
- min
- 0
- max
- 4.83e+06
- mean
- 961.8
- median
- 0
- std
- 2.188e+04
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 113.9
- kurtosis
- 2.087e+04
- n_outliers
- 20,906
- outlier_rate
- 0.1705
- zero_rate
- 0.8295
column28
text free_text multilingual null_rateThis column contains free-text content warnings or age-rating disclosures for video games, with recurring phrases about mature content, nudity, sexual content, and violence. It is massively sparse — 81.68% of rows are null — meaning most games carry no such warning. The duplicate rate of 17.09% (3,839 duplicates across 18,620 unique values) reflects the use of templated boilerplate warning strings, while a small multilingual signal (2 Chinese, 1 Japanese entries) indicates some non-English publisher submissions. Flesch readability of 44.38 and a median length of 124 characters are consistent with dense legal/disclaimer prose. Treatment: Encode as binary 'has_warning' flag and/or extract categorical warning types (violence, nudity, sexual content) via keyword/regex before modelling; drop raw text.
- n
- 122,611
- nulls
- 100,152 (81.7%)
- unique
- 18,620
- len_min
- 2
- len_max
- 2,020
- len_mean
- 164.1
- len_median
- 124
- len_p95
- 445
- word_mean
- 25.74
- word_median
- 20
- n_empty
- 0
- n_duplicates
- 3,839
- duplicate_rate
- 0.1709
- vocab_size
- 23,061
- readability_flesch_mean
- 44.38
- emoji_rate
- 0.0007124
- url_rate
- 8.905e-05
- one_word_rate
- 0.009039
- allcaps_rate
- 0.008193
- boilerplate_rate
- 0.009484
column29
numeric feature high_skew outliersThis column is a heavily zero-inflated count or amount field — 78.7% of its 122,611 rows are exactly zero, and the interquartile range is 0.0, meaning the entire middle 50% of the distribution is zero. Despite a median of 0 and mean of only 208, the max reaches 3,429,544, producing extreme skew (262.89) and kurtosis (75,698), with 21.3% of rows flagged as outliers. This pattern is consistent with a sparse event-count, transaction amount, or usage metric where most entities are inactive but a small tail drives enormous values. Treatment: Apply log1p-transform or treat as two-part model (zero vs. non-zero) before regression or ML use.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 3,037
- min
- 0
- max
- 3.43e+06
- mean
- 208
- median
- 0
- std
- 1.122e+04
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 262.9
- kurtosis
- 7.57e+04
- n_outliers
- 26,119
- outlier_rate
- 0.213
- zero_rate
- 0.787
column30
numeric feature high_skewThis column is a heavily zero-inflated count or amount field: 96.8% of its 122,611 rows are exactly zero, driving a median of 0.0 and an IQR of 0.0. The remaining values are extremely skewed (skew = 51.68, kurtosis = 3252.96), with a mean of 13.79 pulled far right by a maximum of 20,088 — likely representing rare but large events such as transaction amounts, error counts, or penalty values. The 3,898 outliers (3.2% of rows) account for virtually all non-zero variance, which is the defining surprise here. Treatment: Apply zero-inflated modelling or split into a binary indicator plus a log-transformed positive-value sub-model before regression.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 993
- min
- 0
- max
- 20,088
- mean
- 13.79
- median
- 0
- std
- 270.4
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 51.68
- kurtosis
- 3253
- n_outliers
- 3,898
- outlier_rate
- 0.03179
- zero_rate
- 0.9682
column31
numeric feature high_skew outliersThis column is a sparse count or activity metric where the overwhelming majority of records (78.7%) are zero, producing a median of 0.0 and an IQR of exactly 0.0. The distribution is extraordinarily right-skewed (skew = 263.99, kurtosis = 76112.44), driven by extreme outliers reaching a max of 3,429,544 against a mean of only 173.57 — indicating a tiny fraction of records carry massive values. Roughly 21.3% of rows (26,119) are flagged as outliers, which is an unusually high outlier rate and signals a power-law or heavy-tailed phenomenon rather than a simple data error. Treatment: Apply log1p-transform (or a zero-inflated model) before regression; consider capping at a high percentile to manage extreme outliers.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 2,511
- min
- 0
- max
- 3.43e+06
- mean
- 173.6
- median
- 0
- std
- 1.12e+04
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 264
- kurtosis
- 7.611e+04
- n_outliers
- 26,119
- outlier_rate
- 0.213
- zero_rate
- 0.787
column32
numeric feature high_skewThis column is almost certainly a sparse count or occurrence field — likely an event frequency, error count, or similar rare-event tally. The zero_rate of 96.8% means the vast majority of rows have no event, while the remaining ~3.2% drive an extreme right tail (skew=48.9, kurtosis=2848.5) reaching a maximum of 20,088 against a median of 0 and mean of 14.7. The IQR of 0.0 confirms the middle 50% of the distribution is entirely flat at zero, with 3,898 flagged outliers carrying virtually all the variance. Treatment: Apply log1p transform or treat as binary (zero vs. non-zero) flag before modelling; consider capping at a high percentile to suppress the extreme tail.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- 993
- min
- 0
- max
- 20,088
- mean
- 14.72
- median
- 0
- std
- 294.5
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 48.91
- kurtosis
- 2848
- n_outliers
- 3,898
- outlier_rate
- 0.03179
- zero_rate
- 0.9682
column33
text label one_word duplicatesThis column contains game developer or publisher names, evidenced by top values such as 'Choice of Games', 'KOEI TECMO GAMES CO., LTD.', and dominant vocabulary including 'games', 'studio', 'studios', 'interactive', and 'entertainment'. The duplicate rate of 37.98% (43,364 duplicates across 122,611 rows) is expected — publishers release multiple titles — but the 70,816 unique values and a max length of 584 characters suggest occasional free-text entries or combined multi-publisher strings. The one-word rate of 31.8% and mean word count of ~2 words are consistent with company name formats, though the wide length range (1–584 chars) warrants inspection for outliers. Treatment: Normalize casing and strip punctuation variants before grouping; use as a categorical grouping key or encode as a feature via target/frequency encoding.
- n
- 122,611
- nulls
- 8,431 (6.9%)
- unique
- 70,816
- len_min
- 1
- len_max
- 584
- len_mean
- 14.37
- len_median
- 13
- len_p95
- 27
- word_mean
- 2.019
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 43,364
- duplicate_rate
- 0.3798
- vocab_size
- 18,429
- readability_flesch_mean
- 38.73
- emoji_rate
- 0.0008933
- url_rate
- 0.000219
- one_word_rate
- 0.3181
- allcaps_rate
- 0.07974
- boilerplate_rate
- 0
column34
text label one_word duplicatesThis column contains game publisher or developer company names, as evidenced by top values like 'BFG Entertainment', 'Choice of Games', and 'Strategy First', and top words dominated by 'games', 'studio', 'studios', 'entertainment', and corporate suffixes ('llc', 'inc.', 'ltd.'). The duplicate rate is notably high at 44.9% (51,089 duplicates across 122,611 rows), which is expected since many games share the same publisher. The one-word rate of 31.8% reflects single-token studio names, and the 7.2% null rate warrants attention for records with unknown publishers. Treatment: Encode as a categorical feature (e.g. frequency or target encoding); investigate nulls at 7.2% before modelling.
- n
- 122,611
- nulls
- 8,833 (7.2%)
- unique
- 62,689
- len_min
- 1
- len_max
- 164
- len_mean
- 13.82
- len_median
- 13
- len_p95
- 26
- word_mean
- 1.988
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 51,089
- duplicate_rate
- 0.449
- vocab_size
- 15,765
- readability_flesch_mean
- 40.22
- emoji_rate
- 0.0009141
- url_rate
- 0.0002285
- one_word_rate
- 0.3178
- allcaps_rate
- 0.0817
- boilerplate_rate
- 0
column35
text feature duplicatesThis column contains a comma-delimited list of Steam game features/categories (e.g., 'Single-player', 'Steam Achievements', 'Family Sharing', 'Full controller support'), typical of the Steam store's supported features field per game. The extreme duplicate rate (88.3%, 100,367 of 122,611 rows) is expected because many games share identical feature sets, and the tiny vocabulary size of 589 words confirms a finite, enumerated tag system. The 'da' language detection on 12 rows is almost certainly a false positive from short comma-separated tokens, not actual Danish text. With only 13,291 unique combinations out of 122,611 rows, this column is highly suitable for multi-label binarization. Treatment: Split on commas and one-hot encode each feature tag for modelling.
- n
- 122,611
- nulls
- 8,953 (7.3%)
- unique
- 13,291
- len_min
- 3
- len_max
- 534
- len_mean
- 71.58
- len_median
- 59
- len_p95
- 178
- word_mean
- 5.089
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 100,367
- duplicate_rate
- 0.8831
- vocab_size
- 589
- readability_flesch_mean
- -105.9
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04047
- allcaps_rate
- 8.798e-06
- boilerplate_rate
- 0
column36
text label one_word duplicatesThis column contains comma-separated game genre tags (e.g., 'Casual,Indie', 'Action,Adventure,Indie'), consistent with a Steam or similar game catalog dataset. The duplicate rate is extremely high at 97.5%, reflecting the natural cardinality collapse when games share genre combinations — only 2,894 unique tag-sets exist across 122,611 rows. The top words 'to', 'access', and 'play' suggest some rows contain free-text strings like 'Early Access' or 'Free to Play' mixed into the same field, indicating occasional value pollution worth investigating. Treatment: Split on comma to multi-hot encode genre tags before modelling; flag rows where values contain free-text phrases ('to', 'access', 'play') for cleansing.
- n
- 122,611
- nulls
- 8,413 (6.9%)
- unique
- 2,894
- len_min
- 3
- len_max
- 236
- len_mean
- 22.21
- len_median
- 21
- len_p95
- 45
- word_mean
- 1.364
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 111,304
- duplicate_rate
- 0.9747
- vocab_size
- 940
- readability_flesch_mean
- -206.1
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7892
- allcaps_rate
- 0.009781
- boilerplate_rate
- 0
column37
text label multilingual null_rateThis column contains comma-separated genre/tag lists for software or game products (e.g., 'Adventure,Casual,Hidden Object', 'Action,Indie'), consistent with a Steam-style app catalog. The null rate of 32.02% is notably high and warrants investigation before modelling. A multilingual alert is raised, but the non-English content is negligible (26 records out of 3,376 detected), suggesting near-uniform English data with minor noise. The duplicate rate of 7.4% (6,167 duplicates) is expected given finite genre combinations across a large catalog. Treatment: Split on commas to multi-hot encode genre tags; investigate and decide on imputation strategy for the 32.02% null rows before modelling.
- n
- 122,611
- nulls
- 39,265 (32.0%)
- unique
- 77,179
- len_min
- 3
- len_max
- 295
- len_mean
- 141.3
- len_median
- 163
- len_p95
- 228
- word_mean
- 4.923
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 6,167
- duplicate_rate
- 0.07399
- vocab_size
- 57,260
- readability_flesch_mean
- -449.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1233
- allcaps_rate
- 4.799e-05
- boilerplate_rate
- 0
column38
text metadata near_unique one_word url_heavyThis column contains comma-separated lists of Steam screenshot URLs (Akamai CDN), one packed string per row representing all screenshot images for a given Steam game entry. Every value is technically 'one word' (no spaces) because the URLs are concatenated without whitespace, explaining the paradoxical one_word_rate of 1.0 alongside a mean length of ~1319 characters and a max of 29132. With 116,483 unique values out of 122,611 rows and only 110 duplicates, this is near-unique; the small duplicate count likely reflects games with identical screenshot sets. Treatment: Split on commas to extract individual screenshot URLs per game; store as a list-type column or explode into a separate screenshots table keyed by game id.
- n
- 122,611
- nulls
- 6,018 (4.9%)
- unique
- 116,483
- len_min
- 144
- len_max
- 29,132
- len_mean
- 1319
- len_median
- 1,039
- len_p95
- 2,773
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 110
- duplicate_rate
- 0.0009435
- vocab_size
- 19,994
- readability_flesch_mean
- -5099
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
column39
unknown other skippedThis column was skipped by the profiler, so its content and type are entirely unknown. With 122,611 rows, zero nulls, and no computed statistics or uniqueness information, no data-driven characterisation is possible. The 'skipped' alert is the only signal available. Treatment: Manually inspect raw values to determine type and role before any further processing.
- n
- 122,611
- nulls
- 0 (0.0%)
- unique
- —