data trove steam games catalog

source /home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv 122,611 rows 40 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This is a Steam games catalogue with 122,611 rows and 40 columns, covering titles, publishers, developers, genres, pricing, review counts, and associated URLs. The most important thing to examine first is the extreme skew across nearly all numeric engagement columns (column23, column24, column27, column29, column31): medians sit at 0–5 while means run into the hundreds or thousands, meaning a tiny fraction of blockbuster titles account for the vast majority of reviews and activity. A second area worth attention is genre distribution (column36), where just a handful of Casual/Indie/Action combinations account for the bulk of the catalogue, and the estimated owner-count banding (column03) shows over 61% of games have fewer than 20,000 owners — pointing to a long-tail market dominated by low-visibility titles.

citing: column23.stats.median · column23.stats.mean · column27.stats.zero_rate · column29.stats.zero_rate · column03.top_values · column03.stats.top_rate · column36.top_values · column36.stats.duplicate_rate · column06.stats.median · column06.stats.mean

Charts the summary said to look at first

column03 · Owner-count bands reveal the long tail: over 61% of games sit in the lowest tier (0–20,000 owners), with rapid drop-off at higher counts.

Show data table

Top values for column03 (14 unique shown, of 14 total).
value	count	share
0 - 20000	75404	61.5%
0 - 0	21641	17.7%
20000 - 50000	11396	9.3%
50000 - 100000	5355	4.4%
100000 - 200000	3454	2.8%
200000 - 500000	2853	2.3%
500000 - 1000000	1154	0.9%
1000000 - 2000000	729	0.6%
2000000 - 5000000	405	0.3%
5000000 - 10000000	125	0.1%
10000000 - 20000000	51	0.0%
20000000 - 50000000	31	0.0%
50000000 - 100000000	9	0.0%
100000000 - 200000000	4	0.0%

column36 · Genre tag combinations show Casual, Indie, and Action dominate Steam listings — look for which combos are over-represented relative to their review volumes.

Show data table

Character-length distribution for column36 (mean: 22.205064887301003).
chars	count
3 – 9	12259
9 – 15	21084
15 – 20	22318
20 – 26	25837
26 – 32	12284
32 – 38	8026
38 – 44	5596
44 – 50	2995
50 – 55	1587
55 – 61	848
61 – 67	593
67 – 73	229
73 – 79	196
79 – 85	137
85 – 90	71
90 – 96	35
96 – 102	37
102 – 108	13
108 – 114	15
114 – 120	10
120 – 125	6
125 – 131	6
131 – 137	4
137 – 143	2
143 – 149	2
149 – 154	1
154 – 160	0
160 – 166	1
166 – 172	4
172 – 178	0
178 – 184	0
184 – 189	0
189 – 195	0
195 – 201	0
201 – 207	0
207 – 213	1
213 – 219	0
219 – 224	0
224 – 230	0
230 – 236	1

column23 · Review counts are extremely right-skewed with a median of 5 and a max of over 7 million, illustrating the blockbuster vs. obscurity divide.

Show data table

Histogram bins for column23 (median: 5.0).
bin	count
0 – 1.911e+05	122511
1.911e+05 – 3.821e+05	57
3.821e+05 – 5.732e+05	16
5.732e+05 – 7.642e+05	10
7.642e+05 – 9.553e+05	6
9.553e+05 – 1.146e+06	5
1.146e+06 – 1.337e+06	1
1.337e+06 – 1.528e+06	2
1.528e+06 – 1.719e+06	0
1.719e+06 – 1.911e+06	1
1.911e+06 – 2.102e+06	1
2.102e+06 – 2.293e+06	0
2.293e+06 – 2.484e+06	0
2.484e+06 – 2.675e+06	0
2.675e+06 – 2.866e+06	0
2.866e+06 – 3.057e+06	0
3.057e+06 – 3.248e+06	0
3.248e+06 – 3.439e+06	0
3.439e+06 – 3.63e+06	0
3.63e+06 – 3.821e+06	0
3.821e+06 – 4.012e+06	0
4.012e+06 – 4.203e+06	0
4.203e+06 – 4.394e+06	0
4.394e+06 – 4.585e+06	0
4.585e+06 – 4.776e+06	0
4.776e+06 – 4.967e+06	0
4.967e+06 – 5.158e+06	0
5.158e+06 – 5.349e+06	0
5.349e+06 – 5.541e+06	0
5.541e+06 – 5.732e+06	0
5.732e+06 – 5.923e+06	0
5.923e+06 – 6.114e+06	0
6.114e+06 – 6.305e+06	0
6.305e+06 – 6.496e+06	0
6.496e+06 – 6.687e+06	0
6.687e+06 – 6.878e+06	0
6.878e+06 – 7.069e+06	0
7.069e+06 – 7.26e+06	0
7.26e+06 – 7.451e+06	0
7.451e+06 – 7.642e+06	1

column06 · Price distribution is heavily skewed toward low values (median $2.24, mean $4.77) with a long tail up to $999.98 — check for outlier pricing strategies.

Show data table

Histogram bins for column06 (median: 2.24).
bin	count
0 – 25	120926
25 – 50	1081
50 – 75	248
75 – 100	47
100 – 125	6
125 – 150	13
150 – 175	1
175 – 200	282
200 – 225	0
225 – 250	0
250 – 275	1
275 – 300	2
300 – 325	0
325 – 350	0
350 – 375	0
375 – 400	0
400 – 425	0
425 – 450	0
450 – 475	0
475 – 500	0
500 – 525	1
525 – 550	0
550 – 575	0
575 – 600	0
600 – 625	0
625 – 650	0
650 – 675	0
675 – 700	0
700 – 725	0
725 – 750	0
750 – 775	0
775 – 800	0
800 – 825	0
825 – 850	0
850 – 875	0
875 – 900	0
900 – 925	0
925 – 950	0
950 – 975	0
975 – 1000	3

column02 · Release dates cluster heavily in 2024–2025, showing recent acceleration in Steam publishing volume worth comparing against owner and review counts.

Show data table

Character-length distribution for column02 (mean: 11.718671244831214).
chars	count
11 – 11	34494
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 12	88117

Schema

40 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
column00	numeric	0.0%	122,611
column01	text	0.0%	121,454	near_unique
column02	text	0.0%	5,081	short_text duplicates
column03	categorical	0.0%	14
column04	numeric	0.0%	1,110	high_skew outliers
column05	numeric	0.0%	15	high_skew
column06	numeric	0.0%	941	high_skew outliers
column07	numeric	0.0%	88
column08	numeric	0.0%	117	high_skew outliers
column09	text	6.9%	113,556	near_unique
column10	text	0.0%	19,113	one_word duplicates
column11	text	0.0%	3,710	one_word duplicates
column12	text	90.2%	11,884	near_unique null_rate
column13	text	0.1%	122,420	near_unique one_word url_heavy
column14	text	59.5%	39,703	one_word url_heavy null_rate duplicates
column15	text	55.8%	35,399	one_word url_heavy null_rate duplicates
column16	text	18.1%	60,519	one_word duplicates
column17	categorical	0.0%	2	imbalance
column18	categorical	0.0%	2
column19	categorical	0.0%	2
column20	numeric	0.0%	73	high_skew
column21	text	96.5%	4,160	near_unique one_word url_heavy null_rate
column22	numeric	0.0%	31	high_skew
column23	numeric	0.0%	5,540	high_skew outliers
column24	numeric	0.0%	2,725	high_skew outliers
column25	numeric	100.0%	3	null_rate
column26	numeric	0.0%	448	high_skew outliers
column27	numeric	0.0%	5,332	high_skew outliers
column28	text	81.7%	18,620	multilingual null_rate
column29	numeric	0.0%	3,037	high_skew outliers
column30	numeric	0.0%	993	high_skew
column31	numeric	0.0%	2,511	high_skew outliers
column32	numeric	0.0%	993	high_skew
column33	text	6.9%	70,816	one_word duplicates
column34	text	7.2%	62,689	one_word duplicates
column35	text	7.3%	13,291	duplicates
column36	text	6.9%	2,894	one_word duplicates
column37	text	32.0%	77,179	multilingual null_rate
column38	text	4.9%	116,483	near_unique one_word url_heavy
column39	unknown	0.0%	—	skipped

column00

numeric identifier

This column contains 122,611 numeric values that are all unique, null-free, and span from 10 to 4,264,350 — strongly suggesting it is a unique numeric identifier (e.g., a record ID or transaction number). The distribution is remarkably flat and near-uniform: kurtosis of -1.05, negligible skew of 0.18, and zero detected outliers, which is highly unusual for a natural measurement or feature and is consistent with a sequentially or pseudo-randomly assigned integer key. The IQR of 1,806,385 is close to half the full range, further supporting a uniform spread across the ID space. Treatment: Drop before modelling or use as a row key only; do not use as a predictive feature. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 122,611
min: 10
max: 4.264e+06
mean: 1.985e+06
median: 1.907e+06
std: 1.088e+06
q1: 1.063e+06
q3: 2.87e+06
iqr: 1.806e+06
skew: 0.1772
kurtosis: -1.05
n_outliers: 0
outlier_rate: 0
zero_rate: 0

column01

text label near_unique

This column contains short, near-unique text strings averaging ~3 words and 18 characters, consistent with game or software session/product titles. The dominant top words — 'playtest', 'vr', 'simulator' — strongly suggest these are names of VR game playtesting sessions or titles. Surprising signals include 1,156 duplicates (~0.94% duplicate rate) despite the near-unique alert, a small emoji presence (0.26%), and a maximum length of 413 characters which is anomalously long relative to the median of 16. Treatment: Use as a descriptive label; deduplicate or flag the 1,156 repeated entries, and investigate the long-tail outliers (len_max 413) before any downstream grouping or embedding. medium · anthropic:default

n: 122,611
nulls: 1 (0.0%)
unique: 121,454
len_min: 1
len_max: 413
len_mean: 18.07
len_median: 16
len_p95: 37.55
word_mean: 2.912
word_median: 3
n_empty: 0
n_duplicates: 1,156
duplicate_rate: 0.009428
vocab_size: 18,813
readability_flesch_mean: 52.87
emoji_rate: 0.002585
url_rate: 0
one_word_rate: 0.1866
allcaps_rate: 0.06731
boilerplate_rate: 4.078e-05

column02

text timestamp short_text duplicates

This column contains dates formatted as 'Mon DD, YYYY' (e.g., 'Oct 23, 2025'), stored as text rather than a native date type. The values span at least 2021–2025 based on top word frequencies, with a striking duplicate rate of 95.86% — 117,530 of 122,611 rows share one of only 5,081 distinct dates, meaning many records map to the same calendar day. The near-constant string length (median 12, min 11, max 12) and vocabulary of just 68 tokens confirm this is a tightly formatted date field with no free-text variation. Treatment: Parse to a native date type (e.g., datetime64) before any time-series analysis or feature engineering. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 5,081
len_min: 11
len_max: 12
len_mean: 11.72
len_median: 12
len_p95: 12
word_mean: 3
word_median: 3
n_empty: 0
n_duplicates: 117,530
duplicate_rate: 0.9586
vocab_size: 68
readability_flesch_mean: 98.6
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

column03

categorical feature

This column encodes a numeric quantity as binned range labels — almost certainly an income, revenue, or financial amount bracket given the scale (0 to 10,000,000+) and logarithmically spaced bin edges. The distribution is heavily right-skewed: 61.5% of rows fall in the '0 - 20000' bucket alone, and a notable 21,641 rows sit in '0 - 0', suggesting a zero-value spike that may warrant separate treatment. With only 14 distinct values and zero nulls across 122,611 rows, the encoding is clean but lossy. Treatment: Ordinal-encode using the natural bin order, or extract bin midpoints as a numeric approximation; investigate the '0 - 0' segment (21,641 rows) as a potential distinct class. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 14
top_value: 0 - 20000
top_rate: 0.615
cardinality: 14
entropy: 1.814
entropy_ratio: 0.4764

column04

numeric feature high_skew outliers

This column is a heavily zero-inflated numeric field — likely a count, transaction amount, or event frequency — where 83.95% of values are exactly zero and the interquartile range is 0.0, meaning the entire middle half of the distribution is flat at zero. The remaining values are extremely right-skewed (skew = 209.95, kurtosis = 51452.44) with a max of 1,013,936 against a mean of only 54.59, indicating a small number of very large outliers; 16.05% of rows (19,676) are flagged as outliers. The 1,110 unique values and zero null rate suggest this may be a sparse activity or volume metric. Treatment: Consider a two-part model (zero-inflation indicator + log1p transform on non-zero values) or cap/winsorize at a high percentile before modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 1,110
min: 0
max: 1.014e+06
mean: 54.59
median: 0
std: 3729
q1: 0
q3: 0
iqr: 0
skew: 210
kurtosis: 5.145e+04
n_outliers: 19,676
outlier_rate: 0.1605
zero_rate: 0.8395

column05

numeric feature high_skew

This column is a low-cardinality integer count (only 15 distinct values, range 0–21) where 98.96% of rows are exactly zero, making it an extreme sparse count feature — likely recording rare events or occurrences per record. The distribution is severely right-skewed (skew 9.88, kurtosis 96.52) with only 1,272 outlier rows (1.04%) carrying any non-zero signal; the IQR is zero because all three quartiles collapse to 0. Treatment: Treat as a sparse count; consider binarising (0 vs >0) or applying log1p transform, and flag the 1,272 non-zero rows as a minority sub-population for modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 15
min: 0
max: 21
mean: 0.1676
median: 0
std: 1.654
q1: 0
q3: 0
iqr: 0
skew: 9.883
kurtosis: 96.52
n_outliers: 1,272
outlier_rate: 0.01037
zero_rate: 0.9896

column06

numeric feature high_skew outliers

This column likely represents a monetary amount, duration, or rate — a continuous positive measure where most values are small. The distribution is extreme: the median is 2.24 and Q3 is only 5.24, yet the max reaches 999.98, producing a skew of 22.4 and a kurtosis of 1,135. Over 7.5% of rows (9,297) are flagged as outliers, and 21.4% of values are exactly zero, suggesting a two-part structure (zero-inflation plus a heavy-tailed positive component) that would violate standard regression assumptions. Treatment: Model the zero-inflation separately (e.g., hurdle or Tweedie model), then log1p-transform the positive portion before regression or scaling. medium · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 941
min: 0
max: 1000
mean: 4.765
median: 2.24
std: 12.53
q1: 0.55
q3: 5.24
iqr: 4.69
skew: 22.4
kurtosis: 1135
n_outliers: 9,297
outlier_rate: 0.07583
zero_rate: 0.2137

column07

numeric feature

This column is a bounded numeric score or percentage, ranging from 0 to 100 with only 88 distinct values, suggesting a discretized or rounded measurement (e.g., a completion rate, satisfaction score, or grade). The most striking feature is that 66.8% of values are exactly zero, making the distribution heavily zero-inflated; the median is 0.0 while the mean is 18.35 and Q3 is only 40.0, confirming the mass is concentrated at the floor. Despite the zero inflation, kurtosis is near zero (−0.05), meaning the non-zero portion is roughly flat or uniform across the 0–100 range. Analysts should treat this as a zero-inflated bounded variable rather than a standard continuous feature. Treatment: Model with a two-part (hurdle/zero-inflated) approach, or apply an indicator for zero alongside the raw value; avoid log-transform without offset due to zero mass. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 88
min: 0
max: 100
mean: 18.35
median: 0
std: 28.86
q1: 0
q3: 40
iqr: 40
skew: 1.22
kurtosis: -0.05072
n_outliers: 0
outlier_rate: 0
zero_rate: 0.6682

column08

numeric feature high_skew outliers

This column is a sparse count or event-frequency field: 85.5% of its 122,611 rows are exactly zero, the median and IQR are both 0, yet the mean is 0.55 and the max reaches 3,703. The extreme concentration at zero combined with a skew of 171.8 and kurtosis of 38,359 indicates a heavy-tailed distribution driven by rare but very large values; 14.5% of rows (17,771) are flagged as outliers. Only 117 distinct values across 122,611 rows further suggests this is a discrete count, not a continuous measure. Treatment: Apply log1p transform or use a zero-inflated / Poisson model; consider capping or winsorizing at a high quantile given the max of 3703. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 117
min: 0
max: 3,703
mean: 0.5459
median: 0
std: 14.52
q1: 0
q3: 0
iqr: 0
skew: 171.8
kurtosis: 3.836e+04
n_outliers: 17,771
outlier_rate: 0.1449
zero_rate: 0.8551

column09

text free_text near_unique

This column contains long-form natural language text, likely user-generated content such as reviews, product descriptions, or messages — with a mean of 1,297 characters and 214 words per entry, and a vocabulary of 105,903 unique terms. The near-unique alert (113,556 unique values out of 122,611 rows) confirms these are essentially free-text narratives rather than categorical labels. Notably, 4.7% of entries contain emojis, suggesting informal or consumer-facing content, and the max length of 89,665 characters indicates some extreme outliers well beyond the 95th-percentile length of 2,966 characters. Flesch readability mean of 58.7 places the text in a 'fairly easy' register, consistent with consumer writing. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; flag or truncate the extreme-length outliers above len_p95 of 2,966 characters. high · anthropic:default

n: 122,611
nulls: 8,449 (6.9%)
unique: 113,556
len_min: 1
len_max: 89,665
len_mean: 1297
len_median: 1,064
len_p95: 2,966
word_mean: 214.3
word_median: 177
n_empty: 0
n_duplicates: 606
duplicate_rate: 0.005308
vocab_size: 105,903
readability_flesch_mean: 58.75
emoji_rate: 0.04672
url_rate: 0.0003504
one_word_rate: 0.0004029
allcaps_rate: 0.007559
boilerplate_rate: 0.01517

column10

text feature one_word duplicates

This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or game). The dominant value is `['English']` appearing 55,314 times, with `[]` (no languages listed) in 8,380 rows. The duplicate rate is extremely high at 84.4%, which is expected given the limited vocabulary of 217 unique tokens and only 19,113 unique values across 122,611 rows — the data is stored as raw string-serialized lists rather than a normalized structure, which is a notable preprocessing concern. Treatment: Parse the string-serialized lists into actual list structures, then multi-hot encode each language as a binary feature column. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 19,113
len_min: 2
len_max: 1,216
len_mean: 68.02
len_median: 11
len_p95: 224
word_mean: 6.889
word_median: 1
n_empty: 0
n_duplicates: 103,498
duplicate_rate: 0.8441
vocab_size: 217
readability_flesch_mean: 14.07
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5333
allcaps_rate: 0
boilerplate_rate: 0

column11

text feature one_word duplicates

This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or media item). The dominant value is '[]' (empty list) appearing 72,730 times — nearly 60% of rows — indicating most records have no language metadata populated. Despite 122,611 rows, only 3,710 unique values exist and the duplicate rate is 96.97%, which is expected for a categorical-list field, but the vocabulary is tiny at just 194 words, confirming a closed set of language names. Treatment: Parse the serialized list strings into proper multi-label indicators (one binary column per language) before modelling; treat '[]' as missing/unknown. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 3,710
len_min: 2
len_max: 1,216
len_mean: 24.31
len_median: 2
len_p95: 46
word_mean: 2.854
word_median: 1
n_empty: 0
n_duplicates: 118,901
duplicate_rate: 0.9697
vocab_size: 194
readability_flesch_mean: 8.003
emoji_rate: 0
url_rate: 0
one_word_rate: 0.813
allcaps_rate: 0
boilerplate_rate: 0

column12

text free_text near_unique null_rate

This column contains substantial free-text descriptions or reviews, most likely about games — the word 'game' is the top non-stopword at 7,882 occurrences, average text length is ~340 characters (~57 words), and the vocabulary spans 61,840 unique tokens. The 90.16% null rate is a major alert: only about 12,000 of 122,611 rows carry any content, meaning this field is sparsely populated. An emoji_rate of ~1.6% and a median Flesch readability score of ~57.8 suggest informal, consumer-written prose. The near_unique flag is partially explained by the sparse population — 11,884 unique values among ~12,000 non-null rows confirms almost every entry is distinct. Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; impute or mask nulls explicitly given the 90.16% null rate. high · anthropic:default

n: 122,611
nulls: 110,541 (90.2%)
unique: 11,884
len_min: 3
len_max: 2,912
len_mean: 340.3
len_median: 295
len_p95: 763
word_mean: 57.37
word_median: 49
n_empty: 0
n_duplicates: 186
duplicate_rate: 0.01541
vocab_size: 61,840
readability_flesch_mean: 57.83
emoji_rate: 0.01649
url_rate: 0
one_word_rate: 0
allcaps_rate: 0.008202
boilerplate_rate: 0

column13

text metadata near_unique one_word url_heavy

This column contains Steam CDN URLs pointing to game header images hosted on Akamai's steamstatic.com infrastructure — specifically `header.jpg` assets keyed by Steam app ID. With a url_rate of 1.0 and one_word_rate of 1.0, every single value is a single URL. The column is near-unique (122,420 distinct values out of 122,611 rows), with only 110 duplicates, suggesting these map closely to individual game or product records; the small number of repeated URLs (max frequency 5) likely reflects games appearing in multiple dataset rows. Treatment: Extract Steam app ID from URL path for joining; drop raw URL before modelling or store as-is for image retrieval pipelines. high · anthropic:default

n: 122,611
nulls: 81 (0.1%)
unique: 122,420
len_min: 93
len_max: 153
len_mean: 104.6
len_median: 98
len_p95: 139
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 110
duplicate_rate: 0.0008977
vocab_size: 19,992
readability_flesch_mean: -834.3
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

column14

text metadata one_word url_heavy null_rate duplicates

This column contains publisher or developer website URLs, almost certainly scraped from a Steam or similar games catalogue. Virtually every non-null value is a single URL (one_word_rate 0.9999, url_rate 0.9999), pointing to publisher homepages, Facebook pages, or Steam publisher/group pages. Two signals stand out: 59.48% of rows are null, meaning many game records carry no website; and 20.08% of non-null values are duplicates (9,973 repeated URLs), reflecting publishers with large catalogues who share one website across many titles. Treatment: Extract domain as a categorical publisher identifier; flag or impute nulls; do not embed raw URL strings. high · anthropic:default

n: 122,611
nulls: 72,935 (59.5%)
unique: 39,703
len_min: 7
len_max: 236
len_mean: 32.57
len_median: 29
len_p95: 56
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 9,973
duplicate_rate: 0.2008
vocab_size: 17,059
readability_flesch_mean: -260.3
emoji_rate: 0
url_rate: 0.9999
one_word_rate: 0.9999
allcaps_rate: 0
boilerplate_rate: 0

column15

text metadata one_word url_heavy null_rate duplicates

This column is a support/contact URL field — almost certainly a developer or publisher support link associated with game or software records. 95.6% of non-null values are URLs, and the one-word rate is 99.9%, consistent with bare URL strings. Two surprises stand out: the null rate is very high at 55.8%, meaning more than half of records lack this URL, and the duplicate rate is 34.7% (18,808 duplicate values out of ~54,200 non-null rows), reflecting that many games share the same support domain (e.g., Big Fish Games, EA, Facebook pages). Treatment: Extract domain as a categorical feature; treat raw URL as a grouping key rather than a text feature; impute or flag nulls separately given 55.8% null rate. high · anthropic:default

n: 122,611
nulls: 68,404 (55.8%)
unique: 35,399
len_min: 1
len_max: 851
len_mean: 31.19
len_median: 29
len_p95: 51
word_mean: 1.002
word_median: 1
n_empty: 0
n_duplicates: 18,808
duplicate_rate: 0.347
vocab_size: 14,875
readability_flesch_mean: -245.1
emoji_rate: 0
url_rate: 0.9559
one_word_rate: 0.9993
allcaps_rate: 0.0007933
boilerplate_rate: 0

column16

text foreign_key one_word duplicates

This column contains email addresses for game developers or publishers, as evidenced by the top values (e.g., 'info@bigfishgames.com', 'support@quanticlab.com'). Nearly all values (99.86%) are single tokens, consistent with email format. The duplicate rate is high at 39.7% (39,849 duplicates out of 122,611 rows), indicating many records share a contact email — expected for a publisher-level field where one entity owns multiple titles. The null rate of 18.14% is notable and should be investigated for systematic missingness. Treatment: Use as a grouping/join key on publisher or developer entity; normalize to lowercase and strip whitespace before joining. high · anthropic:default

n: 122,611
nulls: 22,243 (18.1%)
unique: 60,519
len_min: 1
len_max: 169
len_mean: 22.91
len_median: 23
len_p95: 31
word_mean: 1.004
word_median: 1
n_empty: 0
n_duplicates: 39,849
duplicate_rate: 0.397
vocab_size: 15,319
readability_flesch_mean: -223.7
emoji_rate: 9.963e-06
url_rate: 0.003906
one_word_rate: 0.9986
allcaps_rate: 0.001016
boilerplate_rate: 0

column17

categorical feature imbalance

This column is a boolean flag stored as string values ('True'/'False'), covering 122,611 rows with no nulls. It is severely imbalanced: 'True' accounts for 99.964% of rows (122,567 occurrences) while 'False' appears only 44 times. The near-zero entropy (0.0046) confirms the column carries almost no information, making it nearly constant. Treatment: Investigate whether the 44 'False' rows are meaningful anomalies; otherwise drop as near-constant with no predictive variance. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 2
top_value: True
top_rate: 0.9996
cardinality: 2
entropy: 0.004625
entropy_ratio: 0.004625

column18

categorical feature

This column is a binary boolean flag stored as string literals 'True'/'False', with zero nulls across 122,611 rows. The dominant value is 'False' at 82.6% (101,319 occurrences), leaving 'True' at roughly 17.4% (21,292) — a moderately imbalanced split that may matter for classification tasks. The entropy ratio of 0.666 confirms meaningful but uneven information content. Treatment: Cast to boolean/integer (0/1) and monitor class imbalance if used as a target or predictor. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 2
top_value: False
top_rate: 0.8263
cardinality: 2
entropy: 0.666
entropy_ratio: 0.666

column19

categorical label

This column is a boolean flag stored as string literals 'True'/'False', covering all 122,611 rows with zero nulls. The distribution is heavily skewed: 'False' dominates at 87.2% (106,905 rows) versus 'True' at only 12.8% (15,706 rows). The low entropy of 0.552 confirms the imbalance. An analyst building a classifier on this as a target should anticipate class imbalance requiring resampling or adjusted class weights. Treatment: encode as binary integer (False=0, True=1) and address class imbalance (~87/13 split) before modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 2
top_value: False
top_rate: 0.8719
cardinality: 2
entropy: 0.5522
entropy_ratio: 0.5522

column20

numeric feature high_skew

This column is a sparse numeric count or score with only 73 distinct values across 122,611 rows, almost certainly representing an event count, frequency, or discrete rating. The distribution is extraordinarily concentrated at zero — 96.5% of values are exactly 0 — with IQR of 0.0 and a median of 0.0, yet the max reaches 97.0, producing extreme positive skew (5.23) and kurtosis (25.75). The 4,256 outlier rows (3.47%) carrying non-zero values likely represent a small active or engaged sub-population, which is the analytically interesting segment. Treatment: Apply log1p transform or binarise (zero vs. non-zero) before modelling; consider separating the zero-inflated mass from the active tail for two-part modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 73
min: 0
max: 97
mean: 2.565
median: 0
std: 13.66
q1: 0
q3: 0
iqr: 0
skew: 5.227
kurtosis: 25.75
n_outliers: 4,256
outlier_rate: 0.03471
zero_rate: 0.9653

column21

text near_unique one_word url_heavy null_rate

n: 122,611
nulls: 118,355 (96.5%)
unique: 4,160
len_min: 42
len_max: 142
len_mean: 72.43
len_median: 70
len_p95: 91
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 96
duplicate_rate: 0.02256
vocab_size: 4,160
readability_flesch_mean: -704.1
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

column22

numeric feature high_skew

This column is almost certainly a sparse indicator or rare-event count: 99.97% of its 122,611 values are exactly zero, with only 40 flagged outliers and a maximum of 100.0. The 31 unique values and an IQR of 0.0 confirm that the vast majority of rows carry no signal at all. The extreme skew (59.25) and kurtosis (3,627.8) are a direct consequence of this near-total zero mass, making standard continuous modelling inappropriate without transformation or binarisation. Treatment: Binarise (zero vs. non-zero) or treat as a rare-event indicator; if the raw magnitude matters, cap at a sensible percentile and log1p-transform before modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 31
min: 0
max: 100
mean: 0.02455
median: 0
std: 1.395
q1: 0
q3: 0
iqr: 0
skew: 59.25
kurtosis: 3628
n_outliers: 40
outlier_rate: 0.0003262
zero_rate: 0.9997

column23

numeric feature high_skew outliers

This column is a numeric count or magnitude field — likely representing activity volume, transaction amount, or similar accumulation metric — with 122,611 non-null records and only 5,540 distinct values. The distribution is extraordinarily right-skewed (skew=177.84, kurtosis=45,295.94): the median is just 5.0 while the mean is 1,044.99, and the maximum reaches 7,642,084 — a value roughly 272x the standard deviation above the mean. About 34.5% of values are zero and 17.0% are flagged as outliers (20,797 rows), indicating a heavy zero-inflated tail with extreme rare events dominating the mean. Treatment: Apply log1p-transform (to handle zeros) before modelling, and consider capping or winsorizing at a high percentile to suppress the extreme outliers up to 7,642,084. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 5,540
min: 0
max: 7.642e+06
mean: 1045
median: 5
std: 2.809e+04
q1: 0
q3: 37
iqr: 37
skew: 177.8
kurtosis: 4.53e+04
n_outliers: 20,797
outlier_rate: 0.1696
zero_rate: 0.3448

column24

numeric feature high_skew outliers

This column is likely a count or frequency measure (e.g., event occurrences, transaction counts, or interaction tallies) given its non-negative integer-like range and high zero rate. The distribution is extraordinarily right-skewed: the median is 1.0 and Q3 is only 10.0, yet the maximum reaches 1,173,003 — a difference of over six orders of magnitude. With 45% zeros, ~16.9% flagged outliers (20,696 rows), a skew of 156.86, and kurtosis exceeding 30,000, the bulk of records cluster near zero while a small number of extreme values dominate the mean (169.20 vs. median 1.0). This is a severe long-tail distribution that will distort any linear model if used as-is. Treatment: Apply log1p-transform (or cap at a high percentile) before modelling to reduce extreme skew. medium · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 2,725
min: 0
max: 1.173e+06
mean: 169.2
median: 1
std: 5375
q1: 0
q3: 10
iqr: 10
skew: 156.9
kurtosis: 3.063e+04
n_outliers: 20,696
outlier_rate: 0.1688
zero_rate: 0.4502

column25

numeric null_rate

n: 122,611
nulls: 122,571 (100.0%)
unique: 3
min: 98
max: 100
mean: 99.17
median: 99
std: 0.6751
q1: 99
q3: 100
iqr: 1
skew: -0.2149
kurtosis: -0.7872
n_outliers: 0
outlier_rate: 0
zero_rate: 0

column26

numeric feature high_skew outliers

This column is likely a count or frequency metric (e.g., event occurrences, transaction counts, or tenure in days/months), given its non-negative integer values with only 448 distinct values across 122,611 rows. The distribution is severely right-skewed (skew=32.63, kurtosis=1192.15): the median is just 2.0 while the mean is 18.09, Q1 is 0.0, and the maximum reaches 9,821—an extreme outlier relative to the IQR of 19. Nearly half the rows (48.6%) are zero, and 6.9% are flagged as outliers, signaling a heavy zero-inflated tail that will distort any linear model trained on raw values. Treatment: Apply log1p-transform (or a zero-inflated model) to compress the extreme right tail before modelling. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 448
min: 0
max: 9,821
mean: 18.09
median: 2
std: 141.5
q1: 0
q3: 19
iqr: 19
skew: 32.63
kurtosis: 1192
n_outliers: 8,433
outlier_rate: 0.06878
zero_rate: 0.4859

column27

numeric feature high_skew outliers

This column is a sparse, heavily right-skewed numeric count or amount field — likely representing an event frequency, transaction volume, or similar quantity that is zero for the vast majority of records. 82.9% of the 122,611 rows are exactly zero, the median is 0.0, and the IQR is 0.0, yet the mean is 961.8 and the maximum reaches 4,830,455 — indicating a tiny fraction of extreme values driving nearly all the variance. The skew of 113.9 and kurtosis of 20,874.5 are extraordinary, and 17.1% of rows are flagged as outliers, confirming that the non-zero tail is severely extreme relative to the bulk of the distribution. Treatment: Apply log1p-transform (or treat as two-part: zero/non-zero indicator + log-transformed non-zero value) before modelling to handle extreme skew and outliers. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 5,332
min: 0
max: 4.83e+06
mean: 961.8
median: 0
std: 2.188e+04
q1: 0
q3: 0
iqr: 0
skew: 113.9
kurtosis: 2.087e+04
n_outliers: 20,906
outlier_rate: 0.1705
zero_rate: 0.8295

column28

text free_text multilingual null_rate

This column contains free-text content warnings or age-rating disclosures for video games, with recurring phrases about mature content, nudity, sexual content, and violence. It is massively sparse — 81.68% of rows are null — meaning most games carry no such warning. The duplicate rate of 17.09% (3,839 duplicates across 18,620 unique values) reflects the use of templated boilerplate warning strings, while a small multilingual signal (2 Chinese, 1 Japanese entries) indicates some non-English publisher submissions. Flesch readability of 44.38 and a median length of 124 characters are consistent with dense legal/disclaimer prose. Treatment: Encode as binary 'has_warning' flag and/or extract categorical warning types (violence, nudity, sexual content) via keyword/regex before modelling; drop raw text. high · anthropic:default

n: 122,611
nulls: 100,152 (81.7%)
unique: 18,620
len_min: 2
len_max: 2,020
len_mean: 164.1
len_median: 124
len_p95: 445
word_mean: 25.74
word_median: 20
n_empty: 0
n_duplicates: 3,839
duplicate_rate: 0.1709
vocab_size: 23,061
readability_flesch_mean: 44.38
emoji_rate: 0.0007124
url_rate: 8.905e-05
one_word_rate: 0.009039
allcaps_rate: 0.008193
boilerplate_rate: 0.009484

column29

numeric feature high_skew outliers

This column is a heavily zero-inflated count or amount field — 78.7% of its 122,611 rows are exactly zero, and the interquartile range is 0.0, meaning the entire middle 50% of the distribution is zero. Despite a median of 0 and mean of only 208, the max reaches 3,429,544, producing extreme skew (262.89) and kurtosis (75,698), with 21.3% of rows flagged as outliers. This pattern is consistent with a sparse event-count, transaction amount, or usage metric where most entities are inactive but a small tail drives enormous values. Treatment: Apply log1p-transform or treat as two-part model (zero vs. non-zero) before regression or ML use. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 3,037
min: 0
max: 3.43e+06
mean: 208
median: 0
std: 1.122e+04
q1: 0
q3: 0
iqr: 0
skew: 262.9
kurtosis: 7.57e+04
n_outliers: 26,119
outlier_rate: 0.213
zero_rate: 0.787

column30

numeric feature high_skew

This column is a heavily zero-inflated count or amount field: 96.8% of its 122,611 rows are exactly zero, driving a median of 0.0 and an IQR of 0.0. The remaining values are extremely skewed (skew = 51.68, kurtosis = 3252.96), with a mean of 13.79 pulled far right by a maximum of 20,088 — likely representing rare but large events such as transaction amounts, error counts, or penalty values. The 3,898 outliers (3.2% of rows) account for virtually all non-zero variance, which is the defining surprise here. Treatment: Apply zero-inflated modelling or split into a binary indicator plus a log-transformed positive-value sub-model before regression. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 993
min: 0
max: 20,088
mean: 13.79
median: 0
std: 270.4
q1: 0
q3: 0
iqr: 0
skew: 51.68
kurtosis: 3253
n_outliers: 3,898
outlier_rate: 0.03179
zero_rate: 0.9682

column31

numeric feature high_skew outliers

This column is a sparse count or activity metric where the overwhelming majority of records (78.7%) are zero, producing a median of 0.0 and an IQR of exactly 0.0. The distribution is extraordinarily right-skewed (skew = 263.99, kurtosis = 76112.44), driven by extreme outliers reaching a max of 3,429,544 against a mean of only 173.57 — indicating a tiny fraction of records carry massive values. Roughly 21.3% of rows (26,119) are flagged as outliers, which is an unusually high outlier rate and signals a power-law or heavy-tailed phenomenon rather than a simple data error. Treatment: Apply log1p-transform (or a zero-inflated model) before regression; consider capping at a high percentile to manage extreme outliers. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 2,511
min: 0
max: 3.43e+06
mean: 173.6
median: 0
std: 1.12e+04
q1: 0
q3: 0
iqr: 0
skew: 264
kurtosis: 7.611e+04
n_outliers: 26,119
outlier_rate: 0.213
zero_rate: 0.787

column32

numeric feature high_skew

This column is almost certainly a sparse count or occurrence field — likely an event frequency, error count, or similar rare-event tally. The zero_rate of 96.8% means the vast majority of rows have no event, while the remaining ~3.2% drive an extreme right tail (skew=48.9, kurtosis=2848.5) reaching a maximum of 20,088 against a median of 0 and mean of 14.7. The IQR of 0.0 confirms the middle 50% of the distribution is entirely flat at zero, with 3,898 flagged outliers carrying virtually all the variance. Treatment: Apply log1p transform or treat as binary (zero vs. non-zero) flag before modelling; consider capping at a high percentile to suppress the extreme tail. high · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: 993
min: 0
max: 20,088
mean: 14.72
median: 0
std: 294.5
q1: 0
q3: 0
iqr: 0
skew: 48.91
kurtosis: 2848
n_outliers: 3,898
outlier_rate: 0.03179
zero_rate: 0.9682

column33

text label one_word duplicates

This column contains game developer or publisher names, evidenced by top values such as 'Choice of Games', 'KOEI TECMO GAMES CO., LTD.', and dominant vocabulary including 'games', 'studio', 'studios', 'interactive', and 'entertainment'. The duplicate rate of 37.98% (43,364 duplicates across 122,611 rows) is expected — publishers release multiple titles — but the 70,816 unique values and a max length of 584 characters suggest occasional free-text entries or combined multi-publisher strings. The one-word rate of 31.8% and mean word count of ~2 words are consistent with company name formats, though the wide length range (1–584 chars) warrants inspection for outliers. Treatment: Normalize casing and strip punctuation variants before grouping; use as a categorical grouping key or encode as a feature via target/frequency encoding. high · anthropic:default

n: 122,611
nulls: 8,431 (6.9%)
unique: 70,816
len_min: 1
len_max: 584
len_mean: 14.37
len_median: 13
len_p95: 27
word_mean: 2.019
word_median: 2
n_empty: 0
n_duplicates: 43,364
duplicate_rate: 0.3798
vocab_size: 18,429
readability_flesch_mean: 38.73
emoji_rate: 0.0008933
url_rate: 0.000219
one_word_rate: 0.3181
allcaps_rate: 0.07974
boilerplate_rate: 0

column34

text label one_word duplicates

This column contains game publisher or developer company names, as evidenced by top values like 'BFG Entertainment', 'Choice of Games', and 'Strategy First', and top words dominated by 'games', 'studio', 'studios', 'entertainment', and corporate suffixes ('llc', 'inc.', 'ltd.'). The duplicate rate is notably high at 44.9% (51,089 duplicates across 122,611 rows), which is expected since many games share the same publisher. The one-word rate of 31.8% reflects single-token studio names, and the 7.2% null rate warrants attention for records with unknown publishers. Treatment: Encode as a categorical feature (e.g. frequency or target encoding); investigate nulls at 7.2% before modelling. high · anthropic:default

n: 122,611
nulls: 8,833 (7.2%)
unique: 62,689
len_min: 1
len_max: 164
len_mean: 13.82
len_median: 13
len_p95: 26
word_mean: 1.988
word_median: 2
n_empty: 0
n_duplicates: 51,089
duplicate_rate: 0.449
vocab_size: 15,765
readability_flesch_mean: 40.22
emoji_rate: 0.0009141
url_rate: 0.0002285
one_word_rate: 0.3178
allcaps_rate: 0.0817
boilerplate_rate: 0

column35

text feature duplicates

This column contains a comma-delimited list of Steam game features/categories (e.g., 'Single-player', 'Steam Achievements', 'Family Sharing', 'Full controller support'), typical of the Steam store's supported features field per game. The extreme duplicate rate (88.3%, 100,367 of 122,611 rows) is expected because many games share identical feature sets, and the tiny vocabulary size of 589 words confirms a finite, enumerated tag system. The 'da' language detection on 12 rows is almost certainly a false positive from short comma-separated tokens, not actual Danish text. With only 13,291 unique combinations out of 122,611 rows, this column is highly suitable for multi-label binarization. Treatment: Split on commas and one-hot encode each feature tag for modelling. high · anthropic:default

n: 122,611
nulls: 8,953 (7.3%)
unique: 13,291
len_min: 3
len_max: 534
len_mean: 71.58
len_median: 59
len_p95: 178
word_mean: 5.089
word_median: 4
n_empty: 0
n_duplicates: 100,367
duplicate_rate: 0.8831
vocab_size: 589
readability_flesch_mean: -105.9
emoji_rate: 0
url_rate: 0
one_word_rate: 0.04047
allcaps_rate: 8.798e-06
boilerplate_rate: 0

column36

text label one_word duplicates

This column contains comma-separated game genre tags (e.g., 'Casual,Indie', 'Action,Adventure,Indie'), consistent with a Steam or similar game catalog dataset. The duplicate rate is extremely high at 97.5%, reflecting the natural cardinality collapse when games share genre combinations — only 2,894 unique tag-sets exist across 122,611 rows. The top words 'to', 'access', and 'play' suggest some rows contain free-text strings like 'Early Access' or 'Free to Play' mixed into the same field, indicating occasional value pollution worth investigating. Treatment: Split on comma to multi-hot encode genre tags before modelling; flag rows where values contain free-text phrases ('to', 'access', 'play') for cleansing. high · anthropic:default

n: 122,611
nulls: 8,413 (6.9%)
unique: 2,894
len_min: 3
len_max: 236
len_mean: 22.21
len_median: 21
len_p95: 45
word_mean: 1.364
word_median: 1
n_empty: 0
n_duplicates: 111,304
duplicate_rate: 0.9747
vocab_size: 940
readability_flesch_mean: -206.1
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7892
allcaps_rate: 0.009781
boilerplate_rate: 0

column37

text label multilingual null_rate

This column contains comma-separated genre/tag lists for software or game products (e.g., 'Adventure,Casual,Hidden Object', 'Action,Indie'), consistent with a Steam-style app catalog. The null rate of 32.02% is notably high and warrants investigation before modelling. A multilingual alert is raised, but the non-English content is negligible (26 records out of 3,376 detected), suggesting near-uniform English data with minor noise. The duplicate rate of 7.4% (6,167 duplicates) is expected given finite genre combinations across a large catalog. Treatment: Split on commas to multi-hot encode genre tags; investigate and decide on imputation strategy for the 32.02% null rows before modelling. high · anthropic:default

n: 122,611
nulls: 39,265 (32.0%)
unique: 77,179
len_min: 3
len_max: 295
len_mean: 141.3
len_median: 163
len_p95: 228
word_mean: 4.923
word_median: 5
n_empty: 0
n_duplicates: 6,167
duplicate_rate: 0.07399
vocab_size: 57,260
readability_flesch_mean: -449.7
emoji_rate: 0
url_rate: 0
one_word_rate: 0.1233
allcaps_rate: 4.799e-05
boilerplate_rate: 0

column38

text metadata near_unique one_word url_heavy

This column contains comma-separated lists of Steam screenshot URLs (Akamai CDN), one packed string per row representing all screenshot images for a given Steam game entry. Every value is technically 'one word' (no spaces) because the URLs are concatenated without whitespace, explaining the paradoxical one_word_rate of 1.0 alongside a mean length of ~1319 characters and a max of 29132. With 116,483 unique values out of 122,611 rows and only 110 duplicates, this is near-unique; the small duplicate count likely reflects games with identical screenshot sets. Treatment: Split on commas to extract individual screenshot URLs per game; store as a list-type column or explode into a separate screenshots table keyed by game id. high · anthropic:default

n: 122,611
nulls: 6,018 (4.9%)
unique: 116,483
len_min: 144
len_max: 29,132
len_mean: 1319
len_median: 1,039
len_p95: 2,773
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 110
duplicate_rate: 0.0009435
vocab_size: 19,994
readability_flesch_mean: -5099
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

column39

unknown other skipped

This column was skipped by the profiler, so its content and type are entirely unknown. With 122,611 rows, zero nulls, and no computed statistics or uniqueness information, no data-driven characterisation is possible. The 'skipped' alert is the only signal available. Treatment: Manually inspect raw values to determine type and role before any further processing. low · anthropic:default

n: 122,611
nulls: 0 (0.0%)
unique: —