saturn·

data trove steam games catalog

source /home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv 122,611 rows 40 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This is a Steam games catalogue with 122,611 rows and 40 columns, covering titles, publishers, developers, genres, pricing, review counts, and associated URLs. The most important thing to examine first is the extreme skew across nearly all numeric engagement columns (column23, column24, column27, column29, column31): medians sit at 0–5 while means run into the hundreds or thousands, meaning a tiny fraction of blockbuster titles account for the vast majority of reviews and activity. A second area worth attention is genre distribution (column36), where just a handful of Casual/Indie/Action combinations account for the bulk of the catalogue, and the estimated owner-count banding (column03) shows over 61% of games have fewer than 20,000 owners — pointing to a long-tail market dominated by low-visibility titles.

citing: column23.stats.median · column23.stats.mean · column27.stats.zero_rate · column29.stats.zero_rate · column03.top_values · column03.stats.top_rate · column36.top_values · column36.stats.duplicate_rate · column06.stats.median · column06.stats.mean

Schema

40 columns
Per-column summary. Click column name to jump to its detail.
Alerts
column00 numeric 0.0% 122,611
column01 text 0.0% 121,454
near_unique
column02 text 0.0% 5,081
short_text duplicates
column03 categorical 0.0% 14
column04 numeric 0.0% 1,110
high_skew outliers
column05 numeric 0.0% 15
high_skew
column06 numeric 0.0% 941
high_skew outliers
column07 numeric 0.0% 88
column08 numeric 0.0% 117
high_skew outliers
column09 text 6.9% 113,556
near_unique
column10 text 0.0% 19,113
one_word duplicates
column11 text 0.0% 3,710
one_word duplicates
column12 text 90.2% 11,884
near_unique null_rate
column13 text 0.1% 122,420
near_unique one_word url_heavy
column14 text 59.5% 39,703
one_word url_heavy null_rate duplicates
column15 text 55.8% 35,399
one_word url_heavy null_rate duplicates
column16 text 18.1% 60,519
one_word duplicates
column17 categorical 0.0% 2
imbalance
column18 categorical 0.0% 2
column19 categorical 0.0% 2
column20 numeric 0.0% 73
high_skew
column21 text 96.5% 4,160
near_unique one_word url_heavy null_rate
column22 numeric 0.0% 31
high_skew
column23 numeric 0.0% 5,540
high_skew outliers
column24 numeric 0.0% 2,725
high_skew outliers
column25 numeric 100.0% 3
null_rate
column26 numeric 0.0% 448
high_skew outliers
column27 numeric 0.0% 5,332
high_skew outliers
column28 text 81.7% 18,620
multilingual null_rate
column29 numeric 0.0% 3,037
high_skew outliers
column30 numeric 0.0% 993
high_skew
column31 numeric 0.0% 2,511
high_skew outliers
column32 numeric 0.0% 993
high_skew
column33 text 6.9% 70,816
one_word duplicates
column34 text 7.2% 62,689
one_word duplicates
column35 text 7.3% 13,291
duplicates
column36 text 6.9% 2,894
one_word duplicates
column37 text 32.0% 77,179
multilingual null_rate
column38 text 4.9% 116,483
near_unique one_word url_heavy
column39 unknown 0.0%
skipped

column00

numeric identifier
This column contains 122,611 numeric values that are all unique, null-free, and span from 10 to 4,264,350 — strongly suggesting it is a unique numeric identifier (e.g., a record ID or transaction number). The distribution is remarkably flat and near-uniform: kurtosis of -1.05, negligible skew of 0.18, and zero detected outliers, which is highly unusual for a natural measurement or feature and is consistent with a sequentially or pseudo-randomly assigned integer key. The IQR of 1,806,385 is close to half the full range, further supporting a uniform spread across the ID space. Treatment: Drop before modelling or use as a row key only; do not use as a predictive feature. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
122,611
min
10
max
4.264e+06
mean
1.985e+06
median
1.907e+06
std
1.088e+06
q1
1.063e+06
q3
2.87e+06
iqr
1.806e+06
skew
0.1772
kurtosis
-1.05
n_outliers
0
outlier_rate
0
zero_rate
0

column01

text label near_unique
This column contains short, near-unique text strings averaging ~3 words and 18 characters, consistent with game or software session/product titles. The dominant top words — 'playtest', 'vr', 'simulator' — strongly suggest these are names of VR game playtesting sessions or titles. Surprising signals include 1,156 duplicates (~0.94% duplicate rate) despite the near-unique alert, a small emoji presence (0.26%), and a maximum length of 413 characters which is anomalously long relative to the median of 16. Treatment: Use as a descriptive label; deduplicate or flag the 1,156 repeated entries, and investigate the long-tail outliers (len_max 413) before any downstream grouping or embedding. medium · anthropic:default
n
122,611
nulls
1 (0.0%)
unique
121,454
len_min
1
len_max
413
len_mean
18.07
len_median
16
len_p95
37.55
word_mean
2.912
word_median
3
n_empty
0
n_duplicates
1,156
duplicate_rate
0.009428
vocab_size
18,813
readability_flesch_mean
52.87
emoji_rate
0.002585
url_rate
0
one_word_rate
0.1866
allcaps_rate
0.06731
boilerplate_rate
4.078e-05

column02

text timestamp short_text duplicates
This column contains dates formatted as 'Mon DD, YYYY' (e.g., 'Oct 23, 2025'), stored as text rather than a native date type. The values span at least 2021–2025 based on top word frequencies, with a striking duplicate rate of 95.86% — 117,530 of 122,611 rows share one of only 5,081 distinct dates, meaning many records map to the same calendar day. The near-constant string length (median 12, min 11, max 12) and vocabulary of just 68 tokens confirm this is a tightly formatted date field with no free-text variation. Treatment: Parse to a native date type (e.g., datetime64) before any time-series analysis or feature engineering. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
5,081
len_min
11
len_max
12
len_mean
11.72
len_median
12
len_p95
12
word_mean
3
word_median
3
n_empty
0
n_duplicates
117,530
duplicate_rate
0.9586
vocab_size
68
readability_flesch_mean
98.6
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

column03

categorical feature
This column encodes a numeric quantity as binned range labels — almost certainly an income, revenue, or financial amount bracket given the scale (0 to 10,000,000+) and logarithmically spaced bin edges. The distribution is heavily right-skewed: 61.5% of rows fall in the '0 - 20000' bucket alone, and a notable 21,641 rows sit in '0 - 0', suggesting a zero-value spike that may warrant separate treatment. With only 14 distinct values and zero nulls across 122,611 rows, the encoding is clean but lossy. Treatment: Ordinal-encode using the natural bin order, or extract bin midpoints as a numeric approximation; investigate the '0 - 0' segment (21,641 rows) as a potential distinct class. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
14
top_value
0 - 20000
top_rate
0.615
cardinality
14
entropy
1.814
entropy_ratio
0.4764

column04

numeric feature high_skew outliers
This column is a heavily zero-inflated numeric field — likely a count, transaction amount, or event frequency — where 83.95% of values are exactly zero and the interquartile range is 0.0, meaning the entire middle half of the distribution is flat at zero. The remaining values are extremely right-skewed (skew = 209.95, kurtosis = 51452.44) with a max of 1,013,936 against a mean of only 54.59, indicating a small number of very large outliers; 16.05% of rows (19,676) are flagged as outliers. The 1,110 unique values and zero null rate suggest this may be a sparse activity or volume metric. Treatment: Consider a two-part model (zero-inflation indicator + log1p transform on non-zero values) or cap/winsorize at a high percentile before modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
1,110
min
0
max
1.014e+06
mean
54.59
median
0
std
3729
q1
0
q3
0
iqr
0
skew
210
kurtosis
5.145e+04
n_outliers
19,676
outlier_rate
0.1605
zero_rate
0.8395

column05

numeric feature high_skew
This column is a low-cardinality integer count (only 15 distinct values, range 0–21) where 98.96% of rows are exactly zero, making it an extreme sparse count feature — likely recording rare events or occurrences per record. The distribution is severely right-skewed (skew 9.88, kurtosis 96.52) with only 1,272 outlier rows (1.04%) carrying any non-zero signal; the IQR is zero because all three quartiles collapse to 0. Treatment: Treat as a sparse count; consider binarising (0 vs >0) or applying log1p transform, and flag the 1,272 non-zero rows as a minority sub-population for modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
15
min
0
max
21
mean
0.1676
median
0
std
1.654
q1
0
q3
0
iqr
0
skew
9.883
kurtosis
96.52
n_outliers
1,272
outlier_rate
0.01037
zero_rate
0.9896

column06

numeric feature high_skew outliers
This column likely represents a monetary amount, duration, or rate — a continuous positive measure where most values are small. The distribution is extreme: the median is 2.24 and Q3 is only 5.24, yet the max reaches 999.98, producing a skew of 22.4 and a kurtosis of 1,135. Over 7.5% of rows (9,297) are flagged as outliers, and 21.4% of values are exactly zero, suggesting a two-part structure (zero-inflation plus a heavy-tailed positive component) that would violate standard regression assumptions. Treatment: Model the zero-inflation separately (e.g., hurdle or Tweedie model), then log1p-transform the positive portion before regression or scaling. medium · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
941
min
0
max
1000
mean
4.765
median
2.24
std
12.53
q1
0.55
q3
5.24
iqr
4.69
skew
22.4
kurtosis
1135
n_outliers
9,297
outlier_rate
0.07583
zero_rate
0.2137

column07

numeric feature
This column is a bounded numeric score or percentage, ranging from 0 to 100 with only 88 distinct values, suggesting a discretized or rounded measurement (e.g., a completion rate, satisfaction score, or grade). The most striking feature is that 66.8% of values are exactly zero, making the distribution heavily zero-inflated; the median is 0.0 while the mean is 18.35 and Q3 is only 40.0, confirming the mass is concentrated at the floor. Despite the zero inflation, kurtosis is near zero (−0.05), meaning the non-zero portion is roughly flat or uniform across the 0–100 range. Analysts should treat this as a zero-inflated bounded variable rather than a standard continuous feature. Treatment: Model with a two-part (hurdle/zero-inflated) approach, or apply an indicator for zero alongside the raw value; avoid log-transform without offset due to zero mass. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
88
min
0
max
100
mean
18.35
median
0
std
28.86
q1
0
q3
40
iqr
40
skew
1.22
kurtosis
-0.05072
n_outliers
0
outlier_rate
0
zero_rate
0.6682

column08

numeric feature high_skew outliers
This column is a sparse count or event-frequency field: 85.5% of its 122,611 rows are exactly zero, the median and IQR are both 0, yet the mean is 0.55 and the max reaches 3,703. The extreme concentration at zero combined with a skew of 171.8 and kurtosis of 38,359 indicates a heavy-tailed distribution driven by rare but very large values; 14.5% of rows (17,771) are flagged as outliers. Only 117 distinct values across 122,611 rows further suggests this is a discrete count, not a continuous measure. Treatment: Apply log1p transform or use a zero-inflated / Poisson model; consider capping or winsorizing at a high quantile given the max of 3703. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
117
min
0
max
3,703
mean
0.5459
median
0
std
14.52
q1
0
q3
0
iqr
0
skew
171.8
kurtosis
3.836e+04
n_outliers
17,771
outlier_rate
0.1449
zero_rate
0.8551

column09

text free_text near_unique
This column contains long-form natural language text, likely user-generated content such as reviews, product descriptions, or messages — with a mean of 1,297 characters and 214 words per entry, and a vocabulary of 105,903 unique terms. The near-unique alert (113,556 unique values out of 122,611 rows) confirms these are essentially free-text narratives rather than categorical labels. Notably, 4.7% of entries contain emojis, suggesting informal or consumer-facing content, and the max length of 89,665 characters indicates some extreme outliers well beyond the 95th-percentile length of 2,966 characters. Flesch readability mean of 58.7 places the text in a 'fairly easy' register, consistent with consumer writing. Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; flag or truncate the extreme-length outliers above len_p95 of 2,966 characters. high · anthropic:default
n
122,611
nulls
8,449 (6.9%)
unique
113,556
len_min
1
len_max
89,665
len_mean
1297
len_median
1,064
len_p95
2,966
word_mean
214.3
word_median
177
n_empty
0
n_duplicates
606
duplicate_rate
0.005308
vocab_size
105,903
readability_flesch_mean
58.75
emoji_rate
0.04672
url_rate
0.0003504
one_word_rate
0.0004029
allcaps_rate
0.007559
boilerplate_rate
0.01517

column10

text feature one_word duplicates
This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or game). The dominant value is `['English']` appearing 55,314 times, with `[]` (no languages listed) in 8,380 rows. The duplicate rate is extremely high at 84.4%, which is expected given the limited vocabulary of 217 unique tokens and only 19,113 unique values across 122,611 rows — the data is stored as raw string-serialized lists rather than a normalized structure, which is a notable preprocessing concern. Treatment: Parse the string-serialized lists into actual list structures, then multi-hot encode each language as a binary feature column. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
19,113
len_min
2
len_max
1,216
len_mean
68.02
len_median
11
len_p95
224
word_mean
6.889
word_median
1
n_empty
0
n_duplicates
103,498
duplicate_rate
0.8441
vocab_size
217
readability_flesch_mean
14.07
emoji_rate
0
url_rate
0
one_word_rate
0.5333
allcaps_rate
0
boilerplate_rate
0

column11

text feature one_word duplicates
This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or media item). The dominant value is '[]' (empty list) appearing 72,730 times — nearly 60% of rows — indicating most records have no language metadata populated. Despite 122,611 rows, only 3,710 unique values exist and the duplicate rate is 96.97%, which is expected for a categorical-list field, but the vocabulary is tiny at just 194 words, confirming a closed set of language names. Treatment: Parse the serialized list strings into proper multi-label indicators (one binary column per language) before modelling; treat '[]' as missing/unknown. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
3,710
len_min
2
len_max
1,216
len_mean
24.31
len_median
2
len_p95
46
word_mean
2.854
word_median
1
n_empty
0
n_duplicates
118,901
duplicate_rate
0.9697
vocab_size
194
readability_flesch_mean
8.003
emoji_rate
0
url_rate
0
one_word_rate
0.813
allcaps_rate
0
boilerplate_rate
0

column12

text free_text near_unique null_rate
This column contains substantial free-text descriptions or reviews, most likely about games — the word 'game' is the top non-stopword at 7,882 occurrences, average text length is ~340 characters (~57 words), and the vocabulary spans 61,840 unique tokens. The 90.16% null rate is a major alert: only about 12,000 of 122,611 rows carry any content, meaning this field is sparsely populated. An emoji_rate of ~1.6% and a median Flesch readability score of ~57.8 suggest informal, consumer-written prose. The near_unique flag is partially explained by the sparse population — 11,884 unique values among ~12,000 non-null rows confirms almost every entry is distinct. Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; impute or mask nulls explicitly given the 90.16% null rate. high · anthropic:default
n
122,611
nulls
110,541 (90.2%)
unique
11,884
len_min
3
len_max
2,912
len_mean
340.3
len_median
295
len_p95
763
word_mean
57.37
word_median
49
n_empty
0
n_duplicates
186
duplicate_rate
0.01541
vocab_size
61,840
readability_flesch_mean
57.83
emoji_rate
0.01649
url_rate
0
one_word_rate
0
allcaps_rate
0.008202
boilerplate_rate
0

column13

text metadata near_unique one_word url_heavy
This column contains Steam CDN URLs pointing to game header images hosted on Akamai's steamstatic.com infrastructure — specifically `header.jpg` assets keyed by Steam app ID. With a url_rate of 1.0 and one_word_rate of 1.0, every single value is a single URL. The column is near-unique (122,420 distinct values out of 122,611 rows), with only 110 duplicates, suggesting these map closely to individual game or product records; the small number of repeated URLs (max frequency 5) likely reflects games appearing in multiple dataset rows. Treatment: Extract Steam app ID from URL path for joining; drop raw URL before modelling or store as-is for image retrieval pipelines. high · anthropic:default
n
122,611
nulls
81 (0.1%)
unique
122,420
len_min
93
len_max
153
len_mean
104.6
len_median
98
len_p95
139
word_mean
1
word_median
1
n_empty
0
n_duplicates
110
duplicate_rate
0.0008977
vocab_size
19,992
readability_flesch_mean
-834.3
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

column14

text metadata one_word url_heavy null_rate duplicates
This column contains publisher or developer website URLs, almost certainly scraped from a Steam or similar games catalogue. Virtually every non-null value is a single URL (one_word_rate 0.9999, url_rate 0.9999), pointing to publisher homepages, Facebook pages, or Steam publisher/group pages. Two signals stand out: 59.48% of rows are null, meaning many game records carry no website; and 20.08% of non-null values are duplicates (9,973 repeated URLs), reflecting publishers with large catalogues who share one website across many titles. Treatment: Extract domain as a categorical publisher identifier; flag or impute nulls; do not embed raw URL strings. high · anthropic:default
n
122,611
nulls
72,935 (59.5%)
unique
39,703
len_min
7
len_max
236
len_mean
32.57
len_median
29
len_p95
56
word_mean
1
word_median
1
n_empty
0
n_duplicates
9,973
duplicate_rate
0.2008
vocab_size
17,059
readability_flesch_mean
-260.3
emoji_rate
0
url_rate
0.9999
one_word_rate
0.9999
allcaps_rate
0
boilerplate_rate
0

column15

text metadata one_word url_heavy null_rate duplicates
This column is a support/contact URL field — almost certainly a developer or publisher support link associated with game or software records. 95.6% of non-null values are URLs, and the one-word rate is 99.9%, consistent with bare URL strings. Two surprises stand out: the null rate is very high at 55.8%, meaning more than half of records lack this URL, and the duplicate rate is 34.7% (18,808 duplicate values out of ~54,200 non-null rows), reflecting that many games share the same support domain (e.g., Big Fish Games, EA, Facebook pages). Treatment: Extract domain as a categorical feature; treat raw URL as a grouping key rather than a text feature; impute or flag nulls separately given 55.8% null rate. high · anthropic:default
n
122,611
nulls
68,404 (55.8%)
unique
35,399
len_min
1
len_max
851
len_mean
31.19
len_median
29
len_p95
51
word_mean
1.002
word_median
1
n_empty
0
n_duplicates
18,808
duplicate_rate
0.347
vocab_size
14,875
readability_flesch_mean
-245.1
emoji_rate
0
url_rate
0.9559
one_word_rate
0.9993
allcaps_rate
0.0007933
boilerplate_rate
0

column16

text foreign_key one_word duplicates
This column contains email addresses for game developers or publishers, as evidenced by the top values (e.g., 'info@bigfishgames.com', 'support@quanticlab.com'). Nearly all values (99.86%) are single tokens, consistent with email format. The duplicate rate is high at 39.7% (39,849 duplicates out of 122,611 rows), indicating many records share a contact email — expected for a publisher-level field where one entity owns multiple titles. The null rate of 18.14% is notable and should be investigated for systematic missingness. Treatment: Use as a grouping/join key on publisher or developer entity; normalize to lowercase and strip whitespace before joining. high · anthropic:default
n
122,611
nulls
22,243 (18.1%)
unique
60,519
len_min
1
len_max
169
len_mean
22.91
len_median
23
len_p95
31
word_mean
1.004
word_median
1
n_empty
0
n_duplicates
39,849
duplicate_rate
0.397
vocab_size
15,319
readability_flesch_mean
-223.7
emoji_rate
9.963e-06
url_rate
0.003906
one_word_rate
0.9986
allcaps_rate
0.001016
boilerplate_rate
0

column17

categorical feature imbalance
This column is a boolean flag stored as string values ('True'/'False'), covering 122,611 rows with no nulls. It is severely imbalanced: 'True' accounts for 99.964% of rows (122,567 occurrences) while 'False' appears only 44 times. The near-zero entropy (0.0046) confirms the column carries almost no information, making it nearly constant. Treatment: Investigate whether the 44 'False' rows are meaningful anomalies; otherwise drop as near-constant with no predictive variance. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
2
top_value
True
top_rate
0.9996
cardinality
2
entropy
0.004625
entropy_ratio
0.004625

column18

categorical feature
This column is a binary boolean flag stored as string literals 'True'/'False', with zero nulls across 122,611 rows. The dominant value is 'False' at 82.6% (101,319 occurrences), leaving 'True' at roughly 17.4% (21,292) — a moderately imbalanced split that may matter for classification tasks. The entropy ratio of 0.666 confirms meaningful but uneven information content. Treatment: Cast to boolean/integer (0/1) and monitor class imbalance if used as a target or predictor. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
2
top_value
False
top_rate
0.8263
cardinality
2
entropy
0.666
entropy_ratio
0.666

column19

categorical label
This column is a boolean flag stored as string literals 'True'/'False', covering all 122,611 rows with zero nulls. The distribution is heavily skewed: 'False' dominates at 87.2% (106,905 rows) versus 'True' at only 12.8% (15,706 rows). The low entropy of 0.552 confirms the imbalance. An analyst building a classifier on this as a target should anticipate class imbalance requiring resampling or adjusted class weights. Treatment: encode as binary integer (False=0, True=1) and address class imbalance (~87/13 split) before modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
2
top_value
False
top_rate
0.8719
cardinality
2
entropy
0.5522
entropy_ratio
0.5522

column20

numeric feature high_skew
This column is a sparse numeric count or score with only 73 distinct values across 122,611 rows, almost certainly representing an event count, frequency, or discrete rating. The distribution is extraordinarily concentrated at zero — 96.5% of values are exactly 0 — with IQR of 0.0 and a median of 0.0, yet the max reaches 97.0, producing extreme positive skew (5.23) and kurtosis (25.75). The 4,256 outlier rows (3.47%) carrying non-zero values likely represent a small active or engaged sub-population, which is the analytically interesting segment. Treatment: Apply log1p transform or binarise (zero vs. non-zero) before modelling; consider separating the zero-inflated mass from the active tail for two-part modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
73
min
0
max
97
mean
2.565
median
0
std
13.66
q1
0
q3
0
iqr
0
skew
5.227
kurtosis
25.75
n_outliers
4,256
outlier_rate
0.03471
zero_rate
0.9653

column21

text near_unique one_word url_heavy null_rate
n
122,611
nulls
118,355 (96.5%)
unique
4,160
len_min
42
len_max
142
len_mean
72.43
len_median
70
len_p95
91
word_mean
1
word_median
1
n_empty
0
n_duplicates
96
duplicate_rate
0.02256
vocab_size
4,160
readability_flesch_mean
-704.1
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

column22

numeric feature high_skew
This column is almost certainly a sparse indicator or rare-event count: 99.97% of its 122,611 values are exactly zero, with only 40 flagged outliers and a maximum of 100.0. The 31 unique values and an IQR of 0.0 confirm that the vast majority of rows carry no signal at all. The extreme skew (59.25) and kurtosis (3,627.8) are a direct consequence of this near-total zero mass, making standard continuous modelling inappropriate without transformation or binarisation. Treatment: Binarise (zero vs. non-zero) or treat as a rare-event indicator; if the raw magnitude matters, cap at a sensible percentile and log1p-transform before modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
31
min
0
max
100
mean
0.02455
median
0
std
1.395
q1
0
q3
0
iqr
0
skew
59.25
kurtosis
3628
n_outliers
40
outlier_rate
0.0003262
zero_rate
0.9997

column23

numeric feature high_skew outliers
This column is a numeric count or magnitude field — likely representing activity volume, transaction amount, or similar accumulation metric — with 122,611 non-null records and only 5,540 distinct values. The distribution is extraordinarily right-skewed (skew=177.84, kurtosis=45,295.94): the median is just 5.0 while the mean is 1,044.99, and the maximum reaches 7,642,084 — a value roughly 272x the standard deviation above the mean. About 34.5% of values are zero and 17.0% are flagged as outliers (20,797 rows), indicating a heavy zero-inflated tail with extreme rare events dominating the mean. Treatment: Apply log1p-transform (to handle zeros) before modelling, and consider capping or winsorizing at a high percentile to suppress the extreme outliers up to 7,642,084. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
5,540
min
0
max
7.642e+06
mean
1045
median
5
std
2.809e+04
q1
0
q3
37
iqr
37
skew
177.8
kurtosis
4.53e+04
n_outliers
20,797
outlier_rate
0.1696
zero_rate
0.3448

column24

numeric feature high_skew outliers
This column is likely a count or frequency measure (e.g., event occurrences, transaction counts, or interaction tallies) given its non-negative integer-like range and high zero rate. The distribution is extraordinarily right-skewed: the median is 1.0 and Q3 is only 10.0, yet the maximum reaches 1,173,003 — a difference of over six orders of magnitude. With 45% zeros, ~16.9% flagged outliers (20,696 rows), a skew of 156.86, and kurtosis exceeding 30,000, the bulk of records cluster near zero while a small number of extreme values dominate the mean (169.20 vs. median 1.0). This is a severe long-tail distribution that will distort any linear model if used as-is. Treatment: Apply log1p-transform (or cap at a high percentile) before modelling to reduce extreme skew. medium · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
2,725
min
0
max
1.173e+06
mean
169.2
median
1
std
5375
q1
0
q3
10
iqr
10
skew
156.9
kurtosis
3.063e+04
n_outliers
20,696
outlier_rate
0.1688
zero_rate
0.4502

column25

numeric null_rate
n
122,611
nulls
122,571 (100.0%)
unique
3
min
98
max
100
mean
99.17
median
99
std
0.6751
q1
99
q3
100
iqr
1
skew
-0.2149
kurtosis
-0.7872
n_outliers
0
outlier_rate
0
zero_rate
0

column26

numeric feature high_skew outliers
This column is likely a count or frequency metric (e.g., event occurrences, transaction counts, or tenure in days/months), given its non-negative integer values with only 448 distinct values across 122,611 rows. The distribution is severely right-skewed (skew=32.63, kurtosis=1192.15): the median is just 2.0 while the mean is 18.09, Q1 is 0.0, and the maximum reaches 9,821—an extreme outlier relative to the IQR of 19. Nearly half the rows (48.6%) are zero, and 6.9% are flagged as outliers, signaling a heavy zero-inflated tail that will distort any linear model trained on raw values. Treatment: Apply log1p-transform (or a zero-inflated model) to compress the extreme right tail before modelling. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
448
min
0
max
9,821
mean
18.09
median
2
std
141.5
q1
0
q3
19
iqr
19
skew
32.63
kurtosis
1192
n_outliers
8,433
outlier_rate
0.06878
zero_rate
0.4859

column27

numeric feature high_skew outliers
This column is a sparse, heavily right-skewed numeric count or amount field — likely representing an event frequency, transaction volume, or similar quantity that is zero for the vast majority of records. 82.9% of the 122,611 rows are exactly zero, the median is 0.0, and the IQR is 0.0, yet the mean is 961.8 and the maximum reaches 4,830,455 — indicating a tiny fraction of extreme values driving nearly all the variance. The skew of 113.9 and kurtosis of 20,874.5 are extraordinary, and 17.1% of rows are flagged as outliers, confirming that the non-zero tail is severely extreme relative to the bulk of the distribution. Treatment: Apply log1p-transform (or treat as two-part: zero/non-zero indicator + log-transformed non-zero value) before modelling to handle extreme skew and outliers. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
5,332
min
0
max
4.83e+06
mean
961.8
median
0
std
2.188e+04
q1
0
q3
0
iqr
0
skew
113.9
kurtosis
2.087e+04
n_outliers
20,906
outlier_rate
0.1705
zero_rate
0.8295

column28

text free_text multilingual null_rate
This column contains free-text content warnings or age-rating disclosures for video games, with recurring phrases about mature content, nudity, sexual content, and violence. It is massively sparse — 81.68% of rows are null — meaning most games carry no such warning. The duplicate rate of 17.09% (3,839 duplicates across 18,620 unique values) reflects the use of templated boilerplate warning strings, while a small multilingual signal (2 Chinese, 1 Japanese entries) indicates some non-English publisher submissions. Flesch readability of 44.38 and a median length of 124 characters are consistent with dense legal/disclaimer prose. Treatment: Encode as binary 'has_warning' flag and/or extract categorical warning types (violence, nudity, sexual content) via keyword/regex before modelling; drop raw text. high · anthropic:default
n
122,611
nulls
100,152 (81.7%)
unique
18,620
len_min
2
len_max
2,020
len_mean
164.1
len_median
124
len_p95
445
word_mean
25.74
word_median
20
n_empty
0
n_duplicates
3,839
duplicate_rate
0.1709
vocab_size
23,061
readability_flesch_mean
44.38
emoji_rate
0.0007124
url_rate
8.905e-05
one_word_rate
0.009039
allcaps_rate
0.008193
boilerplate_rate
0.009484

column29

numeric feature high_skew outliers
This column is a heavily zero-inflated count or amount field — 78.7% of its 122,611 rows are exactly zero, and the interquartile range is 0.0, meaning the entire middle 50% of the distribution is zero. Despite a median of 0 and mean of only 208, the max reaches 3,429,544, producing extreme skew (262.89) and kurtosis (75,698), with 21.3% of rows flagged as outliers. This pattern is consistent with a sparse event-count, transaction amount, or usage metric where most entities are inactive but a small tail drives enormous values. Treatment: Apply log1p-transform or treat as two-part model (zero vs. non-zero) before regression or ML use. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
3,037
min
0
max
3.43e+06
mean
208
median
0
std
1.122e+04
q1
0
q3
0
iqr
0
skew
262.9
kurtosis
7.57e+04
n_outliers
26,119
outlier_rate
0.213
zero_rate
0.787

column30

numeric feature high_skew
This column is a heavily zero-inflated count or amount field: 96.8% of its 122,611 rows are exactly zero, driving a median of 0.0 and an IQR of 0.0. The remaining values are extremely skewed (skew = 51.68, kurtosis = 3252.96), with a mean of 13.79 pulled far right by a maximum of 20,088 — likely representing rare but large events such as transaction amounts, error counts, or penalty values. The 3,898 outliers (3.2% of rows) account for virtually all non-zero variance, which is the defining surprise here. Treatment: Apply zero-inflated modelling or split into a binary indicator plus a log-transformed positive-value sub-model before regression. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
993
min
0
max
20,088
mean
13.79
median
0
std
270.4
q1
0
q3
0
iqr
0
skew
51.68
kurtosis
3253
n_outliers
3,898
outlier_rate
0.03179
zero_rate
0.9682

column31

numeric feature high_skew outliers
This column is a sparse count or activity metric where the overwhelming majority of records (78.7%) are zero, producing a median of 0.0 and an IQR of exactly 0.0. The distribution is extraordinarily right-skewed (skew = 263.99, kurtosis = 76112.44), driven by extreme outliers reaching a max of 3,429,544 against a mean of only 173.57 — indicating a tiny fraction of records carry massive values. Roughly 21.3% of rows (26,119) are flagged as outliers, which is an unusually high outlier rate and signals a power-law or heavy-tailed phenomenon rather than a simple data error. Treatment: Apply log1p-transform (or a zero-inflated model) before regression; consider capping at a high percentile to manage extreme outliers. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
2,511
min
0
max
3.43e+06
mean
173.6
median
0
std
1.12e+04
q1
0
q3
0
iqr
0
skew
264
kurtosis
7.611e+04
n_outliers
26,119
outlier_rate
0.213
zero_rate
0.787

column32

numeric feature high_skew
This column is almost certainly a sparse count or occurrence field — likely an event frequency, error count, or similar rare-event tally. The zero_rate of 96.8% means the vast majority of rows have no event, while the remaining ~3.2% drive an extreme right tail (skew=48.9, kurtosis=2848.5) reaching a maximum of 20,088 against a median of 0 and mean of 14.7. The IQR of 0.0 confirms the middle 50% of the distribution is entirely flat at zero, with 3,898 flagged outliers carrying virtually all the variance. Treatment: Apply log1p transform or treat as binary (zero vs. non-zero) flag before modelling; consider capping at a high percentile to suppress the extreme tail. high · anthropic:default
n
122,611
nulls
0 (0.0%)
unique
993
min
0
max
20,088
mean
14.72
median
0
std
294.5
q1
0
q3
0
iqr
0
skew
48.91
kurtosis
2848
n_outliers
3,898
outlier_rate
0.03179
zero_rate
0.9682

column33

text label one_word duplicates
This column contains game developer or publisher names, evidenced by top values such as 'Choice of Games', 'KOEI TECMO GAMES CO., LTD.', and dominant vocabulary including 'games', 'studio', 'studios', 'interactive', and 'entertainment'. The duplicate rate of 37.98% (43,364 duplicates across 122,611 rows) is expected — publishers release multiple titles — but the 70,816 unique values and a max length of 584 characters suggest occasional free-text entries or combined multi-publisher strings. The one-word rate of 31.8% and mean word count of ~2 words are consistent with company name formats, though the wide length range (1–584 chars) warrants inspection for outliers. Treatment: Normalize casing and strip punctuation variants before grouping; use as a categorical grouping key or encode as a feature via target/frequency encoding. high · anthropic:default
n
122,611
nulls
8,431 (6.9%)
unique
70,816
len_min
1
len_max
584
len_mean
14.37
len_median
13
len_p95
27
word_mean
2.019
word_median
2
n_empty
0
n_duplicates
43,364
duplicate_rate
0.3798
vocab_size
18,429
readability_flesch_mean
38.73
emoji_rate
0.0008933
url_rate
0.000219
one_word_rate
0.3181
allcaps_rate
0.07974
boilerplate_rate
0

column34

text label one_word duplicates
This column contains game publisher or developer company names, as evidenced by top values like 'BFG Entertainment', 'Choice of Games', and 'Strategy First', and top words dominated by 'games', 'studio', 'studios', 'entertainment', and corporate suffixes ('llc', 'inc.', 'ltd.'). The duplicate rate is notably high at 44.9% (51,089 duplicates across 122,611 rows), which is expected since many games share the same publisher. The one-word rate of 31.8% reflects single-token studio names, and the 7.2% null rate warrants attention for records with unknown publishers. Treatment: Encode as a categorical feature (e.g. frequency or target encoding); investigate nulls at 7.2% before modelling. high · anthropic:default
n
122,611
nulls
8,833 (7.2%)
unique
62,689
len_min
1
len_max
164
len_mean
13.82
len_median
13
len_p95
26
word_mean
1.988
word_median
2
n_empty
0
n_duplicates
51,089
duplicate_rate
0.449
vocab_size
15,765
readability_flesch_mean
40.22
emoji_rate
0.0009141
url_rate
0.0002285
one_word_rate
0.3178
allcaps_rate
0.0817
boilerplate_rate
0

column35

text feature duplicates
This column contains a comma-delimited list of Steam game features/categories (e.g., 'Single-player', 'Steam Achievements', 'Family Sharing', 'Full controller support'), typical of the Steam store's supported features field per game. The extreme duplicate rate (88.3%, 100,367 of 122,611 rows) is expected because many games share identical feature sets, and the tiny vocabulary size of 589 words confirms a finite, enumerated tag system. The 'da' language detection on 12 rows is almost certainly a false positive from short comma-separated tokens, not actual Danish text. With only 13,291 unique combinations out of 122,611 rows, this column is highly suitable for multi-label binarization. Treatment: Split on commas and one-hot encode each feature tag for modelling. high · anthropic:default
n
122,611
nulls
8,953 (7.3%)
unique
13,291
len_min
3
len_max
534
len_mean
71.58
len_median
59
len_p95
178
word_mean
5.089
word_median
4
n_empty
0
n_duplicates
100,367
duplicate_rate
0.8831
vocab_size
589
readability_flesch_mean
-105.9
emoji_rate
0
url_rate
0
one_word_rate
0.04047
allcaps_rate
8.798e-06
boilerplate_rate
0

column36

text label one_word duplicates
This column contains comma-separated game genre tags (e.g., 'Casual,Indie', 'Action,Adventure,Indie'), consistent with a Steam or similar game catalog dataset. The duplicate rate is extremely high at 97.5%, reflecting the natural cardinality collapse when games share genre combinations — only 2,894 unique tag-sets exist across 122,611 rows. The top words 'to', 'access', and 'play' suggest some rows contain free-text strings like 'Early Access' or 'Free to Play' mixed into the same field, indicating occasional value pollution worth investigating. Treatment: Split on comma to multi-hot encode genre tags before modelling; flag rows where values contain free-text phrases ('to', 'access', 'play') for cleansing. high · anthropic:default
n
122,611
nulls
8,413 (6.9%)
unique
2,894
len_min
3
len_max
236
len_mean
22.21
len_median
21
len_p95
45
word_mean
1.364
word_median
1
n_empty
0
n_duplicates
111,304
duplicate_rate
0.9747
vocab_size
940
readability_flesch_mean
-206.1
emoji_rate
0
url_rate
0
one_word_rate
0.7892
allcaps_rate
0.009781
boilerplate_rate
0

column37

text label multilingual null_rate
This column contains comma-separated genre/tag lists for software or game products (e.g., 'Adventure,Casual,Hidden Object', 'Action,Indie'), consistent with a Steam-style app catalog. The null rate of 32.02% is notably high and warrants investigation before modelling. A multilingual alert is raised, but the non-English content is negligible (26 records out of 3,376 detected), suggesting near-uniform English data with minor noise. The duplicate rate of 7.4% (6,167 duplicates) is expected given finite genre combinations across a large catalog. Treatment: Split on commas to multi-hot encode genre tags; investigate and decide on imputation strategy for the 32.02% null rows before modelling. high · anthropic:default
n
122,611
nulls
39,265 (32.0%)
unique
77,179
len_min
3
len_max
295
len_mean
141.3
len_median
163
len_p95
228
word_mean
4.923
word_median
5
n_empty
0
n_duplicates
6,167
duplicate_rate
0.07399
vocab_size
57,260
readability_flesch_mean
-449.7
emoji_rate
0
url_rate
0
one_word_rate
0.1233
allcaps_rate
4.799e-05
boilerplate_rate
0

column38

text metadata near_unique one_word url_heavy
This column contains comma-separated lists of Steam screenshot URLs (Akamai CDN), one packed string per row representing all screenshot images for a given Steam game entry. Every value is technically 'one word' (no spaces) because the URLs are concatenated without whitespace, explaining the paradoxical one_word_rate of 1.0 alongside a mean length of ~1319 characters and a max of 29132. With 116,483 unique values out of 122,611 rows and only 110 duplicates, this is near-unique; the small duplicate count likely reflects games with identical screenshot sets. Treatment: Split on commas to extract individual screenshot URLs per game; store as a list-type column or explode into a separate screenshots table keyed by game id. high · anthropic:default
n
122,611
nulls
6,018 (4.9%)
unique
116,483
len_min
144
len_max
29,132
len_mean
1319
len_median
1,039
len_p95
2,773
word_mean
1
word_median
1
n_empty
0
n_duplicates
110
duplicate_rate
0.0009435
vocab_size
19,994
readability_flesch_mean
-5099
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

column39

unknown other skipped
This column was skipped by the profiler, so its content and type are entirely unknown. With 122,611 rows, zero nulls, and no computed statistics or uniqueness information, no data-driven characterisation is possible. The 'skipped' alert is the only signal available. Treatment: Manually inspect raw values to determine type and role before any further processing. low · anthropic:default
n
122,611
nulls
0 (0.0%)
unique