saturn·

data trove steam games catalog

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv

Saturn profiled 122,611 rows across 40 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv",
    "--findings", "data-trove-steam-games-catalog.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This is a Steam games catalogue with 122,611 rows and 40 columns, covering titles, publishers, developers, genres, pricing, review counts, and associated URLs. The most important thing to examine first is the extreme skew across nearly all numeric engagement columns (column23, column24, column27, column29, column31): medians sit at 0–5 while means run into the hundreds or thousands, meaning a tiny fraction of blockbuster titles account for the vast majority of reviews and activity. A second area worth attention is genre distribution (column36), where just a handful of Casual/Indie/Action combinations account for the bulk of the catalogue, and the estimated owner-count banding (column03) shows over 61% of games have fewer than 20,000 owners — pointing to a long-tail market dominated by low-visibility titles.

citing: column23.stats.median · column23.stats.mean · column27.stats.zero_rate · column29.stats.zero_rate · column03.top_values · column03.stats.top_rate · column36.top_values · column36.stats.duplicate_rate · column06.stats.median · column06.stats.mean

Out[4]:

saturn.schema() · 40 columns

column kind n null% unique alerts
column00 numeric 122,611 0.0% 122,611
column01 text 122,611 0.0% 121,454 near_unique
column02 text 122,611 0.0% 5,081 short_text duplicates
column03 categorical 122,611 0.0% 14
column04 numeric 122,611 0.0% 1,110 high_skew outliers
column05 numeric 122,611 0.0% 15 high_skew
column06 numeric 122,611 0.0% 941 high_skew outliers
column07 numeric 122,611 0.0% 88
column08 numeric 122,611 0.0% 117 high_skew outliers
column09 text 122,611 6.9% 113,556 near_unique
column10 text 122,611 0.0% 19,113 one_word duplicates
column11 text 122,611 0.0% 3,710 one_word duplicates
column12 text 122,611 90.2% 11,884 near_unique null_rate
column13 text 122,611 0.1% 122,420 near_unique one_word url_heavy
column14 text 122,611 59.5% 39,703 one_word url_heavy null_rate duplicates
column15 text 122,611 55.8% 35,399 one_word url_heavy null_rate duplicates
column16 text 122,611 18.1% 60,519 one_word duplicates
column17 categorical 122,611 0.0% 2 imbalance
column18 categorical 122,611 0.0% 2
column19 categorical 122,611 0.0% 2
column20 numeric 122,611 0.0% 73 high_skew
column21 text 122,611 96.5% 4,160 near_unique one_word url_heavy null_rate
column22 numeric 122,611 0.0% 31 high_skew
column23 numeric 122,611 0.0% 5,540 high_skew outliers
column24 numeric 122,611 0.0% 2,725 high_skew outliers
column25 numeric 122,611 100.0% 3 null_rate
column26 numeric 122,611 0.0% 448 high_skew outliers
column27 numeric 122,611 0.0% 5,332 high_skew outliers
column28 text 122,611 81.7% 18,620 multilingual null_rate
column29 numeric 122,611 0.0% 3,037 high_skew outliers
column30 numeric 122,611 0.0% 993 high_skew
column31 numeric 122,611 0.0% 2,511 high_skew outliers
column32 numeric 122,611 0.0% 993 high_skew
column33 text 122,611 6.9% 70,816 one_word duplicates
column34 text 122,611 7.2% 62,689 one_word duplicates
column35 text 122,611 7.3% 13,291 duplicates
column36 text 122,611 6.9% 2,894 one_word duplicates
column37 text 122,611 32.0% 77,179 multilingual null_rate
column38 text 122,611 4.9% 116,483 near_unique one_word url_heavy
column39 unknown 122,611 0.0% skipped
Fig 1.
column03 · Owner-count bands reveal the long tail: over 61% of games sit in the lowest tier (0–20,000 owners), with rapid drop-off at higher counts.
Show data table
Top values for column03 (14 unique shown, of 14 total).
valuecountshare
0 - 200007540461.5%
0 - 02164117.7%
20000 - 50000113969.3%
50000 - 10000053554.4%
100000 - 20000034542.8%
200000 - 50000028532.3%
500000 - 100000011540.9%
1000000 - 20000007290.6%
2000000 - 50000004050.3%
5000000 - 100000001250.1%
10000000 - 20000000510.0%
20000000 - 50000000310.0%
50000000 - 10000000090.0%
100000000 - 20000000040.0%
Fig 2.
column36 · Genre tag combinations show Casual, Indie, and Action dominate Steam listings — look for which combos are over-represented relative to their review volumes.
Show data table
Character-length distribution for column36 (mean: 22.205064887301003).
charscount
3 – 912259
9 – 1521084
15 – 2022318
20 – 2625837
26 – 3212284
32 – 388026
38 – 445596
44 – 502995
50 – 551587
55 – 61848
61 – 67593
67 – 73229
73 – 79196
79 – 85137
85 – 9071
90 – 9635
96 – 10237
102 – 10813
108 – 11415
114 – 12010
120 – 1256
125 – 1316
131 – 1374
137 – 1432
143 – 1492
149 – 1541
154 – 1600
160 – 1661
166 – 1724
172 – 1780
178 – 1840
184 – 1890
189 – 1950
195 – 2010
201 – 2070
207 – 2131
213 – 2190
219 – 2240
224 – 2300
230 – 2361
Fig 3.
column23 · Review counts are extremely right-skewed with a median of 5 and a max of over 7 million, illustrating the blockbuster vs. obscurity divide.
Show data table
Histogram bins for column23 (median: 5.0).
bincount
0 – 1.911e+05122511
1.911e+05 – 3.821e+0557
3.821e+05 – 5.732e+0516
5.732e+05 – 7.642e+0510
7.642e+05 – 9.553e+056
9.553e+05 – 1.146e+065
1.146e+06 – 1.337e+061
1.337e+06 – 1.528e+062
1.528e+06 – 1.719e+060
1.719e+06 – 1.911e+061
1.911e+06 – 2.102e+061
2.102e+06 – 2.293e+060
2.293e+06 – 2.484e+060
2.484e+06 – 2.675e+060
2.675e+06 – 2.866e+060
2.866e+06 – 3.057e+060
3.057e+06 – 3.248e+060
3.248e+06 – 3.439e+060
3.439e+06 – 3.63e+060
3.63e+06 – 3.821e+060
3.821e+06 – 4.012e+060
4.012e+06 – 4.203e+060
4.203e+06 – 4.394e+060
4.394e+06 – 4.585e+060
4.585e+06 – 4.776e+060
4.776e+06 – 4.967e+060
4.967e+06 – 5.158e+060
5.158e+06 – 5.349e+060
5.349e+06 – 5.541e+060
5.541e+06 – 5.732e+060
5.732e+06 – 5.923e+060
5.923e+06 – 6.114e+060
6.114e+06 – 6.305e+060
6.305e+06 – 6.496e+060
6.496e+06 – 6.687e+060
6.687e+06 – 6.878e+060
6.878e+06 – 7.069e+060
7.069e+06 – 7.26e+060
7.26e+06 – 7.451e+060
7.451e+06 – 7.642e+061
Fig 4.
column06 · Price distribution is heavily skewed toward low values (median $2.24, mean $4.77) with a long tail up to $999.98 — check for outlier pricing strategies.
Show data table
Histogram bins for column06 (median: 2.24).
bincount
0 – 25120926
25 – 501081
50 – 75248
75 – 10047
100 – 1256
125 – 15013
150 – 1751
175 – 200282
200 – 2250
225 – 2500
250 – 2751
275 – 3002
300 – 3250
325 – 3500
350 – 3750
375 – 4000
400 – 4250
425 – 4500
450 – 4750
475 – 5000
500 – 5251
525 – 5500
550 – 5750
575 – 6000
600 – 6250
625 – 6500
650 – 6750
675 – 7000
700 – 7250
725 – 7500
750 – 7750
775 – 8000
800 – 8250
825 – 8500
850 – 8750
875 – 9000
900 – 9250
925 – 9500
950 – 9750
975 – 10003
Fig 5.
column02 · Release dates cluster heavily in 2024–2025, showing recent acceleration in Steam publishing volume worth comparing against owner and review counts.
Show data table
Character-length distribution for column02 (mean: 11.718671244831214).
charscount
11 – 1134494
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 1288117
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
column00numeric0.0%
column01text0.0%
column02text0.0%
column03categorical0.0%
column04numeric0.0%
column05numeric0.0%
column06numeric0.0%
column07numeric0.0%
column08numeric0.0%
column09text6.9%
column10text0.0%
column11text0.0%
column12text90.2%
column13text0.1%
column14text59.5%
column15text55.8%
column16text18.1%
column17categorical0.0%
column18categorical0.0%
column19categorical0.0%
column20numeric0.0%
column21text96.5%
column22numeric0.0%
column23numeric0.0%
column24numeric0.0%
column25numeric100.0%
column26numeric0.0%
column27numeric0.0%
column28text81.7%
column29numeric0.0%
column30numeric0.0%
column31numeric0.0%
column32numeric0.0%
column33text6.9%
column34text7.2%
column35text7.3%
column36text6.9%
column37text32.0%
column38text4.9%
column39unknown0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 18,468 detected strings).
langcountshare
en1841999.7%
da120.1%
de100.1%
zh90.0%
ja90.0%
es60.0%
pt10.0%
fr10.0%
ca10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
column00column04column05column06column07column08column20column22column23column24column25column26
column00+1.00+0.19+nan-0.04-0.16-0.05-0.25+nan-0.43-0.49-0.06-0.29
column04+0.19+1.00+nan-0.04-0.06+0.38-0.03+nan+0.11-0.05-0.04-0.04
column05+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan
column06-0.04-0.04+nan+1.00-0.17+0.10-0.08+nan+0.12-0.04-0.27+0.14
column07-0.16-0.06+nan-0.17+1.00+0.09-0.06+nan+0.10+0.26+0.01-0.04
column08-0.05+0.38+nan+0.10+0.09+1.00-0.07+nan+0.19+0.24-0.22+0.20
column20-0.25-0.03+nan-0.08-0.06-0.07+1.00+nan-0.09-0.07+0.20+0.08
column22+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan
column23-0.43+0.11+nan+0.12+0.10+0.19-0.09+nan+1.00+0.71-0.10+0.21
column24-0.49-0.05+nan-0.04+0.26+0.24-0.07+nan+0.71+1.00-0.14+0.12
column25-0.06-0.04+nan-0.27+0.01-0.22+0.20+nan-0.10-0.14+1.00+0.08
column26-0.29-0.04+nan+0.14-0.04+0.20+0.08+nan+0.21+0.12+0.08+1.00

column00 numeric identifier

This column contains 122,611 numeric values that are all unique, null-free, and span from 10 to 4,264,350 — strongly suggesting it is a unique numeric identifier (e.g., a record ID or transaction number). The distribution is remarkably flat and near-uniform: kurtosis of -1.05, negligible skew of 0.18, and zero detected outliers, which is highly unusual for a natural measurement or feature and is consistent with a sequentially or pseudo-randomly assigned integer key. The IQR of 1,806,385 is close to half the full range, further supporting a uniform spread across the ID space.

Treatment: Drop before modelling or use as a row key only; do not use as a predictive feature.

anthropic:default · confidence high
Out[14]:

saturn.columns["column00"].stats

statvalue
n122,611
nulls0 (0.0%)
unique122,611
min 10
max 4.264e+06
mean 1.985e+06
median 1.907e+06
std 1.088e+06
q1 1.063e+06
q3 2.87e+06
iqr 1.806e+06
skew 0.1772
kurtosis -1.05
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of column00. Vertical dash marks the median.
Show data table
Histogram bins for column00 (median: 1907380.0).
bincount
10 – 1.066e+051208
1.066e+05 – 2.132e+05280
2.132e+05 – 3.198e+052468
3.198e+05 – 4.264e+053789
4.264e+05 – 5.331e+053626
5.331e+05 – 6.397e+053651
6.397e+05 – 7.463e+054043
7.463e+05 – 8.529e+054024
8.529e+05 – 9.595e+053772
9.595e+05 – 1.066e+063926
1.066e+06 – 1.173e+064091
1.173e+06 – 1.279e+063895
1.279e+06 – 1.386e+063659
1.386e+06 – 1.493e+063825
1.493e+06 – 1.599e+063975
1.599e+06 – 1.706e+063815
1.706e+06 – 1.812e+063854
1.812e+06 – 1.919e+063820
1.919e+06 – 2.026e+063749
2.026e+06 – 2.132e+062991
2.132e+06 – 2.239e+063698
2.239e+06 – 2.345e+063661
2.345e+06 – 2.452e+063593
2.452e+06 – 2.559e+063305
2.559e+06 – 2.665e+063120
2.665e+06 – 2.772e+063232
2.772e+06 – 2.878e+063172
2.878e+06 – 2.985e+063134
2.985e+06 – 3.092e+063098
3.092e+06 – 3.198e+063011
3.198e+06 – 3.305e+062792
3.305e+06 – 3.411e+062880
3.411e+06 – 3.518e+062626
3.518e+06 – 3.625e+062386
3.625e+06 – 3.731e+062314
3.731e+06 – 3.838e+062229
3.838e+06 – 3.945e+062212
3.945e+06 – 4.051e+061778
4.051e+06 – 4.158e+061353
4.158e+06 – 4.264e+06556

column01 text label

This column contains short, near-unique text strings averaging ~3 words and 18 characters, consistent with game or software session/product titles. The dominant top words — 'playtest', 'vr', 'simulator' — strongly suggest these are names of VR game playtesting sessions or titles. Surprising signals include 1,156 duplicates (~0.94% duplicate rate) despite the near-unique alert, a small emoji presence (0.26%), and a maximum length of 413 characters which is anomalously long relative to the median of 16.

Treatment: Use as a descriptive label; deduplicate or flag the 1,156 repeated entries, and investigate the long-tail outliers (len_max 413) before any downstream grouping or embedding.

anthropic:default · confidence medium
Out[17]:

saturn.columns["column01"].stats

statvalue
n122,611
nulls1 (0.0%)
unique121,454
len_min 1
len_max 413
len_mean 18.07
len_median 16
len_p95 37.55
word_mean 2.912
word_median 3
n_empty 0
n_duplicates 1,156
duplicate_rate 0.009428
vocab_size 18,813
readability_flesch_mean 52.87
emoji_rate 0.002585
url_rate 0
one_word_rate 0.1866
allcaps_rate 0.06731
boilerplate_rate 4.078e-05
alert: near_unique99.1% of rows are unique strings
Fig 10.
Character-length distribution for column01.
Show data table
Character-length distribution for column01 (mean: 18.069627273468722).
charscount
1 – 1134028
11 – 2253727
22 – 3222850
32 – 428502
42 – 522281
52 – 63858
63 – 73213
73 – 8369
83 – 9428
94 – 10421
104 – 11413
114 – 1258
125 – 1354
135 – 1453
145 – 1561
156 – 1661
166 – 1760
176 – 1861
186 – 1970
197 – 2070
207 – 2170
217 – 2280
228 – 2380
238 – 2480
248 – 2581
258 – 2690
269 – 2790
279 – 2890
289 – 3000
300 – 3100
310 – 3200
320 – 3310
331 – 3410
341 – 3510
351 – 3620
362 – 3720
372 – 3820
382 – 3920
392 – 4030
403 – 4131

column02 text timestamp

This column contains dates formatted as 'Mon DD, YYYY' (e.g., 'Oct 23, 2025'), stored as text rather than a native date type. The values span at least 2021–2025 based on top word frequencies, with a striking duplicate rate of 95.86% — 117,530 of 122,611 rows share one of only 5,081 distinct dates, meaning many records map to the same calendar day. The near-constant string length (median 12, min 11, max 12) and vocabulary of just 68 tokens confirm this is a tightly formatted date field with no free-text variation.

Treatment: Parse to a native date type (e.g., datetime64) before any time-series analysis or feature engineering.

anthropic:default · confidence high
Out[20]:

saturn.columns["column02"].stats

statvalue
n122,611
nulls0 (0.0%)
unique5,081
len_min 11
len_max 12
len_mean 11.72
len_median 12
len_p95 12
word_mean 3
word_median 3
n_empty 0
n_duplicates 117,530
duplicate_rate 0.9586
vocab_size 68
readability_flesch_mean 98.6
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: short_text95th-percentile length under 20 chars
alert: duplicates95.9% duplicate strings
Fig 11.
Character-length distribution for column02.
Show data table
Character-length distribution for column02 (mean: 11.718671244831214).
charscount
11 – 1134494
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 120
12 – 1288117

column03 categorical feature

This column encodes a numeric quantity as binned range labels — almost certainly an income, revenue, or financial amount bracket given the scale (0 to 10,000,000+) and logarithmically spaced bin edges. The distribution is heavily right-skewed: 61.5% of rows fall in the '0 - 20000' bucket alone, and a notable 21,641 rows sit in '0 - 0', suggesting a zero-value spike that may warrant separate treatment. With only 14 distinct values and zero nulls across 122,611 rows, the encoding is clean but lossy.

Treatment: Ordinal-encode using the natural bin order, or extract bin midpoints as a numeric approximation; investigate the '0 - 0' segment (21,641 rows) as a potential distinct class.

anthropic:default · confidence high
Out[23]:

saturn.columns["column03"].stats

statvalue
n122,611
nulls0 (0.0%)
unique14
top_value 0 - 20000
top_rate 0.615
cardinality 14
entropy 1.814
entropy_ratio 0.4764
Fig 12.
Top values for column03.
Show data table
Top values for column03 (14 unique shown, of 14 total).
valuecountshare
0 - 200007540461.5%
0 - 02164117.7%
20000 - 50000113969.3%
50000 - 10000053554.4%
100000 - 20000034542.8%
200000 - 50000028532.3%
500000 - 100000011540.9%
1000000 - 20000007290.6%
2000000 - 50000004050.3%
5000000 - 100000001250.1%
10000000 - 20000000510.0%
20000000 - 50000000310.0%
50000000 - 10000000090.0%
100000000 - 20000000040.0%

column04 numeric feature

This column is a heavily zero-inflated numeric field — likely a count, transaction amount, or event frequency — where 83.95% of values are exactly zero and the interquartile range is 0.0, meaning the entire middle half of the distribution is flat at zero. The remaining values are extremely right-skewed (skew = 209.95, kurtosis = 51452.44) with a max of 1,013,936 against a mean of only 54.59, indicating a small number of very large outliers; 16.05% of rows (19,676) are flagged as outliers. The 1,110 unique values and zero null rate suggest this may be a sparse activity or volume metric.

Treatment: Consider a two-part model (zero-inflation indicator + log1p transform on non-zero values) or cap/winsorize at a high percentile before modelling.

anthropic:default · confidence high
Out[26]:

saturn.columns["column04"].stats

statvalue
n122,611
nulls0 (0.0%)
unique1,110
min 0
max 1.014e+06
mean 54.59
median 0
std 3729
q1 0
q3 0
iqr 0
skew 210
kurtosis 5.145e+04
n_outliers 19,676
outlier_rate 0.1605
zero_rate 0.8395
alert: high_skewskew=+209.95
alert: outliers16.0% rows beyond 1.5 IQR
Fig 13.
Distribution of column04. Vertical dash marks the median.
Show data table
Histogram bins for column04 (median: 0.0).
bincount
0 – 2.535e+04122573
2.535e+04 – 5.07e+0422
5.07e+04 – 7.605e+046
7.605e+04 – 1.014e+052
1.014e+05 – 1.267e+052
1.267e+05 – 1.521e+051
1.521e+05 – 1.774e+052
1.774e+05 – 2.028e+050
2.028e+05 – 2.281e+050
2.281e+05 – 2.535e+050
2.535e+05 – 2.788e+050
2.788e+05 – 3.042e+050
3.042e+05 – 3.295e+051
3.295e+05 – 3.549e+050
3.549e+05 – 3.802e+050
3.802e+05 – 4.056e+050
4.056e+05 – 4.309e+050
4.309e+05 – 4.563e+050
4.563e+05 – 4.816e+050
4.816e+05 – 5.07e+050
5.07e+05 – 5.323e+050
5.323e+05 – 5.577e+050
5.577e+05 – 5.83e+050
5.83e+05 – 6.084e+050
6.084e+05 – 6.337e+051
6.337e+05 – 6.591e+050
6.591e+05 – 6.844e+050
6.844e+05 – 7.098e+050
7.098e+05 – 7.351e+050
7.351e+05 – 7.605e+050
7.605e+05 – 7.858e+050
7.858e+05 – 8.111e+050
8.111e+05 – 8.365e+050
8.365e+05 – 8.618e+050
8.618e+05 – 8.872e+050
8.872e+05 – 9.125e+050
9.125e+05 – 9.379e+050
9.379e+05 – 9.632e+050
9.632e+05 – 9.886e+050
9.886e+05 – 1.014e+061

column05 numeric feature

This column is a low-cardinality integer count (only 15 distinct values, range 0–21) where 98.96% of rows are exactly zero, making it an extreme sparse count feature — likely recording rare events or occurrences per record. The distribution is severely right-skewed (skew 9.88, kurtosis 96.52) with only 1,272 outlier rows (1.04%) carrying any non-zero signal; the IQR is zero because all three quartiles collapse to 0.

Treatment: Treat as a sparse count; consider binarising (0 vs >0) or applying log1p transform, and flag the 1,272 non-zero rows as a minority sub-population for modelling.

anthropic:default · confidence high
Out[29]:

saturn.columns["column05"].stats

statvalue
n122,611
nulls0 (0.0%)
unique15
min 0
max 21
mean 0.1676
median 0
std 1.654
q1 0
q3 0
iqr 0
skew 9.883
kurtosis 96.52
n_outliers 1,272
outlier_rate 0.01037
zero_rate 0.9896
alert: high_skewskew=+9.88
Fig 14.
Distribution of column05. Vertical dash marks the median.
Show data table
Histogram bins for column05 (median: 0.0).
bincount
0 – 0.525121339
0.525 – 1.052
1.05 – 1.5750
1.575 – 2.10
2.1 – 2.6250
2.625 – 3.156
3.15 – 3.6750
3.675 – 4.20
4.2 – 4.7250
4.725 – 5.250
5.25 – 5.7750
5.775 – 6.34
6.3 – 6.8250
6.825 – 7.355
7.35 – 7.8750
7.875 – 8.40
8.4 – 8.9250
8.925 – 9.450
9.45 – 9.9750
9.975 – 10.526
10.5 – 11.030
11.03 – 11.550
11.55 – 12.0823
12.08 – 12.60
12.6 – 13.12175
13.12 – 13.650
13.65 – 14.181
14.18 – 14.70
14.7 – 15.233
15.23 – 15.750
15.75 – 16.2832
16.28 – 16.80
16.8 – 17.32828
17.32 – 17.850
17.85 – 18.38164
18.38 – 18.90
18.9 – 19.430
19.43 – 19.950
19.95 – 20.481
20.48 – 212

column06 numeric feature

This column likely represents a monetary amount, duration, or rate — a continuous positive measure where most values are small. The distribution is extreme: the median is 2.24 and Q3 is only 5.24, yet the max reaches 999.98, producing a skew of 22.4 and a kurtosis of 1,135. Over 7.5% of rows (9,297) are flagged as outliers, and 21.4% of values are exactly zero, suggesting a two-part structure (zero-inflation plus a heavy-tailed positive component) that would violate standard regression assumptions.

Treatment: Model the zero-inflation separately (e.g., hurdle or Tweedie model), then log1p-transform the positive portion before regression or scaling.

anthropic:default · confidence medium
Out[32]:

saturn.columns["column06"].stats

statvalue
n122,611
nulls0 (0.0%)
unique941
min 0
max 1000
mean 4.765
median 2.24
std 12.53
q1 0.55
q3 5.24
iqr 4.69
skew 22.4
kurtosis 1135
n_outliers 9,297
outlier_rate 0.07583
zero_rate 0.2137
alert: high_skewskew=+22.40
alert: outliers7.6% rows beyond 1.5 IQR
Fig 15.
Distribution of column06. Vertical dash marks the median.
Show data table
Histogram bins for column06 (median: 2.24).
bincount
0 – 25120926
25 – 501081
50 – 75248
75 – 10047
100 – 1256
125 – 15013
150 – 1751
175 – 200282
200 – 2250
225 – 2500
250 – 2751
275 – 3002
300 – 3250
325 – 3500
350 – 3750
375 – 4000
400 – 4250
425 – 4500
450 – 4750
475 – 5000
500 – 5251
525 – 5500
550 – 5750
575 – 6000
600 – 6250
625 – 6500
650 – 6750
675 – 7000
700 – 7250
725 – 7500
750 – 7750
775 – 8000
800 – 8250
825 – 8500
850 – 8750
875 – 9000
900 – 9250
925 – 9500
950 – 9750
975 – 10003

column07 numeric feature

This column is a bounded numeric score or percentage, ranging from 0 to 100 with only 88 distinct values, suggesting a discretized or rounded measurement (e.g., a completion rate, satisfaction score, or grade). The most striking feature is that 66.8% of values are exactly zero, making the distribution heavily zero-inflated; the median is 0.0 while the mean is 18.35 and Q3 is only 40.0, confirming the mass is concentrated at the floor. Despite the zero inflation, kurtosis is near zero (−0.05), meaning the non-zero portion is roughly flat or uniform across the 0–100 range. Analysts should treat this as a zero-inflated bounded variable rather than a standard continuous feature.

Treatment: Model with a two-part (hurdle/zero-inflated) approach, or apply an indicator for zero alongside the raw value; avoid log-transform without offset due to zero mass.

anthropic:default · confidence high
Out[35]:

saturn.columns["column07"].stats

statvalue
n122,611
nulls0 (0.0%)
unique88
min 0
max 100
mean 18.35
median 0
std 28.86
q1 0
q3 40
iqr 40
skew 1.22
kurtosis -0.05072
n_outliers 0
outlier_rate 0
zero_rate 0.6682
Fig 16.
Distribution of column07. Vertical dash marks the median.
Show data table
Histogram bins for column07 (median: 0.0).
bincount
0 – 2.581930
2.5 – 50
5 – 7.50
7.5 – 100
10 – 12.5620
12.5 – 1516
15 – 17.5437
17.5 – 2015
20 – 22.52394
22.5 – 25124
25 – 27.51471
27.5 – 3039
30 – 32.52689
32.5 – 35482
35 – 37.5955
37.5 – 4046
40 – 42.52358
42.5 – 4552
45 – 47.5419
47.5 – 5073
50 – 52.59742
52.5 – 5524
55 – 57.5534
57.5 – 6033
60 – 62.52493
62.5 – 6545
65 – 67.51150
67.5 – 70166
70 – 72.53120
72.5 – 7577
75 – 77.52918
77.5 – 80106
80 – 82.53754
82.5 – 85263
85 – 87.51353
87.5 – 90252
90 – 92.52217
92.5 – 9556
95 – 97.5182
97.5 – 1006

column08 numeric feature

This column is a sparse count or event-frequency field: 85.5% of its 122,611 rows are exactly zero, the median and IQR are both 0, yet the mean is 0.55 and the max reaches 3,703. The extreme concentration at zero combined with a skew of 171.8 and kurtosis of 38,359 indicates a heavy-tailed distribution driven by rare but very large values; 14.5% of rows (17,771) are flagged as outliers. Only 117 distinct values across 122,611 rows further suggests this is a discrete count, not a continuous measure.

Treatment: Apply log1p transform or use a zero-inflated / Poisson model; consider capping or winsorizing at a high quantile given the max of 3703.

anthropic:default · confidence high
Out[38]:

saturn.columns["column08"].stats

statvalue
n122,611
nulls0 (0.0%)
unique117
min 0
max 3,703
mean 0.5459
median 0
std 14.52
q1 0
q3 0
iqr 0
skew 171.8
kurtosis 3.836e+04
n_outliers 17,771
outlier_rate 0.1449
zero_rate 0.8551
alert: high_skewskew=+171.83
alert: outliers14.5% rows beyond 1.5 IQR
Fig 17.
Distribution of column08. Vertical dash marks the median.
Show data table
Histogram bins for column08 (median: 0.0).
bincount
0 – 92.58122551
92.58 – 185.239
185.2 – 277.76
277.7 – 370.33
370.3 – 462.92
462.9 – 555.50
555.5 – 6480
648 – 740.60
740.6 – 833.25
833.2 – 925.81
925.8 – 10181
1018 – 11111
1111 – 12030
1203 – 12960
1296 – 13890
1389 – 14810
1481 – 15740
1574 – 16660
1666 – 17590
1759 – 18520
1852 – 19440
1944 – 20371
2037 – 21290
2129 – 22220
2222 – 23140
2314 – 24070
2407 – 25000
2500 – 25920
2592 – 26850
2685 – 27770
2777 – 28700
2870 – 29620
2962 – 30550
3055 – 31480
3148 – 32400
3240 – 33330
3333 – 34250
3425 – 35180
3518 – 36100
3610 – 37031

column09 text free_text

This column contains long-form natural language text, likely user-generated content such as reviews, product descriptions, or messages — with a mean of 1,297 characters and 214 words per entry, and a vocabulary of 105,903 unique terms. The near-unique alert (113,556 unique values out of 122,611 rows) confirms these are essentially free-text narratives rather than categorical labels. Notably, 4.7% of entries contain emojis, suggesting informal or consumer-facing content, and the max length of 89,665 characters indicates some extreme outliers well beyond the 95th-percentile length of 2,966 characters. Flesch readability mean of 58.7 places the text in a 'fairly easy' register, consistent with consumer writing.

Treatment: Tokenize and embed (e.g., sentence-transformers) before modelling; flag or truncate the extreme-length outliers above len_p95 of 2,966 characters.

anthropic:default · confidence high
Out[41]:

saturn.columns["column09"].stats

statvalue
n122,611
nulls8,449 (6.9%)
unique113,556
len_min 1
len_max 89,665
len_mean 1297
len_median 1,064
len_p95 2,966
word_mean 214.3
word_median 177
n_empty 0
n_duplicates 606
duplicate_rate 0.005308
vocab_size 105,903
readability_flesch_mean 58.75
emoji_rate 0.04672
url_rate 0.0003504
one_word_rate 0.0004029
allcaps_rate 0.007559
boilerplate_rate 0.01517
alert: near_unique99.5% of rows are unique strings
Fig 18.
Character-length distribution for column09.
Show data table
Character-length distribution for column09 (mean: 1297.112095092938).
charscount
1 – 2243100879
2243 – 448411956
4484 – 67261006
6726 – 8967201
8967 – 1120966
11209 – 1345122
13451 – 1569214
15692 – 179347
17934 – 201750
20175 – 224173
22417 – 246590
24659 – 269001
26900 – 291420
29142 – 313831
31383 – 336250
33625 – 358670
35867 – 381080
38108 – 403500
40350 – 425910
42591 – 448330
44833 – 470751
47075 – 493160
49316 – 515580
51558 – 537990
53799 – 560410
56041 – 582830
58283 – 605240
60524 – 627661
62766 – 650070
65007 – 672490
67249 – 694911
69491 – 717320
71732 – 739740
73974 – 762150
76215 – 784570
78457 – 806990
80699 – 829400
82940 – 851820
85182 – 874230
87423 – 896653

column10 text feature

This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or game). The dominant value is `['English']` appearing 55,314 times, with `[]` (no languages listed) in 8,380 rows. The duplicate rate is extremely high at 84.4%, which is expected given the limited vocabulary of 217 unique tokens and only 19,113 unique values across 122,611 rows — the data is stored as raw string-serialized lists rather than a normalized structure, which is a notable preprocessing concern.

Treatment: Parse the string-serialized lists into actual list structures, then multi-hot encode each language as a binary feature column.

anthropic:default · confidence high
Out[44]:

saturn.columns["column10"].stats

statvalue
n122,611
nulls0 (0.0%)
unique19,113
len_min 2
len_max 1,216
len_mean 68.02
len_median 11
len_p95 224
word_mean 6.889
word_median 1
n_empty 0
n_duplicates 103,498
duplicate_rate 0.8441
vocab_size 217
readability_flesch_mean 14.07
emoji_rate 0
url_rate 0
one_word_rate 0.5333
allcaps_rate 0
boilerplate_rate 0
alert: one_word53.3% rows are a single word
alert: duplicates84.4% duplicate strings
Fig 19.
Character-length distribution for column10.
Show data table
Character-length distribution for column10 (mean: 68.01866064219361).
charscount
2 – 3279054
32 – 6313178
63 – 937232
93 – 1235289
123 – 1545081
154 – 1843884
184 – 2142265
214 – 2451157
245 – 275630
275 – 306318
306 – 336341
336 – 366555
366 – 3971037
397 – 427512
427 – 45749
457 – 48828
488 – 51812
518 – 5485
548 – 5799
579 – 6092
609 – 6390
639 – 6702
670 – 7000
700 – 7302
730 – 7610
761 – 7911
791 – 8212
821 – 8526
852 – 8821
882 – 9124
912 – 9438
943 – 9730
973 – 10041
1004 – 10342
1034 – 10645
1064 – 10951
1095 – 11254
1125 – 11556
1155 – 11861
1186 – 12161927

column11 text feature

This column contains serialized Python lists of language names, representing the supported or available languages for each record (likely a software product or media item). The dominant value is '[]' (empty list) appearing 72,730 times — nearly 60% of rows — indicating most records have no language metadata populated. Despite 122,611 rows, only 3,710 unique values exist and the duplicate rate is 96.97%, which is expected for a categorical-list field, but the vocabulary is tiny at just 194 words, confirming a closed set of language names.

Treatment: Parse the serialized list strings into proper multi-label indicators (one binary column per language) before modelling; treat '[]' as missing/unknown.

anthropic:default · confidence high
Out[47]:

saturn.columns["column11"].stats

statvalue
n122,611
nulls0 (0.0%)
unique3,710
len_min 2
len_max 1,216
len_mean 24.31
len_median 2
len_p95 46
word_mean 2.854
word_median 1
n_empty 0
n_duplicates 118,901
duplicate_rate 0.9697
vocab_size 194
readability_flesch_mean 8.003
emoji_rate 0
url_rate 0
one_word_rate 0.813
allcaps_rate 0
boilerplate_rate 0
alert: one_word81.3% rows are a single word
alert: duplicates97.0% duplicate strings
Fig 20.
Character-length distribution for column11.
Show data table
Character-length distribution for column11 (mean: 24.311350531355263).
charscount
2 – 32110168
32 – 637499
63 – 931015
93 – 123742
123 – 154555
154 – 184422
184 – 214254
214 – 245161
245 – 27575
275 – 30632
306 – 33634
336 – 36685
366 – 397345
397 – 42770
427 – 45710
457 – 48811
488 – 5181
518 – 5481
548 – 5790
579 – 6090
609 – 6390
639 – 6700
670 – 7000
700 – 7301
730 – 7610
761 – 7910
791 – 8210
821 – 8525
852 – 8820
882 – 9120
912 – 9430
943 – 9730
973 – 10040
1004 – 10341
1034 – 10643
1064 – 10950
1095 – 11251
1125 – 11553
1155 – 11860
1186 – 12161117

column12 text free_text

This column contains substantial free-text descriptions or reviews, most likely about games — the word 'game' is the top non-stopword at 7,882 occurrences, average text length is ~340 characters (~57 words), and the vocabulary spans 61,840 unique tokens. The 90.16% null rate is a major alert: only about 12,000 of 122,611 rows carry any content, meaning this field is sparsely populated. An emoji_rate of ~1.6% and a median Flesch readability score of ~57.8 suggest informal, consumer-written prose. The near_unique flag is partially explained by the sparse population — 11,884 unique values among ~12,000 non-null rows confirms almost every entry is distinct.

Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; impute or mask nulls explicitly given the 90.16% null rate.

anthropic:default · confidence high
Out[50]:

saturn.columns["column12"].stats

statvalue
n122,611
nulls110,541 (90.2%)
unique11,884
len_min 3
len_max 2,912
len_mean 340.3
len_median 295
len_p95 763
word_mean 57.37
word_median 49
n_empty 0
n_duplicates 186
duplicate_rate 0.01541
vocab_size 61,840
readability_flesch_mean 57.83
emoji_rate 0.01649
url_rate 0
one_word_rate 0
allcaps_rate 0.008202
boilerplate_rate 0
alert: near_unique98.5% of rows are unique strings
alert: null_rate90.2% null
Fig 21.
Character-length distribution for column12.
Show data table
Character-length distribution for column12 (mean: 340.28823529411767).
charscount
3 – 76812
76 – 1481475
148 – 2211866
221 – 2941856
294 – 3671625
367 – 4391313
439 – 512954
512 – 585652
585 – 658495
658 – 730299
730 – 803226
803 – 876128
876 – 948104
948 – 102180
1021 – 109445
1094 – 116728
1167 – 123918
1239 – 131231
1312 – 138520
1385 – 145811
1458 – 15304
1530 – 16035
1603 – 16763
1676 – 17483
1748 – 18215
1821 – 18942
1894 – 19671
1967 – 20391
2039 – 21121
2112 – 21851
2185 – 22570
2257 – 23300
2330 – 24031
2403 – 24761
2476 – 25481
2548 – 26210
2621 – 26940
2694 – 27670
2767 – 28391
2839 – 29122

column13 text metadata

This column contains Steam CDN URLs pointing to game header images hosted on Akamai's steamstatic.com infrastructure — specifically `header.jpg` assets keyed by Steam app ID. With a url_rate of 1.0 and one_word_rate of 1.0, every single value is a single URL. The column is near-unique (122,420 distinct values out of 122,611 rows), with only 110 duplicates, suggesting these map closely to individual game or product records; the small number of repeated URLs (max frequency 5) likely reflects games appearing in multiple dataset rows.

Treatment: Extract Steam app ID from URL path for joining; drop raw URL before modelling or store as-is for image retrieval pipelines.

anthropic:default · confidence high
Out[53]:

saturn.columns["column13"].stats

statvalue
n122,611
nulls81 (0.1%)
unique122,420
len_min 93
len_max 153
len_mean 104.6
len_median 98
len_p95 139
word_mean 1
word_median 1
n_empty 0
n_duplicates 110
duplicate_rate 0.0008977
vocab_size 19,992
readability_flesch_mean -834.3
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.9% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 22.
Character-length distribution for column13.
Show data table
Character-length distribution for column13 (mean: 104.62648331020975).
charscount
93 – 9429
94 – 96238
96 – 9827191
98 – 9974714
99 – 1000
100 – 1020
102 – 1040
104 – 1050
105 – 1060
106 – 1080
108 – 1100
110 – 1111
111 – 11216
112 – 1140
114 – 1160
116 – 1170
117 – 1180
118 – 1200
120 – 1220
122 – 1230
123 – 1240
124 – 1260
126 – 1280
128 – 1290
129 – 1300
130 – 1320
132 – 1340
134 – 1350
135 – 13611
136 – 13839
138 – 14019722
140 – 1410
141 – 1420
142 – 1440
144 – 1460
146 – 1470
147 – 1480
148 – 1500
150 – 15264
152 – 153505

column14 text metadata

This column contains publisher or developer website URLs, almost certainly scraped from a Steam or similar games catalogue. Virtually every non-null value is a single URL (one_word_rate 0.9999, url_rate 0.9999), pointing to publisher homepages, Facebook pages, or Steam publisher/group pages. Two signals stand out: 59.48% of rows are null, meaning many game records carry no website; and 20.08% of non-null values are duplicates (9,973 repeated URLs), reflecting publishers with large catalogues who share one website across many titles.

Treatment: Extract domain as a categorical publisher identifier; flag or impute nulls; do not embed raw URL strings.

anthropic:default · confidence high
Out[56]:

saturn.columns["column14"].stats

statvalue
n122,611
nulls72,935 (59.5%)
unique39,703
len_min 7
len_max 236
len_mean 32.57
len_median 29
len_p95 56
word_mean 1
word_median 1
n_empty 0
n_duplicates 9,973
duplicate_rate 0.2008
vocab_size 17,059
readability_flesch_mean -260.3
emoji_rate 0
url_rate 0.9999
one_word_rate 0.9999
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate59.5% null
alert: duplicates20.1% duplicate strings
Fig 23.
Character-length distribution for column14.
Show data table
Character-length distribution for column14 (mean: 32.568805861985666).
charscount
7 – 132
13 – 181050
18 – 2410802
24 – 3013931
30 – 3610202
36 – 415397
41 – 473308
47 – 531755
53 – 591232
59 – 64791
64 – 70277
70 – 76245
76 – 81223
81 – 87149
87 – 9353
93 – 9934
99 – 10422
104 – 11029
110 – 11626
116 – 12233
122 – 12726
127 – 13318
133 – 13916
139 – 14420
144 – 1504
150 – 1566
156 – 16212
162 – 1673
167 – 1731
173 – 1797
179 – 1841
184 – 1900
190 – 1960
196 – 2020
202 – 2070
207 – 2130
213 – 2190
219 – 2250
225 – 2300
230 – 2361

column15 text metadata

This column is a support/contact URL field — almost certainly a developer or publisher support link associated with game or software records. 95.6% of non-null values are URLs, and the one-word rate is 99.9%, consistent with bare URL strings. Two surprises stand out: the null rate is very high at 55.8%, meaning more than half of records lack this URL, and the duplicate rate is 34.7% (18,808 duplicate values out of ~54,200 non-null rows), reflecting that many games share the same support domain (e.g., Big Fish Games, EA, Facebook pages).

Treatment: Extract domain as a categorical feature; treat raw URL as a grouping key rather than a text feature; impute or flag nulls separately given 55.8% null rate.

anthropic:default · confidence high
Out[59]:

saturn.columns["column15"].stats

statvalue
n122,611
nulls68,404 (55.8%)
unique35,399
len_min 1
len_max 851
len_mean 31.19
len_median 29
len_p95 51
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 18,808
duplicate_rate 0.347
vocab_size 14,875
readability_flesch_mean -245.1
emoji_rate 0
url_rate 0.9559
one_word_rate 0.9993
allcaps_rate 0.0007933
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: url_heavy95.6% rows contain a URL
alert: null_rate55.8% null
alert: duplicates34.7% duplicate strings
Fig 24.
Character-length distribution for column15.
Show data table
Character-length distribution for column15 (mean: 31.185455752947036).
charscount
1 – 229827
22 – 4438992
44 – 654610
65 – 86477
86 – 107155
107 – 12882
128 – 15034
150 – 17110
171 – 1922
192 – 21412
214 – 2351
235 – 2562
256 – 2770
277 – 2980
298 – 3201
320 – 3410
341 – 3621
362 – 3840
384 – 4050
405 – 4260
426 – 4470
447 – 4680
468 – 4900
490 – 5110
511 – 5320
532 – 5540
554 – 5750
575 – 5960
596 – 6170
617 – 6380
638 – 6600
660 – 6810
681 – 7020
702 – 7240
724 – 7450
745 – 7660
766 – 7870
787 – 8080
808 – 8300
830 – 8511

column16 text foreign_key

This column contains email addresses for game developers or publishers, as evidenced by the top values (e.g., 'info@bigfishgames.com', 'support@quanticlab.com'). Nearly all values (99.86%) are single tokens, consistent with email format. The duplicate rate is high at 39.7% (39,849 duplicates out of 122,611 rows), indicating many records share a contact email — expected for a publisher-level field where one entity owns multiple titles. The null rate of 18.14% is notable and should be investigated for systematic missingness.

Treatment: Use as a grouping/join key on publisher or developer entity; normalize to lowercase and strip whitespace before joining.

anthropic:default · confidence high
Out[62]:

saturn.columns["column16"].stats

statvalue
n122,611
nulls22,243 (18.1%)
unique60,519
len_min 1
len_max 169
len_mean 22.91
len_median 23
len_p95 31
word_mean 1.004
word_median 1
n_empty 0
n_duplicates 39,849
duplicate_rate 0.397
vocab_size 15,319
readability_flesch_mean -223.7
emoji_rate 9.963e-06
url_rate 0.003906
one_word_rate 0.9986
allcaps_rate 0.001016
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: duplicates39.7% duplicate strings
Fig 25.
Character-length distribution for column16.
Show data table
Character-length distribution for column16 (mean: 22.90802845528455).
charscount
1 – 554
5 – 923
9 – 141014
14 – 1810969
18 – 2228430
22 – 2639631
26 – 3014560
30 – 353948
35 – 39954
39 – 43416
43 – 47184
47 – 5155
51 – 5661
56 – 6037
60 – 648
64 – 684
68 – 723
72 – 772
77 – 811
81 – 850
85 – 893
89 – 930
93 – 980
98 – 1023
102 – 1061
106 – 1103
110 – 1141
114 – 1190
119 – 1230
123 – 1270
127 – 1310
131 – 1350
135 – 1400
140 – 1440
144 – 1480
148 – 1520
152 – 1562
156 – 1610
161 – 1650
165 – 1691

column17 categorical feature

This column is a boolean flag stored as string values ('True'/'False'), covering 122,611 rows with no nulls. It is severely imbalanced: 'True' accounts for 99.964% of rows (122,567 occurrences) while 'False' appears only 44 times. The near-zero entropy (0.0046) confirms the column carries almost no information, making it nearly constant.

Treatment: Investigate whether the 44 'False' rows are meaningful anomalies; otherwise drop as near-constant with no predictive variance.

anthropic:default · confidence high
Out[65]:

saturn.columns["column17"].stats

statvalue
n122,611
nulls0 (0.0%)
unique2
top_value True
top_rate 0.9996
cardinality 2
entropy 0.004625
entropy_ratio 0.004625
alert: imbalancetop value is 100.0% of rows
Fig 26.
Top values for column17.
Show data table
Top values for column17 (2 unique shown, of 2 total).
valuecountshare
True122567100.0%
False440.0%

column18 categorical feature

This column is a binary boolean flag stored as string literals 'True'/'False', with zero nulls across 122,611 rows. The dominant value is 'False' at 82.6% (101,319 occurrences), leaving 'True' at roughly 17.4% (21,292) — a moderately imbalanced split that may matter for classification tasks. The entropy ratio of 0.666 confirms meaningful but uneven information content.

Treatment: Cast to boolean/integer (0/1) and monitor class imbalance if used as a target or predictor.

anthropic:default · confidence high
Out[68]:

saturn.columns["column18"].stats

statvalue
n122,611
nulls0 (0.0%)
unique2
top_value False
top_rate 0.8263
cardinality 2
entropy 0.666
entropy_ratio 0.666
Fig 27.
Top values for column18.
Show data table
Top values for column18 (2 unique shown, of 2 total).
valuecountshare
False10131982.6%
True2129217.4%

column19 categorical label

This column is a boolean flag stored as string literals 'True'/'False', covering all 122,611 rows with zero nulls. The distribution is heavily skewed: 'False' dominates at 87.2% (106,905 rows) versus 'True' at only 12.8% (15,706 rows). The low entropy of 0.552 confirms the imbalance. An analyst building a classifier on this as a target should anticipate class imbalance requiring resampling or adjusted class weights.

Treatment: encode as binary integer (False=0, True=1) and address class imbalance (~87/13 split) before modelling.

anthropic:default · confidence high
Out[71]:

saturn.columns["column19"].stats

statvalue
n122,611
nulls0 (0.0%)
unique2
top_value False
top_rate 0.8719
cardinality 2
entropy 0.5522
entropy_ratio 0.5522
Fig 28.
Top values for column19.
Show data table
Top values for column19 (2 unique shown, of 2 total).
valuecountshare
False10690587.2%
True1570612.8%

column20 numeric feature

This column is a sparse numeric count or score with only 73 distinct values across 122,611 rows, almost certainly representing an event count, frequency, or discrete rating. The distribution is extraordinarily concentrated at zero — 96.5% of values are exactly 0 — with IQR of 0.0 and a median of 0.0, yet the max reaches 97.0, producing extreme positive skew (5.23) and kurtosis (25.75). The 4,256 outlier rows (3.47%) carrying non-zero values likely represent a small active or engaged sub-population, which is the analytically interesting segment.

Treatment: Apply log1p transform or binarise (zero vs. non-zero) before modelling; consider separating the zero-inflated mass from the active tail for two-part modelling.

anthropic:default · confidence high
Out[74]:

saturn.columns["column20"].stats

statvalue
n122,611
nulls0 (0.0%)
unique73
min 0
max 97
mean 2.565
median 0
std 13.66
q1 0
q3 0
iqr 0
skew 5.227
kurtosis 25.75
n_outliers 4,256
outlier_rate 0.03471
zero_rate 0.9653
alert: high_skewskew=+5.23
Fig 29.
Distribution of column20. Vertical dash marks the median.
Show data table
Histogram bins for column20 (median: 0.0).
bincount
0 – 2.425118355
2.425 – 4.850
4.85 – 7.2751
7.275 – 9.70
9.7 – 12.120
12.12 – 14.550
14.55 – 16.970
16.97 – 19.40
19.4 – 21.821
21.82 – 24.252
24.25 – 26.670
26.67 – 29.12
29.1 – 31.522
31.52 – 33.953
33.95 – 36.388
36.38 – 38.86
38.8 – 41.2216
41.22 – 43.6514
43.65 – 46.0719
46.07 – 48.521
48.5 – 50.9229
50.92 – 53.3562
53.35 – 55.7756
55.77 – 58.2100
58.2 – 60.6285
60.62 – 63.05178
63.05 – 65.47154
65.47 – 67.9179
67.9 – 70.32398
70.32 – 72.75270
72.75 – 75.17514
75.17 – 77.6383
77.6 – 80.02605
80.02 – 82.45367
82.45 – 84.88283
84.88 – 87.3268
87.3 – 89.72109
89.72 – 92.1588
92.15 – 94.5724
94.57 – 979

column21 text

Out[77]:

saturn.columns["column21"].stats

statvalue
n122,611
nulls118,355 (96.5%)
unique4,160
len_min 42
len_max 142
len_mean 72.43
len_median 70
len_p95 91
word_mean 1
word_median 1
n_empty 0
n_duplicates 96
duplicate_rate 0.02256
vocab_size 4,160
readability_flesch_mean -704.1
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique97.7% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate96.5% null
Fig 30.
Character-length distribution for column21.
Show data table
Character-length distribution for column21 (mean: 72.42857142857143).
charscount
42 – 442
44 – 471
47 – 500
50 – 525
52 – 545
54 – 571
57 – 6089
60 – 62209
62 – 64597
64 – 67444
67 – 70632
70 – 72377
72 – 74441
74 – 77238
77 – 80307
80 – 82189
82 – 84241
84 – 8794
87 – 90106
90 – 9274
92 – 9474
94 – 9733
97 – 10038
100 – 10217
102 – 10410
104 – 10710
107 – 11012
110 – 1123
112 – 1141
114 – 1170
117 – 1205
120 – 1220
122 – 1240
124 – 1270
127 – 1300
130 – 1320
132 – 1340
134 – 1370
137 – 1400
140 – 1421

column22 numeric feature

This column is almost certainly a sparse indicator or rare-event count: 99.97% of its 122,611 values are exactly zero, with only 40 flagged outliers and a maximum of 100.0. The 31 unique values and an IQR of 0.0 confirm that the vast majority of rows carry no signal at all. The extreme skew (59.25) and kurtosis (3,627.8) are a direct consequence of this near-total zero mass, making standard continuous modelling inappropriate without transformation or binarisation.

Treatment: Binarise (zero vs. non-zero) or treat as a rare-event indicator; if the raw magnitude matters, cap at a sensible percentile and log1p-transform before modelling.

anthropic:default · confidence high
Out[80]:

saturn.columns["column22"].stats

statvalue
n122,611
nulls0 (0.0%)
unique31
min 0
max 100
mean 0.02455
median 0
std 1.395
q1 0
q3 0
iqr 0
skew 59.25
kurtosis 3628
n_outliers 40
outlier_rate 0.0003262
zero_rate 0.9997
alert: high_skewskew=+59.25
Fig 31.
Distribution of column22. Vertical dash marks the median.
Show data table
Histogram bins for column22 (median: 0.0).
bincount
0 – 2.5122571
2.5 – 50
5 – 7.50
7.5 – 100
10 – 12.50
12.5 – 150
15 – 17.50
17.5 – 200
20 – 22.50
22.5 – 250
25 – 27.50
27.5 – 300
30 – 32.50
32.5 – 350
35 – 37.51
37.5 – 400
40 – 42.50
42.5 – 450
45 – 47.52
47.5 – 500
50 – 52.52
52.5 – 551
55 – 57.52
57.5 – 600
60 – 62.52
62.5 – 651
65 – 67.52
67.5 – 703
70 – 72.51
72.5 – 751
75 – 77.53
77.5 – 801
80 – 82.53
82.5 – 853
85 – 87.51
87.5 – 901
90 – 92.51
92.5 – 951
95 – 97.53
97.5 – 1005

column23 numeric feature

This column is a numeric count or magnitude field — likely representing activity volume, transaction amount, or similar accumulation metric — with 122,611 non-null records and only 5,540 distinct values. The distribution is extraordinarily right-skewed (skew=177.84, kurtosis=45,295.94): the median is just 5.0 while the mean is 1,044.99, and the maximum reaches 7,642,084 — a value roughly 272x the standard deviation above the mean. About 34.5% of values are zero and 17.0% are flagged as outliers (20,797 rows), indicating a heavy zero-inflated tail with extreme rare events dominating the mean.

Treatment: Apply log1p-transform (to handle zeros) before modelling, and consider capping or winsorizing at a high percentile to suppress the extreme outliers up to 7,642,084.

anthropic:default · confidence high
Out[83]:

saturn.columns["column23"].stats

statvalue
n122,611
nulls0 (0.0%)
unique5,540
min 0
max 7.642e+06
mean 1045
median 5
std 2.809e+04
q1 0
q3 37
iqr 37
skew 177.8
kurtosis 4.53e+04
n_outliers 20,797
outlier_rate 0.1696
zero_rate 0.3448
alert: high_skewskew=+177.84
alert: outliers17.0% rows beyond 1.5 IQR
Fig 32.
Distribution of column23. Vertical dash marks the median.
Show data table
Histogram bins for column23 (median: 5.0).
bincount
0 – 1.911e+05122511
1.911e+05 – 3.821e+0557
3.821e+05 – 5.732e+0516
5.732e+05 – 7.642e+0510
7.642e+05 – 9.553e+056
9.553e+05 – 1.146e+065
1.146e+06 – 1.337e+061
1.337e+06 – 1.528e+062
1.528e+06 – 1.719e+060
1.719e+06 – 1.911e+061
1.911e+06 – 2.102e+061
2.102e+06 – 2.293e+060
2.293e+06 – 2.484e+060
2.484e+06 – 2.675e+060
2.675e+06 – 2.866e+060
2.866e+06 – 3.057e+060
3.057e+06 – 3.248e+060
3.248e+06 – 3.439e+060
3.439e+06 – 3.63e+060
3.63e+06 – 3.821e+060
3.821e+06 – 4.012e+060
4.012e+06 – 4.203e+060
4.203e+06 – 4.394e+060
4.394e+06 – 4.585e+060
4.585e+06 – 4.776e+060
4.776e+06 – 4.967e+060
4.967e+06 – 5.158e+060
5.158e+06 – 5.349e+060
5.349e+06 – 5.541e+060
5.541e+06 – 5.732e+060
5.732e+06 – 5.923e+060
5.923e+06 – 6.114e+060
6.114e+06 – 6.305e+060
6.305e+06 – 6.496e+060
6.496e+06 – 6.687e+060
6.687e+06 – 6.878e+060
6.878e+06 – 7.069e+060
7.069e+06 – 7.26e+060
7.26e+06 – 7.451e+060
7.451e+06 – 7.642e+061

column24 numeric feature

This column is likely a count or frequency measure (e.g., event occurrences, transaction counts, or interaction tallies) given its non-negative integer-like range and high zero rate. The distribution is extraordinarily right-skewed: the median is 1.0 and Q3 is only 10.0, yet the maximum reaches 1,173,003 — a difference of over six orders of magnitude. With 45% zeros, ~16.9% flagged outliers (20,696 rows), a skew of 156.86, and kurtosis exceeding 30,000, the bulk of records cluster near zero while a small number of extreme values dominate the mean (169.20 vs. median 1.0). This is a severe long-tail distribution that will distort any linear model if used as-is.

Treatment: Apply log1p-transform (or cap at a high percentile) before modelling to reduce extreme skew.

anthropic:default · confidence medium
Out[86]:

saturn.columns["column24"].stats

statvalue
n122,611
nulls0 (0.0%)
unique2,725
min 0
max 1.173e+06
mean 169.2
median 1
std 5375
q1 0
q3 10
iqr 10
skew 156.9
kurtosis 3.063e+04
n_outliers 20,696
outlier_rate 0.1688
zero_rate 0.4502
alert: high_skewskew=+156.86
alert: outliers16.9% rows beyond 1.5 IQR
Fig 33.
Distribution of column24. Vertical dash marks the median.
Show data table
Histogram bins for column24 (median: 1.0).
bincount
0 – 2.933e+04122529
2.933e+04 – 5.865e+0440
5.865e+04 – 8.798e+0418
8.798e+04 – 1.173e+059
1.173e+05 – 1.466e+053
1.466e+05 – 1.76e+053
1.76e+05 – 2.053e+050
2.053e+05 – 2.346e+051
2.346e+05 – 2.639e+052
2.639e+05 – 2.933e+051
2.933e+05 – 3.226e+051
3.226e+05 – 3.519e+051
3.519e+05 – 3.812e+050
3.812e+05 – 4.106e+050
4.106e+05 – 4.399e+050
4.399e+05 – 4.692e+051
4.692e+05 – 4.985e+050
4.985e+05 – 5.279e+050
5.279e+05 – 5.572e+050
5.572e+05 – 5.865e+050
5.865e+05 – 6.158e+050
6.158e+05 – 6.452e+050
6.452e+05 – 6.745e+050
6.745e+05 – 7.038e+050
7.038e+05 – 7.331e+050
7.331e+05 – 7.625e+050
7.625e+05 – 7.918e+050
7.918e+05 – 8.211e+050
8.211e+05 – 8.504e+050
8.504e+05 – 8.798e+050
8.798e+05 – 9.091e+050
9.091e+05 – 9.384e+050
9.384e+05 – 9.677e+050
9.677e+05 – 9.971e+050
9.971e+05 – 1.026e+060
1.026e+06 – 1.056e+061
1.056e+06 – 1.085e+060
1.085e+06 – 1.114e+060
1.114e+06 – 1.144e+060
1.144e+06 – 1.173e+061

column25 numeric

Out[89]:

saturn.columns["column25"].stats

statvalue
n122,611
nulls122,571 (100.0%)
unique3
min 98
max 100
mean 99.17
median 99
std 0.6751
q1 99
q3 100
iqr 1
skew -0.2149
kurtosis -0.7872
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate100.0% null
Fig 34.
Distribution of column25. Vertical dash marks the median.
Show data table
Histogram bins for column25 (median: 99.0).
bincount
98 – 98.336
98.33 – 98.670
98.67 – 990
99 – 99.3321
99.33 – 99.670
99.67 – 10013

column26 numeric feature

This column is likely a count or frequency metric (e.g., event occurrences, transaction counts, or tenure in days/months), given its non-negative integer values with only 448 distinct values across 122,611 rows. The distribution is severely right-skewed (skew=32.63, kurtosis=1192.15): the median is just 2.0 while the mean is 18.09, Q1 is 0.0, and the maximum reaches 9,821—an extreme outlier relative to the IQR of 19. Nearly half the rows (48.6%) are zero, and 6.9% are flagged as outliers, signaling a heavy zero-inflated tail that will distort any linear model trained on raw values.

Treatment: Apply log1p-transform (or a zero-inflated model) to compress the extreme right tail before modelling.

anthropic:default · confidence high
Out[92]:

saturn.columns["column26"].stats

statvalue
n122,611
nulls0 (0.0%)
unique448
min 0
max 9,821
mean 18.09
median 2
std 141.5
q1 0
q3 19
iqr 19
skew 32.63
kurtosis 1192
n_outliers 8,433
outlier_rate 0.06878
zero_rate 0.4859
alert: high_skewskew=+32.63
alert: outliers6.9% rows beyond 1.5 IQR
Fig 35.
Distribution of column26. Vertical dash marks the median.
Show data table
Histogram bins for column26 (median: 2.0).
bincount
0 – 245.5122280
245.5 – 491.1109
491.1 – 736.644
736.6 – 982.116
982.1 – 122818
1228 – 147314
1473 – 171912
1719 – 19644
1964 – 221011
2210 – 24555
2455 – 27012
2701 – 29462
2946 – 31927
3192 – 34374
3437 – 36831
3683 – 39280
3928 – 41743
4174 – 44192
4419 – 46652
4665 – 49105
4910 – 515668
5156 – 54021
5402 – 56470
5647 – 58930
5893 – 61380
6138 – 63840
6384 – 66290
6629 – 68750
6875 – 71200
7120 – 73660
7366 – 76110
7611 – 78570
7857 – 81020
8102 – 83480
8348 – 85930
8593 – 88390
8839 – 90840
9084 – 93300
9330 – 95750
9575 – 98211

column27 numeric feature

This column is a sparse, heavily right-skewed numeric count or amount field — likely representing an event frequency, transaction volume, or similar quantity that is zero for the vast majority of records. 82.9% of the 122,611 rows are exactly zero, the median is 0.0, and the IQR is 0.0, yet the mean is 961.8 and the maximum reaches 4,830,455 — indicating a tiny fraction of extreme values driving nearly all the variance. The skew of 113.9 and kurtosis of 20,874.5 are extraordinary, and 17.1% of rows are flagged as outliers, confirming that the non-zero tail is severely extreme relative to the bulk of the distribution.

Treatment: Apply log1p-transform (or treat as two-part: zero/non-zero indicator + log-transformed non-zero value) before modelling to handle extreme skew and outliers.

anthropic:default · confidence high
Out[95]:

saturn.columns["column27"].stats

statvalue
n122,611
nulls0 (0.0%)
unique5,332
min 0
max 4.83e+06
mean 961.8
median 0
std 2.188e+04
q1 0
q3 0
iqr 0
skew 113.9
kurtosis 2.087e+04
n_outliers 20,906
outlier_rate 0.1705
zero_rate 0.8295
alert: high_skewskew=+113.91
alert: outliers17.1% rows beyond 1.5 IQR
Fig 36.
Distribution of column27. Vertical dash marks the median.
Show data table
Histogram bins for column27 (median: 0.0).
bincount
0 – 1.208e+05122458
1.208e+05 – 2.415e+0585
2.415e+05 – 3.623e+0530
3.623e+05 – 4.83e+0511
4.83e+05 – 6.038e+052
6.038e+05 – 7.246e+054
7.246e+05 – 8.453e+058
8.453e+05 – 9.661e+052
9.661e+05 – 1.087e+062
1.087e+06 – 1.208e+061
1.208e+06 – 1.328e+065
1.328e+06 – 1.449e+060
1.449e+06 – 1.57e+060
1.57e+06 – 1.691e+060
1.691e+06 – 1.811e+061
1.811e+06 – 1.932e+061
1.932e+06 – 2.053e+060
2.053e+06 – 2.174e+060
2.174e+06 – 2.294e+060
2.294e+06 – 2.415e+060
2.415e+06 – 2.536e+060
2.536e+06 – 2.657e+060
2.657e+06 – 2.778e+060
2.778e+06 – 2.898e+060
2.898e+06 – 3.019e+060
3.019e+06 – 3.14e+060
3.14e+06 – 3.261e+060
3.261e+06 – 3.381e+060
3.381e+06 – 3.502e+060
3.502e+06 – 3.623e+060
3.623e+06 – 3.744e+060
3.744e+06 – 3.864e+060
3.864e+06 – 3.985e+060
3.985e+06 – 4.106e+060
4.106e+06 – 4.227e+060
4.227e+06 – 4.347e+060
4.347e+06 – 4.468e+060
4.468e+06 – 4.589e+060
4.589e+06 – 4.71e+060
4.71e+06 – 4.83e+061

column28 text free_text

This column contains free-text content warnings or age-rating disclosures for video games, with recurring phrases about mature content, nudity, sexual content, and violence. It is massively sparse — 81.68% of rows are null — meaning most games carry no such warning. The duplicate rate of 17.09% (3,839 duplicates across 18,620 unique values) reflects the use of templated boilerplate warning strings, while a small multilingual signal (2 Chinese, 1 Japanese entries) indicates some non-English publisher submissions. Flesch readability of 44.38 and a median length of 124 characters are consistent with dense legal/disclaimer prose.

Treatment: Encode as binary 'has_warning' flag and/or extract categorical warning types (violence, nudity, sexual content) via keyword/regex before modelling; drop raw text.

anthropic:default · confidence high
Out[98]:

saturn.columns["column28"].stats

statvalue
n122,611
nulls100,152 (81.7%)
unique18,620
len_min 2
len_max 2,020
len_mean 164.1
len_median 124
len_p95 445
word_mean 25.74
word_median 20
n_empty 0
n_duplicates 3,839
duplicate_rate 0.1709
vocab_size 23,061
readability_flesch_mean 44.38
emoji_rate 0.0007124
url_rate 8.905e-05
one_word_rate 0.009039
allcaps_rate 0.008193
boilerplate_rate 0.009484
alert: multilingual4 languages detected in sample
alert: null_rate81.7% null
Fig 37.
Character-length distribution for column28.
Show data table
Character-length distribution for column28 (mean: 164.09902488979918).
charscount
2 – 524251
52 – 1035273
103 – 1533975
153 – 2043096
204 – 2541915
254 – 3051167
305 – 355787
355 – 406579
406 – 456389
456 – 506246
506 – 557175
557 – 607143
607 – 65891
658 – 70864
708 – 75960
759 – 80949
809 – 86034
860 – 91025
910 – 96132
961 – 101123
1011 – 106118
1061 – 111213
1112 – 116210
1162 – 12136
1213 – 12634
1263 – 13148
1314 – 13645
1364 – 14153
1415 – 14652
1465 – 15164
1516 – 15662
1566 – 16163
1616 – 16670
1667 – 17172
1717 – 17682
1768 – 18180
1818 – 18692
1869 – 19190
1919 – 19700
1970 – 20201

column29 numeric feature

This column is a heavily zero-inflated count or amount field — 78.7% of its 122,611 rows are exactly zero, and the interquartile range is 0.0, meaning the entire middle 50% of the distribution is zero. Despite a median of 0 and mean of only 208, the max reaches 3,429,544, producing extreme skew (262.89) and kurtosis (75,698), with 21.3% of rows flagged as outliers. This pattern is consistent with a sparse event-count, transaction amount, or usage metric where most entities are inactive but a small tail drives enormous values.

Treatment: Apply log1p-transform or treat as two-part model (zero vs. non-zero) before regression or ML use.

anthropic:default · confidence high
Out[101]:

saturn.columns["column29"].stats

statvalue
n122,611
nulls0 (0.0%)
unique3,037
min 0
max 3.43e+06
mean 208
median 0
std 1.122e+04
q1 0
q3 0
iqr 0
skew 262.9
kurtosis 7.57e+04
n_outliers 26,119
outlier_rate 0.213
zero_rate 0.787
alert: high_skewskew=+262.89
alert: outliers21.3% rows beyond 1.5 IQR
Fig 38.
Distribution of column29. Vertical dash marks the median.
Show data table
Histogram bins for column29 (median: 0.0).
bincount
0 – 8.574e+04122594
8.574e+04 – 1.715e+0510
1.715e+05 – 2.572e+053
2.572e+05 – 3.43e+051
3.43e+05 – 4.287e+051
4.287e+05 – 5.144e+050
5.144e+05 – 6.002e+050
6.002e+05 – 6.859e+050
6.859e+05 – 7.716e+050
7.716e+05 – 8.574e+050
8.574e+05 – 9.431e+050
9.431e+05 – 1.029e+060
1.029e+06 – 1.115e+060
1.115e+06 – 1.2e+060
1.2e+06 – 1.286e+060
1.286e+06 – 1.372e+060
1.372e+06 – 1.458e+060
1.458e+06 – 1.543e+060
1.543e+06 – 1.629e+060
1.629e+06 – 1.715e+061
1.715e+06 – 1.801e+060
1.801e+06 – 1.886e+060
1.886e+06 – 1.972e+060
1.972e+06 – 2.058e+060
2.058e+06 – 2.143e+060
2.143e+06 – 2.229e+060
2.229e+06 – 2.315e+060
2.315e+06 – 2.401e+060
2.401e+06 – 2.486e+060
2.486e+06 – 2.572e+060
2.572e+06 – 2.658e+060
2.658e+06 – 2.744e+060
2.744e+06 – 2.829e+060
2.829e+06 – 2.915e+060
2.915e+06 – 3.001e+060
3.001e+06 – 3.087e+060
3.087e+06 – 3.172e+060
3.172e+06 – 3.258e+060
3.258e+06 – 3.344e+060
3.344e+06 – 3.43e+061

column30 numeric feature

This column is a heavily zero-inflated count or amount field: 96.8% of its 122,611 rows are exactly zero, driving a median of 0.0 and an IQR of 0.0. The remaining values are extremely skewed (skew = 51.68, kurtosis = 3252.96), with a mean of 13.79 pulled far right by a maximum of 20,088 — likely representing rare but large events such as transaction amounts, error counts, or penalty values. The 3,898 outliers (3.2% of rows) account for virtually all non-zero variance, which is the defining surprise here.

Treatment: Apply zero-inflated modelling or split into a binary indicator plus a log-transformed positive-value sub-model before regression.

anthropic:default · confidence high
Out[104]:

saturn.columns["column30"].stats

statvalue
n122,611
nulls0 (0.0%)
unique993
min 0
max 20,088
mean 13.79
median 0
std 270.4
q1 0
q3 0
iqr 0
skew 51.68
kurtosis 3253
n_outliers 3,898
outlier_rate 0.03179
zero_rate 0.9682
alert: high_skewskew=+51.68
Fig 39.
Distribution of column30. Vertical dash marks the median.
Show data table
Histogram bins for column30 (median: 0.0).
bincount
0 – 502.2121943
502.2 – 1004377
1004 – 1507112
1507 – 200957
2009 – 251135
2511 – 301314
3013 – 35157
3515 – 40187
4018 – 45209
4520 – 50222
5022 – 55242
5524 – 60263
6026 – 65294
6529 – 70315
7031 – 75332
7533 – 80352
8035 – 85374
8537 – 90401
9040 – 95420
9542 – 1.004e+042
1.004e+04 – 1.055e+042
1.055e+04 – 1.105e+041
1.105e+04 – 1.155e+040
1.155e+04 – 1.205e+041
1.205e+04 – 1.256e+040
1.256e+04 – 1.306e+041
1.306e+04 – 1.356e+043
1.356e+04 – 1.406e+040
1.406e+04 – 1.456e+040
1.456e+04 – 1.507e+040
1.507e+04 – 1.557e+040
1.557e+04 – 1.607e+040
1.607e+04 – 1.657e+043
1.657e+04 – 1.707e+043
1.707e+04 – 1.758e+040
1.758e+04 – 1.808e+040
1.808e+04 – 1.858e+040
1.858e+04 – 1.908e+040
1.908e+04 – 1.959e+040
1.959e+04 – 2.009e+049

column31 numeric feature

This column is a sparse count or activity metric where the overwhelming majority of records (78.7%) are zero, producing a median of 0.0 and an IQR of exactly 0.0. The distribution is extraordinarily right-skewed (skew = 263.99, kurtosis = 76112.44), driven by extreme outliers reaching a max of 3,429,544 against a mean of only 173.57 — indicating a tiny fraction of records carry massive values. Roughly 21.3% of rows (26,119) are flagged as outliers, which is an unusually high outlier rate and signals a power-law or heavy-tailed phenomenon rather than a simple data error.

Treatment: Apply log1p-transform (or a zero-inflated model) before regression; consider capping at a high percentile to manage extreme outliers.

anthropic:default · confidence high
Out[107]:

saturn.columns["column31"].stats

statvalue
n122,611
nulls0 (0.0%)
unique2,511
min 0
max 3.43e+06
mean 173.6
median 0
std 1.12e+04
q1 0
q3 0
iqr 0
skew 264
kurtosis 7.611e+04
n_outliers 26,119
outlier_rate 0.213
zero_rate 0.787
alert: high_skewskew=+263.99
alert: outliers21.3% rows beyond 1.5 IQR
Fig 40.
Distribution of column31. Vertical dash marks the median.
Show data table
Histogram bins for column31 (median: 0.0).
bincount
0 – 8.574e+04122592
8.574e+04 – 1.715e+0512
1.715e+05 – 2.572e+053
2.572e+05 – 3.43e+051
3.43e+05 – 4.287e+051
4.287e+05 – 5.144e+050
5.144e+05 – 6.002e+050
6.002e+05 – 6.859e+050
6.859e+05 – 7.716e+050
7.716e+05 – 8.574e+050
8.574e+05 – 9.431e+050
9.431e+05 – 1.029e+060
1.029e+06 – 1.115e+060
1.115e+06 – 1.2e+060
1.2e+06 – 1.286e+060
1.286e+06 – 1.372e+060
1.372e+06 – 1.458e+060
1.458e+06 – 1.543e+060
1.543e+06 – 1.629e+060
1.629e+06 – 1.715e+061
1.715e+06 – 1.801e+060
1.801e+06 – 1.886e+060
1.886e+06 – 1.972e+060
1.972e+06 – 2.058e+060
2.058e+06 – 2.143e+060
2.143e+06 – 2.229e+060
2.229e+06 – 2.315e+060
2.315e+06 – 2.401e+060
2.401e+06 – 2.486e+060
2.486e+06 – 2.572e+060
2.572e+06 – 2.658e+060
2.658e+06 – 2.744e+060
2.744e+06 – 2.829e+060
2.829e+06 – 2.915e+060
2.915e+06 – 3.001e+060
3.001e+06 – 3.087e+060
3.087e+06 – 3.172e+060
3.172e+06 – 3.258e+060
3.258e+06 – 3.344e+060
3.344e+06 – 3.43e+061

column32 numeric feature

This column is almost certainly a sparse count or occurrence field — likely an event frequency, error count, or similar rare-event tally. The zero_rate of 96.8% means the vast majority of rows have no event, while the remaining ~3.2% drive an extreme right tail (skew=48.9, kurtosis=2848.5) reaching a maximum of 20,088 against a median of 0 and mean of 14.7. The IQR of 0.0 confirms the middle 50% of the distribution is entirely flat at zero, with 3,898 flagged outliers carrying virtually all the variance.

Treatment: Apply log1p transform or treat as binary (zero vs. non-zero) flag before modelling; consider capping at a high percentile to suppress the extreme tail.

anthropic:default · confidence high
Out[110]:

saturn.columns["column32"].stats

statvalue
n122,611
nulls0 (0.0%)
unique993
min 0
max 20,088
mean 14.72
median 0
std 294.5
q1 0
q3 0
iqr 0
skew 48.91
kurtosis 2848
n_outliers 3,898
outlier_rate 0.03179
zero_rate 0.9682
alert: high_skewskew=+48.91
Fig 41.
Distribution of column32. Vertical dash marks the median.
Show data table
Histogram bins for column32 (median: 0.0).
bincount
0 – 502.2121952
502.2 – 1004342
1004 – 1507114
1507 – 200966
2009 – 251135
2511 – 301319
3013 – 35155
3515 – 40188
4018 – 452012
4520 – 50226
5022 – 55242
5524 – 60264
6026 – 65295
6529 – 70315
7031 – 75331
7533 – 80351
8035 – 85373
8537 – 90401
9040 – 95420
9542 – 1.004e+040
1.004e+04 – 1.055e+042
1.055e+04 – 1.105e+041
1.105e+04 – 1.155e+041
1.155e+04 – 1.205e+041
1.205e+04 – 1.256e+041
1.256e+04 – 1.306e+041
1.306e+04 – 1.356e+043
1.356e+04 – 1.406e+040
1.406e+04 – 1.456e+040
1.456e+04 – 1.507e+041
1.507e+04 – 1.557e+040
1.557e+04 – 1.607e+041
1.607e+04 – 1.657e+043
1.657e+04 – 1.707e+044
1.707e+04 – 1.758e+040
1.758e+04 – 1.808e+040
1.808e+04 – 1.858e+040
1.858e+04 – 1.908e+040
1.908e+04 – 1.959e+041
1.959e+04 – 2.009e+0410

column33 text label

This column contains game developer or publisher names, evidenced by top values such as 'Choice of Games', 'KOEI TECMO GAMES CO., LTD.', and dominant vocabulary including 'games', 'studio', 'studios', 'interactive', and 'entertainment'. The duplicate rate of 37.98% (43,364 duplicates across 122,611 rows) is expected — publishers release multiple titles — but the 70,816 unique values and a max length of 584 characters suggest occasional free-text entries or combined multi-publisher strings. The one-word rate of 31.8% and mean word count of ~2 words are consistent with company name formats, though the wide length range (1–584 chars) warrants inspection for outliers.

Treatment: Normalize casing and strip punctuation variants before grouping; use as a categorical grouping key or encode as a feature via target/frequency encoding.

anthropic:default · confidence high
Out[113]:

saturn.columns["column33"].stats

statvalue
n122,611
nulls8,431 (6.9%)
unique70,816
len_min 1
len_max 584
len_mean 14.37
len_median 13
len_p95 27
word_mean 2.019
word_median 2
n_empty 0
n_duplicates 43,364
duplicate_rate 0.3798
vocab_size 18,429
readability_flesch_mean 38.73
emoji_rate 0.0008933
url_rate 0.000219
one_word_rate 0.3181
allcaps_rate 0.07974
boilerplate_rate 0
alert: one_word31.8% rows are a single word
alert: duplicates38.0% duplicate strings
Fig 42.
Character-length distribution for column33.
Show data table
Character-length distribution for column33 (mean: 14.365659485023647).
charscount
1 – 1677264
16 – 3033047
30 – 452696
45 – 59611
59 – 74241
74 – 88129
88 – 10371
103 – 11834
118 – 13218
132 – 14715
147 – 16110
161 – 1767
176 – 1907
190 – 2058
205 – 2202
220 – 2344
234 – 2491
249 – 2634
263 – 2781
278 – 2922
292 – 3072
307 – 3220
322 – 3361
336 – 3510
351 – 3650
365 – 3801
380 – 3950
395 – 4090
409 – 4241
424 – 4380
438 – 4530
453 – 4670
467 – 4820
482 – 4970
497 – 5110
511 – 5260
526 – 5401
540 – 5551
555 – 5690
569 – 5841

column34 text label

This column contains game publisher or developer company names, as evidenced by top values like 'BFG Entertainment', 'Choice of Games', and 'Strategy First', and top words dominated by 'games', 'studio', 'studios', 'entertainment', and corporate suffixes ('llc', 'inc.', 'ltd.'). The duplicate rate is notably high at 44.9% (51,089 duplicates across 122,611 rows), which is expected since many games share the same publisher. The one-word rate of 31.8% reflects single-token studio names, and the 7.2% null rate warrants attention for records with unknown publishers.

Treatment: Encode as a categorical feature (e.g. frequency or target encoding); investigate nulls at 7.2% before modelling.

anthropic:default · confidence high
Out[116]:

saturn.columns["column34"].stats

statvalue
n122,611
nulls8,833 (7.2%)
unique62,689
len_min 1
len_max 164
len_mean 13.82
len_median 13
len_p95 26
word_mean 1.988
word_median 2
n_empty 0
n_duplicates 51,089
duplicate_rate 0.449
vocab_size 15,765
readability_flesch_mean 40.22
emoji_rate 0.0009141
url_rate 0.0002285
one_word_rate 0.3178
allcaps_rate 0.0817
boilerplate_rate 0
alert: one_word31.8% rows are a single word
alert: duplicates44.9% duplicate strings
Fig 43.
Character-length distribution for column34.
Show data table
Character-length distribution for column34 (mean: 13.824825537450122).
charscount
1 – 56733
5 – 922067
9 – 1332481
13 – 1728105
17 – 2113359
21 – 255059
25 – 302901
30 – 341309
34 – 38764
38 – 42388
42 – 46199
46 – 50132
50 – 5458
54 – 5890
58 – 6246
62 – 6635
66 – 7011
70 – 7415
74 – 787
78 – 825
82 – 872
87 – 911
91 – 951
95 – 992
99 – 1033
103 – 1071
107 – 1110
111 – 1150
115 – 1190
119 – 1230
123 – 1271
127 – 1311
131 – 1350
135 – 1400
140 – 1440
144 – 1480
148 – 1520
152 – 1560
156 – 1600
160 – 1642

column35 text feature

This column contains a comma-delimited list of Steam game features/categories (e.g., 'Single-player', 'Steam Achievements', 'Family Sharing', 'Full controller support'), typical of the Steam store's supported features field per game. The extreme duplicate rate (88.3%, 100,367 of 122,611 rows) is expected because many games share identical feature sets, and the tiny vocabulary size of 589 words confirms a finite, enumerated tag system. The 'da' language detection on 12 rows is almost certainly a false positive from short comma-separated tokens, not actual Danish text. With only 13,291 unique combinations out of 122,611 rows, this column is highly suitable for multi-label binarization.

Treatment: Split on commas and one-hot encode each feature tag for modelling.

anthropic:default · confidence high
Out[119]:

saturn.columns["column35"].stats

statvalue
n122,611
nulls8,953 (7.3%)
unique13,291
len_min 3
len_max 534
len_mean 71.58
len_median 59
len_p95 178
word_mean 5.089
word_median 4
n_empty 0
n_duplicates 100,367
duplicate_rate 0.8831
vocab_size 589
readability_flesch_mean -105.9
emoji_rate 0
url_rate 0
one_word_rate 0.04047
allcaps_rate 8.798e-06
boilerplate_rate 0
alert: duplicates88.3% duplicate strings
Fig 44.
Character-length distribution for column35.
Show data table
Character-length distribution for column35 (mean: 71.58431434654841).
charscount
3 – 164980
16 – 3024221
30 – 436419
43 – 5620213
56 – 6912806
69 – 8310422
83 – 969469
96 – 1095839
109 – 1223729
122 – 1362863
136 – 1492700
149 – 1622033
162 – 1761967
176 – 1891499
189 – 2021190
202 – 215830
215 – 229625
229 – 242475
242 – 255363
255 – 268282
268 – 282175
282 – 295137
295 – 308105
308 – 32270
322 – 33566
335 – 34854
348 – 36138
361 – 37519
375 – 38821
388 – 40118
401 – 4156
415 – 4287
428 – 4413
441 – 4544
454 – 4681
468 – 4816
481 – 4941
494 – 5070
507 – 5210
521 – 5342

column36 text label

This column contains comma-separated game genre tags (e.g., 'Casual,Indie', 'Action,Adventure,Indie'), consistent with a Steam or similar game catalog dataset. The duplicate rate is extremely high at 97.5%, reflecting the natural cardinality collapse when games share genre combinations — only 2,894 unique tag-sets exist across 122,611 rows. The top words 'to', 'access', and 'play' suggest some rows contain free-text strings like 'Early Access' or 'Free to Play' mixed into the same field, indicating occasional value pollution worth investigating.

Treatment: Split on comma to multi-hot encode genre tags before modelling; flag rows where values contain free-text phrases ('to', 'access', 'play') for cleansing.

anthropic:default · confidence high
Out[122]:

saturn.columns["column36"].stats

statvalue
n122,611
nulls8,413 (6.9%)
unique2,894
len_min 3
len_max 236
len_mean 22.21
len_median 21
len_p95 45
word_mean 1.364
word_median 1
n_empty 0
n_duplicates 111,304
duplicate_rate 0.9747
vocab_size 940
readability_flesch_mean -206.1
emoji_rate 0
url_rate 0
one_word_rate 0.7892
allcaps_rate 0.009781
boilerplate_rate 0
alert: one_word78.9% rows are a single word
alert: duplicates97.5% duplicate strings
Fig 45.
Character-length distribution for column36.
Show data table
Character-length distribution for column36 (mean: 22.205064887301003).
charscount
3 – 912259
9 – 1521084
15 – 2022318
20 – 2625837
26 – 3212284
32 – 388026
38 – 445596
44 – 502995
50 – 551587
55 – 61848
61 – 67593
67 – 73229
73 – 79196
79 – 85137
85 – 9071
90 – 9635
96 – 10237
102 – 10813
108 – 11415
114 – 12010
120 – 1256
125 – 1316
131 – 1374
137 – 1432
143 – 1492
149 – 1541
154 – 1600
160 – 1661
166 – 1724
172 – 1780
178 – 1840
184 – 1890
189 – 1950
195 – 2010
201 – 2070
207 – 2131
213 – 2190
219 – 2240
224 – 2300
230 – 2361

column37 text label

This column contains comma-separated genre/tag lists for software or game products (e.g., 'Adventure,Casual,Hidden Object', 'Action,Indie'), consistent with a Steam-style app catalog. The null rate of 32.02% is notably high and warrants investigation before modelling. A multilingual alert is raised, but the non-English content is negligible (26 records out of 3,376 detected), suggesting near-uniform English data with minor noise. The duplicate rate of 7.4% (6,167 duplicates) is expected given finite genre combinations across a large catalog.

Treatment: Split on commas to multi-hot encode genre tags; investigate and decide on imputation strategy for the 32.02% null rows before modelling.

anthropic:default · confidence high
Out[125]:

saturn.columns["column37"].stats

statvalue
n122,611
nulls39,265 (32.0%)
unique77,179
len_min 3
len_max 295
len_mean 141.3
len_median 163
len_p95 228
word_mean 4.923
word_median 5
n_empty 0
n_duplicates 6,167
duplicate_rate 0.07399
vocab_size 57,260
readability_flesch_mean -449.7
emoji_rate 0
url_rate 0
one_word_rate 0.1233
allcaps_rate 4.799e-05
boilerplate_rate 0
alert: multilingual9 languages detected in sample
alert: null_rate32.0% null
Fig 46.
Character-length distribution for column37.
Show data table
Character-length distribution for column37 (mean: 141.31500011998176).
charscount
3 – 10584
10 – 181637
18 – 252337
25 – 323056
32 – 402371
40 – 472362
47 – 542419
54 – 611862
61 – 691883
69 – 761806
76 – 831905
83 – 911584
91 – 981595
98 – 1051879
105 – 1121648
112 – 1201657
120 – 1271916
127 – 1341722
134 – 1421741
142 – 1491845
149 – 1562108
156 – 1641958
164 – 1712426
171 – 1783392
178 – 1863991
186 – 1934902
193 – 2006178
200 – 2075471
207 – 2154901
215 – 2223666
222 – 2292999
229 – 2371612
237 – 244918
244 – 251599
251 – 258239
258 – 266120
266 – 27337
273 – 28013
280 – 2885
288 – 2952

column38 text metadata

This column contains comma-separated lists of Steam screenshot URLs (Akamai CDN), one packed string per row representing all screenshot images for a given Steam game entry. Every value is technically 'one word' (no spaces) because the URLs are concatenated without whitespace, explaining the paradoxical one_word_rate of 1.0 alongside a mean length of ~1319 characters and a max of 29132. With 116,483 unique values out of 122,611 rows and only 110 duplicates, this is near-unique; the small duplicate count likely reflects games with identical screenshot sets.

Treatment: Split on commas to extract individual screenshot URLs per game; store as a list-type column or explode into a separate screenshots table keyed by game id.

anthropic:default · confidence high
Out[128]:

saturn.columns["column38"].stats

statvalue
n122,611
nulls6,018 (4.9%)
unique116,483
len_min 144
len_max 29,132
len_mean 1319
len_median 1,039
len_p95 2,773
word_mean 1
word_median 1
n_empty 0
n_duplicates 110
duplicate_rate 0.0009435
vocab_size 19,994
readability_flesch_mean -5099
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.9% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 47.
Character-length distribution for column38.
Show data table
Character-length distribution for column38 (mean: 1318.9448423147187).
charscount
144 – 86928597
869 – 159359377
1593 – 231818118
2318 – 30436216
3043 – 37682157
3768 – 4492959
4492 – 5217470
5217 – 5942274
5942 – 6666167
6666 – 739195
7391 – 811647
8116 – 884029
8840 – 956525
9565 – 1029020
10290 – 110147
11014 – 1173910
11739 – 124642
12464 – 131894
13189 – 139132
13913 – 146380
14638 – 153631
15363 – 160875
16087 – 168122
16812 – 175371
17537 – 182620
18262 – 189861
18986 – 197110
19711 – 204360
20436 – 211604
21160 – 218850
21885 – 226100
22610 – 233340
23334 – 240590
24059 – 247841
24784 – 255080
25508 – 262330
26233 – 269581
26958 – 276830
27683 – 284070
28407 – 291321

column39 unknown other

This column was skipped by the profiler, so its content and type are entirely unknown. With 122,611 rows, zero nulls, and no computed statistics or uniqueness information, no data-driven characterisation is possible. The 'skipped' alert is the only signal available.

Treatment: Manually inspect raw values to determine type and role before any further processing.

anthropic:default · confidence low
Out[131]:

saturn.columns["column39"].stats

statvalue
n122,611
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-steam-games-catalog-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove steam games catalog},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-steam-games-catalog}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove steam games catalog. Source: /home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-steam-games-catalog