accessibility-ssa_sa_fywl · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv

Saturn profiled 1,093 rows across 30 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv",
    "--findings", "accessibility-ssa_sa_fywl.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: medium

This appears to be the SSA-SA-FYWL dataset (Social Security Administration state/area fiscal-year workload data) with 1,093 rows and 30 columns, but the headers were not parsed correctly — most columns carry placeholder names like `_duplicated_*` and several columns hold metadata constants (file name, update date 3/13/2023, date type 'FY'). The most informative real fields are the geographic and time dimensions: `_duplicated_2` holds 53 US state codes (each appearing 21 times), `_duplicated_1` holds 11 region codes dominated by ATL (168 rows), and `_duplicated_4` holds 22 fiscal years from 2001 onward in a balanced panel. Many numeric measures (e.g. `_duplicated_22`, `_duplicated_12`, `_duplicated_10`) were ingested as text/categorical strings of decimal numbers, so they should be retyped before analysis. Start by fixing headers and dtypes, then look at the region/state/year structure to confirm the panel layout.

citing: _duplicated_1 · _duplicated_2 · _duplicated_4 · _duplicated_22 · _duplicated_12 · _duplicated_0 · _duplicated_3

Out[4]:

saturn.schema() · 30 columns

column	kind	n	null%	unique	alerts
Please note 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released.	categorical	1,093	0.0%	2	imbalance
	categorical	1,093	0.0%	2	imbalance
_duplicated_0	categorical	1,093	0.0%	2	imbalance
_duplicated_1	categorical	1,093	0.0%	11
_duplicated_2	categorical	1,093	0.0%	53
_duplicated_3	categorical	1,093	0.0%	2	imbalance
_duplicated_4	categorical	1,093	0.0%	22
_duplicated_5	text	1,093	0.0%	1,037	one_word allcaps short_text
_duplicated_6	text	1,093	0.0%	1,090	near_unique one_word allcaps short_text
_duplicated_7	categorical	1,093	0.0%	511
_duplicated_8	text	1,093	0.0%	1,041	near_unique one_word allcaps short_text
_duplicated_9	text	1,093	0.0%	1,081	near_unique one_word allcaps short_text
_duplicated_10	categorical	1,093	0.0%	199
_duplicated_11	text	1,093	0.0%	1,062	near_unique one_word allcaps short_text
_duplicated_12	categorical	1,093	0.0%	69
_duplicated_13	text	1,093	0.0%	1,079	near_unique one_word allcaps short_text
_duplicated_14	categorical	1,093	0.0%	883	long_tail
_duplicated_15	text	1,093	0.0%	1,019	one_word allcaps short_text
_duplicated_16	text	1,093	0.0%	1,057	near_unique one_word allcaps short_text
_duplicated_17	categorical	1,093	0.0%	272
_duplicated_18	text	1,093	0.1%	1,021	one_word allcaps short_text
_duplicated_19	text	1,093	0.0%	1,018	one_word allcaps short_text
_duplicated_20	categorical	1,093	0.0%	156
_duplicated_21	categorical	1,093	0.0%	957	long_tail
_duplicated_22	categorical	1,093	0.0%	70
_duplicated_23	text	1,093	0.0%	1,028	one_word allcaps short_text
_duplicated_24	categorical	1,093	0.5%	900	long_tail
_duplicated_25	text	1,093	0.0%	1,088	near_unique one_word allcaps short_text
_duplicated_26	text	1,093	0.0%	1,069	near_unique one_word allcaps short_text
_duplicated_27	categorical	1,093	0.0%	873	long_tail

Fig 1.

_duplicated_1 · Row counts by SSA region code — ATL leads at 168, showing uneven regional coverage.

Show data table

Top values for _duplicated_1 (11 unique shown, of 11 total).
value	count	share
ATL	168	15.4%
DEN	126	11.5%
BOS	126	11.5%
PHL	126	11.5%
CHI	126	11.5%
DAL	105	9.6%
SEA	84	7.7%
SFO	84	7.7%
KCM	84	7.7%
NYC	63	5.8%
Region Code	1	0.1%

Fig 2.

_duplicated_4 · Distribution across fiscal years 2001+ — note the flat 52-rows-per-year pattern indicating a balanced panel.

Show data table

Top values for _duplicated_4 (20 unique shown, of 22 total).
value	count	share
2001	52	4.8%
2002	52	4.8%
2003	52	4.8%
2004	52	4.8%
2005	52	4.8%
2006	52	4.8%
2007	52	4.8%
2008	52	4.8%
2009	52	4.8%
2010	52	4.8%
2011	52	4.8%
2012	52	4.8%
2013	52	4.8%
2014	52	4.8%
2015	52	4.8%
2016	52	4.8%
2017	52	4.8%
2018	52	4.8%
2019	52	4.8%
2020	52	4.8%

Fig 3.

_duplicated_2 · State-code coverage — 53 values each appearing 21 times suggests one row per state per year.

Show data table

Top values for _duplicated_2 (20 unique shown, of 53 total).
value	count	share
AK	21	1.9%
AL	21	1.9%
AR	21	1.9%
AZ	21	1.9%
CA	21	1.9%
CO	21	1.9%
CT	21	1.9%
DC	21	1.9%
DE	21	1.9%
FL	21	1.9%
GA	21	1.9%
HI	21	1.9%
IA	21	1.9%
ID	21	1.9%
IL	21	1.9%
IN	21	1.9%
KS	21	1.9%
KY	21	1.9%
LA	21	1.9%
MA	21	1.9%

Fig 4.

_duplicated_22 · A numeric ratio stored as text (mode 0.18, range ~0.16–0.25); convert to numeric to inspect its true distribution.

Show data table

Top values for _duplicated_22 (20 unique shown, of 70 total).
value	count	share
0.18	77	7.0%
0.20	75	6.9%
0.21	64	5.9%
0.22	62	5.7%
0.25	61	5.6%
0.17	61	5.6%
0.23	49	4.5%
0.19	46	4.2%
0.24	45	4.1%
0.16	43	3.9%
0.15	41	3.8%
0.26	36	3.3%
0.14	27	2.5%
0.12	26	2.4%
0.27	26	2.4%
0.13	25	2.3%
0.11	25	2.3%
0.10	22	2.0%
0.00	21	1.9%
0.29	20	1.8%

Fig 5.

_duplicated_12 · Another numeric column held as 69 string buckets clustered near 0.30–0.40 — recast and replot as a true histogram.

Show data table

Top values for _duplicated_12 (20 unique shown, of 69 total).
value	count	share
0.38	55	5.0%
0.34	45	4.1%
0.32	43	3.9%
0.40	41	3.8%
0.35	39	3.6%
0.44	36	3.3%
0.37	35	3.2%
0.31	35	3.2%
0.36	33	3.0%
0.33	33	3.0%
0.39	33	3.0%
0.46	32	2.9%
0.43	31	2.8%
0.48	30	2.7%
0.45	30	2.7%
0.42	29	2.7%
0.30	29	2.7%
0.41	27	2.5%
0.52	26	2.4%
0.54	25	2.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Please note 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released.	categorical	0.0%
	categorical	0.0%
_duplicated_0	categorical	0.0%
_duplicated_1	categorical	0.0%
_duplicated_2	categorical	0.0%
_duplicated_3	categorical	0.0%
_duplicated_4	categorical	0.0%
_duplicated_5	text	0.0%
_duplicated_6	text	0.0%
_duplicated_7	categorical	0.0%
_duplicated_8	text	0.0%
_duplicated_9	text	0.0%
_duplicated_10	categorical	0.0%
_duplicated_11	text	0.0%
_duplicated_12	categorical	0.0%
_duplicated_13	text	0.0%
_duplicated_14	categorical	0.0%
_duplicated_15	text	0.0%
_duplicated_16	text	0.0%
_duplicated_17	categorical	0.0%
_duplicated_18	text	0.1%
_duplicated_19	text	0.0%
_duplicated_20	categorical	0.0%
_duplicated_21	categorical	0.0%
_duplicated_22	categorical	0.0%
_duplicated_23	text	0.0%
_duplicated_24	categorical	0.5%
_duplicated_25	text	0.0%
_duplicated_26	text	0.0%
_duplicated_27	categorical	0.0%

Please note 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical metadata

This column is effectively a constant file-name tag ("SSA-SA-FYWL.csv" appears 1092 of 1093 times, top_rate 0.999) with a single stray "File Name" value that looks like a header row leaked into the data. The column header itself is a free-text note about 2021 data being backfilled with 2020 data, suggesting this is provenance metadata rather than a feature. Entropy is essentially zero (0.0106), so it carries no discriminative signal.

Treatment: Drop; near-constant provenance field with a leaked header row.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. "].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	2
top_value	SSA-SA-FYWL.csv
top_rate	0.9991
cardinality	2
entropy	0.01055
entropy_ratio	0.01055
alert: imbalance	top value is 99.9% of rows

Fig 7.

Top values for **Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. .

Show data table

Top values for **Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. (2 unique shown, of 2 total).
value	count	share
SSA-SA-FYWL.csv	1092	99.9%
File Name	1	0.1%

categorical metadata

Binary categorical column with 1093 rows and only 2 distinct values, but it is effectively a constant: "2" appears 1092 times (top_rate 0.999) while "File Version" appears once. The lone "File Version" string alongside numeric "2" suggests a stray header row leaked into the data. Entropy of 0.0106 confirms there is virtually no information here.

Treatment: Drop the column and investigate the stray "File Version" row as a parsing artifact.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns[""].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	2
top_value	2
top_rate	0.9991
cardinality	2
entropy	0.01055
entropy_ratio	0.01055
alert: imbalance	top value is 99.9% of rows

Fig 8.

Top values for .

Show data table

Top values for (2 unique shown, of 2 total).
value	count	share
2	1092	99.9%
File Version	1	0.1%

_duplicated_0 categorical metadata

This appears to be a duplicated date column where 1092 of 1093 rows hold the single value '3/13/2023', with the lone other entry being the literal string 'Update Date' — almost certainly a header row that leaked into the data. Entropy is effectively zero (0.0106) and the top rate is 0.999, so the column carries no discriminative signal. The 'Update Date' value also confirms a parsing/ingest issue worth fixing upstream.

Treatment: Drop; constant column with a leaked header value.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["_duplicated_0"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	2
top_value	3/13/2023
top_rate	0.9991
cardinality	2
entropy	0.01055
entropy_ratio	0.01055
alert: imbalance	top value is 99.9% of rows

Fig 9.

Top values for _duplicated_0.

Show data table

Top values for _duplicated_0 (2 unique shown, of 2 total).
value	count	share
3/13/2023	1092	99.9%
Update Date	1	0.1%

_duplicated_1 categorical feature

Three-letter city/airport codes (ATL, DEN, BOS, PHL, CHI, DAL, SEA, SFO, KCM, NYC...) across 1093 rows with 11 unique values and no nulls. Distribution is fairly even — entropy ratio 0.947 and top value ATL only 15.4% — suggesting a balanced categorical rather than a skewed label. The column name `_duplicated_1` flags it as an auto-detected duplicate of another column in the profile.

Treatment: Drop as duplicate, or one-hot encode the 11 city codes if kept.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["_duplicated_1"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	11
top_value	ATL
top_rate	0.1537
cardinality	11
entropy	3.277
entropy_ratio	0.9473

Fig 10.

Top values for _duplicated_1.

Show data table

Top values for _duplicated_1 (11 unique shown, of 11 total).
value	count	share
ATL	168	15.4%
DEN	126	11.5%
BOS	126	11.5%
PHL	126	11.5%
CHI	126	11.5%
DAL	105	9.6%
SEA	84	7.7%
SFO	84	7.7%
KCM	84	7.7%
NYC	63	5.8%
Region Code	1	0.1%

_duplicated_2 categorical feature

This column holds two-letter US state/territory abbreviations with a trailing space (e.g. 'AK ', 'AL ', 'AR '), with 53 distinct values across 1093 rows and no nulls. The distribution is almost perfectly uniform — entropy_ratio of 0.996 and the top value appearing just 21 times (1.92%) — suggesting the data is a regular grid of states repeated roughly 21 times each. The 53 categories slightly exceed the 50 states, consistent with DC and US territories, and the trailing whitespace in every value is a data-hygiene flag.

Treatment: Strip trailing whitespace and treat as a categorical state code (one-hot or target-encode).

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["_duplicated_2"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	53
top_value	AK
top_rate	0.01921
cardinality	53
entropy	5.706
entropy_ratio	0.9961

Fig 11.

Top values for _duplicated_2.

Show data table

Top values for _duplicated_2 (20 unique shown, of 53 total).
value	count	share
AK	21	1.9%
AL	21	1.9%
AR	21	1.9%
AZ	21	1.9%
CA	21	1.9%
CO	21	1.9%
CT	21	1.9%
DC	21	1.9%
DE	21	1.9%
FL	21	1.9%
GA	21	1.9%
HI	21	1.9%
IA	21	1.9%
ID	21	1.9%
IL	21	1.9%
IN	21	1.9%
KS	21	1.9%
KY	21	1.9%
LA	21	1.9%
MA	21	1.9%

_duplicated_3 categorical other

A binary categorical column completely dominated by the value 'FY' (1092 of 1093 rows, top_rate 0.999), with a single stray 'Date Type' entry. Entropy is effectively zero (0.0106), and the name '_duplicated_3' suggests this is a residual from a duplicated header or pivot artifact rather than a real feature. The lone 'Date Type' value looks like a header row that leaked into the data.

Treatment: Drop; constant column with a likely header-leak artifact.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["_duplicated_3"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	2
top_value	FY
top_rate	0.9991
cardinality	2
entropy	0.01055
entropy_ratio	0.01055
alert: imbalance	top value is 99.9% of rows

Fig 12.

Top values for _duplicated_3.

Show data table

Top values for _duplicated_3 (2 unique shown, of 2 total).
value	count	share
FY	1092	99.9%
Date Type	1	0.1%

_duplicated_4 categorical timestamp

This column holds 22 distinct year strings ranging from at least 2001 onward, with each year appearing almost exactly 52 times across 1,093 rows and zero nulls. The near-uniform distribution (entropy ratio 0.986, top rate just 0.0476) and the count of 52 strongly suggest weekly observations stacked per year. The '_duplicated_4' name indicates saturn detected this as a duplicate of another column in the dataset.

Treatment: Drop as a duplicate; if kept, cast to integer year and use as a time key.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["_duplicated_4"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	22
top_value	2001
top_rate	0.04758
cardinality	22
entropy	4.399
entropy_ratio	0.9864

Fig 13.

Top values for _duplicated_4.

Show data table

Top values for _duplicated_4 (20 unique shown, of 22 total).
value	count	share
2001	52	4.8%
2002	52	4.8%
2003	52	4.8%
2004	52	4.8%
2005	52	4.8%
2006	52	4.8%
2007	52	4.8%
2008	52	4.8%
2009	52	4.8%
2010	52	4.8%
2011	52	4.8%
2012	52	4.8%
2013	52	4.8%
2014	52	4.8%
2015	52	4.8%
2016	52	4.8%
2017	52	4.8%
2018	52	4.8%
2019	52	4.8%
2020	52	4.8%

_duplicated_5 text identifier

Stored as text but the values are short numeric tokens (length 6-21, mean 6.85, one word in 99.9% of rows), almost certainly some kind of numeric ID. Cardinality is near-unique (1037 distinct out of 1093) yet 56 rows duplicate (5.1% duplicate rate), which is unexpected for an identifier and worth checking. The column name '_duplicated_5' also suggests this was auto-generated from a collision during ingest.

Treatment: Cast to string id and left-join on it; investigate the 56 duplicates before assuming uniqueness.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["_duplicated_5"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,037
len_min	6
len_max	21
len_mean	6.846
len_median	7
len_p95	8
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	56
duplicate_rate	0.05124
vocab_size	1,039
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for _duplicated_5.

Show data table

Character-length distribution for _duplicated_5 (mean: 6.846294602012809).
chars	count
6 – 6	265
6 – 7	0
7 – 7	744
7 – 8	0
8 – 8	0
8 – 8	83
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	0
13 – 14	0
14 – 14	0
14 – 14	0
14 – 15	0
15 – 15	0
15 – 15	0
15 – 16	0
16 – 16	0
16 – 16	0
16 – 17	0
17 – 17	0
17 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 20	0
20 – 20	0
20 – 20	0
20 – 21	0
21 – 21	1

_duplicated_6 text identifier

Almost every value is a single all-caps token of 5-6 characters (len_mean 5.68, one_word_rate 0.999), with 1090 unique values across 1093 rows and only 3 duplicates. Top tokens are mostly numeric strings like '91371', '18795', '158314', suggesting this is an identifier or numeric code column rather than natural text — though a stray header-like fragment ('ssa', 'disability', 'beneficiaries', 'age', '18-64*') hints the source file had embedded header rows mixed into the data.

Treatment: Treat as a near-unique code; drop from modelling or use only as a join key after stripping the stray header rows.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["_duplicated_6"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,090
len_min	5
len_max	40
len_mean	5.683
len_median	6
len_p95	6
word_mean	1.005
word_median	1
n_empty	0
n_duplicates	3
duplicate_rate	0.002745
vocab_size	1,094
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	99.7% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 15.

Character-length distribution for _duplicated_6.

Show data table

Character-length distribution for _duplicated_6 (mean: 5.683440073193046).
chars	count
5 – 6	397
6 – 7	678
7 – 8	17
8 – 8	0
8 – 9	0
9 – 10	0
10 – 11	0
11 – 12	0
12 – 13	0
13 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 19	0
19 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 25	0
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 30	0
30 – 31	0
31 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 36	0
36 – 37	0
37 – 38	0
38 – 39	0
39 – 40	1

_duplicated_7 categorical feature

Column is typed categorical but holds 511 distinct numeric strings like "5.50", "5.07", "4.90" across 1093 rows, suggesting a continuous measurement (price, rating, or similar) stored as text. Distribution is nearly flat: entropy ratio is 0.968 and the most common value covers only 1.01% of rows. The "_duplicated_7" name implies this is a redundant copy of another column produced during a join or pivot.

Treatment: Cast to float and drop if it duplicates another numeric column; otherwise treat as a continuous feature.

anthropic:claude-opus-4-7 · confidence medium

Out[39]:

saturn.columns["_duplicated_7"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	511
top_value	5.50
top_rate	0.01006
cardinality	511
entropy	8.71
entropy_ratio	0.9681

Fig 16.

Top values for _duplicated_7.

Show data table

Top values for _duplicated_7 (20 unique shown, of 511 total).
value	count	share
5.50	11	1.0%
5.07	9	0.8%
4.90	9	0.8%
4.19	8	0.7%
5.08	8	0.7%
4.70	8	0.7%
4.96	7	0.6%
5.29	7	0.6%
4.11	6	0.5%
4.55	6	0.5%
5.18	6	0.5%
4.45	6	0.5%
6.18	6	0.5%
4.98	6	0.5%
5.63	6	0.5%
7.16	6	0.5%
5.33	5	0.5%
5.15	5	0.5%
5.45	5	0.5%
4.71	5	0.5%

_duplicated_8 text identifier

Single-token, all-caps short strings (length 6-26, mean 6.84, ~1 word each) that are overwhelmingly numeric — top values like '468802', '2702811', '1646445' are integers stored as text. With 1041 unique values across 1093 rows and only 52 duplicates, this looks like a near-unique numeric identifier rather than a feature. The 'allcaps' and Flesch=121.22 signals are artifacts of digit-only tokens; no URLs, emojis, or boilerplate appear.

Treatment: Drop from modelling or use as a join key; cast to integer if needed.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["_duplicated_8"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,041
len_min	6
len_max	26
len_mean	6.835
len_median	7
len_p95	8
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	52
duplicate_rate	0.04758
vocab_size	1,043
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	95.2% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 17.

Character-length distribution for _duplicated_8.

Show data table

Character-length distribution for _duplicated_8 (mean: 6.835315645013724).
chars	count
6 – 6	279
6 – 7	0
7 – 8	733
8 – 8	0
8 – 8	80
8 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 24	0
24 – 24	0
24 – 25	0
25 – 26	0
26 – 26	1

_duplicated_9 text identifier

Almost certainly an identifier-like code column: 1081 unique values across 1093 rows, single-token entries averaging 4.85 characters, and the top repeated values are short numeric strings like '4190' and '8630'. The 99.9% allcaps and one_word rates plus max length of 14 suggest compact alphanumeric codes rather than prose. The 12 duplicates (1.1%) are minor but worth checking given the column is otherwise near-unique.

Treatment: Treat as an identifier; drop from modelling features or use only for joins/lookups.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["_duplicated_9"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,081
len_min	3
len_max	14
len_mean	4.854
len_median	5
len_p95	6
word_mean	1.001
word_median	1
n_empty	0
n_duplicates	12
duplicate_rate	0.01098
vocab_size	1,082
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	98.9% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 18.

Character-length distribution for _duplicated_9.

Show data table

Character-length distribution for _duplicated_9 (mean: 4.8536139066788655).
chars	count
3 – 3	1
3 – 4	0
4 – 4	0
4 – 4	269
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	720
5 – 5	0
5 – 6	0
6 – 6	102
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	0
13 – 13	0
13 – 13	0
13 – 14	0
14 – 14	1

_duplicated_10 categorical feature

Stored as a categorical but the values are numeric strings clustered tightly around 1.0 (top values include '0.97', '1.11', '1.01', '1.04', '0.92'), suggesting a ratio, multiplier, or normalised index. Distribution is highly diffuse with 199 distinct values across 1093 rows and an entropy ratio of 0.929, so no single bucket dominates (top_rate just 0.023). The '_duplicated_10' name implies this column is a redundant copy from an upstream join.

Treatment: Cast to float and treat as a continuous feature; verify it isn't a duplicate of another column before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[48]:

saturn.columns["_duplicated_10"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	199
top_value	0.97
top_rate	0.02287
cardinality	199
entropy	7.097
entropy_ratio	0.9293

Fig 19.

Top values for _duplicated_10.

Show data table

Top values for _duplicated_10 (20 unique shown, of 199 total).
value	count	share
0.97	25	2.3%
1.11	24	2.2%
1.01	23	2.1%
1.04	19	1.7%
0.92	19	1.7%
1.08	18	1.6%
1.02	17	1.6%
1.12	16	1.5%
1.07	16	1.5%
1.15	16	1.5%
0.96	15	1.4%
1.00	14	1.3%
1.13	14	1.3%
1.10	14	1.3%
0.89	13	1.2%
1.23	13	1.2%
0.94	13	1.2%
0.90	13	1.2%
1.05	13	1.2%
0.85	13	1.2%

_duplicated_11 text identifier

Almost certainly a short alphanumeric code column: 1062 distinct values across 1093 rows, 99.9% one-word and 99.9% all-caps, lengths between 3 and 30 characters with a median of 4. Top tokens are bare numeric strings like '6632' and '1573', each appearing only 2-3 times, suggesting ID-like codes rather than categories. The '_duplicated_11' name and 31 duplicates (2.8%) hint this is a copy of another column with minor collisions.

Treatment: Drop as near-unique identifier, or treat as a key for join/lookup rather than a feature.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["_duplicated_11"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,062
len_min	3
len_max	30
len_mean	4.498
len_median	4
len_p95	5
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	31
duplicate_rate	0.02836
vocab_size	1,064
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	97.2% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 20.

Character-length distribution for _duplicated_11.

Show data table

Character-length distribution for _duplicated_11 (mean: 4.497712717291857).
chars	count
3 – 4	11
4 – 4	552
4 – 5	529
5 – 6	0
6 – 6	0
6 – 7	0
7 – 8	0
8 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 19	0
19 – 19	0
19 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 23	0
23 – 23	0
23 – 24	0
24 – 25	0
25 – 25	0
25 – 26	0
26 – 27	0
27 – 27	0
27 – 28	0
28 – 29	0
29 – 29	0
29 – 30	1

_duplicated_12 categorical feature

This column holds 69 distinct numeric-looking strings (e.g. '0.38', '0.34', '0.32') across 1093 rows with no nulls, suggesting a decimal ratio or rate stored as text. The distribution is fairly flat — top value '0.38' covers only 5.0% and entropy ratio is 0.905 — so no single value dominates. The '_duplicated_12' name signals it is a duplicate of another column, which is the main thing to flag.

Treatment: Drop as a duplicate column after confirming it matches its source, otherwise cast to float.

anthropic:claude-opus-4-7 · confidence medium

Out[54]:

saturn.columns["_duplicated_12"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	69
top_value	0.38
top_rate	0.05032
cardinality	69
entropy	5.527
entropy_ratio	0.9048

Fig 21.

Top values for _duplicated_12.

Show data table

Top values for _duplicated_12 (20 unique shown, of 69 total).
value	count	share
0.38	55	5.0%
0.34	45	4.1%
0.32	43	3.9%
0.40	41	3.8%
0.35	39	3.6%
0.44	36	3.3%
0.37	35	3.2%
0.31	35	3.2%
0.36	33	3.0%
0.33	33	3.0%
0.39	33	3.0%
0.46	32	2.9%
0.43	31	2.8%
0.48	30	2.7%
0.45	30	2.7%
0.42	29	2.7%
0.30	29	2.7%
0.41	27	2.5%
0.52	26	2.4%
0.54	25	2.3%

_duplicated_13 text identifier

This column holds short, single-token uppercase strings that are almost entirely unique (1079 unique out of 1093), with lengths between 4 and 24 characters and a median of 5. The top-frequency tokens are all numeric strings ('17955', '5808', etc.) appearing only twice each, suggesting this is a near-unique identifier code rather than natural text. The 'allcaps' and 'one_word' rates near 99.9% confirm a structured code format, and the column name '_duplicated_13' hints it was auto-generated during a join or pivot.

Treatment: Drop or use as a join key; not suitable as a modelling feature due to near-uniqueness.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["_duplicated_13"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,079
len_min	4
len_max	24
len_mean	4.849
len_median	5
len_p95	6
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	14
duplicate_rate	0.01281
vocab_size	1,081
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	98.7% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 22.

Character-length distribution for _duplicated_13.

Show data table

Character-length distribution for _duplicated_13 (mean: 4.84903934126258).
chars	count
4 – 4	283
4 – 5	0
5 – 6	710
6 – 6	0
6 – 6	99
6 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 24	1

_duplicated_14 categorical feature

This column, labelled `_duplicated_14`, holds 1093 numeric-looking strings (e.g. "31.13", "44.89") with 883 unique values and no nulls — almost certainly a continuous measurement that was ingested as categorical. Entropy ratio of 0.99 and a top frequency of just 4 (0.37%) confirm near-uniqueness; the `long_tail` alert and the `_duplicated_` prefix suggest it is a redundant copy of another numeric column.

Treatment: Cast to float and check for equality against the original column; drop if it is a duplicate.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["_duplicated_14"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	883
top_value	31.13
top_rate	0.00366
cardinality	883
entropy	9.686
entropy_ratio	0.9897
alert: long_tail	707 singleton categories

Fig 23.

Top values for _duplicated_14.

Show data table

Top values for _duplicated_14 (20 unique shown, of 883 total).
value	count	share
31.13	4	0.4%
44.89	3	0.3%
33.20	3	0.3%
47.46	3	0.3%
30.73	3	0.3%
35.51	3	0.3%
41.78	3	0.3%
40.12	3	0.3%
36.06	3	0.3%
29.74	3	0.3%
36.98	3	0.3%
37.02	3	0.3%
38.32	3	0.3%
29.63	3	0.3%
36.17	3	0.3%
30.34	3	0.3%
32.50	3	0.3%
36.14	3	0.3%
32.47	3	0.3%
31.93	3	0.3%

_duplicated_15 text identifier

This column holds short single-token numeric strings (one_word_rate 0.999, len_mean 6.4, max 24) stored as text rather than integers, with 1019 unique values across 1093 rows. The value '0' appears 21 times while every other top value occurs only twice, suggesting '0' is a sentinel or default. The name '_duplicated_15' and the 6.8% duplicate rate hint this is a redundant copy of a numeric identifier column from an upstream join.

Treatment: Cast to integer and drop as a duplicate id unless it differs from the original column.

anthropic:claude-opus-4-7 · confidence medium

Out[63]:

saturn.columns["_duplicated_15"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,019
len_min	1
len_max	24
len_mean	6.415
len_median	6
len_p95	7
word_mean	1.003
word_median	1
n_empty	0
n_duplicates	74
duplicate_rate	0.0677
vocab_size	1,022
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9799
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	98.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 24.

Character-length distribution for _duplicated_15.

Show data table

Character-length distribution for _duplicated_15 (mean: 6.415370539798719).
chars	count
1 – 2	21
2 – 2	0
2 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 5	0
5 – 6	0
6 – 6	530
6 – 7	0
7 – 7	541
7 – 8	0
8 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 15	0
15 – 15	0
15 – 16	0
16 – 17	0
17 – 17	0
17 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 23	0
23 – 24	1

_duplicated_16 text identifier

Despite being typed as text, this column is dominated by short single-token numeric strings (one_word_rate 0.999, len_mean 4.54, max 38) with 1057 unique values across 1093 rows. The top tokens are bare integers like "0" (21 occurrences), "1358", "840", suggesting an ID or numeric code stored as text rather than natural language. The allcaps_rate of 0.98 is an artifact of digits/non-letter content, and the column name `_duplicated_16` implies it was auto-generated during a column-name collision.

Treatment: Drop or treat as a high-cardinality ID; do not tokenize as text.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["_duplicated_16"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,057
len_min	1
len_max	38
len_mean	4.535
len_median	5
len_p95	5
word_mean	1.004
word_median	1
n_empty	0
n_duplicates	36
duplicate_rate	0.03294
vocab_size	1,061
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9799
boilerplate_rate	0
alert: near_unique	96.7% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	98.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 25.

Character-length distribution for _duplicated_16.

Show data table

Character-length distribution for _duplicated_16 (mean: 4.535224153705398).
chars	count
1 – 2	21
2 – 3	0
3 – 4	25
4 – 5	446
5 – 6	562
6 – 7	37
7 – 7	1
7 – 8	0
8 – 9	0
9 – 10	0
10 – 11	0
11 – 12	0
12 – 13	0
13 – 14	0
14 – 15	0
15 – 16	0
16 – 17	0
17 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 23	0
23 – 24	0
24 – 25	0
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	0
37 – 38	1

_duplicated_17 categorical feature

Stored as categorical strings but the values are numeric ('0.00', '1.68', '0.58', '1.07'), suggesting a small-magnitude continuous measurement that was read as text. Cardinality is high (272 unique across 1093 rows) with very flat distribution: top value '0.00' covers only 1.92% and entropy ratio is 0.949. The '_duplicated_17' name implies this is a duplicate of another column produced during a join or concat.

Treatment: Cast to float and check whether it duplicates an existing column; drop if redundant.

anthropic:claude-opus-4-7 · confidence medium

Out[69]:

saturn.columns["_duplicated_17"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	272
top_value	0.00
top_rate	0.01921
cardinality	272
entropy	7.671
entropy_ratio	0.9485

Fig 26.

Top values for _duplicated_17.

Show data table

Top values for _duplicated_17 (20 unique shown, of 272 total).
value	count	share
0.00	21	1.9%
1.68	13	1.2%
0.58	12	1.1%
1.07	12	1.1%
1.08	12	1.1%
1.24	12	1.1%
1.15	12	1.1%
0.64	12	1.1%
1.52	11	1.0%
1.42	11	1.0%
1.18	11	1.0%
1.70	11	1.0%
1.81	11	1.0%
1.20	10	0.9%
1.09	10	0.9%
1.44	10	0.9%
1.11	10	0.9%
0.94	10	0.9%
1.78	10	0.9%
1.56	10	0.9%

_duplicated_18 text identifier

Despite being typed as text, this column holds single-token numeric strings (one_word_rate 0.999, word_mean 1.00, len_mean 6.4) with 1021 unique values across 1093 rows — effectively a high-cardinality numeric ID stored as text. The value '0' appears 20 times while every other top value occurs at most twice, hinting at '0' as a sentinel/placeholder amid otherwise near-unique IDs. The 'allcaps' alert is a quirk of digit-only strings rather than meaningful casing.

Treatment: Cast to integer (treating '0' as missing) or drop as near-unique identifier before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[72]:

saturn.columns["_duplicated_18"].stats

stat	value
n	1,093
nulls	1 (0.1%)
unique	1,021
len_min	1
len_max	26
len_mean	6.407
len_median	6
len_p95	7
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	71
duplicate_rate	0.06502
vocab_size	1,023
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9808
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	98.1% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 27.

Character-length distribution for _duplicated_18.

Show data table

Character-length distribution for _duplicated_18 (mean: 6.406593406593407).
chars	count
1 – 2	20
2 – 2	0
2 – 3	0
3 – 4	0
4 – 4	0
4 – 5	0
5 – 5	1
5 – 6	0
6 – 7	545
7 – 7	525
7 – 8	0
8 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 15	0
15 – 15	0
15 – 16	0
16 – 17	0
17 – 17	0
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 24	0
24 – 25	0
25 – 25	0
25 – 26	1

_duplicated_19 text identifier

Despite being typed as text, this column is essentially short numeric tokens — 99.9% are single words with mean length 4.05 characters and a max of 32. With 1018 unique values across 1093 rows and the most common entry '0' appearing only 21 times, it behaves like a high-cardinality numeric identifier stored as strings. The 'allcaps' alert (97.99%) is an artifact of digits having no lowercase form rather than a meaningful signal.

Treatment: Cast to integer and treat as an ID; drop from modelling features unless joined as a key.

anthropic:claude-opus-4-7 · confidence high

Out[75]:

saturn.columns["_duplicated_19"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,018
len_min	1
len_max	32
len_mean	4.047
len_median	4
len_p95	5
word_mean	1.004
word_median	1
n_empty	0
n_duplicates	75
duplicate_rate	0.06862
vocab_size	1,022
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9799
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	98.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 28.

Character-length distribution for _duplicated_19.

Show data table

Character-length distribution for _duplicated_19 (mean: 4.046660567246112).
chars	count
1 – 2	21
2 – 3	0
3 – 3	181
3 – 4	623
4 – 5	0
5 – 6	267
6 – 6	0
6 – 7	0
7 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 13	0
13 – 13	0
13 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 23	0
23 – 23	0
23 – 24	0
24 – 25	0
25 – 26	0
26 – 27	0
27 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 30	0
30 – 31	0
31 – 32	1

_duplicated_20 categorical feature

Despite being typed categorical, every one of the 156 distinct values is a two-decimal numeric string between 0.00 and 0.61+, suggesting a proportion or rate that was stored as text. The distribution is nearly flat (entropy ratio 0.907), with the modal value '0.30' covering only 2.6% of 1093 rows and no nulls. The column name '_duplicated_20' implies it is a copy of another column flagged during ingestion.

Treatment: Cast strings to float and treat as a numeric feature; verify against the source column and drop if it is an exact duplicate.

anthropic:claude-opus-4-7 · confidence medium

Out[78]:

saturn.columns["_duplicated_20"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	156
top_value	0.30
top_rate	0.02562
cardinality	156
entropy	6.609
entropy_ratio	0.9071

Fig 29.

Top values for _duplicated_20.

Show data table

Top values for _duplicated_20 (20 unique shown, of 156 total).
value	count	share
0.30	28	2.6%
0.35	26	2.4%
0.33	26	2.4%
0.37	24	2.2%
0.45	24	2.2%
0.36	23	2.1%
0.61	23	2.1%
0.42	22	2.0%
0.00	21	1.9%
0.43	21	1.9%
0.38	20	1.8%
0.40	19	1.7%
0.48	19	1.7%
0.32	19	1.7%
0.41	18	1.6%
0.39	18	1.6%
0.58	18	1.6%
0.18	18	1.6%
0.57	17	1.6%
0.71	17	1.6%

_duplicated_21 categorical identifier

This column is labelled `_duplicated_21`, suggesting saturn detected it as a duplicate of another field; values appear to be numeric strings stored as categorical. With 957 unique values across 1093 rows and an entropy ratio of 0.9885, it is nearly an identifier — the only meaningful concentration is `"0"` at 21 occurrences (1.92%), likely a sentinel or default. The long_tail alert and near-unique cardinality mean it carries almost no categorical signal as-is.

Treatment: Drop as a duplicated near-unique column, or reconcile against its original before any modelling.

anthropic:claude-opus-4-7 · confidence high

Out[81]:

saturn.columns["_duplicated_21"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	957
top_value	0
top_rate	0.01921
cardinality	957
entropy	9.789
entropy_ratio	0.9885
alert: long_tail	852 singleton categories

Fig 30.

Top values for _duplicated_21.

Show data table

Top values for _duplicated_21 (20 unique shown, of 957 total).
value	count	share
0	21	1.9%
1321	4	0.4%
352	3	0.3%
597	3	0.3%
777	3	0.3%
580	3	0.3%
1184	3	0.3%
1353	3	0.3%
710	3	0.3%
3128	3	0.3%
463	3	0.3%
227	3	0.3%
1043	2	0.2%
421	2	0.2%
2891	2	0.2%
5079	2	0.2%
3228	2	0.2%
299	2	0.2%
3337	2	0.2%
238	2	0.2%

_duplicated_22 categorical feature

This column holds 70 distinct short decimal strings clustered tightly around 0.16–0.25, suggesting a numeric ratio or rate (perhaps a proportion or probability) that has been stored as text. Distribution is fairly even with the top value '0.18' taking only 7.0% of rows and entropy ratio 0.84, so no single bucket dominates. The 'categorical' kind plus the '_duplicated_22' name hint that saturn detected this as a duplicate of another column and parsed it as strings rather than floats.

Treatment: Cast to float and verify it is not redundant with the original column before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[84]:

saturn.columns["_duplicated_22"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	70
top_value	0.18
top_rate	0.07045
cardinality	70
entropy	5.138
entropy_ratio	0.8383

Fig 31.

Top values for _duplicated_22.

Show data table

Top values for _duplicated_22 (20 unique shown, of 70 total).
value	count	share
0.18	77	7.0%
0.20	75	6.9%
0.21	64	5.9%
0.22	62	5.7%
0.25	61	5.6%
0.17	61	5.6%
0.23	49	4.5%
0.19	46	4.2%
0.24	45	4.1%
0.16	43	3.9%
0.15	41	3.8%
0.26	36	3.3%
0.14	27	2.5%
0.12	26	2.4%
0.27	26	2.4%
0.13	25	2.3%
0.11	25	2.3%
0.10	22	2.0%
0.00	21	1.9%
0.29	20	1.8%

_duplicated_23 text identifier

Despite the text kind, every value is a single short token (word_mean 1.004, len_mean 4.05, len_max 37) and the top values are all numeric strings like "0", "406", "404". With 1028 unique values across 1093 rows and a 5.9% duplicate_rate dominated by "0" (21 occurrences), this looks like a numeric identifier or count stored as text. The allcaps_rate of 0.98 is a quirk of digit-only strings being flagged as uppercase.

Treatment: cast to integer and treat as numeric id or count rather than free text.

anthropic:claude-opus-4-7 · confidence high

Out[87]:

saturn.columns["_duplicated_23"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,028
len_min	1
len_max	37
len_mean	4.053
len_median	4
len_p95	5
word_mean	1.004
word_median	1
n_empty	0
n_duplicates	65
duplicate_rate	0.05947
vocab_size	1,032
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9799
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	98.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 32.

Character-length distribution for _duplicated_23.

Show data table

Character-length distribution for _duplicated_23 (mean: 4.053064958828911).
chars	count
1 – 2	21
2 – 3	0
3 – 4	182
4 – 5	619
5 – 6	270
6 – 6	0
6 – 7	0
7 – 8	0
8 – 9	0
9 – 10	0
10 – 11	0
11 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 17	0
17 – 18	0
18 – 19	0
19 – 20	0
20 – 21	0
21 – 22	0
22 – 23	0
23 – 24	0
24 – 24	0
24 – 25	0
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	1

_duplicated_24 categorical feature

Despite being typed categorical, the values are numeric strings (e.g. '0.00', '47.52', '51.82'), suggesting a monetary or measurement field that was read as text. With 900 unique values across 1093 rows and entropy ratio 0.9874, it is nearly unique; the only meaningful concentration is '0.00' at 1.38% (15 rows). The '_duplicated_24' name implies this is a repeated copy of another column in the source.

Treatment: Cast to float and treat as numeric; verify whether it duplicates another column before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[90]:

saturn.columns["_duplicated_24"].stats

stat	value
n	1,093
nulls	6 (0.5%)
unique	900
top_value	0.00
top_rate	0.0138
cardinality	900
entropy	9.69
entropy_ratio	0.9874
alert: long_tail	756 singleton categories

Fig 33.

Top values for _duplicated_24.

Show data table

Top values for _duplicated_24 (20 unique shown, of 900 total).
value	count	share
0.00	15	1.4%
47.52	5	0.5%
51.82	4	0.4%
47.04	4	0.4%
54.24	4	0.4%
48.91	4	0.4%
51.90	3	0.3%
48.89	3	0.3%
37.97	3	0.3%
51.35	3	0.3%
44.18	3	0.3%
63.06	3	0.3%
40.64	3	0.3%
38.66	3	0.3%
57.98	3	0.3%
30.92	3	0.3%
53.94	3	0.3%
60.20	3	0.3%
39.15	3	0.3%
30.05	3	0.3%

_duplicated_25 text identifier

Almost every value is a single short ALLCAPS token (one_word_rate 0.999, allcaps_rate 0.999, len_mean 4.9, word_mean 1.0), and 1088 of 1093 rows are unique with only 5 duplicates. The top tokens are mostly numeric strings like '3584' or '14860', suggesting this is a near-unique short code rather than natural text. The column name '_duplicated_25' hints it was auto-generated from a duplicated source column during profiling.

Treatment: Drop or treat as an ID key; do not tokenize as free text.

anthropic:claude-opus-4-7 · confidence high

Out[93]:

saturn.columns["_duplicated_25"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,088
len_min	4
len_max	18
len_mean	4.906
len_median	5
len_p95	6
word_mean	1.001
word_median	1
n_empty	0
n_duplicates	5
duplicate_rate	0.004575
vocab_size	1,089
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	99.5% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 34.

Character-length distribution for _duplicated_25.

Show data table

Character-length distribution for _duplicated_25 (mean: 4.90576395242452).
chars	count
4 – 4	248
4 – 5	0
5 – 5	712
5 – 5	0
5 – 6	0
6 – 6	132
6 – 6	0
6 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	0
13 – 13	0
13 – 14	0
14 – 14	0
14 – 14	0
14 – 15	0
15 – 15	0
15 – 16	0
16 – 16	0
16 – 16	0
16 – 17	0
17 – 17	0
17 – 17	0
17 – 18	0
18 – 18	1

_duplicated_26 text identifier

Single-token, all-caps strings averaging 4.57 characters with 1069 unique values across 1093 rows — almost certainly an identifier or short code column. The top values are all numeric strings (e.g., '2280', '2086') appearing 2-3 times each, suggesting these are numeric IDs stored as text rather than meaningful tokens. The 99.9% one-word and all-caps rates plus near-unique cardinality rule out free text.

Treatment: Treat as a categorical/key field; drop from modelling features or use only for joins.

anthropic:claude-opus-4-7 · confidence high

Out[96]:

saturn.columns["_duplicated_26"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	1,069
len_min	3
len_max	28
len_mean	4.574
len_median	5
len_p95	5
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	24
duplicate_rate	0.02196
vocab_size	1,071
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9991
allcaps_rate	0.9991
boilerplate_rate	0
alert: near_unique	97.8% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 35.

Character-length distribution for _duplicated_26.

Show data table

Character-length distribution for _duplicated_26 (mean: 4.573650503202196).
chars	count
3 – 4	4
4 – 4	487
4 – 5	0
5 – 6	595
6 – 6	6
6 – 7	0
7 – 7	0
7 – 8	0
8 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 17	0
17 – 17	0
17 – 18	0
18 – 19	0
19 – 19	0
19 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 24	0
24 – 25	0
25 – 26	0
26 – 26	0
26 – 27	0
27 – 27	0
27 – 28	1

_duplicated_27 categorical feature

Stored as categorical strings but every observed value parses as a two-decimal number (e.g. '37.60', '41.85'), so this is almost certainly a numeric measurement — possibly a price, rate or score — that was ingested as text. With 873 unique values across 1093 rows and entropy ratio 0.989, it is near-unique; the most frequent value '37.60' appears just 4 times (top rate 0.37%). The '_duplicated_27' name suggests it is a duplicate of another column produced upstream.

Treatment: Cast to float and treat as a numeric feature; verify it is not redundant with the column it duplicates.

anthropic:claude-opus-4-7 · confidence high

Out[99]:

saturn.columns["_duplicated_27"].stats

stat	value
n	1,093
nulls	0 (0.0%)
unique	873
top_value	37.60
top_rate	0.00366
cardinality	873
entropy	9.662
entropy_ratio	0.989
alert: long_tail	693 singleton categories

Fig 36.

Top values for _duplicated_27.

Show data table

Top values for _duplicated_27 (20 unique shown, of 873 total).
value	count	share
37.60	4	0.4%
36.60	4	0.4%
41.85	4	0.4%
38.47	4	0.4%
49.19	3	0.3%
32.63	3	0.3%
42.28	3	0.3%
29.96	3	0.3%
42.14	3	0.3%
38.12	3	0.3%
33.04	3	0.3%
40.70	3	0.3%
40.45	3	0.3%
33.84	3	0.3%
30.27	3	0.3%
31.35	3	0.3%
39.43	3	0.3%
33.77	3	0.3%
30.69	3	0.3%
31.39	3	0.3%

accessibility ssa sa fywl

Overview

Summary confidence: medium

Please note 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical metadata

categorical metadata

_duplicated_0 categorical metadata

_duplicated_1 categorical feature

_duplicated_2 categorical feature

_duplicated_3 categorical other

_duplicated_4 categorical timestamp

_duplicated_5 text identifier

_duplicated_6 text identifier

_duplicated_7 categorical feature

_duplicated_8 text identifier

_duplicated_9 text identifier

_duplicated_10 categorical feature

_duplicated_11 text identifier

_duplicated_12 categorical feature

_duplicated_13 text identifier

_duplicated_14 categorical feature

_duplicated_15 text identifier

_duplicated_16 text identifier

_duplicated_17 categorical feature

_duplicated_18 text identifier

_duplicated_19 text identifier

_duplicated_20 categorical feature

_duplicated_21 categorical identifier

_duplicated_22 categorical feature

_duplicated_23 text identifier

_duplicated_24 categorical feature

_duplicated_25 text identifier

_duplicated_26 text identifier

_duplicated_27 categorical feature

How to cite

Overview

Summary confidence: medium

**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical metadata

categorical metadata

_duplicated_0 categorical metadata

_duplicated_1 categorical feature

_duplicated_2 categorical feature

_duplicated_3 categorical other

_duplicated_4 categorical timestamp

_duplicated_5 text identifier

_duplicated_6 text identifier

_duplicated_7 categorical feature

_duplicated_8 text identifier

_duplicated_9 text identifier

_duplicated_10 categorical feature

_duplicated_11 text identifier

_duplicated_12 categorical feature

_duplicated_13 text identifier

_duplicated_14 categorical feature

_duplicated_15 text identifier

_duplicated_16 text identifier

_duplicated_17 categorical feature

_duplicated_18 text identifier

_duplicated_19 text identifier

_duplicated_20 categorical feature

_duplicated_21 categorical identifier

_duplicated_22 categorical feature

_duplicated_23 text identifier

_duplicated_24 categorical feature

_duplicated_25 text identifier

_duplicated_26 text identifier

_duplicated_27 categorical feature

How to cite

Please note 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical metadata