saturn·

accessibility ssa sa fywl

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv

Saturn profiled 1,093 rows across 30 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv",
    "--findings", "accessibility-ssa_sa_fywl.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: medium

This appears to be the SSA-SA-FYWL dataset (Social Security Administration state/area fiscal-year workload data) with 1,093 rows and 30 columns, but the headers were not parsed correctly — most columns carry placeholder names like `_duplicated_*` and several columns hold metadata constants (file name, update date 3/13/2023, date type 'FY'). The most informative real fields are the geographic and time dimensions: `_duplicated_2` holds 53 US state codes (each appearing 21 times), `_duplicated_1` holds 11 region codes dominated by ATL (168 rows), and `_duplicated_4` holds 22 fiscal years from 2001 onward in a balanced panel. Many numeric measures (e.g. `_duplicated_22`, `_duplicated_12`, `_duplicated_10`) were ingested as text/categorical strings of decimal numbers, so they should be retyped before analysis. Start by fixing headers and dtypes, then look at the region/state/year structure to confirm the panel layout.

citing: _duplicated_1 · _duplicated_2 · _duplicated_4 · _duplicated_22 · _duplicated_12 · _duplicated_0 · _duplicated_3

Out[4]:

saturn.schema() · 30 columns

column kind n null% unique alerts
**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical 1,093 0.0% 2 imbalance
categorical 1,093 0.0% 2 imbalance
_duplicated_0 categorical 1,093 0.0% 2 imbalance
_duplicated_1 categorical 1,093 0.0% 11
_duplicated_2 categorical 1,093 0.0% 53
_duplicated_3 categorical 1,093 0.0% 2 imbalance
_duplicated_4 categorical 1,093 0.0% 22
_duplicated_5 text 1,093 0.0% 1,037 one_word allcaps short_text
_duplicated_6 text 1,093 0.0% 1,090 near_unique one_word allcaps short_text
_duplicated_7 categorical 1,093 0.0% 511
_duplicated_8 text 1,093 0.0% 1,041 near_unique one_word allcaps short_text
_duplicated_9 text 1,093 0.0% 1,081 near_unique one_word allcaps short_text
_duplicated_10 categorical 1,093 0.0% 199
_duplicated_11 text 1,093 0.0% 1,062 near_unique one_word allcaps short_text
_duplicated_12 categorical 1,093 0.0% 69
_duplicated_13 text 1,093 0.0% 1,079 near_unique one_word allcaps short_text
_duplicated_14 categorical 1,093 0.0% 883 long_tail
_duplicated_15 text 1,093 0.0% 1,019 one_word allcaps short_text
_duplicated_16 text 1,093 0.0% 1,057 near_unique one_word allcaps short_text
_duplicated_17 categorical 1,093 0.0% 272
_duplicated_18 text 1,093 0.1% 1,021 one_word allcaps short_text
_duplicated_19 text 1,093 0.0% 1,018 one_word allcaps short_text
_duplicated_20 categorical 1,093 0.0% 156
_duplicated_21 categorical 1,093 0.0% 957 long_tail
_duplicated_22 categorical 1,093 0.0% 70
_duplicated_23 text 1,093 0.0% 1,028 one_word allcaps short_text
_duplicated_24 categorical 1,093 0.5% 900 long_tail
_duplicated_25 text 1,093 0.0% 1,088 near_unique one_word allcaps short_text
_duplicated_26 text 1,093 0.0% 1,069 near_unique one_word allcaps short_text
_duplicated_27 categorical 1,093 0.0% 873 long_tail
Fig 1.
_duplicated_1 · Row counts by SSA region code — ATL leads at 168, showing uneven regional coverage.
Show data table
Top values for _duplicated_1 (11 unique shown, of 11 total).
valuecountshare
ATL16815.4%
DEN12611.5%
BOS12611.5%
PHL12611.5%
CHI12611.5%
DAL1059.6%
SEA847.7%
SFO847.7%
KCM847.7%
NYC635.8%
Region Code10.1%
Fig 2.
_duplicated_4 · Distribution across fiscal years 2001+ — note the flat 52-rows-per-year pattern indicating a balanced panel.
Show data table
Top values for _duplicated_4 (20 unique shown, of 22 total).
valuecountshare
2001524.8%
2002524.8%
2003524.8%
2004524.8%
2005524.8%
2006524.8%
2007524.8%
2008524.8%
2009524.8%
2010524.8%
2011524.8%
2012524.8%
2013524.8%
2014524.8%
2015524.8%
2016524.8%
2017524.8%
2018524.8%
2019524.8%
2020524.8%
Fig 3.
_duplicated_2 · State-code coverage — 53 values each appearing 21 times suggests one row per state per year.
Show data table
Top values for _duplicated_2 (20 unique shown, of 53 total).
valuecountshare
AK 211.9%
AL 211.9%
AR 211.9%
AZ 211.9%
CA 211.9%
CO 211.9%
CT 211.9%
DC 211.9%
DE 211.9%
FL 211.9%
GA 211.9%
HI 211.9%
IA 211.9%
ID 211.9%
IL 211.9%
IN 211.9%
KS 211.9%
KY 211.9%
LA 211.9%
MA 211.9%
Fig 4.
_duplicated_22 · A numeric ratio stored as text (mode 0.18, range ~0.16–0.25); convert to numeric to inspect its true distribution.
Show data table
Top values for _duplicated_22 (20 unique shown, of 70 total).
valuecountshare
0.18777.0%
0.20756.9%
0.21645.9%
0.22625.7%
0.25615.6%
0.17615.6%
0.23494.5%
0.19464.2%
0.24454.1%
0.16433.9%
0.15413.8%
0.26363.3%
0.14272.5%
0.12262.4%
0.27262.4%
0.13252.3%
0.11252.3%
0.10222.0%
0.00211.9%
0.29201.8%
Fig 5.
_duplicated_12 · Another numeric column held as 69 string buckets clustered near 0.30–0.40 — recast and replot as a true histogram.
Show data table
Top values for _duplicated_12 (20 unique shown, of 69 total).
valuecountshare
0.38555.0%
0.34454.1%
0.32433.9%
0.40413.8%
0.35393.6%
0.44363.3%
0.37353.2%
0.31353.2%
0.36333.0%
0.33333.0%
0.39333.0%
0.46322.9%
0.43312.8%
0.48302.7%
0.45302.7%
0.42292.7%
0.30292.7%
0.41272.5%
0.52262.4%
0.54252.3%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical0.0%
categorical0.0%
_duplicated_0categorical0.0%
_duplicated_1categorical0.0%
_duplicated_2categorical0.0%
_duplicated_3categorical0.0%
_duplicated_4categorical0.0%
_duplicated_5text0.0%
_duplicated_6text0.0%
_duplicated_7categorical0.0%
_duplicated_8text0.0%
_duplicated_9text0.0%
_duplicated_10categorical0.0%
_duplicated_11text0.0%
_duplicated_12categorical0.0%
_duplicated_13text0.0%
_duplicated_14categorical0.0%
_duplicated_15text0.0%
_duplicated_16text0.0%
_duplicated_17categorical0.0%
_duplicated_18text0.1%
_duplicated_19text0.0%
_duplicated_20categorical0.0%
_duplicated_21categorical0.0%
_duplicated_22categorical0.0%
_duplicated_23text0.0%
_duplicated_24categorical0.5%
_duplicated_25text0.0%
_duplicated_26text0.0%
_duplicated_27categorical0.0%

**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical metadata

This column is effectively a constant file-name tag ("SSA-SA-FYWL.csv" appears 1092 of 1093 times, top_rate 0.999) with a single stray "File Name" value that looks like a header row leaked into the data. The column header itself is a free-text note about 2021 data being backfilled with 2020 data, suggesting this is provenance metadata rather than a feature. Entropy is essentially zero (0.0106), so it carries no discriminative signal.

Treatment: Drop; near-constant provenance field with a leaked header row.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. "].stats

statvalue
n1,093
nulls0 (0.0%)
unique2
top_value SSA-SA-FYWL.csv
top_rate 0.9991
cardinality 2
entropy 0.01055
entropy_ratio 0.01055
alert: imbalancetop value is 99.9% of rows
Fig 7.
Top values for **Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. .
Show data table
Top values for **Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. (2 unique shown, of 2 total).
valuecountshare
SSA-SA-FYWL.csv109299.9%
File Name10.1%

categorical metadata

Binary categorical column with 1093 rows and only 2 distinct values, but it is effectively a constant: "2" appears 1092 times (top_rate 0.999) while "File Version" appears once. The lone "File Version" string alongside numeric "2" suggests a stray header row leaked into the data. Entropy of 0.0106 confirms there is virtually no information here.

Treatment: Drop the column and investigate the stray "File Version" row as a parsing artifact.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns[""].stats

statvalue
n1,093
nulls0 (0.0%)
unique2
top_value 2
top_rate 0.9991
cardinality 2
entropy 0.01055
entropy_ratio 0.01055
alert: imbalancetop value is 99.9% of rows
Fig 8.
Top values for .
Show data table
Top values for (2 unique shown, of 2 total).
valuecountshare
2109299.9%
File Version10.1%

_duplicated_0 categorical metadata

This appears to be a duplicated date column where 1092 of 1093 rows hold the single value '3/13/2023', with the lone other entry being the literal string 'Update Date' — almost certainly a header row that leaked into the data. Entropy is effectively zero (0.0106) and the top rate is 0.999, so the column carries no discriminative signal. The 'Update Date' value also confirms a parsing/ingest issue worth fixing upstream.

Treatment: Drop; constant column with a leaked header value.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["_duplicated_0"].stats

statvalue
n1,093
nulls0 (0.0%)
unique2
top_value 3/13/2023
top_rate 0.9991
cardinality 2
entropy 0.01055
entropy_ratio 0.01055
alert: imbalancetop value is 99.9% of rows
Fig 9.
Top values for _duplicated_0.
Show data table
Top values for _duplicated_0 (2 unique shown, of 2 total).
valuecountshare
3/13/2023109299.9%
Update Date10.1%

_duplicated_1 categorical feature

Three-letter city/airport codes (ATL, DEN, BOS, PHL, CHI, DAL, SEA, SFO, KCM, NYC...) across 1093 rows with 11 unique values and no nulls. Distribution is fairly even — entropy ratio 0.947 and top value ATL only 15.4% — suggesting a balanced categorical rather than a skewed label. The column name `_duplicated_1` flags it as an auto-detected duplicate of another column in the profile.

Treatment: Drop as duplicate, or one-hot encode the 11 city codes if kept.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["_duplicated_1"].stats

statvalue
n1,093
nulls0 (0.0%)
unique11
top_value ATL
top_rate 0.1537
cardinality 11
entropy 3.277
entropy_ratio 0.9473
Fig 10.
Top values for _duplicated_1.
Show data table
Top values for _duplicated_1 (11 unique shown, of 11 total).
valuecountshare
ATL16815.4%
DEN12611.5%
BOS12611.5%
PHL12611.5%
CHI12611.5%
DAL1059.6%
SEA847.7%
SFO847.7%
KCM847.7%
NYC635.8%
Region Code10.1%

_duplicated_2 categorical feature

This column holds two-letter US state/territory abbreviations with a trailing space (e.g. 'AK ', 'AL ', 'AR '), with 53 distinct values across 1093 rows and no nulls. The distribution is almost perfectly uniform — entropy_ratio of 0.996 and the top value appearing just 21 times (1.92%) — suggesting the data is a regular grid of states repeated roughly 21 times each. The 53 categories slightly exceed the 50 states, consistent with DC and US territories, and the trailing whitespace in every value is a data-hygiene flag.

Treatment: Strip trailing whitespace and treat as a categorical state code (one-hot or target-encode).

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["_duplicated_2"].stats

statvalue
n1,093
nulls0 (0.0%)
unique53
top_value AK
top_rate 0.01921
cardinality 53
entropy 5.706
entropy_ratio 0.9961
Fig 11.
Top values for _duplicated_2.
Show data table
Top values for _duplicated_2 (20 unique shown, of 53 total).
valuecountshare
AK 211.9%
AL 211.9%
AR 211.9%
AZ 211.9%
CA 211.9%
CO 211.9%
CT 211.9%
DC 211.9%
DE 211.9%
FL 211.9%
GA 211.9%
HI 211.9%
IA 211.9%
ID 211.9%
IL 211.9%
IN 211.9%
KS 211.9%
KY 211.9%
LA 211.9%
MA 211.9%

_duplicated_3 categorical other

A binary categorical column completely dominated by the value 'FY' (1092 of 1093 rows, top_rate 0.999), with a single stray 'Date Type' entry. Entropy is effectively zero (0.0106), and the name '_duplicated_3' suggests this is a residual from a duplicated header or pivot artifact rather than a real feature. The lone 'Date Type' value looks like a header row that leaked into the data.

Treatment: Drop; constant column with a likely header-leak artifact.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["_duplicated_3"].stats

statvalue
n1,093
nulls0 (0.0%)
unique2
top_value FY
top_rate 0.9991
cardinality 2
entropy 0.01055
entropy_ratio 0.01055
alert: imbalancetop value is 99.9% of rows
Fig 12.
Top values for _duplicated_3.
Show data table
Top values for _duplicated_3 (2 unique shown, of 2 total).
valuecountshare
FY109299.9%
Date Type10.1%

_duplicated_4 categorical timestamp

This column holds 22 distinct year strings ranging from at least 2001 onward, with each year appearing almost exactly 52 times across 1,093 rows and zero nulls. The near-uniform distribution (entropy ratio 0.986, top rate just 0.0476) and the count of 52 strongly suggest weekly observations stacked per year. The '_duplicated_4' name indicates saturn detected this as a duplicate of another column in the dataset.

Treatment: Drop as a duplicate; if kept, cast to integer year and use as a time key.

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["_duplicated_4"].stats

statvalue
n1,093
nulls0 (0.0%)
unique22
top_value 2001
top_rate 0.04758
cardinality 22
entropy 4.399
entropy_ratio 0.9864
Fig 13.
Top values for _duplicated_4.
Show data table
Top values for _duplicated_4 (20 unique shown, of 22 total).
valuecountshare
2001524.8%
2002524.8%
2003524.8%
2004524.8%
2005524.8%
2006524.8%
2007524.8%
2008524.8%
2009524.8%
2010524.8%
2011524.8%
2012524.8%
2013524.8%
2014524.8%
2015524.8%
2016524.8%
2017524.8%
2018524.8%
2019524.8%
2020524.8%

_duplicated_5 text identifier

Stored as text but the values are short numeric tokens (length 6-21, mean 6.85, one word in 99.9% of rows), almost certainly some kind of numeric ID. Cardinality is near-unique (1037 distinct out of 1093) yet 56 rows duplicate (5.1% duplicate rate), which is unexpected for an identifier and worth checking. The column name '_duplicated_5' also suggests this was auto-generated from a collision during ingest.

Treatment: Cast to string id and left-join on it; investigate the 56 duplicates before assuming uniqueness.

anthropic:claude-opus-4-7 · confidence high
Out[33]:

saturn.columns["_duplicated_5"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,037
len_min 6
len_max 21
len_mean 6.846
len_median 7
len_p95 8
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 56
duplicate_rate 0.05124
vocab_size 1,039
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 14.
Character-length distribution for _duplicated_5.
Show data table
Character-length distribution for _duplicated_5 (mean: 6.846294602012809).
charscount
6 – 6265
6 – 70
7 – 7744
7 – 80
8 – 80
8 – 883
8 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 130
13 – 140
14 – 140
14 – 140
14 – 150
15 – 150
15 – 150
15 – 160
16 – 160
16 – 160
16 – 170
17 – 170
17 – 180
18 – 180
18 – 180
18 – 190
19 – 190
19 – 200
20 – 200
20 – 200
20 – 210
21 – 211

_duplicated_6 text identifier

Almost every value is a single all-caps token of 5-6 characters (len_mean 5.68, one_word_rate 0.999), with 1090 unique values across 1093 rows and only 3 duplicates. Top tokens are mostly numeric strings like '91371', '18795', '158314', suggesting this is an identifier or numeric code column rather than natural text — though a stray header-like fragment ('ssa', 'disability', 'beneficiaries', 'age', '18-64*') hints the source file had embedded header rows mixed into the data.

Treatment: Treat as a near-unique code; drop from modelling or use only as a join key after stripping the stray header rows.

anthropic:claude-opus-4-7 · confidence high
Out[36]:

saturn.columns["_duplicated_6"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,090
len_min 5
len_max 40
len_mean 5.683
len_median 6
len_p95 6
word_mean 1.005
word_median 1
n_empty 0
n_duplicates 3
duplicate_rate 0.002745
vocab_size 1,094
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique99.7% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 15.
Character-length distribution for _duplicated_6.
Show data table
Character-length distribution for _duplicated_6 (mean: 5.683440073193046).
charscount
5 – 6397
6 – 7678
7 – 817
8 – 80
8 – 90
9 – 100
10 – 110
11 – 120
12 – 130
13 – 140
14 – 150
15 – 160
16 – 160
16 – 170
17 – 180
18 – 190
19 – 200
20 – 210
21 – 220
22 – 220
22 – 230
23 – 240
24 – 250
25 – 260
26 – 270
27 – 280
28 – 290
29 – 300
30 – 300
30 – 310
31 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 360
36 – 370
37 – 380
38 – 390
39 – 401

_duplicated_7 categorical feature

Column is typed categorical but holds 511 distinct numeric strings like "5.50", "5.07", "4.90" across 1093 rows, suggesting a continuous measurement (price, rating, or similar) stored as text. Distribution is nearly flat: entropy ratio is 0.968 and the most common value covers only 1.01% of rows. The "_duplicated_7" name implies this is a redundant copy of another column produced during a join or pivot.

Treatment: Cast to float and drop if it duplicates another numeric column; otherwise treat as a continuous feature.

anthropic:claude-opus-4-7 · confidence medium
Out[39]:

saturn.columns["_duplicated_7"].stats

statvalue
n1,093
nulls0 (0.0%)
unique511
top_value 5.50
top_rate 0.01006
cardinality 511
entropy 8.71
entropy_ratio 0.9681
Fig 16.
Top values for _duplicated_7.
Show data table
Top values for _duplicated_7 (20 unique shown, of 511 total).
valuecountshare
5.50111.0%
5.0790.8%
4.9090.8%
4.1980.7%
5.0880.7%
4.7080.7%
4.9670.6%
5.2970.6%
4.1160.5%
4.5560.5%
5.1860.5%
4.4560.5%
6.1860.5%
4.9860.5%
5.6360.5%
7.1660.5%
5.3350.5%
5.1550.5%
5.4550.5%
4.7150.5%

_duplicated_8 text identifier

Single-token, all-caps short strings (length 6-26, mean 6.84, ~1 word each) that are overwhelmingly numeric — top values like '468802', '2702811', '1646445' are integers stored as text. With 1041 unique values across 1093 rows and only 52 duplicates, this looks like a near-unique numeric identifier rather than a feature. The 'allcaps' and Flesch=121.22 signals are artifacts of digit-only tokens; no URLs, emojis, or boilerplate appear.

Treatment: Drop from modelling or use as a join key; cast to integer if needed.

anthropic:claude-opus-4-7 · confidence high
Out[42]:

saturn.columns["_duplicated_8"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,041
len_min 6
len_max 26
len_mean 6.835
len_median 7
len_p95 8
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 52
duplicate_rate 0.04758
vocab_size 1,043
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique95.2% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 17.
Character-length distribution for _duplicated_8.
Show data table
Character-length distribution for _duplicated_8 (mean: 6.835315645013724).
charscount
6 – 6279
6 – 70
7 – 8733
8 – 80
8 – 880
8 – 90
9 – 100
10 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 140
14 – 150
15 – 160
16 – 160
16 – 160
16 – 170
17 – 180
18 – 180
18 – 180
18 – 190
19 – 200
20 – 200
20 – 200
20 – 210
21 – 220
22 – 220
22 – 220
22 – 230
23 – 240
24 – 240
24 – 240
24 – 250
25 – 260
26 – 261

_duplicated_9 text identifier

Almost certainly an identifier-like code column: 1081 unique values across 1093 rows, single-token entries averaging 4.85 characters, and the top repeated values are short numeric strings like '4190' and '8630'. The 99.9% allcaps and one_word rates plus max length of 14 suggest compact alphanumeric codes rather than prose. The 12 duplicates (1.1%) are minor but worth checking given the column is otherwise near-unique.

Treatment: Treat as an identifier; drop from modelling features or use only for joins/lookups.

anthropic:claude-opus-4-7 · confidence high
Out[45]:

saturn.columns["_duplicated_9"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,081
len_min 3
len_max 14
len_mean 4.854
len_median 5
len_p95 6
word_mean 1.001
word_median 1
n_empty 0
n_duplicates 12
duplicate_rate 0.01098
vocab_size 1,082
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique98.9% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 18.
Character-length distribution for _duplicated_9.
Show data table
Character-length distribution for _duplicated_9 (mean: 4.8536139066788655).
charscount
3 – 31
3 – 40
4 – 40
4 – 4269
4 – 40
4 – 50
5 – 50
5 – 5720
5 – 50
5 – 60
6 – 6102
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 120
12 – 130
13 – 130
13 – 130
13 – 130
13 – 140
14 – 141

_duplicated_10 categorical feature

Stored as a categorical but the values are numeric strings clustered tightly around 1.0 (top values include '0.97', '1.11', '1.01', '1.04', '0.92'), suggesting a ratio, multiplier, or normalised index. Distribution is highly diffuse with 199 distinct values across 1093 rows and an entropy ratio of 0.929, so no single bucket dominates (top_rate just 0.023). The '_duplicated_10' name implies this column is a redundant copy from an upstream join.

Treatment: Cast to float and treat as a continuous feature; verify it isn't a duplicate of another column before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[48]:

saturn.columns["_duplicated_10"].stats

statvalue
n1,093
nulls0 (0.0%)
unique199
top_value 0.97
top_rate 0.02287
cardinality 199
entropy 7.097
entropy_ratio 0.9293
Fig 19.
Top values for _duplicated_10.
Show data table
Top values for _duplicated_10 (20 unique shown, of 199 total).
valuecountshare
0.97252.3%
1.11242.2%
1.01232.1%
1.04191.7%
0.92191.7%
1.08181.6%
1.02171.6%
1.12161.5%
1.07161.5%
1.15161.5%
0.96151.4%
1.00141.3%
1.13141.3%
1.10141.3%
0.89131.2%
1.23131.2%
0.94131.2%
0.90131.2%
1.05131.2%
0.85131.2%

_duplicated_11 text identifier

Almost certainly a short alphanumeric code column: 1062 distinct values across 1093 rows, 99.9% one-word and 99.9% all-caps, lengths between 3 and 30 characters with a median of 4. Top tokens are bare numeric strings like '6632' and '1573', each appearing only 2-3 times, suggesting ID-like codes rather than categories. The '_duplicated_11' name and 31 duplicates (2.8%) hint this is a copy of another column with minor collisions.

Treatment: Drop as near-unique identifier, or treat as a key for join/lookup rather than a feature.

anthropic:claude-opus-4-7 · confidence high
Out[51]:

saturn.columns["_duplicated_11"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,062
len_min 3
len_max 30
len_mean 4.498
len_median 4
len_p95 5
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 31
duplicate_rate 0.02836
vocab_size 1,064
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique97.2% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 20.
Character-length distribution for _duplicated_11.
Show data table
Character-length distribution for _duplicated_11 (mean: 4.497712717291857).
charscount
3 – 411
4 – 4552
4 – 5529
5 – 60
6 – 60
6 – 70
7 – 80
8 – 80
8 – 90
9 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 150
15 – 160
16 – 160
16 – 170
17 – 180
18 – 190
19 – 190
19 – 200
20 – 210
21 – 210
21 – 220
22 – 230
23 – 230
23 – 240
24 – 250
25 – 250
25 – 260
26 – 270
27 – 270
27 – 280
28 – 290
29 – 290
29 – 301

_duplicated_12 categorical feature

This column holds 69 distinct numeric-looking strings (e.g. '0.38', '0.34', '0.32') across 1093 rows with no nulls, suggesting a decimal ratio or rate stored as text. The distribution is fairly flat — top value '0.38' covers only 5.0% and entropy ratio is 0.905 — so no single value dominates. The '_duplicated_12' name signals it is a duplicate of another column, which is the main thing to flag.

Treatment: Drop as a duplicate column after confirming it matches its source, otherwise cast to float.

anthropic:claude-opus-4-7 · confidence medium
Out[54]:

saturn.columns["_duplicated_12"].stats

statvalue
n1,093
nulls0 (0.0%)
unique69
top_value 0.38
top_rate 0.05032
cardinality 69
entropy 5.527
entropy_ratio 0.9048
Fig 21.
Top values for _duplicated_12.
Show data table
Top values for _duplicated_12 (20 unique shown, of 69 total).
valuecountshare
0.38555.0%
0.34454.1%
0.32433.9%
0.40413.8%
0.35393.6%
0.44363.3%
0.37353.2%
0.31353.2%
0.36333.0%
0.33333.0%
0.39333.0%
0.46322.9%
0.43312.8%
0.48302.7%
0.45302.7%
0.42292.7%
0.30292.7%
0.41272.5%
0.52262.4%
0.54252.3%

_duplicated_13 text identifier

This column holds short, single-token uppercase strings that are almost entirely unique (1079 unique out of 1093), with lengths between 4 and 24 characters and a median of 5. The top-frequency tokens are all numeric strings ('17955', '5808', etc.) appearing only twice each, suggesting this is a near-unique identifier code rather than natural text. The 'allcaps' and 'one_word' rates near 99.9% confirm a structured code format, and the column name '_duplicated_13' hints it was auto-generated during a join or pivot.

Treatment: Drop or use as a join key; not suitable as a modelling feature due to near-uniqueness.

anthropic:claude-opus-4-7 · confidence high
Out[57]:

saturn.columns["_duplicated_13"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,079
len_min 4
len_max 24
len_mean 4.849
len_median 5
len_p95 6
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 14
duplicate_rate 0.01281
vocab_size 1,081
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique98.7% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 22.
Character-length distribution for _duplicated_13.
Show data table
Character-length distribution for _duplicated_13 (mean: 4.84903934126258).
charscount
4 – 4283
4 – 50
5 – 6710
6 – 60
6 – 699
6 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 100
10 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 140
14 – 150
15 – 160
16 – 160
16 – 160
16 – 170
17 – 180
18 – 180
18 – 180
18 – 190
19 – 200
20 – 200
20 – 200
20 – 210
21 – 220
22 – 220
22 – 220
22 – 230
23 – 240
24 – 241

_duplicated_14 categorical feature

This column, labelled `_duplicated_14`, holds 1093 numeric-looking strings (e.g. "31.13", "44.89") with 883 unique values and no nulls — almost certainly a continuous measurement that was ingested as categorical. Entropy ratio of 0.99 and a top frequency of just 4 (0.37%) confirm near-uniqueness; the `long_tail` alert and the `_duplicated_` prefix suggest it is a redundant copy of another numeric column.

Treatment: Cast to float and check for equality against the original column; drop if it is a duplicate.

anthropic:claude-opus-4-7 · confidence high
Out[60]:

saturn.columns["_duplicated_14"].stats

statvalue
n1,093
nulls0 (0.0%)
unique883
top_value 31.13
top_rate 0.00366
cardinality 883
entropy 9.686
entropy_ratio 0.9897
alert: long_tail707 singleton categories
Fig 23.
Top values for _duplicated_14.
Show data table
Top values for _duplicated_14 (20 unique shown, of 883 total).
valuecountshare
31.1340.4%
44.8930.3%
33.2030.3%
47.4630.3%
30.7330.3%
35.5130.3%
41.7830.3%
40.1230.3%
36.0630.3%
29.7430.3%
36.9830.3%
37.0230.3%
38.3230.3%
29.6330.3%
36.1730.3%
30.3430.3%
32.5030.3%
36.1430.3%
32.4730.3%
31.9330.3%

_duplicated_15 text identifier

This column holds short single-token numeric strings (one_word_rate 0.999, len_mean 6.4, max 24) stored as text rather than integers, with 1019 unique values across 1093 rows. The value '0' appears 21 times while every other top value occurs only twice, suggesting '0' is a sentinel or default. The name '_duplicated_15' and the 6.8% duplicate rate hint this is a redundant copy of a numeric identifier column from an upstream join.

Treatment: Cast to integer and drop as a duplicate id unless it differs from the original column.

anthropic:claude-opus-4-7 · confidence medium
Out[63]:

saturn.columns["_duplicated_15"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,019
len_min 1
len_max 24
len_mean 6.415
len_median 6
len_p95 7
word_mean 1.003
word_median 1
n_empty 0
n_duplicates 74
duplicate_rate 0.0677
vocab_size 1,022
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9799
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps98.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 24.
Character-length distribution for _duplicated_15.
Show data table
Character-length distribution for _duplicated_15 (mean: 6.415370539798719).
charscount
1 – 221
2 – 20
2 – 30
3 – 30
3 – 40
4 – 40
4 – 50
5 – 60
6 – 6530
6 – 70
7 – 7541
7 – 80
8 – 80
8 – 90
9 – 100
10 – 100
10 – 110
11 – 110
11 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 150
15 – 150
15 – 160
16 – 170
17 – 170
17 – 180
18 – 180
18 – 190
19 – 190
19 – 200
20 – 210
21 – 210
21 – 220
22 – 220
22 – 230
23 – 230
23 – 241

_duplicated_16 text identifier

Despite being typed as text, this column is dominated by short single-token numeric strings (one_word_rate 0.999, len_mean 4.54, max 38) with 1057 unique values across 1093 rows. The top tokens are bare integers like "0" (21 occurrences), "1358", "840", suggesting an ID or numeric code stored as text rather than natural language. The allcaps_rate of 0.98 is an artifact of digits/non-letter content, and the column name `_duplicated_16` implies it was auto-generated during a column-name collision.

Treatment: Drop or treat as a high-cardinality ID; do not tokenize as text.

anthropic:claude-opus-4-7 · confidence high
Out[66]:

saturn.columns["_duplicated_16"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,057
len_min 1
len_max 38
len_mean 4.535
len_median 5
len_p95 5
word_mean 1.004
word_median 1
n_empty 0
n_duplicates 36
duplicate_rate 0.03294
vocab_size 1,061
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9799
boilerplate_rate 0
alert: near_unique96.7% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps98.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 25.
Character-length distribution for _duplicated_16.
Show data table
Character-length distribution for _duplicated_16 (mean: 4.535224153705398).
charscount
1 – 221
2 – 30
3 – 425
4 – 5446
5 – 6562
6 – 737
7 – 71
7 – 80
8 – 90
9 – 100
10 – 110
11 – 120
12 – 130
13 – 140
14 – 150
15 – 160
16 – 170
17 – 180
18 – 190
19 – 200
20 – 200
20 – 210
21 – 220
22 – 230
23 – 240
24 – 250
25 – 260
26 – 270
27 – 280
28 – 290
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 370
37 – 381

_duplicated_17 categorical feature

Stored as categorical strings but the values are numeric ('0.00', '1.68', '0.58', '1.07'), suggesting a small-magnitude continuous measurement that was read as text. Cardinality is high (272 unique across 1093 rows) with very flat distribution: top value '0.00' covers only 1.92% and entropy ratio is 0.949. The '_duplicated_17' name implies this is a duplicate of another column produced during a join or concat.

Treatment: Cast to float and check whether it duplicates an existing column; drop if redundant.

anthropic:claude-opus-4-7 · confidence medium
Out[69]:

saturn.columns["_duplicated_17"].stats

statvalue
n1,093
nulls0 (0.0%)
unique272
top_value 0.00
top_rate 0.01921
cardinality 272
entropy 7.671
entropy_ratio 0.9485
Fig 26.
Top values for _duplicated_17.
Show data table
Top values for _duplicated_17 (20 unique shown, of 272 total).
valuecountshare
0.00211.9%
1.68131.2%
0.58121.1%
1.07121.1%
1.08121.1%
1.24121.1%
1.15121.1%
0.64121.1%
1.52111.0%
1.42111.0%
1.18111.0%
1.70111.0%
1.81111.0%
1.20100.9%
1.09100.9%
1.44100.9%
1.11100.9%
0.94100.9%
1.78100.9%
1.56100.9%

_duplicated_18 text identifier

Despite being typed as text, this column holds single-token numeric strings (one_word_rate 0.999, word_mean 1.00, len_mean 6.4) with 1021 unique values across 1093 rows — effectively a high-cardinality numeric ID stored as text. The value '0' appears 20 times while every other top value occurs at most twice, hinting at '0' as a sentinel/placeholder amid otherwise near-unique IDs. The 'allcaps' alert is a quirk of digit-only strings rather than meaningful casing.

Treatment: Cast to integer (treating '0' as missing) or drop as near-unique identifier before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[72]:

saturn.columns["_duplicated_18"].stats

statvalue
n1,093
nulls1 (0.1%)
unique1,021
len_min 1
len_max 26
len_mean 6.407
len_median 6
len_p95 7
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 71
duplicate_rate 0.06502
vocab_size 1,023
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9808
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps98.1% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 27.
Character-length distribution for _duplicated_18.
Show data table
Character-length distribution for _duplicated_18 (mean: 6.406593406593407).
charscount
1 – 220
2 – 20
2 – 30
3 – 40
4 – 40
4 – 50
5 – 51
5 – 60
6 – 7545
7 – 7525
7 – 80
8 – 80
8 – 90
9 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 150
15 – 150
15 – 160
16 – 170
17 – 170
17 – 180
18 – 180
18 – 190
19 – 200
20 – 200
20 – 210
21 – 220
22 – 220
22 – 230
23 – 240
24 – 240
24 – 250
25 – 250
25 – 261

_duplicated_19 text identifier

Despite being typed as text, this column is essentially short numeric tokens — 99.9% are single words with mean length 4.05 characters and a max of 32. With 1018 unique values across 1093 rows and the most common entry '0' appearing only 21 times, it behaves like a high-cardinality numeric identifier stored as strings. The 'allcaps' alert (97.99%) is an artifact of digits having no lowercase form rather than a meaningful signal.

Treatment: Cast to integer and treat as an ID; drop from modelling features unless joined as a key.

anthropic:claude-opus-4-7 · confidence high
Out[75]:

saturn.columns["_duplicated_19"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,018
len_min 1
len_max 32
len_mean 4.047
len_median 4
len_p95 5
word_mean 1.004
word_median 1
n_empty 0
n_duplicates 75
duplicate_rate 0.06862
vocab_size 1,022
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9799
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps98.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 28.
Character-length distribution for _duplicated_19.
Show data table
Character-length distribution for _duplicated_19 (mean: 4.046660567246112).
charscount
1 – 221
2 – 30
3 – 3181
3 – 4623
4 – 50
5 – 6267
6 – 60
6 – 70
7 – 80
8 – 90
9 – 100
10 – 100
10 – 110
11 – 120
12 – 130
13 – 130
13 – 140
14 – 150
15 – 160
16 – 160
16 – 170
17 – 180
18 – 190
19 – 200
20 – 200
20 – 210
21 – 220
22 – 230
23 – 230
23 – 240
24 – 250
25 – 260
26 – 270
27 – 270
27 – 280
28 – 290
29 – 300
30 – 300
30 – 310
31 – 321

_duplicated_20 categorical feature

Despite being typed categorical, every one of the 156 distinct values is a two-decimal numeric string between 0.00 and 0.61+, suggesting a proportion or rate that was stored as text. The distribution is nearly flat (entropy ratio 0.907), with the modal value '0.30' covering only 2.6% of 1093 rows and no nulls. The column name '_duplicated_20' implies it is a copy of another column flagged during ingestion.

Treatment: Cast strings to float and treat as a numeric feature; verify against the source column and drop if it is an exact duplicate.

anthropic:claude-opus-4-7 · confidence medium
Out[78]:

saturn.columns["_duplicated_20"].stats

statvalue
n1,093
nulls0 (0.0%)
unique156
top_value 0.30
top_rate 0.02562
cardinality 156
entropy 6.609
entropy_ratio 0.9071
Fig 29.
Top values for _duplicated_20.
Show data table
Top values for _duplicated_20 (20 unique shown, of 156 total).
valuecountshare
0.30282.6%
0.35262.4%
0.33262.4%
0.37242.2%
0.45242.2%
0.36232.1%
0.61232.1%
0.42222.0%
0.00211.9%
0.43211.9%
0.38201.8%
0.40191.7%
0.48191.7%
0.32191.7%
0.41181.6%
0.39181.6%
0.58181.6%
0.18181.6%
0.57171.6%
0.71171.6%

_duplicated_21 categorical identifier

This column is labelled `_duplicated_21`, suggesting saturn detected it as a duplicate of another field; values appear to be numeric strings stored as categorical. With 957 unique values across 1093 rows and an entropy ratio of 0.9885, it is nearly an identifier — the only meaningful concentration is `"0"` at 21 occurrences (1.92%), likely a sentinel or default. The long_tail alert and near-unique cardinality mean it carries almost no categorical signal as-is.

Treatment: Drop as a duplicated near-unique column, or reconcile against its original before any modelling.

anthropic:claude-opus-4-7 · confidence high
Out[81]:

saturn.columns["_duplicated_21"].stats

statvalue
n1,093
nulls0 (0.0%)
unique957
top_value 0
top_rate 0.01921
cardinality 957
entropy 9.789
entropy_ratio 0.9885
alert: long_tail852 singleton categories
Fig 30.
Top values for _duplicated_21.
Show data table
Top values for _duplicated_21 (20 unique shown, of 957 total).
valuecountshare
0211.9%
132140.4%
35230.3%
59730.3%
77730.3%
58030.3%
118430.3%
135330.3%
71030.3%
312830.3%
46330.3%
22730.3%
104320.2%
42120.2%
289120.2%
507920.2%
322820.2%
29920.2%
333720.2%
23820.2%

_duplicated_22 categorical feature

This column holds 70 distinct short decimal strings clustered tightly around 0.16–0.25, suggesting a numeric ratio or rate (perhaps a proportion or probability) that has been stored as text. Distribution is fairly even with the top value '0.18' taking only 7.0% of rows and entropy ratio 0.84, so no single bucket dominates. The 'categorical' kind plus the '_duplicated_22' name hint that saturn detected this as a duplicate of another column and parsed it as strings rather than floats.

Treatment: Cast to float and verify it is not redundant with the original column before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[84]:

saturn.columns["_duplicated_22"].stats

statvalue
n1,093
nulls0 (0.0%)
unique70
top_value 0.18
top_rate 0.07045
cardinality 70
entropy 5.138
entropy_ratio 0.8383
Fig 31.
Top values for _duplicated_22.
Show data table
Top values for _duplicated_22 (20 unique shown, of 70 total).
valuecountshare
0.18777.0%
0.20756.9%
0.21645.9%
0.22625.7%
0.25615.6%
0.17615.6%
0.23494.5%
0.19464.2%
0.24454.1%
0.16433.9%
0.15413.8%
0.26363.3%
0.14272.5%
0.12262.4%
0.27262.4%
0.13252.3%
0.11252.3%
0.10222.0%
0.00211.9%
0.29201.8%

_duplicated_23 text identifier

Despite the text kind, every value is a single short token (word_mean 1.004, len_mean 4.05, len_max 37) and the top values are all numeric strings like "0", "406", "404". With 1028 unique values across 1093 rows and a 5.9% duplicate_rate dominated by "0" (21 occurrences), this looks like a numeric identifier or count stored as text. The allcaps_rate of 0.98 is a quirk of digit-only strings being flagged as uppercase.

Treatment: cast to integer and treat as numeric id or count rather than free text.

anthropic:claude-opus-4-7 · confidence high
Out[87]:

saturn.columns["_duplicated_23"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,028
len_min 1
len_max 37
len_mean 4.053
len_median 4
len_p95 5
word_mean 1.004
word_median 1
n_empty 0
n_duplicates 65
duplicate_rate 0.05947
vocab_size 1,032
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9799
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps98.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 32.
Character-length distribution for _duplicated_23.
Show data table
Character-length distribution for _duplicated_23 (mean: 4.053064958828911).
charscount
1 – 221
2 – 30
3 – 4182
4 – 5619
5 – 6270
6 – 60
6 – 70
7 – 80
8 – 90
9 – 100
10 – 110
11 – 120
12 – 130
13 – 140
14 – 140
14 – 150
15 – 160
16 – 170
17 – 180
18 – 190
19 – 200
20 – 210
21 – 220
22 – 230
23 – 240
24 – 240
24 – 250
25 – 260
26 – 270
27 – 280
28 – 290
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 371

_duplicated_24 categorical feature

Despite being typed categorical, the values are numeric strings (e.g. '0.00', '47.52', '51.82'), suggesting a monetary or measurement field that was read as text. With 900 unique values across 1093 rows and entropy ratio 0.9874, it is nearly unique; the only meaningful concentration is '0.00' at 1.38% (15 rows). The '_duplicated_24' name implies this is a repeated copy of another column in the source.

Treatment: Cast to float and treat as numeric; verify whether it duplicates another column before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[90]:

saturn.columns["_duplicated_24"].stats

statvalue
n1,093
nulls6 (0.5%)
unique900
top_value 0.00
top_rate 0.0138
cardinality 900
entropy 9.69
entropy_ratio 0.9874
alert: long_tail756 singleton categories
Fig 33.
Top values for _duplicated_24.
Show data table
Top values for _duplicated_24 (20 unique shown, of 900 total).
valuecountshare
0.00151.4%
47.5250.5%
51.8240.4%
47.0440.4%
54.2440.4%
48.9140.4%
51.9030.3%
48.8930.3%
37.9730.3%
51.3530.3%
44.1830.3%
63.0630.3%
40.6430.3%
38.6630.3%
57.9830.3%
30.9230.3%
53.9430.3%
60.2030.3%
39.1530.3%
30.0530.3%

_duplicated_25 text identifier

Almost every value is a single short ALLCAPS token (one_word_rate 0.999, allcaps_rate 0.999, len_mean 4.9, word_mean 1.0), and 1088 of 1093 rows are unique with only 5 duplicates. The top tokens are mostly numeric strings like '3584' or '14860', suggesting this is a near-unique short code rather than natural text. The column name '_duplicated_25' hints it was auto-generated from a duplicated source column during profiling.

Treatment: Drop or treat as an ID key; do not tokenize as free text.

anthropic:claude-opus-4-7 · confidence high
Out[93]:

saturn.columns["_duplicated_25"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,088
len_min 4
len_max 18
len_mean 4.906
len_median 5
len_p95 6
word_mean 1.001
word_median 1
n_empty 0
n_duplicates 5
duplicate_rate 0.004575
vocab_size 1,089
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique99.5% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 34.
Character-length distribution for _duplicated_25.
Show data table
Character-length distribution for _duplicated_25 (mean: 4.90576395242452).
charscount
4 – 4248
4 – 50
5 – 5712
5 – 50
5 – 60
6 – 6132
6 – 60
6 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 130
13 – 130
13 – 140
14 – 140
14 – 140
14 – 150
15 – 150
15 – 160
16 – 160
16 – 160
16 – 170
17 – 170
17 – 170
17 – 180
18 – 181

_duplicated_26 text identifier

Single-token, all-caps strings averaging 4.57 characters with 1069 unique values across 1093 rows — almost certainly an identifier or short code column. The top values are all numeric strings (e.g., '2280', '2086') appearing 2-3 times each, suggesting these are numeric IDs stored as text rather than meaningful tokens. The 99.9% one-word and all-caps rates plus near-unique cardinality rule out free text.

Treatment: Treat as a categorical/key field; drop from modelling features or use only for joins.

anthropic:claude-opus-4-7 · confidence high
Out[96]:

saturn.columns["_duplicated_26"].stats

statvalue
n1,093
nulls0 (0.0%)
unique1,069
len_min 3
len_max 28
len_mean 4.574
len_median 5
len_p95 5
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 24
duplicate_rate 0.02196
vocab_size 1,071
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9991
allcaps_rate 0.9991
boilerplate_rate 0
alert: near_unique97.8% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 35.
Character-length distribution for _duplicated_26.
Show data table
Character-length distribution for _duplicated_26 (mean: 4.573650503202196).
charscount
3 – 44
4 – 4487
4 – 50
5 – 6595
6 – 66
6 – 70
7 – 70
7 – 80
8 – 90
9 – 90
9 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 150
15 – 160
16 – 160
16 – 170
17 – 170
17 – 180
18 – 190
19 – 190
19 – 200
20 – 200
20 – 210
21 – 220
22 – 220
22 – 230
23 – 240
24 – 240
24 – 250
25 – 260
26 – 260
26 – 270
27 – 270
27 – 281

_duplicated_27 categorical feature

Stored as categorical strings but every observed value parses as a two-decimal number (e.g. '37.60', '41.85'), so this is almost certainly a numeric measurement — possibly a price, rate or score — that was ingested as text. With 873 unique values across 1093 rows and entropy ratio 0.989, it is near-unique; the most frequent value '37.60' appears just 4 times (top rate 0.37%). The '_duplicated_27' name suggests it is a duplicate of another column produced upstream.

Treatment: Cast to float and treat as a numeric feature; verify it is not redundant with the column it duplicates.

anthropic:claude-opus-4-7 · confidence high
Out[99]:

saturn.columns["_duplicated_27"].stats

statvalue
n1,093
nulls0 (0.0%)
unique873
top_value 37.60
top_rate 0.00366
cardinality 873
entropy 9.662
entropy_ratio 0.989
alert: long_tail693 singleton categories
Fig 36.
Top values for _duplicated_27.
Show data table
Top values for _duplicated_27 (20 unique shown, of 873 total).
valuecountshare
37.6040.4%
36.6040.4%
41.8540.4%
38.4740.4%
49.1930.3%
32.6330.3%
42.2830.3%
29.9630.3%
42.1430.3%
38.1230.3%
33.0430.3%
40.7030.3%
40.4530.3%
33.8430.3%
30.2730.3%
31.3530.3%
39.4330.3%
33.7730.3%
30.6930.3%
31.3930.3%

How to cite

click to copy

BibTeX
@misc{saturn-accessibility-ssa-sa-fywl-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: accessibility ssa sa fywl},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/accessibility-ssa_sa_fywl}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: accessibility ssa sa fywl. Source: /home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/accessibility-ssa_sa_fywl