parking-parking_violations_sample_20260119

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/parking/parking_violations_sample_20260119.json

Saturn profiled 10,000 rows across 40 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/parking/parking_violations_sample_20260119.json",
    "--findings", "parking-parking_violations_sample_20260119.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 10,000-row sample of NYC parking violations with 40 fields covering ticket metadata, vehicle attributes, and location/precinct codes. The violation mix is dominated by one category — 'PHTO SCHOOL ZN SPEED VIOLATION' accounts for 4,416 of the issued tickets (about 52% of non-null descriptions) — which also drives the issuing_agency skew toward 'V' and law_section '408'. Geographically, registration_state is heavily NY (6,935) with NJ and PA trailing, and violation_county splits across Queens, Manhattan, Brooklyn, and the Bronx but with inconsistent codes (e.g., 'QN' vs 'Qns', 'BK' vs 'Kings'). Watch out for heavy nulls and placeholder zeros: meter_number, unregistered_vehicle, time_first_observed, and violation_post_code are >70% null, while issuer_precinct, feet_from_curb, and street_code* are dominated by '0' sentinel values. Vehicle_color also has unnormalized variants ('BK'/'BLK'/'BLACK', 'GY'/'GREY'/'GRY') that will need cleanup before any analysis.

citing: violation_description · issuing_agency · registration_state · violation_county · vehicle_make · vehicle_body_type · vehicle_color · plate_type · law_section · meter_number · unregistered_vehicle · violation_post_code · feet_from_curb · issuer_precinct

Out[4]:

saturn.schema() · 40 columns

column	kind	n	null%	unique	alerts
summons_number	text	10,000	0.0%	10,000	near_unique one_word allcaps short_text
plate_id	text	10,000	0.0%	9,519	near_unique one_word allcaps short_text
registration_state	categorical	10,000	0.0%	50
plate_type	categorical	10,000	0.0%	27
issue_date	categorical	10,000	0.0%	687	long_tail
violation_code	categorical	10,000	0.0%	62
vehicle_body_type	categorical	10,000	1.3%	81
vehicle_make	categorical	10,000	0.8%	126
issuing_agency	categorical	10,000	0.0%	20
street_code1	text	10,000	0.0%	1,420	one_word allcaps short_text duplicates
street_code2	text	10,000	0.0%	1,349	one_word allcaps short_text duplicates
street_code3	text	10,000	0.0%	1,317	one_word allcaps short_text duplicates
vehicle_expiration_date	text	10,000	0.0%	1,040	one_word allcaps short_text duplicates
violation_location	categorical	10,000	44.9%	87	null_rate
violation_precinct	categorical	10,000	0.0%	88
issuer_precinct	categorical	10,000	0.0%	121
issuer_code	text	10,000	0.0%	1,420	one_word allcaps short_text duplicates
issuer_command	categorical	10,000	44.2%	228	null_rate
issuer_squad	categorical	10,000	63.6%	23	null_rate
violation_time	text	10,000	0.0%	1,432	one_word allcaps short_text duplicates
violation_county	categorical	10,000	2.7%	13
violation_in_front_of_or_opposite	categorical	10,000	46.9%	4	null_rate
street_name	text	10,000	0.0%	3,115	multilingual allcaps duplicates
intersecting_street	text	10,000	48.1%	1,413	allcaps null_rate short_text duplicates
date_first_observed	categorical	10,000	0.0%	147	long_tail imbalance
law_section	categorical	10,000	0.0%	2
sub_division	categorical	10,000	0.0%	76
days_parking_in_effect	categorical	10,000	50.6%	21	null_rate
from_hours_in_effect	categorical	10,000	68.3%	26	null_rate
to_hours_in_effect	categorical	10,000	68.3%	31	null_rate
vehicle_color	categorical	10,000	9.4%	99
unregistered_vehicle	categorical	10,000	84.9%	1	null_rate imbalance
vehicle_year	categorical	10,000	0.0%	39
meter_number	categorical	10,000	84.8%	7	long_tail null_rate imbalance
feet_from_curb	categorical	10,000	0.0%	11	imbalance
house_number	text	10,000	47.7%	2,350	one_word allcaps null_rate short_text duplicates
time_first_observed	categorical	10,000	78.5%	207	long_tail null_rate
violation_description	categorical	10,000	15.1%	74
violation_post_code	categorical	10,000	78.7%	53	null_rate
violation_legal_code	categorical	10,000	55.8%	1	null_rate imbalance

Fig 1.

violation_description · Shows how heavily school-zone speed violations dominate the ticket mix versus traditional parking offenses.

Show data table

Top values for violation_description (20 unique shown, of 74 total).
value	count	share
PHTO SCHOOL ZN SPEED VIOLATION	4416	44.2%
No Parking Street Cleaning	1428	14.3%
14-No Standing	598	6.0%
40-Fire Hydrant	463	4.6%
20A-No Parking (Non-COM)	131	1.3%
19-No Stand (bus stop)	123	1.2%
16A-No Std (Com Veh) Non-COM	117	1.2%
46A-Double Parking (Non-COM)	106	1.1%
Detached Trailer	105	1.1%
Fire Hydrant	94	0.9%
71A-Insp Sticker Expired (NYS)	77	0.8%
70A-Reg. Sticker Expired (NYS)	66	0.7%
No Standing	61	0.6%
Missing Equipment	56	0.6%
50-Crosswalk	53	0.5%
74-Missing Display Plate	39	0.4%
13-No Stand (taxi stand)	38	0.4%
17-No Stand (exc auth veh)	38	0.4%
Double Parking	32	0.3%
98-Obstructing Driveway	31	0.3%

Fig 2.

registration_state · Confirms NY plates make up the bulk, with NJ and PA as the main out-of-state contributors.

Show data table

Top values for registration_state (20 unique shown, of 50 total).
value	count	share
NY	6935	69.3%
NJ	917	9.2%
PA	470	4.7%
CT	247	2.5%
FL	235	2.4%
VA	220	2.2%
ME	125	1.2%
MA	124	1.2%
99	88	0.9%
GA	65	0.7%
NC	64	0.6%
TN	56	0.6%
MD	53	0.5%
TX	36	0.4%
OH	31	0.3%
IN	30	0.3%
IL	28	0.3%
CA	27	0.3%
SC	26	0.3%
MI	24	0.2%

Fig 3.

vehicle_make · Highlights the most-ticketed manufacturers, led by Honda and Toyota.

Show data table

Top values for vehicle_make (20 unique shown, of 126 total).
value	count	share
HONDA	1331	13.3%
TOYOT	1302	13.0%
NISSA	770	7.7%
FORD	603	6.0%
BMW	559	5.6%
ME/BE	521	5.2%
JEEP	450	4.5%
CHEVR	449	4.5%
HYUND	365	3.6%
SUBAR	273	2.7%
KIA	268	2.7%
LEXUS	268	2.7%
MAZDA	257	2.6%
AUDI	242	2.4%
ACURA	221	2.2%
VOLKS	199	2.0%
DODGE	175	1.8%
TESLA	152	1.5%
INFIN	151	1.5%
GMC	134	1.3%

Fig 4.

violation_county · Useful for spotting both the borough distribution and the inconsistent county code spellings that need normalizing.

Show data table

Top values for violation_county (13 unique shown, of 13 total).
value	count	share
QN	1773	17.7%
NY	1472	14.7%
BK	1233	12.3%
BX	1027	10.3%
K	936	9.4%
Q	902	9.0%
Kings	737	7.4%
Qns	414	4.1%
ST	403	4.0%
Bronx	395	4.0%
MN	314	3.1%
R	116	1.2%
Rich	7	0.1%

Fig 5.

vehicle_color · Reveals the long tail of color codes and duplicate labels (BK/BLK/BLACK) that should be consolidated.

Show data table

Top values for vehicle_color (20 unique shown, of 99 total).
value	count	share
GY	2079	20.8%
BK	1784	17.8%
WH	1579	15.8%
BL	631	6.3%
RD	348	3.5%
WHITE	347	3.5%
BLK	275	2.8%
BLACK	273	2.7%
GREY	239	2.4%
GRY	167	1.7%
GR	148	1.5%
BLUE	131	1.3%
RED	124	1.2%
GRAY	113	1.1%
SILVE	99	1.0%
WHT	65	0.7%
YW	62	0.6%
BR	60	0.6%
WHI	57	0.6%
BLU	46	0.5%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
summons_number	text	0.0%
plate_id	text	0.0%
registration_state	categorical	0.0%
plate_type	categorical	0.0%
issue_date	categorical	0.0%
violation_code	categorical	0.0%
vehicle_body_type	categorical	1.3%
vehicle_make	categorical	0.8%
issuing_agency	categorical	0.0%
street_code1	text	0.0%
street_code2	text	0.0%
street_code3	text	0.0%
vehicle_expiration_date	text	0.0%
violation_location	categorical	44.9%
violation_precinct	categorical	0.0%
issuer_precinct	categorical	0.0%
issuer_code	text	0.0%
issuer_command	categorical	44.2%
issuer_squad	categorical	63.6%
violation_time	text	0.0%
violation_county	categorical	2.7%
violation_in_front_of_or_opposite	categorical	46.9%
street_name	text	0.0%
intersecting_street	text	48.1%
date_first_observed	categorical	0.0%
law_section	categorical	0.0%
sub_division	categorical	0.0%
days_parking_in_effect	categorical	50.6%
from_hours_in_effect	categorical	68.3%
to_hours_in_effect	categorical	68.3%
vehicle_color	categorical	9.4%
unregistered_vehicle	categorical	84.9%
vehicle_year	categorical	0.0%
meter_number	categorical	84.8%
feet_from_curb	categorical	0.0%
house_number	text	47.7%
time_first_observed	categorical	78.5%
violation_description	categorical	15.1%
violation_post_code	categorical	78.7%
violation_legal_code	categorical	55.8%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,612 detected strings).
lang	count	share
en	3596	78.0%
ja	614	13.3%
es	86	1.9%
zh	59	1.3%
de	56	1.2%
fr	39	0.8%
nl	25	0.5%
it	21	0.5%
pt	15	0.3%
ko	14	0.3%
ca	14	0.3%
ar	14	0.3%
ru	12	0.3%
cs	10	0.2%
no	9	0.2%
uk	8	0.2%
eu	5	0.1%
id	4	0.1%
pl	3	0.1%
gl	1	0.0%
da	1	0.0%
ms	1	0.0%
te	1	0.0%
ro	1	0.0%
lt	1	0.0%
sv	1	0.0%
mk	1	0.0%

summons_number text identifier

This is a unique 10-digit numeric identifier per row (likely a parking/traffic summons number), with all 10,000 values distinct and uniformly 10 characters long. There are no nulls, no duplicates, and every value is a single token, consistent with a primary key rather than a feature. The allcaps flag is a quirk of the detector treating digit-only strings as uppercase.

Treatment: Drop from modelling; retain as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["summons_number"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	10,000
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	10,000
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for summons_number.

Show data table

Character-length distribution for summons_number (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	10000
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

plate_id text identifier

This is almost certainly a license/plate identifier: 9,519 unique values across 10,000 rows, all single-token, 99.99% uppercase, with lengths between 2 and 10 characters (mean 6.88). Notably, 'blankplate' appears 34 times and the duplicate rate is 4.81% (481 records), suggesting placeholder values and a small amount of legitimate plate reuse. No nulls or empties, but the placeholder token will pollute any join or uniqueness assumption.

Treatment: Treat as an identifier key; normalize case, map 'BLANKPLATE' to null, then left-join on this id.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["plate_id"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	9,519
len_min	2
len_max	10
len_mean	6.879
len_median	7
len_p95	7
word_mean	1
word_median	1
n_empty	0
n_duplicates	481
duplicate_rate	0.0481
vocab_size	9,519
readability_flesch_mean	100.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9999
boilerplate_rate	0
alert: near_unique	95.2% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for plate_id.

Show data table

Character-length distribution for plate_id (mean: 6.8788).
chars	count
2 – 2	1
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	5
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	12
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	58
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1578
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	7876
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	432
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	3
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	35

registration_state categorical feature

Two-letter US state codes identifying where a vehicle is registered, dominated by NY at 69.35% of 10,000 rows with NJ, PA, CT, and FL trailing — consistent with an NYC-area enforcement dataset. Cardinality is 50 with entropy ratio 0.36, and one suspicious non-state token "99" appears 88 times, likely a sentinel for unknown/out-of-country plates.

Treatment: One-hot encode top states, bucket the long tail into 'OTHER', and recode '99' as missing.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["registration_state"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	50
top_value	NY
top_rate	0.6935
cardinality	50
entropy	2.034
entropy_ratio	0.3603

Fig 10.

Top values for registration_state.

Show data table

Top values for registration_state (20 unique shown, of 50 total).
value	count	share
NY	6935	69.3%
NJ	917	9.2%
PA	470	4.7%
CT	247	2.5%
FL	235	2.4%
VA	220	2.2%
ME	125	1.2%
MA	124	1.2%
99	88	0.9%
GA	65	0.7%
NC	64	0.6%
TN	56	0.6%
MD	53	0.5%
TX	36	0.4%
OH	31	0.3%
IN	30	0.3%
IL	28	0.3%
CA	27	0.3%
SC	26	0.3%
MI	24	0.2%

plate_type categorical feature

This column encodes vehicle plate type codes (e.g., PAS, OMT, COM), almost certainly from a parking or traffic dataset. The distribution is heavily dominated by passenger plates, with PAS accounting for 9072 of 10000 rows (top_rate 0.9072) and entropy_ratio of just 0.146 across 27 categories. A small but notable 77 rows carry the placeholder code '999', which likely represents missing or unknown plate types despite null_rate being 0.

Treatment: Collapse rare codes into 'other' and treat '999' as missing before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["plate_type"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	27
top_value	PAS
top_rate	0.9072
cardinality	27
entropy	0.6958
entropy_ratio	0.1463

Fig 11.

Top values for plate_type.

Show data table

Top values for plate_type (20 unique shown, of 27 total).
value	count	share
PAS	9072	90.7%
OMT	330	3.3%
COM	245	2.5%
SRF	103	1.0%
OMS	84	0.8%
999	77	0.8%
ORG	17	0.2%
TRL	11	0.1%
MED	8	0.1%
RGL	8	0.1%
MOT	7	0.1%
NYS	6	0.1%
PSD	6	0.1%
SPO	4	0.0%
VAS	3	0.0%
ITP	3	0.0%
OMR	2	0.0%
TRC	2	0.0%
TOW	2	0.0%
APP	2	0.0%

issue_date categorical timestamp

This is an issue_date column stored as ISO-8601 timestamps but profiled as categorical, with 687 distinct dates across 10,000 rows and no nulls. The distribution is severely concentrated: 65.42% of rows fall on 2025-12-28, with 2025-12-30 (1,594) and 2025-12-29 (356) accounting for most of the rest, leaving a long tail of 2026 dates with single- or low-double-digit counts. The clustering at end-of-2025 suggests a bulk-issue or backfill event rather than organic daily issuance.

Treatment: Parse as datetime and derive features (day, month, days-since-epoch); investigate the 2025-12-28 spike before using as a model feature.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["issue_date"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	687
top_value	2025-12-28T00:00:00.000
top_rate	0.6542
cardinality	687
entropy	2.765
entropy_ratio	0.2934
alert: long_tail	368 singleton categories

Fig 12.

Top values for issue_date.

Show data table

Top values for issue_date (20 unique shown, of 687 total).
value	count	share
2025-12-28T00:00:00.000	6542	65.4%
2025-12-30T00:00:00.000	1594	15.9%
2025-12-29T00:00:00.000	356	3.6%
2026-06-26T00:00:00.000	14	0.1%
2026-09-27T00:00:00.000	13	0.1%
2026-09-25T00:00:00.000	12	0.1%
2025-12-31T00:00:00.000	12	0.1%
2026-06-27T00:00:00.000	11	0.1%
2026-10-25T00:00:00.000	10	0.1%
2026-08-31T00:00:00.000	10	0.1%
2026-08-27T00:00:00.000	10	0.1%
2026-07-26T00:00:00.000	10	0.1%
2026-06-30T00:00:00.000	10	0.1%
2026-08-25T00:00:00.000	9	0.1%
2026-10-29T00:00:00.000	8	0.1%
2026-10-17T00:00:00.000	8	0.1%
2026-10-05T00:00:00.000	8	0.1%
2026-09-23T00:00:00.000	8	0.1%
2026-07-27T00:00:00.000	8	0.1%
2026-07-24T00:00:00.000	8	0.1%

violation_code categorical feature

Categorical violation_code with 62 distinct codes stored as strings and zero nulls across 10,000 rows. Distribution is heavily concentrated: code '36' alone accounts for 44.16% of records, followed by '21' at 14.63%, giving an entropy ratio of 0.52 — roughly half the maximum for this cardinality. The top 10 codes cover the bulk of traffic, leaving a long tail of rare codes.

Treatment: Group rare codes into an 'other' bucket, then one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["violation_code"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	62
top_value	36
top_rate	0.4416
cardinality	62
entropy	3.1
entropy_ratio	0.5206

Fig 13.

Top values for violation_code.

Show data table

Top values for violation_code (20 unique shown, of 62 total).
value	count	share
36	4416	44.2%
21	1463	14.6%
40	862	8.6%
14	858	8.6%
46	360	3.6%
20	244	2.4%
98	242	2.4%
19	168	1.7%
16	133	1.3%
66	131	1.3%
71	116	1.2%
74	114	1.1%
70	104	1.0%
50	96	1.0%
17	79	0.8%
78	75	0.8%
51	66	0.7%
80	56	0.6%
67	47	0.5%
13	38	0.4%

vehicle_body_type categorical feature

Column holds short vehicle body-type codes (SUBN, 4DSD, SDN, VAN, PICK...) typical of DMV/parking-violation feeds. Distribution is heavily concentrated: SUBN alone covers 51.9% of rows and the top two codes account for roughly 72%, yet there are 81 distinct codes producing a long tail. Entropy ratio of 0.41 confirms the lopsided spread, and nulls are minor at 1.31%.

Treatment: Group rare codes into an 'other' bucket and one-hot encode the top categories.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["vehicle_body_type"].stats

stat	value
n	10,000
nulls	131 (1.3%)
unique	81
top_value	SUBN
top_rate	0.5188
cardinality	81
entropy	2.599
entropy_ratio	0.41

Fig 14.

Top values for vehicle_body_type.

Show data table

Top values for vehicle_body_type (20 unique shown, of 81 total).
value	count	share
SUBN	5120	51.2%
4DSD	2067	20.7%
SDN	670	6.7%
VAN	275	2.8%
SPOR	233	2.3%
PICK	194	1.9%
TRLR	175	1.8%
SEDA	115	1.1%
UT	111	1.1%
SW	110	1.1%
2DSD	94	0.9%
4D	82	0.8%
DELV	77	0.8%
SEMI	53	0.5%
TAXI	52	0.5%
SU	52	0.5%
SD	51	0.5%
P-U	46	0.5%
TRAC	41	0.4%
CONV	26	0.3%

vehicle_make categorical feature

This is a categorical vehicle make field with 126 distinct values across 10,000 rows and a low 0.78% null rate. Values appear truncated to 5 characters (TOYOT, NISSA, ME/BE, CHEVR, HYUND, SUBAR), which will collide brands and complicate joins. HONDA leads at 13.4% followed closely by TOYOT at 13.0%, and entropy ratio of 0.67 indicates a long tail beyond the top 10.

Treatment: Normalize truncated codes to canonical make names, then group rare levels before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["vehicle_make"].stats

stat	value
n	10,000
nulls	78 (0.8%)
unique	126
top_value	HONDA
top_rate	0.1341
cardinality	126
entropy	4.661
entropy_ratio	0.668

Fig 15.

Top values for vehicle_make.

Show data table

Top values for vehicle_make (20 unique shown, of 126 total).
value	count	share
HONDA	1331	13.3%
TOYOT	1302	13.0%
NISSA	770	7.7%
FORD	603	6.0%
BMW	559	5.6%
ME/BE	521	5.2%
JEEP	450	4.5%
CHEVR	449	4.5%
HYUND	365	3.6%
SUBAR	273	2.7%
KIA	268	2.7%
LEXUS	268	2.7%
MAZDA	257	2.6%
AUDI	242	2.4%
ACURA	221	2.2%
VOLKS	199	2.0%
DODGE	175	1.8%
TESLA	152	1.5%
INFIN	151	1.5%
GMC	134	1.3%

issuing_agency categorical feature

Single-letter codes for the agency that issued each record, with 20 distinct values and no nulls across 10,000 rows. Distribution is heavily concentrated: 'V' alone accounts for 44.16% and the top four codes (V, T, S, P) cover the vast majority, while codes like M, O fall to single or low double digits. Entropy ratio of 0.46 confirms the imbalance.

Treatment: One-hot encode the top categories and bucket the long tail into 'other'.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["issuing_agency"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	20
top_value	V
top_rate	0.4416
cardinality	20
entropy	2.007
entropy_ratio	0.4644

Fig 16.

Top values for issuing_agency.

Show data table

Top values for issuing_agency (20 unique shown, of 20 total).
value	count	share
V	4416	44.2%
T	2131	21.3%
S	1946	19.5%
P	1325	13.2%
K	41	0.4%
N	38	0.4%
A	23	0.2%
Y	17	0.2%
M	13	0.1%
O	9	0.1%
C	9	0.1%
8	8	0.1%
3	6	0.1%
X	5	0.1%
9	3	0.0%
W	3	0.0%
L	2	0.0%
R	2	0.0%
F	2	0.0%
U	1	0.0%

street_code1 text foreign_key

This is `street_code1`, a short numeric street identifier stored as text (len_max 5, one_word_rate 1.0, 1420 distinct codes across 10000 rows). The dominant surprise is that 4817 rows — nearly half — carry the placeholder value "0", driving the 0.858 duplicate_rate; the next most frequent code ("13610") only appears 52 times. No nulls, but the "0" sentinel effectively functions as a missing/unknown marker.

Treatment: Treat as a categorical code, recode "0" to missing, then left-join to a street reference table.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["street_code1"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	1,420
len_min	1
len_max	5
len_mean	3.026
len_median	4
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,580
duplicate_rate	0.858
vocab_size	1,420
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.5168
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	51.7% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	85.8% duplicate strings

Fig 17.

Character-length distribution for street_code1.

Show data table

Character-length distribution for street_code1 (mean: 3.0255).
chars	count
1 – 1	4817
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	15
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	3
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	426
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	4739

street_code2 text feature

This looks like a secondary street code stored as text, with 1349 unique short tokens (max length 5, mean 2.94 chars) and every value a single word. The column is dominated by '0', which accounts for 5005 of 10000 rows, with another 359 rows holding '40404'; combined with an 86.51% duplicate rate this leaves very little discriminating signal. No nulls, but the heavy '0' mass likely encodes a missing/sentinel state rather than a real code.

Treatment: Treat '0' as a missing sentinel and use as a low-cardinality categorical, or drop given the extreme mode dominance.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["street_code2"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	1,349
len_min	1
len_max	5
len_mean	2.938
len_median	1
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,651
duplicate_rate	0.8651
vocab_size	1,349
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.496
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	49.6% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	86.5% duplicate strings

Fig 18.

Character-length distribution for street_code2.

Show data table

Character-length distribution for street_code2 (mean: 2.9379).
chars	count
1 – 1	5005
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	35
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	3
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	490
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	4467

street_code3 text feature

Stored as text but the values are short numeric codes (len_max 5, one_word_rate 1.0, 1317 unique tokens), consistent with a street or address code lookup. The distribution is dominated by '0' (5235/10000) with '40404' a distant second at 359, driving an 86.83% duplicate_rate. The allcaps_rate of 0.4753 is an artifact of digit-only strings being counted as uppercase.

Treatment: Treat as a categorical code: keep as string, collapse rare levels, and consider a binary flag for the dominant '0' value.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["street_code3"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	1,317
len_min	1
len_max	5
len_mean	2.85
len_median	1
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,683
duplicate_rate	0.8683
vocab_size	1,317
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.4753
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	47.5% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	86.8% duplicate strings

Fig 19.

Character-length distribution for street_code3.

Show data table

Character-length distribution for street_code3 (mean: 2.8496).
chars	count
1 – 1	5235
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	12
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	3
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	522
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	4228

vehicle_expiration_date text timestamp

Stored as text but functionally a YYYYMMDD vehicle expiration date, with lengths clustered at 8-10 characters and one token per row. Over half the rows (5,492) are the sentinel '0.00000000' and another 556 are '88888888', plus 126 entries like '20258888' that mix a real year with a placeholder month/day — so roughly 61% of values are not real dates. Genuine dates such as 20260930 and 20260228 appear only in the dozens, and the column is 89.6% duplicates across just 1,040 unique values.

Treatment: Parse as YYYYMMDD after mapping '0.00000000', '88888888', and any *8888 tails to null.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["vehicle_expiration_date"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	1,040
len_min	8
len_max	10
len_mean	9.098
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,960
duplicate_rate	0.896
vocab_size	1,040
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	89.6% duplicate strings

Fig 20.

Character-length distribution for vehicle_expiration_date.

Show data table

Character-length distribution for vehicle_expiration_date (mean: 9.0984).
chars	count
8 – 8	4508
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	5492

violation_location categorical feature

This appears to be a violation_location code, likely a precinct or zone identifier stored as a zero-padded string (e.g., '0018', '0047') mixed with non-padded variants ('115', '114'), suggesting inconsistent formatting across sources. Cardinality is 87 with very high entropy ratio (0.929), so values are spread evenly with no dominant location — the top code only accounts for 5.1% of rows. Most striking: 44.85% of rows are null, which is flagged as an alert and severely limits usability.

Treatment: Normalize the zero-padding inconsistency and treat nulls as a separate category before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["violation_location"].stats

stat	value
n	10,000
nulls	4,485 (44.9%)
unique	87
top_value	0018
top_rate	0.0515
cardinality	87
entropy	5.988
entropy_ratio	0.9294
alert: null_rate	44.9% null

Fig 21.

Top values for violation_location.

Show data table

Top values for violation_location (20 unique shown, of 87 total).
value	count	share
0018	284	2.8%
115	220	2.2%
114	178	1.8%
103	162	1.6%
0047	140	1.4%
0075	139	1.4%
0077	135	1.4%
0062	130	1.3%
0072	129	1.3%
110	127	1.3%
0001	125	1.2%
112	124	1.2%
0061	120	1.2%
0084	106	1.1%
0034	105	1.1%
0020	104	1.0%
109	102	1.0%
0067	96	1.0%
0073	96	1.0%
108	93	0.9%

violation_precinct categorical feature

This column encodes the NYPD precinct where the violation occurred, stored as a string with 88 distinct codes. The dominant surprise is that 44.85% of rows carry the value '0', which is almost certainly a sentinel for missing/non-applicable precinct rather than a real precinct number; legitimate precincts like 18, 115, and 114 follow far behind. Entropy ratio of 0.66 reflects this heavy concentration on the sentinel.

Treatment: Recode '0' as missing before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["violation_precinct"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	88
top_value	0
top_rate	0.4485
cardinality	88
entropy	4.295
entropy_ratio	0.6649

Fig 22.

Top values for violation_precinct.

Show data table

Top values for violation_precinct (20 unique shown, of 88 total).
value	count	share
0	4485	44.9%
18	284	2.8%
115	220	2.2%
114	178	1.8%
103	162	1.6%
47	140	1.4%
75	139	1.4%
77	135	1.4%
62	130	1.3%
72	129	1.3%
110	127	1.3%
1	125	1.2%
112	124	1.2%
61	120	1.2%
84	106	1.1%
34	105	1.1%
20	104	1.0%
109	102	1.0%
67	96	1.0%
73	96	1.0%

issuer_precinct categorical feature

This is the precinct number of the issuing officer, with 121 distinct codes across 10,000 rows and no nulls. The distribution is dominated by the value "0" at 64.18% of rows, which is almost certainly a sentinel/placeholder rather than a real precinct, leaving genuine precincts (e.g., 18, 115, 103) as long-tail minorities. Entropy ratio of 0.446 confirms the heavy concentration on that single code.

Treatment: Treat "0" as missing/sentinel and group rare precincts before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["issuer_precinct"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	121
top_value	0
top_rate	0.6418
cardinality	121
entropy	3.088
entropy_ratio	0.4463

Fig 23.

Top values for issuer_precinct.

Show data table

Top values for issuer_precinct (20 unique shown, of 121 total).
value	count	share
0	6418	64.2%
18	282	2.8%
115	161	1.6%
103	139	1.4%
61	115	1.1%
62	113	1.1%
1	104	1.0%
109	90	0.9%
70	87	0.9%
114	81	0.8%
19	80	0.8%
14	80	0.8%
20	78	0.8%
110	75	0.8%
63	73	0.7%
102	68	0.7%
6	65	0.7%
60	59	0.6%
75	57	0.6%
10	57	0.6%

issuer_code text foreign_key

This is a categorical issuer identifier stored as text, with 1,420 distinct codes across 10,000 rows and an 85.8% duplicate rate. The dominant value '0' accounts for 5,314 rows (over half the column), suggesting it is a sentinel or 'unknown issuer' placeholder rather than a real code. Remaining values look like 6-digit numeric IDs (len_max 6, len_p95 6), but the median length of 1 confirms how heavily the '0' bucket skews the distribution.

Treatment: Treat '0' as missing and join the remaining codes to an issuer reference table.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["issuer_code"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	1,420
len_min	1
len_max	6
len_mean	3.338
len_median	1
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,580
duplicate_rate	0.858
vocab_size	1,420
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.4686
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	46.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	85.8% duplicate strings

Fig 24.

Character-length distribution for issuer_code.

Show data table

Character-length distribution for issuer_code (mean: 3.3383).
chars	count
1 – 1	5314
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	24
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	4654

issuer_command categorical feature

This column holds short alphanumeric codes (e.g., T302, T401, EPIU, MTTF) that look like issuer-side command or instruction tokens, with 228 distinct values across 10,000 rows. It is sparsely populated — 44.16% null — yet entropy_ratio of 0.83 over the non-null portion shows the codes spread fairly evenly, with the most common value T302 covering only 6.7%. No single code dominates, so the missingness is the more striking signal than concentration.

Treatment: Treat as a categorical feature with an explicit 'missing' level, then target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[64]:

saturn.columns["issuer_command"].stats

stat	value
n	10,000
nulls	4,416 (44.2%)
unique	228
top_value	T302
top_rate	0.06698
cardinality	228
entropy	6.523
entropy_ratio	0.8327
alert: null_rate	44.2% null

Fig 25.

Top values for issuer_command.

Show data table

Top values for issuer_command (20 unique shown, of 228 total).
value	count	share
T302	374	3.7%
T401	301	3.0%
T103	230	2.3%
T402	223	2.2%
T106	187	1.9%
T301	169	1.7%
T102	145	1.5%
EPIU	142	1.4%
MTTF	133	1.3%
KN02	105	1.1%
T105	104	1.0%
KN08	95	0.9%
QW01	95	0.9%
MN12	88	0.9%
BX12	86	0.9%
T303	85	0.9%
T201	84	0.8%
MN07	83	0.8%
QW02	63	0.6%
KN04	63	0.6%

issuer_squad categorical feature

A categorical code labelled issuer_squad with 23 distinct values, dominated by the placeholder-looking '0000' which accounts for 41.6% of non-nulls (1513 rows) while the rest are single-letter tags like F, N, A, L. Most striking: 63.6% of rows are null, and entropy_ratio of 0.73 suggests the remaining values are reasonably spread but anchored by that '0000' bucket which may represent unassigned issuers. The mix of a numeric-looking code with letter codes hints at inconsistent encoding conventions.

Treatment: Treat '0000' as a missing sentinel, impute or bucket rare letters, then one-hot encode.

anthropic:claude-opus-4-7 · confidence medium

Out[67]:

saturn.columns["issuer_squad"].stats

stat	value
n	10,000
nulls	6,361 (63.6%)
unique	23
top_value	0000
top_rate	0.4158
cardinality	23
entropy	3.309
entropy_ratio	0.7315
alert: null_rate	63.6% null

Fig 26.

Top values for issuer_squad.

Show data table

Top values for issuer_squad (20 unique shown, of 23 total).
value	count	share
0000	1513	15.1%
F	331	3.3%
N	245	2.5%
A	214	2.1%
L	180	1.8%
J	122	1.2%
M	120	1.2%
Q	110	1.1%
R	100	1.0%
H	99	1.0%
E	99	1.0%
X	95	0.9%
Y	85	0.9%
S	71	0.7%
B	70	0.7%
P	44	0.4%
U	29	0.3%
D	28	0.3%
C	25	0.2%
G	22	0.2%

violation_time text timestamp

This column encodes the time of a violation as a fixed 5-character token like '0839A' or '1200P' (HHMM plus A/P meridiem indicator), with every value uppercase and exactly one word. Across 10,000 rows there are only 1,432 distinct values and an 85.7% duplicate rate, which is expected for clock times rather than a data quality issue. Null rate is negligible (0.0004) and the format is strikingly uniform — len_min, len_median, and len_max all equal 5.

Treatment: Parse the HHMM+A/P token into a proper time-of-day feature (e.g., minutes since midnight) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[70]:

saturn.columns["violation_time"].stats

stat	value
n	10,000
nulls	4 (0.0%)
unique	1,432
len_min	5
len_max	5
len_mean	5
len_median	5
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,564
duplicate_rate	0.8567
vocab_size	1,433
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9999
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	85.7% duplicate strings

Fig 27.

Character-length distribution for violation_time.

Show data table

Character-length distribution for violation_time (mean: 5.0).
chars	count
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	9996
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0

violation_county categorical feature

This is a categorical county/borough code for NYC violations, with 13 distinct values across 10,000 rows and a 2.71% null rate. The top value 'QN' (Queens) covers 18.2% of rows, but the codes are inconsistent: Queens appears as both 'QN' and 'Q' (and 'Qns'), Brooklyn as 'BK', 'K', and 'Kings', and Bronx as 'BX' and 'Bronx'—so the true cardinality is lower than 13. Entropy ratio of 0.897 reflects this fragmentation rather than genuine diversity.

Treatment: Normalize aliases (QN/Q/Qns→Queens, BK/K/Kings→Brooklyn, BX/Bronx→Bronx) before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[73]:

saturn.columns["violation_county"].stats

stat	value
n	10,000
nulls	271 (2.7%)
unique	13
top_value	QN
top_rate	0.1822
cardinality	13
entropy	3.32
entropy_ratio	0.8973

Fig 28.

Top values for violation_county.

Show data table

Top values for violation_county (13 unique shown, of 13 total).
value	count	share
QN	1773	17.7%
NY	1472	14.7%
BK	1233	12.3%
BX	1027	10.3%
K	936	9.4%
Q	902	9.0%
Kings	737	7.4%
Qns	414	4.1%
ST	403	4.0%
Bronx	395	4.0%
MN	314	3.1%
R	116	1.2%
Rich	7	0.1%

violation_in_front_of_or_opposite categorical feature

A categorical position code indicating where a violation occurred relative to a reference point — almost certainly 'F' (front), 'O' (opposite), 'I' (inside?), and a single 'R'. Nearly half the rows (46.85%) are null, and among the populated rows 'F' dominates at 66.85%. The 'R' code appears just once out of 10000, suggesting either a data-entry error or a rare legitimate category worth investigating.

Treatment: Impute or add an explicit 'missing' level, collapse the singleton 'R', then one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[76]:

saturn.columns["violation_in_front_of_or_opposite"].stats

stat	value
n	10,000
nulls	4,685 (46.9%)
unique	4
top_value	F
top_rate	0.6685
cardinality	4
entropy	1.229
entropy_ratio	0.6145
alert: null_rate	46.9% null

Fig 29.

Top values for violation_in_front_of_or_opposite.

Show data table

Top values for violation_in_front_of_or_opposite (4 unique shown, of 4 total).
value	count	share
F	3553	35.5%
O	1140	11.4%
I	621	6.2%
R	1	0.0%

street_name text feature

Street/intersection labels, mostly directional roadway descriptors like 'SB CROSS BAY BLVD @' with cardinal prefixes (sb/wb/nb/eb) and '@' separators indicating cross-streets — consistent with NYC traffic or transit data. Values are heavily duplicated (6881 of 10000, 3115 unique) and 77% all-caps, with strings capped at 20 chars suggesting upstream truncation. Language detection flags Japanese (614) and Chinese (59), but these are almost certainly false positives from short uppercase abbreviations rather than true multilingual content.

Treatment: Normalize case and parse into direction/street/cross-street components before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[79]:

saturn.columns["street_name"].stats

stat	value
n	10,000
nulls	4 (0.0%)
unique	3,115
len_min	2
len_max	20
len_mean	14.87
len_median	16
len_p95	20
word_mean	3.513
word_median	3
n_empty	0
n_duplicates	6,881
duplicate_rate	0.6884
vocab_size	1,760
readability_flesch_mean	62.55
emoji_rate	0
url_rate	0
one_word_rate	0.01371
allcaps_rate	0.7732
boilerplate_rate	0
alert: multilingual	28 languages detected in sample
alert: allcaps	77.3% rows are all-caps
alert: duplicates	68.8% duplicate strings

Fig 30.

Character-length distribution for street_name.

Show data table

Character-length distribution for street_name (mean: 14.871748699479792).
chars	count
2 – 2	1
2 – 3	0
3 – 3	9
3 – 4	0
4 – 4	15
4 – 5	0
5 – 5	150
5 – 6	0
6 – 6	105
6 – 6	0
6 – 7	0
7 – 7	533
7 – 8	0
8 – 8	727
8 – 9	0
9 – 9	1009
9 – 10	0
10 – 10	396
10 – 11	0
11 – 11	0
11 – 11	415
11 – 12	0
12 – 12	374
12 – 13	0
13 – 13	489
13 – 14	0
14 – 14	404
14 – 15	0
15 – 15	344
15 – 16	0
16 – 16	0
16 – 16	155
16 – 17	0
17 – 17	90
17 – 18	0
18 – 18	111
18 – 19	0
19 – 19	1013
19 – 20	0
20 – 20	3656

intersecting_street text metadata

This column holds short uppercase street-name fragments (mean length 9.3 chars, ~2.4 words) describing an intersecting street, likely paired with a primary address elsewhere. Nearly half the rows are null (48.1%) and 72.8% of the non-null values are duplicates, with common tokens like 'ST' (189), 'AVE' (84), and 'RD' dominating — many top values are bare suffixes ('ST', 'AVE', 'T', 'E') suggesting truncation or partial captures. Vocabulary is small (1,053 words across 1,413 unique values) and 80.9% of entries are all-caps, consistent with a municipal data-entry convention.

Treatment: Normalize case and standardize street-suffix abbreviations, then use as a categorical join key with the primary street column.

anthropic:claude-opus-4-7 · confidence high

Out[82]:

saturn.columns["intersecting_street"].stats

stat	value
n	10,000
nulls	4,813 (48.1%)
unique	1,413
len_min	1
len_max	20
len_mean	9.271
len_median	8
len_p95	19
word_mean	2.423
word_median	2
n_empty	0
n_duplicates	3,774
duplicate_rate	0.7276
vocab_size	1,053
readability_flesch_mean	83.04
emoji_rate	0
url_rate	0
one_word_rate	0.1139
allcaps_rate	0.8091
boilerplate_rate	0
alert: allcaps	80.9% rows are all-caps
alert: null_rate	48.1% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	72.8% duplicate strings

Fig 31.

Character-length distribution for intersecting_street.

Show data table

Character-length distribution for intersecting_street (mean: 9.271062271062272).
chars	count
1 – 1	192
1 – 2	0
2 – 2	253
2 – 3	0
3 – 3	130
3 – 4	0
4 – 4	407
4 – 5	0
5 – 5	308
5 – 6	0
6 – 6	455
6 – 7	0
7 – 7	646
7 – 8	0
8 – 8	241
8 – 9	0
9 – 9	357
9 – 10	0
10 – 10	508
10 – 10	0
10 – 11	0
11 – 11	210
11 – 12	0
12 – 12	164
12 – 13	0
13 – 13	265
13 – 14	0
14 – 14	125
14 – 15	0
15 – 15	88
15 – 16	0
16 – 16	78
16 – 17	0
17 – 17	110
17 – 18	0
18 – 18	149
18 – 19	0
19 – 19	253
19 – 20	0
20 – 20	248

date_first_observed categorical metadata

Stored as a categorical YYYYMMDD string capturing the date a record was first observed, with 147 distinct values across 10,000 rows. The column is overwhelmingly dominated by the sentinel '0' (98.27%), leaving only ~173 rows with real dates clustered in late 2025 through 2026. Entropy ratio of 0.035 confirms almost no information content as-is.

Treatment: Replace '0' with null, parse remainder as date, and consider dropping unless the rare observed dates matter.

anthropic:claude-opus-4-7 · confidence high

Out[85]:

saturn.columns["date_first_observed"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	147
top_value	0
top_rate	0.9827
cardinality	147
entropy	0.2485
entropy_ratio	0.03452
alert: long_tail	125 singleton categories
alert: imbalance	top value is 98.3% of rows

Fig 32.

Top values for date_first_observed.

Show data table

Top values for date_first_observed (20 unique shown, of 147 total).
value	count	share
0	9827	98.3%
20251229	6	0.1%
20261006	3	0.0%
20260901	3	0.0%
20261106	2	0.0%
20261101	2	0.0%
20260927	2	0.0%
20260925	2	0.0%
20260919	2	0.0%
20260911	2	0.0%
20260907	2	0.0%
20260829	2	0.0%
20250828	2	0.0%
20260629	2	0.0%
20240601	2	0.0%
20260520	2	0.0%
20260426	2	0.0%
20260328	2	0.0%
20260327	2	0.0%
20260314	2	0.0%

law_section categorical feature

This is a binary categorical field encoding a law/statute section, taking only two codes ('408' and '1180') across all 10,000 rows with no nulls. The split is fairly balanced at 55.84%/44.16%, yielding near-maximal entropy (0.99 of the possible 1.0). Despite the numeric-looking values, with only 2 distinct levels it behaves as a flag rather than a continuous code.

Treatment: One-hot or treat as a binary indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[88]:

saturn.columns["law_section"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	2
top_value	408
top_rate	0.5584
cardinality	2
entropy	0.9901
entropy_ratio	0.9901

Fig 33.

Top values for law_section.

Show data table

Top values for law_section (2 unique shown, of 2 total).
value	count	share
408	5584	55.8%
1180	4416	44.2%

sub_division categorical feature

Categorical sub-division code with 76 distinct values across 10,000 rows and almost no nulls (0.02%). The distribution is heavily concentrated: 'B' alone covers 44.4% of rows and the top two values ('B' and 'd1') together exceed 58%, giving an entropy ratio of just 0.52. The mix of single-letter codes ('B','C','D') with letter-digit codes ('d1','E2','C3') — including a lowercase 'd1' alongside uppercase letters — suggests inconsistent coding conventions worth normalising.

Treatment: Normalise case and group rare levels before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[91]:

saturn.columns["sub_division"].stats

stat	value
n	10,000
nulls	2 (0.0%)
unique	76
top_value	B
top_rate	0.4445
cardinality	76
entropy	3.227
entropy_ratio	0.5166

Fig 34.

Top values for sub_division.

Show data table

Top values for sub_division (20 unique shown, of 76 total).
value	count	share
B	4444	44.4%
d1	1428	14.3%
C	781	7.8%
E2	779	7.8%
F1	314	3.1%
F2	240	2.4%
D	222	2.2%
C3	183	1.8%
K2	124	1.2%
J2	123	1.2%
J6	122	1.2%
k4	120	1.2%
J3	109	1.1%
e2	94	0.9%
C4	84	0.8%
E5	81	0.8%
E3	61	0.6%
c	61	0.6%
K6	57	0.6%
n8	56	0.6%

days_parking_in_effect categorical feature

This is a 7-character day-of-week mask indicating which days parking rules are in effect, with each position likely corresponding to Sun-Sat (Y=in effect, B=likely a holiday/exempt marker, space=off). Over half the rows are null (50.59%) and among the populated values 'YYYYYYY' (every day) dominates at 42.38% followed by 'BBBBBBB' at 1449 occurrences, giving low entropy ratio of 0.46 across 21 distinct patterns. The mix of Y, B, and spaces in the same field is unusual and suggests an encoded multi-flag rather than a clean categorical.

Treatment: Split into seven per-day indicator columns and add a missingness flag before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[94]:

saturn.columns["days_parking_in_effect"].stats

stat	value
n	10,000
nulls	5,059 (50.6%)
unique	21
top_value	YYYYYYY
top_rate	0.4238
cardinality	21
entropy	2.025
entropy_ratio	0.461
alert: null_rate	50.6% null

Fig 35.

Top values for days_parking_in_effect.

Show data table

Top values for days_parking_in_effect (20 unique shown, of 21 total).
value	count	share
YYYYYYY	2094	20.9%
BBBBBBB	1449	14.5%
Y Y	771	7.7%
Y	495	5.0%
Y Y Y	25	0.2%
YYYYY	23	0.2%
YYYYYY	21	0.2%
YYYYYBB	19	0.2%
YYYYYYB	8	0.1%
Y Y	7	0.1%
YY YY	6	0.1%
BYBBYBB	4	0.0%
BBBBBYB	3	0.0%
YBBYBBB	3	0.0%
Y Y	3	0.0%
BYBBBBB	2	0.0%
BBYBBBB	2	0.0%
YBBBBBB	2	0.0%
BBBYBBB	2	0.0%
BYBYBYB	1	0.0%

from_hours_in_effect categorical feature

Start time of a parking/traffic regulation's effective window, encoded as a clock string like '0830A' with a sentinel 'ALL' meaning 24-hour applicability. The column is null 68.3% of the time, and among the 3,170 populated rows 'ALL' alone accounts for 45.6% (1,445), so actual start times are a minority signal. Cardinality is just 26 with entropy ratio 0.63, and the populated values cluster heavily in morning hours (0700A–1130A).

Treatment: Parse to minutes-since-midnight with a separate 'ALL'/missing flag before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[97]:

saturn.columns["from_hours_in_effect"].stats

stat	value
n	10,000
nulls	6,830 (68.3%)
unique	26
top_value	ALL
top_rate	0.4558
cardinality	26
entropy	2.956
entropy_ratio	0.6289
alert: null_rate	68.3% null

Fig 36.

Top values for from_hours_in_effect.

Show data table

Top values for from_hours_in_effect (20 unique shown, of 26 total).
value	count	share
ALL	1445	14.4%
1130A	303	3.0%
0830A	242	2.4%
0930A	220	2.2%
0900A	161	1.6%
0800A	156	1.6%
0730A	119	1.2%
0700A	116	1.2%
1200A	102	1.0%
1100A	83	0.8%
0300A	61	0.6%
0200P	51	0.5%
0900P	31	0.3%
1000A	17	0.2%
1200P	14	0.1%
0600A	13	0.1%
0400P	8	0.1%
1130P	8	0.1%
0200A	8	0.1%
0100P	5	0.1%

to_hours_in_effect categorical feature

This appears to be the end-time of a parking or traffic regulation's effective window, encoded as a clock string (e.g., '0100P', '1100A') with 'ALL' meaning the rule applies all day. Two-thirds of rows are null (null_rate 0.683), and among the 3,170 populated values 'ALL' dominates at 45.6%, leaving the 30 actual time codes thinly spread (entropy_ratio 0.62). The mix of sentinel ('ALL') and time-of-day strings in the same field is the main gotcha.

Treatment: Split into a boolean 'all_day' flag and a parsed time-of-day, and impute or mask the 68% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[100]:

saturn.columns["to_hours_in_effect"].stats

stat	value
n	10,000
nulls	6,830 (68.3%)
unique	31
top_value	ALL
top_rate	0.4558
cardinality	31
entropy	3.079
entropy_ratio	0.6215
alert: null_rate	68.3% null

Fig 37.

Top values for to_hours_in_effect.

Show data table

Top values for to_hours_in_effect (20 unique shown, of 31 total).
value	count	share
ALL	1445	14.4%
0100P	299	3.0%
1100A	219	2.2%
1000A	212	2.1%
1030A	154	1.5%
0800A	118	1.2%
0600A	117	1.2%
0300A	97	1.0%
0930A	95	0.9%
0700P	85	0.9%
1230P	85	0.9%
0830A	45	0.4%
0500A	35	0.4%
0900A	27	0.3%
1200A	23	0.2%
0600P	22	0.2%
0130P	19	0.2%
0400P	15	0.1%
0500P	11	0.1%
1000P	11	0.1%

vehicle_color categorical feature

This is a vehicle color code field, almost certainly from a parking or traffic citation feed. The encoding is inconsistent: short codes (GY, BK, WH) dominate but overlapping long forms (WHITE, BLACK, GREY) and alternate abbreviations (BLK, GRY) appear as separate categories, inflating cardinality to 99 across only 10000 rows. About 9.4% of values are null and the top code GY covers 22.95% of rows.

Treatment: Normalize synonyms (e.g. map BK/BLK/BLACK to one code) before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[103]:

saturn.columns["vehicle_color"].stats

stat	value
n	10,000
nulls	943 (9.4%)
unique	99
top_value	GY
top_rate	0.2295
cardinality	99
entropy	3.676
entropy_ratio	0.5545

Fig 38.

Top values for vehicle_color.

Show data table

Top values for vehicle_color (20 unique shown, of 99 total).
value	count	share
GY	2079	20.8%
BK	1784	17.8%
WH	1579	15.8%
BL	631	6.3%
RD	348	3.5%
WHITE	347	3.5%
BLK	275	2.8%
BLACK	273	2.7%
GREY	239	2.4%
GRY	167	1.7%
GR	148	1.5%
BLUE	131	1.3%
RED	124	1.2%
GRAY	113	1.1%
SILVE	99	1.0%
WHT	65	0.7%
YW	62	0.6%
BR	60	0.6%
WHI	57	0.6%
BLU	46	0.5%

unregistered_vehicle categorical feature

This column appears to be a binary flag for whether a vehicle was unregistered, but it carries no information: 84.87% of rows are null and the remaining 1513 rows all hold the single value "0". Cardinality is 1 and entropy is 0, so the field is effectively constant where populated.

Treatment: Drop; constant value with majority nulls offers no modelling signal.

anthropic:claude-opus-4-7 · confidence high

Out[106]:

saturn.columns["unregistered_vehicle"].stats

stat	value
n	10,000
nulls	8,487 (84.9%)
unique	1
top_value	0
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	84.9% null
alert: imbalance	top value is 100.0% of rows

Fig 39.

Top values for unregistered_vehicle.

Show data table

Top values for unregistered_vehicle (1 unique shown, of 1 total).
value	count	share
0	1513	15.1%

vehicle_year categorical feature

Vehicle model year stored as a categorical string with 39 distinct values across 10000 rows and no nulls. The dominant value is "0" at 23.29% of rows, which almost certainly encodes missing/unknown rather than a real year; legitimate years span recent ones with 2024 (759) and 2025 (696) leading. Entropy ratio of 0.78 reflects a fairly even spread across the remaining year codes.

Treatment: Recode "0" as missing, cast to integer year, and optionally bucket into age bands before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[109]:

saturn.columns["vehicle_year"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	39
top_value	0
top_rate	0.2329
cardinality	39
entropy	4.134
entropy_ratio	0.7822

Fig 40.

Top values for vehicle_year.

Show data table

Top values for vehicle_year (20 unique shown, of 39 total).
value	count	share
0	2329	23.3%
2024	759	7.6%
2025	696	7.0%
2023	569	5.7%
2019	512	5.1%
2022	489	4.9%
2021	472	4.7%
2018	448	4.5%
2020	442	4.4%
2017	431	4.3%
2016	372	3.7%
2015	358	3.6%
2014	285	2.9%
2013	283	2.8%
2012	250	2.5%
2011	199	2.0%
2010	178	1.8%
2008	168	1.7%
2026	136	1.4%
2007	119	1.2%

meter_number categorical identifier

Likely a utility meter identifier, but it is effectively empty: 84.79% of rows are null and 99.47% of the non-null values are the placeholder "-". Only 7 distinct values exist across 10,000 rows, with the genuine-looking numeric IDs (e.g. 309485, 103553) appearing at most twice. Entropy ratio of 0.022 confirms there is almost no information here.

Treatment: Drop; column is near-constant placeholder with overwhelming nulls.

anthropic:claude-opus-4-7 · confidence high

Out[112]:

saturn.columns["meter_number"].stats

stat	value
n	10,000
nulls	8,479 (84.8%)
unique	7
top_value	-
top_rate	0.9947
cardinality	7
entropy	0.06054
entropy_ratio	0.02156
alert: long_tail	4 singleton categories
alert: null_rate	84.8% null
alert: imbalance	top value is 99.5% of rows

Fig 41.

Top values for meter_number.

Show data table

Top values for meter_number (7 unique shown, of 7 total).
value	count	share
-	1513	15.1%
309485	2	0.0%
103553	2	0.0%
109720	1	0.0%
424475	1	0.0%
106506	1	0.0%
101339	1	0.0%

feet_from_curb categorical

Out[115]:

saturn.columns["feet_from_curb"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	11
top_value	0
top_rate	0.9783
cardinality	11
entropy	0.2169
entropy_ratio	0.06271
alert: imbalance	top value is 97.8% of rows

Fig 42.

Top values for feet_from_curb.

Show data table

Top values for feet_from_curb (11 unique shown, of 11 total).
value	count	share
0	9783	97.8%
1	42	0.4%
2	40	0.4%
3	35	0.4%
5	30	0.3%
4	18	0.2%
6	15	0.1%
10	12	0.1%
8	12	0.1%
7	10	0.1%
9	3	0.0%

house_number text feature

Almost certainly a street address house-number fragment, with values that are short single tokens (word_mean 1.00, len_mean 3.36, max length 7) and a vocabulary of 2,345 over 10,000 rows. Surprisingly, the most frequent values are cardinal-direction letters N/E/W/S (161/136/122/120) rather than digits — these likely belong in a separate directional prefix field. Nearly half the column is null (47.7%) and 55.1% of non-nulls duplicate, so it's sparse and low-cardinality.

Treatment: Split into numeric house-number and directional-prefix fields, then impute or flag the 47.7% nulls before joining to address data.

anthropic:claude-opus-4-7 · confidence high

Out[118]:

saturn.columns["house_number"].stats

stat	value
n	10,000
nulls	4,771 (47.7%)
unique	2,350
len_min	1
len_max	7
len_mean	3.36
len_median	3
len_p95	5
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	2,879
duplicate_rate	0.5506
vocab_size	2,345
readability_flesch_mean	116.1
emoji_rate	0
url_rate	0
one_word_rate	0.9977
allcaps_rate	0.7887
boilerplate_rate	0
alert: one_word	99.8% rows are a single word
alert: allcaps	78.9% rows are all-caps
alert: null_rate	47.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	55.1% duplicate strings

Fig 43.

Character-length distribution for house_number.

Show data table

Character-length distribution for house_number (mean: 3.3595333715815645).
chars	count
1 – 1	626
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	479
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1797
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1299
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	777
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	246
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	5

time_first_observed categorical timestamp

Looks like a time-of-first-observation field encoded as HHMMx where x is A/P (e.g., '0230A', '0930P'). It is 78.45% null and, among the 2,155 non-null rows, the sentinel '00000' dominates at 1,926 occurrences (top_rate 0.8937), leaving only ~229 rows with real timestamps spread across 206 distinct values. Entropy ratio of 0.169 confirms almost no information content once nulls and the placeholder are removed.

Treatment: Treat '00000' as null, then drop or collapse to a binary 'has_observed_time' flag — too sparse to use as a real timestamp.

anthropic:claude-opus-4-7 · confidence high

Out[121]:

saturn.columns["time_first_observed"].stats

stat	value
n	10,000
nulls	7,845 (78.5%)
unique	207
top_value	00000
top_rate	0.8937
cardinality	207
entropy	1.299
entropy_ratio	0.1688
alert: long_tail	187 singleton categories
alert: null_rate	78.5% null

Fig 44.

Top values for time_first_observed.

Show data table

Top values for time_first_observed (20 unique shown, of 207 total).
value	count	share
00000	1926	19.3%
0230A	3	0.0%
0930P	3	0.0%
0505P	3	0.0%
1115P	3	0.0%
0900P	2	0.0%
1150P	2	0.0%
0920A	2	0.0%
1000A	2	0.0%
0235A	2	0.0%
0645P	2	0.0%
0252A	2	0.0%
0110A	2	0.0%
0955A	2	0.0%
0905P	2	0.0%
0456P	2	0.0%
1149A	2	0.0%
1251A	2	0.0%
1030P	2	0.0%
0119A	2	0.0%

violation_description categorical feature

This column records the parking/traffic violation type as a short text code, with 74 distinct values across 10,000 rows. It's heavily concentrated: 'PHTO SCHOOL ZN SPEED VIOLATION' alone covers 52% of non-null rows, and 15.13% of values are null. The label formatting is inconsistent — some entries are numbered codes like '14-No Standing' while others are free-form like 'Fire Hydrant' or 'Detached Trailer', suggesting multiple source systems or schema versions merged together.

Treatment: Normalise label formatting (strip numeric prefixes) and group rare categories before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[124]:

saturn.columns["violation_description"].stats

stat	value
n	10,000
nulls	1,513 (15.1%)
unique	74
top_value	PHTO SCHOOL ZN SPEED VIOLATION
top_rate	0.5203
cardinality	74
entropy	2.808
entropy_ratio	0.4522

Fig 45.

Top values for violation_description.

Show data table

Top values for violation_description (20 unique shown, of 74 total).
value	count	share
PHTO SCHOOL ZN SPEED VIOLATION	4416	44.2%
No Parking Street Cleaning	1428	14.3%
14-No Standing	598	6.0%
40-Fire Hydrant	463	4.6%
20A-No Parking (Non-COM)	131	1.3%
19-No Stand (bus stop)	123	1.2%
16A-No Std (Com Veh) Non-COM	117	1.2%
46A-Double Parking (Non-COM)	106	1.1%
Detached Trailer	105	1.1%
Fire Hydrant	94	0.9%
71A-Insp Sticker Expired (NYS)	77	0.8%
70A-Reg. Sticker Expired (NYS)	66	0.7%
No Standing	61	0.6%
Missing Equipment	56	0.6%
50-Crosswalk	53	0.5%
74-Missing Display Plate	39	0.4%
13-No Stand (taxi stand)	38	0.4%
17-No Stand (exc auth veh)	38	0.4%
Double Parking	32	0.3%
98-Obstructing Driveway	31	0.3%

violation_post_code categorical feature

This appears to be a violation post code categorical field, likely a sub-classifier within a citation or inspection record. It has 53 unique codes with high entropy (ratio 0.89), and the top code "99" only covers 14.5% of non-null rows. Two notable surprises: 78.74% of rows are null, and the code values are heterogeneous in format—numeric ("99", "01", "311"), single letters ("B", "U"), and tokens like "SPCL"—suggesting multiple coding schemes coexist.

Treatment: Treat missingness as its own category and standardize/group the mixed-format codes before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[127]:

saturn.columns["violation_post_code"].stats

stat	value
n	10,000
nulls	7,874 (78.7%)
unique	53
top_value	99
top_rate	0.1453
cardinality	53
entropy	5.089
entropy_ratio	0.8885
alert: null_rate	78.7% null

Fig 46.

Top values for violation_post_code.

Show data table

Top values for violation_post_code (20 unique shown, of 53 total).
value	count	share
99	309	3.1%
01	207	2.1%
311	135	1.4%
SPCL	85	0.9%
06	71	0.7%
10	66	0.7%
B	57	0.6%
17	56	0.6%
U	53	0.5%
15	49	0.5%
04	48	0.5%
05	48	0.5%
16	45	0.4%
11	42	0.4%
A	40	0.4%
I	35	0.4%
10-P	35	0.4%
38	34	0.3%
03-A	34	0.3%
12	33	0.3%

violation_legal_code categorical feature

This column appears to be a violation legal code indicator, but it carries no information: every one of the 4,416 non-null rows contains the single value "T" (top_rate 1.0, cardinality 1, entropy 0.0). On top of that, 55.84% of rows are null. There is nothing here to discriminate records.

Treatment: Drop; constant column with majority nulls offers no signal.

anthropic:claude-opus-4-7 · confidence high

Out[130]:

saturn.columns["violation_legal_code"].stats

stat	value
n	10,000
nulls	5,584 (55.8%)
unique	1
top_value	T
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	55.8% null
alert: imbalance	top value is 100.0% of rows

Fig 47.

Top values for violation_legal_code.

Show data table

Top values for violation_legal_code (1 unique shown, of 1 total).
value	count	share
T	4416	44.2%

Overview

Summary confidence: high

summons_number text identifier

plate_id text identifier

registration_state categorical feature

plate_type categorical feature

issue_date categorical timestamp

violation_code categorical feature

vehicle_body_type categorical feature

vehicle_make categorical feature

issuing_agency categorical feature

street_code1 text foreign_key

street_code2 text feature

street_code3 text feature

vehicle_expiration_date text timestamp

violation_location categorical feature

violation_precinct categorical feature

issuer_precinct categorical feature

issuer_code text foreign_key

issuer_command categorical feature

issuer_squad categorical feature

violation_time text timestamp

violation_county categorical feature

violation_in_front_of_or_opposite categorical feature

street_name text feature

intersecting_street text metadata

date_first_observed categorical metadata

law_section categorical feature

sub_division categorical feature

days_parking_in_effect categorical feature

from_hours_in_effect categorical feature

to_hours_in_effect categorical feature

vehicle_color categorical feature

unregistered_vehicle categorical feature

vehicle_year categorical feature

meter_number categorical identifier

feet_from_curb categorical

house_number text feature

time_first_observed categorical timestamp

violation_description categorical feature

violation_post_code categorical feature

violation_legal_code categorical feature

How to cite