urban-parking_violations_sample

Overview

Source: /home/coolhand/html/datavis/data_trove/data/urban/parking_violations_sample.csv

Saturn profiled 10,000 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/urban/parking_violations_sample.csv",
    "--findings", "urban-parking_violations_sample.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 10,000-row sample of NYC-style parking violations with 9 columns covering summons IDs, issue dates and times, locations, violation codes/descriptions, issuing agencies, and vehicle make/color. Two things jump out: issue_date is heavily concentrated on a single day (2025-12-28 accounts for 65% of rows), and violation_description is dominated by 'PHTO SCHOOL ZN SPEED VIOLATION' at 52% of non-null values, paired with issuing_agency 'V' at 44% — suggesting the sample is skewed toward automated school-zone camera tickets. Vehicle_color also shows clear data-quality issues, with the same color appearing under multiple codes (e.g., WH/WHITE, BLK/BLACK/BK, GREY/GRY) that would need normalization before analysis. Violation_code is numeric with a ~10% outlier rate and right-skew, worth a look alongside the categorical description. Street_name is messy free text with 77% all-caps and many directional prefixes (SB, NB, WB, EB).

citing: row_count · column_count · issue_date.top_rate · issue_date.top_value · violation_description.top_rate · violation_description.top_value · violation_description.null_rate · issuing_agency.top_rate · issuing_agency.top_value · vehicle_color.top_values · vehicle_make.top_values · violation_code.outlier_rate · violation_code.skew · street_name.allcaps_rate

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
summons_number	numeric	10,000	0.0%	10,000
issue_date	categorical	10,000	0.0%	687	long_tail
violation_code	numeric	10,000	0.0%	62	outliers
violation_description	categorical	10,000	15.1%	74
street_name	text	10,000	0.0%	3,115	multilingual allcaps duplicates
vehicle_make	categorical	10,000	0.8%	126
vehicle_color	categorical	10,000	9.4%	99
violation_time	text	10,000	0.0%	1,432	one_word allcaps short_text duplicates
issuing_agency	categorical	10,000	0.0%	20

Fig 1.

violation_description · Check how dominant school-zone speed violations are versus all other categories.

Show data table

Top values for violation_description (20 unique shown, of 74 total).
value	count	share
PHTO SCHOOL ZN SPEED VIOLATION	4416	44.2%
No Parking Street Cleaning	1428	14.3%
14-No Standing	598	6.0%
40-Fire Hydrant	463	4.6%
20A-No Parking (Non-COM)	131	1.3%
19-No Stand (bus stop)	123	1.2%
16A-No Std (Com Veh) Non-COM	117	1.2%
46A-Double Parking (Non-COM)	106	1.1%
Detached Trailer	105	1.1%
Fire Hydrant	94	0.9%
71A-Insp Sticker Expired (NYS)	77	0.8%
70A-Reg. Sticker Expired (NYS)	66	0.7%
No Standing	61	0.6%
Missing Equipment	56	0.6%
50-Crosswalk	53	0.5%
74-Missing Display Plate	39	0.4%
13-No Stand (taxi stand)	38	0.4%
17-No Stand (exc auth veh)	38	0.4%
Double Parking	32	0.3%
98-Obstructing Driveway	31	0.3%

Fig 2.

issuing_agency · See the concentration in agency 'V' and how the remaining agencies split the rest.

Show data table

Top values for issuing_agency (20 unique shown, of 20 total).
value	count	share
V	4416	44.2%
T	2131	21.3%
S	1946	19.5%
P	1325	13.2%
K	41	0.4%
N	38	0.4%
A	23	0.2%
Y	17	0.2%
M	13	0.1%
O	9	0.1%
C	9	0.1%
8	8	0.1%
3	6	0.1%
X	5	0.1%
9	3	0.0%
W	3	0.0%
L	2	0.0%
R	2	0.0%
F	2	0.0%
U	1	0.0%

Fig 3.

vehicle_color · Look for duplicate color encodings (WH vs WHITE, BLK vs BLACK vs BK) signalling data-cleaning work.

Show data table

Top values for vehicle_color (20 unique shown, of 99 total).
value	count	share
GY	2079	20.8%
BK	1784	17.8%
WH	1579	15.8%
BL	631	6.3%
RD	348	3.5%
WHITE	347	3.5%
BLK	275	2.8%
BLACK	273	2.7%
GREY	239	2.4%
GRY	167	1.7%
GR	148	1.5%
BLUE	131	1.3%
RED	124	1.2%
GRAY	113	1.1%
SILVE	99	1.0%
WHT	65	0.7%
YW	62	0.6%
BR	60	0.6%
WHI	57	0.6%
BLU	46	0.5%

Fig 4.

violation_code · Inspect the right-skewed distribution and the ~10% of values flagged as outliers.

Show data table

Histogram bins for violation_code (median: 36.0).
bin	count
4 – 6.375	9
6.375 – 8.75	0
8.75 – 11.12	48
11.12 – 13.5	38
13.5 – 15.88	858
15.88 – 18.25	216
18.25 – 20.62	412
20.62 – 23	1483
23 – 25.38	14
25.38 – 27.75	6
27.75 – 30.12	0
30.12 – 32.5	3
32.5 – 34.88	4
34.88 – 37.25	4426
37.25 – 39.62	4
39.62 – 42	880
42 – 44.38	0
44.38 – 46.75	376
46.75 – 49.12	27
49.12 – 51.5	162
51.5 – 53.88	24
53.88 – 56.25	3
56.25 – 58.62	0
58.62 – 61	4
61 – 63.38	28
63.38 – 65.75	7
65.75 – 68.12	185
68.12 – 70.5	105
70.5 – 72.88	117
72.88 – 75.25	121
75.25 – 77.62	14
77.62 – 80	76
80 – 82.38	57
82.38 – 84.75	14
84.75 – 87.12	21
87.12 – 89.5	0
89.5 – 91.88	0
91.88 – 94.25	2
94.25 – 96.62	2
96.62 – 99	254

Fig 5.

vehicle_make · Confirm the expected long tail behind Honda and Toyota among 126 distinct makes.

Show data table

Top values for vehicle_make (20 unique shown, of 126 total).
value	count	share
HONDA	1331	13.3%
TOYOT	1302	13.0%
NISSA	770	7.7%
FORD	603	6.0%
BMW	559	5.6%
ME/BE	521	5.2%
JEEP	450	4.5%
CHEVR	449	4.5%
HYUND	365	3.6%
SUBAR	273	2.7%
KIA	268	2.7%
LEXUS	268	2.7%
MAZDA	257	2.6%
AUDI	242	2.4%
ACURA	221	2.2%
VOLKS	199	2.0%
DODGE	175	1.8%
TESLA	152	1.5%
INFIN	151	1.5%
GMC	134	1.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
summons_number	numeric	0.0%
issue_date	categorical	0.0%
violation_code	numeric	0.0%
violation_description	categorical	15.1%
street_name	text	0.0%
vehicle_make	categorical	0.8%
vehicle_color	categorical	9.4%
violation_time	text	0.0%
issuing_agency	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,612 detected strings).
lang	count	share
en	3596	78.0%
ja	614	13.3%
es	86	1.9%
zh	59	1.3%
de	56	1.2%
fr	39	0.8%
nl	25	0.5%
it	21	0.5%
pt	15	0.3%
ko	14	0.3%
ca	14	0.3%
ar	14	0.3%
ru	12	0.3%
cs	10	0.2%
no	9	0.2%
uk	8	0.2%
eu	5	0.1%
id	4	0.1%
pl	3	0.1%
gl	1	0.0%
da	1	0.0%
ms	1	0.0%
te	1	0.0%
ro	1	0.0%
lt	1	0.0%
sv	1	0.0%
mk	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	summons_number	violation_code
summons_number	+1.00	-0.08
violation_code	-0.08	+1.00

summons_number numeric identifier

Every one of the 10,000 rows carries a distinct value (n_unique = 10000, null_rate = 0.0), and the magnitudes (min 1.12e9, max 9.26e9) match the size of NYC parking summons numbers. The wide spread (std ≈ 2.71e9) and lack of outliers reflect identifier allocation rather than a measurable quantity. Despite being typed numeric, no arithmetic interpretation applies.

Treatment: drop from modelling; retain only as a row key for joins.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["summons_number"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	10,000
min	1.125e+09
max	9.255e+09
mean	4.779e+09
median	4.976e+09
std	2.706e+09
q1	2.028e+09
q3	4.976e+09
iqr	2.948e+09
skew	0.4681
kurtosis	-0.9135
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of summons_number. Vertical dash marks the median.

Show data table

Histogram bins for summons_number (median: 4976262516.0).
bin	count
1.125e+09 – 1.328e+09	9
1.328e+09 – 1.531e+09	1504
1.531e+09 – 1.735e+09	0
1.735e+09 – 1.938e+09	0
1.938e+09 – 2.141e+09	1945
2.141e+09 – 2.344e+09	0
2.344e+09 – 2.548e+09	0
2.548e+09 – 2.751e+09	0
2.751e+09 – 2.954e+09	0
2.954e+09 – 3.157e+09	0
3.157e+09 – 3.361e+09	0
3.361e+09 – 3.564e+09	0
3.564e+09 – 3.767e+09	0
3.767e+09 – 3.971e+09	0
3.971e+09 – 4.174e+09	0
4.174e+09 – 4.377e+09	0
4.377e+09 – 4.58e+09	0
4.58e+09 – 4.784e+09	0
4.784e+09 – 4.987e+09	4416
4.987e+09 – 5.19e+09	0
5.19e+09 – 5.393e+09	0
5.393e+09 – 5.597e+09	0
5.597e+09 – 5.8e+09	0
5.8e+09 – 6.003e+09	0
6.003e+09 – 6.206e+09	0
6.206e+09 – 6.41e+09	0
6.41e+09 – 6.613e+09	0
6.613e+09 – 6.816e+09	0
6.816e+09 – 7.019e+09	0
7.019e+09 – 7.223e+09	0
7.223e+09 – 7.426e+09	0
7.426e+09 – 7.629e+09	0
7.629e+09 – 7.832e+09	0
7.832e+09 – 8.036e+09	0
8.036e+09 – 8.239e+09	0
8.239e+09 – 8.442e+09	0
8.442e+09 – 8.646e+09	0
8.646e+09 – 8.849e+09	5
8.849e+09 – 9.052e+09	174
9.052e+09 – 9.255e+09	1947

issue_date categorical timestamp

This is an issue_date column stored as ISO datetime strings, treated here as categorical across 687 distinct days with no nulls. The distribution is severely concentrated: 65.42% of all 10000 rows fall on 2025-12-28, with another 1594 on 2025-12-30 and 356 on 2025-12-29, meaning roughly 85% of issuance clusters in late December 2025 before tapering into a long tail through 2026. Entropy ratio of 0.29 confirms the heavy skew, and the year-end spike looks like a backfill or batch-load artifact worth confirming before treating this as a true event date.

Treatment: Parse to datetime and bucket by month or week; investigate the 2025-12-28 spike before using as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["issue_date"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	687
top_value	2025-12-28T00:00:00.000
top_rate	0.6542
cardinality	687
entropy	2.765
entropy_ratio	0.2934
alert: long_tail	368 singleton categories

Fig 10.

Top values for issue_date.

Show data table

Top values for issue_date (20 unique shown, of 687 total).
value	count	share
2025-12-28T00:00:00.000	6542	65.4%
2025-12-30T00:00:00.000	1594	15.9%
2025-12-29T00:00:00.000	356	3.6%
2026-06-26T00:00:00.000	14	0.1%
2026-09-27T00:00:00.000	13	0.1%
2026-09-25T00:00:00.000	12	0.1%
2025-12-31T00:00:00.000	12	0.1%
2026-06-27T00:00:00.000	11	0.1%
2026-10-25T00:00:00.000	10	0.1%
2026-08-31T00:00:00.000	10	0.1%
2026-08-27T00:00:00.000	10	0.1%
2026-07-26T00:00:00.000	10	0.1%
2026-06-30T00:00:00.000	10	0.1%
2026-08-25T00:00:00.000	9	0.1%
2026-10-29T00:00:00.000	8	0.1%
2026-10-17T00:00:00.000	8	0.1%
2026-10-05T00:00:00.000	8	0.1%
2026-09-23T00:00:00.000	8	0.1%
2026-07-27T00:00:00.000	8	0.1%
2026-07-24T00:00:00.000	8	0.1%

violation_code numeric feature

This is almost certainly a categorical violation code stored as a number, with 62 distinct values across 10,000 rows and no nulls. Despite the numeric type, the distribution is meaningless as a quantity: values span 4 to 99, the IQR runs 21–36, and 10.07% of rows (1,007) flag as outliers under a numeric rule, with skew 1.51 and kurtosis 2.99. Median equals Q3 at 36, suggesting a heavy concentration at one or two dominant codes.

Treatment: Treat as categorical and one-hot or target-encode rather than using the raw integer.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["violation_code"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	62
min	4
max	99
mean	35.85
median	36
std	17.55
q1	21
q3	36
iqr	15
skew	1.514
kurtosis	2.986
n_outliers	1,007
outlier_rate	0.1007
zero_rate	0
alert: outliers	10.1% rows beyond 1.5 IQR

Fig 11.

Distribution of violation_code. Vertical dash marks the median.

Show data table

Histogram bins for violation_code (median: 36.0).
bin	count
4 – 6.375	9
6.375 – 8.75	0
8.75 – 11.12	48
11.12 – 13.5	38
13.5 – 15.88	858
15.88 – 18.25	216
18.25 – 20.62	412
20.62 – 23	1483
23 – 25.38	14
25.38 – 27.75	6
27.75 – 30.12	0
30.12 – 32.5	3
32.5 – 34.88	4
34.88 – 37.25	4426
37.25 – 39.62	4
39.62 – 42	880
42 – 44.38	0
44.38 – 46.75	376
46.75 – 49.12	27
49.12 – 51.5	162
51.5 – 53.88	24
53.88 – 56.25	3
56.25 – 58.62	0
58.62 – 61	4
61 – 63.38	28
63.38 – 65.75	7
65.75 – 68.12	185
68.12 – 70.5	105
70.5 – 72.88	117
72.88 – 75.25	121
75.25 – 77.62	14
77.62 – 80	76
80 – 82.38	57
82.38 – 84.75	14
84.75 – 87.12	21
87.12 – 89.5	0
89.5 – 91.88	0
91.88 – 94.25	2
94.25 – 96.62	2
96.62 – 99	254

violation_description categorical feature

Categorical column describing the parking/traffic violation issued, with 74 distinct codes across 10000 rows. It is heavily concentrated: 'PHTO SCHOOL ZN SPEED VIOLATION' alone covers 52.03% of records, with 'No Parking Street Cleaning' a distant second at 1428, yielding low entropy ratio 0.452. Note 15.13% of values are null, and the labels mix numeric-prefixed legal codes (e.g. '14-No Standing') with free-form descriptions like 'Fire Hydrant', some of which appear to duplicate the coded versions ('40-Fire Hydrant' vs 'Fire Hydrant').

Treatment: Normalise the coded vs free-text duplicates, then group rare categories before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["violation_description"].stats

stat	value
n	10,000
nulls	1,513 (15.1%)
unique	74
top_value	PHTO SCHOOL ZN SPEED VIOLATION
top_rate	0.5203
cardinality	74
entropy	2.808
entropy_ratio	0.4522

Fig 12.

Top values for violation_description.

Show data table

Top values for violation_description (20 unique shown, of 74 total).
value	count	share
PHTO SCHOOL ZN SPEED VIOLATION	4416	44.2%
No Parking Street Cleaning	1428	14.3%
14-No Standing	598	6.0%
40-Fire Hydrant	463	4.6%
20A-No Parking (Non-COM)	131	1.3%
19-No Stand (bus stop)	123	1.2%
16A-No Std (Com Veh) Non-COM	117	1.2%
46A-Double Parking (Non-COM)	106	1.1%
Detached Trailer	105	1.1%
Fire Hydrant	94	0.9%
71A-Insp Sticker Expired (NYS)	77	0.8%
70A-Reg. Sticker Expired (NYS)	66	0.7%
No Standing	61	0.6%
Missing Equipment	56	0.6%
50-Crosswalk	53	0.5%
74-Missing Display Plate	39	0.4%
13-No Stand (taxi stand)	38	0.4%
17-No Stand (exc auth veh)	38	0.4%
Double Parking	32	0.3%
98-Obstructing Driveway	31	0.3%

street_name text feature

Street-name strings, mostly truncated NYC traffic-camera or incident locations like 'SB CROSS BAY BLVD @' with directional prefixes (sb/wb/nb/eb) and '@' delimiters dominating the top words. Values are heavily repeated (68.8% duplicate rate, only 3,115 uniques in 10,000 rows) and 77.3% are all-caps; lengths cap sharply at 20 characters, suggesting upstream truncation. Language detection flags ja (614) and zh (59) alongside 3,596 en, but this is almost certainly a false positive on short ALL-CAPS abbreviations rather than genuine multilingual content.

Treatment: Normalise case, split off the directional prefix and the '@' cross-street, and treat as a categorical location feature.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["street_name"].stats

stat	value
n	10,000
nulls	4 (0.0%)
unique	3,115
len_min	2
len_max	20
len_mean	14.87
len_median	16
len_p95	20
word_mean	3.513
word_median	3
n_empty	0
n_duplicates	6,881
duplicate_rate	0.6884
vocab_size	1,760
readability_flesch_mean	62.55
emoji_rate	0
url_rate	0
one_word_rate	0.01371
allcaps_rate	0.7732
boilerplate_rate	0
alert: multilingual	28 languages detected in sample
alert: allcaps	77.3% rows are all-caps
alert: duplicates	68.8% duplicate strings

Fig 13.

Character-length distribution for street_name.

Show data table

Character-length distribution for street_name (mean: 14.871748699479792).
chars	count
2 – 2	1
2 – 3	0
3 – 3	9
3 – 4	0
4 – 4	15
4 – 5	0
5 – 5	150
5 – 6	0
6 – 6	105
6 – 6	0
6 – 7	0
7 – 7	533
7 – 8	0
8 – 8	727
8 – 9	0
9 – 9	1009
9 – 10	0
10 – 10	396
10 – 11	0
11 – 11	0
11 – 11	415
11 – 12	0
12 – 12	374
12 – 13	0
13 – 13	489
13 – 14	0
14 – 14	404
14 – 15	0
15 – 15	344
15 – 16	0
16 – 16	0
16 – 16	155
16 – 17	0
17 – 17	90
17 – 18	0
18 – 18	111
18 – 19	0
19 – 19	1013
19 – 20	0
20 – 20	3656

vehicle_make categorical feature

Categorical vehicle manufacturer codes, with HONDA leading at 13.4% of 9,922 non-null rows and TOYOT close behind at 1,302. Values appear truncated to 5 characters (TOYOT, NISSA, ME/BE, CHEVR, HYUND, SUBAR), which will fragment any join against full make names. Cardinality is 126 with entropy ratio 0.668, indicating a long but moderately concentrated tail, and nulls are negligible at 0.78%.

Treatment: Normalize the truncated codes to canonical make names, then target- or frequency-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["vehicle_make"].stats

stat	value
n	10,000
nulls	78 (0.8%)
unique	126
top_value	HONDA
top_rate	0.1341
cardinality	126
entropy	4.661
entropy_ratio	0.668

Fig 14.

Top values for vehicle_make.

Show data table

Top values for vehicle_make (20 unique shown, of 126 total).
value	count	share
HONDA	1331	13.3%
TOYOT	1302	13.0%
NISSA	770	7.7%
FORD	603	6.0%
BMW	559	5.6%
ME/BE	521	5.2%
JEEP	450	4.5%
CHEVR	449	4.5%
HYUND	365	3.6%
SUBAR	273	2.7%
KIA	268	2.7%
LEXUS	268	2.7%
MAZDA	257	2.6%
AUDI	242	2.4%
ACURA	221	2.2%
VOLKS	199	2.0%
DODGE	175	1.8%
TESLA	152	1.5%
INFIN	151	1.5%
GMC	134	1.3%

vehicle_color categorical feature

Vehicle color codes, but encoded inconsistently: short codes like GY (2079), BK (1784), WH (1579) coexist with verbose forms WHITE (347), BLACK (273), GREY (239), and alternate abbreviations BLK (275), GRY (167) for the same underlying colors. With 99 distinct values across 10000 rows and a 9.43% null rate, the cardinality is inflated by these duplicate encodings rather than true diversity. Entropy ratio of 0.55 reflects a heavy concentration in the gray/black/white tail.

Treatment: Normalize synonymous codes (e.g., BK/BLK/BLACK → black) before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["vehicle_color"].stats

stat	value
n	10,000
nulls	943 (9.4%)
unique	99
top_value	GY
top_rate	0.2295
cardinality	99
entropy	3.676
entropy_ratio	0.5545

Fig 15.

Top values for vehicle_color.

Show data table

Top values for vehicle_color (20 unique shown, of 99 total).
value	count	share
GY	2079	20.8%
BK	1784	17.8%
WH	1579	15.8%
BL	631	6.3%
RD	348	3.5%
WHITE	347	3.5%
BLK	275	2.8%
BLACK	273	2.7%
GREY	239	2.4%
GRY	167	1.7%
GR	148	1.5%
BLUE	131	1.3%
RED	124	1.2%
GRAY	113	1.1%
SILVE	99	1.0%
WHT	65	0.7%
YW	62	0.6%
BR	60	0.6%
WHI	57	0.6%
BLU	46	0.5%

violation_time text timestamp

This column encodes time-of-day stamps in a compact HHMM+AM/PM format (e.g. '0839A', '1200P'), with every value exactly 5 characters and uppercase. Duplication is high (85.67%) across 1,432 unique stamps, which is expected for clock times sampled across 10,000 records. Null rate is negligible (0.0004) and there are no empty strings.

Treatment: Parse the HHMMa/p format into a proper time-of-day (minutes since midnight) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["violation_time"].stats

stat	value
n	10,000
nulls	4 (0.0%)
unique	1,432
len_min	5
len_max	5
len_mean	5
len_median	5
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	8,564
duplicate_rate	0.8567
vocab_size	1,433
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9999
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	85.7% duplicate strings

Fig 16.

Character-length distribution for violation_time.

Show data table

Character-length distribution for violation_time (mean: 5.0).
chars	count
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	9996
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0

issuing_agency categorical feature

Single-letter codes for the agency issuing each record, drawn from a closed set of 20 values with no nulls. Distribution is heavily concentrated: 'V' alone covers 44.16% (4,416/10,000) and the top four codes (V, T, S, P) account for the bulk of rows, while letters like K, N, A, Y, M, O appear fewer than 50 times each. Entropy ratio of 0.46 confirms the imbalance.

Treatment: One-hot encode the top categories and bucket the long tail into 'other' before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["issuing_agency"].stats

stat	value
n	10,000
nulls	0 (0.0%)
unique	20
top_value	V
top_rate	0.4416
cardinality	20
entropy	2.007
entropy_ratio	0.4644

Fig 17.

Top values for issuing_agency.

Show data table

Top values for issuing_agency (20 unique shown, of 20 total).
value	count	share
V	4416	44.2%
T	2131	21.3%
S	1946	19.5%
P	1325	13.2%
K	41	0.4%
N	38	0.4%
A	23	0.2%
Y	17	0.2%
M	13	0.1%
O	9	0.1%
C	9	0.1%
8	8	0.1%
3	6	0.1%
X	5	0.1%
9	3	0.0%
W	3	0.0%
L	2	0.0%
R	2	0.0%
F	2	0.0%
U	1	0.0%

urban parking violations sample

Overview

Summary confidence: high

summons_number numeric identifier

issue_date categorical timestamp

violation_code numeric feature

violation_description categorical feature

street_name text feature

vehicle_make categorical feature

vehicle_color categorical feature

violation_time text timestamp

issuing_agency categorical feature

How to cite