urban parking violations sample

source /home/coolhand/html/datavis/data_trove/data/urban/parking_violations_sample.csv 10,000 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 10,000-row sample of NYC-style parking violations with 9 columns covering summons IDs, issue dates and times, locations, violation codes/descriptions, issuing agencies, and vehicle make/color. Two things jump out: issue_date is heavily concentrated on a single day (2025-12-28 accounts for 65% of rows), and violation_description is dominated by 'PHTO SCHOOL ZN SPEED VIOLATION' at 52% of non-null values, paired with issuing_agency 'V' at 44% — suggesting the sample is skewed toward automated school-zone camera tickets. Vehicle_color also shows clear data-quality issues, with the same color appearing under multiple codes (e.g., WH/WHITE, BLK/BLACK/BK, GREY/GRY) that would need normalization before analysis. Violation_code is numeric with a ~10% outlier rate and right-skew, worth a look alongside the categorical description. Street_name is messy free text with 77% all-caps and many directional prefixes (SB, NB, WB, EB).

citing: row_count · column_count · issue_date.top_rate · issue_date.top_value · violation_description.top_rate · violation_description.top_value · violation_description.null_rate · issuing_agency.top_rate · issuing_agency.top_value · vehicle_color.top_values · vehicle_make.top_values · violation_code.outlier_rate · violation_code.skew · street_name.allcaps_rate

Charts the summary said to look at first

violation_description · Check how dominant school-zone speed violations are versus all other categories.

Show data table

Top values for violation_description (20 unique shown, of 74 total).
value	count	share
PHTO SCHOOL ZN SPEED VIOLATION	4416	44.2%
No Parking Street Cleaning	1428	14.3%
14-No Standing	598	6.0%
40-Fire Hydrant	463	4.6%
20A-No Parking (Non-COM)	131	1.3%
19-No Stand (bus stop)	123	1.2%
16A-No Std (Com Veh) Non-COM	117	1.2%
46A-Double Parking (Non-COM)	106	1.1%
Detached Trailer	105	1.1%
Fire Hydrant	94	0.9%
71A-Insp Sticker Expired (NYS)	77	0.8%
70A-Reg. Sticker Expired (NYS)	66	0.7%
No Standing	61	0.6%
Missing Equipment	56	0.6%
50-Crosswalk	53	0.5%
74-Missing Display Plate	39	0.4%
13-No Stand (taxi stand)	38	0.4%
17-No Stand (exc auth veh)	38	0.4%
Double Parking	32	0.3%
98-Obstructing Driveway	31	0.3%

issuing_agency · See the concentration in agency 'V' and how the remaining agencies split the rest.

Show data table

Top values for issuing_agency (20 unique shown, of 20 total).
value	count	share
V	4416	44.2%
T	2131	21.3%
S	1946	19.5%
P	1325	13.2%
K	41	0.4%
N	38	0.4%
A	23	0.2%
Y	17	0.2%
M	13	0.1%
O	9	0.1%
C	9	0.1%
8	8	0.1%
3	6	0.1%
X	5	0.1%
9	3	0.0%
W	3	0.0%
L	2	0.0%
R	2	0.0%
F	2	0.0%
U	1	0.0%

vehicle_color · Look for duplicate color encodings (WH vs WHITE, BLK vs BLACK vs BK) signalling data-cleaning work.

Show data table

Top values for vehicle_color (20 unique shown, of 99 total).
value	count	share
GY	2079	20.8%
BK	1784	17.8%
WH	1579	15.8%
BL	631	6.3%
RD	348	3.5%
WHITE	347	3.5%
BLK	275	2.8%
BLACK	273	2.7%
GREY	239	2.4%
GRY	167	1.7%
GR	148	1.5%
BLUE	131	1.3%
RED	124	1.2%
GRAY	113	1.1%
SILVE	99	1.0%
WHT	65	0.7%
YW	62	0.6%
BR	60	0.6%
WHI	57	0.6%
BLU	46	0.5%

violation_code · Inspect the right-skewed distribution and the ~10% of values flagged as outliers.

Show data table

Histogram bins for violation_code (median: 36.0).
bin	count
4 – 6.375	9
6.375 – 8.75	0
8.75 – 11.12	48
11.12 – 13.5	38
13.5 – 15.88	858
15.88 – 18.25	216
18.25 – 20.62	412
20.62 – 23	1483
23 – 25.38	14
25.38 – 27.75	6
27.75 – 30.12	0
30.12 – 32.5	3
32.5 – 34.88	4
34.88 – 37.25	4426
37.25 – 39.62	4
39.62 – 42	880
42 – 44.38	0
44.38 – 46.75	376
46.75 – 49.12	27
49.12 – 51.5	162
51.5 – 53.88	24
53.88 – 56.25	3
56.25 – 58.62	0
58.62 – 61	4
61 – 63.38	28
63.38 – 65.75	7
65.75 – 68.12	185
68.12 – 70.5	105
70.5 – 72.88	117
72.88 – 75.25	121
75.25 – 77.62	14
77.62 – 80	76
80 – 82.38	57
82.38 – 84.75	14
84.75 – 87.12	21
87.12 – 89.5	0
89.5 – 91.88	0
91.88 – 94.25	2
94.25 – 96.62	2
96.62 – 99	254

vehicle_make · Confirm the expected long tail behind Honda and Toyota among 126 distinct makes.

Show data table

Top values for vehicle_make (20 unique shown, of 126 total).
value	count	share
HONDA	1331	13.3%
TOYOT	1302	13.0%
NISSA	770	7.7%
FORD	603	6.0%
BMW	559	5.6%
ME/BE	521	5.2%
JEEP	450	4.5%
CHEVR	449	4.5%
HYUND	365	3.6%
SUBAR	273	2.7%
KIA	268	2.7%
LEXUS	268	2.7%
MAZDA	257	2.6%
AUDI	242	2.4%
ACURA	221	2.2%
VOLKS	199	2.0%
DODGE	175	1.8%
TESLA	152	1.5%
INFIN	151	1.5%
GMC	134	1.3%

Schema

9 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
summons_number	numeric	0.0%	10,000
issue_date	categorical	0.0%	687	long_tail
violation_code	numeric	0.0%	62	outliers
violation_description	categorical	15.1%	74
street_name	text	0.0%	3,115	multilingual allcaps duplicates
vehicle_make	categorical	0.8%	126
vehicle_color	categorical	9.4%	99
violation_time	text	0.0%	1,432	one_word allcaps short_text duplicates
issuing_agency	categorical	0.0%	20

summons_number

numeric identifier

Every one of the 10,000 rows carries a distinct value (n_unique = 10000, null_rate = 0.0), and the magnitudes (min 1.12e9, max 9.26e9) match the size of NYC parking summons numbers. The wide spread (std ≈ 2.71e9) and lack of outliers reflect identifier allocation rather than a measurable quantity. Despite being typed numeric, no arithmetic interpretation applies. Treatment: drop from modelling; retain only as a row key for joins. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 0 (0.0%)
unique: 10,000
min: 1.125e+09
max: 9.255e+09
mean: 4.779e+09
median: 4.976e+09
std: 2.706e+09
q1: 2.028e+09
q3: 4.976e+09
iqr: 2.948e+09
skew: 0.4681
kurtosis: -0.9135
n_outliers: 0
outlier_rate: 0
zero_rate: 0

issue_date

categorical timestamp long_tail

This is an issue_date column stored as ISO datetime strings, treated here as categorical across 687 distinct days with no nulls. The distribution is severely concentrated: 65.42% of all 10000 rows fall on 2025-12-28, with another 1594 on 2025-12-30 and 356 on 2025-12-29, meaning roughly 85% of issuance clusters in late December 2025 before tapering into a long tail through 2026. Entropy ratio of 0.29 confirms the heavy skew, and the year-end spike looks like a backfill or batch-load artifact worth confirming before treating this as a true event date. Treatment: Parse to datetime and bucket by month or week; investigate the 2025-12-28 spike before using as a feature. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 0 (0.0%)
unique: 687
top_value: 2025-12-28T00:00:00.000
top_rate: 0.6542
cardinality: 687
entropy: 2.765
entropy_ratio: 0.2934

violation_code

numeric feature outliers

This is almost certainly a categorical violation code stored as a number, with 62 distinct values across 10,000 rows and no nulls. Despite the numeric type, the distribution is meaningless as a quantity: values span 4 to 99, the IQR runs 21–36, and 10.07% of rows (1,007) flag as outliers under a numeric rule, with skew 1.51 and kurtosis 2.99. Median equals Q3 at 36, suggesting a heavy concentration at one or two dominant codes. Treatment: Treat as categorical and one-hot or target-encode rather than using the raw integer. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 0 (0.0%)
unique: 62
min: 4
max: 99
mean: 35.85
median: 36
std: 17.55
q1: 21
q3: 36
iqr: 15
skew: 1.514
kurtosis: 2.986
n_outliers: 1,007
outlier_rate: 0.1007
zero_rate: 0

violation_description

categorical feature

Categorical column describing the parking/traffic violation issued, with 74 distinct codes across 10000 rows. It is heavily concentrated: 'PHTO SCHOOL ZN SPEED VIOLATION' alone covers 52.03% of records, with 'No Parking Street Cleaning' a distant second at 1428, yielding low entropy ratio 0.452. Note 15.13% of values are null, and the labels mix numeric-prefixed legal codes (e.g. '14-No Standing') with free-form descriptions like 'Fire Hydrant', some of which appear to duplicate the coded versions ('40-Fire Hydrant' vs 'Fire Hydrant'). Treatment: Normalise the coded vs free-text duplicates, then group rare categories before one-hot encoding. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 1,513 (15.1%)
unique: 74
top_value: PHTO SCHOOL ZN SPEED VIOLATION
top_rate: 0.5203
cardinality: 74
entropy: 2.808
entropy_ratio: 0.4522

street_name

text feature multilingual allcaps duplicates

Street-name strings, mostly truncated NYC traffic-camera or incident locations like 'SB CROSS BAY BLVD @' with directional prefixes (sb/wb/nb/eb) and '@' delimiters dominating the top words. Values are heavily repeated (68.8% duplicate rate, only 3,115 uniques in 10,000 rows) and 77.3% are all-caps; lengths cap sharply at 20 characters, suggesting upstream truncation. Language detection flags ja (614) and zh (59) alongside 3,596 en, but this is almost certainly a false positive on short ALL-CAPS abbreviations rather than genuine multilingual content. Treatment: Normalise case, split off the directional prefix and the '@' cross-street, and treat as a categorical location feature. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 4 (0.0%)
unique: 3,115
len_min: 2
len_max: 20
len_mean: 14.87
len_median: 16
len_p95: 20
word_mean: 3.513
word_median: 3
n_empty: 0
n_duplicates: 6,881
duplicate_rate: 0.6884
vocab_size: 1,760
readability_flesch_mean: 62.55
emoji_rate: 0
url_rate: 0
one_word_rate: 0.01371
allcaps_rate: 0.7732
boilerplate_rate: 0

vehicle_make

categorical feature

Categorical vehicle manufacturer codes, with HONDA leading at 13.4% of 9,922 non-null rows and TOYOT close behind at 1,302. Values appear truncated to 5 characters (TOYOT, NISSA, ME/BE, CHEVR, HYUND, SUBAR), which will fragment any join against full make names. Cardinality is 126 with entropy ratio 0.668, indicating a long but moderately concentrated tail, and nulls are negligible at 0.78%. Treatment: Normalize the truncated codes to canonical make names, then target- or frequency-encode for modelling. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 78 (0.8%)
unique: 126
top_value: HONDA
top_rate: 0.1341
cardinality: 126
entropy: 4.661
entropy_ratio: 0.668

vehicle_color

categorical feature

Vehicle color codes, but encoded inconsistently: short codes like GY (2079), BK (1784), WH (1579) coexist with verbose forms WHITE (347), BLACK (273), GREY (239), and alternate abbreviations BLK (275), GRY (167) for the same underlying colors. With 99 distinct values across 10000 rows and a 9.43% null rate, the cardinality is inflated by these duplicate encodings rather than true diversity. Entropy ratio of 0.55 reflects a heavy concentration in the gray/black/white tail. Treatment: Normalize synonymous codes (e.g., BK/BLK/BLACK → black) before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 943 (9.4%)
unique: 99
top_value: GY
top_rate: 0.2295
cardinality: 99
entropy: 3.676
entropy_ratio: 0.5545

violation_time

text timestamp one_word allcaps short_text duplicates

This column encodes time-of-day stamps in a compact HHMM+AM/PM format (e.g. '0839A', '1200P'), with every value exactly 5 characters and uppercase. Duplication is high (85.67%) across 1,432 unique stamps, which is expected for clock times sampled across 10,000 records. Null rate is negligible (0.0004) and there are no empty strings. Treatment: Parse the HHMMa/p format into a proper time-of-day (minutes since midnight) before modelling. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 4 (0.0%)
unique: 1,432
len_min: 5
len_max: 5
len_mean: 5
len_median: 5
len_p95: 5
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 8,564
duplicate_rate: 0.8567
vocab_size: 1,433
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9999
allcaps_rate: 1
boilerplate_rate: 0

issuing_agency

categorical feature

Single-letter codes for the agency issuing each record, drawn from a closed set of 20 values with no nulls. Distribution is heavily concentrated: 'V' alone covers 44.16% (4,416/10,000) and the top four codes (V, T, S, P) account for the bulk of rows, while letters like K, N, A, Y, M, O appear fewer than 50 times each. Entropy ratio of 0.46 confirms the imbalance. Treatment: One-hot encode the top categories and bucket the long tail into 'other' before modelling. high · anthropic:claude-opus-4-7

n: 10,000
nulls: 0 (0.0%)
unique: 20
top_value: V
top_rate: 0.4416
cardinality: 20
entropy: 2.007
entropy_ratio: 0.4644