saturn·

urban parking violations sample

source /home/coolhand/html/datavis/data_trove/data/urban/parking_violations_sample.csv 10,000 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 10,000-row sample of NYC-style parking violations with 9 columns covering summons IDs, issue dates and times, locations, violation codes/descriptions, issuing agencies, and vehicle make/color. Two things jump out: issue_date is heavily concentrated on a single day (2025-12-28 accounts for 65% of rows), and violation_description is dominated by 'PHTO SCHOOL ZN SPEED VIOLATION' at 52% of non-null values, paired with issuing_agency 'V' at 44% — suggesting the sample is skewed toward automated school-zone camera tickets. Vehicle_color also shows clear data-quality issues, with the same color appearing under multiple codes (e.g., WH/WHITE, BLK/BLACK/BK, GREY/GRY) that would need normalization before analysis. Violation_code is numeric with a ~10% outlier rate and right-skew, worth a look alongside the categorical description. Street_name is messy free text with 77% all-caps and many directional prefixes (SB, NB, WB, EB).

citing: row_count · column_count · issue_date.top_rate · issue_date.top_value · violation_description.top_rate · violation_description.top_value · violation_description.null_rate · issuing_agency.top_rate · issuing_agency.top_value · vehicle_color.top_values · vehicle_make.top_values · violation_code.outlier_rate · violation_code.skew · street_name.allcaps_rate

Schema

9 columns
Per-column summary. Click column name to jump to its detail.
Alerts
summons_number numeric 0.0% 10,000
issue_date categorical 0.0% 687
long_tail
violation_code numeric 0.0% 62
outliers
violation_description categorical 15.1% 74
street_name text 0.0% 3,115
multilingual allcaps duplicates
vehicle_make categorical 0.8% 126
vehicle_color categorical 9.4% 99
violation_time text 0.0% 1,432
one_word allcaps short_text duplicates
issuing_agency categorical 0.0% 20

summons_number

numeric identifier
Every one of the 10,000 rows carries a distinct value (n_unique = 10000, null_rate = 0.0), and the magnitudes (min 1.12e9, max 9.26e9) match the size of NYC parking summons numbers. The wide spread (std ≈ 2.71e9) and lack of outliers reflect identifier allocation rather than a measurable quantity. Despite being typed numeric, no arithmetic interpretation applies. Treatment: drop from modelling; retain only as a row key for joins. high · anthropic:claude-opus-4-7
n
10,000
nulls
0 (0.0%)
unique
10,000
min
1.125e+09
max
9.255e+09
mean
4.779e+09
median
4.976e+09
std
2.706e+09
q1
2.028e+09
q3
4.976e+09
iqr
2.948e+09
skew
0.4681
kurtosis
-0.9135
n_outliers
0
outlier_rate
0
zero_rate
0

issue_date

categorical timestamp long_tail
This is an issue_date column stored as ISO datetime strings, treated here as categorical across 687 distinct days with no nulls. The distribution is severely concentrated: 65.42% of all 10000 rows fall on 2025-12-28, with another 1594 on 2025-12-30 and 356 on 2025-12-29, meaning roughly 85% of issuance clusters in late December 2025 before tapering into a long tail through 2026. Entropy ratio of 0.29 confirms the heavy skew, and the year-end spike looks like a backfill or batch-load artifact worth confirming before treating this as a true event date. Treatment: Parse to datetime and bucket by month or week; investigate the 2025-12-28 spike before using as a feature. high · anthropic:claude-opus-4-7
n
10,000
nulls
0 (0.0%)
unique
687
top_value
2025-12-28T00:00:00.000
top_rate
0.6542
cardinality
687
entropy
2.765
entropy_ratio
0.2934

violation_code

numeric feature outliers
This is almost certainly a categorical violation code stored as a number, with 62 distinct values across 10,000 rows and no nulls. Despite the numeric type, the distribution is meaningless as a quantity: values span 4 to 99, the IQR runs 21–36, and 10.07% of rows (1,007) flag as outliers under a numeric rule, with skew 1.51 and kurtosis 2.99. Median equals Q3 at 36, suggesting a heavy concentration at one or two dominant codes. Treatment: Treat as categorical and one-hot or target-encode rather than using the raw integer. high · anthropic:claude-opus-4-7
n
10,000
nulls
0 (0.0%)
unique
62
min
4
max
99
mean
35.85
median
36
std
17.55
q1
21
q3
36
iqr
15
skew
1.514
kurtosis
2.986
n_outliers
1,007
outlier_rate
0.1007
zero_rate
0

violation_description

categorical feature
Categorical column describing the parking/traffic violation issued, with 74 distinct codes across 10000 rows. It is heavily concentrated: 'PHTO SCHOOL ZN SPEED VIOLATION' alone covers 52.03% of records, with 'No Parking Street Cleaning' a distant second at 1428, yielding low entropy ratio 0.452. Note 15.13% of values are null, and the labels mix numeric-prefixed legal codes (e.g. '14-No Standing') with free-form descriptions like 'Fire Hydrant', some of which appear to duplicate the coded versions ('40-Fire Hydrant' vs 'Fire Hydrant'). Treatment: Normalise the coded vs free-text duplicates, then group rare categories before one-hot encoding. high · anthropic:claude-opus-4-7
n
10,000
nulls
1,513 (15.1%)
unique
74
top_value
PHTO SCHOOL ZN SPEED VIOLATION
top_rate
0.5203
cardinality
74
entropy
2.808
entropy_ratio
0.4522

street_name

text feature multilingual allcaps duplicates
Street-name strings, mostly truncated NYC traffic-camera or incident locations like 'SB CROSS BAY BLVD @' with directional prefixes (sb/wb/nb/eb) and '@' delimiters dominating the top words. Values are heavily repeated (68.8% duplicate rate, only 3,115 uniques in 10,000 rows) and 77.3% are all-caps; lengths cap sharply at 20 characters, suggesting upstream truncation. Language detection flags ja (614) and zh (59) alongside 3,596 en, but this is almost certainly a false positive on short ALL-CAPS abbreviations rather than genuine multilingual content. Treatment: Normalise case, split off the directional prefix and the '@' cross-street, and treat as a categorical location feature. high · anthropic:claude-opus-4-7
n
10,000
nulls
4 (0.0%)
unique
3,115
len_min
2
len_max
20
len_mean
14.87
len_median
16
len_p95
20
word_mean
3.513
word_median
3
n_empty
0
n_duplicates
6,881
duplicate_rate
0.6884
vocab_size
1,760
readability_flesch_mean
62.55
emoji_rate
0
url_rate
0
one_word_rate
0.01371
allcaps_rate
0.7732
boilerplate_rate
0

vehicle_make

categorical feature
Categorical vehicle manufacturer codes, with HONDA leading at 13.4% of 9,922 non-null rows and TOYOT close behind at 1,302. Values appear truncated to 5 characters (TOYOT, NISSA, ME/BE, CHEVR, HYUND, SUBAR), which will fragment any join against full make names. Cardinality is 126 with entropy ratio 0.668, indicating a long but moderately concentrated tail, and nulls are negligible at 0.78%. Treatment: Normalize the truncated codes to canonical make names, then target- or frequency-encode for modelling. high · anthropic:claude-opus-4-7
n
10,000
nulls
78 (0.8%)
unique
126
top_value
HONDA
top_rate
0.1341
cardinality
126
entropy
4.661
entropy_ratio
0.668

vehicle_color

categorical feature
Vehicle color codes, but encoded inconsistently: short codes like GY (2079), BK (1784), WH (1579) coexist with verbose forms WHITE (347), BLACK (273), GREY (239), and alternate abbreviations BLK (275), GRY (167) for the same underlying colors. With 99 distinct values across 10000 rows and a 9.43% null rate, the cardinality is inflated by these duplicate encodings rather than true diversity. Entropy ratio of 0.55 reflects a heavy concentration in the gray/black/white tail. Treatment: Normalize synonymous codes (e.g., BK/BLK/BLACK → black) before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
10,000
nulls
943 (9.4%)
unique
99
top_value
GY
top_rate
0.2295
cardinality
99
entropy
3.676
entropy_ratio
0.5545

violation_time

text timestamp one_word allcaps short_text duplicates
This column encodes time-of-day stamps in a compact HHMM+AM/PM format (e.g. '0839A', '1200P'), with every value exactly 5 characters and uppercase. Duplication is high (85.67%) across 1,432 unique stamps, which is expected for clock times sampled across 10,000 records. Null rate is negligible (0.0004) and there are no empty strings. Treatment: Parse the HHMMa/p format into a proper time-of-day (minutes since midnight) before modelling. high · anthropic:claude-opus-4-7
n
10,000
nulls
4 (0.0%)
unique
1,432
len_min
5
len_max
5
len_mean
5
len_median
5
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
8,564
duplicate_rate
0.8567
vocab_size
1,433
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9999
allcaps_rate
1
boilerplate_rate
0

issuing_agency

categorical feature
Single-letter codes for the agency issuing each record, drawn from a closed set of 20 values with no nulls. Distribution is heavily concentrated: 'V' alone covers 44.16% (4,416/10,000) and the top four codes (V, T, S, P) account for the bulk of rows, while letters like K, N, A, Y, M, O appear fewer than 50 times each. Entropy ratio of 0.46 confirms the imbalance. Treatment: One-hot encode the top categories and bucket the long tail into 'other' before modelling. high · anthropic:claude-opus-4-7
n
10,000
nulls
0 (0.0%)
unique
20
top_value
V
top_rate
0.4416
cardinality
20
entropy
2.007
entropy_ratio
0.4644