wild-nyc_311_sample_20260121 · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample_20260121.json

Saturn profiled 1,000 rows across 47 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample_20260121.json",
    "--findings", "wild-nyc_311_sample_20260121.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 1,000-row sample of NYC 311 service requests (47 columns), almost entirely categorical, capturing complaints by agency, location, and resolution status. NYPD dominates routing at 59.4% of requests, followed by HPD (23.2%) and DSNY (8.2%), and the top complaint types are Noise - Residential (23.4%), Illegal Parking (18.7%), and HEAT/HOT WATER (17.0%) — a good first place to look. Geographically, Brooklyn (31.2%), Queens (26.1%), and the Bronx (23.0%) account for most cases, while Staten Island is just 2.2%. Status is split across In Progress (38.8%), Closed (33.6%), and Open (27.6%), so a sizable share remains unresolved. Note that many specialized fields (taxi, bridge/highway, vehicle, facility_type) are >95% null and not informative, and `location` was skipped during profiling.

citing: agency · complaint_type · borough · status · open_data_channel_type · descriptor · facility_type · taxi_company_borough · vehicle_type

Out[4]:

saturn.schema() · 47 columns

column	kind	n	null%	unique	alerts
unique_key	categorical	1,000	0.0%	1,000	long_tail
created_date	categorical	1,000	0.0%	910	long_tail
agency	categorical	1,000	0.0%	10
agency_name	categorical	1,000	0.0%	10
complaint_type	categorical	1,000	0.0%	61
descriptor	categorical	1,000	0.0%	105
location_type	categorical	1,000	4.8%	19
incident_zip	categorical	1,000	0.3%	156
incident_address	categorical	1,000	1.6%	787	long_tail
street_name	categorical	1,000	1.6%	592	long_tail
cross_street_1	categorical	1,000	25.0%	457	long_tail null_rate
cross_street_2	categorical	1,000	25.0%	458	long_tail null_rate
intersection_street_1	categorical	1,000	26.7%	436	long_tail null_rate
intersection_street_2	categorical	1,000	26.7%	443	long_tail null_rate
address_type	categorical	1,000	0.3%	4
city	categorical	1,000	2.9%	40
landmark	categorical	1,000	30.8%	445	long_tail null_rate
status	categorical	1,000	0.0%	3
community_board	categorical	1,000	0.0%	64
council_district	categorical	1,000	1.3%	51
police_precinct	categorical	1,000	0.0%	76
bbl	categorical	1,000	5.9%	744	long_tail
borough	categorical	1,000	0.0%	5
x_coordinate_state_plane	categorical	1,000	1.2%	788	long_tail
y_coordinate_state_plane	categorical	1,000	1.2%	788	long_tail
open_data_channel_type	categorical	1,000	0.0%	4
park_facility_name	categorical	1,000	0.0%	3	long_tail imbalance
park_borough	categorical	1,000	0.0%	5
latitude	categorical	1,000	1.2%	794	long_tail
longitude	categorical	1,000	1.2%	794	long_tail
location	unknown	1,000	0.0%	—	skipped
:@computed_region_f5dn_yrer	categorical	1,000	1.2%	62
:@computed_region_yeji_bk3q	categorical	1,000	1.2%	5
:@computed_region_sbqj_enih	categorical	1,000	1.2%	75
:@computed_region_92fq_4b7q	categorical	1,000	1.2%	51
descriptor_2	categorical	1,000	59.6%	75	long_tail null_rate
resolution_description	categorical	1,000	41.2%	19	null_rate
resolution_action_updated_date	categorical	1,000	40.9%	354	long_tail null_rate
taxi_pick_up_location	categorical	1,000	98.8%	3	null_rate
vehicle_type	categorical	1,000	95.7%	5	null_rate
closed_date	categorical	1,000	66.4%	334	long_tail null_rate
bridge_highway_name	categorical	1,000	99.6%	4	long_tail null_rate
bridge_highway_segment	categorical	1,000	99.6%	4	long_tail null_rate
facility_type	categorical	1,000	95.9%	1	null_rate imbalance
bridge_highway_direction	categorical	1,000	99.7%	3	long_tail null_rate
road_ramp	categorical	1,000	99.7%	2	null_rate
taxi_company_borough	categorical	1,000	99.9%	1	long_tail null_rate imbalance

Fig 1.

agency · NYPD handles nearly 60% of requests; check how the long tail of smaller agencies splits the remainder.

Show data table

Top values for agency (10 unique shown, of 10 total).
value	count	share
NYPD	594	59.4%
HPD	232	23.2%
DSNY	82	8.2%
DOT	34	3.4%
DEP	24	2.4%
DOHMH	16	1.6%
TLC	12	1.2%
DPR	4	0.4%
DCWP	1	0.1%
OOS	1	0.1%

Fig 2.

complaint_type · Noise, Illegal Parking, and HEAT/HOT WATER lead — see how concentrated the top complaints are versus the 61-category tail.

Show data table

Top values for complaint_type (20 unique shown, of 61 total).
value	count	share
Noise - Residential	234	23.4%
Illegal Parking	187	18.7%
HEAT/HOT WATER	170	17.0%
Snow or Ice	69	6.9%
Blocked Driveway	64	6.4%
Noise - Commercial	38	3.8%
Noise - Vehicle	22	2.2%
UNSANITARY CONDITION	21	2.1%
Noise - Street/Sidewalk	20	2.0%
Street Condition	14	1.4%
Noise	14	1.4%
Traffic Signal Condition	9	0.9%
Abandoned Vehicle	8	0.8%
Taxi Complaint	8	0.8%
DOOR/WINDOW	8	0.8%
Dirty Condition	7	0.7%
PLUMBING	7	0.7%
Non-Emergency Police Matter	6	0.6%
FLOORING/STAIRS	6	0.6%
Indoor Air Quality	5	0.5%

Fig 3.

borough · Brooklyn, Queens, and Bronx dominate; Staten Island barely registers at ~2%.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
BROOKLYN	312	31.2%
QUEENS	261	26.1%
BRONX	230	23.0%
MANHATTAN	175	17.5%
STATEN ISLAND	22	2.2%

Fig 4.

status · Roughly two-thirds of tickets are still In Progress or Open — a useful backlog indicator.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
In Progress	388	38.8%
Closed	336	33.6%
Open	276	27.6%

Fig 5.

open_data_channel_type · Online intake (52.7%) outpaces Mobile and Phone, showing how residents are actually filing.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	527	52.7%
MOBILE	234	23.4%
PHONE	214	21.4%
UNKNOWN	25	2.5%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
unique_key	categorical	0.0%
created_date	categorical	0.0%
agency	categorical	0.0%
agency_name	categorical	0.0%
complaint_type	categorical	0.0%
descriptor	categorical	0.0%
location_type	categorical	4.8%
incident_zip	categorical	0.3%
incident_address	categorical	1.6%
street_name	categorical	1.6%
cross_street_1	categorical	25.0%
cross_street_2	categorical	25.0%
intersection_street_1	categorical	26.7%
intersection_street_2	categorical	26.7%
address_type	categorical	0.3%
city	categorical	2.9%
landmark	categorical	30.8%
status	categorical	0.0%
community_board	categorical	0.0%
council_district	categorical	1.3%
police_precinct	categorical	0.0%
bbl	categorical	5.9%
borough	categorical	0.0%
x_coordinate_state_plane	categorical	1.2%
y_coordinate_state_plane	categorical	1.2%
open_data_channel_type	categorical	0.0%
park_facility_name	categorical	0.0%
park_borough	categorical	0.0%
latitude	categorical	1.2%
longitude	categorical	1.2%
location	unknown	0.0%
:@computed_region_f5dn_yrer	categorical	1.2%
:@computed_region_yeji_bk3q	categorical	1.2%
:@computed_region_sbqj_enih	categorical	1.2%
:@computed_region_92fq_4b7q	categorical	1.2%
descriptor_2	categorical	59.6%
resolution_description	categorical	41.2%
resolution_action_updated_date	categorical	40.9%
taxi_pick_up_location	categorical	98.8%
vehicle_type	categorical	95.7%
closed_date	categorical	66.4%
bridge_highway_name	categorical	99.6%
bridge_highway_segment	categorical	99.6%
facility_type	categorical	95.9%
bridge_highway_direction	categorical	99.7%
road_ramp	categorical	99.7%
taxi_company_borough	categorical	99.9%

unique_key categorical identifier

This is a row-level identifier: every one of the 1000 values is unique (n_unique=1000, entropy_ratio=1.0) and no nulls are present. The values are 8-digit numeric strings clustered around 6753xxxx–6754xxxx, consistent with a sequential record key. There is no predictive signal here, only joinability.

Treatment: Drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["unique_key"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	1,000
top_value	67534607
top_rate	0.001
cardinality	1,000
entropy	9.966
entropy_ratio	1
alert: long_tail	1000 singleton categories

Fig 7.

Top values for unique_key.

Show data table

Top values for unique_key (20 unique shown, of 1000 total).
value	count	share
67534607	1	0.1%
67540784	1	0.1%
67540752	1	0.1%
67537071	1	0.1%
67535886	1	0.1%
67543458	1	0.1%
67541950	1	0.1%
67537003	1	0.1%
67538353	1	0.1%
67543223	1	0.1%
67534550	1	0.1%
67537050	1	0.1%
67540783	1	0.1%
67535884	1	0.1%
67541981	1	0.1%
67540793	1	0.1%
67539537	1	0.1%
67534594	1	0.1%
67543236	1	0.1%
67535866	1	0.1%

created_date categorical timestamp

ISO-8601 datetime stamps stored as strings, with 910 unique values across 1000 rows and zero nulls. The top 10 timestamps all fall on 2026-01-19 between 22:01 and 23:57, suggesting the sample is concentrated in a roughly two-hour window rather than spread over time. Entropy ratio 0.9915 confirms near-unique values, but the column is typed as categorical instead of a proper timestamp.

Treatment: Parse to datetime and derive features (hour, day, delta); do not use raw string as a category.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["created_date"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	910
top_value	2026-01-19T22:01:09.000
top_rate	0.009
cardinality	910
entropy	9.746
entropy_ratio	0.9915
alert: long_tail	850 singleton categories

Fig 8.

Top values for created_date.

Show data table

Top values for created_date (20 unique shown, of 910 total).
value	count	share
2026-01-19T22:01:09.000	9	0.9%
2026-01-19T22:38:46.000	7	0.7%
2026-01-19T23:17:18.000	6	0.6%
2026-01-19T23:57:16.000	5	0.5%
2026-01-19T22:05:00.000	5	0.5%
2026-01-19T23:31:38.000	4	0.4%
2026-01-19T23:10:44.000	3	0.3%
2026-01-19T23:07:39.000	3	0.3%
2026-01-19T23:04:06.000	3	0.3%
2026-01-19T23:01:05.000	3	0.3%
2026-01-19T22:48:53.000	3	0.3%
2026-01-19T22:15:33.000	3	0.3%
2026-01-20T02:00:04.000	2	0.2%
2026-01-20T01:04:29.000	2	0.2%
2026-01-20T00:41:14.000	2	0.2%
2026-01-20T00:12:00.000	2	0.2%
2026-01-19T23:58:19.000	2	0.2%
2026-01-19T23:57:12.000	2	0.2%
2026-01-19T23:43:00.000	2	0.2%
2026-01-19T23:41:00.000	2	0.2%

agency categorical feature

This column records the NYC agency handling each record, drawn from 10 distinct codes with no nulls across 1000 rows. Distribution is heavily concentrated: NYPD alone accounts for 594 rows (top_rate 0.594) and HPD another 232, while DCWP and OOS appear just once each. Entropy ratio of 0.527 confirms the long tail is thin, which may starve models of signal for rare agencies.

Treatment: One-hot encode the top few agencies and bucket the rare codes (DPR, DCWP, OOS) into an 'other' category.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["agency"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	10
top_value	NYPD
top_rate	0.594
cardinality	10
entropy	1.75
entropy_ratio	0.5268

Fig 9.

Top values for agency.

Show data table

Top values for agency (10 unique shown, of 10 total).
value	count	share
NYPD	594	59.4%
HPD	232	23.2%
DSNY	82	8.2%
DOT	34	3.4%
DEP	24	2.4%
DOHMH	16	1.6%
TLC	12	1.2%
DPR	4	0.4%
DCWP	1	0.1%
OOS	1	0.1%

agency_name categorical feature

This column names the NYC agency handling each record, with 10 distinct agencies and no nulls across 1000 rows. The distribution is highly concentrated: NYPD alone accounts for 59.4% of records, followed by Housing Preservation and Development at 232 and Sanitation at 82, while two agencies appear just once. The entropy ratio of 0.53 confirms the heavy skew toward a single dominant category.

Treatment: One-hot encode or group rare agencies into an 'Other' bucket before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["agency_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	10
top_value	New York City Police Department
top_rate	0.594
cardinality	10
entropy	1.75
entropy_ratio	0.5268

Fig 10.

Top values for agency_name.

Show data table

Top values for agency_name (10 unique shown, of 10 total).
value	count	share
New York City Police Department	594	59.4%
Department of Housing Preservation and Development	232	23.2%
Department of Sanitation	82	8.2%
Department of Transportation	34	3.4%
Department of Environmental Protection	24	2.4%
Department of Health and Mental Hygiene	16	1.6%
Taxi and Limousine Commission	12	1.2%
Department of Parks and Recreation	4	0.4%
Department of Consumer and Worker Protection	1	0.1%
Office of the Sheriff	1	0.1%

complaint_type categorical label

This is a categorical complaint-type field, almost certainly from a 311-style service request log, with 61 distinct categories across 1000 rows and no nulls. The distribution is moderately concentrated: 'Noise - Residential' leads at 23.4%, followed by 'Illegal Parking' (187) and 'HEAT/HOT WATER' (170), with entropy ratio 0.637 indicating a long tail. Worth noting the inconsistent casing (e.g., 'HEAT/HOT WATER' and 'UNSANITARY CONDITION' in caps vs. mixed-case neighbours), and that 'Noise' alone fragments across at least four sub-types.

Treatment: Normalise casing and consider grouping rare tail categories before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["complaint_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	61
top_value	Noise - Residential
top_rate	0.234
cardinality	61
entropy	3.776
entropy_ratio	0.6366

Fig 11.

Top values for complaint_type.

Show data table

Top values for complaint_type (20 unique shown, of 61 total).
value	count	share
Noise - Residential	234	23.4%
Illegal Parking	187	18.7%
HEAT/HOT WATER	170	17.0%
Snow or Ice	69	6.9%
Blocked Driveway	64	6.4%
Noise - Commercial	38	3.8%
Noise - Vehicle	22	2.2%
UNSANITARY CONDITION	21	2.1%
Noise - Street/Sidewalk	20	2.0%
Street Condition	14	1.4%
Noise	14	1.4%
Traffic Signal Condition	9	0.9%
Abandoned Vehicle	8	0.8%
Taxi Complaint	8	0.8%
DOOR/WINDOW	8	0.8%
Dirty Condition	7	0.7%
PLUMBING	7	0.7%
Non-Emergency Police Matter	6	0.6%
FLOORING/STAIRS	6	0.6%
Indoor Air Quality	5	0.5%

descriptor categorical feature

Categorical descriptor field detailing the specific nature of a complaint or issue, with values like 'Banging/Pounding', 'Loud Music/Party', and 'Blocked Hydrant' suggesting NYC 311-style service requests. Cardinality is moderate at 105 unique values across 1000 rows, with the top value covering 13.3% and entropy ratio of 0.71 indicating a reasonably spread distribution. Notably, casing is inconsistent ('ENTIRE BUILDING' and 'APARTMENT ONLY' in caps versus title-case elsewhere), hinting at multiple upstream sources or schemas merged together.

Treatment: Normalize casing and group rare levels before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["descriptor"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	105
top_value	Banging/Pounding
top_rate	0.133
cardinality	105
entropy	4.739
entropy_ratio	0.7059

Fig 12.

Top values for descriptor.

Show data table

Top values for descriptor (20 unique shown, of 105 total).
value	count	share
Banging/Pounding	133	13.3%
Loud Music/Party	127	12.7%
ENTIRE BUILDING	112	11.2%
Blocked Hydrant	98	9.8%
Sidewalk	69	6.9%
APARTMENT ONLY	58	5.8%
No Access	50	5.0%
Loud Talking	31	3.1%
Commercial Overnight Parking	26	2.6%
Posted Parking Sign Violation	21	2.1%
Partial Access	14	1.4%
Engine Idling	12	1.2%
Pothole	11	1.1%
PESTS	11	1.1%
Double Parked Blocking Traffic	10	1.0%
Driver Complaint - Passenger	9	0.9%
Blocked Sidewalk	9	0.9%
With License Plate	8	0.8%
Trash	7	0.7%
Blocked Crosswalk	7	0.7%

location_type categorical feature

Categorical descriptor of where an incident occurred, with 19 distinct values across 1000 rows and a 4.8% null rate. The top value 'Street/Sidewalk' covers 32.1% of records, but the vocabulary is clearly inconsistent: 'Residential Building/House' (245), 'RESIDENTIAL BUILDING' (232), and 'Residential Building' (5) appear as separate categories, as do 'Sidewalk' (80) vs 'Street/Sidewalk', and '3+ Family Apartment Building' vs '3+ Family Apt. Building'. Entropy ratio of 0.57 reflects this fragmentation rather than genuine diversity.

Treatment: Normalize casing and merge synonymous labels into a controlled vocabulary before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["location_type"].stats

stat	value
n	1,000
nulls	48 (4.8%)
unique	19
top_value	Street/Sidewalk
top_rate	0.3214
cardinality	19
entropy	2.437
entropy_ratio	0.5737

Fig 13.

Top values for location_type.

Show data table

Top values for location_type (19 unique shown, of 19 total).
value	count	share
Street/Sidewalk	306	30.6%
Residential Building/House	245	24.5%
RESIDENTIAL BUILDING	232	23.2%
Sidewalk	80	8.0%
Club/Bar/Restaurant	23	2.3%
Street	20	2.0%
Store/Commercial	15	1.5%
3+ Family Apartment Building	6	0.6%
Residential Building	5	0.5%
3+ Family Apt. Building	4	0.4%
Taxi	3	0.3%
Highway	3	0.3%
Yard	2	0.2%
Restaurant/Bar/Deli/Bakery	2	0.2%
Park/Playground	2	0.2%
Subway	1	0.1%
Business	1	0.1%
Subway Station	1	0.1%
Park	1	0.1%

incident_zip categorical feature

This is a US ZIP code field for incident locations, with 156 distinct codes across 1000 rows and only 0.3% nulls. The distribution is highly diffuse (entropy ratio 0.94) — the most frequent ZIP, 10461, accounts for just 2.9% of rows, and the top values (10461, 10462, 11226, 11214, 10034) are all NYC codes, suggesting a NYC-scoped dataset.

Treatment: Treat as high-cardinality categorical: group by borough/region or target-encode rather than one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["incident_zip"].stats

stat	value
n	1,000
nulls	3 (0.3%)
unique	156
top_value	10461
top_rate	0.02909
cardinality	156
entropy	6.826
entropy_ratio	0.937

Fig 14.

Top values for incident_zip.

Show data table

Top values for incident_zip (20 unique shown, of 156 total).
value	count	share
10461	29	2.9%
10462	25	2.5%
11226	23	2.3%
11214	20	2.0%
10034	19	1.9%
11373	19	1.9%
11385	18	1.8%
10473	17	1.7%
10029	16	1.6%
11103	16	1.6%
11230	16	1.6%
11225	16	1.6%
11204	15	1.5%
10040	15	1.5%
11366	14	1.4%
10468	14	1.4%
11212	13	1.3%
11208	13	1.3%
10457	13	1.3%
11233	13	1.3%

incident_address categorical metadata

Street-level incident addresses, almost certainly NYC given entries like JOHN F KENNEDY AIRPORT and LA GUARDIA AIRPORT. Cardinality is extreme: 787 unique values across 1000 rows with entropy ratio 0.98, and the modal address '31 HARRISON AVENUE' appears just 9 times (0.9%). Null rate is low at 1.6%, but the long tail makes this unusable as a categorical feature without aggregation.

Treatment: Geocode to borough/zip or coordinates rather than using raw strings.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["incident_address"].stats

stat	value
n	1,000
nulls	16 (1.6%)
unique	787
top_value	31 HARRISON AVENUE
top_rate	0.009146
cardinality	787
entropy	9.431
entropy_ratio	0.9803
alert: long_tail	681 singleton categories

Fig 15.

Top values for incident_address.

Show data table

Top values for incident_address (20 unique shown, of 787 total).
value	count	share
31 HARRISON AVENUE	9	0.9%
64 SOUTH 11 STREET	8	0.8%
41-72 JUDGE STREET	7	0.7%
170 VERMILYEA AVENUE	7	0.7%
JOHN F KENNEDY AIRPORT	6	0.6%
42-01 LAYTON STREET	6	0.6%
21 TRUXTON STREET	6	0.6%
LA GUARDIA AIRPORT	5	0.5%
ROCKAWAY BOULEVARD	5	0.5%
1790 STORY AVENUE	5	0.5%
9 EAST 193 STREET	5	0.5%
147-40 ARCHER AVENUE	5	0.5%
108-26 159 STREET	4	0.4%
239 OCEAN AVENUE	4	0.4%
66 WELDON STREET	4	0.4%
1365 5 AVENUE	4	0.4%
538 EAST 21 STREET	4	0.4%
2854 KINGSBRIDGE TERRACE	4	0.4%
67 DRIVE	4	0.4%
3990 BRONX BOULEVARD	4	0.4%

street_name categorical feature

Street-name strings, almost certainly NYC thoroughfares given entries like OCEAN AVENUE, BROADWAY, and BAY PARKWAY. Cardinality is extreme: 592 unique values across 1000 rows with the top value covering only 1.0% and entropy_ratio of 0.96, so the distribution is essentially flat with a long tail. Null rate is low at 1.6%.

Treatment: Group rare streets into an 'other' bucket or target/frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["street_name"].stats

stat	value
n	1,000
nulls	16 (1.6%)
unique	592
top_value	OCEAN AVENUE
top_rate	0.01016
cardinality	592
entropy	8.886
entropy_ratio	0.9648
alert: long_tail	401 singleton categories

Fig 16.

Top values for street_name.

Show data table

Top values for street_name (20 unique shown, of 592 total).
value	count	share
OCEAN AVENUE	10	1.0%
EAST 21 STREET	9	0.9%
HARRISON AVENUE	9	0.9%
SOUTH 11 STREET	8	0.8%
66 STREET	8	0.8%
BROADWAY	8	0.8%
JUDGE STREET	8	0.8%
WASHINGTON AVENUE	7	0.7%
BAY PARKWAY	7	0.7%
VERMILYEA AVENUE	7	0.7%
RANDALL AVENUE	6	0.6%
JOHN F KENNEDY AIRPORT	6	0.6%
30 AVENUE	6	0.6%
WOODYCREST AVENUE	6	0.6%
RYER AVENUE	6	0.6%
ROCKAWAY BOULEVARD	6	0.6%
HERING AVENUE	6	0.6%
LAYTON STREET	6	0.6%
TRUXTON STREET	6	0.6%
LA GUARDIA AIRPORT	5	0.5%

cross_street_1 categorical metadata

Street-name field used as a cross-street reference, likely from NYC service or incident records given entries like 'WYTHE AVENUE', 'ADAM CLAYTON POWELL JR BOULEVARD', and numbered avenues. Cardinality is extreme (457 unique across 1000 rows, entropy ratio 0.97) and the top value 'BEND' covers only 1.3%, so no value dominates. A 25% null rate and presence of placeholder-like 'DEAD END' suggest inconsistent capture that should be reviewed before use.

Treatment: Normalise street strings and treat as high-cardinality location metadata; do not one-hot encode directly.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["cross_street_1"].stats

stat	value
n	1,000
nulls	250 (25.0%)
unique	457
top_value	BEND
top_rate	0.01333
cardinality	457
entropy	8.534
entropy_ratio	0.9658
alert: long_tail	309 singleton categories
alert: null_rate	25.0% null

Fig 17.

Top values for cross_street_1.

Show data table

Top values for cross_street_1 (20 unique shown, of 457 total).
value	count	share
BEND	10	1.0%
WYTHE AVENUE	8	0.8%
DORCHESTER ROAD	8	0.8%
75 AVENUE	8	0.8%
99 STREET	7	0.7%
ADAM CLAYTON POWELL JR BOULEVARD	7	0.7%
5 AVENUE	7	0.7%
3 AVENUE	6	0.6%
18 AVENUE	6	0.6%
DEAD END	6	0.6%
19 AVENUE	6	0.6%
94 STREET	6	0.6%
WEST 162 STREET	6	0.6%
36 STREET	5	0.5%
8 AVENUE	5	0.5%
EAST BURNSIDE AVENUE	5	0.5%
ST NICHOLAS AVENUE	5	0.5%
MYRTLE AVENUE	5	0.5%
ROSEDALE AVENUE	5	0.5%
108 AVENUE	4	0.4%

cross_street_2 categorical feature

This is a free-text street name field, almost certainly the second cross-street bounding an incident or location record. Cardinality is extreme (458 unique across 750 non-null rows, entropy ratio 0.97) and the modal value 'BERRY STREET' covers just 1.07% of rows, so no street dominates. A quarter of values are null and oddities like 'DEAD END' and 'AIRTRAIN-HOWARD BCH/JAMAICA LINE' appear alongside normal street names, suggesting inconsistent free-text entry.

Treatment: Normalise casing/synonyms and bucket rare values, or drop in favour of geocoded coordinates.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["cross_street_2"].stats

stat	value
n	1,000
nulls	250 (25.0%)
unique	458
top_value	BERRY STREET
top_rate	0.01067
cardinality	458
entropy	8.536
entropy_ratio	0.9657
alert: long_tail	313 singleton categories
alert: null_rate	25.0% null

Fig 18.

Top values for cross_street_2.

Show data table

Top values for cross_street_2 (20 unique shown, of 458 total).
value	count	share
BERRY STREET	8	0.8%
FREDERICK DOUGLASS BOULEVARD	8	0.8%
DITMAS AVENUE	8	0.8%
BOOTH STREET	7	0.7%
UNION TURNPIKE	7	0.7%
109 AVENUE	6	0.6%
AIRTRAIN-HOWARD BCH/JAMAICA LINE	6	0.6%
20 AVENUE	6	0.6%
DEAD END	6	0.6%
WEST 163 STREET	6	0.6%
BROADWAY	6	0.6%
102 STREET	6	0.6%
3 AVENUE	6	0.6%
10 AVENUE	6	0.6%
BEND	5	0.5%
19 AVENUE	5	0.5%
37 STREET	5	0.5%
18 AVENUE	5	0.5%
EAST 115 STREET	5	0.5%
GRAND CENTRAL PARKWAY ET 7 EB	5	0.5%

intersection_street_1 categorical feature

Cross-street name for an incident location, judging from values like 'WYTHE AVENUE', 'ADAM CLAYTON POWELL JR BOULEVARD', and 'DEAD END'. The field is sparse and highly diverse: 26.7% null and 436 distinct values across 1000 rows, with the most common ('BEND') appearing only 9 times and entropy ratio 0.965 indicating an almost-flat long tail. Note the presence of non-street tokens like 'BEND' and 'DEAD END', which suggest mixed semantics rather than clean street names.

Treatment: Normalize street names and bucket the long tail; treat nulls as a separate category before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[48]:

saturn.columns["intersection_street_1"].stats

stat	value
n	1,000
nulls	267 (26.7%)
unique	436
top_value	BEND
top_rate	0.01228
cardinality	436
entropy	8.465
entropy_ratio	0.9654
alert: long_tail	286 singleton categories
alert: null_rate	26.7% null

Fig 19.

Top values for intersection_street_1.

Show data table

Top values for intersection_street_1 (20 unique shown, of 436 total).
value	count	share
BEND	9	0.9%
WYTHE AVENUE	8	0.8%
DORCHESTER ROAD	8	0.8%
75 AVENUE	8	0.8%
99 STREET	7	0.7%
ADAM CLAYTON POWELL JR BOULEVARD	7	0.7%
5 AVENUE	7	0.7%
3 AVENUE	6	0.6%
18 AVENUE	6	0.6%
DEAD END	6	0.6%
19 AVENUE	6	0.6%
94 STREET	6	0.6%
WEST 162 STREET	6	0.6%
36 STREET	5	0.5%
8 AVENUE	5	0.5%
EAST BURNSIDE AVENUE	5	0.5%
ST NICHOLAS AVENUE	5	0.5%
AVENUE N	5	0.5%
MYRTLE AVENUE	5	0.5%
ROSEDALE AVENUE	5	0.5%

intersection_street_2 categorical feature

This column holds the second cross-street of an intersection, drawn from NYC street names like BERRY STREET, FREDERICK DOUGLASS BOULEVARD, and DITMAS AVENUE. It is extremely high-cardinality (443 unique values across 1000 rows, entropy ratio 0.97) with a long flat tail — the top value covers only 1.1% of rows. Notably, 26.7% of rows are null, and oddities like 'DEAD END' and 'AIRTRAIN-HOWARD BCH/JAMAICA LINE' appear alongside conventional street names.

Treatment: Normalize street strings and group rare values or geocode to coordinates before modelling; impute or flag the 26.7% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["intersection_street_2"].stats

stat	value
n	1,000
nulls	267 (26.7%)
unique	443
top_value	BERRY STREET
top_rate	0.01091
cardinality	443
entropy	8.489
entropy_ratio	0.9657
alert: long_tail	298 singleton categories
alert: null_rate	26.7% null

Fig 20.

Top values for intersection_street_2.

Show data table

Top values for intersection_street_2 (20 unique shown, of 443 total).
value	count	share
BERRY STREET	8	0.8%
FREDERICK DOUGLASS BOULEVARD	8	0.8%
DITMAS AVENUE	8	0.8%
BOOTH STREET	7	0.7%
UNION TURNPIKE	7	0.7%
AIRTRAIN-HOWARD BCH/JAMAICA LINE	6	0.6%
20 AVENUE	6	0.6%
DEAD END	6	0.6%
WEST 163 STREET	6	0.6%
BROADWAY	6	0.6%
102 STREET	6	0.6%
3 AVENUE	6	0.6%
10 AVENUE	6	0.6%
109 AVENUE	5	0.5%
19 AVENUE	5	0.5%
37 STREET	5	0.5%
18 AVENUE	5	0.5%
EAST 115 STREET	5	0.5%
GRAND CENTRAL PARKWAY ET 7 EB	5	0.5%
EAST 180 STREET	5	0.5%

address_type categorical feature

Categorical tag describing the kind of geolocation reference, with four levels: ADDRESS, INTERSECTION, PLACE, and BLOCKFACE. The distribution is highly imbalanced — ADDRESS covers 94.2% of 1000 rows, leaving the other three categories with 36, 12, and 10 occurrences respectively. Entropy ratio is just 0.199, and 0.3% of rows are null.

Treatment: One-hot encode, but expect the non-ADDRESS levels to contribute little signal given the severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[54]:

saturn.columns["address_type"].stats

stat	value
n	1,000
nulls	3 (0.3%)
unique	4
top_value	ADDRESS
top_rate	0.9418
cardinality	4
entropy	0.3978
entropy_ratio	0.1989

Fig 21.

Top values for address_type.

Show data table

Top values for address_type (4 unique shown, of 4 total).
value	count	share
ADDRESS	939	93.9%
INTERSECTION	36	3.6%
PLACE	12	1.2%
BLOCKFACE	10	1.0%

city categorical feature

Categorical column listing NYC-area city/neighborhood names, dominated by the five boroughs with BROOKLYN top at 307/1000 (31.6%), followed by BRONX (227) and NEW YORK (170). Cardinality is modest (40 unique) and null rate is low (2.9%), but the mix conflates borough names (BROOKLYN, BRONX, QUEENS) with neighborhood names (ASTORIA, ELMHURST, RIDGEWOOD), so granularity is inconsistent. Entropy ratio of 0.61 confirms the heavy concentration in a few labels.

Treatment: Normalize to a consistent geographic level (e.g., map neighborhoods to boroughs) and one-hot or target-encode.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["city"].stats

stat	value
n	1,000
nulls	29 (2.9%)
unique	40
top_value	BROOKLYN
top_rate	0.3162
cardinality	40
entropy	3.234
entropy_ratio	0.6077

Fig 22.

Top values for city.

Show data table

Top values for city (20 unique shown, of 40 total).
value	count	share
BROOKLYN	307	30.7%
BRONX	227	22.7%
NEW YORK	170	17.0%
JAMAICA	34	3.4%
ASTORIA	23	2.3%
STATEN ISLAND	21	2.1%
ELMHURST	19	1.9%
RIDGEWOOD	15	1.5%
FRESH MEADOWS	15	1.5%
QUEENS	13	1.3%
FLUSHING	12	1.2%
WOODSIDE	10	1.0%
FAR ROCKAWAY	9	0.9%
REGO PARK	9	0.9%
CORONA	8	0.8%
EAST ELMHURST	7	0.7%
MIDDLE VILLAGE	7	0.7%
JACKSON HEIGHTS	5	0.5%
SOUTH OZONE PARK	5	0.5%
FOREST HILLS	5	0.5%

landmark categorical metadata

This column holds landmark or street-name references, dominated by NYC thoroughfares like EAST 21 STREET, OCEAN AVENUE, and JOHN F KENNEDY AIRPORT. It's sparsely populated (30.8% null) and extremely long-tailed — 445 unique values across only 692 non-null rows, with the top value covering just 1.3% and entropy ratio at 0.969. No single landmark carries signal on its own.

Treatment: Treat as high-cardinality free text: drop or bucket into broad categories rather than one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["landmark"].stats

stat	value
n	1,000
nulls	308 (30.8%)
unique	445
top_value	EAST 21 STREET
top_rate	0.01301
cardinality	445
entropy	8.522
entropy_ratio	0.9686
alert: long_tail	312 singleton categories
alert: null_rate	30.8% null

Fig 23.

Top values for landmark.

Show data table

Top values for landmark (20 unique shown, of 445 total).
value	count	share
EAST 21 STREET	9	0.9%
SOUTH 11 STREET	8	0.8%
OCEAN AVENUE	8	0.8%
WASHINGTON AVENUE	7	0.7%
RANDALL AVENUE	6	0.6%
JOHN F KENNEDY AIRPORT	6	0.6%
30 AVENUE	6	0.6%
WOODYCREST AVENUE	6	0.6%
BROADWAY	6	0.6%
RYER AVENUE	6	0.6%
HERING AVENUE	6	0.6%
66 STREET	5	0.5%
LA GUARDIA AIRPORT	5	0.5%
YATES AVENUE	5	0.5%
KINGSBRIDGE TERRACE	5	0.5%
STORY AVENUE	5	0.5%
159 STREET	4	0.4%
BAY PARKWAY	4	0.4%
WELDON STREET	4	0.4%
5 AVENUE	4	0.4%

status categorical label

A 3-level categorical status field (In Progress, Closed, Open) with no nulls across 1000 rows. The distribution is nearly uniform — entropy_ratio of 0.991 and a modest top_rate of 0.388 — so no class dominates, which is unusual for status fields that often skew heavily to one state.

Treatment: One-hot or ordinal encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[63]:

saturn.columns["status"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	3
top_value	In Progress
top_rate	0.388
cardinality	3
entropy	1.571
entropy_ratio	0.9913

Fig 24.

Top values for status.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
In Progress	388	38.8%
Closed	336	33.6%
Open	276	27.6%

community_board categorical metadata

This column encodes NYC community board assignments, formatted as a zero-padded board number plus borough name (e.g., '12 MANHATTAN'). With 64 unique values across 1000 rows, no nulls, and a near-uniform distribution (entropy ratio 0.955, top value only 4.4%), no single board dominates. The composite format mixes two facts (board id + borough) into one string, which is worth splitting before analysis.

Treatment: Split into borough and board-number fields, then treat as a categorical geographic key.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["community_board"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	64
top_value	12 MANHATTAN
top_rate	0.044
cardinality	64
entropy	5.73
entropy_ratio	0.955

Fig 25.

Top values for community_board.

Show data table

Top values for community_board (20 unique shown, of 64 total).
value	count	share
12 MANHATTAN	44	4.4%
11 BROOKLYN	40	4.0%
09 BRONX	38	3.8%
14 BROOKLYN	33	3.3%
11 BRONX	33	3.3%
12 QUEENS	31	3.1%
10 BRONX	26	2.6%
05 QUEENS	26	2.6%
01 QUEENS	25	2.5%
10 MANHATTAN	25	2.5%
05 BROOKLYN	24	2.4%
12 BRONX	23	2.3%
04 QUEENS	23	2.3%
01 BROOKLYN	22	2.2%
08 QUEENS	22	2.2%
12 BROOKLYN	22	2.2%
16 BROOKLYN	22	2.2%
17 BROOKLYN	19	1.9%
03 BRONX	19	1.9%
10 QUEENS	19	1.9%

council_district categorical feature

Categorical column holding council district codes as zero-padded strings (e.g. '09', '13'), with 51 distinct values across 1000 rows and a 1.3% null rate. The distribution is nearly uniform: entropy ratio is 0.968 and the top district '13' accounts for only 4.7% of rows, so no single district dominates. Treat the codes as discrete labels rather than numbers since they are stored as strings with leading zeros.

Treatment: Keep as string categorical and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[69]:

saturn.columns["council_district"].stats

stat	value
n	1,000
nulls	13 (1.3%)
unique	51
top_value	13
top_rate	0.04661
cardinality	51
entropy	5.492
entropy_ratio	0.9682

Fig 26.

Top values for council_district.

Show data table

Top values for council_district (20 unique shown, of 51 total).
value	count	share
13	46	4.6%
10	44	4.4%
18	43	4.3%
40	38	3.8%
34	29	2.9%
37	29	2.9%
09	29	2.9%
35	28	2.8%
15	28	2.8%
43	27	2.7%
17	26	2.6%
28	25	2.5%
14	24	2.4%
25	23	2.3%
08	23	2.3%
11	23	2.3%
22	23	2.3%
24	23	2.3%
42	22	2.2%
30	22	2.2%

police_precinct categorical feature

This column identifies the police precinct associated with each record, using labels like 'Precinct 62' across 76 distinct values with no nulls. The distribution is nearly uniform: the top value appears in only 4% of rows and entropy ratio is 0.94, so no precinct dominates. Treat it as a high-cardinality categorical feature rather than a meaningful ranking.

Treatment: Target- or frequency-encode before modelling; avoid one-hot given 76 levels.

anthropic:claude-opus-4-7 · confidence high

Out[72]:

saturn.columns["police_precinct"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	76
top_value	Precinct 62
top_rate	0.04
cardinality	76
entropy	5.883
entropy_ratio	0.9416

Fig 27.

Top values for police_precinct.

Show data table

Top values for police_precinct (20 unique shown, of 76 total).
value	count	share
Precinct 62	40	4.0%
Precinct 43	38	3.8%
Precinct 34	36	3.6%
Precinct 70	33	3.3%
Precinct 49	33	3.3%
Precinct 104	29	2.9%
Precinct 114	28	2.8%
Precinct 45	26	2.6%
Precinct 75	24	2.4%
Precinct 47	23	2.3%
Precinct 110	23	2.3%
Precinct 107	22	2.2%
Precinct 66	22	2.2%
Precinct 73	22	2.2%
Precinct 67	19	1.9%
Precinct 113	19	1.9%
Precinct 42	19	1.9%
Precinct 106	19	1.9%
Precinct 71	19	1.9%
Precinct 90	18	1.8%

bbl categorical foreign_key

This column holds NYC Borough-Block-Lot (BBL) parcel identifiers — 10-digit codes where the leading digit encodes the borough (values 1–5 appear in the top entries). With 744 unique values across 1000 rows and entropy ratio 0.979, it is near-unique but has mild repetition (top BBL '5010780006' appears 9 times), suggesting multiple records per parcel rather than a primary key. Null rate is 5.9%, and the long_tail alert confirms most BBLs occur only once or twice.

Treatment: left-join on this id to a parcel/property reference table; do not use as a model feature directly.

anthropic:claude-opus-4-7 · confidence high

Out[75]:

saturn.columns["bbl"].stats

stat	value
n	1,000
nulls	59 (5.9%)
unique	744
top_value	5010780006
top_rate	0.009564
cardinality	744
entropy	9.342
entropy_ratio	0.9793
alert: long_tail	636 singleton categories

Fig 28.

Top values for bbl.

Show data table

Top values for bbl (20 unique shown, of 744 total).
value	count	share
5010780006	9	0.9%
3021600005	8	0.8%
4015070054	7	0.7%
1022280007	7	0.7%
4142600001	6	0.6%
2025110068	6	0.6%
4015080001	6	0.6%
3015420044	6	0.6%
4009260001	5	0.5%
2036370001	5	0.5%
2031910054	5	0.5%
4099987501	5	0.5%
4101460051	4	0.4%
3050260344	4	0.4%
2039730005	4	0.4%
3041640018	4	0.4%
1016180001	4	0.4%
3051840025	4	0.4%
2032530133	4	0.4%
2048200042	4	0.4%

borough categorical feature

This is a NYC borough categorical with all 5 expected values present and no nulls across 1000 rows. Distribution is fairly balanced (entropy ratio 0.895) with BROOKLYN leading at 31.2% and QUEENS at 26.1%; STATEN ISLAND is notably underrepresented at just 22 rows.

Treatment: one-hot encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[78]:

saturn.columns["borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	BROOKLYN
top_rate	0.312
cardinality	5
entropy	2.079
entropy_ratio	0.8953

Fig 29.

Top values for borough.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
BROOKLYN	312	31.2%
QUEENS	261	26.1%
BRONX	230	23.0%
MANHATTAN	175	17.5%
STATEN ISLAND	22	2.2%

x_coordinate_state_plane categorical feature

This column holds X coordinates in a state plane projection, stored as strings rather than numerics — values like "946638" and "1016981" are typical NYC-area easting values. Cardinality is extremely high (788 unique across 1000 rows, entropy ratio 0.98) with the modal value appearing only 9 times (0.9%), so it behaves as a near-continuous spatial measurement. Null rate is low at 1.2%.

Treatment: Cast to numeric and pair with the Y coordinate for spatial features rather than treating as categorical.

anthropic:claude-opus-4-7 · confidence high

Out[81]:

saturn.columns["x_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	788
top_value	946638
top_rate	0.009109
cardinality	788
entropy	9.433
entropy_ratio	0.9803
alert: long_tail	679 singleton categories

Fig 30.

Top values for x_coordinate_state_plane.

Show data table

Top values for x_coordinate_state_plane (20 unique shown, of 788 total).
value	count	share
946638	9	0.9%
993540	8	0.8%
1016981	7	0.7%
1006636	7	0.7%
1043001	6	0.6%
1016259	6	0.6%
1009589	6	0.6%
1018236	5	0.5%
1021634	5	0.5%
1012556	5	0.5%
1037889	5	0.5%
1041463	4	0.4%
994731	4	0.4%
1019629	4	0.4%
998703	4	0.4%
1025645	4	0.4%
995841	4	0.4%
1011173	4	0.4%
1025158	4	0.4%
1021852	4	0.4%

y_coordinate_state_plane categorical feature

This is a State Plane Y-coordinate (northing), stored as strings rather than numerics — 788 unique values across 1000 rows with only a 1.2% null rate. The distribution is essentially flat (entropy ratio 0.98, top value '171301' appearing just 9 times), consistent with continuous spatial coordinates rather than a true category. The categorical typing is the surprise here; it should be a numeric geo-feature.

Treatment: Cast to numeric and pair with the X-coordinate as a geospatial feature; do not one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[84]:

saturn.columns["y_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	788
top_value	171301
top_rate	0.009109
cardinality	788
entropy	9.437
entropy_ratio	0.9808
alert: long_tail	675 singleton categories

Fig 31.

Top values for y_coordinate_state_plane.

Show data table

Top values for y_coordinate_state_plane (20 unique shown, of 788 total).
value	count	share
171301	9	0.9%
197050	8	0.8%
210691	7	0.7%
255285	7	0.7%
175548	6	0.6%
210546	6	0.6%
186419	6	0.6%
221443	5	0.5%
239284	5	0.5%
255017	5	0.5%
194669	5	0.5%
192497	4	0.4%
178635	4	0.4%
187056	4	0.4%
230311	4	0.4%
172790	4	0.4%
257790	4	0.4%
203433	4	0.4%
263320	4	0.4%
241935	4	0.4%

open_data_channel_type categorical feature

This is a low-cardinality categorical recording the intake channel for a request, with only 4 distinct values and no nulls. ONLINE dominates at 52.7% (527/1000), followed by MOBILE (234) and PHONE (214), while UNKNOWN appears 25 times and may warrant treatment as missing. Entropy ratio of 0.79 indicates a fairly balanced spread across the top three channels.

Treatment: One-hot encode and consider mapping UNKNOWN to null.

anthropic:claude-opus-4-7 · confidence high

Out[87]:

saturn.columns["open_data_channel_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	4
top_value	ONLINE
top_rate	0.527
cardinality	4
entropy	1.586
entropy_ratio	0.7932

Fig 32.

Top values for open_data_channel_type.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	527	52.7%
MOBILE	234	23.4%
PHONE	214	21.4%
UNKNOWN	25	2.5%

park_facility_name categorical feature

Categorical column for a park or facility name, but it is effectively a constant: 998 of 1000 rows are 'Unspecified', with single occurrences of 'Marcus Garvey Park' and 'Forest Park'. Entropy ratio is 0.014 and top_rate is 0.998, so this column carries virtually no signal despite having no nulls.

Treatment: Drop; near-constant with 99.8% 'Unspecified'.

anthropic:claude-opus-4-7 · confidence high

Out[90]:

saturn.columns["park_facility_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	3
top_value	Unspecified
top_rate	0.998
cardinality	3
entropy	0.02281
entropy_ratio	0.01439
alert: long_tail	2 singleton categories
alert: imbalance	top value is 99.8% of rows

Fig 33.

Top values for park_facility_name.

Show data table

Top values for park_facility_name (3 unique shown, of 3 total).
value	count	share
Unspecified	998	99.8%
Marcus Garvey Park	1	0.1%
Forest Park	1	0.1%

park_borough categorical feature

This column records one of New York City's five boroughs (likely the borough of an associated park), with no missing values across 1000 rows. Distribution is fairly even — entropy ratio 0.8953 — but Staten Island is sharply underrepresented at just 22 rows versus Brooklyn's 312.

Treatment: One-hot encode the five borough levels before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[93]:

saturn.columns["park_borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	BROOKLYN
top_rate	0.312
cardinality	5
entropy	2.079
entropy_ratio	0.8953

Fig 34.

Top values for park_borough.

Show data table

Top values for park_borough (5 unique shown, of 5 total).
value	count	share
BROOKLYN	312	31.2%
QUEENS	261	26.1%
BRONX	230	23.0%
MANHATTAN	175	17.5%
STATEN ISLAND	22	2.2%

latitude categorical feature

Latitude coordinates stored as strings rather than floats, with 794 unique values across 988 non-null rows (1.2% nulls). All top values cluster in the 40.6-40.87 range, consistent with the New York City area. The near-maximum entropy ratio (0.98) and 0.9% top rate confirm this is effectively continuous geospatial data miscast as categorical.

Treatment: cast to float and pair with longitude for geospatial features rather than treating as categorical.

anthropic:claude-opus-4-7 · confidence high

Out[96]:

saturn.columns["latitude"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	794
top_value	40.63677840515416
top_rate	0.009109
cardinality	794
entropy	9.449
entropy_ratio	0.9809
alert: long_tail	687 singleton categories

Fig 35.

Top values for latitude.

Show data table

Top values for latitude (20 unique shown, of 794 total).
value	count	share
40.63677840515416	9	0.9%
40.7075286122006	8	0.8%
40.744914143624996	7	0.7%
40.86734453983834	7	0.7%
40.64832048620134	6	0.6%
40.74451879806252	6	0.6%
40.67831758601989	6	0.6%
40.77442086598845	5	0.5%
40.82337577056312	5	0.5%
40.866591963493335	5	0.5%
40.700835637070725	5	0.5%
40.694851633900804	4	0.4%
40.65698230332589	4	0.4%
40.680031571772965	4	0.4%
40.79881467021513	4	0.4%
40.64093766389099	4	0.4%
40.874207322276355	4	0.4%
40.72495870564602	4	0.4%
40.88934639713765	4	0.4%
40.83066324047387	4	0.4%

longitude categorical feature

This is a longitude coordinate column, almost certainly geographic points in the New York City area given values clustered around -73.8 to -74.1. It has been ingested as categorical text rather than numeric, with 794 unique values across 1000 rows and a top value frequency of just 0.9%, indicating a long tail of near-unique floats. Null rate is low at 1.2%, and entropy ratio of 0.98 confirms values are spread very evenly.

Treatment: Cast to float and use as a numeric geospatial feature, optionally paired with latitude for distance or grid encoding.

anthropic:claude-opus-4-7 · confidence high

Out[99]:

saturn.columns["longitude"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	794
top_value	-74.13551741527912
top_rate	0.009109
cardinality	794
entropy	9.449
entropy_ratio	0.9809
alert: long_tail	687 singleton categories

Fig 36.

Top values for longitude.

Show data table

Top values for longitude (20 unique shown, of 794 total).
value	count	share
-74.13551741527912	9	0.9%
-73.96649227605458	8	0.8%
-73.88187760396336	7	0.7%
-73.91906278851496	7	0.7%
-73.78828125130184	6	0.6%
-73.88448390553113	6	0.6%
-73.90864580709884	6	0.6%
-73.87729410513894	5	0.5%
-73.86492637415863	5	0.5%
-73.89766001179035	5	0.5%
-73.80655093855829	5	0.5%
-73.79367980321378	4	0.4%
-73.96222514965272	4	0.4%
-73.8724454944903	4	0.4%
-73.94779857268328	4	0.4%
-73.95823461385314	4	0.4%
-73.9026490846547	4	0.4%
-73.85241193799187	4	0.4%
-73.86400388206638	4	0.4%
-73.87487410501063	4	0.4%

location unknown metadata

The column is named "location" but saturn skipped detailed profiling, so its kind is unknown and no value statistics are available. We can only confirm there are 1000 rows with a 0.0 null rate; uniqueness, cardinality, and value distribution are all missing. The name suggests geographic or place data, but without evidence we cannot verify format (string, coordinates, codes) or quality.

Treatment: Re-profile with an appropriate parser before deciding; do not use as-is.

anthropic:claude-opus-4-7 · confidence low

Out[102]:

saturn.columns["location"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

:@computed_region_f5dn_yrer categorical foreign_key

This is a Socrata-style computed region column (`:@computed_region_f5dn_yrer`), almost certainly a spatial join key mapping each row to one of 62 geographic regions. Distribution is remarkably flat — entropy ratio 0.957 and the top value '47' covers only 4.5% of rows — so no single region dominates. Null rate is low at 1.2%.

Treatment: Treat as a categorical region id; one-hot or target-encode, or left-join to a region lookup.

anthropic:claude-opus-4-7 · confidence high

Out[104]:

saturn.columns[":@computed_region_f5dn_yrer"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	62
top_value	47
top_rate	0.04453
cardinality	62
entropy	5.698
entropy_ratio	0.957

Fig 37.

Top values for :@computed_region_f5dn_yrer.

Show data table

Top values for :@computed_region_f5dn_yrer (20 unique shown, of 62 total).
value	count	share
47	44	4.4%
1	40	4.0%
58	38	3.8%
60	33	3.3%
59	33	3.3%
41	31	3.1%
43	26	2.6%
54	26	2.6%
18	26	2.6%
39	25	2.5%
45	24	2.4%
29	23	2.3%
66	23	2.3%
36	22	2.2%
25	22	2.2%
2	22	2.2%
55	22	2.2%
61	19	1.9%
70	19	1.9%
34	19	1.9%

:@computed_region_yeji_bk3q categorical metadata

This appears to be an auto-generated Socrata computed region column (`:@computed_region_yeji_bk3q`), holding a small geographic or administrative bucket id encoded as strings. Cardinality is just 5 with a fairly balanced spread (top value '2' at 31.6%, entropy ratio 0.896), though category '1' is rare at only 22 occurrences and 1.2% of rows are null.

Treatment: Treat as a low-cardinality categorical region code; one-hot encode or drop if the region mapping is not needed.

anthropic:claude-opus-4-7 · confidence high

Out[107]:

saturn.columns[":@computed_region_yeji_bk3q"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	5
top_value	2
top_rate	0.3158
cardinality	5
entropy	2.08
entropy_ratio	0.8959

Fig 38.

Top values for :@computed_region_yeji_bk3q.

Show data table

Top values for :@computed_region_yeji_bk3q (5 unique shown, of 5 total).
value	count	share
2	312	31.2%
3	250	25.0%
5	229	22.9%
4	175	17.5%
1	22	2.2%

:@computed_region_sbqj_enih categorical foreign_key

This is a Socrata-style computed region column (`:@computed_region_sbqj_enih`), almost certainly a spatial-join key assigning each row to one of 75 region polygons. The distribution is very flat — entropy ratio 0.94 of the maximum, top value '37' covers only 4.05% of rows — so no single region dominates. Null rate is low at 1.2%, consistent with rows that fell outside any polygon.

Treatment: Treat as a categorical region id; left-join to the region lookup or drop if spatial context isn't needed.

anthropic:claude-opus-4-7 · confidence high

Out[110]:

saturn.columns[":@computed_region_sbqj_enih"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	75
top_value	37
top_rate	0.04049
cardinality	75
entropy	5.867
entropy_ratio	0.942

Fig 39.

Top values for :@computed_region_sbqj_enih.

Show data table

Top values for :@computed_region_sbqj_enih (20 unique shown, of 75 total).
value	count	share
37	40	4.0%
26	38	3.8%
22	36	3.6%
43	33	3.3%
32	33	3.3%
72	27	2.7%
28	26	2.6%
62	26	2.6%
47	24	2.4%
30	23	2.3%
68	23	2.3%
65	22	2.2%
39	22	2.2%
46	22	2.2%
40	19	1.9%
71	19	1.9%
25	19	1.9%
44	19	1.9%
56	18	1.8%
61	18	1.8%

:@computed_region_92fq_4b7q categorical foreign_key

This is a Socrata-style computed region column (`:@computed_region_92fq_4b7q`) holding 51 distinct integer-coded region IDs across 1000 rows, with only 1.2% nulls. The distribution is remarkably flat — entropy ratio 0.969, and the top value '12' covers just 4.66% — suggesting a near-uniform spread across regions rather than a dominant one. No single region drives the data, so this behaves like a geographic foreign key into an external boundary lookup.

Treatment: Left-join on this id to a region lookup, or treat as a high-cardinality categorical (target-encode) for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[113]:

saturn.columns[":@computed_region_92fq_4b7q"].stats

stat	value
n	1,000
nulls	12 (1.2%)
unique	51
top_value	12
top_rate	0.04656
cardinality	51
entropy	5.497
entropy_ratio	0.9691

Fig 40.

Top values for :@computed_region_92fq_4b7q.

Show data table

Top values for :@computed_region_92fq_4b7q (20 unique shown, of 51 total).
value	count	share
12	46	4.6%
31	45	4.5%
39	41	4.1%
11	34	3.4%
36	32	3.2%
46	30	3.0%
22	29	2.9%
37	28	2.8%
35	27	2.7%
45	27	2.7%
48	27	2.7%
34	25	2.5%
28	24	2.4%
44	23	2.3%
5	23	2.3%
29	23	2.3%
4	22	2.2%
43	22	2.2%
24	22	2.2%
30	22	2.2%

descriptor_2 categorical feature

Secondary descriptor for service complaints, dominated by heating issues — "NO HEAT" leads at 28.9% of non-null rows, with "NO HOT WATER" variants and pest/structural codes filling the tail across 75 distinct values. Nearly 60% of rows are null (null_rate 0.596) and "N/A" appears as a literal value 98 times, so missingness is encoded inconsistently. Entropy ratio 0.64 confirms a long tail beyond the heat-related top categories.

Treatment: Normalize "N/A" to null, then group rare levels before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[116]:

saturn.columns["descriptor_2"].stats

stat	value
n	1,000
nulls	596 (59.6%)
unique	75
top_value	NO HEAT
top_rate	0.2896
cardinality	75
entropy	3.995
entropy_ratio	0.6413
alert: long_tail	42 singleton categories
alert: null_rate	59.6% null

Fig 41.

Top values for descriptor_2.

Show data table

Top values for descriptor_2 (20 unique shown, of 75 total).
value	count	share
NO HEAT	117	11.7%
N/A	98	9.8%
NO HEAT AND NO HOT WATER	36	3.6%
NO HOT WATER	19	1.9%
NM1	6	0.6%
BROKEN OR MISSING	6	0.6%
ROACHES	6	0.6%
Operating Improperly	5	0.5%
NR5	5	0.5%
AT WALL OR CEILING	5	0.5%
Littering	4	0.4%
Fare/Tip Complaint - Credit Card	4	0.4%
Dog	4	0.4%
L10	4	0.4%
Blocking Driveway	3	0.3%
Not Cleaned by Property Owner	3	0.3%
Out	3	0.3%
MISSING OR INADEQUATE CANS/LID	3	0.3%
SAGGING OR SLOPING	3	0.3%
Other	2	0.2%

resolution_description categorical label

Canned resolution narratives attached to 311-style complaints, drawn from a fixed template library of 19 distinct strings covering HPD, NYPD, and DOT outcomes. 41.2% of rows are null (likely still-open cases), and the top template alone covers 24.1% of populated rows, with entropy ratio 0.72 indicating moderate concentration across the small vocabulary. The text is boilerplate agency response language rather than free-form notes, so it behaves as a high-cardinality categorical disposition code.

Treatment: Map the 19 templates to short disposition codes (agency + outcome) and treat nulls as an explicit 'open/unresolved' category.

anthropic:claude-opus-4-7 · confidence high

Out[119]:

saturn.columns["resolution_description"].stats

stat	value
n	1,000
nulls	412 (41.2%)
unique	19
top_value	The following complaint conditions are still open. HPD has already attempted to notify the property owner that the condition exists; the tenant should provide access for the owner to make the repair. HPD may attempt to contact the tenant by phone to verify the correction of the condition or an HPD Inspector may attempt to conduct an inspection.
top_rate	0.2415
cardinality	19
entropy	3.071
entropy_ratio	0.7228
alert: null_rate	41.2% null

Fig 42.

Top values for resolution_description.

Show data table

Top values for resolution_description (19 unique shown, of 19 total).
value	count	share
The following complaint conditions are still open. HPD has already attempted to notify the property owner that the condition exists; the tenant should provide access for the owner to make the repair. HPD may attempt to contact the tenant by phone to verify the correction of the condition or an HPD Inspector may attempt to conduct an inspection.	142	14.2%
The New York City Police Department responded to the complaint and with the information available observed no evidence of a criminal violation at that time. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	120	12.0%
This complaint is a duplicate of a building-wide condition already reported by another tenant. The original complaint is still open, and HPD may only need to confirm that the condition exists by inspecting one apartment. If we cannot contact the tenant from the original complaint or get access to that apartment, HPD may attempt to contact the person who filed this complaint to verify the correction of the condition or may conduct an inspection of your unit. You can check HPDONLINE to see if a	89	8.9%
The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	67	6.7%
The New York City Police Department responded to the complaint and their investigation determined that a violation of law occurred. Police issued a summons in response to the complaint. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	41	4.1%
The New York City Police Department responded to the complaint but officers were unable to gain entry into the premises. If the problem persists, please contact 311 to create another complaint and ensure that contact information (e.g., buzzer number, phone number, etc.) is available to assist the responding officers in gaining entry to properly investigate the complaint. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	37	3.7%
The New York City Police Department responded to the complaint and their investigation determined that police action was not necessary. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	34	3.4%
The New York City Police Department responded to the complaint and observed no criminal violation upon their arrival. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	27	2.7%
The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.	9	0.9%
Your complaint has been received by the New York City Police Department and assigned to a unit at your local precinct. The responding officers will conduct an assessment of the complaint and may contact you for further information, if you left contact information. Our team is committed to resolving this matter promptly, and we appreciate your contribution to maintaining the well-being of our community.	8	0.8%
The New York City Police Department responded to the complaint and observed no encampment at the noted location. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	3	0.3%
The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.	2	0.2%
The Police Department reviewed your complaint and provided additional information below.	2	0.2%
The Department of Health and Mental Hygiene has sent official written notification to the Owner/Landlord warning them of potential violations and instructing them to correct the situation. If the situation persists 21 days after your initial complaint, please make a new complaint.	2	0.2%
The Department of Health and Mental Hygiene has received and processed your complaint. All restaurants and mobile food vendors are inspected annually. Restaurant inspection results can be found on WWW.NYC.GOV or a copy of the inspection can be requested from 311.	1	0.1%
The New York City Police Department responded to the complaint and their investigation determined that a violation of law occurred. Police made an arrest in response to the complaint. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	1	0.1%
Service Request status for this request is available on the Department of Transportationâs website. Please click the âLearn Moreâ link below.	1	0.1%
This request required re-assignment to a new DOT unit.	1	0.1%
The New York City Police Department responded to the complaint and a report was prepared as part of their investigation. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	1	0.1%

resolution_action_updated_date categorical timestamp

ISO-8601 datetimes recording when a resolution action was last updated, stored as strings rather than parsed timestamps. 40.9% of rows are null and a single value, 2026-01-19T00:00:00.000, accounts for 39.3% of non-null entries (232 of 1000) — likely a default or batch-stamped backfill at midnight, since all other timestamps cluster on 2026-01-20 with second-level precision. The remaining 353 unique values appear at most twice, giving high entropy (ratio 0.72) among the long tail.

Treatment: Parse to datetime, treat the midnight 2026-01-19 spike as a sentinel/default, and engineer null-flag plus recency features rather than using the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[122]:

saturn.columns["resolution_action_updated_date"].stats

stat	value
n	1,000
nulls	409 (40.9%)
unique	354
top_value	2026-01-19T00:00:00.000
top_rate	0.3926
cardinality	354
entropy	6.102
entropy_ratio	0.7206
alert: long_tail	347 singleton categories
alert: null_rate	40.9% null

Fig 43.

Top values for resolution_action_updated_date.

Show data table

Top values for resolution_action_updated_date (20 unique shown, of 354 total).
value	count	share
2026-01-19T00:00:00.000	232	23.2%
2026-01-20T01:49:13.000	2	0.2%
2026-01-20T01:55:13.000	2	0.2%
2026-01-20T01:12:29.000	2	0.2%
2026-01-20T02:01:00.000	2	0.2%
2026-01-20T01:08:29.000	2	0.2%
2026-01-20T00:19:12.000	2	0.2%
2026-01-20T01:59:27.000	1	0.1%
2026-01-20T02:04:38.000	1	0.1%
2026-01-20T01:49:45.000	1	0.1%
2026-01-20T02:05:08.000	1	0.1%
2026-01-20T01:43:12.000	1	0.1%
2026-01-20T02:04:57.000	1	0.1%
2026-01-20T01:28:28.000	1	0.1%
2026-01-20T01:32:18.000	1	0.1%
2025-03-31T12:11:47.000	1	0.1%
2025-05-28T08:40:41.000	1	0.1%
2026-01-20T01:46:37.000	1	0.1%
2026-01-20T01:45:05.000	1	0.1%
2026-01-20T02:04:52.000	1	0.1%

taxi_pick_up_location categorical metadata

This is a free-text taxi pickup address, populated for only 12 of 1000 rows (null_rate 0.988). Among the 12 non-null entries, just 3 distinct addresses appear, dominated by JFK Airport (6) and LaGuardia Airport (5), suggesting the field is only filled for airport-related trips. With near-total nullity, this column carries almost no signal as-is.

Treatment: Drop or collapse to a binary 'pickup_address_present' flag given 98.8% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[125]:

saturn.columns["taxi_pick_up_location"].stats

stat	value
n	1,000
nulls	988 (98.8%)
unique	3
top_value	JOHN F KENNEDY AIRPORT, QUEENS (JAMAICA) ,NY, 11430
top_rate	0.5
cardinality	3
entropy	1.325
entropy_ratio	0.836
alert: null_rate	98.8% null

Fig 44.

Top values for taxi_pick_up_location.

Show data table

Top values for taxi_pick_up_location (3 unique shown, of 3 total).
value	count	share
JOHN F KENNEDY AIRPORT, QUEENS (JAMAICA) ,NY, 11430	6	0.6%
LA GUARDIA AIRPORT, QUEENS (EAST ELMHURST) ,NY, 11369	5	0.5%
141 NAGLE AVENUE, MANHATTAN (NEW YORK), NY, 10040	1	0.1%

vehicle_type categorical feature

Categorical descriptor of vehicle class with five levels (Car, SUV, Other, Van, Truck), almost certainly a feature describing involved vehicles. The column is essentially empty: 95.7% of rows are null, leaving only 43 populated values, of which 65% are 'Car'. With such severe missingness any modelling signal is fragile despite reasonable entropy ratio (0.69) across the observed sample.

Treatment: Drop or encode missingness as its own level; do not rely on this column given 95.7% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[128]:

saturn.columns["vehicle_type"].stats

stat	value
n	1,000
nulls	957 (95.7%)
unique	5
top_value	Car
top_rate	0.6512
cardinality	5
entropy	1.599
entropy_ratio	0.6886
alert: null_rate	95.7% null

Fig 45.

Top values for vehicle_type.

Show data table

Top values for vehicle_type (5 unique shown, of 5 total).
value	count	share
Car	28	2.8%
SUV	5	0.5%
Other	5	0.5%
Van	3	0.3%
Truck	2	0.2%

closed_date categorical timestamp

ISO-8601 timestamps recording when a record was closed, stored as strings rather than parsed datetimes. 66.4% of rows are null, consistent with most records still being open. Among the 336 non-null entries there are 334 unique values, and the visible top values all fall within a narrow window on 2026-01-20, suggesting closures cluster tightly in time rather than spreading across the dataset.

Treatment: Parse to datetime and derive features (e.g., time-to-close); treat nulls as 'still open' rather than imputing.

anthropic:claude-opus-4-7 · confidence high

Out[131]:

saturn.columns["closed_date"].stats

stat	value
n	1,000
nulls	664 (66.4%)
unique	334
top_value	2026-01-20T01:00:32.000
top_rate	0.005952
cardinality	334
entropy	8.38
entropy_ratio	0.9996
alert: long_tail	332 singleton categories
alert: null_rate	66.4% null

Fig 46.

Top values for closed_date.

Show data table

Top values for closed_date (20 unique shown, of 334 total).
value	count	share
2026-01-20T01:00:32.000	2	0.2%
2026-01-20T01:08:26.000	2	0.2%
2026-01-20T02:04:35.000	1	0.1%
2026-01-20T02:05:03.000	1	0.1%
2026-01-20T01:43:08.000	1	0.1%
2026-01-20T01:55:05.000	1	0.1%
2026-01-20T02:04:52.000	1	0.1%
2026-01-20T01:28:25.000	1	0.1%
2026-01-20T01:32:15.000	1	0.1%
2026-01-20T01:46:35.000	1	0.1%
2026-01-20T01:45:02.000	1	0.1%
2026-01-20T02:04:49.000	1	0.1%
2026-01-20T01:44:55.000	1	0.1%
2026-01-20T01:52:28.000	1	0.1%
2026-01-20T01:38:17.000	1	0.1%
2026-01-20T01:12:33.000	1	0.1%
2026-01-20T01:59:33.000	1	0.1%
2026-01-20T01:16:46.000	1	0.1%
2026-01-20T01:26:42.000	1	0.1%
2026-01-20T01:47:42.000	1	0.1%

bridge_highway_name categorical metadata

This column appears to record the bridge or highway name associated with each row, but it is essentially empty — 99.6% of values are null and only 4 distinct strings appear across 1000 rows. The four observed values ("J", "Cross Island Pkwy", "Long Island Expwy", "Nassau Expwy") each occur exactly once, with one entry ("J") looking like a stray code or data-entry artifact rather than a real road name.

Treatment: Drop or retain only as a sparse flag; too null-heavy to model directly.

anthropic:claude-opus-4-7 · confidence high

Out[134]:

saturn.columns["bridge_highway_name"].stats

stat	value
n	1,000
nulls	996 (99.6%)
unique	4
top_value	J
top_rate	0.25
cardinality	4
entropy	2
entropy_ratio	1
alert: long_tail	4 singleton categories
alert: null_rate	99.6% null

Fig 47.

Top values for bridge_highway_name.

Show data table

Top values for bridge_highway_name (4 unique shown, of 4 total).
value	count	share
J	1	0.1%
Cross Island Pkwy	1	0.1%
Long Island Expwy	1	0.1%
Nassau Expwy	1	0.1%

bridge_highway_segment categorical metadata

Categorical field naming a bridge or highway segment, but 99.6% of rows are null with only 4 non-null values observed across 1000 records. Each of the 4 surviving entries is unique (Mezzanine, Hempstead Ave, Van Dam St, Belt Pkwy), suggesting this attribute applies to a rare subset of incidents tied to specific NYC roadway infrastructure.

Treatment: Drop or convert to a binary 'has_segment' flag given the 99.6% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[137]:

saturn.columns["bridge_highway_segment"].stats

stat	value
n	1,000
nulls	996 (99.6%)
unique	4
top_value	Mezzanine
top_rate	0.25
cardinality	4
entropy	2
entropy_ratio	1
alert: long_tail	4 singleton categories
alert: null_rate	99.6% null

Fig 48.

Top values for bridge_highway_segment.

Show data table

Top values for bridge_highway_segment (4 unique shown, of 4 total).
value	count	share
Mezzanine	1	0.1%
Hempstead Ave (NY 24) (Exit 26C)	1	0.1%
Van Dam St (Exit 15) - Queens Midtown Tunnel	1	0.1%
Belt Pkwy So Conduit Ave (Exit 2 N)	1	0.1%

facility_type categorical metadata

This column is meant to record facility_type but is effectively empty: 95.9% of rows are null and the only non-null value across all 1000 rows is the literal string "N/A" (41 occurrences). Cardinality is 1 and entropy is 0, so the field carries no information as-is.

Treatment: Drop; zero entropy and 95.9% nulls leave nothing to model.

anthropic:claude-opus-4-7 · confidence high

Out[140]:

saturn.columns["facility_type"].stats

stat	value
n	1,000
nulls	959 (95.9%)
unique	1
top_value	N/A
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	95.9% null
alert: imbalance	top value is 100.0% of rows

Fig 49.

Top values for facility_type.

Show data table

Top values for facility_type (1 unique shown, of 1 total).
value	count	share
N/A	41	4.1%

bridge_highway_direction categorical metadata

Almost certainly a directional descriptor for bridge or highway traffic, with values like "South/Long Island Bound", "West/Manhattan Bound", and "Eastbound". The column is effectively empty: 99.7% of rows are null, leaving only 3 populated rows each with a distinct value. With such sparse coverage, the cardinality and entropy stats describe three observations rather than a meaningful distribution.

Treatment: Drop or retain only as a contextual flag; too sparse (99.7% null) for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[143]:

saturn.columns["bridge_highway_direction"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	3
top_value	South/Long Island Bound
top_rate	0.3333
cardinality	3
entropy	1.585
entropy_ratio	1
alert: long_tail	3 singleton categories
alert: null_rate	99.7% null

Fig 50.

Top values for bridge_highway_direction.

Show data table

Top values for bridge_highway_direction (3 unique shown, of 3 total).
value	count	share
South/Long Island Bound	1	0.1%
West/Manhattan Bound	1	0.1%
Eastbound	1	0.1%

road_ramp categorical feature

A categorical flag distinguishing 'Ramp' vs 'Roadway' segments, but it is effectively empty: 997 of 1000 rows are null, leaving only 3 observed values (2 'Ramp', 1 'Roadway'). With such sparse signal the entropy and top-rate stats are computed on just 3 records and carry no real information.

Treatment: Drop; null rate of 0.997 leaves no usable signal.

anthropic:claude-opus-4-7 · confidence high

Out[146]:

saturn.columns["road_ramp"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	Ramp
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 51.

Top values for road_ramp.

Show data table

Top values for road_ramp (2 unique shown, of 2 total).
value	count	share
Ramp	2	0.2%
Roadway	1	0.1%

taxi_company_borough categorical metadata

This column appears to be the borough associated with a taxi company, but it is effectively empty: 99.9% of rows are null and the single non-null value is 'BRONX'. With cardinality of 1 and entropy of 0, the field carries no information in this sample.

Treatment: Drop; the column is 99.9% null with only one observed category.

anthropic:claude-opus-4-7 · confidence high

Out[149]:

saturn.columns["taxi_company_borough"].stats

stat	value
n	1,000
nulls	999 (99.9%)
unique	1
top_value	BRONX
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: long_tail	1 singleton categories
alert: null_rate	99.9% null
alert: imbalance	top value is 100.0% of rows

Fig 52.

Top values for taxi_company_borough.

Show data table

Top values for taxi_company_borough (1 unique shown, of 1 total).
value	count	share
BRONX	1	0.1%

Overview

Summary confidence: high

unique_key categorical identifier

created_date categorical timestamp

agency categorical feature

agency_name categorical feature

complaint_type categorical label

descriptor categorical feature

location_type categorical feature

incident_zip categorical feature

incident_address categorical metadata

street_name categorical feature

cross_street_1 categorical metadata

cross_street_2 categorical feature

intersection_street_1 categorical feature

intersection_street_2 categorical feature

address_type categorical feature

city categorical feature

landmark categorical metadata

status categorical label

community_board categorical metadata

council_district categorical feature

police_precinct categorical feature

bbl categorical foreign_key

borough categorical feature

x_coordinate_state_plane categorical feature

y_coordinate_state_plane categorical feature

open_data_channel_type categorical feature

park_facility_name categorical feature

park_borough categorical feature

latitude categorical feature

longitude categorical feature

location unknown metadata

:@computed_region_f5dn_yrer categorical foreign_key

:@computed_region_yeji_bk3q categorical metadata

:@computed_region_sbqj_enih categorical foreign_key

:@computed_region_92fq_4b7q categorical foreign_key

descriptor_2 categorical feature

resolution_description categorical label

resolution_action_updated_date categorical timestamp

taxi_pick_up_location categorical metadata

vehicle_type categorical feature

closed_date categorical timestamp

bridge_highway_name categorical metadata

bridge_highway_segment categorical metadata

facility_type categorical metadata

bridge_highway_direction categorical metadata

road_ramp categorical feature

taxi_company_borough categorical metadata

How to cite