data-trove-nyc-311-service-requests

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json

Saturn profiled 1,000 rows across 47 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json",
    "--findings", "data-trove-nyc-311-service-requests.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a sample of 1,000 NYC 311 service requests, capturing complaints logged across the five boroughs with details on complaint type, location, agency, and resolution status. The dominant signal is complaint type: 'Noise - Residential' alone accounts for 39.3% of records, followed by 'Illegal Parking' (19.7%) and 'Noise - Commercial' (14.8%), pointing to a dataset heavily skewed toward NYPD-handled quality-of-life complaints (NYPD handles 88% of cases). A second area worth examining is resolution status — 61% of complaints are closed, 30.5% are still in progress, and 8.5% remain open, which raises questions about agency workload and response time. Many specialty columns (road_ramp, taxi_company_borough, bridge_highway fields) are nearly entirely null (99%+), indicating they apply only to rare complaint subtypes and can largely be ignored.

citing: complaint_type.top_value · complaint_type.top_rate · agency.top_value · agency.top_rate · status.top_values · borough.top_values · open_data_channel_type.top_values · descriptor.top_value · road_ramp.null_rate · taxi_company_borough.null_rate

Out[4]:

saturn.schema() · 47 columns

column	kind	n	null%	unique	alerts
unique_key	categorical	1,000	0.0%	1,000	long_tail
created_date	categorical	1,000	0.0%	939	long_tail
agency	categorical	1,000	0.0%	11
agency_name	categorical	1,000	0.0%	11
complaint_type	categorical	1,000	0.0%	45
descriptor	categorical	1,000	0.0%	71
location_type	categorical	1,000	1.4%	20
incident_zip	categorical	1,000	0.3%	146
incident_address	categorical	1,000	0.5%	776	long_tail
street_name	categorical	1,000	0.5%	569	long_tail
cross_street_1	categorical	1,000	8.0%	511	long_tail
cross_street_2	categorical	1,000	7.9%	504	long_tail
intersection_street_1	categorical	1,000	8.5%	506	long_tail
intersection_street_2	categorical	1,000	8.4%	499	long_tail
address_type	categorical	1,000	0.2%	4	imbalance
city	categorical	1,000	2.1%	36
landmark	categorical	1,000	10.5%	515	long_tail
status	categorical	1,000	0.0%	3
community_board	categorical	1,000	0.0%	65
council_district	categorical	1,000	0.9%	51
police_precinct	categorical	1,000	0.0%	76
bbl	categorical	1,000	5.5%	719	long_tail
borough	categorical	1,000	0.0%	5
x_coordinate_state_plane	categorical	1,000	0.7%	763	long_tail
y_coordinate_state_plane	categorical	1,000	0.7%	768	long_tail
open_data_channel_type	categorical	1,000	0.0%	4
park_facility_name	categorical	1,000	0.0%	2	imbalance
park_borough	categorical	1,000	0.0%	5
latitude	categorical	1,000	0.7%	770	long_tail
longitude	categorical	1,000	0.7%	770	long_tail
location	unknown	1,000	0.0%	—	skipped
:@computed_region_f5dn_yrer	categorical	1,000	0.7%	62
:@computed_region_yeji_bk3q	categorical	1,000	0.7%	5
:@computed_region_sbqj_enih	categorical	1,000	0.7%	75
:@computed_region_92fq_4b7q	categorical	1,000	0.7%	51
descriptor_2	categorical	1,000	88.2%	43	long_tail null_rate
resolution_description	categorical	1,000	30.7%	16	null_rate
resolution_action_updated_date	categorical	1,000	30.6%	585	long_tail null_rate
closed_date	categorical	1,000	39.0%	585	long_tail null_rate
taxi_pick_up_location	categorical	1,000	99.4%	6	long_tail null_rate
vehicle_type	categorical	1,000	96.5%	4	null_rate
facility_type	categorical	1,000	99.0%	1	null_rate imbalance
taxi_company_borough	categorical	1,000	99.9%	1	long_tail null_rate imbalance
bridge_highway_name	categorical	1,000	99.7%	2	null_rate
bridge_highway_direction	categorical	1,000	99.7%	2	null_rate
road_ramp	categorical	1,000	99.9%	1	long_tail null_rate imbalance
bridge_highway_segment	categorical	1,000	99.7%	2	null_rate

Fig 1.

complaint_type · Look for the steep drop-off after 'Noise - Residential' (39.3%) and 'Illegal Parking' (19.7%) — a small number of complaint types dominate the dataset.

Show data table

Top values for complaint_type (20 unique shown, of 45 total).
value	count	share
Noise - Residential	393	39.3%
Illegal Parking	197	19.7%
Noise - Commercial	148	14.8%
Blocked Driveway	55	5.5%
HEAT/HOT WATER	49	4.9%
Noise - Street/Sidewalk	44	4.4%
Noise - Vehicle	17	1.7%
UNSANITARY CONDITION	12	1.2%
Street Condition	7	0.7%
Non-Emergency Police Matter	7	0.7%
Smoking or Vaping	5	0.5%
Abandoned Vehicle	5	0.5%
Animal-Abuse	5	0.5%
Rodent	4	0.4%
Encampment	4	0.4%
Taxi Complaint	3	0.3%
Dirty Condition	3	0.3%
PAINT/PLASTER	3	0.3%
GENERAL	3	0.3%
Drinking	2	0.2%

Fig 2.

status · With 61% closed, 30.5% in progress, and 8.5% open, check whether specific agencies or complaint types are driving the backlog.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
Closed	610	61.0%
In Progress	305	30.5%
Open	85	8.5%

Fig 3.

borough · Queens, Manhattan, and Brooklyn each account for roughly a quarter of complaints, while Staten Island is a clear outlier at just 1.1%.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

Fig 4.

open_data_channel_type · Online (46.6%) and Mobile (28.8%) together dominate submission channels — look at whether channel type correlates with complaint category.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	466	46.6%
MOBILE	288	28.8%
PHONE	236	23.6%
UNKNOWN	10	1.0%

Fig 5.

descriptor · 'Loud Music/Party' at 42.7% dwarfs all other descriptors — confirm whether this single descriptor is inflating the noise complaint count.

Show data table

Top values for descriptor (20 unique shown, of 71 total).
value	count	share
Loud Music/Party	427	42.7%
Banging/Pounding	114	11.4%
Blocked Hydrant	79	7.9%
No Access	45	4.5%
Posted Parking Sign Violation	42	4.2%
Loud Talking	36	3.6%
ENTIRE BUILDING	35	3.5%
Blocked Sidewalk	21	2.1%
Commercial Overnight Parking	20	2.0%
APARTMENT ONLY	14	1.4%
Car/Truck Music	13	1.3%
Double Parked Blocking Traffic	13	1.3%
Loud Television	10	1.0%
Partial Access	10	1.0%
PESTS	10	1.0%
Blocked Crosswalk	9	0.9%
Pothole	6	0.6%
Allowed in Smoke Free Area	5	0.5%
With License Plate	5	0.5%
No Shelter	5	0.5%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
unique_key	categorical	0.0%
created_date	categorical	0.0%
agency	categorical	0.0%
agency_name	categorical	0.0%
complaint_type	categorical	0.0%
descriptor	categorical	0.0%
location_type	categorical	1.4%
incident_zip	categorical	0.3%
incident_address	categorical	0.5%
street_name	categorical	0.5%
cross_street_1	categorical	8.0%
cross_street_2	categorical	7.9%
intersection_street_1	categorical	8.5%
intersection_street_2	categorical	8.4%
address_type	categorical	0.2%
city	categorical	2.1%
landmark	categorical	10.5%
status	categorical	0.0%
community_board	categorical	0.0%
council_district	categorical	0.9%
police_precinct	categorical	0.0%
bbl	categorical	5.5%
borough	categorical	0.0%
x_coordinate_state_plane	categorical	0.7%
y_coordinate_state_plane	categorical	0.7%
open_data_channel_type	categorical	0.0%
park_facility_name	categorical	0.0%
park_borough	categorical	0.0%
latitude	categorical	0.7%
longitude	categorical	0.7%
location	unknown	0.0%
:@computed_region_f5dn_yrer	categorical	0.7%
:@computed_region_yeji_bk3q	categorical	0.7%
:@computed_region_sbqj_enih	categorical	0.7%
:@computed_region_92fq_4b7q	categorical	0.7%
descriptor_2	categorical	88.2%
resolution_description	categorical	30.7%
resolution_action_updated_date	categorical	30.6%
closed_date	categorical	39.0%
taxi_pick_up_location	categorical	99.4%
vehicle_type	categorical	96.5%
facility_type	categorical	99.0%
taxi_company_borough	categorical	99.9%
bridge_highway_name	categorical	99.7%
bridge_highway_direction	categorical	99.7%
road_ramp	categorical	99.9%
bridge_highway_segment	categorical	99.7%

unique_key categorical identifier

This column is a unique row identifier, likely a primary key or transaction/record ID — every one of the 1000 rows has a distinct value, giving it perfect cardinality and a maximum entropy ratio of 1.0. Values appear to be numeric strings in a narrow range (~67519046–67525266), suggesting sequential or near-sequential ID assignment. There is nothing analytically surprising here beyond the expected long-tail alert, which is a trivial consequence of full uniqueness. The null rate is 0.0.

Treatment: Drop before modelling; retain only for row tracing or joins.

anthropic:default · confidence high

Out[12]:

saturn.columns["unique_key"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	1,000
top_value	67519092
top_rate	0.001
cardinality	1,000
entropy	9.966
entropy_ratio	1
alert: long_tail	1000 singleton categories

Fig 7.

Top values for unique_key.

Show data table

Top values for unique_key (20 unique shown, of 1000 total).
value	count	share
67519092	1	0.1%
67525266	1	0.1%
67525239	1	0.1%
67521094	1	0.1%
67519046	1	0.1%
67519060	1	0.1%
67524218	1	0.1%
67524221	1	0.1%
67520253	1	0.1%
67519078	1	0.1%
67522157	1	0.1%
67521098	1	0.1%
67524234	1	0.1%
67523183	1	0.1%
67519091	1	0.1%
67524231	1	0.1%
67520104	1	0.1%
67521095	1	0.1%
67522127	1	0.1%
67523186	1	0.1%

created_date categorical timestamp

This column is a creation timestamp, stored as a categorical string in ISO-8601 millisecond format (e.g., '2026-01-17T23:49:56.000'). With 939 unique values out of 1000 rows and an entropy ratio of 0.995, it behaves almost like a unique identifier. The long-tail alert and the fact that the most frequent value appears only 9 times (0.9% of rows) suggest occasional duplicate timestamps — likely batch inserts or rapid successive record creation within the same second.

Treatment: Parse to datetime, then engineer time-based features (hour, day-of-week, recency); do not use raw string for modelling.

anthropic:default · confidence high

Out[15]:

saturn.columns["created_date"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	939
top_value	2026-01-17T23:49:56.000
top_rate	0.009
cardinality	939
entropy	9.828
entropy_ratio	0.9952
alert: long_tail	889 singleton categories

Fig 8.

Top values for created_date.

Show data table

Top values for created_date (20 unique shown, of 939 total).
value	count	share
2026-01-17T23:49:56.000	9	0.9%
2026-01-17T23:43:55.000	4	0.4%
2026-01-18T01:11:10.000	3	0.3%
2026-01-17T23:04:06.000	3	0.3%
2026-01-18T01:25:57.000	2	0.2%
2026-01-18T01:25:16.000	2	0.2%
2026-01-18T01:18:23.000	2	0.2%
2026-01-18T01:12:27.000	2	0.2%
2026-01-18T01:08:57.000	2	0.2%
2026-01-18T01:07:55.000	2	0.2%
2026-01-18T01:04:36.000	2	0.2%
2026-01-18T01:02:54.000	2	0.2%
2026-01-18T00:53:20.000	2	0.2%
2026-01-18T00:47:20.000	2	0.2%
2026-01-18T00:38:56.000	2	0.2%
2026-01-18T00:35:43.000	2	0.2%
2026-01-18T00:35:09.000	2	0.2%
2026-01-18T00:31:54.000	2	0.2%
2026-01-18T00:28:06.000	2	0.2%
2026-01-18T00:26:25.000	2	0.2%

agency categorical label

This column identifies the New York City municipal agency associated with each record, with 11 distinct agency codes and no nulls across 1,000 rows. The distribution is severely skewed: NYPD alone accounts for 88% of all records (880 of 1,000), while the remaining 10 agencies collectively cover only 120 rows — several (DHS, OOS, DCWP, DPR) appear just twice each. The low entropy ratio of 0.223 confirms this near-monolithic concentration, which could bias any agency-level analysis or model trained on this sample.

Treatment: One-hot encode for modelling, but note severe class imbalance — consider oversampling or stratified splitting to ensure minority agencies are represented.

anthropic:default · confidence high

Out[18]:

saturn.columns["agency"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	11
top_value	NYPD
top_rate	0.88
cardinality	11
entropy	0.7715
entropy_ratio	0.223

Fig 9.

Top values for agency.

Show data table

Top values for agency (11 unique shown, of 11 total).
value	count	share
NYPD	880	88.0%
HPD	74	7.4%
DOHMH	13	1.3%
DOT	11	1.1%
TLC	6	0.6%
DSNY	6	0.6%
DHS	2	0.2%
OOS	2	0.2%
DCWP	2	0.2%
DPR	2	0.2%
DEP	2	0.2%

agency_name categorical label

This column identifies the New York City municipal agency responsible for each record, with 11 distinct agencies across 1,000 rows and no nulls. It is severely dominated by the New York City Police Department, which accounts for 88% of all records (880 of 1,000), producing a very low entropy ratio of 0.223. The remaining 10 agencies collectively cover only 120 records, with several (e.g., Department of Homeless Services, Office of the Sheriff) appearing just twice. This extreme class imbalance will distort any agency-level aggregation or model that uses this column as a feature.

Treatment: One-hot encode with caution given severe class imbalance; consider grouping rare agencies (≤6 occurrences) into an 'Other' category before modelling.

anthropic:default · confidence high

Out[21]:

saturn.columns["agency_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	11
top_value	New York City Police Department
top_rate	0.88
cardinality	11
entropy	0.7715
entropy_ratio	0.223

Fig 10.

Top values for agency_name.

Show data table

Top values for agency_name (11 unique shown, of 11 total).
value	count	share
New York City Police Department	880	88.0%
Department of Housing Preservation and Development	74	7.4%
Department of Health and Mental Hygiene	13	1.3%
Department of Transportation	11	1.1%
Taxi and Limousine Commission	6	0.6%
Department of Sanitation	6	0.6%
Department of Homeless Services	2	0.2%
Office of the Sheriff	2	0.2%
Department of Consumer and Worker Protection	2	0.2%
Department of Parks and Recreation	2	0.2%
Department of Environmental Protection	2	0.2%

complaint_type categorical label

This column contains the categorized type of civic complaint (likely NYC 311 service requests), with 45 distinct complaint categories across 1,000 records and zero nulls. The distribution is heavily skewed: 'Noise - Residential' alone accounts for 39.3% of all records, and the top three noise-related categories together represent roughly 60% of the dataset. The entropy ratio of 0.53 indicates moderate concentration — far from uniform — with a long tail of rare complaint types below the top 10.

Treatment: One-hot encode or target-encode for modelling; consider grouping rare categories (below ~7 occurrences) into an 'Other' bucket to reduce noise.

anthropic:default · confidence high

Out[24]:

saturn.columns["complaint_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	45
top_value	Noise - Residential
top_rate	0.393
cardinality	45
entropy	2.935
entropy_ratio	0.5345

Fig 11.

Top values for complaint_type.

Show data table

Top values for complaint_type (20 unique shown, of 45 total).
value	count	share
Noise - Residential	393	39.3%
Illegal Parking	197	19.7%
Noise - Commercial	148	14.8%
Blocked Driveway	55	5.5%
HEAT/HOT WATER	49	4.9%
Noise - Street/Sidewalk	44	4.4%
Noise - Vehicle	17	1.7%
UNSANITARY CONDITION	12	1.2%
Street Condition	7	0.7%
Non-Emergency Police Matter	7	0.7%
Smoking or Vaping	5	0.5%
Abandoned Vehicle	5	0.5%
Animal-Abuse	5	0.5%
Rodent	4	0.4%
Encampment	4	0.4%
Taxi Complaint	3	0.3%
Dirty Condition	3	0.3%
PAINT/PLASTER	3	0.3%
GENERAL	3	0.3%
Drinking	2	0.2%

descriptor categorical label

This column contains descriptive sub-type labels for service requests or complaints — likely from a system such as NYC 311 — further specifying the nature of each complaint beyond a top-level category. 'Loud Music/Party' dominates with 42.7% of all 1,000 records, creating notable class imbalance; the top two values alone account for 54.1% of rows. With 71 unique values, entropy ratio of 0.576, and zero nulls, the column is moderately concentrated but still covers a meaningful range of complaint types including noise, parking, and building issues.

Treatment: One-hot encode for modelling, but consider grouping rare categories (below ~1% frequency) into an 'Other' bucket to manage the 71-level cardinality.

anthropic:default · confidence high

Out[27]:

saturn.columns["descriptor"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	71
top_value	Loud Music/Party
top_rate	0.427
cardinality	71
entropy	3.54
entropy_ratio	0.5756

Fig 12.

Top values for descriptor.

Show data table

Top values for descriptor (20 unique shown, of 71 total).
value	count	share
Loud Music/Party	427	42.7%
Banging/Pounding	114	11.4%
Blocked Hydrant	79	7.9%
No Access	45	4.5%
Posted Parking Sign Violation	42	4.2%
Loud Talking	36	3.6%
ENTIRE BUILDING	35	3.5%
Blocked Sidewalk	21	2.1%
Commercial Overnight Parking	20	2.0%
APARTMENT ONLY	14	1.4%
Car/Truck Music	13	1.3%
Double Parked Blocking Traffic	13	1.3%
Loud Television	10	1.0%
Partial Access	10	1.0%
PESTS	10	1.0%
Blocked Crosswalk	9	0.9%
Pothole	6	0.6%
Allowed in Smoke Free Area	5	0.5%
With License Plate	5	0.5%
No Shelter	5	0.5%

location_type categorical label

This column encodes the type of location where an incident or event occurred, with 20 distinct values across 1,000 rows. The dominant category is 'Residential Building/House' at 40.2% (396 occurrences), followed by 'Street/Sidewalk' at 324. A significant data quality issue is immediately apparent: the same real-world concept appears under multiple inconsistent labels — 'Residential Building/House' (396), 'RESIDENTIAL BUILDING' (74), and 'Residential Building' (7) are clearly duplicates, as are 'Street/Sidewalk' (324), 'Street' (8), and 'Sidewalk' (5) — meaning true cardinality is substantially lower than 20 and category frequencies are understated.

Treatment: Normalize case and consolidate synonymous labels (e.g. merge 'RESIDENTIAL BUILDING', 'Residential Building', 'Residential Building/House') before encoding as a categorical feature.

anthropic:default · confidence high

Out[30]:

saturn.columns["location_type"].stats

stat	value
n	1,000
nulls	14 (1.4%)
unique	20
top_value	Residential Building/House
top_rate	0.4016
cardinality	20
entropy	2.237
entropy_ratio	0.5177

Fig 13.

Top values for location_type.

Show data table

Top values for location_type (20 unique shown, of 20 total).
value	count	share
Residential Building/House	396	39.6%
Street/Sidewalk	324	32.4%
Club/Bar/Restaurant	91	9.1%
RESIDENTIAL BUILDING	74	7.4%
Store/Commercial	60	6.0%
Street	8	0.8%
Residential Building	7	0.7%
Sidewalk	5	0.5%
3+ Family Apartment Building	3	0.3%
House and Store	3	0.3%
3+ Family Apt. Building	2	0.2%
Business	2	0.2%
Subway	2	0.2%
Other	2	0.2%
Park/Playground	2	0.2%
1-2 Family Mixed Use Building	1	0.1%
Restaurant/Bar/Deli/Bakery	1	0.1%
Bridge	1	0.1%
Taxi	1	0.1%
1-2 Family Dwelling	1	0.1%

incident_zip categorical feature

This column contains US ZIP codes associated with incidents, all values appearing to be New York City ZIP codes (10xxx and 11xxx series, consistent with Manhattan, Queens, and the Bronx). With 146 unique values across 1,000 rows and a high entropy ratio of 0.931, the distribution is remarkably spread — the most frequent ZIP '10011' appears only 42 times (4.2%), indicating incidents are distributed broadly across many neighbourhoods rather than concentrated in a few. Null rate is negligible at 0.3%.

Treatment: Encode as geographic feature; consider grouping by borough or joining to a ZIP-code reference table for lat/lon or demographic enrichment.

anthropic:default · confidence high

Out[33]:

saturn.columns["incident_zip"].stats

stat	value
n	1,000
nulls	3 (0.3%)
unique	146
top_value	10011
top_rate	0.04213
cardinality	146
entropy	6.696
entropy_ratio	0.9313

Fig 14.

Top values for incident_zip.

Show data table

Top values for incident_zip (20 unique shown, of 146 total).
value	count	share
10011	42	4.2%
11421	35	3.5%
11385	29	2.9%
10463	22	2.2%
10031	22	2.2%
11368	21	2.1%
10456	20	2.0%
10009	19	1.9%
10462	17	1.7%
10002	16	1.6%
11206	15	1.5%
11212	15	1.5%
11226	14	1.4%
11375	14	1.4%
10012	14	1.4%
11373	14	1.4%
10461	13	1.3%
10452	12	1.2%
10468	12	1.2%
10034	12	1.2%

incident_address categorical feature

This column contains free-text street addresses of incident locations, likely from a New York City incident or complaints dataset given recognizable street names (Lenox Avenue, Avenue of the Americas, Bruckner Boulevard). With 776 unique values across 1,000 rows and an entropy ratio of 0.97, the distribution is highly dispersed — a classic long-tail pattern. Notably, '126 WEST 13 STREET' appears 31 times (3.1% of rows), a disproportionate spike that may indicate a shelter, institution, or high-incident venue worth investigating. Inconsistent spacing in values like '60 EAST 93 STREET' suggests formatting irregularities that will need normalization.

Treatment: Normalize whitespace and casing, then geocode or extract street/borough components for spatial modelling.

anthropic:default · confidence high

Out[36]:

saturn.columns["incident_address"].stats

stat	value
n	1,000
nulls	5 (0.5%)
unique	776
top_value	126 WEST 13 STREET
top_rate	0.03116
cardinality	776
entropy	9.321
entropy_ratio	0.9709
alert: long_tail	664 singleton categories

Fig 15.

Top values for incident_address.

Show data table

Top values for incident_address (20 unique shown, of 776 total).
value	count	share
126 WEST 13 STREET	31	3.1%
60 EAST 93 STREET	13	1.3%
71 LENOX AVENUE	8	0.8%
105 MACDOUGAL STREET	8	0.8%
1465 WASHINGTON AVENUE	8	0.8%
2918 BRUCKNER BOULEVARD	7	0.7%
190 AVENUE OF THE AMERICAS	5	0.5%
235 COURT STREET	5	0.5%
74-03 85 DRIVE	5	0.5%
85 TOMPKINS AVENUE	5	0.5%
1365 5 AVENUE	4	0.4%
112-17 NORTHERN BOULEVARD	4	0.4%
182 NAGLE AVENUE	4	0.4%
108-26 159 STREET	4	0.4%
402 ONDERDONK AVENUE	4	0.4%
465 SENECA AVENUE	3	0.3%
2140 MATTHEWS AVENUE	3	0.3%
28-10 JACKSON AVENUE	3	0.3%
62-11 108 STREET	3	0.3%
499 MYRTLE AVENUE	3	0.3%

street_name categorical feature

This column contains street names, almost certainly from a New York City dataset given entries like 'WEST 13 STREET', 'LENOX AVENUE', 'BRUCKNER BOULEVARD', and 'JAMAICA AVENUE'. With 569 unique values across 1,000 rows and an entropy ratio of 0.955, the distribution is highly spread — the top value 'WEST 13 STREET' appears only 31 times (3.1% of rows), confirming the long-tail alert. The near-zero null rate (0.5%) is clean, but the high cardinality and long tail mean most street names are rare, which limits direct one-hot encoding utility.

Treatment: Frequency-encode or embed as a categorical feature; avoid one-hot encoding due to 569-level cardinality and long-tail distribution.

anthropic:default · confidence high

Out[39]:

saturn.columns["street_name"].stats

stat	value
n	1,000
nulls	5 (0.5%)
unique	569
top_value	WEST 13 STREET
top_rate	0.03116
cardinality	569
entropy	8.742
entropy_ratio	0.9551
alert: long_tail	371 singleton categories

Fig 16.

Top values for street_name.

Show data table

Top values for street_name (20 unique shown, of 569 total).
value	count	share
WEST 13 STREET	31	3.1%
EAST 93 STREET	13	1.3%
JAMAICA AVENUE	11	1.1%
WASHINGTON AVENUE	11	1.1%
LENOX AVENUE	10	1.0%
MACDOUGAL STREET	10	1.0%
BROADWAY	9	0.9%
BRUCKNER BOULEVARD	8	0.8%
BROOKLYN AVENUE	7	0.7%
NORTHERN BOULEVARD	7	0.7%
76 STREET	7	0.7%
COURT STREET	6	0.6%
EAST 74 STREET	6	0.6%
GRAND STREET	5	0.5%
BUSHWICK AVENUE	5	0.5%
EAST 3 STREET	5	0.5%
SENECA AVENUE	5	0.5%
111 AVENUE	5	0.5%
AVENUE OF THE AMERICAS	5	0.5%
AMSTERDAM AVENUE	5	0.5%

cross_street_1 categorical feature

This column captures the first cross street in a NYC street-address intersection, most likely from a traffic incident, 311, or similar geospatial events dataset. With 511 unique values across 1,000 rows and an entropy ratio of 0.953, the distribution is nearly flat — a strong long-tail signal — meaning the vast majority of street names appear only once or twice. The top value 'AVENUE OF THE AMERICAS' appears just 35 times (3.8% of rows), and the 8% null rate suggests some records lack cross-street data entirely.

Treatment: Normalize street name strings (abbreviations, casing), then use as a categorical geographic feature or join to a street reference table; high cardinality warrants frequency encoding or embedding rather than one-hot encoding.

anthropic:default · confidence high

Out[42]:

saturn.columns["cross_street_1"].stats

stat	value
n	1,000
nulls	80 (8.0%)
unique	511
top_value	AVENUE OF THE AMERICAS
top_rate	0.03804
cardinality	511
entropy	8.574
entropy_ratio	0.953
alert: long_tail	324 singleton categories

Fig 17.

Top values for cross_street_1.

Show data table

Top values for cross_street_1 (20 unique shown, of 511 total).
value	count	share
AVENUE OF THE AMERICAS	35	3.5%
BLEECKER STREET	11	1.1%
AMSTERDAM AVENUE	11	1.1%
WEST 113 STREET	8	0.8%
HIMROD STREET	8	0.8%
112 STREET	8	0.8%
ST PAULS PLACE	8	0.8%
DEXTER COURT	8	0.8%
EAST TREMONT AVENUE	7	0.7%
BROADWAY	7	0.7%
BEND	6	0.6%
107 AVENUE	6	0.6%
80 STREET	6	0.6%
AVENUE B	6	0.6%
3 AVENUE	6	0.6%
EAST 182 STREET	5	0.5%
5 AVENUE	5	0.5%
4 AVENUE	5	0.5%
VANDAM STREET	5	0.5%
DEAD END	5	0.5%

cross_street_2 categorical feature

This column contains the secondary cross-street name associated with a location record, likely from a NYC incident or address dataset. With 504 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is nearly flat — a strong long-tail signal where most street names appear very rarely. The top value '7 AVENUE' appears only 36 times (3.9% of non-null rows), and 'DEAD END' appearing 10 times is a notable data-quality flag indicating unresolved or terminus locations.

Treatment: Standardize street name abbreviations, flag 'DEAD END' as a sentinel value, and consider encoding via frequency bucketing or embedding rather than one-hot due to high cardinality.

anthropic:default · confidence high

Out[45]:

saturn.columns["cross_street_2"].stats

stat	value
n	1,000
nulls	79 (7.9%)
unique	504
top_value	7 AVENUE
top_rate	0.03909
cardinality	504
entropy	8.511
entropy_ratio	0.9481
alert: long_tail	321 singleton categories

Fig 18.

Top values for cross_street_2.

Show data table

Top values for cross_street_2 (20 unique shown, of 504 total).
value	count	share
7 AVENUE	36	3.6%
BROADWAY	19	1.9%
MINETTA LANE	10	1.0%
DEAD END	10	1.0%
109 AVENUE	9	0.9%
EAST 171 STREET	9	0.9%
WEST 114 STREET	8	0.8%
HARMAN STREET	8	0.8%
75 STREET	8	0.8%
EDISON AVENUE	7	0.7%
112 PLACE	7	0.7%
BEND	7	0.7%
AVENUE D	6	0.6%
DITMAS AVENUE	6	0.6%
EAST 174 STREET	6	0.6%
10 AVENUE	6	0.6%
3 AVENUE	6	0.6%
1 AVENUE	6	0.6%
EAST 170 STREET	6	0.6%
BALTIC STREET	6	0.6%

intersection_street_1 categorical feature

This column captures the name of the first cross-street at an intersection, consistent with NYC street-incident or infrastructure data. With 506 unique values across 1,000 rows (entropy ratio 0.95), the distribution is nearly flat — a long-tail alert confirms that the vast majority of street names appear only once or twice, while even the top value ('AVENUE OF THE AMERICAS') accounts for just 3.8% of non-null records. An 8.5% null rate suggests intersections are not always fully recorded, which may indicate missing location data rather than true absence of a street.

Treatment: Standardize street name strings (abbreviations, spacing), then encode as a high-cardinality categorical using target encoding or embedding before modelling; impute or flag nulls separately.

anthropic:default · confidence high

Out[48]:

saturn.columns["intersection_street_1"].stats

stat	value
n	1,000
nulls	85 (8.5%)
unique	506
top_value	AVENUE OF THE AMERICAS
top_rate	0.03825
cardinality	506
entropy	8.558
entropy_ratio	0.9527
alert: long_tail	319 singleton categories

Fig 19.

Top values for intersection_street_1.

Show data table

Top values for intersection_street_1 (20 unique shown, of 506 total).
value	count	share
AVENUE OF THE AMERICAS	35	3.5%
BLEECKER STREET	11	1.1%
AMSTERDAM AVENUE	11	1.1%
WEST 113 STREET	8	0.8%
HIMROD STREET	8	0.8%
112 STREET	8	0.8%
ST PAULS PLACE	8	0.8%
DEXTER COURT	8	0.8%
EAST TREMONT AVENUE	7	0.7%
BROADWAY	7	0.7%
BEND	6	0.6%
5 AVENUE	6	0.6%
107 AVENUE	6	0.6%
80 STREET	6	0.6%
AVENUE B	6	0.6%
3 AVENUE	6	0.6%
EAST 182 STREET	5	0.5%
4 AVENUE	5	0.5%
VANDAM STREET	5	0.5%
DEAD END	5	0.5%

intersection_street_2 categorical feature

This column captures the secondary (cross) street name at an intersection, likely from a New York City incident or traffic dataset. With 499 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is nearly flat — a strong long-tail pattern where '7 AVENUE' is the modal value at only 3.9% frequency. The presence of 'DEAD END' (10 occurrences) as a top value is notable, indicating it is used as a literal street descriptor rather than a true street name, which may require cleaning. An 8.4% null rate suggests some incidents occurred at non-intersection locations (e.g., mid-block).

Treatment: Standardize 'DEAD END' and similar non-street tokens, impute or flag nulls, then encode as a categorical feature or use for geospatial joining.

anthropic:default · confidence high

Out[51]:

saturn.columns["intersection_street_2"].stats

stat	value
n	1,000
nulls	84 (8.4%)
unique	499
top_value	7 AVENUE
top_rate	0.0393
cardinality	499
entropy	8.495
entropy_ratio	0.9478
alert: long_tail	316 singleton categories

Fig 20.

Top values for intersection_street_2.

Show data table

Top values for intersection_street_2 (20 unique shown, of 499 total).
value	count	share
7 AVENUE	36	3.6%
BROADWAY	19	1.9%
MINETTA LANE	10	1.0%
DEAD END	10	1.0%
109 AVENUE	9	0.9%
EAST 171 STREET	9	0.9%
WEST 114 STREET	8	0.8%
HARMAN STREET	8	0.8%
75 STREET	8	0.8%
EDISON AVENUE	7	0.7%
112 PLACE	7	0.7%
BEND	7	0.7%
AVENUE D	6	0.6%
DITMAS AVENUE	6	0.6%
EAST 174 STREET	6	0.6%
10 AVENUE	6	0.6%
3 AVENUE	6	0.6%
1 AVENUE	6	0.6%
EAST 170 STREET	6	0.6%
BALTIC STREET	6	0.6%

address_type categorical label

This column classifies the type of geographic address entry, with four categories: ADDRESS, INTERSECTION, BLOCKFACE, and PLACE. It is severely imbalanced — 'ADDRESS' dominates at 97.2% of valid records (970/1000), while the remaining three types collectively account for only 28 records. The entropy ratio of 0.108 confirms near-minimal informational diversity, meaning this field will contribute little discriminative power in most models without special handling.

Treatment: One-hot encode with caution; minority classes (INTERSECTION=19, BLOCKFACE=7, PLACE=2) may need oversampling or collapsing into an 'OTHER' bucket before modelling.

anthropic:default · confidence high

Out[54]:

saturn.columns["address_type"].stats

stat	value
n	1,000
nulls	2 (0.2%)
unique	4
top_value	ADDRESS
top_rate	0.9719
cardinality	4
entropy	0.2169
entropy_ratio	0.1084
alert: imbalance	top value is 97.2% of rows

Fig 21.

Top values for address_type.

Show data table

Top values for address_type (4 unique shown, of 4 total).
value	count	share
ADDRESS	970	97.0%
INTERSECTION	19	1.9%
BLOCKFACE	7	0.7%
PLACE	2	0.2%

city categorical feature

This column contains NYC neighborhood and borough city names, almost certainly a mailing-address city field from a dataset heavily concentrated in New York City. The top three values alone — BROOKLYN (250), NEW YORK (249), and BRONX (198) — account for roughly 70% of the 1,000 rows, while the remaining 33 values (e.g., WOODHAVEN, JAMAICA, RIDGEWOOD) are clearly Queens and Brooklyn sub-neighborhoods used as postal city names. The distribution is notably skewed (top_rate 0.255) but the entropy_ratio of 0.635 across 36 unique values suggests moderate spread beyond the top cluster; the 2.1% null rate is minor.

Treatment: One-hot encode or target-encode for modelling; consider grouping sub-neighbourhood postal cities (e.g., WOODHAVEN, JAMAICA) into borough-level categories to reduce sparsity.

anthropic:default · confidence high

Out[57]:

saturn.columns["city"].stats

stat	value
n	1,000
nulls	21 (2.1%)
unique	36
top_value	BROOKLYN
top_rate	0.2554
cardinality	36
entropy	3.282
entropy_ratio	0.6349

Fig 22.

Top values for city.

Show data table

Top values for city (20 unique shown, of 36 total).
value	count	share
BROOKLYN	250	25.0%
NEW YORK	249	24.9%
BRONX	198	19.8%
WOODHAVEN	35	3.5%
JAMAICA	29	2.9%
RIDGEWOOD	29	2.9%
CORONA	20	2.0%
ASTORIA	14	1.4%
ELMHURST	14	1.4%
FOREST HILLS	11	1.1%
STATEN ISLAND	11	1.1%
SOUTH OZONE PARK	11	1.1%
OZONE PARK	9	0.9%
SOUTH RICHMOND HILL	9	0.9%
REGO PARK	9	0.9%
JACKSON HEIGHTS	9	0.9%
RICHMOND HILL	7	0.7%
COLLEGE POINT	6	0.6%
QUEENS	6	0.6%
WOODSIDE	6	0.6%

landmark categorical label

This column contains street or landmark names, likely representing the nearest notable street or geographic reference point for each record in a New York City–area dataset. With 515 unique values across 1,000 rows (51.5% uniqueness) and a 10.5% null rate, coverage is incomplete. The distribution is heavily long-tailed: the top value 'WEST 13 STREET' appears only 31 times (3.46%), and the entropy ratio of 0.955 indicates near-maximum disorder, meaning most landmark names are rare or unique — making this column unsuitable as a reliable grouping key without significant consolidation.

Treatment: Impute or flag nulls (10.5%), then either group by street type/prefix for coarser features or embed as a high-cardinality categorical; avoid one-hot encoding given 515 levels.

anthropic:default · confidence high

Out[60]:

saturn.columns["landmark"].stats

stat	value
n	1,000
nulls	105 (10.5%)
unique	515
top_value	WEST 13 STREET
top_rate	0.03464
cardinality	515
entropy	8.603
entropy_ratio	0.955
alert: long_tail	335 singleton categories

Fig 23.

Top values for landmark.

Show data table

Top values for landmark (20 unique shown, of 515 total).
value	count	share
WEST 13 STREET	31	3.1%
JAMAICA AVENUE	11	1.1%
WASHINGTON AVENUE	11	1.1%
LENOX AVENUE	10	1.0%
MAC DOUGAL STREET	10	1.0%
BRUCKNER BOULEVARD	8	0.8%
BROADWAY	8	0.8%
BROOKLYN AVENUE	7	0.7%
NORTHERN BOULEVARD	7	0.7%
76 STREET	7	0.7%
COURT STREET	6	0.6%
EAST 74 STREET	6	0.6%
GRAND STREET	5	0.5%
BUSHWICK AVENUE	5	0.5%
EAST 3 STREET	5	0.5%
SENECA AVENUE	5	0.5%
111 AVENUE	5	0.5%
AVENUE OF THE AMERICAS	5	0.5%
MYRTLE AVENUE	5	0.5%
90 STREET	5	0.5%

status categorical label

This column is a workflow/lifecycle status field with exactly 3 states: Closed, In Progress, and Open. 'Closed' dominates at 61% (610/1000), while 'Open' is surprisingly rare at only 8.5% (85/1000), suggesting the dataset skews heavily toward resolved records — possibly a historical or archived snapshot rather than a live operational view. No nulls and perfect coverage across all 1000 rows.

Treatment: One-hot encode or ordinal-encode (Open→In Progress→Closed) depending on whether a natural progression is to be modelled; consider class imbalance if used as a target.

anthropic:default · confidence high

Out[63]:

saturn.columns["status"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	3
top_value	Closed
top_rate	0.61
cardinality	3
entropy	1.26
entropy_ratio	0.7948

Fig 24.

Top values for status.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
Closed	610	61.0%
In Progress	305	30.5%
Open	85	8.5%

community_board categorical label

This column represents NYC Community Board designations, combining a numeric district ID with a borough name (e.g., '02 MANHATTAN'). With 65 unique values across 1,000 rows and zero nulls, coverage is complete. The distribution is notably flat — entropy ratio of 0.93 indicates near-uniform spread, with the most frequent value ('02 MANHATTAN') appearing only 5.6% of the time, suggesting no single board dominates the dataset. Manhattan and Queens boards appear disproportionately among the top 10, which may reflect a geographic sampling bias worth investigating.

Treatment: One-hot encode or target-encode for modelling; consider grouping by borough prefix to reduce cardinality from 65 to 5.

anthropic:default · confidence high

Out[66]:

saturn.columns["community_board"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	65
top_value	02 MANHATTAN
top_rate	0.056
cardinality	65
entropy	5.61
entropy_ratio	0.9316

Fig 25.

Top values for community_board.

Show data table

Top values for community_board (20 unique shown, of 65 total).
value	count	share
02 MANHATTAN	56	5.6%
09 QUEENS	48	4.8%
03 MANHATTAN	38	3.8%
05 QUEENS	38	3.8%
12 QUEENS	30	3.0%
03 BRONX	29	2.9%
12 MANHATTAN	29	2.9%
10 MANHATTAN	28	2.8%
08 BRONX	28	2.8%
09 MANHATTAN	28	2.8%
17 BROOKLYN	27	2.7%
10 QUEENS	27	2.7%
03 QUEENS	25	2.5%
01 BROOKLYN	24	2.4%
06 QUEENS	24	2.4%
11 BROOKLYN	23	2.3%
04 BROOKLYN	22	2.2%
09 BRONX	21	2.1%
04 QUEENS	20	2.0%
10 BRONX	20	2.0%

council_district categorical label

This column represents a council district code — a zero-padded numeric string identifier (e.g., '03', '32') used to assign records to geographic administrative units. With 51 distinct values across 1,000 rows and an entropy ratio of 0.954, the distribution is remarkably flat and near-uniform, meaning no single district dominates heavily; the most frequent value '03' appears only 56 times (5.65% of rows). This near-maximum entropy is unusual for a district field and suggests either broad geographic coverage or deliberate sampling across all districts. Null rate is negligible at 0.9%.

Treatment: Use as a categorical grouping variable; one-hot or target-encode for modelling, or retain as-is for geographic aggregation and joins.

anthropic:default · confidence high

Out[69]:

saturn.columns["council_district"].stats

stat	value
n	1,000
nulls	9 (0.9%)
unique	51
top_value	03
top_rate	0.05651
cardinality	51
entropy	5.412
entropy_ratio	0.9541

Fig 26.

Top values for council_district.

Show data table

Top values for council_district (20 unique shown, of 51 total).
value	count	share
03	56	5.6%
32	48	4.8%
02	39	3.9%
34	38	3.8%
13	36	3.6%
10	36	3.6%
07	33	3.3%
16	32	3.2%
09	32	3.2%
28	30	3.0%
01	28	2.8%
11	26	2.6%
30	25	2.5%
21	25	2.5%
17	24	2.4%
14	24	2.4%
47	24	2.4%
41	23	2.3%
35	22	2.2%
18	22	2.2%

police_precinct categorical label

This column represents the police precinct assignment for each record, with 76 distinct precinct labels across 1,000 rows and zero nulls. The distribution is remarkably flat: the most common value, 'Precinct 6', appears only 50 times (5% of rows), and entropy ratio is 0.936 — nearly as uniform as a perfectly even distribution across all 76 precincts. No dominant precinct stands out, suggesting either a geographically broad dataset or deliberate sampling across jurisdictions.

Treatment: One-hot encode or target-encode for modelling; high cardinality (76 levels) warrants regularised encoding rather than naive dummies.

anthropic:default · confidence high

Out[72]:

saturn.columns["police_precinct"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	76
top_value	Precinct 6
top_rate	0.05
cardinality	76
entropy	5.847
entropy_ratio	0.9358

Fig 27.

Top values for police_precinct.

Show data table

Top values for police_precinct (20 unique shown, of 76 total).
value	count	share
Precinct 6	50	5.0%
Precinct 102	48	4.8%
Precinct 104	38	3.8%
Precinct 42	29	2.9%
Precinct 50	28	2.8%
Precinct 67	27	2.7%
Precinct 106	27	2.7%
Precinct 30	26	2.6%
Precinct 110	24	2.4%
Precinct 112	24	2.4%
Precinct 62	23	2.3%
Precinct 115	23	2.3%
Precinct 83	22	2.2%
Precinct 43	21	2.1%
Precinct 9	20	2.0%
Precinct 103	20	2.0%
Precinct 45	20	2.0%
Precinct 49	19	1.9%
Precinct 34	19	1.9%
Precinct 32	18	1.8%

bbl categorical foreign_key

This column contains New York City Borough-Block-Lot (BBL) codes, a standard 10-digit property identifier encoding borough (leading digit 1–5), block, and lot numbers. With 719 unique values across 1,000 rows and an entropy ratio of 0.97, it is near-unique, but the top value '1006080026' appears 31 times (3.3% of non-null rows), indicating repeated records tied to a single property — flagged as a long tail. The 5.5% null rate and clustering of repeats suggest this may be a foreign key referencing a property registry rather than a row-level unique identifier.

Treatment: Left-join on this BBL to enrich with property-level attributes from a NYC PLUTO or similar property reference table.

anthropic:default · confidence high

Out[75]:

saturn.columns["bbl"].stats

stat	value
n	1,000
nulls	55 (5.5%)
unique	719
top_value	1006080026
top_rate	0.0328
cardinality	719
entropy	9.189
entropy_ratio	0.9683
alert: long_tail	606 singleton categories

Fig 28.

Top values for bbl.

Show data table

Top values for bbl (20 unique shown, of 719 total).
value	count	share
1006080026	31	3.1%
3045950215	13	1.3%
1018230033	8	0.8%
1005420048	8	0.8%
2029020036	8	0.8%
2054190122	7	0.7%
4017067501	7	0.7%
1005040011	5	0.5%
3003960004	5	0.5%
4088380065	5	0.5%
3017400001	5	0.5%
1016180001	4	0.4%
1022170047	4	0.4%
4101460051	4	0.4%
1021700112	4	0.4%
4034270030	4	0.4%
4034300001	3	0.3%
2043230014	3	0.3%
2039437501	3	0.3%
2024090096	3	0.3%

borough categorical label

This column represents the five New York City boroughs, functioning as a geographic label with complete coverage (null_rate 0.0) across all 1,000 rows. Four boroughs — Queens (274), Manhattan (258), Brooklyn (254), and Bronx (203) — are distributed with reasonable balance, but Staten Island is strikingly underrepresented at just 11 occurrences (~1.1%), compared to 20% expected under uniform distribution. The high entropy_ratio of 0.886 reflects the near-even spread among the four dominant boroughs, masking Staten Island's severe underrepresentation.

Treatment: One-hot encode or target-encode for modelling; note Staten Island's class imbalance (n=11) may require stratified sampling or grouping.

anthropic:default · confidence high

Out[78]:

saturn.columns["borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	QUEENS
top_rate	0.274
cardinality	5
entropy	2.057
entropy_ratio	0.8858

Fig 29.

Top values for borough.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

x_coordinate_state_plane categorical feature

This column contains X-coordinates in a State Plane coordinate system, stored as categorical strings rather than numerics — values like '984721' and '1004501' are typical State Plane Easting values in feet. With 763 unique values out of 1000 rows (entropy ratio 0.97) and a long-tail alert, the distribution is nearly unique per record, which is expected for spatial point coordinates. The top value '984721' appearing 31 times (3.1% of rows) is surprisingly frequent for a coordinate and may indicate a default, imputed, or snapped location worth investigating.

Treatment: Cast to numeric, pair with Y-coordinate for spatial analysis, and investigate the 31 rows sharing coordinate '984721' for potential data quality issues.

anthropic:default · confidence high

Out[81]:

saturn.columns["x_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	763
top_value	984721
top_rate	0.03122
cardinality	763
entropy	9.292
entropy_ratio	0.9704
alert: long_tail	643 singleton categories

Fig 30.

Top values for x_coordinate_state_plane.

Show data table

Top values for x_coordinate_state_plane (20 unique shown, of 763 total).
value	count	share
984721	31	3.1%
1004501	13	1.3%
997870	8	0.8%
984032	8	0.8%
1010885	8	0.8%
1031969	7	0.7%
983238	5	0.5%
985877	5	0.5%
1020825	5	0.5%
999077	5	0.5%
998703	4	0.4%
1023829	4	0.4%
1005224	4	0.4%
1041463	4	0.4%
1008285	4	0.4%
1008323	3	0.3%
995971	3	0.3%
1022177	3	0.3%
1007992	3	0.3%
1001276	3	0.3%

y_coordinate_state_plane categorical feature

This column represents Y-coordinates in a State Plane coordinate system, likely a geographic reference for spatial data (e.g., NYC or similar municipal dataset). Despite being numeric in nature, it is stored as a categorical type, which is unexpected and likely a data pipeline issue. With 768 unique values out of 1000 rows (entropy ratio 0.97) and a long-tail alert, the distribution is highly dispersed, though the top value '207809' appears 31 times — a modest but notable cluster suggesting a frequently referenced location. The near-unique cardinality makes this unsuitable for direct use as a categorical feature.

Treatment: Cast to numeric, then use as a spatial coordinate feature or pair with x-coordinate for geospatial analysis.

anthropic:default · confidence high

Out[84]:

saturn.columns["y_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	768
top_value	207809
top_rate	0.03122
cardinality	768
entropy	9.304
entropy_ratio	0.9707
alert: long_tail	651 singleton categories

Fig 31.

Top values for y_coordinate_state_plane.

Show data table

Top values for y_coordinate_state_plane (20 unique shown, of 768 total).
value	count	share
207809	31	3.1%
180649	13	1.3%
230926	8	0.8%
205111	8	0.8%
244212	8	0.8%
242900	7	0.7%
203929	5	0.5%
189205	5	0.5%
191901	5	0.5%
193110	5	0.5%
230311	4	0.4%
215511	4	0.4%
253267	4	0.4%
192497	4	0.4%
197077	4	0.4%
196526	3	0.3%
250862	3	0.3%
240061	3	0.3%
211978	3	0.3%
207649	3	0.3%

open_data_channel_type categorical feature

This column captures the channel through which a report or request was submitted, with four distinct values: ONLINE, MOBILE, PHONE, and UNKNOWN. ONLINE dominates at 46.6% of records, followed by MOBILE (28.8%) and PHONE (23.6%), leaving only 10 records (1%) tagged as UNKNOWN. The near-uniform distribution across the three known channels is notable, and the UNKNOWN category — while small — may warrant imputation or flagging depending on downstream use.

Treatment: One-hot encode or ordinal-map the 3 known channels; consider grouping or flagging the 10 UNKNOWN values before modelling.

anthropic:default · confidence high

Out[87]:

saturn.columns["open_data_channel_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	4
top_value	ONLINE
top_rate	0.466
cardinality	4
entropy	1.589
entropy_ratio	0.7943

Fig 32.

Top values for open_data_channel_type.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	466	46.6%
MOBILE	288	28.8%
PHONE	236	23.6%
UNKNOWN	10	1.0%

park_facility_name categorical label

This column captures the named park facility associated with each record, but it is almost entirely uninformative: 999 of 1000 rows (99.9%) carry the placeholder value 'Unspecified', with only a single record attributed to Flushing Meadows Corona Park. The near-zero entropy (0.011) confirms the column is maximally imbalanced and conveys virtually no discriminative signal. This is likely a poorly populated administrative field rather than a reliable feature.

Treatment: Drop from modelling; if facility-level analysis is ever needed, this field requires back-filling from a source system before use.

anthropic:default · confidence high

Out[90]:

saturn.columns["park_facility_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	2
top_value	Unspecified
top_rate	0.999
cardinality	2
entropy	0.01141
entropy_ratio	0.01141
alert: imbalance	top value is 99.9% of rows

Fig 33.

Top values for park_facility_name.

Show data table

Top values for park_facility_name (2 unique shown, of 2 total).
value	count	share
Unspecified	999	99.9%
Flushing Meadows Corona Park	1	0.1%

park_borough categorical label

This column encodes the five NYC borough names associated with park locations, making it a low-cardinality geographic label with zero nulls across 1,000 rows. Four boroughs (Queens, Manhattan, Brooklyn, Bronx) are nearly evenly distributed (203–274 records each), but Staten Island is severely underrepresented at just 11 occurrences (1.1%), which would surprise an analyst expecting proportional borough coverage. Entropy ratio of 0.886 reflects the near-uniform spread across four classes, masking the sharp imbalance on the fifth.

Treatment: One-hot encode for modelling; note the severe class imbalance for STATEN ISLAND (11 of 1000) and consider stratified sampling or weighting.

anthropic:default · confidence high

Out[93]:

saturn.columns["park_borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	QUEENS
top_rate	0.274
cardinality	5
entropy	2.057
entropy_ratio	0.8858

Fig 34.

Top values for park_borough.

Show data table

Top values for park_borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

latitude categorical feature

This column contains geographic latitude coordinates stored as strings, with all top values clustering tightly around 40.6–40.8°N — consistent with New York City's latitude range. Despite being numeric in nature, it was ingested as categorical, yielding 770 unique values across 1,000 rows (entropy ratio 0.97), which is near-unique. The long-tail alert and the top value ('40.73706433046593') appearing 31 times (3.1% of rows) suggests either a default/fallback coordinate or a high-traffic location being logged repeatedly.

Treatment: Cast to float64, pair with a longitude column for geospatial features, and investigate the 31-row duplicate coordinate for data quality issues before modelling.

anthropic:default · confidence high

Out[96]:

saturn.columns["latitude"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	770
top_value	40.73706433046593
top_rate	0.03122
cardinality	770
entropy	9.308
entropy_ratio	0.9707
alert: long_tail	655 singleton categories

Fig 35.

Top values for latitude.

Show data table

Top values for latitude (20 unique shown, of 770 total).
value	count	share
40.73706433046593	31	3.1%
40.662493333971206	13	1.3%
40.800504000346805	8	0.8%
40.72965899371875	8	0.8%
40.83694066292417	8	0.8%
40.83325085108711	7	0.7%
40.726414636772915	5	0.5%
40.68600063896156	5	0.5%
40.693325110944265	5	0.5%
40.696706691928384	5	0.5%
40.79881467021513	4	0.4%
40.75811580959723	4	0.4%
40.861809211135316	4	0.4%
40.694851633900804	4	0.4%
40.70757495380374	4	0.4%
40.7060624863299	3	0.3%
40.85515165955753	3	0.3%
40.82555561649529	3	0.3%
40.74849081688869	3	0.3%
40.73652935637317	3	0.3%

longitude categorical feature

This column contains geographic longitude values, stored as strings (hence the categorical classification), representing locations clustered tightly around -73.8 to -74.0 — consistent with New York City. Despite 1,000 rows, there are 770 unique values, yet the top value '-73.99830041620608' appears 31 times (3.1% of rows), which is surprisingly high given the near-continuous nature of coordinate data and flags a long-tail concentration. Entropy ratio of 0.97 confirms very high diversity overall, but the repeated exact coordinate values suggest either data binning, snapping to a fixed point, or duplicated records at specific locations.

Treatment: Parse to float, verify duplicate coordinates for data quality issues, then use directly as a numeric geospatial feature or pair with latitude for distance/clustering models.

anthropic:default · confidence high

Out[99]:

saturn.columns["longitude"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	770
top_value	-73.99830041620608
top_rate	0.03122
cardinality	770
entropy	9.308
entropy_ratio	0.9707
alert: long_tail	655 singleton categories

Fig 36.

Top values for longitude.

Show data table

Top values for longitude (20 unique shown, of 770 total).
value	count	share
-73.99830041620608	31	3.1%
-73.9270067970956	13	1.3%
-73.95080595870716	8	0.8%
-74.00078655646078	8	0.8%
-73.90374441298103	8	0.8%
-73.82755893872277	7	0.7%
-74.00365117608368	5	0.5%
-73.994133534454	5	0.5%
-73.86810717317749	5	0.5%
-73.94652977407227	5	0.5%
-73.94779857268328	4	0.4%
-73.85713563386481	4	0.4%
-73.92417422718809	4	0.4%
-73.79367980321378	4	0.4%
-73.91330906170826	4	0.4%
-73.91317397090971	3	0.3%
-73.86289896839915	3	0.3%
-73.91421404014056	3	0.3%
-73.93855184851218	3	0.3%
-73.85143729064244	3	0.3%

location unknown other

This column is named 'location' and contains 1,000 non-null values with a null rate of 0.0, but the profiler skipped it entirely, yielding no stats, no uniqueness count, and no type inference. Without distribution, cardinality, or sample values, it is impossible to determine whether this is a structured geographic field (e.g., city, coordinates, country code) or free-text. The 'skipped' alert is the dominant signal and warrants direct inspection of raw values before any downstream use.

Treatment: Manually inspect raw values to determine structure (free-text, categorical, coordinate, etc.) before deciding on encoding or embedding strategy.

anthropic:default · confidence low

Out[102]:

saturn.columns["location"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

:@computed_region_f5dn_yrer categorical foreign_key

This column is a Socrata-generated computed region identifier, assigning each row to one of 62 geographic zones (e.g., neighborhood, district, or census tract polygons). With an entropy ratio of 0.939 and 62 unique integer-like codes across 1,000 rows, values are distributed fairly evenly — the most frequent region ('57') accounts for only 5.6% of rows, suggesting no strong spatial concentration. The null rate is negligible at 0.7%. The numeric-looking codes are categorical labels, not ordinal or continuous values.

Treatment: Treat as a categorical geographic key; left-join to a region lookup table or one-hot encode for spatial feature engineering.

anthropic:default · confidence high

Out[104]:

saturn.columns[":@computed_region_f5dn_yrer"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	62
top_value	57
top_rate	0.05639
cardinality	62
entropy	5.591
entropy_ratio	0.9389

Fig 37.

Top values for :@computed_region_f5dn_yrer.

Show data table

Top values for :@computed_region_f5dn_yrer (20 unique shown, of 62 total).
value	count	share
57	56	5.6%
46	48	4.8%
70	38	3.8%
54	37	3.7%
41	30	3.0%
34	29	2.9%
47	29	2.9%
18	28	2.8%
48	28	2.8%
37	28	2.8%
61	27	2.7%
62	27	2.7%
65	25	2.5%
36	24	2.4%
1	23	2.3%
42	22	2.2%
58	21	2.1%
40	21	2.1%
66	20	2.0%
43	20	2.0%

:@computed_region_yeji_bk3q categorical foreign_key

This column is a Socrata-generated computed region identifier (geo-zone lookup key), indicated by the ':@computed_region_' prefix — it maps each row to one of 5 predefined geographic regions. Values are nearly uniformly distributed across zones 2–4 (~254–269 rows each), but zone '1' is a stark outlier with only 11 occurrences (~1.1% of rows), which may signal a very small or edge-case geographic area. Null rate is negligible at 0.7%.

Treatment: Use as a categorical grouping key or left-join to a region lookup table; investigate zone '1' underrepresentation before using as a stratification variable.

anthropic:default · confidence high

Out[107]:

saturn.columns[":@computed_region_yeji_bk3q"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	5
top_value	3
top_rate	0.2709
cardinality	5
entropy	2.054
entropy_ratio	0.8846

Fig 38.

Top values for :@computed_region_yeji_bk3q.

Show data table

Top values for :@computed_region_yeji_bk3q (5 unique shown, of 5 total).
value	count	share
3	269	26.9%
4	266	26.6%
2	254	25.4%
5	193	19.3%
1	11	1.1%

:@computed_region_sbqj_enih categorical foreign_key

This column is a Socrata computed region identifier, automatically generated by the platform to assign rows to pre-defined geographic boundary zones (e.g., neighbourhoods, council districts, or census tracts). With 75 distinct integer-like codes across 1,000 rows and an entropy ratio of 0.94, the distribution is remarkably flat — no single zone dominates, with even the top value '3' appearing in only 5% of rows. The near-uniform spread across 75 regions and very low null rate (0.7%) suggest broad geographic coverage with no strong spatial concentration.

Treatment: Left-join on this region code to a geographic boundaries table to enrich with spatial attributes; do not treat as a numeric feature.

anthropic:default · confidence high

Out[110]:

saturn.columns[":@computed_region_sbqj_enih"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	75
top_value	3
top_rate	0.05035
cardinality	75
entropy	5.846
entropy_ratio	0.9385

Fig 39.

Top values for :@computed_region_sbqj_enih.

Show data table

Top values for :@computed_region_sbqj_enih (20 unique shown, of 75 total).
value	count	share
3	50	5.0%
60	48	4.8%
62	37	3.7%
25	29	2.9%
33	28	2.8%
40	27	2.7%
64	27	2.7%
19	26	2.6%
68	23	2.3%
37	23	2.3%
73	23	2.3%
53	22	2.2%
26	21	2.1%
70	21	2.1%
61	20	2.0%
28	20	2.0%
32	19	1.9%
5	19	1.9%
22	19	1.9%
20	18	1.8%

:@computed_region_92fq_4b7q categorical foreign_key

This column is a Socrata-generated computed region identifier, typically encoding a geographic zone or district (e.g., census tract, neighbourhood, or administrative boundary) as an opaque integer-like label. With 51 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is near-uniform — no single region dominates heavily, though region '10' leads with 6.4% of records (64 occurrences). The null rate of 0.7% is negligible, suggesting reliable spatial assignment.

Treatment: Left-join on this computed region ID to a Socrata boundary dataset to retrieve human-readable geographic names; do not treat the numeric strings as ordinal or cardinal values.

anthropic:default · confidence high

Out[113]:

saturn.columns[":@computed_region_92fq_4b7q"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	51
top_value	10
top_rate	0.06445
cardinality	51
entropy	5.376
entropy_ratio	0.9477

Fig 40.

Top values for :@computed_region_92fq_4b7q.

Show data table

Top values for :@computed_region_92fq_4b7q (20 unique shown, of 51 total).
value	count	share
10	64	6.4%
34	54	5.4%
30	40	4.0%
36	36	3.6%
32	36	3.6%
12	35	3.5%
39	35	3.5%
46	34	3.4%
23	33	3.3%
42	28	2.8%
50	27	2.7%
21	27	2.7%
43	27	2.7%
40	25	2.5%
28	25	2.5%
48	24	2.4%
45	23	2.3%
29	22	2.2%
17	22	2.2%
35	21	2.1%

descriptor_2 categorical label

This column is a secondary complaint descriptor, likely a sub-category or detail field attached to a service request or complaint record (consistent with NYC 311-style data). It is almost entirely empty — null_rate is 0.882, meaning only 118 of 1,000 rows carry a value. Among the 43 unique non-null values, 'NO HEAT' dominates at 27.97% of non-null entries (33 occurrences), but the spread across housing, noise, sanitation, and transportation categories (e.g., 'Cannabis Smoking or Vaping', 'Unsafe Driving - Non-Passenger') signals this field is populated inconsistently across complaint types, not a uniform taxonomy. The high entropy_ratio of 0.807 confirms the long-tail alert: values are broadly dispersed despite the sparse fill rate.

Treatment: Filter to non-null rows before use; group rare categories (below a frequency threshold) into 'OTHER' and one-hot encode or target-encode the remainder.

anthropic:default · confidence high

Out[116]:

saturn.columns["descriptor_2"].stats

stat	value
n	1,000
nulls	882 (88.2%)
unique	43
top_value	NO HEAT
top_rate	0.2797
cardinality	43
entropy	4.376
entropy_ratio	0.8065
alert: long_tail	27 singleton categories
alert: null_rate	88.2% null

Fig 41.

Top values for descriptor_2.

Show data table

Top values for descriptor_2 (20 unique shown, of 43 total).
value	count	share
NO HEAT	33	3.3%
NO HEAT AND NO HOT WATER	10	1.0%
N/A	9	0.9%
NO HOT WATER	7	0.7%
Cannabis Smoking or Vaping	5	0.5%
ROACHES	4	0.4%
Unsafe Driving - Non-Passenger	3	0.3%
Dog	3	0.3%
OTHER	3	0.3%
Not Cleaned by Property Owner	2	0.2%
Parked In Front Of Fire Hydrant	2	0.2%
Operating Improperly	2	0.2%
AT WALL OR CEILING	2	0.2%
FLIES	2	0.2%
BROKEN OR MISSING	2	0.2%
MISSING OR INADEQUATE CANS/LID	2	0.2%
Loose or Improperly Stored Garbage or Food	1	0.1%
Droppings	1	0.1%
Fare/Tip Complaint	1	0.1%
Waste	1	0.1%

resolution_description categorical label

This column contains standardized resolution outcome descriptions from NYC 311 service requests, drawn from a fixed vocabulary of only 16 distinct boilerplate phrases. Despite being a text column, it behaves as a low-cardinality categorical: the top value (NYPD 'no criminal violation, condition corrected') accounts for 35.4% of non-null rows. A 30.7% null rate is flagged as an alert, likely reflecting complaints that have not yet been resolved or closed.

Treatment: Encode as ordinal or one-hot categorical (16 levels); treat nulls as a distinct 'unresolved' category rather than imputing.

anthropic:default · confidence high

Out[119]:

saturn.columns["resolution_description"].stats

stat	value
n	1,000
nulls	307 (30.7%)
unique	16
top_value	The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.
top_rate	0.3535
cardinality	16
entropy	2.777
entropy_ratio	0.6943
alert: null_rate	30.7% null

Fig 42.

Top values for resolution_description.

Show data table

Top values for resolution_description (16 unique shown, of 16 total).
value	count	share
The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	245	24.5%
The New York City Police Department responded to the complaint and with the information available observed no evidence of a criminal violation at that time. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	161	16.1%
The New York City Police Department responded to the complaint and their investigation determined that a violation of law occurred. Police issued a summons in response to the complaint. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	68	6.8%
The New York City Police Department responded to the complaint but officers were unable to gain entry into the premises. If the problem persists, please contact 311 to create another complaint and ensure that contact information (e.g., buzzer number, phone number, etc.) is available to assist the responding officers in gaining entry to properly investigate the complaint. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	50	5.0%
The following complaint conditions are still open. HPD has already attempted to notify the property owner that the condition exists; the tenant should provide access for the owner to make the repair. HPD may attempt to contact the tenant by phone to verify the correction of the condition or an HPD Inspector may attempt to conduct an inspection.	44	4.4%
The New York City Police Department responded to the complaint and their investigation determined that police action was not necessary. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	40	4.0%
The New York City Police Department responded to the complaint and observed no criminal violation upon their arrival. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	31	3.1%
This complaint is a duplicate of a building-wide condition already reported by another tenant. The original complaint is still open, and HPD may only need to confirm that the condition exists by inspecting one apartment. If we cannot contact the tenant from the original complaint or get access to that apartment, HPD may attempt to contact the person who filed this complaint to verify the correction of the condition or may conduct an inspection of your unit. You can check HPDONLINE to see if a	29	2.9%
The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.	6	0.6%
Your complaint has been received but does not fall under the jurisdiction of the New York City Police Department. Please contact your local precinct for more information, including which City agency your request was referred to. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	5	0.5%
The Police Department reviewed your complaint and provided additional information below.	4	0.4%
The New York City Police Department responded to the complaint and a report was prepared as part of their investigation. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	3	0.3%
The Department of Health and Mental Hygiene has sent official written notification to the Owner/Landlord warning them of potential violations and instructing them to correct the situation. If the situation persists 21 days after your initial complaint, please make a new complaint.	3	0.3%
The New York City Police Department responded to the complaint and observed an encampment at the noted location. The complaint has been referred to the Department of Homeless Services (DHS) for further action. DHS will inspect the condition and update your service request with more information. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	2	0.2%
Your complaint has been received by the New York City Police Department and assigned to a unit at your local precinct. The responding officers will conduct an assessment of the complaint and may contact you for further information, if you left contact information. Our team is committed to resolving this matter promptly, and we appreciate your contribution to maintaining the well-being of our community.	1	0.1%
The New York City Police Department responded to the complaint and their investigation determined another specific tow is required. Please contact the local precinct for more information. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	1	0.1%

resolution_action_updated_date categorical timestamp

This column is a timestamp recording when a resolution action was last updated, stored as a categorical string in ISO 8601 format. The 30.6% null rate is a significant concern, indicating roughly one-third of records have no update date logged. Strikingly, one value — '2026-01-17T00:00:00.000' — accounts for 10.7% of all non-null records (74 occurrences) with a midnight-exactly timestamp, suggesting a bulk update or default-date assignment rather than genuine event timestamps. All remaining top values appear only twice, confirming a severe long-tail distribution (entropy ratio 0.94) consistent with mostly unique timestamps.

Treatment: Parse to datetime, flag the 74 midnight '2026-01-17' records as probable synthetic/default dates, and impute or exclude nulls (30.6%) before any time-based feature engineering.

anthropic:default · confidence high

Out[122]:

saturn.columns["resolution_action_updated_date"].stats

stat	value
n	1,000
nulls	306 (30.6%)
unique	585
top_value	2026-01-17T00:00:00.000
top_rate	0.1066
cardinality	585
entropy	8.673
entropy_ratio	0.9435
alert: long_tail	548 singleton categories
alert: null_rate	30.6% null

Fig 43.

Top values for resolution_action_updated_date.

Show data table

Top values for resolution_action_updated_date (20 unique shown, of 585 total).
value	count	share
2026-01-17T00:00:00.000	74	7.4%
2026-01-18T02:03:44.000	2	0.2%
2026-01-18T02:04:03.000	2	0.2%
2026-01-18T02:01:54.000	2	0.2%
2026-01-18T01:35:08.000	2	0.2%
2026-01-18T01:41:49.000	2	0.2%
2026-01-18T01:26:35.000	2	0.2%
2026-01-18T01:32:28.000	2	0.2%
2026-01-18T01:25:53.000	2	0.2%
2026-01-18T01:54:39.000	2	0.2%
2026-01-18T01:19:59.000	2	0.2%
2026-01-18T01:09:47.000	2	0.2%
2026-01-18T01:00:57.000	2	0.2%
2026-01-18T01:04:36.000	2	0.2%
2026-01-18T00:58:56.000	2	0.2%
2026-01-18T01:07:50.000	2	0.2%
2026-01-18T01:27:34.000	2	0.2%
2026-01-18T00:46:07.000	2	0.2%
2026-01-18T00:55:02.000	2	0.2%
2026-01-18T00:40:32.000	2	0.2%

closed_date categorical timestamp

This column is a ticket or record closure timestamp, stored as a string rather than a parsed datetime type, with second-level precision. Two signals demand attention: 39% of rows are null, indicating a large share of records that have not yet been closed (open items); and with 585 unique values across 1,000 rows and an entropy ratio of 0.998, timestamps are nearly all distinct, which is expected for event times but confirms the column carries no categorical signal. All top values cluster on 2026-01-18, suggesting the snapshot or batch was captured on that single date.

Treatment: Parse to datetime, use nulls as an 'is_open' binary flag, and engineer elapsed-time features (e.g. days-to-close) rather than using raw values.

anthropic:default · confidence high

Out[125]:

saturn.columns["closed_date"].stats

stat	value
n	1,000
nulls	390 (39.0%)
unique	585
top_value	2026-01-18T00:41:26.000
top_rate	0.004918
cardinality	585
entropy	9.169
entropy_ratio	0.9975
alert: long_tail	561 singleton categories
alert: null_rate	39.0% null

Fig 44.

Top values for closed_date.

Show data table

Top values for closed_date (20 unique shown, of 585 total).
value	count	share
2026-01-18T00:41:26.000	3	0.3%
2026-01-18T02:02:35.000	2	0.2%
2026-01-18T02:03:41.000	2	0.2%
2026-01-18T01:41:47.000	2	0.2%
2026-01-18T01:26:28.000	2	0.2%
2026-01-18T01:45:29.000	2	0.2%
2026-01-18T01:54:33.000	2	0.2%
2026-01-18T01:07:46.000	2	0.2%
2026-01-18T01:27:29.000	2	0.2%
2026-01-18T00:46:04.000	2	0.2%
2026-01-18T00:40:28.000	2	0.2%
2026-01-18T00:55:27.000	2	0.2%
2026-01-18T00:41:08.000	2	0.2%
2026-01-18T00:20:14.000	2	0.2%
2026-01-18T00:37:27.000	2	0.2%
2026-01-18T00:22:36.000	2	0.2%
2026-01-18T00:36:52.000	2	0.2%
2026-01-18T00:17:49.000	2	0.2%
2026-01-18T00:28:52.000	2	0.2%
2026-01-18T00:20:55.000	2	0.2%

taxi_pick_up_location categorical

Out[128]:

saturn.columns["taxi_pick_up_location"].stats

stat	value
n	1,000
nulls	994 (99.4%)
unique	6
top_value	55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014
top_rate	0.1667
cardinality	6
entropy	2.585
entropy_ratio	1
alert: long_tail	6 singleton categories
alert: null_rate	99.4% null

Fig 45.

Top values for taxi_pick_up_location.

Show data table

Top values for taxi_pick_up_location (6 unique shown, of 6 total).
value	count	share
55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014	1	0.1%
JOHN F KENNEDY AIRPORT, QUEENS (JAMAICA) ,NY, 11430	1	0.1%
11 WATER STREET, BROOKLYN, NY, 11201	1	0.1%
9 AVENUE AND WEST 43 STREET, MANHATTAN, NY, 10036	1	0.1%
30 AVENUE AND 33 STREET, QUEENS, NY, 11102	1	0.1%
30-08 33 STREET, QUEENS (ASTORIA), NY, 11102	1	0.1%

vehicle_type categorical

Out[131]:

saturn.columns["vehicle_type"].stats

stat	value
n	1,000
nulls	965 (96.5%)
unique	4
top_value	Car
top_rate	0.7714
cardinality	4
entropy	1.097
entropy_ratio	0.5484
alert: null_rate	96.5% null

Fig 46.

Top values for vehicle_type.

Show data table

Top values for vehicle_type (4 unique shown, of 4 total).
value	count	share
Car	27	2.7%
Other	4	0.4%
SUV	3	0.3%
Van	1	0.1%

facility_type categorical

Out[134]:

saturn.columns["facility_type"].stats

stat	value
n	1,000
nulls	990 (99.0%)
unique	1
top_value	N/A
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	99.0% null
alert: imbalance	top value is 100.0% of rows

Fig 47.

Top values for facility_type.

Show data table

Top values for facility_type (1 unique shown, of 1 total).
value	count	share
N/A	10	1.0%

taxi_company_borough categorical

Out[137]:

saturn.columns["taxi_company_borough"].stats

stat	value
n	1,000
nulls	999 (99.9%)
unique	1
top_value	BROOKLYN
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: long_tail	1 singleton categories
alert: null_rate	99.9% null
alert: imbalance	top value is 100.0% of rows

Fig 48.

Top values for taxi_company_borough.

Show data table

Top values for taxi_company_borough (1 unique shown, of 1 total).
value	count	share
BROOKLYN	1	0.1%

bridge_highway_name categorical

Out[140]:

saturn.columns["bridge_highway_name"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	E
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 49.

Top values for bridge_highway_name.

Show data table

Top values for bridge_highway_name (2 unique shown, of 2 total).
value	count	share
E	2	0.2%
Kosciuszko Br - BQE	1	0.1%

bridge_highway_direction categorical

Out[143]:

saturn.columns["bridge_highway_direction"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	C E Local Downtown & Brooklyn
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 50.

Top values for bridge_highway_direction.

Show data table

Top values for bridge_highway_direction (2 unique shown, of 2 total).
value	count	share
C E Local Downtown & Brooklyn	2	0.2%
Queens Bound	1	0.1%

road_ramp categorical

Out[146]:

saturn.columns["road_ramp"].stats

stat	value
n	1,000
nulls	999 (99.9%)
unique	1
top_value	Ramp
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: long_tail	1 singleton categories
alert: null_rate	99.9% null
alert: imbalance	top value is 100.0% of rows

Fig 51.

Top values for road_ramp.

Show data table

Top values for road_ramp (1 unique shown, of 1 total).
value	count	share
Ramp	1	0.1%

bridge_highway_segment categorical

Out[149]:

saturn.columns["bridge_highway_segment"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	Platform
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 52.

Top values for bridge_highway_segment.

Show data table

Top values for bridge_highway_segment (2 unique shown, of 2 total).
value	count	share
Platform	2	0.2%
Ramp	1	0.1%

Overview

Summary confidence: high

unique_key categorical identifier

created_date categorical timestamp

agency categorical label

agency_name categorical label

complaint_type categorical label

descriptor categorical label

location_type categorical label

incident_zip categorical feature

incident_address categorical feature

street_name categorical feature

cross_street_1 categorical feature

cross_street_2 categorical feature

intersection_street_1 categorical feature

intersection_street_2 categorical feature

address_type categorical label

city categorical feature

landmark categorical label

status categorical label

community_board categorical label

council_district categorical label

police_precinct categorical label

bbl categorical foreign_key

borough categorical label

x_coordinate_state_plane categorical feature

y_coordinate_state_plane categorical feature

open_data_channel_type categorical feature

park_facility_name categorical label

park_borough categorical label

latitude categorical feature

longitude categorical feature

location unknown other

:@computed_region_f5dn_yrer categorical foreign_key

:@computed_region_yeji_bk3q categorical foreign_key

:@computed_region_sbqj_enih categorical foreign_key

:@computed_region_92fq_4b7q categorical foreign_key

descriptor_2 categorical label

resolution_description categorical label

resolution_action_updated_date categorical timestamp

closed_date categorical timestamp

taxi_pick_up_location categorical

vehicle_type categorical

facility_type categorical

taxi_company_borough categorical

bridge_highway_name categorical

bridge_highway_direction categorical

road_ramp categorical

bridge_highway_segment categorical

How to cite