wild-nyc_311_sample · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json

Saturn profiled 1,000 rows across 47 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json",
    "--findings", "wild-nyc_311_sample.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 1,000-row sample of NYC 311 service requests with 47 columns covering complaint metadata, location, agency routing, and resolution status. The bulk of activity is noise and parking complaints routed to NYPD (88% of all tickets), with 'Noise - Residential' alone accounting for 393 of 1,000 rows and 'Loud Music/Party' the dominant descriptor at 427. Geographically the load is spread across Queens, Manhattan, Brooklyn and the Bronx fairly evenly, while Staten Island barely registers (11 rows). Worth a closer look: the resolution funnel — 61% of tickets are Closed but 30.5% are still In Progress — and the channel mix, where ONLINE (466) now outpaces MOBILE and PHONE combined-ish. Note that many specialty fields (road_ramp, taxi_*, bridge_highway_*, facility_type) are >99% null and should be ignored for analysis.

citing: complaint_type · descriptor · agency · borough · status · open_data_channel_type · city · location_type · address_type · road_ramp · taxi_company_borough · facility_type

Out[4]:

saturn.schema() · 47 columns

column	kind	n	null%	unique	alerts
unique_key	categorical	1,000	0.0%	1,000	long_tail
created_date	categorical	1,000	0.0%	939	long_tail
agency	categorical	1,000	0.0%	11
agency_name	categorical	1,000	0.0%	11
complaint_type	categorical	1,000	0.0%	45
descriptor	categorical	1,000	0.0%	71
location_type	categorical	1,000	1.4%	20
incident_zip	categorical	1,000	0.3%	146
incident_address	categorical	1,000	0.5%	776	long_tail
street_name	categorical	1,000	0.5%	569	long_tail
cross_street_1	categorical	1,000	8.0%	511	long_tail
cross_street_2	categorical	1,000	7.9%	504	long_tail
intersection_street_1	categorical	1,000	8.5%	506	long_tail
intersection_street_2	categorical	1,000	8.4%	499	long_tail
address_type	categorical	1,000	0.2%	4	imbalance
city	categorical	1,000	2.1%	36
landmark	categorical	1,000	10.5%	515	long_tail
status	categorical	1,000	0.0%	3
community_board	categorical	1,000	0.0%	65
council_district	categorical	1,000	0.9%	51
police_precinct	categorical	1,000	0.0%	76
bbl	categorical	1,000	5.5%	719	long_tail
borough	categorical	1,000	0.0%	5
x_coordinate_state_plane	categorical	1,000	0.7%	763	long_tail
y_coordinate_state_plane	categorical	1,000	0.7%	768	long_tail
open_data_channel_type	categorical	1,000	0.0%	4
park_facility_name	categorical	1,000	0.0%	2	imbalance
park_borough	categorical	1,000	0.0%	5
latitude	categorical	1,000	0.7%	770	long_tail
longitude	categorical	1,000	0.7%	770	long_tail
location	unknown	1,000	0.0%	—	skipped
:@computed_region_f5dn_yrer	categorical	1,000	0.7%	62
:@computed_region_yeji_bk3q	categorical	1,000	0.7%	5
:@computed_region_sbqj_enih	categorical	1,000	0.7%	75
:@computed_region_92fq_4b7q	categorical	1,000	0.7%	51
descriptor_2	categorical	1,000	88.2%	43	long_tail null_rate
resolution_description	categorical	1,000	30.7%	16	null_rate
resolution_action_updated_date	categorical	1,000	30.6%	585	long_tail null_rate
closed_date	categorical	1,000	39.0%	585	long_tail null_rate
taxi_pick_up_location	categorical	1,000	99.4%	6	long_tail null_rate
vehicle_type	categorical	1,000	96.5%	4	null_rate
facility_type	categorical	1,000	99.0%	1	null_rate imbalance
taxi_company_borough	categorical	1,000	99.9%	1	long_tail null_rate imbalance
bridge_highway_name	categorical	1,000	99.7%	2	null_rate
bridge_highway_direction	categorical	1,000	99.7%	2	null_rate
road_ramp	categorical	1,000	99.9%	1	long_tail null_rate imbalance
bridge_highway_segment	categorical	1,000	99.7%	2	null_rate

Fig 1.

complaint_type · Noise and parking complaints dominate; check how steeply the long tail drops after the top three categories.

Show data table

Top values for complaint_type (20 unique shown, of 45 total).
value	count	share
Noise - Residential	393	39.3%
Illegal Parking	197	19.7%
Noise - Commercial	148	14.8%
Blocked Driveway	55	5.5%
HEAT/HOT WATER	49	4.9%
Noise - Street/Sidewalk	44	4.4%
Noise - Vehicle	17	1.7%
UNSANITARY CONDITION	12	1.2%
Street Condition	7	0.7%
Non-Emergency Police Matter	7	0.7%
Smoking or Vaping	5	0.5%
Abandoned Vehicle	5	0.5%
Animal-Abuse	5	0.5%
Rodent	4	0.4%
Encampment	4	0.4%
Taxi Complaint	3	0.3%
Dirty Condition	3	0.3%
PAINT/PLASTER	3	0.3%
GENERAL	3	0.3%
Drinking	2	0.2%

Fig 2.

agency · NYPD handles 88% of tickets — see how thin the slice is for every other agency.

Show data table

Top values for agency (11 unique shown, of 11 total).
value	count	share
NYPD	880	88.0%
HPD	74	7.4%
DOHMH	13	1.3%
DOT	11	1.1%
TLC	6	0.6%
DSNY	6	0.6%
DHS	2	0.2%
OOS	2	0.2%
DCWP	2	0.2%
DPR	2	0.2%
DEP	2	0.2%

Fig 3.

borough · Queens, Manhattan, Brooklyn and Bronx are roughly balanced while Staten Island is barely represented.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

Fig 4.

status · Closed vs. In Progress vs. Open — useful for gauging the open backlog in this snapshot.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
Closed	610	61.0%
In Progress	305	30.5%
Open	85	8.5%

Fig 5.

open_data_channel_type · Compare intake channels to see how reporting has shifted toward online and mobile over phone.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	466	46.6%
MOBILE	288	28.8%
PHONE	236	23.6%
UNKNOWN	10	1.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
unique_key	categorical	0.0%
created_date	categorical	0.0%
agency	categorical	0.0%
agency_name	categorical	0.0%
complaint_type	categorical	0.0%
descriptor	categorical	0.0%
location_type	categorical	1.4%
incident_zip	categorical	0.3%
incident_address	categorical	0.5%
street_name	categorical	0.5%
cross_street_1	categorical	8.0%
cross_street_2	categorical	7.9%
intersection_street_1	categorical	8.5%
intersection_street_2	categorical	8.4%
address_type	categorical	0.2%
city	categorical	2.1%
landmark	categorical	10.5%
status	categorical	0.0%
community_board	categorical	0.0%
council_district	categorical	0.9%
police_precinct	categorical	0.0%
bbl	categorical	5.5%
borough	categorical	0.0%
x_coordinate_state_plane	categorical	0.7%
y_coordinate_state_plane	categorical	0.7%
open_data_channel_type	categorical	0.0%
park_facility_name	categorical	0.0%
park_borough	categorical	0.0%
latitude	categorical	0.7%
longitude	categorical	0.7%
location	unknown	0.0%
:@computed_region_f5dn_yrer	categorical	0.7%
:@computed_region_yeji_bk3q	categorical	0.7%
:@computed_region_sbqj_enih	categorical	0.7%
:@computed_region_92fq_4b7q	categorical	0.7%
descriptor_2	categorical	88.2%
resolution_description	categorical	30.7%
resolution_action_updated_date	categorical	30.6%
closed_date	categorical	39.0%
taxi_pick_up_location	categorical	99.4%
vehicle_type	categorical	96.5%
facility_type	categorical	99.0%
taxi_company_borough	categorical	99.9%
bridge_highway_name	categorical	99.7%
bridge_highway_direction	categorical	99.7%
road_ramp	categorical	99.9%
bridge_highway_segment	categorical	99.7%

unique_key categorical identifier

This column is a unique row identifier: all 1000 values are distinct (n_unique=1000, entropy_ratio=1.0) with no nulls and a top_rate of just 0.001. Sample values like '67519092' are 8-digit numeric strings, consistent with a primary key from a source system. There is no predictive signal here; the long_tail alert simply reflects perfect uniqueness.

Treatment: Drop from modelling; retain only as a join key or row reference.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["unique_key"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	1,000
top_value	67519092
top_rate	0.001
cardinality	1,000
entropy	9.966
entropy_ratio	1
alert: long_tail	1000 singleton categories

Fig 7.

Top values for unique_key.

Show data table

Top values for unique_key (20 unique shown, of 1000 total).
value	count	share
67519092	1	0.1%
67525266	1	0.1%
67525239	1	0.1%
67521094	1	0.1%
67519046	1	0.1%
67519060	1	0.1%
67524218	1	0.1%
67524221	1	0.1%
67520253	1	0.1%
67519078	1	0.1%
67522157	1	0.1%
67521098	1	0.1%
67524234	1	0.1%
67523183	1	0.1%
67519091	1	0.1%
67524231	1	0.1%
67520104	1	0.1%
67521095	1	0.1%
67522127	1	0.1%
67523186	1	0.1%

created_date categorical timestamp

This is a record creation timestamp stored as an ISO-8601 string, with 939 unique values across 1000 rows and no nulls. Entropy ratio of 0.995 confirms it is essentially a per-row timestamp; the most frequent value appears only 9 times (0.9%). All visible top values fall within a narrow window on 2026-01-17 to 2026-01-18, suggesting the sample covers only a short ingestion period rather than a broad history.

Treatment: Parse to datetime and derive features (hour, day, delta) rather than using as a categorical.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["created_date"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	939
top_value	2026-01-17T23:49:56.000
top_rate	0.009
cardinality	939
entropy	9.828
entropy_ratio	0.9952
alert: long_tail	889 singleton categories

Fig 8.

Top values for created_date.

Show data table

Top values for created_date (20 unique shown, of 939 total).
value	count	share
2026-01-17T23:49:56.000	9	0.9%
2026-01-17T23:43:55.000	4	0.4%
2026-01-18T01:11:10.000	3	0.3%
2026-01-17T23:04:06.000	3	0.3%
2026-01-18T01:25:57.000	2	0.2%
2026-01-18T01:25:16.000	2	0.2%
2026-01-18T01:18:23.000	2	0.2%
2026-01-18T01:12:27.000	2	0.2%
2026-01-18T01:08:57.000	2	0.2%
2026-01-18T01:07:55.000	2	0.2%
2026-01-18T01:04:36.000	2	0.2%
2026-01-18T01:02:54.000	2	0.2%
2026-01-18T00:53:20.000	2	0.2%
2026-01-18T00:47:20.000	2	0.2%
2026-01-18T00:38:56.000	2	0.2%
2026-01-18T00:35:43.000	2	0.2%
2026-01-18T00:35:09.000	2	0.2%
2026-01-18T00:31:54.000	2	0.2%
2026-01-18T00:28:06.000	2	0.2%
2026-01-18T00:26:25.000	2	0.2%

agency categorical feature

This is the responding NYC agency code for each record, with 11 distinct values and no nulls. The distribution is severely concentrated: NYPD accounts for 880 of 1000 rows (top_rate 0.88), followed by HPD at 74, leaving the remaining 9 agencies with at most 13 rows each. Entropy ratio of 0.22 confirms the column carries little information beyond 'NYPD vs. not'.

Treatment: Collapse rare agencies into an 'Other' bucket before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["agency"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	11
top_value	NYPD
top_rate	0.88
cardinality	11
entropy	0.7715
entropy_ratio	0.223

Fig 9.

Top values for agency.

Show data table

Top values for agency (11 unique shown, of 11 total).
value	count	share
NYPD	880	88.0%
HPD	74	7.4%
DOHMH	13	1.3%
DOT	11	1.1%
TLC	6	0.6%
DSNY	6	0.6%
DHS	2	0.2%
OOS	2	0.2%
DCWP	2	0.2%
DPR	2	0.2%
DEP	2	0.2%

agency_name categorical feature

This column names the NYC agency handling each record, with 11 distinct agencies and no nulls. The distribution is heavily concentrated: the New York City Police Department accounts for 880 of 1000 rows (88%), and entropy ratio is just 0.22, meaning the field carries little information beyond 'NYPD or not'. Secondary agencies like Housing Preservation and Development (74) and Health and Mental Hygiene (13) trail far behind.

Treatment: Collapse rare agencies into an 'Other' bucket or binarize as NYPD vs non-NYPD before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["agency_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	11
top_value	New York City Police Department
top_rate	0.88
cardinality	11
entropy	0.7715
entropy_ratio	0.223

Fig 10.

Top values for agency_name.

Show data table

Top values for agency_name (11 unique shown, of 11 total).
value	count	share
New York City Police Department	880	88.0%
Department of Housing Preservation and Development	74	7.4%
Department of Health and Mental Hygiene	13	1.3%
Department of Transportation	11	1.1%
Taxi and Limousine Commission	6	0.6%
Department of Sanitation	6	0.6%
Department of Homeless Services	2	0.2%
Office of the Sheriff	2	0.2%
Department of Consumer and Worker Protection	2	0.2%
Department of Parks and Recreation	2	0.2%
Department of Environmental Protection	2	0.2%

complaint_type categorical label

This is a categorical complaint-type label, almost certainly from a 311-style service request feed, with 45 distinct categories across 1000 rows and no nulls. The distribution is heavily concentrated: "Noise - Residential" alone accounts for 39.3% of records, and the top three values (all noise or parking complaints) cover roughly 74% of the column. Casing is inconsistent across categories (e.g., "HEAT/HOT WATER" and "UNSANITARY CONDITION" in upper case versus title-cased noise variants), which suggests the source merges multiple agency taxonomies.

Treatment: Normalise casing and group the long tail before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["complaint_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	45
top_value	Noise - Residential
top_rate	0.393
cardinality	45
entropy	2.935
entropy_ratio	0.5345

Fig 11.

Top values for complaint_type.

Show data table

Top values for complaint_type (20 unique shown, of 45 total).
value	count	share
Noise - Residential	393	39.3%
Illegal Parking	197	19.7%
Noise - Commercial	148	14.8%
Blocked Driveway	55	5.5%
HEAT/HOT WATER	49	4.9%
Noise - Street/Sidewalk	44	4.4%
Noise - Vehicle	17	1.7%
UNSANITARY CONDITION	12	1.2%
Street Condition	7	0.7%
Non-Emergency Police Matter	7	0.7%
Smoking or Vaping	5	0.5%
Abandoned Vehicle	5	0.5%
Animal-Abuse	5	0.5%
Rodent	4	0.4%
Encampment	4	0.4%
Taxi Complaint	3	0.3%
Dirty Condition	3	0.3%
PAINT/PLASTER	3	0.3%
GENERAL	3	0.3%
Drinking	2	0.2%

descriptor categorical feature

Categorical descriptor field enumerating specific complaint subtypes (e.g., 'Loud Music/Party', 'Blocked Hydrant', 'Posted Parking Sign Violation'), likely a secondary classification under a parent complaint type. Distribution is heavily skewed: 'Loud Music/Party' alone covers 42.7% of 1000 rows, and the top 3 values account for the majority, leaving 71 distinct categories with a long thin tail. Note the inconsistent casing ('ENTIRE BUILDING', 'APARTMENT ONLY' vs. title-case entries), suggesting upstream data-entry from multiple systems.

Treatment: Normalize casing, then group rare levels into 'Other' before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["descriptor"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	71
top_value	Loud Music/Party
top_rate	0.427
cardinality	71
entropy	3.54
entropy_ratio	0.5756

Fig 12.

Top values for descriptor.

Show data table

Top values for descriptor (20 unique shown, of 71 total).
value	count	share
Loud Music/Party	427	42.7%
Banging/Pounding	114	11.4%
Blocked Hydrant	79	7.9%
No Access	45	4.5%
Posted Parking Sign Violation	42	4.2%
Loud Talking	36	3.6%
ENTIRE BUILDING	35	3.5%
Blocked Sidewalk	21	2.1%
Commercial Overnight Parking	20	2.0%
APARTMENT ONLY	14	1.4%
Car/Truck Music	13	1.3%
Double Parked Blocking Traffic	13	1.3%
Loud Television	10	1.0%
Partial Access	10	1.0%
PESTS	10	1.0%
Blocked Crosswalk	9	0.9%
Pothole	6	0.6%
Allowed in Smoke Free Area	5	0.5%
With License Plate	5	0.5%
No Shelter	5	0.5%

location_type categorical feature

Categorical location descriptor with 20 distinct values and only 1.4% nulls; 'Residential Building/House' dominates at 40.2% followed by 'Street/Sidewalk' at 32.4%. The vocabulary is inconsistent: 'RESIDENTIAL BUILDING' (74), 'Residential Building' (7), and 'Residential Building/House' (396) are clearly the same concept stored three ways, and 'Street' (8) overlaps with 'Street/Sidewalk' (324). Entropy ratio of 0.52 confirms heavy concentration in the top two buckets.

Treatment: Normalize casing and merge synonym variants (e.g., collapse the three 'Residential Building' forms) before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["location_type"].stats

stat	value
n	1,000
nulls	14 (1.4%)
unique	20
top_value	Residential Building/House
top_rate	0.4016
cardinality	20
entropy	2.237
entropy_ratio	0.5177

Fig 13.

Top values for location_type.

Show data table

Top values for location_type (20 unique shown, of 20 total).
value	count	share
Residential Building/House	396	39.6%
Street/Sidewalk	324	32.4%
Club/Bar/Restaurant	91	9.1%
RESIDENTIAL BUILDING	74	7.4%
Store/Commercial	60	6.0%
Street	8	0.8%
Residential Building	7	0.7%
Sidewalk	5	0.5%
3+ Family Apartment Building	3	0.3%
House and Store	3	0.3%
3+ Family Apt. Building	2	0.2%
Business	2	0.2%
Subway	2	0.2%
Other	2	0.2%
Park/Playground	2	0.2%
1-2 Family Mixed Use Building	1	0.1%
Restaurant/Bar/Deli/Bakery	1	0.1%
Bridge	1	0.1%
Taxi	1	0.1%
1-2 Family Dwelling	1	0.1%

incident_zip categorical feature

Holds NYC ZIP codes tied to incidents, with 146 distinct values across 1000 rows and only 0.3% missing. Distribution is diffuse — entropy ratio 0.93 and the top ZIP '10011' captures just 4.2% — so no single neighborhood dominates. The top 10 ZIPs span Manhattan, Brooklyn, Queens, and the Bronx, indicating broad five-borough coverage.

Treatment: Treat as categorical; group by borough or target-encode before modelling rather than one-hot across 146 levels.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["incident_zip"].stats

stat	value
n	1,000
nulls	3 (0.3%)
unique	146
top_value	10011
top_rate	0.04213
cardinality	146
entropy	6.696
entropy_ratio	0.9313

Fig 14.

Top values for incident_zip.

Show data table

Top values for incident_zip (20 unique shown, of 146 total).
value	count	share
10011	42	4.2%
11421	35	3.5%
11385	29	2.9%
10463	22	2.2%
10031	22	2.2%
11368	21	2.1%
10456	20	2.0%
10009	19	1.9%
10462	17	1.7%
10002	16	1.6%
11206	15	1.5%
11212	15	1.5%
11226	14	1.4%
11375	14	1.4%
10012	14	1.4%
11373	14	1.4%
10461	13	1.3%
10452	12	1.2%
10468	12	1.2%
10034	12	1.2%

incident_address categorical metadata

Street addresses for incident locations, with 776 unique values across 1000 rows and only 0.5% nulls. Entropy ratio of 0.97 confirms a long tail — most addresses appear once — but a few hotspots stand out, notably '126 WEST 13 STREET' with 31 occurrences (3.1% of rows), far above the next most frequent. Note inconsistent whitespace formatting (e.g., '60 EAST 93 STREET') suggesting unnormalized input.

Treatment: Normalize whitespace/casing and geocode rather than using raw strings as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["incident_address"].stats

stat	value
n	1,000
nulls	5 (0.5%)
unique	776
top_value	126 WEST 13 STREET
top_rate	0.03116
cardinality	776
entropy	9.321
entropy_ratio	0.9709
alert: long_tail	664 singleton categories

Fig 15.

Top values for incident_address.

Show data table

Top values for incident_address (20 unique shown, of 776 total).
value	count	share
126 WEST 13 STREET	31	3.1%
60 EAST 93 STREET	13	1.3%
71 LENOX AVENUE	8	0.8%
105 MACDOUGAL STREET	8	0.8%
1465 WASHINGTON AVENUE	8	0.8%
2918 BRUCKNER BOULEVARD	7	0.7%
190 AVENUE OF THE AMERICAS	5	0.5%
235 COURT STREET	5	0.5%
74-03 85 DRIVE	5	0.5%
85 TOMPKINS AVENUE	5	0.5%
1365 5 AVENUE	4	0.4%
112-17 NORTHERN BOULEVARD	4	0.4%
182 NAGLE AVENUE	4	0.4%
108-26 159 STREET	4	0.4%
402 ONDERDONK AVENUE	4	0.4%
465 SENECA AVENUE	3	0.3%
2140 MATTHEWS AVENUE	3	0.3%
28-10 JACKSON AVENUE	3	0.3%
62-11 108 STREET	3	0.3%
499 MYRTLE AVENUE	3	0.3%

street_name categorical metadata

Street name field, almost certainly a NYC address component given entries like WEST 13 STREET, JAMAICA AVENUE, and BRUCKNER BOULEVARD. Extremely high-cardinality (569 unique across 1000 rows, entropy ratio 0.955) with a long tail and no dominant value — the top entry only covers 3.1% of rows. Watch for inconsistent whitespace (e.g. "EAST 93 STREET" has multiple spaces) which will fragment otherwise-equal values.

Treatment: Normalise whitespace and casing, then either group by borough/zip or drop for modelling — too high-cardinality to one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["street_name"].stats

stat	value
n	1,000
nulls	5 (0.5%)
unique	569
top_value	WEST 13 STREET
top_rate	0.03116
cardinality	569
entropy	8.742
entropy_ratio	0.9551
alert: long_tail	371 singleton categories

Fig 16.

Top values for street_name.

Show data table

Top values for street_name (20 unique shown, of 569 total).
value	count	share
WEST 13 STREET	31	3.1%
EAST 93 STREET	13	1.3%
JAMAICA AVENUE	11	1.1%
WASHINGTON AVENUE	11	1.1%
LENOX AVENUE	10	1.0%
MACDOUGAL STREET	10	1.0%
BROADWAY	9	0.9%
BRUCKNER BOULEVARD	8	0.8%
BROOKLYN AVENUE	7	0.7%
NORTHERN BOULEVARD	7	0.7%
76 STREET	7	0.7%
COURT STREET	6	0.6%
EAST 74 STREET	6	0.6%
GRAND STREET	5	0.5%
BUSHWICK AVENUE	5	0.5%
EAST 3 STREET	5	0.5%
SENECA AVENUE	5	0.5%
111 AVENUE	5	0.5%
AVENUE OF THE AMERICAS	5	0.5%
AMSTERDAM AVENUE	5	0.5%

cross_street_1 categorical metadata

This column captures the name of one cross street, almost certainly part of a NYC location reference (entries like 'AVENUE OF THE AMERICAS', 'BLEECKER STREET', and 'BROADWAY' are giveaways). Cardinality is extreme: 511 unique values across 1000 rows with entropy ratio 0.95, and the most common street accounts for only 3.8% of rows. 8% of values are null, and the long tail alert means most streets appear only once or twice.

Treatment: Standardize casing and group rare streets into an 'OTHER' bucket before any encoding; do not one-hot raw.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["cross_street_1"].stats

stat	value
n	1,000
nulls	80 (8.0%)
unique	511
top_value	AVENUE OF THE AMERICAS
top_rate	0.03804
cardinality	511
entropy	8.574
entropy_ratio	0.953
alert: long_tail	324 singleton categories

Fig 17.

Top values for cross_street_1.

Show data table

Top values for cross_street_1 (20 unique shown, of 511 total).
value	count	share
AVENUE OF THE AMERICAS	35	3.5%
BLEECKER STREET	11	1.1%
AMSTERDAM AVENUE	11	1.1%
WEST 113 STREET	8	0.8%
HIMROD STREET	8	0.8%
112 STREET	8	0.8%
ST PAULS PLACE	8	0.8%
DEXTER COURT	8	0.8%
EAST TREMONT AVENUE	7	0.7%
BROADWAY	7	0.7%
BEND	6	0.6%
107 AVENUE	6	0.6%
80 STREET	6	0.6%
AVENUE B	6	0.6%
3 AVENUE	6	0.6%
EAST 182 STREET	5	0.5%
5 AVENUE	5	0.5%
4 AVENUE	5	0.5%
VANDAM STREET	5	0.5%
DEAD END	5	0.5%

cross_street_2 categorical feature

Free-form NYC street names serving as the second cross street in a location pair, with 504 distinct values across 921 non-null rows and a 7.9% null rate. Distribution is long-tailed and high-entropy (entropy ratio 0.948); the most common value '7 AVENUE' covers only 3.9%, and 'DEAD END' appearing 10 times suggests non-street sentinel values mixed in.

Treatment: Treat as high-cardinality categorical: hash or target-encode rather than one-hot, and normalise sentinels like 'DEAD END' before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["cross_street_2"].stats

stat	value
n	1,000
nulls	79 (7.9%)
unique	504
top_value	7 AVENUE
top_rate	0.03909
cardinality	504
entropy	8.511
entropy_ratio	0.9481
alert: long_tail	321 singleton categories

Fig 18.

Top values for cross_street_2.

Show data table

Top values for cross_street_2 (20 unique shown, of 504 total).
value	count	share
7 AVENUE	36	3.6%
BROADWAY	19	1.9%
MINETTA LANE	10	1.0%
DEAD END	10	1.0%
109 AVENUE	9	0.9%
EAST 171 STREET	9	0.9%
WEST 114 STREET	8	0.8%
HARMAN STREET	8	0.8%
75 STREET	8	0.8%
EDISON AVENUE	7	0.7%
112 PLACE	7	0.7%
BEND	7	0.7%
AVENUE D	6	0.6%
DITMAS AVENUE	6	0.6%
EAST 174 STREET	6	0.6%
10 AVENUE	6	0.6%
3 AVENUE	6	0.6%
1 AVENUE	6	0.6%
EAST 170 STREET	6	0.6%
BALTIC STREET	6	0.6%

intersection_street_1 categorical metadata

Street-name field, almost certainly the first cross street of an NYC intersection given values like AVENUE OF THE AMERICAS, BROADWAY, and AMSTERDAM AVENUE. Cardinality is extreme: 506 distinct names across 1000 rows with entropy ratio 0.95, and the most frequent value covers only 3.8% of records. Null rate is 8.5%, and the long tail means most names appear only once or twice.

Treatment: Group rare streets into an 'OTHER' bucket or geocode to borough/zone before using as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[48]:

saturn.columns["intersection_street_1"].stats

stat	value
n	1,000
nulls	85 (8.5%)
unique	506
top_value	AVENUE OF THE AMERICAS
top_rate	0.03825
cardinality	506
entropy	8.558
entropy_ratio	0.9527
alert: long_tail	319 singleton categories

Fig 19.

Top values for intersection_street_1.

Show data table

Top values for intersection_street_1 (20 unique shown, of 506 total).
value	count	share
AVENUE OF THE AMERICAS	35	3.5%
BLEECKER STREET	11	1.1%
AMSTERDAM AVENUE	11	1.1%
WEST 113 STREET	8	0.8%
HIMROD STREET	8	0.8%
112 STREET	8	0.8%
ST PAULS PLACE	8	0.8%
DEXTER COURT	8	0.8%
EAST TREMONT AVENUE	7	0.7%
BROADWAY	7	0.7%
BEND	6	0.6%
5 AVENUE	6	0.6%
107 AVENUE	6	0.6%
80 STREET	6	0.6%
AVENUE B	6	0.6%
3 AVENUE	6	0.6%
EAST 182 STREET	5	0.5%
4 AVENUE	5	0.5%
VANDAM STREET	5	0.5%
DEAD END	5	0.5%

intersection_street_2 categorical feature

This is the second street name of an intersection, almost certainly NYC street geography given entries like '7 AVENUE', 'BROADWAY', and 'EAST 171 STREET'. Cardinality is extreme (499 unique values across 1000 rows, entropy ratio 0.95) with no dominant value — the top entry '7 AVENUE' covers only 3.9% and 8.4% of rows are null. Notably, 'DEAD END' appears 10 times as a sentinel rather than a real street, which will distort any join on street name.

Treatment: Treat as high-cardinality location text: normalize sentinels like 'DEAD END' and target-encode or geocode rather than one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["intersection_street_2"].stats

stat	value
n	1,000
nulls	84 (8.4%)
unique	499
top_value	7 AVENUE
top_rate	0.0393
cardinality	499
entropy	8.495
entropy_ratio	0.9478
alert: long_tail	316 singleton categories

Fig 20.

Top values for intersection_street_2.

Show data table

Top values for intersection_street_2 (20 unique shown, of 499 total).
value	count	share
7 AVENUE	36	3.6%
BROADWAY	19	1.9%
MINETTA LANE	10	1.0%
DEAD END	10	1.0%
109 AVENUE	9	0.9%
EAST 171 STREET	9	0.9%
WEST 114 STREET	8	0.8%
HARMAN STREET	8	0.8%
75 STREET	8	0.8%
EDISON AVENUE	7	0.7%
112 PLACE	7	0.7%
BEND	7	0.7%
AVENUE D	6	0.6%
DITMAS AVENUE	6	0.6%
EAST 174 STREET	6	0.6%
10 AVENUE	6	0.6%
3 AVENUE	6	0.6%
1 AVENUE	6	0.6%
EAST 170 STREET	6	0.6%
BALTIC STREET	6	0.6%

address_type categorical feature

Categorical descriptor of how a location is specified, with four values: ADDRESS, INTERSECTION, BLOCKFACE, and PLACE. The distribution is severely imbalanced — ADDRESS accounts for 97.2% of 1000 rows, leaving only 28 non-ADDRESS records and an entropy ratio of 0.108. Null rate is negligible at 0.2%.

Treatment: Collapse to a binary ADDRESS-vs-other flag or drop, since the minority levels carry little signal.

anthropic:claude-opus-4-7 · confidence high

Out[54]:

saturn.columns["address_type"].stats

stat	value
n	1,000
nulls	2 (0.2%)
unique	4
top_value	ADDRESS
top_rate	0.9719
cardinality	4
entropy	0.2169
entropy_ratio	0.1084
alert: imbalance	top value is 97.2% of rows

Fig 21.

Top values for address_type.

Show data table

Top values for address_type (4 unique shown, of 4 total).
value	count	share
ADDRESS	970	97.0%
INTERSECTION	19	1.9%
BLOCKFACE	7	0.7%
PLACE	2	0.2%

city categorical feature

This is a NYC-centric city field with 36 distinct uppercase values and a 2.1% null rate. The distribution is dominated by Brooklyn (25.5%), New York, and Bronx, which together account for roughly 70% of the 1000 rows; entropy ratio of 0.63 reflects this concentration on a few boroughs while the long tail covers Queens neighborhoods like Woodhaven, Jamaica, and Astoria. Note the values mix borough names with neighborhood names rather than a consistent geographic level.

Treatment: Group rare neighborhoods into an 'other' bucket and one-hot or target-encode the top categories.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["city"].stats

stat	value
n	1,000
nulls	21 (2.1%)
unique	36
top_value	BROOKLYN
top_rate	0.2554
cardinality	36
entropy	3.282
entropy_ratio	0.6349

Fig 22.

Top values for city.

Show data table

Top values for city (20 unique shown, of 36 total).
value	count	share
BROOKLYN	250	25.0%
NEW YORK	249	24.9%
BRONX	198	19.8%
WOODHAVEN	35	3.5%
JAMAICA	29	2.9%
RIDGEWOOD	29	2.9%
CORONA	20	2.0%
ASTORIA	14	1.4%
ELMHURST	14	1.4%
FOREST HILLS	11	1.1%
STATEN ISLAND	11	1.1%
SOUTH OZONE PARK	11	1.1%
OZONE PARK	9	0.9%
SOUTH RICHMOND HILL	9	0.9%
REGO PARK	9	0.9%
JACKSON HEIGHTS	9	0.9%
RICHMOND HILL	7	0.7%
COLLEGE POINT	6	0.6%
QUEENS	6	0.6%
WOODSIDE	6	0.6%

landmark categorical metadata

This column holds NYC street names used as landmark references, with 515 unique values across 895 non-null rows and a 10.5% null rate. The distribution is extremely flat (entropy ratio 0.955) — the most common value, 'WEST 13 STREET', accounts for only 3.5% of records, and a long tail alert is raised. No single landmark dominates, so this behaves more like a high-cardinality location descriptor than a categorical feature.

Treatment: Treat as high-cardinality text; group rare values or geocode rather than one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["landmark"].stats

stat	value
n	1,000
nulls	105 (10.5%)
unique	515
top_value	WEST 13 STREET
top_rate	0.03464
cardinality	515
entropy	8.603
entropy_ratio	0.955
alert: long_tail	335 singleton categories

Fig 23.

Top values for landmark.

Show data table

Top values for landmark (20 unique shown, of 515 total).
value	count	share
WEST 13 STREET	31	3.1%
JAMAICA AVENUE	11	1.1%
WASHINGTON AVENUE	11	1.1%
LENOX AVENUE	10	1.0%
MAC DOUGAL STREET	10	1.0%
BRUCKNER BOULEVARD	8	0.8%
BROADWAY	8	0.8%
BROOKLYN AVENUE	7	0.7%
NORTHERN BOULEVARD	7	0.7%
76 STREET	7	0.7%
COURT STREET	6	0.6%
EAST 74 STREET	6	0.6%
GRAND STREET	5	0.5%
BUSHWICK AVENUE	5	0.5%
EAST 3 STREET	5	0.5%
SENECA AVENUE	5	0.5%
111 AVENUE	5	0.5%
AVENUE OF THE AMERICAS	5	0.5%
MYRTLE AVENUE	5	0.5%
90 STREET	5	0.5%

status categorical feature

This is a ticket/case lifecycle status field with three states: Closed, In Progress, and Open. The distribution is heavily weighted toward Closed (610 of 1000, 61%), with Open being notably rare at just 85 records, suggesting most cases reach resolution. No nulls and clean cardinality of 3 make this a tidy categorical signal.

Treatment: One-hot or ordinal encode (Open < In Progress < Closed) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[63]:

saturn.columns["status"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	3
top_value	Closed
top_rate	0.61
cardinality	3
entropy	1.26
entropy_ratio	0.7948

Fig 24.

Top values for status.

Show data table

Top values for status (3 unique shown, of 3 total).
value	count	share
Closed	610	61.0%
In Progress	305	30.5%
Open	85	8.5%

community_board categorical metadata

This column encodes NYC community board identifiers as a numeric prefix plus borough name (e.g., '02 MANHATTAN'), with 65 distinct values across 1000 rows and no nulls. The distribution is fairly flat — entropy ratio is 0.93 and the top value accounts for only 5.6% — suggesting wide geographic coverage rather than concentration in one district. Manhattan and Queens boards dominate the top of the leaderboard, with '02 MANHATTAN' (56) and '09 QUEENS' (48) leading.

Treatment: Split into borough and board-number components, then one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["community_board"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	65
top_value	02 MANHATTAN
top_rate	0.056
cardinality	65
entropy	5.61
entropy_ratio	0.9316

Fig 25.

Top values for community_board.

Show data table

Top values for community_board (20 unique shown, of 65 total).
value	count	share
02 MANHATTAN	56	5.6%
09 QUEENS	48	4.8%
03 MANHATTAN	38	3.8%
05 QUEENS	38	3.8%
12 QUEENS	30	3.0%
03 BRONX	29	2.9%
12 MANHATTAN	29	2.9%
10 MANHATTAN	28	2.8%
08 BRONX	28	2.8%
09 MANHATTAN	28	2.8%
17 BROOKLYN	27	2.7%
10 QUEENS	27	2.7%
03 QUEENS	25	2.5%
01 BROOKLYN	24	2.4%
06 QUEENS	24	2.4%
11 BROOKLYN	23	2.3%
04 BROOKLYN	22	2.2%
09 BRONX	21	2.1%
04 QUEENS	20	2.0%
10 BRONX	20	2.0%

council_district categorical feature

Categorical codes representing council districts, stored as zero-padded two-digit strings across 51 distinct values with a 0.9% null rate. The distribution is nearly uniform (entropy ratio 0.954), with the top district '03' accounting for only 5.65% of rows, suggesting broad geographic coverage rather than concentration. No single district dominates, which is consistent with a city/region-wide sample.

Treatment: One-hot or target-encode for modelling; impute the small null share with an explicit 'unknown' category.

anthropic:claude-opus-4-7 · confidence high

Out[69]:

saturn.columns["council_district"].stats

stat	value
n	1,000
nulls	9 (0.9%)
unique	51
top_value	03
top_rate	0.05651
cardinality	51
entropy	5.412
entropy_ratio	0.9541

Fig 26.

Top values for council_district.

Show data table

Top values for council_district (20 unique shown, of 51 total).
value	count	share
03	56	5.6%
32	48	4.8%
02	39	3.9%
34	38	3.8%
13	36	3.6%
10	36	3.6%
07	33	3.3%
16	32	3.2%
09	32	3.2%
28	30	3.0%
01	28	2.8%
11	26	2.6%
30	25	2.5%
21	25	2.5%
17	24	2.4%
14	24	2.4%
47	24	2.4%
41	23	2.3%
35	22	2.2%
18	22	2.2%

police_precinct categorical feature

This is a categorical column tagging each record with one of 76 NYPD-style police precincts, with no nulls across 1000 rows. The distribution is remarkably flat — entropy ratio 0.9358 and the modal value 'Precinct 6' covering just 5% — meaning no precinct dominates and counts decay slowly from 50 down to the mid-20s among the top 10.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode rather than one-hot before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[72]:

saturn.columns["police_precinct"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	76
top_value	Precinct 6
top_rate	0.05
cardinality	76
entropy	5.847
entropy_ratio	0.9358

Fig 27.

Top values for police_precinct.

Show data table

Top values for police_precinct (20 unique shown, of 76 total).
value	count	share
Precinct 6	50	5.0%
Precinct 102	48	4.8%
Precinct 104	38	3.8%
Precinct 42	29	2.9%
Precinct 50	28	2.8%
Precinct 67	27	2.7%
Precinct 106	27	2.7%
Precinct 30	26	2.6%
Precinct 110	24	2.4%
Precinct 112	24	2.4%
Precinct 62	23	2.3%
Precinct 115	23	2.3%
Precinct 83	22	2.2%
Precinct 43	21	2.1%
Precinct 9	20	2.0%
Precinct 103	20	2.0%
Precinct 45	20	2.0%
Precinct 49	19	1.9%
Precinct 34	19	1.9%
Precinct 32	18	1.8%

bbl categorical foreign_key

This is the NYC BBL (Borough-Block-Lot) parcel identifier, stored as a 10-digit string where the leading digit encodes the borough (1-5). Cardinality is very high (719 unique across 945 non-null rows, entropy ratio 0.968), but a long tail is flagged: one parcel '1006080026' accounts for 31 rows (3.3%) and several others recur 5-13 times, so rows are not one-per-parcel. Null rate is 5.5%.

Treatment: Treat as a parcel key for left-joining to PLUTO/property tables; do not one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[75]:

saturn.columns["bbl"].stats

stat	value
n	1,000
nulls	55 (5.5%)
unique	719
top_value	1006080026
top_rate	0.0328
cardinality	719
entropy	9.189
entropy_ratio	0.9683
alert: long_tail	606 singleton categories

Fig 28.

Top values for bbl.

Show data table

Top values for bbl (20 unique shown, of 719 total).
value	count	share
1006080026	31	3.1%
3045950215	13	1.3%
1018230033	8	0.8%
1005420048	8	0.8%
2029020036	8	0.8%
2054190122	7	0.7%
4017067501	7	0.7%
1005040011	5	0.5%
3003960004	5	0.5%
4088380065	5	0.5%
3017400001	5	0.5%
1016180001	4	0.4%
1022170047	4	0.4%
4101460051	4	0.4%
1021700112	4	0.4%
4034270030	4	0.4%
4034300001	3	0.3%
2043230014	3	0.3%
2039437501	3	0.3%
2024090096	3	0.3%

borough categorical feature

This column lists one of the five NYC boroughs for each of the 1000 rows, with no nulls and only 5 unique values. The distribution is fairly even across QUEENS (274), MANHATTAN (258), BROOKLYN (254), and BRONX (203), but STATEN ISLAND is sharply underrepresented at just 11 rows. High entropy ratio (0.886) confirms the top four categories are well balanced.

Treatment: One-hot encode; consider grouping or monitoring STATEN ISLAND given its low support.

anthropic:claude-opus-4-7 · confidence high

Out[78]:

saturn.columns["borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	QUEENS
top_rate	0.274
cardinality	5
entropy	2.057
entropy_ratio	0.8858

Fig 29.

Top values for borough.

Show data table

Top values for borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

x_coordinate_state_plane categorical feature

This column holds X-coordinates in a state plane projection, stored as strings rather than numbers. With 763 unique values across 993 non-null rows and entropy ratio 0.97, it is near-unique; the most frequent value '984721' appears only 31 times (3.1%). The long_tail alert is consistent with a near-continuous spatial measurement that has been categorically encoded.

Treatment: Cast to numeric and pair with the Y-coordinate as a 2D spatial feature.

anthropic:claude-opus-4-7 · confidence high

Out[81]:

saturn.columns["x_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	763
top_value	984721
top_rate	0.03122
cardinality	763
entropy	9.292
entropy_ratio	0.9704
alert: long_tail	643 singleton categories

Fig 30.

Top values for x_coordinate_state_plane.

Show data table

Top values for x_coordinate_state_plane (20 unique shown, of 763 total).
value	count	share
984721	31	3.1%
1004501	13	1.3%
997870	8	0.8%
984032	8	0.8%
1010885	8	0.8%
1031969	7	0.7%
983238	5	0.5%
985877	5	0.5%
1020825	5	0.5%
999077	5	0.5%
998703	4	0.4%
1023829	4	0.4%
1005224	4	0.4%
1041463	4	0.4%
1008285	4	0.4%
1008323	3	0.3%
995971	3	0.3%
1022177	3	0.3%
1007992	3	0.3%
1001276	3	0.3%

y_coordinate_state_plane categorical feature

Column holds State Plane y-coordinates (northings) stored as strings rather than numerics, with 768 unique values across 1000 rows and a 0.7% null rate. Entropy ratio of 0.97 and a top_rate of just 3.1% confirm a near-uniform long tail typical of geographic coordinates, though the modal value '207809' repeating 31 times suggests a default or geocoder fallback location worth investigating.

Treatment: Cast to numeric and treat as a continuous spatial coordinate; inspect the repeated 207809 value for geocoding artifacts.

anthropic:claude-opus-4-7 · confidence high

Out[84]:

saturn.columns["y_coordinate_state_plane"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	768
top_value	207809
top_rate	0.03122
cardinality	768
entropy	9.304
entropy_ratio	0.9707
alert: long_tail	651 singleton categories

Fig 31.

Top values for y_coordinate_state_plane.

Show data table

Top values for y_coordinate_state_plane (20 unique shown, of 768 total).
value	count	share
207809	31	3.1%
180649	13	1.3%
230926	8	0.8%
205111	8	0.8%
244212	8	0.8%
242900	7	0.7%
203929	5	0.5%
189205	5	0.5%
191901	5	0.5%
193110	5	0.5%
230311	4	0.4%
215511	4	0.4%
253267	4	0.4%
192497	4	0.4%
197077	4	0.4%
196526	3	0.3%
250862	3	0.3%
240061	3	0.3%
211978	3	0.3%
207649	3	0.3%

open_data_channel_type categorical feature

Categorical channel indicating how a record was opened, with 4 distinct values and no nulls across 1000 rows. ONLINE leads at 466, followed by MOBILE (288) and PHONE (236), while UNKNOWN appears only 10 times. Distribution is fairly balanced (entropy ratio 0.79) with no dominant single class beyond 46.6%.

Treatment: one-hot encode; consider folding UNKNOWN into a missing/other bucket.

anthropic:claude-opus-4-7 · confidence high

Out[87]:

saturn.columns["open_data_channel_type"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	4
top_value	ONLINE
top_rate	0.466
cardinality	4
entropy	1.589
entropy_ratio	0.7943

Fig 32.

Top values for open_data_channel_type.

Show data table

Top values for open_data_channel_type (4 unique shown, of 4 total).
value	count	share
ONLINE	466	46.6%
MOBILE	288	28.8%
PHONE	236	23.6%
UNKNOWN	10	1.0%

park_facility_name categorical metadata

This column nominally records a park facility name but is effectively a constant: 999 of 1000 rows are "Unspecified" and only one row names "Flushing Meadows Corona Park". Entropy ratio is 0.011, confirming there is virtually no information content here. The column adds no discriminative signal as-is.

Treatment: Drop; single dominant value carries no signal.

anthropic:claude-opus-4-7 · confidence high

Out[90]:

saturn.columns["park_facility_name"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	2
top_value	Unspecified
top_rate	0.999
cardinality	2
entropy	0.01141
entropy_ratio	0.01141
alert: imbalance	top value is 99.9% of rows

Fig 33.

Top values for park_facility_name.

Show data table

Top values for park_facility_name (2 unique shown, of 2 total).
value	count	share
Unspecified	999	99.9%
Flushing Meadows Corona Park	1	0.1%

park_borough categorical feature

Categorical column listing the NYC borough associated with a park, with all five boroughs present and no nulls across 1000 rows. Distribution is fairly balanced (entropy ratio 0.8858) among Queens (274), Manhattan (258), Brooklyn (254), and Bronx (203), but Staten Island is severely under-represented at just 11 records.

Treatment: One-hot encode; consider grouping or stratifying given the Staten Island sparsity.

anthropic:claude-opus-4-7 · confidence high

Out[93]:

saturn.columns["park_borough"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	5
top_value	QUEENS
top_rate	0.274
cardinality	5
entropy	2.057
entropy_ratio	0.8858

Fig 34.

Top values for park_borough.

Show data table

Top values for park_borough (5 unique shown, of 5 total).
value	count	share
QUEENS	274	27.4%
MANHATTAN	258	25.8%
BROOKLYN	254	25.4%
BRONX	203	20.3%
STATEN ISLAND	11	1.1%

latitude categorical feature

Latitude values stored as strings, with 770 unique points across 1000 rows clustered around 40.6-40.8 (NYC range). Entropy ratio of 0.97 and top_rate of just 0.031 confirm a long tail of near-unique coordinates, though the modal value 40.73706433046593 repeating 31 times suggests some snapped or default-pinned locations. Null rate is low at 0.7%.

Treatment: cast to float and pair with longitude for geospatial features rather than treating as categorical.

anthropic:claude-opus-4-7 · confidence high

Out[96]:

saturn.columns["latitude"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	770
top_value	40.73706433046593
top_rate	0.03122
cardinality	770
entropy	9.308
entropy_ratio	0.9707
alert: long_tail	655 singleton categories

Fig 35.

Top values for latitude.

Show data table

Top values for latitude (20 unique shown, of 770 total).
value	count	share
40.73706433046593	31	3.1%
40.662493333971206	13	1.3%
40.800504000346805	8	0.8%
40.72965899371875	8	0.8%
40.83694066292417	8	0.8%
40.83325085108711	7	0.7%
40.726414636772915	5	0.5%
40.68600063896156	5	0.5%
40.693325110944265	5	0.5%
40.696706691928384	5	0.5%
40.79881467021513	4	0.4%
40.75811580959723	4	0.4%
40.861809211135316	4	0.4%
40.694851633900804	4	0.4%
40.70757495380374	4	0.4%
40.7060624863299	3	0.3%
40.85515165955753	3	0.3%
40.82555561649529	3	0.3%
40.74849081688869	3	0.3%
40.73652935637317	3	0.3%

longitude categorical feature

Geographic longitude coordinates, almost certainly NYC-area given the -73.8 to -74.0 range. The column was profiled as categorical despite being continuous numeric: 770 unique values across 1000 rows with entropy ratio 0.97 and a top value repeating only 31 times (3.1%). Null rate is low at 0.7%.

Treatment: Cast to float and pair with latitude as a geospatial feature rather than treating as categorical.

anthropic:claude-opus-4-7 · confidence high

Out[99]:

saturn.columns["longitude"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	770
top_value	-73.99830041620608
top_rate	0.03122
cardinality	770
entropy	9.308
entropy_ratio	0.9707
alert: long_tail	655 singleton categories

Fig 36.

Top values for longitude.

Show data table

Top values for longitude (20 unique shown, of 770 total).
value	count	share
-73.99830041620608	31	3.1%
-73.9270067970956	13	1.3%
-73.95080595870716	8	0.8%
-74.00078655646078	8	0.8%
-73.90374441298103	8	0.8%
-73.82755893872277	7	0.7%
-74.00365117608368	5	0.5%
-73.994133534454	5	0.5%
-73.86810717317749	5	0.5%
-73.94652977407227	5	0.5%
-73.94779857268328	4	0.4%
-73.85713563386481	4	0.4%
-73.92417422718809	4	0.4%
-73.79367980321378	4	0.4%
-73.91330906170826	4	0.4%
-73.91317397090971	3	0.3%
-73.86289896839915	3	0.3%
-73.91421404014056	3	0.3%
-73.93855184851218	3	0.3%
-73.85143729064244	3	0.3%

location unknown other

The column is named "location" but saturn skipped profiling, so no type, cardinality, or value statistics are available. The only confirmed facts are 1000 rows with a 0.0 null rate; uniqueness and content remain unknown.

Treatment: Re-profile or manually inspect before use; saturn skipped this column.

anthropic:claude-opus-4-7 · confidence low

Out[102]:

saturn.columns["location"].stats

stat	value
n	1,000
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

:@computed_region_f5dn_yrer categorical metadata

This is a Socrata-style computed region column (`:@computed_region_f5dn_yrer`), almost certainly a numeric region/zone code attached automatically by a spatial join. It has 62 unique values across 1000 rows with a near-uniform spread (entropy ratio 0.94, top value '57' at only 5.6%) and a 0.7% null rate. No single region dominates, suggesting the underlying records are well-distributed across the joined geography.

Treatment: Treat as a categorical region code; one-hot or target-encode if used as a feature, otherwise drop as join artifact.

anthropic:claude-opus-4-7 · confidence high

Out[104]:

saturn.columns[":@computed_region_f5dn_yrer"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	62
top_value	57
top_rate	0.05639
cardinality	62
entropy	5.591
entropy_ratio	0.9389

Fig 37.

Top values for :@computed_region_f5dn_yrer.

Show data table

Top values for :@computed_region_f5dn_yrer (20 unique shown, of 62 total).
value	count	share
57	56	5.6%
46	48	4.8%
70	38	3.8%
54	37	3.7%
41	30	3.0%
34	29	2.9%
47	29	2.9%
18	28	2.8%
48	28	2.8%
37	28	2.8%
61	27	2.7%
62	27	2.7%
65	25	2.5%
36	24	2.4%
1	23	2.3%
42	22	2.2%
58	21	2.1%
40	21	2.1%
66	20	2.0%
43	20	2.0%

:@computed_region_yeji_bk3q categorical metadata

This is a Socrata-style computed region column (`:@computed_region_yeji_bk3q`), almost certainly a spatial-join lookup mapping each row to one of 5 region codes. The distribution is fairly balanced across codes 2-5 (193-269 each) but code 1 appears only 11 times, and 0.7% of rows are null. Entropy ratio 0.88 confirms near-uniform spread except for the sparse '1' bucket.

Treatment: Treat as a categorical region key; one-hot encode or drop if the spatial join isn't relevant.

anthropic:claude-opus-4-7 · confidence high

Out[107]:

saturn.columns[":@computed_region_yeji_bk3q"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	5
top_value	3
top_rate	0.2709
cardinality	5
entropy	2.054
entropy_ratio	0.8846

Fig 38.

Top values for :@computed_region_yeji_bk3q.

Show data table

Top values for :@computed_region_yeji_bk3q (5 unique shown, of 5 total).
value	count	share
3	269	26.9%
4	266	26.6%
2	254	25.4%
5	193	19.3%
1	11	1.1%

:@computed_region_sbqj_enih categorical foreign_key

This is a Socrata-style computed region column (`:@computed_region_sbqj_enih`), holding a small integer code that geocodes each row into one of 75 regions. The distribution is nearly uniform across categories — entropy ratio is 0.94 and the top value '3' covers only 5.0% of rows — so no single region dominates. Nulls are negligible at 0.7%.

Treatment: Treat as a categorical region id; left-join to the corresponding Socrata boundary lookup or one-hot encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[110]:

saturn.columns[":@computed_region_sbqj_enih"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	75
top_value	3
top_rate	0.05035
cardinality	75
entropy	5.846
entropy_ratio	0.9385

Fig 39.

Top values for :@computed_region_sbqj_enih.

Show data table

Top values for :@computed_region_sbqj_enih (20 unique shown, of 75 total).
value	count	share
3	50	5.0%
60	48	4.8%
62	37	3.7%
25	29	2.9%
33	28	2.8%
40	27	2.7%
64	27	2.7%
19	26	2.6%
68	23	2.3%
37	23	2.3%
73	23	2.3%
53	22	2.2%
26	21	2.1%
70	21	2.1%
61	20	2.0%
28	20	2.0%
32	19	1.9%
5	19	1.9%
22	19	1.9%
20	18	1.8%

:@computed_region_92fq_4b7q categorical foreign_key

This is a Socrata-style computed region column (`:@computed_region_92fq_4b7q`) holding 51 distinct integer codes, almost certainly a spatial join key (e.g., a precinct, district, or zone id) appended by the platform. Distribution is fairly flat — entropy ratio 0.9477 and the top value '10' covers only 6.4% of rows — with very few nulls (0.7%). No single region dominates, suggesting broad geographic coverage rather than a hot spot.

Treatment: Treat as a categorical region id; left-join to the region lookup or one-hot encode if used as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[113]:

saturn.columns[":@computed_region_92fq_4b7q"].stats

stat	value
n	1,000
nulls	7 (0.7%)
unique	51
top_value	10
top_rate	0.06445
cardinality	51
entropy	5.376
entropy_ratio	0.9477

Fig 40.

Top values for :@computed_region_92fq_4b7q.

Show data table

Top values for :@computed_region_92fq_4b7q (20 unique shown, of 51 total).
value	count	share
10	64	6.4%
34	54	5.4%
30	40	4.0%
36	36	3.6%
32	36	3.6%
12	35	3.5%
39	35	3.5%
46	34	3.4%
23	33	3.3%
42	28	2.8%
50	27	2.7%
21	27	2.7%
43	27	2.7%
40	25	2.5%
28	25	2.5%
48	24	2.4%
45	23	2.3%
29	22	2.2%
17	22	2.2%
35	21	2.1%

descriptor_2 categorical feature

Secondary complaint descriptor that further qualifies a primary complaint type, with values like 'NO HEAT', 'NO HEAT AND NO HOT WATER', and 'ROACHES'. It is null 88.2% of the time, so only 118 rows carry a value, and among those 'NO HEAT' alone covers 27.97%. Cardinality is 43 with entropy ratio 0.807, indicating a long tail across the populated rows and a mix of casing conventions ('ROACHES' vs 'Cannabis Smoking or Vaping') that suggests inconsistent upstream sources.

Treatment: Normalize casing and treat missing as its own category before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[116]:

saturn.columns["descriptor_2"].stats

stat	value
n	1,000
nulls	882 (88.2%)
unique	43
top_value	NO HEAT
top_rate	0.2797
cardinality	43
entropy	4.376
entropy_ratio	0.8065
alert: long_tail	27 singleton categories
alert: null_rate	88.2% null

Fig 41.

Top values for descriptor_2.

Show data table

Top values for descriptor_2 (20 unique shown, of 43 total).
value	count	share
NO HEAT	33	3.3%
NO HEAT AND NO HOT WATER	10	1.0%
N/A	9	0.9%
NO HOT WATER	7	0.7%
Cannabis Smoking or Vaping	5	0.5%
ROACHES	4	0.4%
Unsafe Driving - Non-Passenger	3	0.3%
Dog	3	0.3%
OTHER	3	0.3%
Not Cleaned by Property Owner	2	0.2%
Parked In Front Of Fire Hydrant	2	0.2%
Operating Improperly	2	0.2%
AT WALL OR CEILING	2	0.2%
FLIES	2	0.2%
BROKEN OR MISSING	2	0.2%
MISSING OR INADEQUATE CANS/LID	2	0.2%
Loose or Improperly Stored Garbage or Food	1	0.1%
Droppings	1	0.1%
Fare/Tip Complaint	1	0.1%
Waste	1	0.1%

resolution_description categorical label

This is the canned 311 resolution narrative attached to each complaint, describing the agency's disposition (NYPD found no violation, summons issued, HPD inspection pending, etc.). Cardinality is just 16 templates over 1000 rows, with the top NYPD 'no criminal violation' boilerplate covering 35.4% and 30.7% of rows null — likely complaints still open. Entropy ratio of 0.69 shows the distribution is skewed toward a few NYPD templates, with DOT and HPD templates appearing only in the long tail.

Treatment: Map the 16 templates to a categorical disposition code (e.g., 'no_violation', 'summons_issued', 'no_entry', 'pending') rather than treating as free text.

anthropic:claude-opus-4-7 · confidence high

Out[119]:

saturn.columns["resolution_description"].stats

stat	value
n	1,000
nulls	307 (30.7%)
unique	16
top_value	The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.
top_rate	0.3535
cardinality	16
entropy	2.777
entropy_ratio	0.6943
alert: null_rate	30.7% null

Fig 42.

Top values for resolution_description.

Show data table

Top values for resolution_description (16 unique shown, of 16 total).
value	count	share
The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	245	24.5%
The New York City Police Department responded to the complaint and with the information available observed no evidence of a criminal violation at that time. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	161	16.1%
The New York City Police Department responded to the complaint and their investigation determined that a violation of law occurred. Police issued a summons in response to the complaint. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	68	6.8%
The New York City Police Department responded to the complaint but officers were unable to gain entry into the premises. If the problem persists, please contact 311 to create another complaint and ensure that contact information (e.g., buzzer number, phone number, etc.) is available to assist the responding officers in gaining entry to properly investigate the complaint. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	50	5.0%
The following complaint conditions are still open. HPD has already attempted to notify the property owner that the condition exists; the tenant should provide access for the owner to make the repair. HPD may attempt to contact the tenant by phone to verify the correction of the condition or an HPD Inspector may attempt to conduct an inspection.	44	4.4%
The New York City Police Department responded to the complaint and their investigation determined that police action was not necessary. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	40	4.0%
The New York City Police Department responded to the complaint and observed no criminal violation upon their arrival. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	31	3.1%
This complaint is a duplicate of a building-wide condition already reported by another tenant. The original complaint is still open, and HPD may only need to confirm that the condition exists by inspecting one apartment. If we cannot contact the tenant from the original complaint or get access to that apartment, HPD may attempt to contact the person who filed this complaint to verify the correction of the condition or may conduct an inspection of your unit. You can check HPDONLINE to see if a	29	2.9%
The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.	6	0.6%
Your complaint has been received but does not fall under the jurisdiction of the New York City Police Department. Please contact your local precinct for more information, including which City agency your request was referred to. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	5	0.5%
The Police Department reviewed your complaint and provided additional information below.	4	0.4%
The New York City Police Department responded to the complaint and a report was prepared as part of their investigation. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	3	0.3%
The Department of Health and Mental Hygiene has sent official written notification to the Owner/Landlord warning them of potential violations and instructing them to correct the situation. If the situation persists 21 days after your initial complaint, please make a new complaint.	3	0.3%
The New York City Police Department responded to the complaint and observed an encampment at the noted location. The complaint has been referred to the Department of Homeless Services (DHS) for further action. DHS will inspect the condition and update your service request with more information. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	2	0.2%
Your complaint has been received by the New York City Police Department and assigned to a unit at your local precinct. The responding officers will conduct an assessment of the complaint and may contact you for further information, if you left contact information. Our team is committed to resolving this matter promptly, and we appreciate your contribution to maintaining the well-being of our community.	1	0.1%
The New York City Police Department responded to the complaint and their investigation determined another specific tow is required. Please contact the local precinct for more information. Thank you for attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.	1	0.1%

resolution_action_updated_date categorical timestamp

ISO-8601 timestamps recording when a resolution action was last updated, stored as strings rather than a native datetime. 30.6% of rows are null, and a single midnight value '2026-01-17T00:00:00.000' accounts for 74 of the non-null entries (top_rate 10.66%) while the remaining values are millisecond-precise and nearly all unique (585 distinct over 694 non-null, entropy_ratio 0.94). The mix of one date-only spike with otherwise sub-second timestamps suggests two ingestion paths or a backfill default colliding with live event logging.

Treatment: Parse to datetime, treat the 2026-01-17 midnight spike as a sentinel/backfill, and derive recency or duration features rather than using raw values.

anthropic:claude-opus-4-7 · confidence high

Out[122]:

saturn.columns["resolution_action_updated_date"].stats

stat	value
n	1,000
nulls	306 (30.6%)
unique	585
top_value	2026-01-17T00:00:00.000
top_rate	0.1066
cardinality	585
entropy	8.673
entropy_ratio	0.9435
alert: long_tail	548 singleton categories
alert: null_rate	30.6% null

Fig 43.

Top values for resolution_action_updated_date.

Show data table

Top values for resolution_action_updated_date (20 unique shown, of 585 total).
value	count	share
2026-01-17T00:00:00.000	74	7.4%
2026-01-18T02:03:44.000	2	0.2%
2026-01-18T02:04:03.000	2	0.2%
2026-01-18T02:01:54.000	2	0.2%
2026-01-18T01:35:08.000	2	0.2%
2026-01-18T01:41:49.000	2	0.2%
2026-01-18T01:26:35.000	2	0.2%
2026-01-18T01:32:28.000	2	0.2%
2026-01-18T01:25:53.000	2	0.2%
2026-01-18T01:54:39.000	2	0.2%
2026-01-18T01:19:59.000	2	0.2%
2026-01-18T01:09:47.000	2	0.2%
2026-01-18T01:00:57.000	2	0.2%
2026-01-18T01:04:36.000	2	0.2%
2026-01-18T00:58:56.000	2	0.2%
2026-01-18T01:07:50.000	2	0.2%
2026-01-18T01:27:34.000	2	0.2%
2026-01-18T00:46:07.000	2	0.2%
2026-01-18T00:55:02.000	2	0.2%
2026-01-18T00:40:32.000	2	0.2%

closed_date categorical timestamp

This is a closure timestamp captured at second-level precision, stored as strings rather than parsed datetimes. 39% of rows are null, consistent with records that have not yet been closed, and among the 610 populated rows there are 585 unique values with the most common appearing only 3 times (top_rate 0.0049, entropy_ratio 0.998). All visible top values cluster on 2026-01-18, suggesting the populated closures concentrate in a narrow window.

Treatment: Parse to datetime, treat nulls as 'still open', and derive duration features rather than using raw values.

anthropic:claude-opus-4-7 · confidence high

Out[125]:

saturn.columns["closed_date"].stats

stat	value
n	1,000
nulls	390 (39.0%)
unique	585
top_value	2026-01-18T00:41:26.000
top_rate	0.004918
cardinality	585
entropy	9.169
entropy_ratio	0.9975
alert: long_tail	561 singleton categories
alert: null_rate	39.0% null

Fig 44.

Top values for closed_date.

Show data table

Top values for closed_date (20 unique shown, of 585 total).
value	count	share
2026-01-18T00:41:26.000	3	0.3%
2026-01-18T02:02:35.000	2	0.2%
2026-01-18T02:03:41.000	2	0.2%
2026-01-18T01:41:47.000	2	0.2%
2026-01-18T01:26:28.000	2	0.2%
2026-01-18T01:45:29.000	2	0.2%
2026-01-18T01:54:33.000	2	0.2%
2026-01-18T01:07:46.000	2	0.2%
2026-01-18T01:27:29.000	2	0.2%
2026-01-18T00:46:04.000	2	0.2%
2026-01-18T00:40:28.000	2	0.2%
2026-01-18T00:55:27.000	2	0.2%
2026-01-18T00:41:08.000	2	0.2%
2026-01-18T00:20:14.000	2	0.2%
2026-01-18T00:37:27.000	2	0.2%
2026-01-18T00:22:36.000	2	0.2%
2026-01-18T00:36:52.000	2	0.2%
2026-01-18T00:17:49.000	2	0.2%
2026-01-18T00:28:52.000	2	0.2%
2026-01-18T00:20:55.000	2	0.2%

taxi_pick_up_location categorical free_text

Free-text taxi pickup addresses (street, borough, state, ZIP) for NYC trips. The column is almost entirely empty at a 99.4% null rate, leaving only 6 non-null rows that are all unique, so cardinality equals the populated count and entropy is maximal (entropy_ratio 1.0). With so little signal, this column carries effectively no analytical value as-is.

Treatment: Drop the column or geocode the rare populated addresses if pickup location is needed.

anthropic:claude-opus-4-7 · confidence high

Out[128]:

saturn.columns["taxi_pick_up_location"].stats

stat	value
n	1,000
nulls	994 (99.4%)
unique	6
top_value	55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014
top_rate	0.1667
cardinality	6
entropy	2.585
entropy_ratio	1
alert: long_tail	6 singleton categories
alert: null_rate	99.4% null

Fig 45.

Top values for taxi_pick_up_location.

Show data table

Top values for taxi_pick_up_location (6 unique shown, of 6 total).
value	count	share
55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014	1	0.1%
JOHN F KENNEDY AIRPORT, QUEENS (JAMAICA) ,NY, 11430	1	0.1%
11 WATER STREET, BROOKLYN, NY, 11201	1	0.1%
9 AVENUE AND WEST 43 STREET, MANHATTAN, NY, 10036	1	0.1%
30 AVENUE AND 33 STREET, QUEENS, NY, 11102	1	0.1%
30-08 33 STREET, QUEENS (ASTORIA), NY, 11102	1	0.1%

vehicle_type categorical feature

Categorical vehicle classification with just 4 levels (Car, Other, SUV, Van), but it is almost entirely missing — 96.5% null, leaving only 35 populated rows out of 1000. Among those present, Car dominates at 77.1%, giving an entropy ratio of 0.55. With so few observations, the distribution is unreliable as a feature.

Treatment: Drop or treat missingness as its own category; too sparse to model directly.

anthropic:claude-opus-4-7 · confidence high

Out[131]:

saturn.columns["vehicle_type"].stats

stat	value
n	1,000
nulls	965 (96.5%)
unique	4
top_value	Car
top_rate	0.7714
cardinality	4
entropy	1.097
entropy_ratio	0.5484
alert: null_rate	96.5% null

Fig 46.

Top values for vehicle_type.

Show data table

Top values for vehicle_type (4 unique shown, of 4 total).
value	count	share
Car	27	2.7%
Other	4	0.4%
SUV	3	0.3%
Van	1	0.1%

facility_type categorical metadata

This column is a categorical facility_type field that is effectively empty: 99% of the 1000 rows are null, and the only 10 non-null values are all the placeholder 'N/A'. With cardinality of 1 and entropy of 0, it carries no information.

Treatment: Drop the column; it is 99% null with a single placeholder value.

anthropic:claude-opus-4-7 · confidence high

Out[134]:

saturn.columns["facility_type"].stats

stat	value
n	1,000
nulls	990 (99.0%)
unique	1
top_value	N/A
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	99.0% null
alert: imbalance	top value is 100.0% of rows

Fig 47.

Top values for facility_type.

Show data table

Top values for facility_type (1 unique shown, of 1 total).
value	count	share
N/A	10	1.0%

taxi_company_borough categorical metadata

Categorical column intended to record the borough of a taxi company, but it is effectively empty: 99.9% of rows are null and the single non-null value is 'BROOKLYN'. With cardinality 1 and entropy 0, the column carries no information as-is.

Treatment: Drop; near-entirely null with only one observed level.

anthropic:claude-opus-4-7 · confidence high

Out[137]:

saturn.columns["taxi_company_borough"].stats

stat	value
n	1,000
nulls	999 (99.9%)
unique	1
top_value	BROOKLYN
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: long_tail	1 singleton categories
alert: null_rate	99.9% null
alert: imbalance	top value is 100.0% of rows

Fig 48.

Top values for taxi_company_borough.

Show data table

Top values for taxi_company_borough (1 unique shown, of 1 total).
value	count	share
BROOKLYN	1	0.1%

bridge_highway_name categorical metadata

A categorical field flagging bridge or highway names, but it's effectively empty: 99.7% null with only 3 non-null records across 1000 rows. The two distinct values observed ('E' and 'Kosciuszko Br - BQE') are inconsistent in form, suggesting either a free-text field or an abbreviation code mixed with full names.

Treatment: Drop; null rate of 99.7% leaves no usable signal.

anthropic:claude-opus-4-7 · confidence high

Out[140]:

saturn.columns["bridge_highway_name"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	E
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 49.

Top values for bridge_highway_name.

Show data table

Top values for bridge_highway_name (2 unique shown, of 2 total).
value	count	share
E	2	0.2%
Kosciuszko Br - BQE	1	0.1%

bridge_highway_direction categorical metadata

This is a near-empty categorical field describing bridge or highway direction, with a 99.7% null rate leaving only 3 non-null observations across 2 distinct values ('C E Local Downtown & Brooklyn' and 'Queens Bound'). The values resemble transit route/direction labels rather than simple compass directions, suggesting inconsistent coding. With only 3 populated rows out of 1000, no reliable distribution can be inferred.

Treatment: Drop; null rate of 0.997 leaves too little signal to use.

anthropic:claude-opus-4-7 · confidence high

Out[143]:

saturn.columns["bridge_highway_direction"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	C E Local Downtown & Brooklyn
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 50.

Top values for bridge_highway_direction.

Show data table

Top values for bridge_highway_direction (2 unique shown, of 2 total).
value	count	share
C E Local Downtown & Brooklyn	2	0.2%
Queens Bound	1	0.1%

road_ramp categorical metadata

This column is a binary-style flag indicating whether a road segment is a ramp, but it carries virtually no information here. With a null_rate of 0.999 and only 1 non-null value ("Ramp") across 1000 rows, cardinality is 1 and entropy is 0.

Treatment: Drop; effectively constant with 99.9% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[146]:

saturn.columns["road_ramp"].stats

stat	value
n	1,000
nulls	999 (99.9%)
unique	1
top_value	Ramp
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: long_tail	1 singleton categories
alert: null_rate	99.9% null
alert: imbalance	top value is 100.0% of rows

Fig 51.

Top values for road_ramp.

Show data table

Top values for road_ramp (1 unique shown, of 1 total).
value	count	share
Ramp	1	0.1%

bridge_highway_segment categorical metadata

A categorical flag describing a bridge/highway segment type, with only two observed values ('Platform' and 'Ramp'). It is almost entirely missing — 99.7% null, leaving just 3 non-null rows out of 1000 — so the apparent top_rate of 0.667 reflects 2 observations and is not meaningful.

Treatment: Drop or treat as a rare-event indicator; too sparse (3/1000 populated) to model directly.

anthropic:claude-opus-4-7 · confidence high

Out[149]:

saturn.columns["bridge_highway_segment"].stats

stat	value
n	1,000
nulls	997 (99.7%)
unique	2
top_value	Platform
top_rate	0.6667
cardinality	2
entropy	0.9183
entropy_ratio	0.9183
alert: null_rate	99.7% null

Fig 52.

Top values for bridge_highway_segment.

Show data table

Top values for bridge_highway_segment (2 unique shown, of 2 total).
value	count	share
Platform	2	0.2%
Ramp	1	0.1%

Overview

Summary confidence: high

unique_key categorical identifier

created_date categorical timestamp

agency categorical feature

agency_name categorical feature

complaint_type categorical label

descriptor categorical feature

location_type categorical feature

incident_zip categorical feature

incident_address categorical metadata

street_name categorical metadata

cross_street_1 categorical metadata

cross_street_2 categorical feature

intersection_street_1 categorical metadata

intersection_street_2 categorical feature

address_type categorical feature

city categorical feature

landmark categorical metadata

status categorical feature

community_board categorical metadata

council_district categorical feature

police_precinct categorical feature

bbl categorical foreign_key

borough categorical feature

x_coordinate_state_plane categorical feature

y_coordinate_state_plane categorical feature

open_data_channel_type categorical feature

park_facility_name categorical metadata

park_borough categorical feature

latitude categorical feature

longitude categorical feature

location unknown other

:@computed_region_f5dn_yrer categorical metadata

:@computed_region_yeji_bk3q categorical metadata

:@computed_region_sbqj_enih categorical foreign_key

:@computed_region_92fq_4b7q categorical foreign_key

descriptor_2 categorical feature

resolution_description categorical label

resolution_action_updated_date categorical timestamp

closed_date categorical timestamp

taxi_pick_up_location categorical free_text

vehicle_type categorical feature

facility_type categorical metadata

taxi_company_borough categorical metadata

bridge_highway_name categorical metadata

bridge_highway_direction categorical metadata

road_ramp categorical metadata

bridge_highway_segment categorical metadata

How to cite