wild nyc 311 sample 20260121
Reading
This is a 1,000-row sample of NYC 311 service requests (47 columns), almost entirely categorical, capturing complaints by agency, location, and resolution status. NYPD dominates routing at 59.4% of requests, followed by HPD (23.2%) and DSNY (8.2%), and the top complaint types are Noise - Residential (23.4%), Illegal Parking (18.7%), and HEAT/HOT WATER (17.0%) — a good first place to look. Geographically, Brooklyn (31.2%), Queens (26.1%), and the Bronx (23.0%) account for most cases, while Staten Island is just 2.2%. Status is split across In Progress (38.8%), Closed (33.6%), and Open (27.6%), so a sizable share remains unresolved. Note that many specialized fields (taxi, bridge/highway, vehicle, facility_type) are >95% null and not informative, and `location` was skipped during profiling.
citing: agency · complaint_type · borough · status · open_data_channel_type · descriptor · facility_type · taxi_company_borough · vehicle_type
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| NYPD | 594 | 59.4% |
| HPD | 232 | 23.2% |
| DSNY | 82 | 8.2% |
| DOT | 34 | 3.4% |
| DEP | 24 | 2.4% |
| DOHMH | 16 | 1.6% |
| TLC | 12 | 1.2% |
| DPR | 4 | 0.4% |
| DCWP | 1 | 0.1% |
| OOS | 1 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| Noise - Residential | 234 | 23.4% |
| Illegal Parking | 187 | 18.7% |
| HEAT/HOT WATER | 170 | 17.0% |
| Snow or Ice | 69 | 6.9% |
| Blocked Driveway | 64 | 6.4% |
| Noise - Commercial | 38 | 3.8% |
| Noise - Vehicle | 22 | 2.2% |
| UNSANITARY CONDITION | 21 | 2.1% |
| Noise - Street/Sidewalk | 20 | 2.0% |
| Street Condition | 14 | 1.4% |
| Noise | 14 | 1.4% |
| Traffic Signal Condition | 9 | 0.9% |
| Abandoned Vehicle | 8 | 0.8% |
| Taxi Complaint | 8 | 0.8% |
| DOOR/WINDOW | 8 | 0.8% |
| Dirty Condition | 7 | 0.7% |
| PLUMBING | 7 | 0.7% |
| Non-Emergency Police Matter | 6 | 0.6% |
| FLOORING/STAIRS | 6 | 0.6% |
| Indoor Air Quality | 5 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| BROOKLYN | 312 | 31.2% |
| QUEENS | 261 | 26.1% |
| BRONX | 230 | 23.0% |
| MANHATTAN | 175 | 17.5% |
| STATEN ISLAND | 22 | 2.2% |
Show data table
| value | count | share |
|---|---|---|
| In Progress | 388 | 38.8% |
| Closed | 336 | 33.6% |
| Open | 276 | 27.6% |
Show data table
| value | count | share |
|---|---|---|
| ONLINE | 527 | 52.7% |
| MOBILE | 234 | 23.4% |
| PHONE | 214 | 21.4% |
| UNKNOWN | 25 | 2.5% |
Schema
47 columns| Alerts | ||||
|---|---|---|---|---|
| unique_key | categorical | 0.0% | 1,000 |
long_tail
|
| created_date | categorical | 0.0% | 910 |
long_tail
|
| agency | categorical | 0.0% | 10 |
|
| agency_name | categorical | 0.0% | 10 |
|
| complaint_type | categorical | 0.0% | 61 |
|
| descriptor | categorical | 0.0% | 105 |
|
| location_type | categorical | 4.8% | 19 |
|
| incident_zip | categorical | 0.3% | 156 |
|
| incident_address | categorical | 1.6% | 787 |
long_tail
|
| street_name | categorical | 1.6% | 592 |
long_tail
|
| cross_street_1 | categorical | 25.0% | 457 |
long_tail
null_rate
|
| cross_street_2 | categorical | 25.0% | 458 |
long_tail
null_rate
|
| intersection_street_1 | categorical | 26.7% | 436 |
long_tail
null_rate
|
| intersection_street_2 | categorical | 26.7% | 443 |
long_tail
null_rate
|
| address_type | categorical | 0.3% | 4 |
|
| city | categorical | 2.9% | 40 |
|
| landmark | categorical | 30.8% | 445 |
long_tail
null_rate
|
| status | categorical | 0.0% | 3 |
|
| community_board | categorical | 0.0% | 64 |
|
| council_district | categorical | 1.3% | 51 |
|
| police_precinct | categorical | 0.0% | 76 |
|
| bbl | categorical | 5.9% | 744 |
long_tail
|
| borough | categorical | 0.0% | 5 |
|
| x_coordinate_state_plane | categorical | 1.2% | 788 |
long_tail
|
| y_coordinate_state_plane | categorical | 1.2% | 788 |
long_tail
|
| open_data_channel_type | categorical | 0.0% | 4 |
|
| park_facility_name | categorical | 0.0% | 3 |
long_tail
imbalance
|
| park_borough | categorical | 0.0% | 5 |
|
| latitude | categorical | 1.2% | 794 |
long_tail
|
| longitude | categorical | 1.2% | 794 |
long_tail
|
| location | unknown | 0.0% | — |
skipped
|
| :@computed_region_f5dn_yrer | categorical | 1.2% | 62 |
|
| :@computed_region_yeji_bk3q | categorical | 1.2% | 5 |
|
| :@computed_region_sbqj_enih | categorical | 1.2% | 75 |
|
| :@computed_region_92fq_4b7q | categorical | 1.2% | 51 |
|
| descriptor_2 | categorical | 59.6% | 75 |
long_tail
null_rate
|
| resolution_description | categorical | 41.2% | 19 |
null_rate
|
| resolution_action_updated_date | categorical | 40.9% | 354 |
long_tail
null_rate
|
| taxi_pick_up_location | categorical | 98.8% | 3 |
null_rate
|
| vehicle_type | categorical | 95.7% | 5 |
null_rate
|
| closed_date | categorical | 66.4% | 334 |
long_tail
null_rate
|
| bridge_highway_name | categorical | 99.6% | 4 |
long_tail
null_rate
|
| bridge_highway_segment | categorical | 99.6% | 4 |
long_tail
null_rate
|
| facility_type | categorical | 95.9% | 1 |
null_rate
imbalance
|
| bridge_highway_direction | categorical | 99.7% | 3 |
long_tail
null_rate
|
| road_ramp | categorical | 99.7% | 2 |
null_rate
|
| taxi_company_borough | categorical | 99.9% | 1 |
long_tail
null_rate
imbalance
|
unique_key
categorical identifier long_tailThis is a row-level identifier: every one of the 1000 values is unique (n_unique=1000, entropy_ratio=1.0) and no nulls are present. The values are 8-digit numeric strings clustered around 6753xxxx–6754xxxx, consistent with a sequential record key. There is no predictive signal here, only joinability. Treatment: Drop from modelling; retain only as a join key.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 1,000
- top_value
- 67534607
- top_rate
- 0.001
- cardinality
- 1,000
- entropy
- 9.966
- entropy_ratio
- 1
created_date
categorical timestamp long_tailISO-8601 datetime stamps stored as strings, with 910 unique values across 1000 rows and zero nulls. The top 10 timestamps all fall on 2026-01-19 between 22:01 and 23:57, suggesting the sample is concentrated in a roughly two-hour window rather than spread over time. Entropy ratio 0.9915 confirms near-unique values, but the column is typed as categorical instead of a proper timestamp. Treatment: Parse to datetime and derive features (hour, day, delta); do not use raw string as a category.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 910
- top_value
- 2026-01-19T22:01:09.000
- top_rate
- 0.009
- cardinality
- 910
- entropy
- 9.746
- entropy_ratio
- 0.9915
agency
categorical featureThis column records the NYC agency handling each record, drawn from 10 distinct codes with no nulls across 1000 rows. Distribution is heavily concentrated: NYPD alone accounts for 594 rows (top_rate 0.594) and HPD another 232, while DCWP and OOS appear just once each. Entropy ratio of 0.527 confirms the long tail is thin, which may starve models of signal for rare agencies. Treatment: One-hot encode the top few agencies and bucket the rare codes (DPR, DCWP, OOS) into an 'other' category.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- NYPD
- top_rate
- 0.594
- cardinality
- 10
- entropy
- 1.75
- entropy_ratio
- 0.5268
agency_name
categorical featureThis column names the NYC agency handling each record, with 10 distinct agencies and no nulls across 1000 rows. The distribution is highly concentrated: NYPD alone accounts for 59.4% of records, followed by Housing Preservation and Development at 232 and Sanitation at 82, while two agencies appear just once. The entropy ratio of 0.53 confirms the heavy skew toward a single dominant category. Treatment: One-hot encode or group rare agencies into an 'Other' bucket before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- New York City Police Department
- top_rate
- 0.594
- cardinality
- 10
- entropy
- 1.75
- entropy_ratio
- 0.5268
complaint_type
categorical labelThis is a categorical complaint-type field, almost certainly from a 311-style service request log, with 61 distinct categories across 1000 rows and no nulls. The distribution is moderately concentrated: 'Noise - Residential' leads at 23.4%, followed by 'Illegal Parking' (187) and 'HEAT/HOT WATER' (170), with entropy ratio 0.637 indicating a long tail. Worth noting the inconsistent casing (e.g., 'HEAT/HOT WATER' and 'UNSANITARY CONDITION' in caps vs. mixed-case neighbours), and that 'Noise' alone fragments across at least four sub-types. Treatment: Normalise casing and consider grouping rare tail categories before one-hot or target encoding.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 61
- top_value
- Noise - Residential
- top_rate
- 0.234
- cardinality
- 61
- entropy
- 3.776
- entropy_ratio
- 0.6366
descriptor
categorical featureCategorical descriptor field detailing the specific nature of a complaint or issue, with values like 'Banging/Pounding', 'Loud Music/Party', and 'Blocked Hydrant' suggesting NYC 311-style service requests. Cardinality is moderate at 105 unique values across 1000 rows, with the top value covering 13.3% and entropy ratio of 0.71 indicating a reasonably spread distribution. Notably, casing is inconsistent ('ENTIRE BUILDING' and 'APARTMENT ONLY' in caps versus title-case elsewhere), hinting at multiple upstream sources or schemas merged together. Treatment: Normalize casing and group rare levels before one-hot or target encoding.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 105
- top_value
- Banging/Pounding
- top_rate
- 0.133
- cardinality
- 105
- entropy
- 4.739
- entropy_ratio
- 0.7059
location_type
categorical featureCategorical descriptor of where an incident occurred, with 19 distinct values across 1000 rows and a 4.8% null rate. The top value 'Street/Sidewalk' covers 32.1% of records, but the vocabulary is clearly inconsistent: 'Residential Building/House' (245), 'RESIDENTIAL BUILDING' (232), and 'Residential Building' (5) appear as separate categories, as do 'Sidewalk' (80) vs 'Street/Sidewalk', and '3+ Family Apartment Building' vs '3+ Family Apt. Building'. Entropy ratio of 0.57 reflects this fragmentation rather than genuine diversity. Treatment: Normalize casing and merge synonymous labels into a controlled vocabulary before encoding.
- n
- 1,000
- nulls
- 48 (4.8%)
- unique
- 19
- top_value
- Street/Sidewalk
- top_rate
- 0.3214
- cardinality
- 19
- entropy
- 2.437
- entropy_ratio
- 0.5737
incident_zip
categorical featureThis is a US ZIP code field for incident locations, with 156 distinct codes across 1000 rows and only 0.3% nulls. The distribution is highly diffuse (entropy ratio 0.94) — the most frequent ZIP, 10461, accounts for just 2.9% of rows, and the top values (10461, 10462, 11226, 11214, 10034) are all NYC codes, suggesting a NYC-scoped dataset. Treatment: Treat as high-cardinality categorical: group by borough/region or target-encode rather than one-hot.
- n
- 1,000
- nulls
- 3 (0.3%)
- unique
- 156
- top_value
- 10461
- top_rate
- 0.02909
- cardinality
- 156
- entropy
- 6.826
- entropy_ratio
- 0.937
incident_address
categorical metadata long_tailStreet-level incident addresses, almost certainly NYC given entries like JOHN F KENNEDY AIRPORT and LA GUARDIA AIRPORT. Cardinality is extreme: 787 unique values across 1000 rows with entropy ratio 0.98, and the modal address '31 HARRISON AVENUE' appears just 9 times (0.9%). Null rate is low at 1.6%, but the long tail makes this unusable as a categorical feature without aggregation. Treatment: Geocode to borough/zip or coordinates rather than using raw strings.
- n
- 1,000
- nulls
- 16 (1.6%)
- unique
- 787
- top_value
- 31 HARRISON AVENUE
- top_rate
- 0.009146
- cardinality
- 787
- entropy
- 9.431
- entropy_ratio
- 0.9803
street_name
categorical feature long_tailStreet-name strings, almost certainly NYC thoroughfares given entries like OCEAN AVENUE, BROADWAY, and BAY PARKWAY. Cardinality is extreme: 592 unique values across 1000 rows with the top value covering only 1.0% and entropy_ratio of 0.96, so the distribution is essentially flat with a long tail. Null rate is low at 1.6%. Treatment: Group rare streets into an 'other' bucket or target/frequency-encode before modelling.
- n
- 1,000
- nulls
- 16 (1.6%)
- unique
- 592
- top_value
- OCEAN AVENUE
- top_rate
- 0.01016
- cardinality
- 592
- entropy
- 8.886
- entropy_ratio
- 0.9648
cross_street_1
categorical metadata long_tail null_rateStreet-name field used as a cross-street reference, likely from NYC service or incident records given entries like 'WYTHE AVENUE', 'ADAM CLAYTON POWELL JR BOULEVARD', and numbered avenues. Cardinality is extreme (457 unique across 1000 rows, entropy ratio 0.97) and the top value 'BEND' covers only 1.3%, so no value dominates. A 25% null rate and presence of placeholder-like 'DEAD END' suggest inconsistent capture that should be reviewed before use. Treatment: Normalise street strings and treat as high-cardinality location metadata; do not one-hot encode directly.
- n
- 1,000
- nulls
- 250 (25.0%)
- unique
- 457
- top_value
- BEND
- top_rate
- 0.01333
- cardinality
- 457
- entropy
- 8.534
- entropy_ratio
- 0.9658
cross_street_2
categorical feature long_tail null_rateThis is a free-text street name field, almost certainly the second cross-street bounding an incident or location record. Cardinality is extreme (458 unique across 750 non-null rows, entropy ratio 0.97) and the modal value 'BERRY STREET' covers just 1.07% of rows, so no street dominates. A quarter of values are null and oddities like 'DEAD END' and 'AIRTRAIN-HOWARD BCH/JAMAICA LINE' appear alongside normal street names, suggesting inconsistent free-text entry. Treatment: Normalise casing/synonyms and bucket rare values, or drop in favour of geocoded coordinates.
- n
- 1,000
- nulls
- 250 (25.0%)
- unique
- 458
- top_value
- BERRY STREET
- top_rate
- 0.01067
- cardinality
- 458
- entropy
- 8.536
- entropy_ratio
- 0.9657
intersection_street_1
categorical feature long_tail null_rateCross-street name for an incident location, judging from values like 'WYTHE AVENUE', 'ADAM CLAYTON POWELL JR BOULEVARD', and 'DEAD END'. The field is sparse and highly diverse: 26.7% null and 436 distinct values across 1000 rows, with the most common ('BEND') appearing only 9 times and entropy ratio 0.965 indicating an almost-flat long tail. Note the presence of non-street tokens like 'BEND' and 'DEAD END', which suggest mixed semantics rather than clean street names. Treatment: Normalize street names and bucket the long tail; treat nulls as a separate category before encoding.
- n
- 1,000
- nulls
- 267 (26.7%)
- unique
- 436
- top_value
- BEND
- top_rate
- 0.01228
- cardinality
- 436
- entropy
- 8.465
- entropy_ratio
- 0.9654
intersection_street_2
categorical feature long_tail null_rateThis column holds the second cross-street of an intersection, drawn from NYC street names like BERRY STREET, FREDERICK DOUGLASS BOULEVARD, and DITMAS AVENUE. It is extremely high-cardinality (443 unique values across 1000 rows, entropy ratio 0.97) with a long flat tail — the top value covers only 1.1% of rows. Notably, 26.7% of rows are null, and oddities like 'DEAD END' and 'AIRTRAIN-HOWARD BCH/JAMAICA LINE' appear alongside conventional street names. Treatment: Normalize street strings and group rare values or geocode to coordinates before modelling; impute or flag the 26.7% nulls.
- n
- 1,000
- nulls
- 267 (26.7%)
- unique
- 443
- top_value
- BERRY STREET
- top_rate
- 0.01091
- cardinality
- 443
- entropy
- 8.489
- entropy_ratio
- 0.9657
address_type
categorical featureCategorical tag describing the kind of geolocation reference, with four levels: ADDRESS, INTERSECTION, PLACE, and BLOCKFACE. The distribution is highly imbalanced — ADDRESS covers 94.2% of 1000 rows, leaving the other three categories with 36, 12, and 10 occurrences respectively. Entropy ratio is just 0.199, and 0.3% of rows are null. Treatment: One-hot encode, but expect the non-ADDRESS levels to contribute little signal given the severe imbalance.
- n
- 1,000
- nulls
- 3 (0.3%)
- unique
- 4
- top_value
- ADDRESS
- top_rate
- 0.9418
- cardinality
- 4
- entropy
- 0.3978
- entropy_ratio
- 0.1989
city
categorical featureCategorical column listing NYC-area city/neighborhood names, dominated by the five boroughs with BROOKLYN top at 307/1000 (31.6%), followed by BRONX (227) and NEW YORK (170). Cardinality is modest (40 unique) and null rate is low (2.9%), but the mix conflates borough names (BROOKLYN, BRONX, QUEENS) with neighborhood names (ASTORIA, ELMHURST, RIDGEWOOD), so granularity is inconsistent. Entropy ratio of 0.61 confirms the heavy concentration in a few labels. Treatment: Normalize to a consistent geographic level (e.g., map neighborhoods to boroughs) and one-hot or target-encode.
- n
- 1,000
- nulls
- 29 (2.9%)
- unique
- 40
- top_value
- BROOKLYN
- top_rate
- 0.3162
- cardinality
- 40
- entropy
- 3.234
- entropy_ratio
- 0.6077
landmark
categorical metadata long_tail null_rateThis column holds landmark or street-name references, dominated by NYC thoroughfares like EAST 21 STREET, OCEAN AVENUE, and JOHN F KENNEDY AIRPORT. It's sparsely populated (30.8% null) and extremely long-tailed — 445 unique values across only 692 non-null rows, with the top value covering just 1.3% and entropy ratio at 0.969. No single landmark carries signal on its own. Treatment: Treat as high-cardinality free text: drop or bucket into broad categories rather than one-hot encode.
- n
- 1,000
- nulls
- 308 (30.8%)
- unique
- 445
- top_value
- EAST 21 STREET
- top_rate
- 0.01301
- cardinality
- 445
- entropy
- 8.522
- entropy_ratio
- 0.9686
status
categorical labelA 3-level categorical status field (In Progress, Closed, Open) with no nulls across 1000 rows. The distribution is nearly uniform — entropy_ratio of 0.991 and a modest top_rate of 0.388 — so no class dominates, which is unusual for status fields that often skew heavily to one state. Treatment: One-hot or ordinal encode for modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- In Progress
- top_rate
- 0.388
- cardinality
- 3
- entropy
- 1.571
- entropy_ratio
- 0.9913
community_board
categorical metadataThis column encodes NYC community board assignments, formatted as a zero-padded board number plus borough name (e.g., '12 MANHATTAN'). With 64 unique values across 1000 rows, no nulls, and a near-uniform distribution (entropy ratio 0.955, top value only 4.4%), no single board dominates. The composite format mixes two facts (board id + borough) into one string, which is worth splitting before analysis. Treatment: Split into borough and board-number fields, then treat as a categorical geographic key.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 64
- top_value
- 12 MANHATTAN
- top_rate
- 0.044
- cardinality
- 64
- entropy
- 5.73
- entropy_ratio
- 0.955
council_district
categorical featureCategorical column holding council district codes as zero-padded strings (e.g. '09', '13'), with 51 distinct values across 1000 rows and a 1.3% null rate. The distribution is nearly uniform: entropy ratio is 0.968 and the top district '13' accounts for only 4.7% of rows, so no single district dominates. Treat the codes as discrete labels rather than numbers since they are stored as strings with leading zeros. Treatment: Keep as string categorical and one-hot or target-encode before modelling.
- n
- 1,000
- nulls
- 13 (1.3%)
- unique
- 51
- top_value
- 13
- top_rate
- 0.04661
- cardinality
- 51
- entropy
- 5.492
- entropy_ratio
- 0.9682
police_precinct
categorical featureThis column identifies the police precinct associated with each record, using labels like 'Precinct 62' across 76 distinct values with no nulls. The distribution is nearly uniform: the top value appears in only 4% of rows and entropy ratio is 0.94, so no precinct dominates. Treat it as a high-cardinality categorical feature rather than a meaningful ranking. Treatment: Target- or frequency-encode before modelling; avoid one-hot given 76 levels.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 76
- top_value
- Precinct 62
- top_rate
- 0.04
- cardinality
- 76
- entropy
- 5.883
- entropy_ratio
- 0.9416
bbl
categorical foreign_key long_tailThis column holds NYC Borough-Block-Lot (BBL) parcel identifiers — 10-digit codes where the leading digit encodes the borough (values 1–5 appear in the top entries). With 744 unique values across 1000 rows and entropy ratio 0.979, it is near-unique but has mild repetition (top BBL '5010780006' appears 9 times), suggesting multiple records per parcel rather than a primary key. Null rate is 5.9%, and the long_tail alert confirms most BBLs occur only once or twice. Treatment: left-join on this id to a parcel/property reference table; do not use as a model feature directly.
- n
- 1,000
- nulls
- 59 (5.9%)
- unique
- 744
- top_value
- 5010780006
- top_rate
- 0.009564
- cardinality
- 744
- entropy
- 9.342
- entropy_ratio
- 0.9793
borough
categorical featureThis is a NYC borough categorical with all 5 expected values present and no nulls across 1000 rows. Distribution is fairly balanced (entropy ratio 0.895) with BROOKLYN leading at 31.2% and QUEENS at 26.1%; STATEN ISLAND is notably underrepresented at just 22 rows. Treatment: one-hot encode for modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- BROOKLYN
- top_rate
- 0.312
- cardinality
- 5
- entropy
- 2.079
- entropy_ratio
- 0.8953
x_coordinate_state_plane
categorical feature long_tailThis column holds X coordinates in a state plane projection, stored as strings rather than numerics — values like "946638" and "1016981" are typical NYC-area easting values. Cardinality is extremely high (788 unique across 1000 rows, entropy ratio 0.98) with the modal value appearing only 9 times (0.9%), so it behaves as a near-continuous spatial measurement. Null rate is low at 1.2%. Treatment: Cast to numeric and pair with the Y coordinate for spatial features rather than treating as categorical.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 788
- top_value
- 946638
- top_rate
- 0.009109
- cardinality
- 788
- entropy
- 9.433
- entropy_ratio
- 0.9803
y_coordinate_state_plane
categorical feature long_tailThis is a State Plane Y-coordinate (northing), stored as strings rather than numerics — 788 unique values across 1000 rows with only a 1.2% null rate. The distribution is essentially flat (entropy ratio 0.98, top value '171301' appearing just 9 times), consistent with continuous spatial coordinates rather than a true category. The categorical typing is the surprise here; it should be a numeric geo-feature. Treatment: Cast to numeric and pair with the X-coordinate as a geospatial feature; do not one-hot encode.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 788
- top_value
- 171301
- top_rate
- 0.009109
- cardinality
- 788
- entropy
- 9.437
- entropy_ratio
- 0.9808
open_data_channel_type
categorical featureThis is a low-cardinality categorical recording the intake channel for a request, with only 4 distinct values and no nulls. ONLINE dominates at 52.7% (527/1000), followed by MOBILE (234) and PHONE (214), while UNKNOWN appears 25 times and may warrant treatment as missing. Entropy ratio of 0.79 indicates a fairly balanced spread across the top three channels. Treatment: One-hot encode and consider mapping UNKNOWN to null.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- ONLINE
- top_rate
- 0.527
- cardinality
- 4
- entropy
- 1.586
- entropy_ratio
- 0.7932
park_facility_name
categorical feature long_tail imbalanceCategorical column for a park or facility name, but it is effectively a constant: 998 of 1000 rows are 'Unspecified', with single occurrences of 'Marcus Garvey Park' and 'Forest Park'. Entropy ratio is 0.014 and top_rate is 0.998, so this column carries virtually no signal despite having no nulls. Treatment: Drop; near-constant with 99.8% 'Unspecified'.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Unspecified
- top_rate
- 0.998
- cardinality
- 3
- entropy
- 0.02281
- entropy_ratio
- 0.01439
park_borough
categorical featureThis column records one of New York City's five boroughs (likely the borough of an associated park), with no missing values across 1000 rows. Distribution is fairly even — entropy ratio 0.8953 — but Staten Island is sharply underrepresented at just 22 rows versus Brooklyn's 312. Treatment: One-hot encode the five borough levels before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- BROOKLYN
- top_rate
- 0.312
- cardinality
- 5
- entropy
- 2.079
- entropy_ratio
- 0.8953
latitude
categorical feature long_tailLatitude coordinates stored as strings rather than floats, with 794 unique values across 988 non-null rows (1.2% nulls). All top values cluster in the 40.6-40.87 range, consistent with the New York City area. The near-maximum entropy ratio (0.98) and 0.9% top rate confirm this is effectively continuous geospatial data miscast as categorical. Treatment: cast to float and pair with longitude for geospatial features rather than treating as categorical.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 794
- top_value
- 40.63677840515416
- top_rate
- 0.009109
- cardinality
- 794
- entropy
- 9.449
- entropy_ratio
- 0.9809
longitude
categorical feature long_tailThis is a longitude coordinate column, almost certainly geographic points in the New York City area given values clustered around -73.8 to -74.1. It has been ingested as categorical text rather than numeric, with 794 unique values across 1000 rows and a top value frequency of just 0.9%, indicating a long tail of near-unique floats. Null rate is low at 1.2%, and entropy ratio of 0.98 confirms values are spread very evenly. Treatment: Cast to float and use as a numeric geospatial feature, optionally paired with latitude for distance or grid encoding.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 794
- top_value
- -74.13551741527912
- top_rate
- 0.009109
- cardinality
- 794
- entropy
- 9.449
- entropy_ratio
- 0.9809
location
unknown metadata skippedThe column is named "location" but saturn skipped detailed profiling, so its kind is unknown and no value statistics are available. We can only confirm there are 1000 rows with a 0.0 null rate; uniqueness, cardinality, and value distribution are all missing. The name suggests geographic or place data, but without evidence we cannot verify format (string, coordinates, codes) or quality. Treatment: Re-profile with an appropriate parser before deciding; do not use as-is.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- —
:@computed_region_f5dn_yrer
categorical foreign_keyThis is a Socrata-style computed region column (`:@computed_region_f5dn_yrer`), almost certainly a spatial join key mapping each row to one of 62 geographic regions. Distribution is remarkably flat — entropy ratio 0.957 and the top value '47' covers only 4.5% of rows — so no single region dominates. Null rate is low at 1.2%. Treatment: Treat as a categorical region id; one-hot or target-encode, or left-join to a region lookup.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 62
- top_value
- 47
- top_rate
- 0.04453
- cardinality
- 62
- entropy
- 5.698
- entropy_ratio
- 0.957
:@computed_region_yeji_bk3q
categorical metadataThis appears to be an auto-generated Socrata computed region column (`:@computed_region_yeji_bk3q`), holding a small geographic or administrative bucket id encoded as strings. Cardinality is just 5 with a fairly balanced spread (top value '2' at 31.6%, entropy ratio 0.896), though category '1' is rare at only 22 occurrences and 1.2% of rows are null. Treatment: Treat as a low-cardinality categorical region code; one-hot encode or drop if the region mapping is not needed.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 5
- top_value
- 2
- top_rate
- 0.3158
- cardinality
- 5
- entropy
- 2.08
- entropy_ratio
- 0.8959
:@computed_region_sbqj_enih
categorical foreign_keyThis is a Socrata-style computed region column (`:@computed_region_sbqj_enih`), almost certainly a spatial-join key assigning each row to one of 75 region polygons. The distribution is very flat — entropy ratio 0.94 of the maximum, top value '37' covers only 4.05% of rows — so no single region dominates. Null rate is low at 1.2%, consistent with rows that fell outside any polygon. Treatment: Treat as a categorical region id; left-join to the region lookup or drop if spatial context isn't needed.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 75
- top_value
- 37
- top_rate
- 0.04049
- cardinality
- 75
- entropy
- 5.867
- entropy_ratio
- 0.942
:@computed_region_92fq_4b7q
categorical foreign_keyThis is a Socrata-style computed region column (`:@computed_region_92fq_4b7q`) holding 51 distinct integer-coded region IDs across 1000 rows, with only 1.2% nulls. The distribution is remarkably flat — entropy ratio 0.969, and the top value '12' covers just 4.66% — suggesting a near-uniform spread across regions rather than a dominant one. No single region drives the data, so this behaves like a geographic foreign key into an external boundary lookup. Treatment: Left-join on this id to a region lookup, or treat as a high-cardinality categorical (target-encode) for modelling.
- n
- 1,000
- nulls
- 12 (1.2%)
- unique
- 51
- top_value
- 12
- top_rate
- 0.04656
- cardinality
- 51
- entropy
- 5.497
- entropy_ratio
- 0.9691
descriptor_2
categorical feature long_tail null_rateSecondary descriptor for service complaints, dominated by heating issues — "NO HEAT" leads at 28.9% of non-null rows, with "NO HOT WATER" variants and pest/structural codes filling the tail across 75 distinct values. Nearly 60% of rows are null (null_rate 0.596) and "N/A" appears as a literal value 98 times, so missingness is encoded inconsistently. Entropy ratio 0.64 confirms a long tail beyond the heat-related top categories. Treatment: Normalize "N/A" to null, then group rare levels before one-hot or target encoding.
- n
- 1,000
- nulls
- 596 (59.6%)
- unique
- 75
- top_value
- NO HEAT
- top_rate
- 0.2896
- cardinality
- 75
- entropy
- 3.995
- entropy_ratio
- 0.6413
resolution_description
categorical label null_rateCanned resolution narratives attached to 311-style complaints, drawn from a fixed template library of 19 distinct strings covering HPD, NYPD, and DOT outcomes. 41.2% of rows are null (likely still-open cases), and the top template alone covers 24.1% of populated rows, with entropy ratio 0.72 indicating moderate concentration across the small vocabulary. The text is boilerplate agency response language rather than free-form notes, so it behaves as a high-cardinality categorical disposition code. Treatment: Map the 19 templates to short disposition codes (agency + outcome) and treat nulls as an explicit 'open/unresolved' category.
- n
- 1,000
- nulls
- 412 (41.2%)
- unique
- 19
- top_value
- The following complaint conditions are still open. HPD has already attempted to notify the property owner that the condition exists; the tenant should provide access for the owner to make the repair. HPD may attempt to contact the tenant by phone to verify the correction of the condition or an HPD Inspector may attempt to conduct an inspection.
- top_rate
- 0.2415
- cardinality
- 19
- entropy
- 3.071
- entropy_ratio
- 0.7228
resolution_action_updated_date
categorical timestamp long_tail null_rateISO-8601 datetimes recording when a resolution action was last updated, stored as strings rather than parsed timestamps. 40.9% of rows are null and a single value, 2026-01-19T00:00:00.000, accounts for 39.3% of non-null entries (232 of 1000) — likely a default or batch-stamped backfill at midnight, since all other timestamps cluster on 2026-01-20 with second-level precision. The remaining 353 unique values appear at most twice, giving high entropy (ratio 0.72) among the long tail. Treatment: Parse to datetime, treat the midnight 2026-01-19 spike as a sentinel/default, and engineer null-flag plus recency features rather than using the raw string.
- n
- 1,000
- nulls
- 409 (40.9%)
- unique
- 354
- top_value
- 2026-01-19T00:00:00.000
- top_rate
- 0.3926
- cardinality
- 354
- entropy
- 6.102
- entropy_ratio
- 0.7206
taxi_pick_up_location
categorical metadata null_rateThis is a free-text taxi pickup address, populated for only 12 of 1000 rows (null_rate 0.988). Among the 12 non-null entries, just 3 distinct addresses appear, dominated by JFK Airport (6) and LaGuardia Airport (5), suggesting the field is only filled for airport-related trips. With near-total nullity, this column carries almost no signal as-is. Treatment: Drop or collapse to a binary 'pickup_address_present' flag given 98.8% nulls.
- n
- 1,000
- nulls
- 988 (98.8%)
- unique
- 3
- top_value
- JOHN F KENNEDY AIRPORT, QUEENS (JAMAICA) ,NY, 11430
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.325
- entropy_ratio
- 0.836
vehicle_type
categorical feature null_rateCategorical descriptor of vehicle class with five levels (Car, SUV, Other, Van, Truck), almost certainly a feature describing involved vehicles. The column is essentially empty: 95.7% of rows are null, leaving only 43 populated values, of which 65% are 'Car'. With such severe missingness any modelling signal is fragile despite reasonable entropy ratio (0.69) across the observed sample. Treatment: Drop or encode missingness as its own level; do not rely on this column given 95.7% nulls.
- n
- 1,000
- nulls
- 957 (95.7%)
- unique
- 5
- top_value
- Car
- top_rate
- 0.6512
- cardinality
- 5
- entropy
- 1.599
- entropy_ratio
- 0.6886
closed_date
categorical timestamp long_tail null_rateISO-8601 timestamps recording when a record was closed, stored as strings rather than parsed datetimes. 66.4% of rows are null, consistent with most records still being open. Among the 336 non-null entries there are 334 unique values, and the visible top values all fall within a narrow window on 2026-01-20, suggesting closures cluster tightly in time rather than spreading across the dataset. Treatment: Parse to datetime and derive features (e.g., time-to-close); treat nulls as 'still open' rather than imputing.
- n
- 1,000
- nulls
- 664 (66.4%)
- unique
- 334
- top_value
- 2026-01-20T01:00:32.000
- top_rate
- 0.005952
- cardinality
- 334
- entropy
- 8.38
- entropy_ratio
- 0.9996
bridge_highway_name
categorical metadata long_tail null_rateThis column appears to record the bridge or highway name associated with each row, but it is essentially empty — 99.6% of values are null and only 4 distinct strings appear across 1000 rows. The four observed values ("J", "Cross Island Pkwy", "Long Island Expwy", "Nassau Expwy") each occur exactly once, with one entry ("J") looking like a stray code or data-entry artifact rather than a real road name. Treatment: Drop or retain only as a sparse flag; too null-heavy to model directly.
- n
- 1,000
- nulls
- 996 (99.6%)
- unique
- 4
- top_value
- J
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
bridge_highway_segment
categorical metadata long_tail null_rateCategorical field naming a bridge or highway segment, but 99.6% of rows are null with only 4 non-null values observed across 1000 records. Each of the 4 surviving entries is unique (Mezzanine, Hempstead Ave, Van Dam St, Belt Pkwy), suggesting this attribute applies to a rare subset of incidents tied to specific NYC roadway infrastructure. Treatment: Drop or convert to a binary 'has_segment' flag given the 99.6% null rate.
- n
- 1,000
- nulls
- 996 (99.6%)
- unique
- 4
- top_value
- Mezzanine
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
facility_type
categorical metadata null_rate imbalanceThis column is meant to record facility_type but is effectively empty: 95.9% of rows are null and the only non-null value across all 1000 rows is the literal string "N/A" (41 occurrences). Cardinality is 1 and entropy is 0, so the field carries no information as-is. Treatment: Drop; zero entropy and 95.9% nulls leave nothing to model.
- n
- 1,000
- nulls
- 959 (95.9%)
- unique
- 1
- top_value
- N/A
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
bridge_highway_direction
categorical metadata long_tail null_rateAlmost certainly a directional descriptor for bridge or highway traffic, with values like "South/Long Island Bound", "West/Manhattan Bound", and "Eastbound". The column is effectively empty: 99.7% of rows are null, leaving only 3 populated rows each with a distinct value. With such sparse coverage, the cardinality and entropy stats describe three observations rather than a meaningful distribution. Treatment: Drop or retain only as a contextual flag; too sparse (99.7% null) for modelling.
- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 3
- top_value
- South/Long Island Bound
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
road_ramp
categorical feature null_rateA categorical flag distinguishing 'Ramp' vs 'Roadway' segments, but it is effectively empty: 997 of 1000 rows are null, leaving only 3 observed values (2 'Ramp', 1 'Roadway'). With such sparse signal the entropy and top-rate stats are computed on just 3 records and carry no real information. Treatment: Drop; null rate of 0.997 leaves no usable signal.
- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- Ramp
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
taxi_company_borough
categorical metadata long_tail null_rate imbalanceThis column appears to be the borough associated with a taxi company, but it is effectively empty: 99.9% of rows are null and the single non-null value is 'BRONX'. With cardinality of 1 and entropy of 0, the field carries no information in this sample. Treatment: Drop; the column is 99.9% null with only one observed category.
- n
- 1,000
- nulls
- 999 (99.9%)
- unique
- 1
- top_value
- BRONX
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0