wild nyc 311 sample
Reading
This is a 1,000-row sample of NYC 311 service requests with 47 columns covering complaint metadata, location, agency routing, and resolution status. The bulk of activity is noise and parking complaints routed to NYPD (88% of all tickets), with 'Noise - Residential' alone accounting for 393 of 1,000 rows and 'Loud Music/Party' the dominant descriptor at 427. Geographically the load is spread across Queens, Manhattan, Brooklyn and the Bronx fairly evenly, while Staten Island barely registers (11 rows). Worth a closer look: the resolution funnel — 61% of tickets are Closed but 30.5% are still In Progress — and the channel mix, where ONLINE (466) now outpaces MOBILE and PHONE combined-ish. Note that many specialty fields (road_ramp, taxi_*, bridge_highway_*, facility_type) are >99% null and should be ignored for analysis.
citing: complaint_type · descriptor · agency · borough · status · open_data_channel_type · city · location_type · address_type · road_ramp · taxi_company_borough · facility_type
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Noise - Residential | 393 | 39.3% |
| Illegal Parking | 197 | 19.7% |
| Noise - Commercial | 148 | 14.8% |
| Blocked Driveway | 55 | 5.5% |
| HEAT/HOT WATER | 49 | 4.9% |
| Noise - Street/Sidewalk | 44 | 4.4% |
| Noise - Vehicle | 17 | 1.7% |
| UNSANITARY CONDITION | 12 | 1.2% |
| Street Condition | 7 | 0.7% |
| Non-Emergency Police Matter | 7 | 0.7% |
| Smoking or Vaping | 5 | 0.5% |
| Abandoned Vehicle | 5 | 0.5% |
| Animal-Abuse | 5 | 0.5% |
| Rodent | 4 | 0.4% |
| Encampment | 4 | 0.4% |
| Taxi Complaint | 3 | 0.3% |
| Dirty Condition | 3 | 0.3% |
| PAINT/PLASTER | 3 | 0.3% |
| GENERAL | 3 | 0.3% |
| Drinking | 2 | 0.2% |
Show data table
| value | count | share |
|---|---|---|
| NYPD | 880 | 88.0% |
| HPD | 74 | 7.4% |
| DOHMH | 13 | 1.3% |
| DOT | 11 | 1.1% |
| TLC | 6 | 0.6% |
| DSNY | 6 | 0.6% |
| DHS | 2 | 0.2% |
| OOS | 2 | 0.2% |
| DCWP | 2 | 0.2% |
| DPR | 2 | 0.2% |
| DEP | 2 | 0.2% |
Show data table
| value | count | share |
|---|---|---|
| QUEENS | 274 | 27.4% |
| MANHATTAN | 258 | 25.8% |
| BROOKLYN | 254 | 25.4% |
| BRONX | 203 | 20.3% |
| STATEN ISLAND | 11 | 1.1% |
Show data table
| value | count | share |
|---|---|---|
| Closed | 610 | 61.0% |
| In Progress | 305 | 30.5% |
| Open | 85 | 8.5% |
Show data table
| value | count | share |
|---|---|---|
| ONLINE | 466 | 46.6% |
| MOBILE | 288 | 28.8% |
| PHONE | 236 | 23.6% |
| UNKNOWN | 10 | 1.0% |
Schema
47 columns| Alerts | ||||
|---|---|---|---|---|
| unique_key | categorical | 0.0% | 1,000 |
long_tail
|
| created_date | categorical | 0.0% | 939 |
long_tail
|
| agency | categorical | 0.0% | 11 |
|
| agency_name | categorical | 0.0% | 11 |
|
| complaint_type | categorical | 0.0% | 45 |
|
| descriptor | categorical | 0.0% | 71 |
|
| location_type | categorical | 1.4% | 20 |
|
| incident_zip | categorical | 0.3% | 146 |
|
| incident_address | categorical | 0.5% | 776 |
long_tail
|
| street_name | categorical | 0.5% | 569 |
long_tail
|
| cross_street_1 | categorical | 8.0% | 511 |
long_tail
|
| cross_street_2 | categorical | 7.9% | 504 |
long_tail
|
| intersection_street_1 | categorical | 8.5% | 506 |
long_tail
|
| intersection_street_2 | categorical | 8.4% | 499 |
long_tail
|
| address_type | categorical | 0.2% | 4 |
imbalance
|
| city | categorical | 2.1% | 36 |
|
| landmark | categorical | 10.5% | 515 |
long_tail
|
| status | categorical | 0.0% | 3 |
|
| community_board | categorical | 0.0% | 65 |
|
| council_district | categorical | 0.9% | 51 |
|
| police_precinct | categorical | 0.0% | 76 |
|
| bbl | categorical | 5.5% | 719 |
long_tail
|
| borough | categorical | 0.0% | 5 |
|
| x_coordinate_state_plane | categorical | 0.7% | 763 |
long_tail
|
| y_coordinate_state_plane | categorical | 0.7% | 768 |
long_tail
|
| open_data_channel_type | categorical | 0.0% | 4 |
|
| park_facility_name | categorical | 0.0% | 2 |
imbalance
|
| park_borough | categorical | 0.0% | 5 |
|
| latitude | categorical | 0.7% | 770 |
long_tail
|
| longitude | categorical | 0.7% | 770 |
long_tail
|
| location | unknown | 0.0% | — |
skipped
|
| :@computed_region_f5dn_yrer | categorical | 0.7% | 62 |
|
| :@computed_region_yeji_bk3q | categorical | 0.7% | 5 |
|
| :@computed_region_sbqj_enih | categorical | 0.7% | 75 |
|
| :@computed_region_92fq_4b7q | categorical | 0.7% | 51 |
|
| descriptor_2 | categorical | 88.2% | 43 |
long_tail
null_rate
|
| resolution_description | categorical | 30.7% | 16 |
null_rate
|
| resolution_action_updated_date | categorical | 30.6% | 585 |
long_tail
null_rate
|
| closed_date | categorical | 39.0% | 585 |
long_tail
null_rate
|
| taxi_pick_up_location | categorical | 99.4% | 6 |
long_tail
null_rate
|
| vehicle_type | categorical | 96.5% | 4 |
null_rate
|
| facility_type | categorical | 99.0% | 1 |
null_rate
imbalance
|
| taxi_company_borough | categorical | 99.9% | 1 |
long_tail
null_rate
imbalance
|
| bridge_highway_name | categorical | 99.7% | 2 |
null_rate
|
| bridge_highway_direction | categorical | 99.7% | 2 |
null_rate
|
| road_ramp | categorical | 99.9% | 1 |
long_tail
null_rate
imbalance
|
| bridge_highway_segment | categorical | 99.7% | 2 |
null_rate
|
unique_key
categorical identifier long_tailThis column is a unique row identifier: all 1000 values are distinct (n_unique=1000, entropy_ratio=1.0) with no nulls and a top_rate of just 0.001. Sample values like '67519092' are 8-digit numeric strings, consistent with a primary key from a source system. There is no predictive signal here; the long_tail alert simply reflects perfect uniqueness. Treatment: Drop from modelling; retain only as a join key or row reference.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 1,000
- top_value
- 67519092
- top_rate
- 0.001
- cardinality
- 1,000
- entropy
- 9.966
- entropy_ratio
- 1
created_date
categorical timestamp long_tailThis is a record creation timestamp stored as an ISO-8601 string, with 939 unique values across 1000 rows and no nulls. Entropy ratio of 0.995 confirms it is essentially a per-row timestamp; the most frequent value appears only 9 times (0.9%). All visible top values fall within a narrow window on 2026-01-17 to 2026-01-18, suggesting the sample covers only a short ingestion period rather than a broad history. Treatment: Parse to datetime and derive features (hour, day, delta) rather than using as a categorical.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 939
- top_value
- 2026-01-17T23:49:56.000
- top_rate
- 0.009
- cardinality
- 939
- entropy
- 9.828
- entropy_ratio
- 0.9952
agency
categorical featureThis is the responding NYC agency code for each record, with 11 distinct values and no nulls. The distribution is severely concentrated: NYPD accounts for 880 of 1000 rows (top_rate 0.88), followed by HPD at 74, leaving the remaining 9 agencies with at most 13 rows each. Entropy ratio of 0.22 confirms the column carries little information beyond 'NYPD vs. not'. Treatment: Collapse rare agencies into an 'Other' bucket before encoding.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- NYPD
- top_rate
- 0.88
- cardinality
- 11
- entropy
- 0.7715
- entropy_ratio
- 0.223
agency_name
categorical featureThis column names the NYC agency handling each record, with 11 distinct agencies and no nulls. The distribution is heavily concentrated: the New York City Police Department accounts for 880 of 1000 rows (88%), and entropy ratio is just 0.22, meaning the field carries little information beyond 'NYPD or not'. Secondary agencies like Housing Preservation and Development (74) and Health and Mental Hygiene (13) trail far behind. Treatment: Collapse rare agencies into an 'Other' bucket or binarize as NYPD vs non-NYPD before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- New York City Police Department
- top_rate
- 0.88
- cardinality
- 11
- entropy
- 0.7715
- entropy_ratio
- 0.223
complaint_type
categorical labelThis is a categorical complaint-type label, almost certainly from a 311-style service request feed, with 45 distinct categories across 1000 rows and no nulls. The distribution is heavily concentrated: "Noise - Residential" alone accounts for 39.3% of records, and the top three values (all noise or parking complaints) cover roughly 74% of the column. Casing is inconsistent across categories (e.g., "HEAT/HOT WATER" and "UNSANITARY CONDITION" in upper case versus title-cased noise variants), which suggests the source merges multiple agency taxonomies. Treatment: Normalise casing and group the long tail before one-hot or target encoding.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 45
- top_value
- Noise - Residential
- top_rate
- 0.393
- cardinality
- 45
- entropy
- 2.935
- entropy_ratio
- 0.5345
descriptor
categorical featureCategorical descriptor field enumerating specific complaint subtypes (e.g., 'Loud Music/Party', 'Blocked Hydrant', 'Posted Parking Sign Violation'), likely a secondary classification under a parent complaint type. Distribution is heavily skewed: 'Loud Music/Party' alone covers 42.7% of 1000 rows, and the top 3 values account for the majority, leaving 71 distinct categories with a long thin tail. Note the inconsistent casing ('ENTIRE BUILDING', 'APARTMENT ONLY' vs. title-case entries), suggesting upstream data-entry from multiple systems. Treatment: Normalize casing, then group rare levels into 'Other' before one-hot or target encoding.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 71
- top_value
- Loud Music/Party
- top_rate
- 0.427
- cardinality
- 71
- entropy
- 3.54
- entropy_ratio
- 0.5756
location_type
categorical featureCategorical location descriptor with 20 distinct values and only 1.4% nulls; 'Residential Building/House' dominates at 40.2% followed by 'Street/Sidewalk' at 32.4%. The vocabulary is inconsistent: 'RESIDENTIAL BUILDING' (74), 'Residential Building' (7), and 'Residential Building/House' (396) are clearly the same concept stored three ways, and 'Street' (8) overlaps with 'Street/Sidewalk' (324). Entropy ratio of 0.52 confirms heavy concentration in the top two buckets. Treatment: Normalize casing and merge synonym variants (e.g., collapse the three 'Residential Building' forms) before encoding.
- n
- 1,000
- nulls
- 14 (1.4%)
- unique
- 20
- top_value
- Residential Building/House
- top_rate
- 0.4016
- cardinality
- 20
- entropy
- 2.237
- entropy_ratio
- 0.5177
incident_zip
categorical featureHolds NYC ZIP codes tied to incidents, with 146 distinct values across 1000 rows and only 0.3% missing. Distribution is diffuse — entropy ratio 0.93 and the top ZIP '10011' captures just 4.2% — so no single neighborhood dominates. The top 10 ZIPs span Manhattan, Brooklyn, Queens, and the Bronx, indicating broad five-borough coverage. Treatment: Treat as categorical; group by borough or target-encode before modelling rather than one-hot across 146 levels.
- n
- 1,000
- nulls
- 3 (0.3%)
- unique
- 146
- top_value
- 10011
- top_rate
- 0.04213
- cardinality
- 146
- entropy
- 6.696
- entropy_ratio
- 0.9313
incident_address
categorical metadata long_tailStreet addresses for incident locations, with 776 unique values across 1000 rows and only 0.5% nulls. Entropy ratio of 0.97 confirms a long tail — most addresses appear once — but a few hotspots stand out, notably '126 WEST 13 STREET' with 31 occurrences (3.1% of rows), far above the next most frequent. Note inconsistent whitespace formatting (e.g., '60 EAST 93 STREET') suggesting unnormalized input. Treatment: Normalize whitespace/casing and geocode rather than using raw strings as a feature.
- n
- 1,000
- nulls
- 5 (0.5%)
- unique
- 776
- top_value
- 126 WEST 13 STREET
- top_rate
- 0.03116
- cardinality
- 776
- entropy
- 9.321
- entropy_ratio
- 0.9709
street_name
categorical metadata long_tailStreet name field, almost certainly a NYC address component given entries like WEST 13 STREET, JAMAICA AVENUE, and BRUCKNER BOULEVARD. Extremely high-cardinality (569 unique across 1000 rows, entropy ratio 0.955) with a long tail and no dominant value — the top entry only covers 3.1% of rows. Watch for inconsistent whitespace (e.g. "EAST 93 STREET" has multiple spaces) which will fragment otherwise-equal values. Treatment: Normalise whitespace and casing, then either group by borough/zip or drop for modelling — too high-cardinality to one-hot.
- n
- 1,000
- nulls
- 5 (0.5%)
- unique
- 569
- top_value
- WEST 13 STREET
- top_rate
- 0.03116
- cardinality
- 569
- entropy
- 8.742
- entropy_ratio
- 0.9551
cross_street_1
categorical metadata long_tailThis column captures the name of one cross street, almost certainly part of a NYC location reference (entries like 'AVENUE OF THE AMERICAS', 'BLEECKER STREET', and 'BROADWAY' are giveaways). Cardinality is extreme: 511 unique values across 1000 rows with entropy ratio 0.95, and the most common street accounts for only 3.8% of rows. 8% of values are null, and the long tail alert means most streets appear only once or twice. Treatment: Standardize casing and group rare streets into an 'OTHER' bucket before any encoding; do not one-hot raw.
- n
- 1,000
- nulls
- 80 (8.0%)
- unique
- 511
- top_value
- AVENUE OF THE AMERICAS
- top_rate
- 0.03804
- cardinality
- 511
- entropy
- 8.574
- entropy_ratio
- 0.953
cross_street_2
categorical feature long_tailFree-form NYC street names serving as the second cross street in a location pair, with 504 distinct values across 921 non-null rows and a 7.9% null rate. Distribution is long-tailed and high-entropy (entropy ratio 0.948); the most common value '7 AVENUE' covers only 3.9%, and 'DEAD END' appearing 10 times suggests non-street sentinel values mixed in. Treatment: Treat as high-cardinality categorical: hash or target-encode rather than one-hot, and normalise sentinels like 'DEAD END' before modelling.
- n
- 1,000
- nulls
- 79 (7.9%)
- unique
- 504
- top_value
- 7 AVENUE
- top_rate
- 0.03909
- cardinality
- 504
- entropy
- 8.511
- entropy_ratio
- 0.9481
intersection_street_1
categorical metadata long_tailStreet-name field, almost certainly the first cross street of an NYC intersection given values like AVENUE OF THE AMERICAS, BROADWAY, and AMSTERDAM AVENUE. Cardinality is extreme: 506 distinct names across 1000 rows with entropy ratio 0.95, and the most frequent value covers only 3.8% of records. Null rate is 8.5%, and the long tail means most names appear only once or twice. Treatment: Group rare streets into an 'OTHER' bucket or geocode to borough/zone before using as a feature.
- n
- 1,000
- nulls
- 85 (8.5%)
- unique
- 506
- top_value
- AVENUE OF THE AMERICAS
- top_rate
- 0.03825
- cardinality
- 506
- entropy
- 8.558
- entropy_ratio
- 0.9527
intersection_street_2
categorical feature long_tailThis is the second street name of an intersection, almost certainly NYC street geography given entries like '7 AVENUE', 'BROADWAY', and 'EAST 171 STREET'. Cardinality is extreme (499 unique values across 1000 rows, entropy ratio 0.95) with no dominant value — the top entry '7 AVENUE' covers only 3.9% and 8.4% of rows are null. Notably, 'DEAD END' appears 10 times as a sentinel rather than a real street, which will distort any join on street name. Treatment: Treat as high-cardinality location text: normalize sentinels like 'DEAD END' and target-encode or geocode rather than one-hot.
- n
- 1,000
- nulls
- 84 (8.4%)
- unique
- 499
- top_value
- 7 AVENUE
- top_rate
- 0.0393
- cardinality
- 499
- entropy
- 8.495
- entropy_ratio
- 0.9478
address_type
categorical feature imbalanceCategorical descriptor of how a location is specified, with four values: ADDRESS, INTERSECTION, BLOCKFACE, and PLACE. The distribution is severely imbalanced — ADDRESS accounts for 97.2% of 1000 rows, leaving only 28 non-ADDRESS records and an entropy ratio of 0.108. Null rate is negligible at 0.2%. Treatment: Collapse to a binary ADDRESS-vs-other flag or drop, since the minority levels carry little signal.
- n
- 1,000
- nulls
- 2 (0.2%)
- unique
- 4
- top_value
- ADDRESS
- top_rate
- 0.9719
- cardinality
- 4
- entropy
- 0.2169
- entropy_ratio
- 0.1084
city
categorical featureThis is a NYC-centric city field with 36 distinct uppercase values and a 2.1% null rate. The distribution is dominated by Brooklyn (25.5%), New York, and Bronx, which together account for roughly 70% of the 1000 rows; entropy ratio of 0.63 reflects this concentration on a few boroughs while the long tail covers Queens neighborhoods like Woodhaven, Jamaica, and Astoria. Note the values mix borough names with neighborhood names rather than a consistent geographic level. Treatment: Group rare neighborhoods into an 'other' bucket and one-hot or target-encode the top categories.
- n
- 1,000
- nulls
- 21 (2.1%)
- unique
- 36
- top_value
- BROOKLYN
- top_rate
- 0.2554
- cardinality
- 36
- entropy
- 3.282
- entropy_ratio
- 0.6349
landmark
categorical metadata long_tailThis column holds NYC street names used as landmark references, with 515 unique values across 895 non-null rows and a 10.5% null rate. The distribution is extremely flat (entropy ratio 0.955) — the most common value, 'WEST 13 STREET', accounts for only 3.5% of records, and a long tail alert is raised. No single landmark dominates, so this behaves more like a high-cardinality location descriptor than a categorical feature. Treatment: Treat as high-cardinality text; group rare values or geocode rather than one-hot encode.
- n
- 1,000
- nulls
- 105 (10.5%)
- unique
- 515
- top_value
- WEST 13 STREET
- top_rate
- 0.03464
- cardinality
- 515
- entropy
- 8.603
- entropy_ratio
- 0.955
status
categorical featureThis is a ticket/case lifecycle status field with three states: Closed, In Progress, and Open. The distribution is heavily weighted toward Closed (610 of 1000, 61%), with Open being notably rare at just 85 records, suggesting most cases reach resolution. No nulls and clean cardinality of 3 make this a tidy categorical signal. Treatment: One-hot or ordinal encode (Open < In Progress < Closed) before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Closed
- top_rate
- 0.61
- cardinality
- 3
- entropy
- 1.26
- entropy_ratio
- 0.7948
community_board
categorical metadataThis column encodes NYC community board identifiers as a numeric prefix plus borough name (e.g., '02 MANHATTAN'), with 65 distinct values across 1000 rows and no nulls. The distribution is fairly flat — entropy ratio is 0.93 and the top value accounts for only 5.6% — suggesting wide geographic coverage rather than concentration in one district. Manhattan and Queens boards dominate the top of the leaderboard, with '02 MANHATTAN' (56) and '09 QUEENS' (48) leading. Treatment: Split into borough and board-number components, then one-hot or target-encode for modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 65
- top_value
- 02 MANHATTAN
- top_rate
- 0.056
- cardinality
- 65
- entropy
- 5.61
- entropy_ratio
- 0.9316
council_district
categorical featureCategorical codes representing council districts, stored as zero-padded two-digit strings across 51 distinct values with a 0.9% null rate. The distribution is nearly uniform (entropy ratio 0.954), with the top district '03' accounting for only 5.65% of rows, suggesting broad geographic coverage rather than concentration. No single district dominates, which is consistent with a city/region-wide sample. Treatment: One-hot or target-encode for modelling; impute the small null share with an explicit 'unknown' category.
- n
- 1,000
- nulls
- 9 (0.9%)
- unique
- 51
- top_value
- 03
- top_rate
- 0.05651
- cardinality
- 51
- entropy
- 5.412
- entropy_ratio
- 0.9541
police_precinct
categorical featureThis is a categorical column tagging each record with one of 76 NYPD-style police precincts, with no nulls across 1000 rows. The distribution is remarkably flat — entropy ratio 0.9358 and the modal value 'Precinct 6' covering just 5% — meaning no precinct dominates and counts decay slowly from 50 down to the mid-20s among the top 10. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode rather than one-hot before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 76
- top_value
- Precinct 6
- top_rate
- 0.05
- cardinality
- 76
- entropy
- 5.847
- entropy_ratio
- 0.9358
bbl
categorical foreign_key long_tailThis is the NYC BBL (Borough-Block-Lot) parcel identifier, stored as a 10-digit string where the leading digit encodes the borough (1-5). Cardinality is very high (719 unique across 945 non-null rows, entropy ratio 0.968), but a long tail is flagged: one parcel '1006080026' accounts for 31 rows (3.3%) and several others recur 5-13 times, so rows are not one-per-parcel. Null rate is 5.5%. Treatment: Treat as a parcel key for left-joining to PLUTO/property tables; do not one-hot encode.
- n
- 1,000
- nulls
- 55 (5.5%)
- unique
- 719
- top_value
- 1006080026
- top_rate
- 0.0328
- cardinality
- 719
- entropy
- 9.189
- entropy_ratio
- 0.9683
borough
categorical featureThis column lists one of the five NYC boroughs for each of the 1000 rows, with no nulls and only 5 unique values. The distribution is fairly even across QUEENS (274), MANHATTAN (258), BROOKLYN (254), and BRONX (203), but STATEN ISLAND is sharply underrepresented at just 11 rows. High entropy ratio (0.886) confirms the top four categories are well balanced. Treatment: One-hot encode; consider grouping or monitoring STATEN ISLAND given its low support.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- QUEENS
- top_rate
- 0.274
- cardinality
- 5
- entropy
- 2.057
- entropy_ratio
- 0.8858
x_coordinate_state_plane
categorical feature long_tailThis column holds X-coordinates in a state plane projection, stored as strings rather than numbers. With 763 unique values across 993 non-null rows and entropy ratio 0.97, it is near-unique; the most frequent value '984721' appears only 31 times (3.1%). The long_tail alert is consistent with a near-continuous spatial measurement that has been categorically encoded. Treatment: Cast to numeric and pair with the Y-coordinate as a 2D spatial feature.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 763
- top_value
- 984721
- top_rate
- 0.03122
- cardinality
- 763
- entropy
- 9.292
- entropy_ratio
- 0.9704
y_coordinate_state_plane
categorical feature long_tailColumn holds State Plane y-coordinates (northings) stored as strings rather than numerics, with 768 unique values across 1000 rows and a 0.7% null rate. Entropy ratio of 0.97 and a top_rate of just 3.1% confirm a near-uniform long tail typical of geographic coordinates, though the modal value '207809' repeating 31 times suggests a default or geocoder fallback location worth investigating. Treatment: Cast to numeric and treat as a continuous spatial coordinate; inspect the repeated 207809 value for geocoding artifacts.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 768
- top_value
- 207809
- top_rate
- 0.03122
- cardinality
- 768
- entropy
- 9.304
- entropy_ratio
- 0.9707
open_data_channel_type
categorical featureCategorical channel indicating how a record was opened, with 4 distinct values and no nulls across 1000 rows. ONLINE leads at 466, followed by MOBILE (288) and PHONE (236), while UNKNOWN appears only 10 times. Distribution is fairly balanced (entropy ratio 0.79) with no dominant single class beyond 46.6%. Treatment: one-hot encode; consider folding UNKNOWN into a missing/other bucket.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- ONLINE
- top_rate
- 0.466
- cardinality
- 4
- entropy
- 1.589
- entropy_ratio
- 0.7943
park_facility_name
categorical metadata imbalanceThis column nominally records a park facility name but is effectively a constant: 999 of 1000 rows are "Unspecified" and only one row names "Flushing Meadows Corona Park". Entropy ratio is 0.011, confirming there is virtually no information content here. The column adds no discriminative signal as-is. Treatment: Drop; single dominant value carries no signal.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Unspecified
- top_rate
- 0.999
- cardinality
- 2
- entropy
- 0.01141
- entropy_ratio
- 0.01141
park_borough
categorical featureCategorical column listing the NYC borough associated with a park, with all five boroughs present and no nulls across 1000 rows. Distribution is fairly balanced (entropy ratio 0.8858) among Queens (274), Manhattan (258), Brooklyn (254), and Bronx (203), but Staten Island is severely under-represented at just 11 records. Treatment: One-hot encode; consider grouping or stratifying given the Staten Island sparsity.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- QUEENS
- top_rate
- 0.274
- cardinality
- 5
- entropy
- 2.057
- entropy_ratio
- 0.8858
latitude
categorical feature long_tailLatitude values stored as strings, with 770 unique points across 1000 rows clustered around 40.6-40.8 (NYC range). Entropy ratio of 0.97 and top_rate of just 0.031 confirm a long tail of near-unique coordinates, though the modal value 40.73706433046593 repeating 31 times suggests some snapped or default-pinned locations. Null rate is low at 0.7%. Treatment: cast to float and pair with longitude for geospatial features rather than treating as categorical.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 770
- top_value
- 40.73706433046593
- top_rate
- 0.03122
- cardinality
- 770
- entropy
- 9.308
- entropy_ratio
- 0.9707
longitude
categorical feature long_tailGeographic longitude coordinates, almost certainly NYC-area given the -73.8 to -74.0 range. The column was profiled as categorical despite being continuous numeric: 770 unique values across 1000 rows with entropy ratio 0.97 and a top value repeating only 31 times (3.1%). Null rate is low at 0.7%. Treatment: Cast to float and pair with latitude as a geospatial feature rather than treating as categorical.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 770
- top_value
- -73.99830041620608
- top_rate
- 0.03122
- cardinality
- 770
- entropy
- 9.308
- entropy_ratio
- 0.9707
location
unknown other skippedThe column is named "location" but saturn skipped profiling, so no type, cardinality, or value statistics are available. The only confirmed facts are 1000 rows with a 0.0 null rate; uniqueness and content remain unknown. Treatment: Re-profile or manually inspect before use; saturn skipped this column.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- —
:@computed_region_f5dn_yrer
categorical metadataThis is a Socrata-style computed region column (`:@computed_region_f5dn_yrer`), almost certainly a numeric region/zone code attached automatically by a spatial join. It has 62 unique values across 1000 rows with a near-uniform spread (entropy ratio 0.94, top value '57' at only 5.6%) and a 0.7% null rate. No single region dominates, suggesting the underlying records are well-distributed across the joined geography. Treatment: Treat as a categorical region code; one-hot or target-encode if used as a feature, otherwise drop as join artifact.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 62
- top_value
- 57
- top_rate
- 0.05639
- cardinality
- 62
- entropy
- 5.591
- entropy_ratio
- 0.9389
:@computed_region_yeji_bk3q
categorical metadataThis is a Socrata-style computed region column (`:@computed_region_yeji_bk3q`), almost certainly a spatial-join lookup mapping each row to one of 5 region codes. The distribution is fairly balanced across codes 2-5 (193-269 each) but code 1 appears only 11 times, and 0.7% of rows are null. Entropy ratio 0.88 confirms near-uniform spread except for the sparse '1' bucket. Treatment: Treat as a categorical region key; one-hot encode or drop if the spatial join isn't relevant.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 5
- top_value
- 3
- top_rate
- 0.2709
- cardinality
- 5
- entropy
- 2.054
- entropy_ratio
- 0.8846
:@computed_region_sbqj_enih
categorical foreign_keyThis is a Socrata-style computed region column (`:@computed_region_sbqj_enih`), holding a small integer code that geocodes each row into one of 75 regions. The distribution is nearly uniform across categories — entropy ratio is 0.94 and the top value '3' covers only 5.0% of rows — so no single region dominates. Nulls are negligible at 0.7%. Treatment: Treat as a categorical region id; left-join to the corresponding Socrata boundary lookup or one-hot encode for modelling.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 75
- top_value
- 3
- top_rate
- 0.05035
- cardinality
- 75
- entropy
- 5.846
- entropy_ratio
- 0.9385
:@computed_region_92fq_4b7q
categorical foreign_keyThis is a Socrata-style computed region column (`:@computed_region_92fq_4b7q`) holding 51 distinct integer codes, almost certainly a spatial join key (e.g., a precinct, district, or zone id) appended by the platform. Distribution is fairly flat — entropy ratio 0.9477 and the top value '10' covers only 6.4% of rows — with very few nulls (0.7%). No single region dominates, suggesting broad geographic coverage rather than a hot spot. Treatment: Treat as a categorical region id; left-join to the region lookup or one-hot encode if used as a feature.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 51
- top_value
- 10
- top_rate
- 0.06445
- cardinality
- 51
- entropy
- 5.376
- entropy_ratio
- 0.9477
descriptor_2
categorical feature long_tail null_rateSecondary complaint descriptor that further qualifies a primary complaint type, with values like 'NO HEAT', 'NO HEAT AND NO HOT WATER', and 'ROACHES'. It is null 88.2% of the time, so only 118 rows carry a value, and among those 'NO HEAT' alone covers 27.97%. Cardinality is 43 with entropy ratio 0.807, indicating a long tail across the populated rows and a mix of casing conventions ('ROACHES' vs 'Cannabis Smoking or Vaping') that suggests inconsistent upstream sources. Treatment: Normalize casing and treat missing as its own category before one-hot or target encoding.
- n
- 1,000
- nulls
- 882 (88.2%)
- unique
- 43
- top_value
- NO HEAT
- top_rate
- 0.2797
- cardinality
- 43
- entropy
- 4.376
- entropy_ratio
- 0.8065
resolution_description
categorical label null_rateThis is the canned 311 resolution narrative attached to each complaint, describing the agency's disposition (NYPD found no violation, summons issued, HPD inspection pending, etc.). Cardinality is just 16 templates over 1000 rows, with the top NYPD 'no criminal violation' boilerplate covering 35.4% and 30.7% of rows null — likely complaints still open. Entropy ratio of 0.69 shows the distribution is skewed toward a few NYPD templates, with DOT and HPD templates appearing only in the long tail. Treatment: Map the 16 templates to a categorical disposition code (e.g., 'no_violation', 'summons_issued', 'no_entry', 'pending') rather than treating as free text.
- n
- 1,000
- nulls
- 307 (30.7%)
- unique
- 16
- top_value
- The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.
- top_rate
- 0.3535
- cardinality
- 16
- entropy
- 2.777
- entropy_ratio
- 0.6943
resolution_action_updated_date
categorical timestamp long_tail null_rateISO-8601 timestamps recording when a resolution action was last updated, stored as strings rather than a native datetime. 30.6% of rows are null, and a single midnight value '2026-01-17T00:00:00.000' accounts for 74 of the non-null entries (top_rate 10.66%) while the remaining values are millisecond-precise and nearly all unique (585 distinct over 694 non-null, entropy_ratio 0.94). The mix of one date-only spike with otherwise sub-second timestamps suggests two ingestion paths or a backfill default colliding with live event logging. Treatment: Parse to datetime, treat the 2026-01-17 midnight spike as a sentinel/backfill, and derive recency or duration features rather than using raw values.
- n
- 1,000
- nulls
- 306 (30.6%)
- unique
- 585
- top_value
- 2026-01-17T00:00:00.000
- top_rate
- 0.1066
- cardinality
- 585
- entropy
- 8.673
- entropy_ratio
- 0.9435
closed_date
categorical timestamp long_tail null_rateThis is a closure timestamp captured at second-level precision, stored as strings rather than parsed datetimes. 39% of rows are null, consistent with records that have not yet been closed, and among the 610 populated rows there are 585 unique values with the most common appearing only 3 times (top_rate 0.0049, entropy_ratio 0.998). All visible top values cluster on 2026-01-18, suggesting the populated closures concentrate in a narrow window. Treatment: Parse to datetime, treat nulls as 'still open', and derive duration features rather than using raw values.
- n
- 1,000
- nulls
- 390 (39.0%)
- unique
- 585
- top_value
- 2026-01-18T00:41:26.000
- top_rate
- 0.004918
- cardinality
- 585
- entropy
- 9.169
- entropy_ratio
- 0.9975
taxi_pick_up_location
categorical free_text long_tail null_rateFree-text taxi pickup addresses (street, borough, state, ZIP) for NYC trips. The column is almost entirely empty at a 99.4% null rate, leaving only 6 non-null rows that are all unique, so cardinality equals the populated count and entropy is maximal (entropy_ratio 1.0). With so little signal, this column carries effectively no analytical value as-is. Treatment: Drop the column or geocode the rare populated addresses if pickup location is needed.
- n
- 1,000
- nulls
- 994 (99.4%)
- unique
- 6
- top_value
- 55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014
- top_rate
- 0.1667
- cardinality
- 6
- entropy
- 2.585
- entropy_ratio
- 1
vehicle_type
categorical feature null_rateCategorical vehicle classification with just 4 levels (Car, Other, SUV, Van), but it is almost entirely missing — 96.5% null, leaving only 35 populated rows out of 1000. Among those present, Car dominates at 77.1%, giving an entropy ratio of 0.55. With so few observations, the distribution is unreliable as a feature. Treatment: Drop or treat missingness as its own category; too sparse to model directly.
- n
- 1,000
- nulls
- 965 (96.5%)
- unique
- 4
- top_value
- Car
- top_rate
- 0.7714
- cardinality
- 4
- entropy
- 1.097
- entropy_ratio
- 0.5484
facility_type
categorical metadata null_rate imbalanceThis column is a categorical facility_type field that is effectively empty: 99% of the 1000 rows are null, and the only 10 non-null values are all the placeholder 'N/A'. With cardinality of 1 and entropy of 0, it carries no information. Treatment: Drop the column; it is 99% null with a single placeholder value.
- n
- 1,000
- nulls
- 990 (99.0%)
- unique
- 1
- top_value
- N/A
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
taxi_company_borough
categorical metadata long_tail null_rate imbalanceCategorical column intended to record the borough of a taxi company, but it is effectively empty: 99.9% of rows are null and the single non-null value is 'BROOKLYN'. With cardinality 1 and entropy 0, the column carries no information as-is. Treatment: Drop; near-entirely null with only one observed level.
- n
- 1,000
- nulls
- 999 (99.9%)
- unique
- 1
- top_value
- BROOKLYN
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
bridge_highway_name
categorical metadata null_rateA categorical field flagging bridge or highway names, but it's effectively empty: 99.7% null with only 3 non-null records across 1000 rows. The two distinct values observed ('E' and 'Kosciuszko Br - BQE') are inconsistent in form, suggesting either a free-text field or an abbreviation code mixed with full names. Treatment: Drop; null rate of 99.7% leaves no usable signal.
- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- E
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
bridge_highway_direction
categorical metadata null_rateThis is a near-empty categorical field describing bridge or highway direction, with a 99.7% null rate leaving only 3 non-null observations across 2 distinct values ('C E Local Downtown & Brooklyn' and 'Queens Bound'). The values resemble transit route/direction labels rather than simple compass directions, suggesting inconsistent coding. With only 3 populated rows out of 1000, no reliable distribution can be inferred. Treatment: Drop; null rate of 0.997 leaves too little signal to use.
- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- C E Local Downtown & Brooklyn
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
road_ramp
categorical metadata long_tail null_rate imbalanceThis column is a binary-style flag indicating whether a road segment is a ramp, but it carries virtually no information here. With a null_rate of 0.999 and only 1 non-null value ("Ramp") across 1000 rows, cardinality is 1 and entropy is 0. Treatment: Drop; effectively constant with 99.9% nulls.
- n
- 1,000
- nulls
- 999 (99.9%)
- unique
- 1
- top_value
- Ramp
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
bridge_highway_segment
categorical metadata null_rateA categorical flag describing a bridge/highway segment type, with only two observed values ('Platform' and 'Ramp'). It is almost entirely missing — 99.7% null, leaving just 3 non-null rows out of 1000 — so the apparent top_rate of 0.667 reflects 2 observations and is not meaningful. Treatment: Drop or treat as a rare-event indicator; too sparse (3/1000 populated) to model directly.
- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- Platform
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183