data trove nyc 311 service requests
Reading
This dataset is a sample of 1,000 NYC 311 service requests, capturing complaints logged across the five boroughs with details on complaint type, location, agency, and resolution status. The dominant signal is complaint type: 'Noise - Residential' alone accounts for 39.3% of records, followed by 'Illegal Parking' (19.7%) and 'Noise - Commercial' (14.8%), pointing to a dataset heavily skewed toward NYPD-handled quality-of-life complaints (NYPD handles 88% of cases). A second area worth examining is resolution status — 61% of complaints are closed, 30.5% are still in progress, and 8.5% remain open, which raises questions about agency workload and response time. Many specialty columns (road_ramp, taxi_company_borough, bridge_highway fields) are nearly entirely null (99%+), indicating they apply only to rare complaint subtypes and can largely be ignored.
citing: complaint_type.top_value · complaint_type.top_rate · agency.top_value · agency.top_rate · status.top_values · borough.top_values · open_data_channel_type.top_values · descriptor.top_value · road_ramp.null_rate · taxi_company_borough.null_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Noise - Residential | 393 | 39.3% |
| Illegal Parking | 197 | 19.7% |
| Noise - Commercial | 148 | 14.8% |
| Blocked Driveway | 55 | 5.5% |
| HEAT/HOT WATER | 49 | 4.9% |
| Noise - Street/Sidewalk | 44 | 4.4% |
| Noise - Vehicle | 17 | 1.7% |
| UNSANITARY CONDITION | 12 | 1.2% |
| Street Condition | 7 | 0.7% |
| Non-Emergency Police Matter | 7 | 0.7% |
| Smoking or Vaping | 5 | 0.5% |
| Abandoned Vehicle | 5 | 0.5% |
| Animal-Abuse | 5 | 0.5% |
| Rodent | 4 | 0.4% |
| Encampment | 4 | 0.4% |
| Taxi Complaint | 3 | 0.3% |
| Dirty Condition | 3 | 0.3% |
| PAINT/PLASTER | 3 | 0.3% |
| GENERAL | 3 | 0.3% |
| Drinking | 2 | 0.2% |
Show data table
| value | count | share |
|---|---|---|
| Closed | 610 | 61.0% |
| In Progress | 305 | 30.5% |
| Open | 85 | 8.5% |
Show data table
| value | count | share |
|---|---|---|
| QUEENS | 274 | 27.4% |
| MANHATTAN | 258 | 25.8% |
| BROOKLYN | 254 | 25.4% |
| BRONX | 203 | 20.3% |
| STATEN ISLAND | 11 | 1.1% |
Show data table
| value | count | share |
|---|---|---|
| ONLINE | 466 | 46.6% |
| MOBILE | 288 | 28.8% |
| PHONE | 236 | 23.6% |
| UNKNOWN | 10 | 1.0% |
Show data table
| value | count | share |
|---|---|---|
| Loud Music/Party | 427 | 42.7% |
| Banging/Pounding | 114 | 11.4% |
| Blocked Hydrant | 79 | 7.9% |
| No Access | 45 | 4.5% |
| Posted Parking Sign Violation | 42 | 4.2% |
| Loud Talking | 36 | 3.6% |
| ENTIRE BUILDING | 35 | 3.5% |
| Blocked Sidewalk | 21 | 2.1% |
| Commercial Overnight Parking | 20 | 2.0% |
| APARTMENT ONLY | 14 | 1.4% |
| Car/Truck Music | 13 | 1.3% |
| Double Parked Blocking Traffic | 13 | 1.3% |
| Loud Television | 10 | 1.0% |
| Partial Access | 10 | 1.0% |
| PESTS | 10 | 1.0% |
| Blocked Crosswalk | 9 | 0.9% |
| Pothole | 6 | 0.6% |
| Allowed in Smoke Free Area | 5 | 0.5% |
| With License Plate | 5 | 0.5% |
| No Shelter | 5 | 0.5% |
Schema
47 columns| Alerts | ||||
|---|---|---|---|---|
| unique_key | categorical | 0.0% | 1,000 |
long_tail
|
| created_date | categorical | 0.0% | 939 |
long_tail
|
| agency | categorical | 0.0% | 11 |
|
| agency_name | categorical | 0.0% | 11 |
|
| complaint_type | categorical | 0.0% | 45 |
|
| descriptor | categorical | 0.0% | 71 |
|
| location_type | categorical | 1.4% | 20 |
|
| incident_zip | categorical | 0.3% | 146 |
|
| incident_address | categorical | 0.5% | 776 |
long_tail
|
| street_name | categorical | 0.5% | 569 |
long_tail
|
| cross_street_1 | categorical | 8.0% | 511 |
long_tail
|
| cross_street_2 | categorical | 7.9% | 504 |
long_tail
|
| intersection_street_1 | categorical | 8.5% | 506 |
long_tail
|
| intersection_street_2 | categorical | 8.4% | 499 |
long_tail
|
| address_type | categorical | 0.2% | 4 |
imbalance
|
| city | categorical | 2.1% | 36 |
|
| landmark | categorical | 10.5% | 515 |
long_tail
|
| status | categorical | 0.0% | 3 |
|
| community_board | categorical | 0.0% | 65 |
|
| council_district | categorical | 0.9% | 51 |
|
| police_precinct | categorical | 0.0% | 76 |
|
| bbl | categorical | 5.5% | 719 |
long_tail
|
| borough | categorical | 0.0% | 5 |
|
| x_coordinate_state_plane | categorical | 0.7% | 763 |
long_tail
|
| y_coordinate_state_plane | categorical | 0.7% | 768 |
long_tail
|
| open_data_channel_type | categorical | 0.0% | 4 |
|
| park_facility_name | categorical | 0.0% | 2 |
imbalance
|
| park_borough | categorical | 0.0% | 5 |
|
| latitude | categorical | 0.7% | 770 |
long_tail
|
| longitude | categorical | 0.7% | 770 |
long_tail
|
| location | unknown | 0.0% | — |
skipped
|
| :@computed_region_f5dn_yrer | categorical | 0.7% | 62 |
|
| :@computed_region_yeji_bk3q | categorical | 0.7% | 5 |
|
| :@computed_region_sbqj_enih | categorical | 0.7% | 75 |
|
| :@computed_region_92fq_4b7q | categorical | 0.7% | 51 |
|
| descriptor_2 | categorical | 88.2% | 43 |
long_tail
null_rate
|
| resolution_description | categorical | 30.7% | 16 |
null_rate
|
| resolution_action_updated_date | categorical | 30.6% | 585 |
long_tail
null_rate
|
| closed_date | categorical | 39.0% | 585 |
long_tail
null_rate
|
| taxi_pick_up_location | categorical | 99.4% | 6 |
long_tail
null_rate
|
| vehicle_type | categorical | 96.5% | 4 |
null_rate
|
| facility_type | categorical | 99.0% | 1 |
null_rate
imbalance
|
| taxi_company_borough | categorical | 99.9% | 1 |
long_tail
null_rate
imbalance
|
| bridge_highway_name | categorical | 99.7% | 2 |
null_rate
|
| bridge_highway_direction | categorical | 99.7% | 2 |
null_rate
|
| road_ramp | categorical | 99.9% | 1 |
long_tail
null_rate
imbalance
|
| bridge_highway_segment | categorical | 99.7% | 2 |
null_rate
|
unique_key
categorical identifier long_tailThis column is a unique row identifier, likely a primary key or transaction/record ID — every one of the 1000 rows has a distinct value, giving it perfect cardinality and a maximum entropy ratio of 1.0. Values appear to be numeric strings in a narrow range (~67519046–67525266), suggesting sequential or near-sequential ID assignment. There is nothing analytically surprising here beyond the expected long-tail alert, which is a trivial consequence of full uniqueness. The null rate is 0.0. Treatment: Drop before modelling; retain only for row tracing or joins.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 1,000
- top_value
- 67519092
- top_rate
- 0.001
- cardinality
- 1,000
- entropy
- 9.966
- entropy_ratio
- 1
created_date
categorical timestamp long_tailThis column is a creation timestamp, stored as a categorical string in ISO-8601 millisecond format (e.g., '2026-01-17T23:49:56.000'). With 939 unique values out of 1000 rows and an entropy ratio of 0.995, it behaves almost like a unique identifier. The long-tail alert and the fact that the most frequent value appears only 9 times (0.9% of rows) suggest occasional duplicate timestamps — likely batch inserts or rapid successive record creation within the same second. Treatment: Parse to datetime, then engineer time-based features (hour, day-of-week, recency); do not use raw string for modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 939
- top_value
- 2026-01-17T23:49:56.000
- top_rate
- 0.009
- cardinality
- 939
- entropy
- 9.828
- entropy_ratio
- 0.9952
agency
categorical labelThis column identifies the New York City municipal agency associated with each record, with 11 distinct agency codes and no nulls across 1,000 rows. The distribution is severely skewed: NYPD alone accounts for 88% of all records (880 of 1,000), while the remaining 10 agencies collectively cover only 120 rows — several (DHS, OOS, DCWP, DPR) appear just twice each. The low entropy ratio of 0.223 confirms this near-monolithic concentration, which could bias any agency-level analysis or model trained on this sample. Treatment: One-hot encode for modelling, but note severe class imbalance — consider oversampling or stratified splitting to ensure minority agencies are represented.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- NYPD
- top_rate
- 0.88
- cardinality
- 11
- entropy
- 0.7715
- entropy_ratio
- 0.223
agency_name
categorical labelThis column identifies the New York City municipal agency responsible for each record, with 11 distinct agencies across 1,000 rows and no nulls. It is severely dominated by the New York City Police Department, which accounts for 88% of all records (880 of 1,000), producing a very low entropy ratio of 0.223. The remaining 10 agencies collectively cover only 120 records, with several (e.g., Department of Homeless Services, Office of the Sheriff) appearing just twice. This extreme class imbalance will distort any agency-level aggregation or model that uses this column as a feature. Treatment: One-hot encode with caution given severe class imbalance; consider grouping rare agencies (≤6 occurrences) into an 'Other' category before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- New York City Police Department
- top_rate
- 0.88
- cardinality
- 11
- entropy
- 0.7715
- entropy_ratio
- 0.223
complaint_type
categorical labelThis column contains the categorized type of civic complaint (likely NYC 311 service requests), with 45 distinct complaint categories across 1,000 records and zero nulls. The distribution is heavily skewed: 'Noise - Residential' alone accounts for 39.3% of all records, and the top three noise-related categories together represent roughly 60% of the dataset. The entropy ratio of 0.53 indicates moderate concentration — far from uniform — with a long tail of rare complaint types below the top 10. Treatment: One-hot encode or target-encode for modelling; consider grouping rare categories (below ~7 occurrences) into an 'Other' bucket to reduce noise.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 45
- top_value
- Noise - Residential
- top_rate
- 0.393
- cardinality
- 45
- entropy
- 2.935
- entropy_ratio
- 0.5345
descriptor
categorical labelThis column contains descriptive sub-type labels for service requests or complaints — likely from a system such as NYC 311 — further specifying the nature of each complaint beyond a top-level category. 'Loud Music/Party' dominates with 42.7% of all 1,000 records, creating notable class imbalance; the top two values alone account for 54.1% of rows. With 71 unique values, entropy ratio of 0.576, and zero nulls, the column is moderately concentrated but still covers a meaningful range of complaint types including noise, parking, and building issues. Treatment: One-hot encode for modelling, but consider grouping rare categories (below ~1% frequency) into an 'Other' bucket to manage the 71-level cardinality.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 71
- top_value
- Loud Music/Party
- top_rate
- 0.427
- cardinality
- 71
- entropy
- 3.54
- entropy_ratio
- 0.5756
location_type
categorical labelThis column encodes the type of location where an incident or event occurred, with 20 distinct values across 1,000 rows. The dominant category is 'Residential Building/House' at 40.2% (396 occurrences), followed by 'Street/Sidewalk' at 324. A significant data quality issue is immediately apparent: the same real-world concept appears under multiple inconsistent labels — 'Residential Building/House' (396), 'RESIDENTIAL BUILDING' (74), and 'Residential Building' (7) are clearly duplicates, as are 'Street/Sidewalk' (324), 'Street' (8), and 'Sidewalk' (5) — meaning true cardinality is substantially lower than 20 and category frequencies are understated. Treatment: Normalize case and consolidate synonymous labels (e.g. merge 'RESIDENTIAL BUILDING', 'Residential Building', 'Residential Building/House') before encoding as a categorical feature.
- n
- 1,000
- nulls
- 14 (1.4%)
- unique
- 20
- top_value
- Residential Building/House
- top_rate
- 0.4016
- cardinality
- 20
- entropy
- 2.237
- entropy_ratio
- 0.5177
incident_zip
categorical featureThis column contains US ZIP codes associated with incidents, all values appearing to be New York City ZIP codes (10xxx and 11xxx series, consistent with Manhattan, Queens, and the Bronx). With 146 unique values across 1,000 rows and a high entropy ratio of 0.931, the distribution is remarkably spread — the most frequent ZIP '10011' appears only 42 times (4.2%), indicating incidents are distributed broadly across many neighbourhoods rather than concentrated in a few. Null rate is negligible at 0.3%. Treatment: Encode as geographic feature; consider grouping by borough or joining to a ZIP-code reference table for lat/lon or demographic enrichment.
- n
- 1,000
- nulls
- 3 (0.3%)
- unique
- 146
- top_value
- 10011
- top_rate
- 0.04213
- cardinality
- 146
- entropy
- 6.696
- entropy_ratio
- 0.9313
incident_address
categorical feature long_tailThis column contains free-text street addresses of incident locations, likely from a New York City incident or complaints dataset given recognizable street names (Lenox Avenue, Avenue of the Americas, Bruckner Boulevard). With 776 unique values across 1,000 rows and an entropy ratio of 0.97, the distribution is highly dispersed — a classic long-tail pattern. Notably, '126 WEST 13 STREET' appears 31 times (3.1% of rows), a disproportionate spike that may indicate a shelter, institution, or high-incident venue worth investigating. Inconsistent spacing in values like '60 EAST 93 STREET' suggests formatting irregularities that will need normalization. Treatment: Normalize whitespace and casing, then geocode or extract street/borough components for spatial modelling.
- n
- 1,000
- nulls
- 5 (0.5%)
- unique
- 776
- top_value
- 126 WEST 13 STREET
- top_rate
- 0.03116
- cardinality
- 776
- entropy
- 9.321
- entropy_ratio
- 0.9709
street_name
categorical feature long_tailThis column contains street names, almost certainly from a New York City dataset given entries like 'WEST 13 STREET', 'LENOX AVENUE', 'BRUCKNER BOULEVARD', and 'JAMAICA AVENUE'. With 569 unique values across 1,000 rows and an entropy ratio of 0.955, the distribution is highly spread — the top value 'WEST 13 STREET' appears only 31 times (3.1% of rows), confirming the long-tail alert. The near-zero null rate (0.5%) is clean, but the high cardinality and long tail mean most street names are rare, which limits direct one-hot encoding utility. Treatment: Frequency-encode or embed as a categorical feature; avoid one-hot encoding due to 569-level cardinality and long-tail distribution.
- n
- 1,000
- nulls
- 5 (0.5%)
- unique
- 569
- top_value
- WEST 13 STREET
- top_rate
- 0.03116
- cardinality
- 569
- entropy
- 8.742
- entropy_ratio
- 0.9551
cross_street_1
categorical feature long_tailThis column captures the first cross street in a NYC street-address intersection, most likely from a traffic incident, 311, or similar geospatial events dataset. With 511 unique values across 1,000 rows and an entropy ratio of 0.953, the distribution is nearly flat — a strong long-tail signal — meaning the vast majority of street names appear only once or twice. The top value 'AVENUE OF THE AMERICAS' appears just 35 times (3.8% of rows), and the 8% null rate suggests some records lack cross-street data entirely. Treatment: Normalize street name strings (abbreviations, casing), then use as a categorical geographic feature or join to a street reference table; high cardinality warrants frequency encoding or embedding rather than one-hot encoding.
- n
- 1,000
- nulls
- 80 (8.0%)
- unique
- 511
- top_value
- AVENUE OF THE AMERICAS
- top_rate
- 0.03804
- cardinality
- 511
- entropy
- 8.574
- entropy_ratio
- 0.953
cross_street_2
categorical feature long_tailThis column contains the secondary cross-street name associated with a location record, likely from a NYC incident or address dataset. With 504 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is nearly flat — a strong long-tail signal where most street names appear very rarely. The top value '7 AVENUE' appears only 36 times (3.9% of non-null rows), and 'DEAD END' appearing 10 times is a notable data-quality flag indicating unresolved or terminus locations. Treatment: Standardize street name abbreviations, flag 'DEAD END' as a sentinel value, and consider encoding via frequency bucketing or embedding rather than one-hot due to high cardinality.
- n
- 1,000
- nulls
- 79 (7.9%)
- unique
- 504
- top_value
- 7 AVENUE
- top_rate
- 0.03909
- cardinality
- 504
- entropy
- 8.511
- entropy_ratio
- 0.9481
intersection_street_1
categorical feature long_tailThis column captures the name of the first cross-street at an intersection, consistent with NYC street-incident or infrastructure data. With 506 unique values across 1,000 rows (entropy ratio 0.95), the distribution is nearly flat — a long-tail alert confirms that the vast majority of street names appear only once or twice, while even the top value ('AVENUE OF THE AMERICAS') accounts for just 3.8% of non-null records. An 8.5% null rate suggests intersections are not always fully recorded, which may indicate missing location data rather than true absence of a street. Treatment: Standardize street name strings (abbreviations, spacing), then encode as a high-cardinality categorical using target encoding or embedding before modelling; impute or flag nulls separately.
- n
- 1,000
- nulls
- 85 (8.5%)
- unique
- 506
- top_value
- AVENUE OF THE AMERICAS
- top_rate
- 0.03825
- cardinality
- 506
- entropy
- 8.558
- entropy_ratio
- 0.9527
intersection_street_2
categorical feature long_tailThis column captures the secondary (cross) street name at an intersection, likely from a New York City incident or traffic dataset. With 499 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is nearly flat — a strong long-tail pattern where '7 AVENUE' is the modal value at only 3.9% frequency. The presence of 'DEAD END' (10 occurrences) as a top value is notable, indicating it is used as a literal street descriptor rather than a true street name, which may require cleaning. An 8.4% null rate suggests some incidents occurred at non-intersection locations (e.g., mid-block). Treatment: Standardize 'DEAD END' and similar non-street tokens, impute or flag nulls, then encode as a categorical feature or use for geospatial joining.
- n
- 1,000
- nulls
- 84 (8.4%)
- unique
- 499
- top_value
- 7 AVENUE
- top_rate
- 0.0393
- cardinality
- 499
- entropy
- 8.495
- entropy_ratio
- 0.9478
address_type
categorical label imbalanceThis column classifies the type of geographic address entry, with four categories: ADDRESS, INTERSECTION, BLOCKFACE, and PLACE. It is severely imbalanced — 'ADDRESS' dominates at 97.2% of valid records (970/1000), while the remaining three types collectively account for only 28 records. The entropy ratio of 0.108 confirms near-minimal informational diversity, meaning this field will contribute little discriminative power in most models without special handling. Treatment: One-hot encode with caution; minority classes (INTERSECTION=19, BLOCKFACE=7, PLACE=2) may need oversampling or collapsing into an 'OTHER' bucket before modelling.
- n
- 1,000
- nulls
- 2 (0.2%)
- unique
- 4
- top_value
- ADDRESS
- top_rate
- 0.9719
- cardinality
- 4
- entropy
- 0.2169
- entropy_ratio
- 0.1084
city
categorical featureThis column contains NYC neighborhood and borough city names, almost certainly a mailing-address city field from a dataset heavily concentrated in New York City. The top three values alone — BROOKLYN (250), NEW YORK (249), and BRONX (198) — account for roughly 70% of the 1,000 rows, while the remaining 33 values (e.g., WOODHAVEN, JAMAICA, RIDGEWOOD) are clearly Queens and Brooklyn sub-neighborhoods used as postal city names. The distribution is notably skewed (top_rate 0.255) but the entropy_ratio of 0.635 across 36 unique values suggests moderate spread beyond the top cluster; the 2.1% null rate is minor. Treatment: One-hot encode or target-encode for modelling; consider grouping sub-neighbourhood postal cities (e.g., WOODHAVEN, JAMAICA) into borough-level categories to reduce sparsity.
- n
- 1,000
- nulls
- 21 (2.1%)
- unique
- 36
- top_value
- BROOKLYN
- top_rate
- 0.2554
- cardinality
- 36
- entropy
- 3.282
- entropy_ratio
- 0.6349
landmark
categorical label long_tailThis column contains street or landmark names, likely representing the nearest notable street or geographic reference point for each record in a New York City–area dataset. With 515 unique values across 1,000 rows (51.5% uniqueness) and a 10.5% null rate, coverage is incomplete. The distribution is heavily long-tailed: the top value 'WEST 13 STREET' appears only 31 times (3.46%), and the entropy ratio of 0.955 indicates near-maximum disorder, meaning most landmark names are rare or unique — making this column unsuitable as a reliable grouping key without significant consolidation. Treatment: Impute or flag nulls (10.5%), then either group by street type/prefix for coarser features or embed as a high-cardinality categorical; avoid one-hot encoding given 515 levels.
- n
- 1,000
- nulls
- 105 (10.5%)
- unique
- 515
- top_value
- WEST 13 STREET
- top_rate
- 0.03464
- cardinality
- 515
- entropy
- 8.603
- entropy_ratio
- 0.955
status
categorical labelThis column is a workflow/lifecycle status field with exactly 3 states: Closed, In Progress, and Open. 'Closed' dominates at 61% (610/1000), while 'Open' is surprisingly rare at only 8.5% (85/1000), suggesting the dataset skews heavily toward resolved records — possibly a historical or archived snapshot rather than a live operational view. No nulls and perfect coverage across all 1000 rows. Treatment: One-hot encode or ordinal-encode (Open→In Progress→Closed) depending on whether a natural progression is to be modelled; consider class imbalance if used as a target.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Closed
- top_rate
- 0.61
- cardinality
- 3
- entropy
- 1.26
- entropy_ratio
- 0.7948
community_board
categorical labelThis column represents NYC Community Board designations, combining a numeric district ID with a borough name (e.g., '02 MANHATTAN'). With 65 unique values across 1,000 rows and zero nulls, coverage is complete. The distribution is notably flat — entropy ratio of 0.93 indicates near-uniform spread, with the most frequent value ('02 MANHATTAN') appearing only 5.6% of the time, suggesting no single board dominates the dataset. Manhattan and Queens boards appear disproportionately among the top 10, which may reflect a geographic sampling bias worth investigating. Treatment: One-hot encode or target-encode for modelling; consider grouping by borough prefix to reduce cardinality from 65 to 5.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 65
- top_value
- 02 MANHATTAN
- top_rate
- 0.056
- cardinality
- 65
- entropy
- 5.61
- entropy_ratio
- 0.9316
council_district
categorical labelThis column represents a council district code — a zero-padded numeric string identifier (e.g., '03', '32') used to assign records to geographic administrative units. With 51 distinct values across 1,000 rows and an entropy ratio of 0.954, the distribution is remarkably flat and near-uniform, meaning no single district dominates heavily; the most frequent value '03' appears only 56 times (5.65% of rows). This near-maximum entropy is unusual for a district field and suggests either broad geographic coverage or deliberate sampling across all districts. Null rate is negligible at 0.9%. Treatment: Use as a categorical grouping variable; one-hot or target-encode for modelling, or retain as-is for geographic aggregation and joins.
- n
- 1,000
- nulls
- 9 (0.9%)
- unique
- 51
- top_value
- 03
- top_rate
- 0.05651
- cardinality
- 51
- entropy
- 5.412
- entropy_ratio
- 0.9541
police_precinct
categorical labelThis column represents the police precinct assignment for each record, with 76 distinct precinct labels across 1,000 rows and zero nulls. The distribution is remarkably flat: the most common value, 'Precinct 6', appears only 50 times (5% of rows), and entropy ratio is 0.936 — nearly as uniform as a perfectly even distribution across all 76 precincts. No dominant precinct stands out, suggesting either a geographically broad dataset or deliberate sampling across jurisdictions. Treatment: One-hot encode or target-encode for modelling; high cardinality (76 levels) warrants regularised encoding rather than naive dummies.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 76
- top_value
- Precinct 6
- top_rate
- 0.05
- cardinality
- 76
- entropy
- 5.847
- entropy_ratio
- 0.9358
bbl
categorical foreign_key long_tailThis column contains New York City Borough-Block-Lot (BBL) codes, a standard 10-digit property identifier encoding borough (leading digit 1–5), block, and lot numbers. With 719 unique values across 1,000 rows and an entropy ratio of 0.97, it is near-unique, but the top value '1006080026' appears 31 times (3.3% of non-null rows), indicating repeated records tied to a single property — flagged as a long tail. The 5.5% null rate and clustering of repeats suggest this may be a foreign key referencing a property registry rather than a row-level unique identifier. Treatment: Left-join on this BBL to enrich with property-level attributes from a NYC PLUTO or similar property reference table.
- n
- 1,000
- nulls
- 55 (5.5%)
- unique
- 719
- top_value
- 1006080026
- top_rate
- 0.0328
- cardinality
- 719
- entropy
- 9.189
- entropy_ratio
- 0.9683
borough
categorical labelThis column represents the five New York City boroughs, functioning as a geographic label with complete coverage (null_rate 0.0) across all 1,000 rows. Four boroughs — Queens (274), Manhattan (258), Brooklyn (254), and Bronx (203) — are distributed with reasonable balance, but Staten Island is strikingly underrepresented at just 11 occurrences (~1.1%), compared to 20% expected under uniform distribution. The high entropy_ratio of 0.886 reflects the near-even spread among the four dominant boroughs, masking Staten Island's severe underrepresentation. Treatment: One-hot encode or target-encode for modelling; note Staten Island's class imbalance (n=11) may require stratified sampling or grouping.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- QUEENS
- top_rate
- 0.274
- cardinality
- 5
- entropy
- 2.057
- entropy_ratio
- 0.8858
x_coordinate_state_plane
categorical feature long_tailThis column contains X-coordinates in a State Plane coordinate system, stored as categorical strings rather than numerics — values like '984721' and '1004501' are typical State Plane Easting values in feet. With 763 unique values out of 1000 rows (entropy ratio 0.97) and a long-tail alert, the distribution is nearly unique per record, which is expected for spatial point coordinates. The top value '984721' appearing 31 times (3.1% of rows) is surprisingly frequent for a coordinate and may indicate a default, imputed, or snapped location worth investigating. Treatment: Cast to numeric, pair with Y-coordinate for spatial analysis, and investigate the 31 rows sharing coordinate '984721' for potential data quality issues.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 763
- top_value
- 984721
- top_rate
- 0.03122
- cardinality
- 763
- entropy
- 9.292
- entropy_ratio
- 0.9704
y_coordinate_state_plane
categorical feature long_tailThis column represents Y-coordinates in a State Plane coordinate system, likely a geographic reference for spatial data (e.g., NYC or similar municipal dataset). Despite being numeric in nature, it is stored as a categorical type, which is unexpected and likely a data pipeline issue. With 768 unique values out of 1000 rows (entropy ratio 0.97) and a long-tail alert, the distribution is highly dispersed, though the top value '207809' appears 31 times — a modest but notable cluster suggesting a frequently referenced location. The near-unique cardinality makes this unsuitable for direct use as a categorical feature. Treatment: Cast to numeric, then use as a spatial coordinate feature or pair with x-coordinate for geospatial analysis.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 768
- top_value
- 207809
- top_rate
- 0.03122
- cardinality
- 768
- entropy
- 9.304
- entropy_ratio
- 0.9707
open_data_channel_type
categorical featureThis column captures the channel through which a report or request was submitted, with four distinct values: ONLINE, MOBILE, PHONE, and UNKNOWN. ONLINE dominates at 46.6% of records, followed by MOBILE (28.8%) and PHONE (23.6%), leaving only 10 records (1%) tagged as UNKNOWN. The near-uniform distribution across the three known channels is notable, and the UNKNOWN category — while small — may warrant imputation or flagging depending on downstream use. Treatment: One-hot encode or ordinal-map the 3 known channels; consider grouping or flagging the 10 UNKNOWN values before modelling.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- ONLINE
- top_rate
- 0.466
- cardinality
- 4
- entropy
- 1.589
- entropy_ratio
- 0.7943
park_facility_name
categorical label imbalanceThis column captures the named park facility associated with each record, but it is almost entirely uninformative: 999 of 1000 rows (99.9%) carry the placeholder value 'Unspecified', with only a single record attributed to Flushing Meadows Corona Park. The near-zero entropy (0.011) confirms the column is maximally imbalanced and conveys virtually no discriminative signal. This is likely a poorly populated administrative field rather than a reliable feature. Treatment: Drop from modelling; if facility-level analysis is ever needed, this field requires back-filling from a source system before use.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Unspecified
- top_rate
- 0.999
- cardinality
- 2
- entropy
- 0.01141
- entropy_ratio
- 0.01141
park_borough
categorical labelThis column encodes the five NYC borough names associated with park locations, making it a low-cardinality geographic label with zero nulls across 1,000 rows. Four boroughs (Queens, Manhattan, Brooklyn, Bronx) are nearly evenly distributed (203–274 records each), but Staten Island is severely underrepresented at just 11 occurrences (1.1%), which would surprise an analyst expecting proportional borough coverage. Entropy ratio of 0.886 reflects the near-uniform spread across four classes, masking the sharp imbalance on the fifth. Treatment: One-hot encode for modelling; note the severe class imbalance for STATEN ISLAND (11 of 1000) and consider stratified sampling or weighting.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- QUEENS
- top_rate
- 0.274
- cardinality
- 5
- entropy
- 2.057
- entropy_ratio
- 0.8858
latitude
categorical feature long_tailThis column contains geographic latitude coordinates stored as strings, with all top values clustering tightly around 40.6–40.8°N — consistent with New York City's latitude range. Despite being numeric in nature, it was ingested as categorical, yielding 770 unique values across 1,000 rows (entropy ratio 0.97), which is near-unique. The long-tail alert and the top value ('40.73706433046593') appearing 31 times (3.1% of rows) suggests either a default/fallback coordinate or a high-traffic location being logged repeatedly. Treatment: Cast to float64, pair with a longitude column for geospatial features, and investigate the 31-row duplicate coordinate for data quality issues before modelling.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 770
- top_value
- 40.73706433046593
- top_rate
- 0.03122
- cardinality
- 770
- entropy
- 9.308
- entropy_ratio
- 0.9707
longitude
categorical feature long_tailThis column contains geographic longitude values, stored as strings (hence the categorical classification), representing locations clustered tightly around -73.8 to -74.0 — consistent with New York City. Despite 1,000 rows, there are 770 unique values, yet the top value '-73.99830041620608' appears 31 times (3.1% of rows), which is surprisingly high given the near-continuous nature of coordinate data and flags a long-tail concentration. Entropy ratio of 0.97 confirms very high diversity overall, but the repeated exact coordinate values suggest either data binning, snapping to a fixed point, or duplicated records at specific locations. Treatment: Parse to float, verify duplicate coordinates for data quality issues, then use directly as a numeric geospatial feature or pair with latitude for distance/clustering models.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 770
- top_value
- -73.99830041620608
- top_rate
- 0.03122
- cardinality
- 770
- entropy
- 9.308
- entropy_ratio
- 0.9707
location
unknown other skippedThis column is named 'location' and contains 1,000 non-null values with a null rate of 0.0, but the profiler skipped it entirely, yielding no stats, no uniqueness count, and no type inference. Without distribution, cardinality, or sample values, it is impossible to determine whether this is a structured geographic field (e.g., city, coordinates, country code) or free-text. The 'skipped' alert is the dominant signal and warrants direct inspection of raw values before any downstream use. Treatment: Manually inspect raw values to determine structure (free-text, categorical, coordinate, etc.) before deciding on encoding or embedding strategy.
- n
- 1,000
- nulls
- 0 (0.0%)
- unique
- —
:@computed_region_f5dn_yrer
categorical foreign_keyThis column is a Socrata-generated computed region identifier, assigning each row to one of 62 geographic zones (e.g., neighborhood, district, or census tract polygons). With an entropy ratio of 0.939 and 62 unique integer-like codes across 1,000 rows, values are distributed fairly evenly — the most frequent region ('57') accounts for only 5.6% of rows, suggesting no strong spatial concentration. The null rate is negligible at 0.7%. The numeric-looking codes are categorical labels, not ordinal or continuous values. Treatment: Treat as a categorical geographic key; left-join to a region lookup table or one-hot encode for spatial feature engineering.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 62
- top_value
- 57
- top_rate
- 0.05639
- cardinality
- 62
- entropy
- 5.591
- entropy_ratio
- 0.9389
:@computed_region_yeji_bk3q
categorical foreign_keyThis column is a Socrata-generated computed region identifier (geo-zone lookup key), indicated by the ':@computed_region_' prefix — it maps each row to one of 5 predefined geographic regions. Values are nearly uniformly distributed across zones 2–4 (~254–269 rows each), but zone '1' is a stark outlier with only 11 occurrences (~1.1% of rows), which may signal a very small or edge-case geographic area. Null rate is negligible at 0.7%. Treatment: Use as a categorical grouping key or left-join to a region lookup table; investigate zone '1' underrepresentation before using as a stratification variable.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 5
- top_value
- 3
- top_rate
- 0.2709
- cardinality
- 5
- entropy
- 2.054
- entropy_ratio
- 0.8846
:@computed_region_sbqj_enih
categorical foreign_keyThis column is a Socrata computed region identifier, automatically generated by the platform to assign rows to pre-defined geographic boundary zones (e.g., neighbourhoods, council districts, or census tracts). With 75 distinct integer-like codes across 1,000 rows and an entropy ratio of 0.94, the distribution is remarkably flat — no single zone dominates, with even the top value '3' appearing in only 5% of rows. The near-uniform spread across 75 regions and very low null rate (0.7%) suggest broad geographic coverage with no strong spatial concentration. Treatment: Left-join on this region code to a geographic boundaries table to enrich with spatial attributes; do not treat as a numeric feature.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 75
- top_value
- 3
- top_rate
- 0.05035
- cardinality
- 75
- entropy
- 5.846
- entropy_ratio
- 0.9385
:@computed_region_92fq_4b7q
categorical foreign_keyThis column is a Socrata-generated computed region identifier, typically encoding a geographic zone or district (e.g., census tract, neighbourhood, or administrative boundary) as an opaque integer-like label. With 51 unique values across 1,000 rows and an entropy ratio of 0.948, the distribution is near-uniform — no single region dominates heavily, though region '10' leads with 6.4% of records (64 occurrences). The null rate of 0.7% is negligible, suggesting reliable spatial assignment. Treatment: Left-join on this computed region ID to a Socrata boundary dataset to retrieve human-readable geographic names; do not treat the numeric strings as ordinal or cardinal values.
- n
- 1,000
- nulls
- 7 (0.7%)
- unique
- 51
- top_value
- 10
- top_rate
- 0.06445
- cardinality
- 51
- entropy
- 5.376
- entropy_ratio
- 0.9477
descriptor_2
categorical label long_tail null_rateThis column is a secondary complaint descriptor, likely a sub-category or detail field attached to a service request or complaint record (consistent with NYC 311-style data). It is almost entirely empty — null_rate is 0.882, meaning only 118 of 1,000 rows carry a value. Among the 43 unique non-null values, 'NO HEAT' dominates at 27.97% of non-null entries (33 occurrences), but the spread across housing, noise, sanitation, and transportation categories (e.g., 'Cannabis Smoking or Vaping', 'Unsafe Driving - Non-Passenger') signals this field is populated inconsistently across complaint types, not a uniform taxonomy. The high entropy_ratio of 0.807 confirms the long-tail alert: values are broadly dispersed despite the sparse fill rate. Treatment: Filter to non-null rows before use; group rare categories (below a frequency threshold) into 'OTHER' and one-hot encode or target-encode the remainder.
- n
- 1,000
- nulls
- 882 (88.2%)
- unique
- 43
- top_value
- NO HEAT
- top_rate
- 0.2797
- cardinality
- 43
- entropy
- 4.376
- entropy_ratio
- 0.8065
resolution_description
categorical label null_rateThis column contains standardized resolution outcome descriptions from NYC 311 service requests, drawn from a fixed vocabulary of only 16 distinct boilerplate phrases. Despite being a text column, it behaves as a low-cardinality categorical: the top value (NYPD 'no criminal violation, condition corrected') accounts for 35.4% of non-null rows. A 30.7% null rate is flagged as an alert, likely reflecting complaints that have not yet been resolved or closed. Treatment: Encode as ordinal or one-hot categorical (16 levels); treat nulls as a distinct 'unresolved' category rather than imputing.
- n
- 1,000
- nulls
- 307 (30.7%)
- unique
- 16
- top_value
- The New York City Police Department responded to the complaint and their investigation determined that no criminal violation existed. The condition was corrected without the need to issue a summons or effect an arrest. If the problem persists, please contact 311 to create another complaint. If possible, provide contact information so responding officers may reach out to you for more details. If necessary, your complaint may be referred to your local precinct's special operations units (Quality of Life, etc.). Thank you for your attention to this matter. We count on New Yorkers like yourself to maintain a safe City, so please let us know if you see other conditions that require our attention.
- top_rate
- 0.3535
- cardinality
- 16
- entropy
- 2.777
- entropy_ratio
- 0.6943
resolution_action_updated_date
categorical timestamp long_tail null_rateThis column is a timestamp recording when a resolution action was last updated, stored as a categorical string in ISO 8601 format. The 30.6% null rate is a significant concern, indicating roughly one-third of records have no update date logged. Strikingly, one value — '2026-01-17T00:00:00.000' — accounts for 10.7% of all non-null records (74 occurrences) with a midnight-exactly timestamp, suggesting a bulk update or default-date assignment rather than genuine event timestamps. All remaining top values appear only twice, confirming a severe long-tail distribution (entropy ratio 0.94) consistent with mostly unique timestamps. Treatment: Parse to datetime, flag the 74 midnight '2026-01-17' records as probable synthetic/default dates, and impute or exclude nulls (30.6%) before any time-based feature engineering.
- n
- 1,000
- nulls
- 306 (30.6%)
- unique
- 585
- top_value
- 2026-01-17T00:00:00.000
- top_rate
- 0.1066
- cardinality
- 585
- entropy
- 8.673
- entropy_ratio
- 0.9435
closed_date
categorical timestamp long_tail null_rateThis column is a ticket or record closure timestamp, stored as a string rather than a parsed datetime type, with second-level precision. Two signals demand attention: 39% of rows are null, indicating a large share of records that have not yet been closed (open items); and with 585 unique values across 1,000 rows and an entropy ratio of 0.998, timestamps are nearly all distinct, which is expected for event times but confirms the column carries no categorical signal. All top values cluster on 2026-01-18, suggesting the snapshot or batch was captured on that single date. Treatment: Parse to datetime, use nulls as an 'is_open' binary flag, and engineer elapsed-time features (e.g. days-to-close) rather than using raw values.
- n
- 1,000
- nulls
- 390 (39.0%)
- unique
- 585
- top_value
- 2026-01-18T00:41:26.000
- top_rate
- 0.004918
- cardinality
- 585
- entropy
- 9.169
- entropy_ratio
- 0.9975
taxi_pick_up_location
categorical long_tail null_rate- n
- 1,000
- nulls
- 994 (99.4%)
- unique
- 6
- top_value
- 55 LITTLE WEST 12 STREET, MANHATTAN (NEW YORK), NY, 10014
- top_rate
- 0.1667
- cardinality
- 6
- entropy
- 2.585
- entropy_ratio
- 1
vehicle_type
categorical null_rate- n
- 1,000
- nulls
- 965 (96.5%)
- unique
- 4
- top_value
- Car
- top_rate
- 0.7714
- cardinality
- 4
- entropy
- 1.097
- entropy_ratio
- 0.5484
facility_type
categorical null_rate imbalance- n
- 1,000
- nulls
- 990 (99.0%)
- unique
- 1
- top_value
- N/A
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
taxi_company_borough
categorical long_tail null_rate imbalance- n
- 1,000
- nulls
- 999 (99.9%)
- unique
- 1
- top_value
- BROOKLYN
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
bridge_highway_name
categorical null_rate- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- E
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
bridge_highway_direction
categorical null_rate- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- C E Local Downtown & Brooklyn
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
road_ramp
categorical long_tail null_rate imbalance- n
- 1,000
- nulls
- 999 (99.9%)
- unique
- 1
- top_value
- Ramp
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
bridge_highway_segment
categorical null_rate- n
- 1,000
- nulls
- 997 (99.7%)
- unique
- 2
- top_value
- Platform
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183