data trove large meteorites 10kg
Reading
This dataset is a NASA meteorite landings catalogue covering 45,716 unique meteorite records with attributes including mass, classification, discovery year, and geographic coordinates. The most striking feature is the mass distribution: the median mass is just 32.6 g but the maximum reaches 60,000,000 g, producing extreme skew (skew=76.9) and over 7,000 statistical outliers — a handful of enormous meteorites are pulling the mean to 13,278 g. A second key finding is that 97.6% of records are classified as 'Found' rather than 'Fell', meaning nearly all entries are meteorites discovered on the ground rather than witnessed falling, which has strong implications for geographic and temporal bias in the data. The meteorite classification column (recclass) spans 466 types, dominated by ordinary chondrites (L6, H5, L5), and year of discovery shows a clear spike in the late 1990s–2000s likely tied to Antarctic collection campaigns.
citing: mass (g).stats.median · mass (g).stats.max · mass (g).stats.skew · mass (g).stats.mean · mass (g).stats.n_outliers · fall.top_values · fall.stats.top_rate · recclass.n_unique · recclass.top_values · year.top_values · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 1.5e+06 | 45544 |
| 1.5e+06 – 3e+06 | 16 |
| 3e+06 – 4.5e+06 | 8 |
| 4.5e+06 – 6e+06 | 1 |
| 6e+06 – 7.5e+06 | 1 |
| 7.5e+06 – 9e+06 | 1 |
| 9e+06 – 1.05e+07 | 2 |
| 1.05e+07 – 1.2e+07 | 0 |
| 1.2e+07 – 1.35e+07 | 0 |
| 1.35e+07 – 1.5e+07 | 0 |
| 1.5e+07 – 1.65e+07 | 2 |
| 1.65e+07 – 1.8e+07 | 0 |
| 1.8e+07 – 1.95e+07 | 0 |
| 1.95e+07 – 2.1e+07 | 0 |
| 2.1e+07 – 2.25e+07 | 1 |
| 2.25e+07 – 2.4e+07 | 1 |
| 2.4e+07 – 2.55e+07 | 2 |
| 2.55e+07 – 2.7e+07 | 1 |
| 2.7e+07 – 2.85e+07 | 1 |
| 2.85e+07 – 3e+07 | 0 |
| 3e+07 – 3.15e+07 | 1 |
| 3.15e+07 – 3.3e+07 | 0 |
| 3.3e+07 – 3.45e+07 | 0 |
| 3.45e+07 – 3.6e+07 | 0 |
| 3.6e+07 – 3.75e+07 | 0 |
| 3.75e+07 – 3.9e+07 | 0 |
| 3.9e+07 – 4.05e+07 | 0 |
| 4.05e+07 – 4.2e+07 | 0 |
| 4.2e+07 – 4.35e+07 | 0 |
| 4.35e+07 – 4.5e+07 | 0 |
| 4.5e+07 – 4.65e+07 | 0 |
| 4.65e+07 – 4.8e+07 | 0 |
| 4.8e+07 – 4.95e+07 | 0 |
| 4.95e+07 – 5.1e+07 | 1 |
| 5.1e+07 – 5.25e+07 | 0 |
| 5.25e+07 – 5.4e+07 | 0 |
| 5.4e+07 – 5.55e+07 | 0 |
| 5.55e+07 – 5.7e+07 | 0 |
| 5.7e+07 – 5.85e+07 | 1 |
| 5.85e+07 – 6e+07 | 1 |
Show data table
| value | count | share |
|---|---|---|
| Found | 44609 | 97.6% |
| Fell | 1107 | 2.4% |
Show data table
| value | count | share |
|---|---|---|
| L6 | 8285 | 18.1% |
| H5 | 7142 | 15.6% |
| L5 | 4796 | 10.5% |
| H6 | 4528 | 9.9% |
| H4 | 4211 | 9.2% |
| LL5 | 2766 | 6.1% |
| LL6 | 2043 | 4.5% |
| L4 | 1253 | 2.7% |
| H4/5 | 428 | 0.9% |
| CM2 | 416 | 0.9% |
| H3 | 386 | 0.8% |
| L3 | 365 | 0.8% |
| CO3 | 335 | 0.7% |
| Ureilite | 300 | 0.7% |
| Iron, IIIAB | 285 | 0.6% |
| LL4 | 268 | 0.6% |
| CV3 | 256 | 0.6% |
| Diogenite | 241 | 0.5% |
| Howardite | 240 | 0.5% |
| LL | 225 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| 2003-01-01T00:00:00 | 3323 | 7.3% |
| 1979-01-01T00:00:00 | 3046 | 6.7% |
| 1998-01-01T00:00:00 | 2697 | 5.9% |
| 2006-01-01T00:00:00 | 2456 | 5.4% |
| 1988-01-01T00:00:00 | 2296 | 5.0% |
| 2002-01-01T00:00:00 | 2078 | 4.5% |
| 2004-01-01T00:00:00 | 1940 | 4.2% |
| 2000-01-01T00:00:00 | 1792 | 3.9% |
| 1997-01-01T00:00:00 | 1696 | 3.7% |
| 1999-01-01T00:00:00 | 1691 | 3.7% |
| 2001-01-01T00:00:00 | 1650 | 3.6% |
| 1990-01-01T00:00:00 | 1518 | 3.3% |
| 2009-01-01T00:00:00 | 1497 | 3.3% |
| 1986-01-01T00:00:00 | 1375 | 3.0% |
| 2007-01-01T00:00:00 | 1189 | 2.6% |
| 2010-01-01T00:00:00 | 1005 | 2.2% |
| 1993-01-01T00:00:00 | 979 | 2.1% |
| 2008-01-01T00:00:00 | 957 | 2.1% |
| 1987-01-01T00:00:00 | 916 | 2.0% |
| 1991-01-01T00:00:00 | 877 | 1.9% |
Show data table
| bin | count |
|---|---|
| -87.37 – -83.15 | 7090 |
| -83.15 – -78.94 | 1218 |
| -78.94 – -74.73 | 4083 |
| -74.73 – -70.51 | 9707 |
| -70.51 – -66.3 | 1 |
| -66.3 – -62.09 | 0 |
| -62.09 – -57.87 | 0 |
| -57.87 – -53.66 | 1 |
| -53.66 – -49.45 | 0 |
| -49.45 – -45.23 | 3 |
| -45.23 – -41.02 | 11 |
| -41.02 – -36.81 | 27 |
| -36.81 – -32.59 | 91 |
| -32.59 – -28.38 | 550 |
| -28.38 – -24.17 | 436 |
| -24.17 – -19.95 | 93 |
| -19.95 – -15.74 | 35 |
| -15.74 – -11.53 | 18 |
| -11.53 – -7.313 | 19 |
| -7.313 – -3.1 | 24 |
| -3.1 – 1.113 | 6448 |
| 1.113 – 5.327 | 15 |
| 5.327 – 9.54 | 19 |
| 9.54 – 13.75 | 55 |
| 13.75 – 17.97 | 40 |
| 17.97 – 22.18 | 3197 |
| 22.18 – 26.39 | 315 |
| 26.39 – 30.61 | 2239 |
| 30.61 – 34.82 | 859 |
| 34.82 – 39.03 | 649 |
| 39.03 – 43.25 | 403 |
| 43.25 – 47.46 | 230 |
| 47.46 – 51.67 | 196 |
| 51.67 – 55.89 | 155 |
| 55.89 – 60.1 | 119 |
| 60.1 – 64.31 | 30 |
| 64.31 – 68.53 | 17 |
| 68.53 – 72.74 | 4 |
| 72.74 – 76.95 | 3 |
| 76.95 – 81.17 | 1 |
Schema
20 columns| Alerts | ||||
|---|---|---|---|---|
| sid | text | 0.0% | 45,716 |
near_unique
one_word
short_text
|
| id | text | 0.0% | 45,716 |
near_unique
one_word
allcaps
|
| position | numeric | 0.0% | 1 |
constant
|
| created_at | numeric | 0.0% | 1 |
constant
|
| created_meta | unknown | 0.0% | — |
skipped
|
| updated_at | numeric | 0.0% | 1 |
constant
|
| updated_meta | unknown | 0.0% | — |
skipped
|
| meta | categorical | 0.0% | 1 |
imbalance
|
| name | text | 0.0% | 45,716 |
near_unique
|
| id_1 | numeric | 0.0% | 45,716 |
|
| nametype | categorical | 0.0% | 2 |
imbalance
|
| recclass | categorical | 0.0% | 466 |
|
| mass (g) | numeric | 0.3% | 12,576 |
high_skew
outliers
|
| fall | categorical | 0.0% | 2 |
imbalance
|
| year | categorical | 0.6% | 266 |
|
| reclat | numeric | 16.0% | 12,738 |
|
| reclong | numeric | 16.0% | 14,640 |
|
| GeoLocation | text | 16.0% | 17,100 |
duplicates
|
| States | numeric | 96.4% | 45 |
null_rate
|
| Counties | numeric | 96.4% | 662 |
null_rate
|
sid
text identifier near_unique one_word short_textThis column is a Socrata-style row identifier, recognizable from the 'row-XXXX.XXXX-XXXX' format visible in all sampled values. Every value is exactly 18 characters long (len_min = len_max = len_mean = 18.0), and all 45,716 rows are unique with a duplicate_rate of 0.0 and null_rate of 0.0 — a perfect surrogate key. No surprises in the data; it is entirely consistent with an auto-generated system identifier from a Socrata open-data platform. Treatment: Drop before modelling; retain only if row-level traceability back to the source platform is needed.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 18
- len_max
- 18
- len_mean
- 18
- len_median
- 18
- len_p95
- 18
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- -5.68
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
id
text identifier near_unique one_word allcapsThis column contains UUIDs (universally unique identifiers) serving as a primary key, with all 45,716 values being exactly 36 characters long and fully unique — zero duplicates, zero nulls. Notably, all sampled top values share the prefix '00000000-0000-0000-', suggesting the UUID version/variant fields are zeroed out, which is atypical of standard UUID v4 generation and may indicate a custom or synthetic ID scheme. The allcaps_rate of 1.0 is consistent with hex characters but worth noting if downstream systems are case-sensitive. Treatment: Use as primary key for joins; do not encode or transform for modelling — drop or pass through as-is.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 36
- len_max
- 36
- len_mean
- 36
- len_median
- 36
- len_p95
- 36
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 65.38
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
position
numeric other constantThis column, named 'position', is a numeric field that is entirely constant — every one of its 45,716 non-null values is exactly 0.0 (zero_rate = 1.0, n_unique = 1). It carries zero information and would contribute nothing to any model or analysis. This is flagged as a constant alert, confirming it is safe to drop. Treatment: Drop immediately; constant value across all 45,716 rows provides no signal.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
created_at
numeric timestamp constantThis column is a Unix timestamp named 'created_at', but every single one of its 45,716 non-null rows holds the identical value 1446143734 (approximately 2015-10-29 UTC), triggering a 'constant' alert. With n_unique of 1, std of 0.0, and IQR of 0.0, the column carries zero information variance — strongly suggesting a bulk-load default, a data pipeline bug, or a one-time snapshot import where timestamps were not properly captured. This is a critical data quality issue that renders the column useless as a temporal signal. Treatment: Drop or flag as corrupted; investigate ETL pipeline for the source of the constant value before using as a temporal feature.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1.446e+09
- max
- 1.446e+09
- mean
- 1.446e+09
- median
- 1.446e+09
- std
- 0
- q1
- 1.446e+09
- q3
- 1.446e+09
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
created_meta
unknown metadata skippedThe column 'created_meta' is likely a creation timestamp or metadata field associated with record provenance, but saturn classified it as 'unknown' kind and skipped all profiling, yielding zero stats and no uniqueness count. With 45,716 rows, zero nulls, and no further signal available, its actual dtype, distribution, and content cannot be assessed from this evidence alone. Treatment: Inspect raw values to determine dtype (timestamp, JSON blob, string) before deciding on parsing, dropping, or feature extraction.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- —
updated_at
numeric timestamp constantThis column is a Unix epoch timestamp named `updated_at`, representing a last-modified datetime for each row. Every single one of the 45,716 non-null records holds the identical value 1446143734 (approximately 2015-10-29 UTC), meaning the column is a constant — it carries zero information variance. This strongly suggests a bulk data load or migration event where all rows were stamped with the same timestamp rather than tracking real update times. Treatment: Drop from modelling; flag to data owner as a likely ETL artefact — all 45,716 rows share the single value 1446143734.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1.446e+09
- max
- 1.446e+09
- mean
- 1.446e+09
- median
- 1.446e+09
- std
- 0
- q1
- 1.446e+09
- q3
- 1.446e+09
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
updated_meta
unknown other skippedThe column 'updated_meta' was skipped by the profiler, yielding no stats, no uniqueness count, and no type resolution beyond 'unknown'. With 45,716 non-null rows and a null rate of 0.0, the column is fully populated, but its content, structure, and distribution are entirely opaque from the available evidence. The name suggests it may hold metadata update timestamps or serialized metadata objects (e.g., JSON blobs), but nothing in the evidence confirms this. Treatment: Manually inspect raw values to determine type (timestamp, JSON, free text); re-profile after parsing or casting appropriately.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- —
meta
categorical metadata imbalanceThis column is a metadata field that contains exclusively the empty object literal '{ }' across all 45,716 rows, with zero nulls and a cardinality of 1. It carries no information whatsoever — entropy is 0.0 and top_rate is 1.0, meaning every single record is identical. This is almost certainly an unfilled placeholder or a defaulted JSON field that was never populated. Treatment: Drop before modelling; zero-variance column with no predictive or descriptive value.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- { }
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
name
text label near_uniqueThis column contains proper names of geographic features — the top words ('range', 'hills', 'mountains', 'northwest', 'africa', 'grove', 'yamato', 'Queen Alexandra') are all typical components of named landforms or place names. Every one of the 45,716 rows has a distinct value (duplicate_rate 0.0, n_unique = 45,716) with zero nulls, making it a perfect natural identifier. The mean word count of 2.77 and median of 3.0 confirm multi-token names rather than single labels, while 'yamato' appearing 3,317 times as the top individual word suggests a large Antarctic or Japanese geographic sub-corpus driving partial lexical repetition even though full names are unique. Treatment: Use as a display label or natural key; tokenize on whitespace for gazetteer/NLP tasks, but drop before any ML feature matrix.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 2
- len_max
- 28
- len_mean
- 17.78
- len_median
- 19
- len_p95
- 27
- word_mean
- 2.772
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 17,917
- readability_flesch_mean
- 63.74
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04749
- allcaps_rate
- 0
- boilerplate_rate
- 0
id_1
numeric identifierThis column is almost certainly a row or entity identifier: it has 45,716 unique values across 45,716 rows with zero nulls and zero duplicates, indicating a perfect 1-to-1 mapping. Values run from 1 to 57,458, suggesting either a sparse sequential ID (gaps exist since max > n) or a pre-filtered subset of a larger table. The near-uniform distribution (kurtosis −1.16, skew 0.27, zero outliers) is consistent with a sequential or pseudo-random integer key rather than a meaningful numeric feature. Treatment: Retain as a join/lookup key; exclude from any modelling feature set.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- min
- 1
- max
- 57,458
- mean
- 2.689e+04
- median
- 2.426e+04
- std
- 1.686e+04
- q1
- 1.269e+04
- q3
- 4.066e+04
- iqr
- 27,968
- skew
- 0.2665
- kurtosis
- -1.16
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
nametype
categorical label imbalanceThis column is a meteorite name-type classification flag, distinguishing between currently valid meteorite names ('Valid') and relict/superseded ones ('Relict'). The distribution is extremely imbalanced: 45,641 of 45,716 records (99.84%) are 'Valid', with only 75 'Relict' entries. The near-zero entropy (0.018) confirms this column carries almost no information variance, which triggered the imbalance alert. Treatment: Exclude from predictive features due to near-zero variance; retain only as a filter to subset valid records if analysis should exclude relict entries.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Valid
- top_rate
- 0.9984
- cardinality
- 2
- entropy
- 0.01754
- entropy_ratio
- 0.01754
recclass
categorical labelThis column contains meteorite classification codes, identifying the mineralogical and petrologic type of each recovered meteorite specimen. The top 7 values (L6, H5, L5, H6, H4, LL5, LL6) are all ordinary chondrite classes and together account for the vast majority of records, with L6 alone representing 18.1% of the 45,716 rows. Despite 466 unique classes, the entropy ratio of 0.51 indicates moderate concentration — the long tail of rare classes (e.g., CM2 with only 416 occurrences) will create sparse dummy variables if one-hot encoded naively. Treatment: Group rare classes (below a frequency threshold) into an 'Other' bucket before encoding, or use target/frequency encoding to handle the 466-cardinality tail.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 466
- top_value
- L6
- top_rate
- 0.1812
- cardinality
- 466
- entropy
- 4.548
- entropy_ratio
- 0.5131
mass (g)
numeric feature high_skew outliersThis column records the physical mass of objects in grams, almost certainly meteorite or asteroid specimen weights given the scale and distribution. The median is just 32.6 g while the mean explodes to 13,278 g and the maximum reaches 60,000,000 g (60 tonnes), indicating a tiny fraction of massive outliers dragging the distribution — skew of 76.9 and kurtosis of 6,796 confirm an extreme long tail. Fully 15.5% of rows (7,086) are flagged as outliers, meaning the bulk of specimens are small rocks while a handful of giants dominate the aggregate statistics. Treatment: log-transform (log1p) before modelling to compress the extreme right tail; consider capping or flagging the ~7,086 outliers separately.
- n
- 45,716
- nulls
- 131 (0.3%)
- unique
- 12,576
- min
- 0
- max
- 6e+07
- mean
- 1.328e+04
- median
- 32.6
- std
- 5.75e+05
- q1
- 7.2
- q3
- 202.6
- iqr
- 195.4
- skew
- 76.91
- kurtosis
- 6796
- n_outliers
- 7,086
- outlier_rate
- 0.1554
- zero_rate
- 0.0004168
fall
categorical label imbalanceThis column captures whether a meteorite was discovered on the ground ('Found') versus observed falling ('Fell'), making it a binary classification label for meteorite recovery type. The distribution is severely imbalanced: 'Found' accounts for 97.6% of 45,716 records (44,609), while 'Fell' represents only 2.4% (1,107). The entropy ratio of 0.164 confirms near-minimum uncertainty, flagged explicitly as an imbalance alert. Any model using this as a target will require class-balancing techniques. Treatment: Apply class-balancing (e.g., SMOTE or class weights) before using as a classification target.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Found
- top_rate
- 0.9758
- cardinality
- 2
- entropy
- 0.1645
- entropy_ratio
- 0.1645
year
categorical timestampThis column represents a calendar year, stored as full ISO-8601 timestamps normalised to January 1st of each year (e.g. '2003-01-01T00:00:00'), confirming the time component carries no information. Despite being profiled as categorical, it is effectively an annual time dimension spanning at least the range visible in the top values (1979–2006). Surprising: cardinality is 266 distinct values against only ~45 years visible in top values, suggesting either a much wider date range or some malformed/unexpected entries worth inspecting. The top year (2003) accounts for just 7.3% of rows, indicating a reasonably spread distribution rather than heavy concentration. Treatment: Parse to date/integer year, investigate the 266 distinct values vs expected ~45-year span for anomalies, then use as a time feature or group-by dimension.
- n
- 45,716
- nulls
- 291 (0.6%)
- unique
- 266
- top_value
- 2003-01-01T00:00:00
- top_rate
- 0.07315
- cardinality
- 266
- entropy
- 5.299
- entropy_ratio
- 0.6578
reclat
numeric featureThis column represents the recorded latitude of meteorite find/fall locations, with values ranging from -87.37° to +81.17° consistent with geographic latitude bounds. Surprising signals: the median of -71.5° indicates the majority of records are concentrated in high southern latitudes (likely Antarctic recovery sites), yet the Q3 is exactly 0.0°, suggesting a notable cluster at the equator or a placeholder zero — reinforced by a zero_rate of 16.8% that almost exactly matches the null_rate of 16%, implying zeros may be encoding missing coordinates rather than true equatorial finds. Kurtosis of -1.48 confirms a flat, bimodal-like distribution rather than a normal one. Treatment: Treat zero values as missing (mask alongside existing nulls before modelling); use as-is for geospatial analysis or pair with longitude for coordinate-based features.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 12,738
- min
- -87.37
- max
- 81.17
- mean
- -39.12
- median
- -71.5
- std
- 46.38
- q1
- -76.71
- q3
- 0
- iqr
- 76.71
- skew
- 0.4916
- kurtosis
- -1.477
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.1677
reclong
numeric featureThis column represents the recorded longitude of meteorite landing or find locations, covering a range from -165.43° to 354.47°. The maximum value of 354.47 is surprising — valid WGS84 longitude should cap at 180°, suggesting some records use a 0–360° convention rather than the standard -180 to 180° range, which will cause mapping errors if not normalised. The zero_rate of ~16% mirrors the null_rate of 16%, strongly implying that zero-filled values are placeholder/missing entries rather than genuine equatorial coordinates at the prime meridian. Distribution is near-symmetric (skew -0.17, kurtosis -0.73) with a large IQR of 157.17, consistent with a globally spread geographic variable. Treatment: Normalise values > 180 to the -180–180 range (subtract 360), then treat zero values matching the null_rate as missing and impute or exclude before spatial analysis.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 14,640
- min
- -165.4
- max
- 354.5
- mean
- 61.07
- median
- 35.67
- std
- 80.65
- q1
- 0
- q3
- 157.2
- iqr
- 157.2
- skew
- -0.1745
- kurtosis
- -0.7312
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.1618
GeoLocation
text feature duplicatesGeoLocation stores geographic coordinates as serialized Python list strings in the format [None, '', ' ', None, False], representing what appears to be a structured geo-point object flattened to text. The most common value — '[None, '0.0', '0.0', None, False]' appearing 6,214 times — is almost certainly a null/unknown sentinel rather than a genuine equatorial location, masking true missingness beyond the 16% null rate. Duplicate rate is high at 55.5% (21,301 duplicates across 17,100 unique values), consistent with many records sharing the same geographic coordinates. The column should be parsed to extract numeric longitude and latitude fields rather than used as raw text. Treatment: Parse the serialized list string to extract longitude (index 1) and latitude (index 2) as separate numeric columns, treating '0.0', '0.0' as missing.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 17,100
- len_min
- 33
- len_max
- 47
- len_mean
- 40.3
- len_median
- 41
- len_p95
- 45
- word_mean
- 5
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 21,301
- duplicate_rate
- 0.5547
- vocab_size
- 15,461
- readability_flesch_mean
- 117.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
States
numeric null_rate- n
- 45,716
- nulls
- 44,057 (96.4%)
- unique
- 45
- min
- 1
- max
- 51
- mean
- 17.34
- median
- 15
- std
- 10.41
- q1
- 9
- q3
- 23
- iqr
- 14
- skew
- 1.115
- kurtosis
- 0.6891
- n_outliers
- 40
- outlier_rate
- 0.02411
- zero_rate
- 0
Counties
numeric null_rate- n
- 45,716
- nulls
- 44,057 (96.4%)
- unique
- 662
- min
- 5
- max
- 3,210
- mean
- 1353
- median
- 1,195
- std
- 994.1
- q1
- 482
- q3
- 2,113
- iqr
- 1,631
- skew
- 0.2374
- kurtosis
- -1.19
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0