data-trove-country-centroids · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/geographic/country_centroids.json

Saturn profiled 7,124 rows across 10 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/geographic/country_centroids.json",
    "--findings", "data-trove-country-centroids.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 7,124 geographic coordinate records, likely representing country or administrative centroid points sourced from Natural Earth 1:10m Admin 0 Label Points. The most striking issue is that all categorical attribute columns — including name, continent, iso_a2, iso_a3, region_un, and subregion — contain only empty strings, meaning the dataset is essentially stripped of its descriptive metadata and only the raw coordinates remain usable. The latitude values range from -83.1 to 83.2 with a mean around 22.9°, suggesting a moderate northern hemisphere bias, while longitude spans nearly the full global range (-180 to 180) with no outliers. Before any analysis, the empty categorical fields need to be investigated and repopulated, as the dataset in its current form cannot be used to answer any country- or region-level questions.

citing: row_count · column_count · columns[latitude].stats.min · columns[latitude].stats.max · columns[latitude].stats.mean · columns[latitude].stats.skew · columns[longitude].stats.min · columns[longitude].stats.max · columns[longitude].stats.outlier_rate · columns[source].stats.top_value · columns[name].stats.top_rate · columns[continent].stats.cardinality

Out[4]:

saturn.schema() · 10 columns

column	kind	n	null%	unique	alerts
iso_a2	categorical	7,124	0.0%	1	imbalance
iso_a3	categorical	7,124	0.0%	1	imbalance
name	categorical	7,124	0.0%	1	imbalance
name_long	categorical	7,124	0.0%	1	imbalance
continent	categorical	7,124	0.0%	1	imbalance
region_un	categorical	7,124	0.0%	1	imbalance
subregion	categorical	7,124	0.0%	1	imbalance
longitude	numeric	7,124	0.0%	7,124
latitude	numeric	7,124	0.0%	7,124
source	categorical	7,124	0.0%	1	imbalance

Fig 1.

latitude · Look for the northern hemisphere skew — most centroids cluster above the equator with a thin tail toward the south pole.

Show data table

Histogram bins for latitude (median: 25.195900167159152).
bin	count
-83.05 – -78.9	19
-78.9 – -74.74	26
-74.74 – -70.58	38
-70.58 – -66.43	42
-66.43 – -62.27	51
-62.27 – -58.11	10
-58.11 – -53.95	45
-53.95 – -49.8	77
-49.8 – -45.64	72
-45.64 – -41.48	62
-41.48 – -37.32	30
-37.32 – -33.17	19
-33.17 – -29.01	10
-29.01 – -24.85	23
-24.85 – -20.7	85
-20.7 – -16.54	140
-16.54 – -12.38	169
-12.38 – -8.224	238
-8.224 – -4.067	246
-4.067 – 0.09051	250
0.09051 – 4.248	293
4.248 – 8.405	321
8.405 – 12.56	381
12.56 – 16.72	302
16.72 – 20.88	179
20.88 – 25.03	400
25.03 – 29.19	489
29.19 – 33.35	214
33.35 – 37.51	388
37.51 – 41.66	263
41.66 – 45.82	163
45.82 – 49.98	142
49.98 – 54.13	225
54.13 – 58.29	275
58.29 – 62.45	502
62.45 – 66.61	441
66.61 – 70.76	222
70.76 – 74.92	88
74.92 – 79.08	105
79.08 – 83.24	79

Fig 2.

longitude · Longitude is broadly spread across the full -180 to 180 range with no outliers, reflecting global coverage.

Show data table

Histogram bins for longitude (median: 23.477727818184434).
bin	count
-180 – -171	112
-171 – -162	105
-162 – -153	87
-153 – -144	109
-144 – -135	55
-135 – -126	113
-126 – -117	95
-117 – -108	91
-108 – -98.98	36
-98.98 – -89.98	82
-89.98 – -80.98	350
-80.98 – -71.98	554
-71.98 – -62.98	253
-62.98 – -53.98	202
-53.98 – -44.99	83
-44.99 – -35.99	49
-35.99 – -26.99	30
-26.99 – -17.99	66
-17.99 – -8.989	75
-8.989 – 0.01045	126
0.01045 – 9.01	130
9.01 – 18.01	299
18.01 – 27.01	632
27.01 – 36.01	131
36.01 – 45.01	88
45.01 – 54.01	90
54.01 – 63	306
63 – 72	48
72 – 81	226
81 – 90	28
90 – 99	168
99 – 108	256
108 – 117	190
117 – 126	604
126 – 135	535
135 – 144	170
144 – 153	185
153 – 162	115
162 – 171	149
171 – 180	101

Fig 3.

source · All 7,124 rows share a single source value, confirming this is a uniform single-origin dataset with no provenance variety.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
Natural Earth 1:10m Admin 0 Label Points	7124	100.0%

Fig 4.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
iso_a2	categorical	0.0%
iso_a3	categorical	0.0%
name	categorical	0.0%
name_long	categorical	0.0%
continent	categorical	0.0%
region_un	categorical	0.0%
subregion	categorical	0.0%
longitude	numeric	0.0%
latitude	numeric	0.0%
source	categorical	0.0%

Fig 5.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	longitude	latitude
longitude	+1.00	-0.10
latitude	-0.10	+1.00

iso_a2 categorical other

This column is an ISO 3166-1 alpha-2 country code field, but it contains exactly one distinct value across all 7,124 rows: an empty string. Every single record has a blank code, making the column entirely uninformative. With entropy of 0.0 and top_rate of 1.0, this column carries zero signal and is effectively a dead field in this dataset.

Treatment: Drop entirely; the column is a constant empty string across all 7,124 rows and provides no analytical value.

anthropic:default · confidence high

Out[11]:

saturn.columns["iso_a2"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 6.

Top values for iso_a2.

Show data table

Top values for iso_a2 (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

iso_a3 categorical feature

This column is intended to hold ISO 3166-1 alpha-3 country codes but contains exclusively empty strings across all 7,124 rows — cardinality of 1, top_rate of 1.0, and zero nulls. The column is entirely unpopulated (empty string rather than NULL), making it informationally void despite having no technical missing values. This is a data pipeline or extraction failure: the field exists but was never filled.

Treatment: Drop this column; it carries zero information (entropy = 0.0) and would need to be re-sourced from upstream before any use.

anthropic:default · confidence high

Out[14]:

saturn.columns["iso_a3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 7.

Top values for iso_a3.

Show data table

Top values for iso_a3 (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

name categorical other

This column, labelled 'name', is a categorical field that is entirely empty strings across all 7,124 rows — a single unique value of '' with a top_rate of 1.0 and null_rate of 0.0. No actual name data is present; the column has been populated with blank strings rather than nulls, masking what would otherwise appear as 100% missing. With entropy of 0.0 and cardinality of 1, it carries zero information.

Treatment: Drop entirely — zero variance, zero information content; investigate upstream pipeline for why nulls were coerced to empty strings.

anthropic:default · confidence high

Out[17]:

saturn.columns["name"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 8.

Top values for name.

Show data table

Top values for name (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

name_long categorical other

This column, ostensibly a long-form name field, contains exactly one unique value across all 7,124 rows: an empty string. With a null_rate of 0.0 and top_rate of 1.0, every single record holds an empty string rather than a true null, meaning the field was populated with blanks rather than left absent. The column carries zero informational content (entropy = 0.0) and is entirely useless for analysis in its current state.

Treatment: Drop column; it is a constant empty-string field with no variance or analytical value.

anthropic:default · confidence high

Out[20]:

saturn.columns["name_long"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 9.

Top values for name_long.

Show data table

Top values for name_long (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

continent categorical label

This column is intended to represent a geographic continent label, but it contains exactly one distinct value — an empty string — across all 7,124 rows with no nulls. The column carries zero information entropy (entropy = 0.0, top_rate = 1.0), meaning every record has been filled with a blank string rather than a real value or a proper null. This is a data quality failure: the field appears to have been populated with empty strings instead of being left null or populated correctly.

Treatment: Drop or remediate before modelling — the column is a constant empty string and provides no signal; investigate ETL pipeline for the source of blank-string imputation.

anthropic:default · confidence high

Out[23]:

saturn.columns["continent"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 10.

Top values for continent.

Show data table

Top values for continent (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

region_un categorical label

This column is intended to store a UN macro-region label but contains only empty strings across all 7,124 rows — cardinality is 1, top_rate is 1.0, and entropy is 0.0. It carries zero informational value in its current state. This is almost certainly an unpopulated or failed data extraction field rather than a legitimately uniform dataset.

Treatment: Drop this column entirely; it is constant-empty and contributes no signal to any downstream task.

anthropic:default · confidence high

Out[26]:

saturn.columns["region_un"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 11.

Top values for region_un.

Show data table

Top values for region_un (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

subregion categorical other

This column represents a geographic subregion field, but it contains exactly one distinct value across all 7,124 rows: an empty string. With a top_rate of 1.0 and entropy of 0.0, the column is entirely unpopulated — every record is blank, not null. This is a completely degenerate column with zero informational content.

Treatment: Drop entirely — zero variance, all values are empty strings with no predictive or descriptive value.

anthropic:default · confidence high

Out[29]:

saturn.columns["subregion"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 12.

Top values for subregion.

Show data table

Top values for subregion (1 unique shown, of 1 total).
value	count	share
	7124	100.0%

longitude numeric feature

This column contains geographic longitude coordinates, with values spanning the full valid range from approximately -179.97° to 179.99°, indicating near-global coverage. All 7,124 rows are unique and non-null, consistent with precise GPS or geocoded point locations. The IQR of 191.8° is notably wide — Q1 at -72.3° and Q3 at 119.5° — confirming records are spread across both the Western and Eastern hemispheres rather than concentrated in any single region. The near-zero skew (-0.27) and platykurtic distribution (kurtosis -1.13) suggest a fairly flat, broadly distributed spread of locations around the globe.

Treatment: Pair with latitude for spatial analysis; consider geohash or H3 encoding for ML features, or use directly in distance calculations.

anthropic:default · confidence high

Out[32]:

saturn.columns["longitude"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7,124
min	-180
max	180
mean	21.9
median	23.48
std	97.72
q1	-72.33
q3	119.5
iqr	191.8
skew	-0.267
kurtosis	-1.131
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 23.477727818184434).
bin	count
-180 – -171	112
-171 – -162	105
-162 – -153	87
-153 – -144	109
-144 – -135	55
-135 – -126	113
-126 – -117	95
-117 – -108	91
-108 – -98.98	36
-98.98 – -89.98	82
-89.98 – -80.98	350
-80.98 – -71.98	554
-71.98 – -62.98	253
-62.98 – -53.98	202
-53.98 – -44.99	83
-44.99 – -35.99	49
-35.99 – -26.99	30
-26.99 – -17.99	66
-17.99 – -8.989	75
-8.989 – 0.01045	126
0.01045 – 9.01	130
9.01 – 18.01	299
18.01 – 27.01	632
27.01 – 36.01	131
36.01 – 45.01	88
45.01 – 54.01	90
54.01 – 63	306
63 – 72	48
72 – 81	226
81 – 90	28
90 – 99	168
99 – 108	256
108 – 117	190
117 – 126	604
126 – 135	535
135 – 144	170
144 – 153	185
153 – 162	115
162 – 171	149
171 – 180	101

latitude numeric feature

This column contains geographic latitude values, with every one of 7,124 rows being unique and no nulls, consistent with precise coordinate data. The range spans -83.05 to 83.23 degrees, covering nearly the full global latitude range, with a mean of 22.92 and median of 25.20 suggesting a modest concentration in the Northern Hemisphere tropics/subtropics. The IQR of 51.96 (Q1 ≈ 1.15, Q3 ≈ 53.11) confirms wide global spread, and the mild negative skew (-0.60) indicates a slight tail toward southern latitudes. Only 35 outliers (0.49%) were flagged, likely high-latitude locations near the poles.

Treatment: Use as-is or pair with longitude for geospatial modelling; consider binning into latitude bands or projecting to spatial features.

anthropic:default · confidence high

Out[35]:

saturn.columns["latitude"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7,124
min	-83.05
max	83.24
mean	22.92
median	25.2
std	34.23
q1	1.149
q3	53.11
iqr	51.96
skew	-0.6007
kurtosis	0.1113
n_outliers	35
outlier_rate	0.004913
zero_rate	0

Fig 14.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 25.195900167159152).
bin	count
-83.05 – -78.9	19
-78.9 – -74.74	26
-74.74 – -70.58	38
-70.58 – -66.43	42
-66.43 – -62.27	51
-62.27 – -58.11	10
-58.11 – -53.95	45
-53.95 – -49.8	77
-49.8 – -45.64	72
-45.64 – -41.48	62
-41.48 – -37.32	30
-37.32 – -33.17	19
-33.17 – -29.01	10
-29.01 – -24.85	23
-24.85 – -20.7	85
-20.7 – -16.54	140
-16.54 – -12.38	169
-12.38 – -8.224	238
-8.224 – -4.067	246
-4.067 – 0.09051	250
0.09051 – 4.248	293
4.248 – 8.405	321
8.405 – 12.56	381
12.56 – 16.72	302
16.72 – 20.88	179
20.88 – 25.03	400
25.03 – 29.19	489
29.19 – 33.35	214
33.35 – 37.51	388
37.51 – 41.66	263
41.66 – 45.82	163
45.82 – 49.98	142
49.98 – 54.13	225
54.13 – 58.29	275
58.29 – 62.45	502
62.45 – 66.61	441
66.61 – 70.76	222
70.76 – 74.92	88
74.92 – 79.08	105
79.08 – 83.24	79

source categorical metadata

This column records the data source attribution for every row, and all 7,124 records carry the identical value 'Natural Earth 1:10m Admin 0 Label Points'. With cardinality of 1, entropy of 0.0, and a top_rate of 1.0, the column carries zero information variance — it is purely a provenance/metadata tag indicating the dataset was sourced entirely from a single Natural Earth layer.

Treatment: Drop before modelling; constant column adds no predictive signal and wastes memory.

anthropic:default · confidence high

Out[38]:

saturn.columns["source"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value	Natural Earth 1:10m Admin 0 Label Points
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for source.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
Natural Earth 1:10m Admin 0 Label Points	7124	100.0%

data trove country centroids

Overview

Summary confidence: high

iso_a2 categorical other

iso_a3 categorical feature

name categorical other

name_long categorical other

continent categorical label

region_un categorical label

subregion categorical other

longitude numeric feature

latitude numeric feature

source categorical metadata

How to cite