saturn·

data trove witch trials

source /home/coolhand/html/datavis/data_trove/data/quirky/witch_trials.json 10,940 rows 6 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset records historical witch trials across Europe, covering 10,940 cases with information on location, time period, and outcomes (people tried and deaths). Two things stand out immediately: the extreme skew in both 'deaths' and 'tried' — the vast majority of records show zero deaths and just one person tried, yet outliers reach as high as 500, suggesting a small number of mass trials drove most of the carnage. Temporally, activity clusters heavily between roughly 1590–1660 (the IQR), pointing to a well-known peak persecution era, with a long tail back to 1300 worth examining. Geographically, the United Kingdom and Germany together account for over two-thirds of all records, while Geneva dominates city-level entries despite nearly half of city values being missing.

citing: deaths.stats.median · deaths.stats.mean · deaths.stats.max · deaths.stats.zero_rate · deaths.stats.n_outliers · tried.stats.median · tried.stats.max · tried.stats.outlier_rate · year.stats.q1 · year.stats.q3 · year.stats.median · year.stats.min · country.top_values · city.top_values · city.null_rate

Schema

6 columns
Per-column summary. Click column name to jump to its detail.
Alerts
year numeric 8.5% 430
outliers
decade numeric 0.0% 53
outliers
city categorical 47.7% 906
null_rate
country categorical 0.0% 19
tried numeric 0.0% 111
high_skew outliers
deaths numeric 0.0% 74
high_skew outliers

year

numeric timestamp outliers
This column represents a historical year, likely a creation, publication, or event date for records spanning 1300 to 1850 — consistent with a cultural heritage or manuscript dataset. The distribution is strongly left-skewed (skew = -1.59) with an unusually heavy left tail: while 50% of records fall between 1597 and 1660 (IQR = 63 years), the minimum reaches back to 1300, producing 714 outliers (7.1% of rows) at the early extreme. The high kurtosis (4.32) confirms a sharp central peak around the median of 1630 with fat tails, and an 8.5% null rate warrants attention before temporal analysis. Treatment: Treat as an ordinal/temporal feature; investigate records below ~1500 as potential data-quality issues or deliberate historical outliers before binning into periods for modelling. high · anthropic:default
n
10,940
nulls
931 (8.5%)
unique
430
min
1,300
max
1,850
mean
1621
median
1,630
std
66.25
q1
1,597
q3
1,660
iqr
63
skew
-1.586
kurtosis
4.319
n_outliers
714
outlier_rate
0.07134
zero_rate
0

decade

numeric feature outliers
This column represents a decade or year of origin — likely a composition or publication year — spanning 1300 to 1850, with only 53 distinct values across 10,940 rows. The distribution is heavily left-skewed (skew = −1.48) with high kurtosis (3.85), meaning most records cluster in the 1590–1650 range (IQR = 60 years) while a long tail stretches back to 1300. Notably, 848 rows (≈7.75%) are flagged as outliers, likely corresponding to early-period records far from the central mass near the median of 1620. Treatment: Treat as an ordinal/temporal feature; consider binning into broader periods or applying a robust scaler given the heavy left tail and outlier concentration. high · anthropic:default
n
10,940
nulls
0 (0.0%)
unique
53
min
1,300
max
1,850
mean
1615
median
1,620
std
66.68
q1
1,590
q3
1,650
iqr
60
skew
-1.478
kurtosis
3.85
n_outliers
848
outlier_rate
0.07751
zero_rate
0

city

categorical feature null_rate
This column contains city names, likely representing geographic origin or location associated with records in the dataset. The null rate of 47.65% is a significant concern — nearly half of all 10,940 rows are missing a city value, triggering an alert. With 906 unique cities and an entropy ratio of 0.856, the distribution is fairly broad, yet Geneva dominates with 320 occurrences (5.59% of non-null rows). The top cities — Geneva, Budingen, Bruges, Munich, Augsburg, Venice — suggest a historical European dataset, possibly pre-modern trade, migration, or administrative records. Treatment: Impute or flag nulls (47.65% missing); encode as categorical feature, potentially grouping rare cities below a frequency threshold given 906 unique values. high · anthropic:default
n
10,940
nulls
5,213 (47.7%)
unique
906
top_value
Geneva
top_rate
0.05588
cardinality
906
entropy
8.406
entropy_ratio
0.8557

country

categorical feature
This column records the country of origin or location for each record, covering 19 distinct countries across 10,940 rows with no nulls. The distribution is heavily concentrated in Western Europe, with the United Kingdom (3,750 rows, 34.3%) and Germany (3,417 rows) together accounting for roughly two-thirds of all records. Switzerland (1,272) and France (807) are the next largest groups, while the remaining 15 countries collectively represent a small tail — Spain, for example, appears only 29 times. The entropy ratio of 0.59 reflects this moderate-to-high imbalance, which could bias any model trained on country as a feature. Treatment: One-hot encode or target-encode; consider grouping low-frequency countries (e.g., Spain with 29 rows) into an 'Other' bucket to reduce sparsity. high · anthropic:default
n
10,940
nulls
0 (0.0%)
unique
19
top_value
United Kingdom
top_rate
0.3428
cardinality
19
entropy
2.502
entropy_ratio
0.5889

tried

numeric feature high_skew outliers
This column likely records the number of attempts made for some action (e.g., quiz attempts, login tries, or retry counts), given its name 'tried' and integer-like distribution starting at 1. The distribution is extremely concentrated: Q1, median, and Q3 are all 1.0, yet the mean is ~3.95 and the max is 500, indicating a tiny fraction of records drive nearly all the variance. With 22.5% of rows flagged as outliers and a kurtosis of 316, the tail is extraordinarily heavy and the bulk of users attempt something exactly once. Treatment: Log-transform (log1p) before modelling, or cap at a meaningful percentile threshold to reduce outlier influence; consider binning into ordinal buckets (1, 2–5, 6+). high · anthropic:default
n
10,940
nulls
0 (0.0%)
unique
111
min
1
max
500
mean
3.952
median
1
std
19.26
q1
1
q3
1
iqr
0
skew
15.61
kurtosis
316
n_outliers
2,457
outlier_rate
0.2246
zero_rate
0

deaths

numeric numeric_target high_skew outliers
This column records death counts per observation (likely per event, location, or time period), with the vast majority of rows — 75.5% — being exactly zero and a median of 0. The distribution is extraordinarily right-skewed (skew=28.52, kurtosis=991.65), driven by rare but extreme values reaching a maximum of 500; notably, 24.5% of rows are flagged as outliers, meaning non-zero death counts are statistically rare but not negligible. Only 74 unique values across 10,940 rows confirms the heavily zero-inflated, discrete nature of the data. Treatment: Model with zero-inflated or negative-binomial regression; apply log1p transform if used as a feature in standard ML pipelines. high · anthropic:default
n
10,940
nulls
0 (0.0%)
unique
74
min
0
max
500
mean
1.493
median
0
std
13.19
q1
0
q3
0
iqr
0
skew
28.52
kurtosis
991.6
n_outliers
2,684
outlier_rate
0.2453
zero_rate
0.7547