saturn·

data trove noaa atmospheric weather alerts

source /home/coolhand/html/datavis/data_trove/data/quirky/atmospheric_real.json 571 rows 19 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · medium confidence anthropic:default

This dataset contains 571 weather alert and atmospheric event records, combining operational NWS advisory data with a small number of rare/quirky atmospheric phenomena entries. The bulk of the dataset is well-populated NWS alerts — dominated by Small Craft Advisories (149), Winter Weather Advisories (95), and Winter Storm Warnings (60) — with certainty skewed heavily toward 'Likely' (89% of records). A key anomaly worth investigating is that columns like country, event_type, magnitude, source, and state have a ~98.6% null rate, meaning they are only populated for roughly 8 rare-event records, suggesting the dataset is a hybrid merge of two very different sources. Severity is fairly well distributed across Minor, Moderate, and Severe, making it a useful dimension for filtering operational alerts.

citing: event.top_values · event.n_unique · certainty.top_rate · certainty.top_value · severity.top_values · urgency.top_value · urgency.top_rate · country.null_rate · event_type.null_rate · row_count

Schema

19 columns
Per-column summary. Click column name to jump to its detail.
Alerts
event categorical 1.4% 35
headline categorical 1.6% 305
long_tail
description categorical 0.0% 527
long_tail
severity categorical 1.4% 5
certainty categorical 1.4% 4
urgency categorical 1.4% 5
areaDesc categorical 1.4% 441
long_tail
sent unknown 0.0%
skipped
effective unknown 0.0%
skipped
onset unknown 0.0%
skipped
expires unknown 0.0%
skipped
latitude numeric 97.2% 16
null_rate outliers
longitude numeric 97.2% 16
null_rate outliers
event_type categorical 98.6% 8
long_tail null_rate
date unknown 0.0%
skipped
state categorical 98.6% 6
long_tail null_rate
country categorical 98.6% 4
long_tail null_rate
magnitude categorical 98.6% 5
long_tail null_rate
source categorical 98.6% 8
long_tail null_rate

event

categorical label
This column contains National Weather Service alert/advisory event type names, serving as a categorical label for meteorological warning events. With 35 unique values across 571 rows, it covers a meaningful range of weather phenomena. 'Small Craft Advisory' dominates at 26.5% (149 occurrences), while the top 4 event types together account for the majority of records — suggesting a skewed distribution toward marine and winter/wind events. Entropy ratio of 0.72 indicates moderate-to-high diversity, but the heavy concentration in a few categories is worth noting for class-imbalance handling in any classification task. Treatment: One-hot encode or target-encode for modelling; be aware of class imbalance with 'Small Craft Advisory' at 26.5% and many tail categories. high · anthropic:default
n
571
nulls
8 (1.4%)
unique
35
top_value
Small Craft Advisory
top_rate
0.2647
cardinality
35
entropy
3.699
entropy_ratio
0.7212

headline

categorical free_text long_tail
This column contains NWS (National Weather Service) alert headlines — structured text strings describing weather advisory type, issuance timestamp, expiry, and issuing office. Despite appearing categorical, the entropy ratio of 0.943 and 305 unique values out of 571 records signal near free-text behaviour, with a long-tail alert confirming most headlines appear only once or a handful of times. The most frequent value ('Small Craft Advisory issued February 17…') appears only 19 times (top_rate ≈ 3.4%), indicating very little repetition across the dataset. Treatment: Parse structured subfields (alert type, issue time, expiry time, NWS office) via regex before modelling; do not use raw string as a categorical feature. high · anthropic:default
n
571
nulls
9 (1.6%)
unique
305
top_value
Small Craft Advisory issued February 17 at 4:12AM AKST until February 18 at 5:00PM AKST by NWS Anchorage AK
top_rate
0.03381
cardinality
305
entropy
7.782
entropy_ratio
0.943

description

categorical free_text long_tail
This column contains full-text NWS (National Weather Service) alert and forecast descriptions — multi-line, structured prose covering marine, fire weather, wind, and winter storm warnings across various US regions. With 527 unique values out of 571 rows and an entropy ratio of 0.9945, nearly every entry is distinct, making this essentially free text. The long-tail alert and the presence of duplicate entries (e.g., the same Southeast Alaska marine forecast appearing 4 times) suggest periodic reissue of templated advisories rather than purely unique records, which may indicate time-series duplication worth investigating. Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; consider deduplicating or grouping by alert template for frequency analysis. high · anthropic:default
n
571
nulls
0 (0.0%)
unique
527
top_value
Southeast Alaska Inside Waters from Dixon Entrance to Skagway Wind forecasts reflect the predominant speed and direction expected. Sea forecasts represent the average of the highest one-third of the combined windwave and swell height. .TONIGHT...N wind 25 kt. Seas 5 ft. Heavy freezing spray. .WED...N wind 15 kt diminishing to 10 kt in the morning. Seas 3 ft in the morning then 2 ft or less. Light freezing spray in the early morning. .WED NIGHT...N wind 10 kt. Seas 2 ft or less. .THU...N wind 10 kt. Seas 2 ft or less. Light freezing spray. .THU NIGHT...N wind 15 kt. Seas 2 ft or less. Snow. .FRI...N gale to 35 kt. Seas 5 ft. .SAT...N gale to 45 kt. Seas 2 ft or less. .SUN...N gale to 35 kt. Seas 2 ft or less.
top_rate
0.007005
cardinality
527
entropy
8.992
entropy_ratio
0.9945

severity

categorical label
This column is an ordinal severity classification with 5 levels — Moderate, Minor, Severe, Unknown, and Extreme — likely describing the intensity of incidents, events, or conditions. 'Moderate' dominates at 36.9% of records (208/571), and 'Extreme' is strikingly rare at only 5 occurrences, suggesting a heavily right-skewed real-world distribution where catastrophic events are uncommon. The 19 'Unknown' values represent a data-quality concern distinct from the 1.4% null rate, effectively adding a second form of missingness. Entropy ratio of 0.77 indicates a reasonably spread distribution, though the extreme imbalance at the tail warrants attention for any classification task. Treatment: Encode as ordinal (Minor < Moderate < Severe < Extreme); treat 'Unknown' as a separate missing indicator before modelling. high · anthropic:default
n
571
nulls
8 (1.4%)
unique
5
top_value
Moderate
top_rate
0.3694
cardinality
5
entropy
1.788
entropy_ratio
0.7698

certainty

categorical label
This column encodes an analyst-assigned confidence level for some observation or classification, with four ordinal categories: Likely, Observed, Unknown, and Possible. The distribution is severely skewed: 'Likely' dominates at 89.3% of all records (503/571), while 'Observed', 'Unknown', and 'Possible' each account for only 18–24 records. The low entropy ratio of 0.328 confirms near-constant behaviour, and the 1.4% null rate is minor. The near-total dominance of a single category limits this column's discriminative power as a feature. Treatment: Ordinal-encode with awareness of severe class imbalance; consider collapsing minority categories or using as a stratification variable rather than a predictive feature. high · anthropic:default
n
571
nulls
8 (1.4%)
unique
4
top_value
Likely
top_rate
0.8934
cardinality
4
entropy
0.6569
entropy_ratio
0.3285

urgency

categorical feature
This column is a categorical urgency classification, likely from an incident or request management system, with 5 distinct levels. It is severely dominated by 'Expected' (510 out of 571 rows, 90.6%), leaving the remaining 4 categories — 'Unknown', 'Future', 'Past', and 'Immediate' — collectively accounting for fewer than 10% of records. The very low entropy ratio of 0.27 confirms extreme class imbalance, which will limit this column's discriminative power in most models. The 'Immediate' category, presumably the most critical, appears only 6 times, making it near-invisible to any classifier trained on this distribution. Treatment: One-hot encode but flag severe class imbalance; consider oversampling minority classes (especially 'Immediate' with n=6) or collapsing into binary 'Expected vs. Other' before modelling. high · anthropic:default
n
571
nulls
8 (1.4%)
unique
5
top_value
Expected
top_rate
0.9059
cardinality
5
entropy
0.6276
entropy_ratio
0.2703

areaDesc

categorical label long_tail
This column contains geographic area descriptions used in weather or emergency alerts, predominantly covering coastal and inland waterways of Southeast Alaska (e.g., 'Glacier Bay', 'Stephens Passage', 'Northern Lynn Canal') with some continental US zones also present (e.g., 'Greater Lake Tahoe Area', 'Sacramento Mountains'). With 441 unique values across only 571 rows and an entropy ratio of 0.98, cardinality is extremely high — nearly every row is a distinct area. The long-tail alert confirms that most areas appear only once or twice, with even the top value ('Glacier Bay') appearing just 9 times (1.6% of rows). The multi-zone concatenated entries (e.g., 'Greater Lake Tahoe Area; Greater Lake Tahoe Area') suggest some records bundle multiple zones into a single string, which may cause deduplication or parsing issues. Treatment: Parse semicolon-delimited multi-zone entries into separate records, then use as a grouping/filter dimension rather than a model feature; too high-cardinality for direct encoding without aggregation. high · anthropic:default
n
571
nulls
8 (1.4%)
unique
441
top_value
Glacier Bay
top_rate
0.01599
cardinality
441
entropy
8.608
entropy_ratio
0.9799

sent

unknown other skipped
The column 'sent' has 571 non-null rows but was skipped by the profiler, leaving its type and value distribution entirely unknown. No stats, uniqueness counts, or distribution metrics are available. The name suggests a boolean flag (e.g., message/email sent status) or a timestamp, but this cannot be confirmed from the evidence. An analyst should inspect raw values directly before any downstream use. Treatment: Inspect raw values to determine dtype, then re-profile before any modelling or filtering. low · anthropic:default
n
571
nulls
0 (0.0%)
unique

effective

unknown other skipped
The column 'effective' contains 571 non-null values but was skipped by the profiler, leaving its type and distribution entirely unknown. No stats, uniqueness count, or value samples are available, so its semantic role cannot be determined from this evidence alone. The name suggests a boolean flag (e.g., is-effective) or a date (effective date/start), but this is speculation beyond the evidence. Treatment: Manually inspect raw values to determine type, then re-profile before any modelling use. low · anthropic:default
n
571
nulls
0 (0.0%)
unique

onset

unknown timestamp skipped
The column 'onset' likely records the timing or start of a clinical or epidemiological event (e.g., symptom onset date), but the profiler emitted a 'skipped' alert and returned no stats, leaving its true type and distribution entirely uncharacterised. With 571 non-null rows and zero null rate, data is present, but nothing about format, uniqueness, or value range can be confirmed from this evidence alone. An analyst should inspect raw values to determine whether it is a date string, numeric duration, or categorical stage before any downstream use. Treatment: Inspect raw values to confirm type (date vs. numeric vs. categorical), then parse or encode accordingly before modelling. low · anthropic:default
n
571
nulls
0 (0.0%)
unique

expires

unknown timestamp skipped
The column 'expires' likely represents an expiration date or timestamp field, but the profiler skipped analysis entirely, yielding no stats, no uniqueness count, and a kind of 'unknown'. With 571 non-null rows and zero null rate, data is present but its structure or encoding prevented saturn from classifying it. No further distributional signals are available from the evidence. Treatment: Inspect raw values to confirm encoding (ISO string, epoch int, or other format), parse to datetime, then use as a feature or filter boundary. low · anthropic:default
n
571
nulls
0 (0.0%)
unique

latitude

numeric null_rate outliers
n
571
nulls
555 (97.2%)
unique
16
min
25.76
max
69.65
mean
40.46
median
35.45
std
11.58
q1
34.96
q3
41
iqr
6.043
skew
1.508
kurtosis
1.297
n_outliers
4
outlier_rate
0.25
zero_rate
0

longitude

numeric null_rate outliers
n
571
nulls
555 (97.2%)
unique
16
min
-120.6
max
18.96
mean
-84.42
median
-95.31
std
45.23
q1
-120.2
q3
-78.65
iqr
41.52
skew
1.283
kurtosis
0.3248
n_outliers
2
outlier_rate
0.125
zero_rate
0

event_type

categorical long_tail null_rate
n
571
nulls
563 (98.6%)
unique
8
top_value
Ball Lightning
top_rate
0.125
cardinality
8
entropy
3
entropy_ratio
1

date

unknown timestamp skipped
This column is named 'date' and contains 571 non-null values with a 0.0% null rate, suggesting it is a timestamp or date field. However, saturn skipped profiling it (kind: 'unknown', no stats, no uniqueness count), so no distribution, range, or format details are available. The absence of any computed statistics prevents assessment of cardinality, temporal range, or potential drift. Treat with caution until the parsing issue causing the skip is resolved. Treatment: Investigate why saturn skipped this column, parse to a proper datetime type, then extract temporal features or use as an index. low · anthropic:default
n
571
nulls
0 (0.0%)
unique

state

categorical long_tail null_rate
n
571
nulls
563 (98.6%)
unique
6
top_value
International
top_rate
0.375
cardinality
6
entropy
2.406
entropy_ratio
0.9306

country

categorical long_tail null_rate
n
571
nulls
563 (98.6%)
unique
4
top_value
USA
top_rate
0.625
cardinality
4
entropy
1.549
entropy_ratio
0.7744

magnitude

categorical long_tail null_rate
n
571
nulls
563 (98.6%)
unique
5
top_value
N/A
top_rate
0.5
cardinality
5
entropy
2
entropy_ratio
0.8614

source

categorical long_tail null_rate
n
571
nulls
563 (98.6%)
unique
8
top_value
Journal of Geophysical Research
top_rate
0.125
cardinality
8
entropy
3
entropy_ratio
1