saturn·

dataset dissector · v0.2.0

Point saturn at a dataset. Get a plain-language reading and every number behind it.

Drop a spreadsheet, CSV, JSONL, Parquet, SQLite, or Arrow file, or paste a HuggingFace repo id. Saturn profiles the whole corpus, then a language model writes up what changed and what's worth a closer look. All statistics are deterministic and machine-readable.

UPLOAD01

Local file

By default this demo runs Opus on the server's account. Paste your own API key below to use a different provider; it's used once for this run and never stored. Pick "ollama (local)" to call a self-hosted model (only reachable when saturn runs on your own machine).

API key (used once, not stored)

50 MB max. The summary uses anthropic:claude-opus-4-7; uncheck "stats only" or paste your own key above.

HUGGINGFACE02

Remote repo

API key (used once, not stored)

Loads every split, concatenated. Big repos may take minutes; you'll get a job page to watch.

Readings

233 files
001 / 233
profile reading

data trove strange places v5 2

/home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json · 354,770 rows × 48 cols

15× feature 6× label 1× timestamp 1× free_text 1× foreign_key 1× metadata

This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.

Open
002 / 233
profile reading

data trove us county state boundaries geojson

/home/coolhand/html/datavis/data_trove/geographic/counties_simplified.geojson · 3,234 rows × 18 cols

7× label 6× feature 2× identifier 2× foreign_key

This dataset is a US county-level geographic reference file containing 3,234 county (and county-equivalent) records across 56 state FIPS codes, with spatial attributes, area measurements, and metropolitan area classifications. The most notable pattern is that roughly 41% of counties share a name with at least one other county — 'Washington' alone appears 31 times — reflecting the historic reuse of patriotic and presidential names across states. Two numeric columns, ALAND (land area) and AWATER (water area), show extreme right skew with over 11–14% outliers, meaning a small number of counties are vastly larger or wetter than the median, which warrants attention in any area-weighted analysis. Additionally, over 40% of counties have no CBSAFP code (no core-based statistical area assignment), indicating a large rural, non-metro population of counties that could easily be overlooked in urban-focused analyses.

Open
003 / 233
profile reading

data trove us geological features

/home/coolhand/html/datavis/data_trove/geographic/geology/geological_regions.geojson · 14 rows × 7 cols

4× label 1× metadata 1× free_text 1× feature

This dataset is a small geospatial catalogue of 14 named geological regions across the United States, each described as a polygon with attributes covering geology type, geological age, primary resources, and a short description. The most notable pattern is the dominance of Sedimentary Basins (4 of 14 regions) as the leading geology type, which aligns with the prevalence of oil and natural gas as primary resources. The geological ages span a wide range from Precambrian to Tertiary, suggesting this catalogue captures regions of very different formation histories — worth examining alongside resource type to spot any age-resource relationships.

Open
004 / 233
profile reading

data trove us disasters mashup

/home/coolhand/html/datavis/data_trove/data/wild/disasters/disasters_mashup.json · 54,575 rows × 16 cols

10× feature 4× label 1× timestamp 1× foreign_key

This dataset is a multi-hazard disaster event mashup of 54,575 records spanning aviation accidents, storms, earthquakes, and shipwrecks, each geolocated with latitude and longitude. Aviation accidents dominate heavily at nearly 59% of all records, with Cessna models being the most frequently involved aircraft — worth examining whether this reflects true prevalence or a reporting/sourcing bias. A second area of interest is the severity data: fatalities, injuries, and damage all carry a ~73% null rate, meaning consequence analysis is limited to roughly a quarter of the dataset and skewed toward zero-casualty events. The storm subcategory breakdown (Tornadoes, Flash Floods, Thunderstorm Wind) also deserves a closer look for geographic and seasonal clustering given the strong US state representation.

Open
005 / 233
profile reading

data trove global sentiment survey pew research

/home/coolhand/html/datavis/data_trove/data/quirky/global_sentiment.json · 1 rows × 5 cols

2× metadata 2× other 1× label

This dataset is a single-record metadata document titled 'How the World Sees America,' aggregating global favorability and presidential confidence data from Pew Research Center surveys covering 2020–2025. With only 1 row and 5 columns, it functions as a descriptor or wrapper rather than a traditional tabular dataset — the real analytical content is almost certainly nested inside the 'countries' and 'metrics' columns, which were skipped during profiling. Those two unknown-type columns are the most important things to examine next, as they likely contain the multi-country time-series data needed for any meaningful analysis. Until those nested structures are unpacked, no statistical patterns can be assessed from the surface-level profile alone.

Open
006 / 233
profile reading

data trove geopolitical actions timeline

/home/coolhand/html/datavis/data_trove/data/policy/geopolitical_timeline.json · 1 rows × 6 cols

3× metadata 3× other

This dataset is a single-record metadata stub describing a geopolitical timeline of military and diplomatic actions during Trump's second term. With only 1 row and 6 columns — the majority of which are unresolved 'unknown' types (_fields, _key_events, _sources, data) — there is virtually no statistical signal available at this level. The real analytical content is almost certainly nested inside the 'data' and '_key_events' columns, which likely contain structured arrays or objects that need to be unpacked before any meaningful analysis can begin. The '_stub' flag being 'True' confirms this is a placeholder or index record, not the full dataset.

Open
007 / 233
profile reading

data trove market sector performance

/home/coolhand/html/datavis/data_trove/data/policy/sector_indices.json · 1 rows × 7 cols

4× other 3× metadata

This dataset is a single-record JSON stub describing S&P 500 sector ETF performance since the January 20, 2025 inauguration. With only 1 row and 7 columns — most of which are 'unknown' kind and skipped during profiling — there is very little structured, analyzable data surface available here. The two readable columns (_description and _stub) are fully constant and carry no variation. The meaningful content almost certainly lives inside the nested or unparsed fields (_etf_map, _fields, _key_findings, data), which would need to be unpacked before any real analysis can begin.

Open
008 / 233
profile reading

data trove doge workforce cuts

/home/coolhand/html/datavis/data_trove/data/policy/doge_cuts_by_agency.json · 1 rows × 6 cols

3× metadata 3× other

This dataset is a single-record metadata wrapper describing a DOGE (Department of Government Efficiency) federal workforce cuts dataset, organized by agency and paired with spending paradox data. The file contains only 1 row and 6 columns, with most columns ('_fields', '_key_numbers', '_sources', 'data') skipped during profiling — meaning the substantive data is likely nested inside those unparsed structures. The only readable signals are the description label and a stub flag set to 'True', suggesting this is a placeholder or index record pointing to richer nested content. Analysts should prioritize unpacking the 'data' and '_key_numbers' columns to access the actual agency-level workforce and spending figures.

Open
009 / 233
profile reading

data trove tariff timeline

/home/coolhand/html/datavis/data_trove/data/policy/tariff_timeline.json · 1 rows × 6 cols

3× metadata 3× other

This dataset is a single-record metadata stub describing a curated knowledge object about tariff announcements, rates, and market reactions covering 2025–2026. With only 1 row and 6 columns — most of which are unresolved 'unknown' types — there is virtually no tabular data to analyse at this stage. The meaningful content almost certainly lives inside the nested or complex fields (_fields, _key_numbers, _sources, data) that the profiler skipped. Before any analysis can proceed, those nested structures need to be unpacked and flattened into a proper tabular format.

Open
010 / 233
profile reading

data trove executive orders database

/home/coolhand/html/datavis/data_trove/data/policy/executive_orders.json · 1 rows × 7 cols

4× other 3× metadata

This dataset is a metadata wrapper — a single-row JSON manifest describing a collection of Trump second-term executive orders (EO 14147 through EO 14371) sourced from the Federal Register API. With only 1 row and 7 columns, the file itself is a stub or index record rather than the full underlying data; the actual executive order records are nested inside the 'data', '_sample', '_stats', and '_fields' columns which were skipped during profiling. The most important next step is to unpack those nested columns — particularly 'data' — to access the real executive order records. Until that extraction is done, no meaningful analysis of the orders themselves (topics, signing dates, frequency, etc.) is possible.

Open
011 / 233
profile reading

data trove witch trials

/home/coolhand/html/datavis/data_trove/data/quirky/witch_trials.json · 10,940 rows × 6 cols

4× feature 1× numeric_target 1× timestamp

This dataset records historical witch trials across Europe, covering 10,940 cases with information on location, time period, and outcomes (people tried and deaths). Two things stand out immediately: the extreme skew in both 'deaths' and 'tried' — the vast majority of records show zero deaths and just one person tried, yet outliers reach as high as 500, suggesting a small number of mass trials drove most of the carnage. Temporally, activity clusters heavily between roughly 1590–1660 (the IQR), pointing to a well-known peak persecution era, with a long tail back to 1300 worth examining. Geographically, the United Kingdom and Germany together account for over two-thirds of all records, while Geneva dominates city-level entries despite nearly half of city values being missing.

Open
012 / 233
profile reading

data trove bfro bigfoot sightings full scrape

/home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json · 5,411 rows × 9 cols

3× feature 2× label 2× identifier 1× free_text 1× timestamp

This dataset contains 5,411 Bigfoot sighting reports sourced from the Bigfoot Field Researchers Organization (BFRO), covering sightings across 53 U.S. states and territories with attributes including location, date, classification, and a short description. Washington state dominates with 631 reports (about 12% of all sightings), followed by California and Ohio, suggesting strong geographic clustering worth examining. The temporal distribution is skewed toward more recent decades — median year is 2001 with records stretching back to 1870 — raising questions about whether sightings are truly increasing or simply better reported. Sighting classifications split almost evenly between Class A (direct sightings, 2,655) and Class B (indirect evidence, 2,722), with Class C being rare at just 34 reports.

Open
013 / 233
profile reading

data trove ufo sightings analysis

/home/coolhand/html/datavis/data_trove/data/quirky/ufo_by_state.json · 58 rows × 2 cols

1× feature 1× identifier

This dataset contains UFO sighting counts aggregated by U.S. state, covering all 58 rows with no missing values. The count distribution is heavily right-skewed (skew ~2.93) with high kurtosis and 4 outlier states that far exceed the norm — the max of 16,197 sightings dwarfs the median of 1,510, suggesting a handful of states dominate UFO reports. The state column has one entry per state, so the interesting story is entirely in how unevenly sightings are distributed across states. Look closely at the top states to see which ones are driving the bulk of reported sightings.

Open
014 / 233
profile reading

data trove ghost sightings database

/home/coolhand/html/datavis/data_trove/data/quirky/ghosts.json · 1 rows × 1 cols

1× other

This dataset contains a single column called 'ghosts' with just one row, sourced from a file named ghosts.json. The column type could not be determined and was skipped during profiling, meaning there is effectively no statistical signal to report. With only 1 row and 1 column yielding no computable stats, there is nothing analytically actionable here. The dataset may be malformed, empty in substance, or structured in a way the profiler could not parse.

Open
015 / 233
profile reading

data trove carnivorous plants gbif

/home/coolhand/html/datavis/data_trove/data/quirky/carnivorous_plants_real.json · 610 rows × 14 cols

7× feature 5× label 1× identifier 1× free_text

This is a GBIF biodiversity occurrence dataset with 610 records spanning 14 columns, covering observations and preserved specimens of organisms across 35 countries, primarily recorded between 2021 and 2026. Despite the filename suggesting carnivorous plants, the dataset actually mixes three distinct taxonomic families — Hesperiidae (skippers/butterflies), Canellaceae (spice plants), and Araceae — each contributing roughly 300, 300, and 10 records respectively, which is a notable data-quality curiosity worth investigating. The dominant species is Canella winterana with 174 records (28.5%), and the US, Mexico, Brazil, and Guadeloupe together account for nearly half of all country-level records. Coordinate uncertainty is severely skewed and problematic: the median is just 35 metres but the max reaches 766,917 metres, with 91 outliers and a 23% null rate, meaning spatial analyses should treat location precision with caution.

Open
016 / 233
profile reading

data trove noaa atmospheric weather alerts

/home/coolhand/html/datavis/data_trove/data/quirky/atmospheric_real.json · 571 rows × 19 cols

4× label 3× timestamp 2× free_text 2× other 1× feature

This dataset contains 571 weather alert and atmospheric event records, combining operational NWS advisory data with a small number of rare/quirky atmospheric phenomena entries. The bulk of the dataset is well-populated NWS alerts — dominated by Small Craft Advisories (149), Winter Weather Advisories (95), and Winter Storm Warnings (60) — with certainty skewed heavily toward 'Likely' (89% of records). A key anomaly worth investigating is that columns like country, event_type, magnitude, source, and state have a ~98.6% null rate, meaning they are only populated for roughly 8 rare-event records, suggesting the dataset is a hybrid merge of two very different sources. Severity is fairly well distributed across Minor, Moderate, and Severe, making it a useful dimension for filtering operational alerts.

Open
017 / 233
profile reading

data trove tornadoes noaa spc

/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json · 70,022 rows × 13 cols

10× feature 2× timestamp 1× numeric_target

This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.

Open
018 / 233
profile reading

data trove bioluminescence

/home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json · 43,060 rows × 14 cols

7× label 5× feature 1× timestamp 1× metadata

This dataset contains 43,060 occurrence records of bioluminescent marine organisms, covering 26 named groups across 7 phyla — from dinoflagellates and jellyfish to krill and bacteria — with geographic coordinates, taxonomy, and sampling depth. The most notable issue is that depth has a 24.75% null rate, extreme skew (max 10,000 m vs. median 52.5 m), and over 10% outliers, meaning depth-based analysis needs careful filtering before any conclusions are drawn. A second area to investigate is geographic bias: over 63% of country values are blank, yet Australia, the United States, Peru, and Canada dominate the named entries, suggesting strong regional over-representation in the sourced datasets. The year column also carries a 42% null rate, which limits time-trend analysis despite records spanning from at least 1962 to 2017.

Open
019 / 233
profile reading

data trove deep sea specimens

/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json · 200,000 rows × 12 cols

7× label 5× feature

This dataset contains 200,000 deep-sea biodiversity occurrence records spanning taxonomic classification, geographic coordinates, ocean depth, and collection year. The most striking feature is the dominance of blank values across taxonomy columns — 55% of genus, 40% of family, and 73% of species entries are empty strings, suggesting many records are identified only at higher taxonomic levels. Proteobacteria, Cnidaria, and Chordata are the best-represented phyla, while Australia accounts for the vast majority of records with a named country (~79k of ~96k non-blank entries). Depth ranges from 1,000 to 11,000 metres with a mean around 2,400 m, and the year column is heavily left-skewed with over 12,000 outlier records dating back as far as 1875, versus a median of 2016.

Open
020 / 233
profile reading

data trove fossils pbdb

/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json · 22,043 rows × 21 cols

9× feature 8× label 2× other 1× foreign_key 1× identifier

This is a fossil occurrence dataset containing 22,043 records spanning taxonomic classifications, geographic coordinates, and geological time ranges for paleontological finds. The taxonomic breakdown is dominated by Chordata (81.6%) with Mammalia, Saurischia, and Ornithischia as the leading classes, while over half of all occurrences (50.9%) come from the United States — worth examining for geographic bias. The geological age columns (early_age_mya and late_age_mya) span from near-present to over 500 million years ago with high spread and outliers, suggesting the dataset mixes very different eras of life. Taxonomic rank is split between species (41%) and genus (33%), meaning precision of identification varies considerably across records and may affect comparative analyses.

Open
021 / 233
profile reading

data trove megaliths

/home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json · 15,464 rows × 14 cols

5× label 4× feature 2× metadata 1× foreign_key 1× free_text 1× identifier

This dataset catalogues 15,464 megalithic structures (dolmens, menhirs, stone circles, nuraghes, and more) drawn from OpenStreetMap, with geographic coordinates, heritage classification, and typology fields. The most striking pattern is extreme sparsity in descriptive metadata: over 95% of records have no description, 98.5% have no material recorded, and roughly 70% lack a Wikidata link, suggesting the dataset is geographically rich but editorially thin. The megalith_type column is the most informative categorical field, splitting meaningfully across menhirs (5,231), dolmens (4,501), nuraghes (1,080), and stone circles (1,011). Geographically, the bulk of sites cluster in Western Europe (median latitude ~47.6°N, median longitude ~-1.6°), but high skew and outliers in both lat and lon indicate a long tail of sites in places like Sardinia, Iberia, Ireland, and beyond — worth mapping.

Open
022 / 233
profile reading

data trove caves worldwide

/home/coolhand/html/datavis/data_trove/data/quirky/caves.json · 69,716 rows × 12 cols

7× feature 2× metadata 1× free_text 1× label 1× identifier

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

Open
023 / 233
profile reading

data trove lighthouses worldwide

/home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json · 14,585 rows × 13 cols

6× feature 3× label 1× metadata 1× foreign_key

This dataset contains 14,585 lighthouse and seamark records sourced from OpenStreetMap, covering navigational lights and related structures worldwide. The most immediately striking feature is that many descriptive columns — country, operator, year_built, height, and heritage — have null rates of 90% or higher, meaning the richest analysis must focus on the minority of well-filled records. Two columns worth close inspection are seamark_type, which cleanly splits records into light_minor (3,496), light_major (3,051), and landmark (716) with no nulls beyond the 48% gap, and light_character, where 'Fl' (flashing) dominates at 74.7% of non-null values across 19 pattern types. Geographically, all 14,585 records carry latitude and longitude, revealing a notable left skew in latitude (mean 34.5°, median 40.8°) with 1,295 outliers, suggesting a clustering of records in the Northern Hemisphere with some Southern outliers worth mapping.

Open
024 / 233
profile reading

data trove noaa ovation aurora forecast

/home/coolhand/html/datavis/data_trove/data/quirky/ovation.json · 71 rows × 5 cols

3× feature 1× timestamp 1× label

This dataset captures 71 five-minute snapshots of auroral activity on January 20, 2026, with each row recording a timestamp, activity classification, intensity, and power readings for the northern and southern hemispheres. The most striking feature is that 'Storm' conditions dominate 55% of the observations, with 'Active' and 'Quiet' states making up the remainder — suggesting this day saw sustained geomagnetic disturbance. Both north and south power readings show wide, roughly uniform distributions (IQR ~110 GW) with medians well below their means, hinting that storm periods drive the upper range of power values. Intensity is similarly spread across most of its 0–1 range with near-zero skew, making the relationship between activity class and intensity worth exploring closely.

Open
025 / 233
profile reading

data trove natural satellites moons

/home/coolhand/html/datavis/data_trove/data/quirky/moons.json · 6 rows × 18 cols

12× feature 3× metadata 3× label

This dataset is a small orbital and physical reference catalogue of 6 notable moons in the solar system — Earth's Moon, Jupiter's four Galilean moons (Io, Europa, Ganymede, Callisto), and Saturn's Titan — sourced entirely from NASA JPL Horizons on 2026-01-19. The most interesting structural feature is the parent planet distribution: four of the six moons belong to Jupiter, making it the dominant host. Two orbital parameters flag outliers worth examining — eccentricity and inclination, where one moon (likely the Earth's Moon or Titan) sits clearly apart from the tightly clustered Galilean group. Physical size and mass are remarkably similar across all six, with diameters ranging from 3,122 to 5,262 km and masses between 0.008 and 0.025 Earth masses, suggesting this selection skews toward the largest moons in the solar system.

Open
026 / 233
profile reading

data trove solar system planets

/home/coolhand/html/datavis/data_trove/data/quirky/planets.json · 8 rows × 20 cols

15× feature 2× metadata 2× label 1× identifier

This dataset contains orbital and physical characteristics of all 8 planets in the Solar System, sourced from NASA JPL Horizons on 2026-01-19. The most striking feature is the extreme spread in planetary mass: values range from 0.0553 to 317.8 Earth masses, with 2 outliers (25% outlier rate) pulling the mean far above the median of 7.75 — a clear sign that Jupiter and Saturn dominate. Rotation period is equally dramatic, with a mean of -22.7 days and a minimum of -243.025 days, reflecting both retrograde rotation (Venus) and the very slow spin of some planets — worth examining closely. The dataset splits cleanly into 4 Inner Planets and 4 Outer Planets, and ring data (has_rings, ring radii) is only populated for 1 planet (87.5% null rate), consistent with Saturn being the sole ringed entry recorded.

Open
027 / 233
profile reading

data trove global volcanism program

/home/coolhand/html/datavis/data_trove/data/quirky/volcanoes.json · 200 rows × 9 cols

6× feature 3× label

This dataset captures 200 volcanic eruption records across 33 countries, covering events from 1900 to 1999, with 9 attributes including eruption intensity, volcano type, elevation, and geographic coordinates. The most striking feature is the heavy geographic concentration — Indonesia alone accounts for 28.5% of all records (57 out of 200), with Semeru appearing 13 times as the single most frequent volcano. Volcano type is strongly skewed toward stratovolcanoes, which make up 69.5% of all records, so the 'type' breakdown is worth examining to understand how rare other forms like calderas or shield volcanoes are by comparison. The Volcanic Explosivity Index (VEI) flags 15 outliers at the high end, with a maximum of 6.0 against a mean of 2.6, suggesting a small number of exceptionally powerful eruptions that deserve individual attention.

Open
028 / 233
profile reading

data trove waterfalls worldwide

/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json · 80,678 rows × 9 cols

4× feature 2× label 2× metadata 1× other

This dataset is a global catalogue of 80,678 waterfalls sourced entirely from OpenStreetMap, covering geographic coordinates and basic descriptive attributes. The most striking finding is how sparse the data quality is: 89.9% of records carry only the generic description 'Waterfall' with no height recorded, and 59.7% of entries are named 'Unnamed Waterfall', suggesting the dataset is geographically broad but informationally thin. Height data is worth a closer look — where it does exist, values cluster at small measurements (2–10 metres), hinting at a possible recording bias toward easily measured falls. The geographic spread is genuinely global (latitude ranges from -77.7 to 78.7), but the country field is nearly empty for 99.97% of records, so spatial analysis should rely on the raw coordinates rather than the country column.

Open
029 / 233
profile reading

data trove usgs significant earthquakes

/home/coolhand/html/datavis/data_trove/data/wild/usgs_significant_earthquakes.json · 3,742 rows × 11 cols

5× label 4× feature 1× timestamp 1× metadata

This dataset contains 3,742 records of significant earthquakes catalogued by the USGS, each describing a seismic event with location, magnitude, depth, and type. The vast majority (99.9%) are classified as earthquakes, with just 2 explosions and 1 landslide, so event type is not a useful differentiator. Two things stand out for closer inspection: first, depth_km is heavily right-skewed (median 10 km, mean 23.7 km, max 248.7 km) with 314 outliers, suggesting a small but important subset of unusually deep earthquakes worth isolating. Second, geographic concentration is striking — Alaska dominates the place names (appearing in roughly 1,991 records) and 'off the coast of Oregon' is the single most repeated location (151 times), pointing to a strong Pacific Northwest and Alaskan bias in this 'significant' events catalog. Magnitude ranges from 4.5 to 8.2 with a median of 4.8 and a long upper tail, meaning truly destructive events are rare outliers worth flagging.

Open
030 / 233
profile reading

data trove large meteorites 10kg

/home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv · 45,716 rows × 20 cols

4× feature 4× label 3× identifier 3× timestamp 2× metadata 2× other

This dataset is a NASA meteorite landings catalogue covering 45,716 unique meteorite records with attributes including mass, classification, discovery year, and geographic coordinates. The most striking feature is the mass distribution: the median mass is just 32.6 g but the maximum reaches 60,000,000 g, producing extreme skew (skew=76.9) and over 7,000 statistical outliers — a handful of enormous meteorites are pulling the mean to 13,278 g. A second key finding is that 97.6% of records are classified as 'Found' rather than 'Fell', meaning nearly all entries are meteorites discovered on the ground rather than witnessed falling, which has strong implications for geographic and temporal bias in the data. The meteorite classification column (recclass) spans 466 types, dominated by ordinary chondrites (L6, H5, L5), and year of discovery shows a clear spike in the late 1990s–2000s likely tied to Antarctic collection campaigns.

Open
031 / 233
profile reading

data trove iso 639 3 language codes

/home/coolhand/html/datavis/data_trove/data/linguistic/language-families/iso-639-3-aliases.json · 1 rows × 2 cols

1× other 1× metadata

This dataset is a single-record JSON file related to ISO 639-3 language aliases, likely containing structured linguistic metadata about language family classifications. With only 1 row and 2 columns — both flagged as 'unknown' kind and skipped during profiling — there is essentially no statistical signal available for analysis. The file likely contains deeply nested or complex JSON structures (aliases and metadata) that a standard profiler cannot flatten automatically. The most valuable next step is to manually inspect the raw JSON structure to understand nesting depth and extract usable fields before any meaningful analysis can proceed.

Open
032 / 233
profile reading

data trove glottolog languoids

/home/coolhand/html/datavis/data_trove/data/linguistic/glottolog_languoid.csv · 23,740 rows × 16 cols

6× feature 4× label 3× foreign_key 2× free_text 1× identifier

This dataset is a comprehensive catalogue of the world's languoids from Glottolog, covering 23,740 entries that span dialects (10,920), languages (8,481), and language families (4,339). The most striking pattern is in endangerment status: while the majority (18,965) are marked 'safe', nearly 4,800 entries are endangered, extinct, or vulnerable — worth examining closely against family and geographic distribution. A second area of interest is the highly skewed child-count columns: most languoids have zero children (74% for dialects, 82% for languages), but a handful of nodes have hundreds or even thousands of descendants, suggesting a very uneven tree structure. Geographic coverage is also notably incomplete, with latitude and longitude missing for 66% of rows, limiting spatial analysis to a subset of the data.

Open
033 / 233
profile reading

data trove phoible phonetics database

/home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv · 105,484 rows × 49 cols

27× feature 11× label 2× foreign_key

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

Open
034 / 233
profile reading

data trove wals world atlas of language structures

/home/coolhand/html/datavis/data_trove/data/linguistic/wals_parameters.csv · 192 rows × 5 cols

1× identifier 1× label 1× other 1× free_text 1× foreign_key

This dataset is a catalogue of 192 linguistic parameters from the World Atlas of Language Structures (WALS), each representing a distinct typological feature of human languages (e.g., 'Consonant Inventories', 'Vowel Quality Inventories'). Every row is uniquely identified by both an ID (e.g., '1A', '2A') and a Name, meaning there are no duplicates or groupings to aggregate within those columns. The most analytically interesting column is Chapter_ID, which groups these 192 parameters into 144 chapters — indicating that some chapters contain multiple parameters worth investigating. The Chapter_ID distribution is fairly uniform (mean ~84.5, median ~89.5, near-symmetric with slight left skew), suggesting chapters are spread across the full range with no heavy clustering.

Open
035 / 233
profile reading

data trove world languages endangerment silence

/home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json · 6,998 rows × 6 cols

4× feature 2× label

This dataset catalogues approximately 7,000 world languages, each with a name, geographic coordinates, speaker population, and endangerment status. The most striking finding is the extreme inequality in speaker populations: the median language has only 11,000 speakers while the maximum reaches nearly 1 billion, with over 16% of languages flagged as outliers — a classic long-tail distribution reflecting how a handful of dominant languages vastly outnumber the rest. Equally notable is the endangerment picture: while 44% of languages are classified as 'safe', a substantial share face real risk — 1,753 are 'definitely endangered', 327 are 'critically endangered', and 219 are already extinct. Top words in language names include 'sign', 'zapotec', 'mixtec', and directional qualifiers like 'southern' and 'northern', hinting at rich dialect clustering worth exploring geographically.

Open
036 / 233
profile reading

data trove world languages integrated

/home/coolhand/html/datavis/data_trove/data/linguistic/world_languages_integrated.json · 7,130 rows × 8 cols

2× label 2× other 2× feature 1× identifier 1× foreign_key

This dataset is a reference catalogue of 7,130 world languages, each identified by a unique ISO 639-3 three-letter code alongside its name and several linked data sources (Glottolog, Joshua Project, speaker counts, and more). Every row is distinct — no duplicate language codes or names — making this primarily a lookup/reference table. Two things stand out for closer inspection: first, the name column reveals notable clusters around directional qualifiers (Southern, Northern, Western, Eastern) and language families like Zapotec (58), Mixtec (52), and Naga (49), suggesting rich geographic and genealogical structure worth exploring. Second, 'sign' appears 152 times in language names, indicating a surprisingly large representation of sign languages across the world's documented tongues.

Open
037 / 233
profile reading

data trove healthcare deserts

/home/coolhand/html/datavis/data_trove/data/healthcare/healthcare_desert_merged.csv · 3,222 rows × 10 cols

6× feature 3× label 1× identifier

This dataset covers healthcare access indicators for 3,222 U.S. counties, combining population size, uninsured rates, poverty rates, and hospital closure risk scores. The most striking pattern is the extreme skew in both total population and uninsured population — the median county has just 25,328 residents and 36 uninsured individuals, yet outliers push the max to nearly 10 million people and over 20,000 uninsured, meaning a small number of large counties dominate the raw counts. Two things warrant a closer look: first, 84% of counties are rated 'Low' hospital closure risk, but nearly 29% score exactly zero on the closure risk score, suggesting the scoring may be coarser than it appears (only 3 unique values exist); second, 69% of counties are classified as Rural, yet uninsured rates range from 0% to 370% of expected norms with heavy right skew, pointing to pockets of severe coverage gaps worth isolating geographically.

Open
038 / 233
profile reading

data trove cms hospital database

/home/coolhand/html/datavis/data_trove/data/healthcare/cms_hospitals_2025.csv · 5,421 rows × 38 cols

24× feature 9× label 4× metadata 1× identifier

This dataset is a 2025 CMS (Centers for Medicare & Medicaid Services) registry of 5,421 U.S. hospitals, covering identity, location, ownership type, and performance ratings across mortality, readmission, safety, patient experience, and timely care measures. The most striking feature is that 47% of hospitals lack an overall star rating ('Not Available'), which severely limits any headline quality comparison and warrants investigation into which hospital types or states are disproportionately unrated. A second area worth scrutiny is ownership structure: voluntary non-profit private hospitals dominate at 42%, yet proprietary and government-run facilities make up a substantial share — cross-referencing ownership against star ratings could reveal systematic quality differences.

Open
039 / 233
profile reading

data trove veteran suicide rates

/home/coolhand/html/datavis/data_trove/demographic/veterans/military_firearm_suicide.csv · 50 rows × 4 cols

2× numeric_target 1× label 1× feature

This dataset contains state-level suicide rate statistics for all 50 U.S. states, comparing civilian and veteran populations along with a veteran risk ratio. The most striking signal is the scale of the veteran suicide burden: the mean veteran suicide rate (36.1 per 100k) is roughly double the civilian mean (17.6 per 100k), and the veteran risk ratio ranges from 1.8 to 3.23, meaning veterans are at minimum nearly twice as likely to die by suicide as civilians in every single state. The right-skewed distribution of the veteran risk ratio deserves closer attention — a handful of states show ratios above 2.4, suggesting particularly acute disparities worth investigating.

Open
040 / 233
profile reading

data trove veteran homelessness

/home/coolhand/html/datavis/data_trove/demographic/veterans/military_firearm_va_healthcare.csv · 50 rows × 2 cols

1× label 1× numeric_target

This dataset contains one row per U.S. state (all 50, no nulls) with a single metric: the percentage of veterans utilizing VA healthcare. The utilization rate ranges from 13.8% to 42.3%, with a mean and median both near 27%, suggesting a roughly symmetric distribution across states. The wide spread — an IQR of about 13 percentage points and a standard deviation of ~8 points — means some states have nearly triple the VA uptake of others, which is worth investigating. Identifying which states cluster at the high and low ends could reveal geographic, demographic, or access-related patterns driving VA healthcare engagement.

Open
041 / 233
profile reading

data trove veteran employment statistics

/home/coolhand/html/datavis/data_trove/demographic/veterans/military_firearm_spouse_employment.csv · 15 rows × 3 cols

2× feature 1× label

This dataset captures spouse employment indicators — unemployment rate and labor force participation — across 15 U.S. states, likely in the context of military or veteran households. The most notable signal is in spouse unemployment rate, which ranges widely from 7.35% to 16.28% with a right skew and one flagged outlier at the high end, suggesting at least one state has a notably worse outcome for spouses. By contrast, spouse labor force participation is tightly clustered between 66.8% and 73.4% with no outliers, meaning most states see similar engagement levels even when unemployment varies — worth investigating whether high-unemployment states are simply retaining more job-seekers in the labor force.

Open
042 / 233
profile reading

data trove veteran population by county

/home/coolhand/html/datavis/data_trove/demographic/veterans/military_firearm_veterans.csv · 49 rows × 5 cols

3× feature 2× label

This dataset contains U.S. state-level veteran population statistics for 49 states, including total population, veteran counts, and the percentage of the population that are veterans. The most important signal is in veteran_percentage, which is extremely right-skewed (skew: 5.79) with 8 outliers and a max of 277.05 — far above the median of 5.08 — suggesting a small number of states have dramatically elevated veteran shares worth investigating. Total population and veteran population both distribute relatively evenly across states with no outliers, meaning the percentage anomalies are not simply a function of small population size.

Open
043 / 233
profile reading

data trove us military veteran analysis

/home/coolhand/html/datavis/data_trove/demographic/veterans/military_firearm_merged_analysis.csv · 54 rows × 23 cols

18× feature 3× label 1× metadata 1× numeric_target

This is a 54-row, state-level dataset merging U.S. military and veteran demographics with firearm licensing, suicide rates, and installation-level economic data. The most striking signal is the veteran suicide rate (mean 35.6, range 24.9–52.3), which is roughly double the civilian suicide rate (mean 17.2, range 7.7–28.9), and the veteran_risk_ratio column directly quantifies this gap (mean 2.2x) across states. A second area worth scrutiny is the extreme right-skew in active_duty_per_100k (median 92, max 5,544) and ffl_per_100k (median 12, max 342), suggesting a handful of states—likely those hosting large installations—are pulling these distributions hard; about 22% of rows also carry heavy null rates on installation-level columns (county, installation, economic impact), meaning the installation-linked data covers only ~12 records. Analysts should examine how firearm density and military concentration interact with veteran mental health outcomes across states.

Open
044 / 233
profile reading

data trove submarine cable map

/home/coolhand/html/datavis/data_trove/tools/fetchers/cache/submarine_cables.json · 1 rows × 4 cols

2× metadata 2× other

This dataset is a GeoJSON FeatureCollection describing global submarine cables, stored as a single-row JSON file where the entire dataset is packed into nested fields. The top-level metadata columns ('name' and 'type') each contain exactly one value — 'submarine_cables' and 'FeatureCollection' respectively — confirming this is a wrapper structure rather than a flat table. The real analytical content is buried inside the 'features' and 'crs' columns, which were skipped during profiling because they contain complex nested objects (geometries, cable properties, coordinates). To get meaningful insight, the 'features' array needs to be unpacked into individual cable records before any visualization or analysis can begin.

Open
045 / 233
profile reading

data trove us presidential election results by county

/home/coolhand/html/datavis/data_trove/geographic/election/2016_election.csv · 3,141 rows × 11 cols

5× feature 2× label 2× identifier 2× numeric_target

This dataset captures 2016 US presidential election results at the county level, covering all 3,141 counties across 51 state/territory abbreviations. The most striking pattern is the strong Republican lean in the median county: the median GOP vote share is 66.5% versus 28.6% for Democrats, though total votes are heavily right-skewed — a small number of large urban counties (max 2.65 million votes) dominate raw vote totals while most counties are small. The per-point difference column shows values ranging widely (e.g., 63% margins appear in the top values), suggesting many counties were not competitive at all. Texas leads with 254 counties, making state-level aggregation worth examining to see which states drive the most records and volume.

Open
046 / 233
profile reading

data trove country centroids

/home/coolhand/html/datavis/data_trove/geographic/country_centroids.json · 7,124 rows × 10 cols

4× other 3× feature 2× label 1× metadata

This dataset contains 7,124 geographic coordinate records, likely representing country or administrative centroid points sourced from Natural Earth 1:10m Admin 0 Label Points. The most striking issue is that all categorical attribute columns — including name, continent, iso_a2, iso_a3, region_un, and subregion — contain only empty strings, meaning the dataset is essentially stripped of its descriptive metadata and only the raw coordinates remain usable. The latitude values range from -83.1 to 83.2 with a mean around 22.9°, suggesting a moderate northern hemisphere bias, while longitude spans nearly the full global range (-180 to 180) with no outliers. Before any analysis, the empty categorical fields need to be investigated and repopulated, as the dataset in its current form cannot be used to answer any country- or region-level questions.

Open
047 / 233
profile reading

data trove scars standardized county analysis research system

/home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv · 3,221 rows × 20 cols

16× feature 2× foreign_key 1× label 1× identifier

This dataset covers 3,221 U.S. counties with demographic, economic, and electoral variables for the 2016 and 2020 presidential elections. The most striking finding is that Republican candidates dominated the majority of counties in both cycles — the median Republican share was roughly 67% in 2016 and 68% in 2020, while the Democratic median hovered near 29–30%, reflecting the well-known rural-county skew in U.S. politics. A data quality issue worth flagging immediately is the median_household_income column, which contains a minimum value of -666,666,666 — almost certainly a sentinel/error value — dragging the column mean to -$152,820 despite a plausible median of $52,380. Poverty rate averages about 15% across counties but reaches as high as 66%, and racial composition variables (pct_white, pct_black, pct_hispanic) are highly skewed, suggesting a small number of majority-minority counties sit at the extremes.

Open
048 / 233
profile reading

data trove social norms graph

/home/coolhand/html/datavis/data_trove/data/quirky/social_graph.json · 1 rows × 2 cols

2× other

This dataset is a social graph stored as a JSON file with just two columns — 'links' and 'edges' — and a single row, suggesting the entire graph is encoded as nested or complex objects rather than a flat tabular structure. Because both columns were skipped during profiling, no statistical breakdown is available, meaning the real content (nodes and their connections) is buried inside those nested structures. The most important next step is to manually inspect the raw JSON to understand how nodes and links are structured before any analysis or visualization can proceed. Without unpacking the nested data, no meaningful patterns, counts, or distributions can be assessed.

Open
049 / 233
profile reading

data trove hot pepper varieties pepperscale

/home/coolhand/html/datavis/data_trove/data/quirky/peppers.json · 175 rows × 11 cols

6× feature 4× label 1× identifier

This dataset catalogs 175 pepper varieties with attributes covering heat (Scoville scale min/median/max), flavor profile, botanical type, geographic origin, and culinary use. The most striking feature is the extreme right-skew in all three Scoville columns: while the median pepper sits around 15,000–30,000 SHU, outliers push to 15–16 million SHU, meaning a small cluster of 'Super Hot' varieties dwarfs the rest of the dataset. A secondary angle worth exploring is the geographic and botanical spread — the United States leads origin with 46 entries, 'annuum' dominates species type at 104 of 175, and flavor descriptions cluster heavily around 'Sweet' and 'Sweet, Fruity', suggesting most peppers in this catalog are mild and food-friendly despite the headline-grabbing extremes.

Open
050 / 233
profile reading

data trove chocolate origins

/home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json · 2,530 rows × 10 cols

6× feature 1× other 1× free_text 1× timestamp 1× foreign_key

This dataset contains 2,530 chocolate bar reviews covering bean origins, cocoa percentages, ingredients, and expert ratings across reviews dated 2006–2021. Two things stand out: first, cocoa percent clusters tightly between 70–74% but has 235 outliers (9.3%) stretching up to 100%, suggesting a small but notable group of ultra-dark bars worth investigating. Second, ratings skew modestly negative with a mean of 3.20 and median of 3.25 out of 4.0, indicating most bars are rated good-to-very-good — but the distribution of scores by bean origin (Venezuela, Peru, Dominican Republic, and Ecuador dominate) could reveal whether provenance drives quality. The 'company' column is entirely blank and should be ignored.

Open
051 / 233
profile reading

data trove openfoodfacts database

/home/coolhand/html/datavis/data_trove/cache/wild/openfoodfacts_sample.json · 50 rows × 545 cols

16× other 10× metadata 6× feature 4× label 3× free_text 1× identifier

This is a 50-product sample from the Open Food Facts database, an open crowdsourced food product catalogue with 545 columns spanning multilingual product names, ingredient texts, allergen data, nutritional scores, packaging details, and community metadata. The most striking structural issue is extreme sparsity: the vast majority of language-specific columns (e.g. product_name_dz, ingredients_text_ja) have null rates of 96–98%, meaning content is concentrated in French and English fields. Two things most deserve a closer look: first, the Nutri-Score distribution is heavily skewed toward grade 'e' (54% of products), suggesting the sample leans toward nutritionally poor items; second, scan counts (scans_n, mean 578, max 2523) show a strong right-skewed tail with a few highly popular products dominating community attention.

Open
052 / 233
profile reading

data trove grocery store density

/home/coolhand/html/datavis/data_trove/data/urban/food_deserts/urban_rural_comparison.json · 1 rows × 2 cols

2× other

This dataset appears to be a single-row comparison file contrasting metro and rural dimensions, likely related to food desert conditions given its source path. Unfortunately, the dataset contains only 1 row and both columns were skipped during profiling, meaning no statistical detail is available for either field. With just two columns and a single record, there is virtually nothing to analyze at this stage — the file may be a summary or aggregation stub rather than a full dataset. A closer look at the raw JSON source file is recommended to understand the actual structure and whether nested data exists within the metro and rural fields.

Open
053 / 233
profile reading

data trove snap participation benefits

/home/coolhand/html/datavis/data_trove/data/urban/food_deserts/snap_gap_states.json · 20 rows × 6 cols

4× feature 1× metadata 1× label

This dataset tracks SNAP (food stamp) program enrollment across 20 U.S. states, capturing estimated eligible populations, actual participants, and the resulting coverage gap. Two things stand out immediately: first, the enrollment rate and gap percentage are constant across all states (67% enrolled, 33% gap), suggesting these are summary-level figures rather than state-specific calculations — they should not be used for cross-state comparison. Second, the three population-count columns (eligible, participants, gap) are all heavily right-skewed with 2 outliers each, pointing to a small number of very large states — likely California and/or Florida — that dwarf the rest and will dominate any totals-based analysis.

Open
054 / 233
profile reading

data trove food desert states summary

/home/coolhand/html/datavis/data_trove/data/quirky/food_desert_states.json · 51 rows × 11 cols

9× feature 1× identifier 1× label

This dataset contains one row per U.S. state (plus D.C., 51 rows total) with figures on food desert populations, vehicle access, and poverty. The most striking feature is the extreme right-skew in desert-exposed population counts: the median desertPop is just 21,000 but the max reaches 449,000, with 6 outlier states driving the distribution far above the norm — a pattern mirrored almost identically in noVehicle counts. Poverty rate, by contrast, is far more normally distributed (mean 12.4%, std 2.6%), suggesting that food desert exposure is more strongly shaped by state size and car dependency than by poverty alone — worth cross-examining. The noVehiclePct column (max 17.37% vs. median 2.45%) flags a small handful of states with dramatically higher car-free household rates that likely align with the desertPop outliers.

Open
055 / 233
profile reading

data trove edible insects database

/home/coolhand/html/datavis/data_trove/data/quirky/insects_by_form.json · 7 rows × 2 cols

1× feature 1× label

This tiny dataset categorises insect-based food products into 7 form types and records how many products fall into each category. With only 7 rows, the big story is the extreme skew in counts: most form types have just 1–6 products, but one outlier category reaches 57, pulling the mean (10.7) far above the median (3.0). That dominant category is worth identifying immediately, as it likely represents the most commercially developed segment of the edible-insect market. The high standard deviation (20.6) confirms the distribution is anything but uniform.

Open
056 / 233
profile reading

data trove wine varieties regions

/home/coolhand/html/datavis/data_trove/data/quirky/wine_by_country.json · 62 rows × 2 cols

1× feature 1× label

This dataset lists wine production (or a related wine metric) aggregated by country, covering 62 countries each with an associated count. The count distribution is extremely skewed: the median is just 2, yet the mean is nearly 19 and the maximum reaches 476, with 10 flagged outliers — suggesting a small handful of countries dominate the wine landscape entirely. France tops the list and is worth examining alongside the other high-count outliers to understand which countries drive the bulk of the totals.

Open
057 / 233
profile reading

data trove world of cheese

/home/coolhand/html/datavis/data_trove/data/quirky/cheese_list.json · 7,146 rows × 4 cols

2× label 1× other 1× feature

This dataset is a multilingual catalogue of 7,146 cheese products spanning 32 categories and 111 countries of origin. The most immediately striking pattern is the geographic concentration: France alone accounts for 26% of all entries (1,853), followed by Germany and the United States, suggesting the dataset skews heavily toward Western European dairy traditions. On the category side, Cream Cheese dominates with 1,187 entries (17%), and the top 5 categories together cover over half the dataset — worth examining for potential over-representation. The 'value' column is entirely constant at 1.0 and can be safely ignored. Note also that the product names are highly multilingual (30 languages detected) with an 11% duplicate rate, indicating some cheese types are listed under multiple language variants.

Open
058 / 233
profile reading

data trove onion headlines

/home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv · 2,103 rows × 3 cols

1× metadata 1× free_text 1× identifier

This dataset is an index of 2,103 articles from The Onion, a satirical news outlet, containing article headlines, thumbnail image URLs, and a sequential row ID. The headline column is the most analytically interesting field, with a vocabulary of 7,613 unique words and mean headline length of about 9 words and 61 characters — worth exploring for length distribution and common vocabulary patterns. There are also 209 duplicate image URLs (~10% of rows), suggesting some thumbnails are reused across multiple articles, with one image appearing 11 times.

Open
059 / 233
profile reading

data trove boy bands

/home/coolhand/html/datavis/data_trove/entertainment/pop_culture/Boy Band.csv · 15 rows × 4 cols

2× identifier 1× label 1× feature

This dataset is a small reference list of 15 famous boy bands, capturing each band's name and its years active. The most immediately interesting angle is the band frequency distribution — four bands (Westlife, Jonas Brothers, Take That, and Blue) each appear twice, suggesting possible duplicate rows or multiple entries per group worth investigating. The Years Active column is entirely unique across all 15 rows, spanning acts from 1958 (The Osmonds) to present-day groups, hinting at a wide generational spread that could reward closer reading.

Open
060 / 233
profile reading

data trove bond girls

/home/coolhand/html/datavis/data_trove/entertainment/film/bond_girls.csv · 71 rows × 11 cols

6× feature 4× label 1× numeric_target

This dataset covers 71 records of Bond Girls across 25 James Bond films, tracking each actress alongside financial performance, ages, directors, and Bond actors. The most striking signal is the age disparity: Bond girls average 28.9 years old while Bond actors average 43.1 years — a gap of roughly 14 years that is remarkably consistent across the franchise. Box office figures show strong right skew with outliers (actual revenue ranges from $59.5M to $1,108.6M, with a mean well above the median), suggesting a handful of films massively outperformed the rest. Sean Connery dominates the Bond actor column with 23 appearances, and Guy Hamilton is the most prolific director with 14 entries, both worth examining against box office performance.

Open
061 / 233
profile reading

data trove steam users

/home/coolhand/html/datavis/data_trove/entertainment/gaming/users_metadata.json · 1 rows × 6 cols

4× metadata 1× timestamp 1× other

This is a single-row metadata record describing the Steam Users dataset, a collection of 14,306,064 Steam user profiles sourced from Steam Store data (likely via Kaggle or SteamSpy) and last updated on 2025-01-20. Rather than being an analytical dataset itself, it serves as a data catalogue entry pointing analysts toward the actual user data file (185 MB) which links to a recommendations.csv via user_id. The most important thing to note is the scale: over 14 million user profiles covering library size and review activity represent a substantial analytical resource. Before diving in, analysts should locate and join the referenced recommendations.csv to unlock the full relational value of this dataset.

Open
062 / 233
profile reading

data trove steam user reviews

/home/coolhand/html/datavis/data_trove/entertainment/gaming/recommendations_metadata.json · 1 rows × 6 cols

5× metadata 1× other

This is a single-row metadata descriptor for the 'Steam Game Recommendations' dataset, last updated 2025-01-20 — it is a catalog entry rather than the underlying data itself. The key takeaway is the scale of what it describes: 41.1 million Steam user reviews stored in a 1.9 GB file, sourced likely via Kaggle or SteamSpy. The metadata notes that the full dataset links to companion files (games.csv and users.csv) via app_id and user_id, and includes playtime and helpfulness metrics — making those join keys the most important fields to validate before any analysis. Analysts should treat this file as a data dictionary and move quickly to the referenced source files for substantive exploration.

Open
063 / 233
profile reading

data trove steam games catalog

/home/coolhand/html/datavis/data_trove/entertainment/gaming/enriched/games.csv · 122,611 rows × 40 cols

21× feature 6× label 4× metadata 3× free_text 1× foreign_key 1× timestamp 1× other 1× identifier

This is a Steam games catalogue with 122,611 rows and 40 columns, covering titles, publishers, developers, genres, pricing, review counts, and associated URLs. The most important thing to examine first is the extreme skew across nearly all numeric engagement columns (column23, column24, column27, column29, column31): medians sit at 0–5 while means run into the hundreds or thousands, meaning a tiny fraction of blockbuster titles account for the vast majority of reviews and activity. A second area worth attention is genre distribution (column36), where just a handful of Casual/Indie/Action combinations account for the bulk of the catalogue, and the estimated owner-count banding (column03) shows over 61% of games have fewer than 20,000 owners — pointing to a long-tail market dominated by low-visibility titles.

Open
064 / 233
profile reading

data trove nyc housing analysis

/home/coolhand/html/datavis/data_trove/economic/housing/nyc/nyc_housing_metrics_merged.csv · 2,327 rows × 23 cols

18× feature 2× label 2× foreign_key 1× numeric_target

This dataset covers housing affordability metrics for 2,327 census tracts across New York City's five boroughs, with variables spanning rent burden, household income, gross rent, and tenure type. The most urgent data quality issue is that both `median_gross_rent` and `median_household_income` contain extreme negative sentinel values (min of -666,666,666), which wildly distort their means and standard deviations — these columns must be filtered or recoded before any analysis. Substantively, rent burden is the headline story: the median tract has 50% of renter households paying more than 30% of income on rent (`pct_rent_burdened` median = 50.0), and severe burden (≥50% of income) affects a median of 26.2% of renters per tract. Brooklyn leads in tract count (805 tracts, 34.6% of the dataset), followed by Queens (725) and the Bronx (361), so borough-level comparisons are feasible but uneven in sample size.

Open
065 / 233
profile reading

data trove temperature anomalies nasa giss

/home/coolhand/html/datavis/data_trove/environmental/temperature_anomalies/temperature_anomalies_1880_2015.csv · 146 rows × 19 cols

18× feature 1× timestamp

This dataset contains 146 years of global temperature anomaly records (1880–2025), with monthly, seasonal, and annual mean anomaly values expressed in degrees relative to a baseline. The most important pattern to look for is the right-skewed distribution present across virtually every time column: medians sit near or below zero while means are positive and maximums reach 1.2–1.48°C, strongly suggesting a warming trend concentrated in recent decades. October stands out with the highest outlier rate (5.5%, 8 outliers) and a mean of 0.107°C — worth examining for unusual warm spikes. The annual J-D and D-N summary columns provide the clearest single-column view of the long-run warming signal across the full 146-year span.

Open
066 / 233
profile reading

data trove noaa significant storms

/home/coolhand/html/datavis/data_trove/data/wild/weather/noaa_significant_storms.json · 14,770 rows × 14 cols

6× feature 4× label 3× metadata 1× timestamp

This dataset contains 14,770 records of significant US storms sourced from the NOAA Storm Events Database, covering events across all 50+ states with dates, locations, event types, casualties, and property damage estimates. The most striking pattern is the dominance of tornadoes (6,334 events, 43% of all records), far outnumbering the next categories of Flash Flood and Thunderstorm Wind. Two dates worth flagging immediately are 1974-04-03 (126 events, the Super Outbreak) and 2011-04-27 (105 events, the 2011 Super Outbreak), suggesting this dataset captures landmark multi-tornado outbreaks disproportionately. Property damage skews heavily toward million-dollar figures, with '2.5M' being the single most common damage value (2,278 occurrences), hinting at possible rounding or a threshold-based inclusion criterion. Texas leads all states with 1,450 events, nearly double the next state (Missouri at 648), reflecting both its geographic size and exposure to severe weather corridors.

Open
067 / 233
profile reading

data trove noaa lightning strikes 2018

/home/coolhand/html/datavis/data_trove/data/wild/weather/monthly_heatmap.json · 59,070 rows × 4 cols

4× feature

This dataset contains 59,070 records of lightning strike activity, each described by geographic coordinates (latitude and longitude), a month, and a strike count. The strikes column is highly right-skewed (skew ~2.0, max 531 vs. median 34), meaning a small number of locations experience dramatically more lightning than typical — these ~2,900 outlier records are worth investigating. Latitude also shows ~9.5% outlier rate with a northward skew, suggesting strike activity is concentrated in a core geographic band but with notable events at higher latitudes.

Open
068 / 233
profile reading

data trove global shark attack file gsaf

/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv · 6,462 rows × 24 cols

7× label 5× metadata 5× feature 4× identifier 1× foreign_key

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

Open
069 / 233
profile reading

data trove airplane crashes fatalities 1908 2009

/home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv · 5,268 rows × 13 cols

5× label 4× feature 1× identifier 1× timestamp 1× numeric_target 1× free_text

This dataset catalogues 5,268 aviation accidents spanning roughly a century, recording details such as date, operator, aircraft type, location, passengers aboard, fatalities, and ground casualties. Two numeric columns stand out immediately: Fatalities (mean 20, max 583) and Aboard (mean 28, max 644) are both highly right-skewed with significant outliers, suggesting a small number of catastrophic mass-casualty events dominate the tail. The Operator column reveals that Aeroflot (179 incidents) and U.S. military branches collectively account for a large share of recorded crashes, worth examining for era-specific clustering. Ground fatalities are near-zero in 95% of cases but spike dramatically in rare events (max 2,750), likely reflecting high-profile urban crashes.

Open
070 / 233
profile reading

data trove wikipedia pageviews

/home/coolhand/html/datavis/data_trove/data/attention/wikipedia_pageviews.json · 1 rows × 2 cols

1× other 1× metadata

This dataset is a single-row JSON file containing Wikipedia pageview data, with two columns — 'countries' and 'metadata' — both of which were skipped during profiling and returned no usable statistics. With only 1 row and no resolved data types or value distributions, there is virtually nothing to analyze at this stage. The file likely contains nested or complex JSON structures that require unpacking before any meaningful analysis can begin. The immediate priority should be to inspect the raw file contents and flatten or parse the nested fields.

Open
071 / 233
profile reading

data trove gdelt global events

/home/coolhand/html/datavis/data_trove/data/attention/gdelt_timeline.json · 1 rows × 2 cols

1× other 1× metadata

This dataset is a single-row JSON file (gdelt_timeline.json) containing two columns — 'countries' and 'metadata' — likely representing a GDELT event timeline snapshot. With only 1 row and both columns flagged as 'unknown' kind with no computable statistics, there is effectively no statistical signal to analyse at this stage. The file may contain nested or complex JSON structures (arrays, objects) that require unpacking before any meaningful exploration can begin. The immediate priority should be inspecting the raw structure of both columns to understand what nested data they contain.

Open
072 / 233
profile reading

data trove nyc 311 service requests

/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json · 1,000 rows × 47 cols

16× label 13× feature 5× foreign_key 3× timestamp 1× identifier 1× other

This dataset is a sample of 1,000 NYC 311 service requests, capturing complaints logged across the five boroughs with details on complaint type, location, agency, and resolution status. The dominant signal is complaint type: 'Noise - Residential' alone accounts for 39.3% of records, followed by 'Illegal Parking' (19.7%) and 'Noise - Commercial' (14.8%), pointing to a dataset heavily skewed toward NYPD-handled quality-of-life complaints (NYPD handles 88% of cases). A second area worth examining is resolution status — 61% of complaints are closed, 30.5% are still in progress, and 8.5% remain open, which raises questions about agency workload and response time. Many specialty columns (road_ramp, taxi_company_borough, bridge_highway fields) are nearly entirely null (99%+), indicating they apply only to rare complaint subtypes and can largely be ignored.

Open
073 / 233
profile reading

data trove accessibility audit tools

/home/coolhand/html/datavis/data_trove/accessibility_audit/web_accessibility_data_top100.csv · 92 rows × 6 cols

4× feature 1× identifier 1× label

This dataset is a web accessibility audit of approximately 100 top websites, covering 92 rows with metrics on error counts, error density, popularity rank, and automated WAVE tool scores. The most striking finding is that both 'errors' and 'error_density' are heavily right-skewed with extreme outliers — the median error count is just 5, but the max reaches 364, suggesting a small cluster of sites are dramatically worse than the rest. A second angle worth exploring is the 'notes' column, where 'Low contrast text' dominates as the most common accessibility issue (12 occurrences), pointing to a systemic problem across high-traffic sites. The near-uniform distribution of 'popularity_rank' suggests the sample spans the full top-100 range evenly, making comparisons across popularity tiers feasible.

Open
074 / 233
profile reading

data trove acs disability statistics

/home/coolhand/html/datavis/data_trove/cache/accessibility/census_disability_states_2021.json · 1 rows × 2 cols

1× other 1× metadata

This dataset appears to be a JSON file containing 2021 US Census disability data by state, but it contains only 1 row with two opaque columns ('data' and 'metadata') that were skipped during profiling. The profiler was unable to parse the internal structure of either column, meaning the actual disability statistics are likely nested within those fields. Before any analysis can proceed, the dataset will need to be unpacked or flattened so that state-level disability metrics become accessible as individual columns.

Open
075 / 233
profile reading

data trove wlasl word level american sign language

/home/coolhand/html/datavis/data_trove/cache/accessibility/wlasl_index.json · 2,000 rows × 2 cols

1× label 1× other

This dataset appears to be a sign language lexicon index (WLASL — Word-Level American Sign Language), containing 2,000 entries each pairing a gloss (a written word label for a sign) with associated instances, likely video or image examples. Every gloss is unique, confirming this is a vocabulary index rather than a repeated-observation log. The gloss labels are almost entirely single words (97.75% one-word rate) and are short, averaging just 6 characters, covering everyday vocabulary like 'up', 'hearing', 'dog', and 'hot'. The most interesting angle to explore is the 'instances' column, which is currently unanalysed — the number of example instances per sign likely varies considerably and would reveal which signs are well-represented versus data-sparse.

Open
076 / 233
profile reading

data trove census communication disability data

/home/coolhand/html/datavis/data_trove/cache/accessibility/census_communication_states_2021.json · 1 rows × 2 cols

1× other 1× metadata

This dataset appears to be a single-record JSON file containing US Census communication and accessibility data by state for 2021, with only two columns — 'data' and 'metadata' — both of unknown/unparsed type. The file has just 1 row and no statistics could be extracted, meaning the profiler was unable to parse the nested structure of the JSON. Before any analysis is possible, the file likely needs to be flattened or unpacked to expose the underlying state-level records. No meaningful patterns or outliers can be identified at this stage.

Open
077 / 233
profile reading

data trove shipwrecks

/home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json · 6,914 rows × 14 cols

6× feature 3× label 2× metadata 1× foreign_key 1× free_text 1× identifier

This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.

Open
078 / 233
profile reading

create a nsfw visual novel outline

/tmp/saturn-uploads/fbb13d662265/create-a-nsfw-visual-novel-outline.json · 1 rows × 8 cols

5× other 2× metadata 1× foreign_key

This dataset is a single-record JSON file representing a structured interactive story or visual novel project built with the Storyblocks Studio schema (version 'storyblocks-studio/v1'). The file contains just one row and 8 columns, meaning it is a configuration or project document rather than a tabular dataset in the traditional sense. The most analytically meaningful columns — acts, characters, locations, scenes, variables, and metadata — were all skipped during profiling, likely because they contain nested or complex object structures that require deeper parsing. To get real value from this file, the nested columns (especially 'scenes', 'characters', and 'acts') should be unpacked and analysed individually.

Open
079 / 233
profile reading

quirky ufo sightings 20260121

/home/coolhand/html/datavis/data_trove/cache/quirky/ufo_sightings_20260121.parquet · 147,890 rows × 13 cols

4× feature 3× timestamp 3× free_text 1× identifier 1× metadata 1× label

This dataset contains 147,890 UFO sighting reports across 13 columns, mixing free-text descriptions (Summary, Text, Location details), structured categoricals (Shape, Explanation), timestamps (Occurred, Reported, Posted), and a numeric witness count. The Shape field is a clean place to start: 39 categories with 'Light' leading at ~27,494 sightings, followed by Circle and Triangle. Two things deserve a closer look. First, 'No of observers' is extremely skewed — values run from -10 to 20,000 with a median of 2 and over 18,000 outliers, suggesting data-entry errors that need cleaning before any aggregation. Second, the Explanation column is 99.46% null, so claims about 'what UFOs really were' rest on under 800 labelled rows, dominated by Starlink and rocket attributions. Location is dense and US-heavy (Phoenix, Seattle, Las Vegas top the list), and the Characteristics field collapses to ~43 vocabulary tokens dominated by 'Lights on object'.

Open
080 / 233
profile reading

meg c middle english texts

/home/coolhand/servers/diachronica/corpus/historical-corpora/meg-c/middle_english_texts.jsonl · 2 rows × 6 cols

3× metadata 1× label 1× feature

This column holds the work title for each row, with both of the 2 records carrying a distinct literary name ("The Canterbury Tales (Prologue)" and "Sir Gawain and the Green Knight"). Cardinality equals row count, so every value is unique and entropy_ratio is 1.0 — the long_tail alert simply reflects this tiny, fully-distinct sample. There is nothing to aggregate on here yet.

Open
081 / 233
profile reading

exoplanets exoplanets

/home/coolhand/data/celestial/exoplanets/exoplanets.csv · 6,150 rows × 11 cols

8× feature 1× identifier 1× foreign_key 1× timestamp

This dataset catalogs 6,150 exoplanets across 11 columns, mixing identifiers (pl_name, hostname), discovery metadata (discoverymethod, disc_year), sky coordinates (ra, dec), and physical measurements (pl_bmassj, pl_orbsmax, pl_rade, pl_orbper, sy_dist). Discovery is heavily dominated by the Transit method at 73.4% of records, with Radial Velocity a distant second — worth noting because it shapes which kinds of planets are represented. The physical measurement columns are all extremely skewed with heavy outliers: pl_orbper has a skew of ~43.8 and a max of 8,040,000 days, and pl_orbsmax similarly stretches to 19,000 AU, so any analysis should use log scales or trimming. Also flag that pl_bmassj is missing for 50.3% of rows and pl_orbsmax for 37.4%, which limits joint mass/orbit analyses. Discovery year peaks around 2016 (median) and ranges from 1992 to 2026, giving a clear timeline of the field's growth.

Open
082 / 233
profile reading

witnessed meteorite falls witnessed meteorite falls

/home/coolhand/datasets/witnessed-meteorite-falls/witnessed_meteorite_falls.json · 1,097 rows × 10 cols

3× metadata 2× feature 1× identifier 1× free_text 1× timestamp 1× other 1× label

This dataset catalogs 1,097 witnessed meteorite falls, with each row identified by a unique name and described by date, geographic coordinates, meteorite class, and a short description. Two columns (category and fall_type) are constants ('witnessed_meteorite_falls' and 'Fell') and offer no analytical value. The most informative dimensions are meteorite_class — heavily dominated by L6 (260 falls, ~24%) followed by H5 (163) and H6 (91) — and the latitude/longitude pair, where latitude skews north (median 36.1) with about 8% outliers and longitude spans the full globe. The date column covers 231 distinct years with 1933 as the most frequent (17 falls), suggesting room for a time-trend exploration.

Open
083 / 233
profile reading

wild openfoodfacts sample

/home/coolhand/html/datavis/data_trove/cache/wild/openfoodfacts_sample.json · 50 rows × 545 cols

176× free_text 152× metadata 103× other 82× feature 11× identifier 9× label 7× timestamp 5× foreign_key

This is a 50-row sample from Open Food Facts with 545 columns, dominated by per-language localized fields (product names, generic names, ingredient texts, packaging texts, origin) plus nutrition, scoring, and provenance metadata. The shape is extremely sparse: the vast majority of localized columns have null rates of 0.92–0.98, so most analytical signal lives in a small core of fields. Worth a closer look first: the Nutri-Score and NOVA distributions (the catalog skews heavily to grade 'e' and NOVA group 4), the Eco-Score grade mix, and the food_groups / pnns_groups_2 breakdown showing this sample is concentrated in chocolate and biscuit products. Also note the heavy imbalance in `lang` (70% French) and `countries_lc`, which biases any text or origin analysis. Treat the hundreds of `*_xx` / `ingredients_text_` columns as effectively empty rather than as features.

Open
084 / 233
profile reading

wild nyc 311 sample 20260121

/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample_20260121.json · 1,000 rows × 47 cols

24× feature 12× metadata 4× foreign_key 3× timestamp 3× label 1× identifier

This is a 1,000-row sample of NYC 311 service requests (47 columns), almost entirely categorical, capturing complaints by agency, location, and resolution status. NYPD dominates routing at 59.4% of requests, followed by HPD (23.2%) and DSNY (8.2%), and the top complaint types are Noise - Residential (23.4%), Illegal Parking (18.7%), and HEAT/HOT WATER (17.0%) — a good first place to look. Geographically, Brooklyn (31.2%), Queens (26.1%), and the Bronx (23.0%) account for most cases, while Staten Island is just 2.2%. Status is split across In Progress (38.8%), Closed (33.6%), and Open (27.6%), so a sizable share remains unresolved. Note that many specialized fields (taxi, bridge/highway, vehicle, facility_type) are >95% null and not informative, and `location` was skipped during profiling.

Open
085 / 233
profile reading

wild nyc 311 sample

/home/coolhand/html/datavis/data_trove/cache/wild/nyc_311_sample.json · 1,000 rows × 47 cols

21× feature 15× metadata 3× timestamp 3× foreign_key 2× label 1× identifier 1× other 1× free_text

This is a 1,000-row sample of NYC 311 service requests with 47 columns covering complaint metadata, location, agency routing, and resolution status. The bulk of activity is noise and parking complaints routed to NYPD (88% of all tickets), with 'Noise - Residential' alone accounting for 393 of 1,000 rows and 'Loud Music/Party' the dominant descriptor at 427. Geographically the load is spread across Queens, Manhattan, Brooklyn and the Bronx fairly evenly, while Staten Island barely registers (11 rows). Worth a closer look: the resolution funnel — 61% of tickets are Closed but 30.5% are still In Progress — and the channel mix, where ONLINE (466) now outpaces MOBILE and PHONE combined-ish. Note that many specialty fields (road_ramp, taxi_*, bridge_highway_*, facility_type) are >99% null and should be ignored for analysis.

Open
086 / 233
profile reading

wild ghost sightings

/home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv · 10,992 rows × 12 cols

9× feature 2× free_text 1× metadata

This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).

Open
087 / 233
profile reading

wild bigfoot sightings

/home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json · 5,411 rows × 9 cols

4× feature 2× identifier 1× timestamp 1× label 1× free_text

This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.

Open
088 / 233
profile reading

web accessibility data top100

/tmp/saturn-uploads/f55be675ada5/web_accessibility_data_top100.csv · 92 rows × 6 cols

4× feature 1× identifier 1× free_text

This dataset profiles 92 popular websites with accessibility metrics, including error counts, error density, and two ranking signals (popularity and wave rank). The error metrics are highly skewed: errors range from 0 to 364 with a median of just 5, and 8 sites (about 10%) qualify as outliers — worth flagging as the worst offenders. The notes field is the richest qualitative signal, with 'Low contrast text' (12 sites) and 'Missing form input label' (9 sites) dominating the issue mix. Popularity_rank is evenly spread across 1–100, so it works well as a control axis when comparing error patterns across the popularity spectrum.

Open
089 / 233
profile reading

vizwiz val annotations

/home/coolhand/html/datavis/data_trove/cache/vizwiz_val_annotations.json · 4,319 rows × 5 cols

2× label 1× identifier 1× free_text 1× other

This dataset contains 4,319 rows from the VizWiz validation annotations, structured around image filenames, the questions asked about each image, the answers, an answer_type label, and an answerable flag. The questions column is the most interesting: about 35% are duplicates, with 'What is this?' alone appearing 523 times, suggesting a heavy concentration of generic identification queries. Answer_type is dominated by 'other' (62%) and 'unanswerable' (32%), and the answerable flag confirms that roughly 32% of items are flagged as not answerable — a key signal for any downstream modeling. The image column is uniquely identifying per row and not worth deeper analysis, while the answers column was skipped by the profiler.

Open
090 / 233
profile reading

veterans merged county analysis

/home/coolhand/html/datavis/data_trove/data/policy/veterans/merged_county_analysis.csv · 3,144 rows × 18 cols

8× feature 5× identifier 2× foreign_key 2× label 1× metadata

This dataset contains 3,144 rows — one per U.S. county — combining Census geographic identifiers (GEOID, STATE_NAME, NAMELSAD, ALAND, AWATER) with veteran and active-duty military estimates and rate-normalized fields. The raw count columns (total_pop, active_duty_est, veterans_est, ALAND) are extremely right-skewed with skew values above 8 and hundreds of outliers each, so any analysis on them should use logs or per-capita versions. The rate columns tell a cleaner story: active_duty_per_10k is roughly symmetric (skew -0.38, mean ~4,694 per 10k) while veterans_per_100 is mildly right-skewed (mean 6.19, max 18.09) and is the better candidate for ranking counties. State coverage is uneven — Texas alone supplies 254 counties (8.1%), followed by Georgia and Virginia — which matters when aggregating. Note also that LSAD is heavily imbalanced (95% code '06') and GEOID and fips are duplicates of each other.

Open
091 / 233
profile reading

usgs significant earthquakes usgs significant earthquakes

/home/coolhand/datasets/usgs-significant-earthquakes/usgs_significant_earthquakes.json · 3,742 rows × 11 cols

5× feature 4× metadata 1× free_text 1× identifier

This dataset contains 3,742 records of significant earthquakes from USGS, with 11 columns covering location (latitude, longitude, place/name), magnitude, depth, and event metadata. Magnitude is tightly clustered between 4.5 and 5.1 (median 4.8) but has a long right tail reaching 8.2, with 184 outliers worth examining for the rare large events. Depth_km is highly skewed (skew 3.07) with a median of 10 km but a max of 248.7 km and 314 outliers, suggesting a mix of shallow and deep-focus quakes. Geographically, the data is heavily concentrated around Alaska — 'alaska' appears in 1,991 place names and 'off the coast of Oregon' alone accounts for 151 records — so this is effectively a North Pacific / Alaska-dominated sample rather than a global one. Note that the `category` column is constant ('significant_earthquakes') and `earthquake_type` is 99.9% 'earthquake', so neither will be useful for segmentation.

Open
092 / 233
profile reading

us housing affordability crisis housing crisis merged

/home/coolhand/datasets/us-housing-affordability-crisis/housing_crisis_merged.csv · 3,222 rows × 16 cols

13× feature 2× identifier 1× label

This dataset covers 3,222 U.S. counties with 16 columns describing rental affordability — rents, incomes, renter shares, and burden percentages — keyed by FIPS and county name. Several numeric fields (annual_rent, median_gross_rent, median_household_income, rent_to_income_ratio) carry impossible negative sentinel values like -666666666 and -7999999992, which drag means deeply negative and produce skew around -17 to -57; these need cleaning before any analysis. The affordability_category field is also extremely imbalanced — 3,192 of 3,222 counties are labeled 'Affordable' (top_rate 0.99), so it offers little discriminatory signal as-is. The cleaner fields to start with are pct_rent_burdened_30plus (median 37.36%), pct_rent_burdened_50plus (median 17.62%), and pct_renter (median 26.07%), which look well-behaved and tell the real affordability story.

Open
093 / 233
profile reading

urban parking violations sample

/home/coolhand/html/datavis/data_trove/data/urban/parking_violations_sample.csv · 10,000 rows × 9 cols

6× feature 2× timestamp 1× identifier

This is a 10,000-row sample of NYC-style parking violations with 9 columns covering summons IDs, issue dates and times, locations, violation codes/descriptions, issuing agencies, and vehicle make/color. Two things jump out: issue_date is heavily concentrated on a single day (2025-12-28 accounts for 65% of rows), and violation_description is dominated by 'PHTO SCHOOL ZN SPEED VIOLATION' at 52% of non-null values, paired with issuing_agency 'V' at 44% — suggesting the sample is skewed toward automated school-zone camera tickets. Vehicle_color also shows clear data-quality issues, with the same color appearing under multiple codes (e.g., WH/WHITE, BLK/BLACK/BK, GREY/GRY) that would need normalization before analysis. Violation_code is numeric with a ~10% outlier rate and right-skew, worth a look alongside the categorical description. Street_name is messy free text with 77% all-caps and many directional prefixes (SB, NB, WB, EB).

Open
094 / 233
profile reading

scars master dataset

/home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv · 3,221 rows × 20 cols

17× feature 2× identifier 1× foreign_key

This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.

Open
095 / 233
profile reading

satire theonion index to dataset

/home/coolhand/html/datavis/data_trove/entertainment/satire/theonion_index_to_dataset.csv · 2,103 rows × 3 cols

1× identifier 1× free_text 1× metadata

This dataset contains 2,103 rows and three columns scraped from The Onion: a numeric index, a satirical headline, and an associated image URL. The headlines column is the most substantive — it has 7,613 unique vocabulary tokens, a median of 9 words, and an average Flesch readability score of about 46.9, suggesting typical news-headline phrasing. The image URL column is uniform in structure (every value is a single URL averaging ~108 characters) but contains a roughly 9.9% duplicate rate, with one image reused 11 times — worth a look if you're checking scrape integrity. The numeric index column is a clean 2 → 2104 sequence with no outliers and is essentially just a row identifier.

Open
096 / 233
profile reading

quirky witch trials

/home/coolhand/html/datavis/data_trove/data/quirky/witch_trials.json · 10,940 rows × 6 cols

3× feature 2× timestamp 1× numeric_target

This dataset catalogs 10,940 historical witch trial records across 6 columns, covering when and where trials occurred, how many people were tried, and how many died. Trials span from 1300 to 1850, with the bulk concentrated around the early 1600s (median year 1630), and they are heavily dominated by the United Kingdom (3,750 records) and Germany (3,417), which together account for roughly two-thirds of the data. The 'deaths' and 'tried' columns are extremely skewed: 75% of records report zero deaths, yet a small set of outlier events reach up to 500, so any aggregate analysis should treat these tails carefully. Also worth flagging: the 'city' field is 47.6% null and spans 906 unique values, so geographic analysis below the country level will be patchy.

Open
097 / 233
profile reading

quirky social norms

/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet · 12,383 rows × 6 cols

3× label 1× feature 1× free_text 1× other

This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).

Open
098 / 233
profile reading

quirky silence data

/home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json · 6,998 rows × 6 cols

4× feature 1× identifier 1× label

This dataset catalogs 6,998 world languages, each with a name (n), speaker population (p), geographic coordinates (lat/lng), a numeric score (s), and an endangerment status (ss). The most striking feature is the extreme skew in speaker population: the median language has just 11,000 speakers but the mean is over 1.1 million, with a max near 965 million and roughly 13% of entries showing zero speakers — a tell-tale signature of dying or dormant languages. The endangerment status field is also worth a close look: only about 44% of languages are 'safe', while the remaining categories span from 'vulnerable' all the way to 'extinct' (219 cases). Geography is broadly distributed (latitude centered near the tropics, longitude spanning the globe), so the dataset supports both statistical and map-based exploration.

Open
099 / 233
profile reading

quirky shipwrecks

/home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json · 5,569 rows × 14 cols

5× feature 4× metadata 2× identifier 1× label 1× foreign_key 1× free_text

This dataset catalogues 5,569 shipwrecks (and a handful of related features) sourced from OpenStreetMap, with 14 columns covering geography (lat/lon), OSM identifiers, type classifications, and optional metadata like depth, year sunk, and Wikipedia links. The collection is overwhelmingly homogeneous in category: 'wreck' accounts for 98.4% of seamark_type and 'shipwreck' for 91.2% of type, so the interesting variation lives elsewhere. Geographic spread is global — longitude ranges from -179.28 to 179.45 and latitude from -77.42 to 82.17 — making the lat/lon distribution the most informative view. Be aware that descriptive fields are largely empty: heritage is 99.8% null, year_sunk 99.3% null, depth 96.3% null, and Wikipedia/Wikidata links are missing for ~94% of records, so any analysis beyond location and basic typing will be working with a small subset.

Open
100 / 233
profile reading

quirky openfoodfacts cheese 20260121

/home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet · 77,145 rows × 10 cols

3× feature 2× other 2× metadata 2× free_text 1× label

This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.

Open
101 / 233
profile reading

quirky megaliths

/home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json · 15,464 rows × 14 cols

5× feature 4× metadata 2× label 1× identifier 1× free_text 1× foreign_key

This dataset catalogues 15,464 megalithic sites with 14 fields covering geographic coordinates, classification (type, megalith_type, material), heritage status, and external references (wikidata, wikipedia, name). Coverage is uneven: many descriptive fields are mostly empty (description is blank in 14,814 rows, material in 15,223, start_date in 15,430), so analysis should lean on the well-populated columns. The most informative categorical is megalith_type, where menhir (5,231) and dolmen (4,501) dominate but 73 distinct subtypes appear, while the broader type field is overwhelmingly 'megalith' (97.7%). Geographically, lat/lon are highly skewed with heavy clustering in Europe (median lat 47.6, lon -1.6) and a long tail of outliers stretching as far as 144°E and -51°S. Start with megalith_type and the lat/lon distributions to understand what kinds of sites exist and where they cluster.

Open
102 / 233
profile reading

quirky lighthouses

/home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json · 14,585 rows × 13 cols

7× feature 4× metadata 2× identifier

This dataset catalogues 14,585 lighthouses and related navigational landmarks sourced from OpenStreetMap, with 13 columns covering location (lat/lon), OSM identifiers, names, operators, build years, light characteristics, and heritage status. Coverage is very uneven: descriptive fields like country (99.6% null), heritage (96.9% null), year_built (93.2% null) and operator (92.7% null) are mostly empty, so any analysis on those needs to acknowledge the small annotated subsample. The most reliable signals are geographic (lat/lon, fully populated) and the OSM-derived fields osm_type and seamark_type — the latter shows the dataset is dominated by light_minor (3,496) and light_major (3,051), confirming its lighthouse focus. Light_character is also worth examining: where recorded, 'Fl' (flashing) overwhelmingly dominates at 75% of entries. Latitude is heavily skewed toward the northern hemisphere (median 40.8°) with 1,295 outliers flagged in the southern extremes.

Open
103 / 233
profile reading

quirky geothermal

/home/coolhand/html/datavis/data_trove/data/quirky/geothermal.json · 8,776 rows × 13 cols

6× feature 3× metadata 2× label 1× free_text 1× identifier

This dataset catalogs 8,776 geothermal features (hot springs and geysers) sourced from OpenStreetMap, with 13 columns covering location, type, and optional metadata like temperature and tourism use. The core signal is in the `type` and `osm_type` fields: roughly 80% are hot springs and 20% geysers, and most entries are point nodes rather than ways. Geographic coverage is global but skewed — latitude leans heavily toward the northern hemisphere with a long southern tail flagged as outliers, while longitude spans the full range. Be aware that nearly all the descriptive fields (`country`, `wikipedia`, `temperature`, `description`, `access`, `tourism`, `intermittent`) have null rates above 97%, so they're only useful for the small annotated subset. Within that subset, `tourism` is dominated by 'attraction' and `intermittent` is overwhelmingly 'no', which limits their analytic value.

Open
104 / 233
profile reading

quirky chocolate origins

/home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json · 2,530 rows × 10 cols

6× feature 1× foreign_key 1× metadata 1× timestamp 1× free_text

This dataset catalogs 2,530 chocolate bar reviews with 10 columns covering bean origins, cocoa percentages, ingredients, ratings, and review metadata. Ratings cluster tightly (median 3.25, IQR 0.5) on a 1–4 scale, while cocoa percent is similarly concentrated around 70% but carries 235 outliers worth investigating. Geographic skew is notable: U.S.A. dominates company locations at 44.9% of records, whereas bean origins are more diverse, led by Venezuela, Peru, and the Dominican Republic. Heads up that the `company` column is entirely empty (single blank value across all rows), so it should be excluded from analysis.

Open
105 / 233
profile reading

quirky cheese list

/home/coolhand/html/datavis/data_trove/data/quirky/cheese_list.json · 7,146 rows × 4 cols

2× feature 1× label 1× other

This dataset is a catalogue of 7,146 cheese product entries with a name, a category, a country of origin, and a constant value field. Cheeses span 32 categories and 111 countries, with France alone accounting for 25.9% of rows and Germany and the United States rounding out the top three. Category is led by Cream Cheese (1,187 rows, 16.6%), followed by Mozzarella and Soft Cheese, suggesting some categories are far more populated than others. The name column is multilingual (predominantly English and French, with notable German, Spanish, and Italian presence) and has an 11.3% duplicate rate worth investigating before any de-duplicated analysis. Note that the value column is constant at 1.0 across all rows and carries no analytical signal.

Open
106 / 233
profile reading

quirky atmospheric real

/home/coolhand/html/datavis/data_trove/data/quirky/atmospheric_real.json · 571 rows × 19 cols

4× label 4× feature 4× metadata 4× other 2× free_text 1× timestamp

This dataset contains 571 weather alert records with 19 columns mixing NWS-style alert metadata (event, severity, urgency, certainty, areaDesc, headline) with sparse atmospheric event annotations (country, event_type, magnitude, source, state). The alert fields are well-populated and dominated by 'Small Craft Advisory' (149 of 571) and 'Winter Weather Advisory' (95), while certainty is overwhelmingly 'Likely' (89.3%) and urgency is 'Expected' (90.6%), suggesting limited variation in those risk dimensions. Severity is the most balanced operational field, split across Moderate (208), Minor (187), and Severe (144). Note that the curiosity-style columns (country, event_type, magnitude, source, state) are ~98.6% null and only describe a handful of rows, so treat them as a separate mini-dataset rather than primary signal.

Open
107 / 233
profile reading

quirky asteroids

/home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json · 40,827 rows × 11 cols

7× feature 2× label 1× identifier 1× metadata

This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.

Open
108 / 233
profile reading

processed word forms

/home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv · 25,731 rows × 8 cols

4× foreign_key 3× feature 1× metadata

This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.

Open
109 / 233
profile reading

processed lexibank references

/home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json · 11,359 rows × 9 cols

7× metadata 1× identifier 1× feature

This dataset is a bibliographic reference list with 11,359 rows and 9 columns (key, author, citation, title, year, plus mostly-empty editor/publisher/journal/url fields). The most informative columns are author, citation, title, and year — the rest are either unique IDs or near-empty categoricals. Note that author has a 66% duplicate rate and 1,277 empty values, while citation and title both show heavy duplication (54% and 50%) driven by a handful of large source collections like Koelle's Polyglotta africana and the Africa Museum and Austronesian web archives. The year column spans 271 distinct values with reasonable spread (entropy ratio 0.75), though about 11% of rows have no year and another 574 are marked 'n.d.'. Author and title are also multilingual, with English dominant but meaningful German, French, Spanish, and Chinese subsets.

Open
110 / 233
profile reading

processed cognate sets

/home/coolhand/servers/diachronica/etymology_atlas/processed/cognate_sets.json · 4,981 rows × 8 cols

3× other 2× metadata 2× feature 1× identifier

This dataset contains 4,981 cognate sets sourced entirely from the 'iecor' source_dataset, each identified by a unique cognate_id. The two main numeric signals are language_count and word_count, which are nearly identical in distribution: both have a median of 2 and mean around 5.17, but stretch out to a maximum of 157 with skew above 6.8 and roughly 13% of rows flagged as outliers. That long tail is the most interesting story — most cognate sets are small, but a minority span very many languages/words and deserve a closer look. Note that concept is empty for every row, confidence is constant at 1.0, and source_dataset has only one value, so those columns carry no analytic signal.

Open
111 / 233
profile reading

parquet languages

/home/coolhand/servers/diachronica/etymology_atlas/parquet/languages.parquet · 19,401 rows × 11 cols

4× feature 2× identifier 2× metadata 2× other 1× foreign_key

This dataset catalogues 19,401 world languages, each identified by a unique Glottocode and name, with attributes like geographic coordinates, macroarea, language family, ISO code, and phoneme count. Two things stand out for closer inspection: phoneme_count is missing for 88.8% of rows and is heavily right-skewed (mean ~38, max 231), so any analysis of phonological inventories will rely on a small subsample with notable outliers. Latitude and longitude are also null for 59.1% of rows, which will limit mapping coverage. On the categorical side, macroarea is well-distributed across six regions but dominated by Africa (32%), while the status column is uninformative since every language is labelled 'living'.

Open
112 / 233
profile reading

parquet cognate sets

/home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet · 4,981 rows × 7 cols

3× metadata 2× feature 1× identifier 1× free_text

This dataset catalogs 4,981 cognate sets from the IECoR source, with each row identified by a unique cognate_id and accompanied by a JSON-like 'words' payload listing language entries. The numeric columns language_count and word_count are nearly identical twins, both highly skewed (skew ~6.84) with a median of 2 but a max of 157 and ~13% outliers — a small set of cognate groups is dramatically larger than the rest. Three columns (concept, confidence, source_dataset) are constant or empty and carry no analytic signal. Start by examining the distribution of language_count to understand the long tail of cross-linguistic coverage, and inspect the longest 'words' entries (len_max ~14,956) to see which cognate sets dominate.

Open
113 / 233
profile reading

noaa significant storms noaa significant storms

/home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json · 14,770 rows × 14 cols

6× feature 4× metadata 2× label 1× timestamp 1× numeric_target

This dataset contains 14,770 significant US storm events from the NOAA Storm Events Database, with 14 columns covering event type, location, date, magnitude, casualties, and property damage. Tornadoes dominate at 6,334 records (about 43% of rows), followed by Flash Flood, Thunderstorm Wind, and Flood — worth focusing on first since event_type drives most other fields. Geographically the events skew heavily to the central/southern US, with Texas alone accounting for 1,450 records and a long tail across 65 state values. Fatalities and injuries are highly zero-inflated (around 69% and 68% zeros respectively), so any casualty analysis should treat the non-zero tail separately. Note also that magnitude is missing for 51.8% of rows and damage_property is stored as text codes like '2.5M' and '1.00M' rather than numbers, which will need parsing before quantitative use.

Open
114 / 233
profile reading

natural hazards storms

/home/coolhand/html/datavis/data_trove/data/natural_hazards/storms.json · 14,770 rows × 14 cols

6× feature 4× metadata 2× label 1× timestamp 1× numeric_target

This dataset contains 14,770 records of significant U.S. storm events sourced entirely from the NOAA Storm Events Database, with each row describing a weather incident's location, type, magnitude, and damages. Tornadoes dominate the event mix at roughly 43% of records, followed by Flash Floods and Thunderstorm Winds, so the event_type distribution is the first thing to inspect. Geographically the data skews toward Texas (1,450 events) and other tornado-belt states like Missouri, Arkansas, and Mississippi, which is worth confirming on the latitude/longitude spread. Two caveats deserve attention: the magnitude field is missing for 51.8% of rows, and category/country/source are constants (single value) so they carry no analytical signal. Fatalities and injuries are heavily zero-inflated (about 69% and 68% zeros respectively), meaning summary stats will be driven by a small tail of severe events.

Open
115 / 233
profile reading

natural hazards meteorites

/home/coolhand/html/datavis/data_trove/data/natural_hazards/meteorites.json · 1,097 rows × 10 cols

3× feature 3× metadata 1× identifier 1× timestamp 1× other 1× label

This is a 1,097-row catalogue of witnessed meteorite falls, with each record carrying a name, description, date, lat/long coordinates and a meteorite class. Two columns (category and fall_type) are constant — every record is a 'witnessed_meteorite_falls' event with fall_type 'Fell' — so the analytic interest sits elsewhere. Meteorite class is the most informative categorical: 125 distinct classes but heavily concentrated, with L6 alone accounting for ~24% of falls and H5 the next largest at ~15%. Latitude is skewed toward the northern hemisphere (median 36.1, mean 30.0) with ~8% flagged as outliers, while longitude spreads broadly across the globe (-157.9 to 174.4). Start with meteorite_class to understand the dominant compositions, then look at the lat/long pair to see geographic coverage.

Open
116 / 233
profile reading

natural hazards earthquakes

/home/coolhand/html/datavis/data_trove/data/natural_hazards/earthquakes.json · 3,742 rows × 11 cols

5× feature 3× metadata 1× free_text 1× identifier 1× label

This dataset contains 3,742 records of significant earthquakes, with numeric measurements (magnitude, depth, latitude, longitude), location text fields, and a date column. Magnitude is tightly clustered between 4.5 and 5.1 (median 4.8) but reaches up to 8.2, producing 184 high-end outliers worth a closer look. Depth is highly skewed (skew 3.07) with a median of 10 km but a max of 248.7 km and 314 outliers, suggesting a mix of shallow and deep events. Geographically, the data is heavily concentrated around Alaska — 'alaska' appears in 1,991 place names, with Canada and Mexico trailing far behind — and longitudes sit firmly in the western hemisphere (median -144.2). Note that 'category' is a single constant value and 'earthquake_type' is 99.9% 'earthquake', so neither adds analytic signal.

Open
117 / 233
profile reading

ml 32m movies

/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/movies.csv · 87,585 rows × 3 cols

1× identifier 1× free_text 1× feature

This dataset is a movie catalogue of 87,585 rows with three columns: a unique movieId, a title, and a pipe-delimited genres string. The genres column is the most analytically interesting: only 1,798 unique combinations exist, and Drama, Documentary, and Comedy dominate, while 7,080 rows are tagged '(no genres listed)' — a sizeable gap worth flagging. Titles are nearly unique (87,382 distinct of 87,585), and the frequent '(2014)'–'(2019)' tokens in titles suggest the catalogue skews toward recent years. movieId spans 1 to 292,757 with no outliers, indicating a sparse identifier range rather than a clean sequence. Start with the genre distribution and the missing-genre share before any deeper modelling.

Open
118 / 233
profile reading

ml 32m links

/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/links.csv · 87,585 rows × 3 cols

2× identifier 1× foreign_key

This is a movie ID linkage table with 87,585 rows and 3 numeric identifier columns: imdbId, tmdbId, and movieId. As expected for ID columns, movieId and imdbId are unique per row, while tmdbId is nearly unique with a tiny null rate (0.0014). The most notable shape is imdbId, which is heavily right-skewed (skew 2.24) with about 8.6% flagged as outliers — reflecting how IMDb IDs span a huge range from 1 up to ~29M. tmdbId is also right-skewed but more moderately, while movieId is distributed fairly evenly up to 292,757. There's little analytical signal here beyond confirming the file is a clean ID crosswalk.

Open
119 / 233
profile reading

merged inequality master

/home/coolhand/datasets/us-inequality-atlas/merged/inequality_master.csv · 3,222 rows × 28 cols

26× feature 2× identifier

This dataset profiles 3,222 U.S. counties across 28 columns of socioeconomic indicators, including poverty, rent burden, education, healthcare, and a composite inequality index. Two things stand out for closer inspection: the rent_to_income_ratio shows extreme skew (53.98) with a max of 1200 against a median of 17.06, suggesting either data-entry anomalies or a handful of severe outliers worth investigating. Total population is also highly skewed (skew 13.36, max ~9.78M vs median 25,174), so any per-county aggregation should be population-weighted. The composite_index and the *_score columns are well-behaved and centered near 50, making them good candidates for cross-county comparison. Texas (254 counties), Georgia, and Virginia dominate the state distribution.

Open
120 / 233
profile reading

lightning monthly heatmap

/home/coolhand/html/datavis/data_trove/data/natural_hazards/lightning/monthly_heatmap.json · 59,070 rows × 4 cols

4× feature

This dataset contains 59,070 rows of monthly lightning strike observations, each tagged with latitude, longitude, month, and strike count. Geographically the points sit between roughly 25.35°N–35.46°N and -96.74°W–-79.03°W, suggesting coverage of the southeastern United States. The strikes column is highly skewed (skew ≈ 2.02, max 531 vs. median 34) with a long right tail worth investigating to identify hotspots. Latitude also flags an outlier cluster (about 9.5% of rows), so it is worth checking whether those represent edge regions or data quality issues. Month is evenly bounded 1–12 with a mean near 6.9, hinting at a mild summer concentration in the records.

Open
121 / 233
profile reading

letterboxd users export

/home/coolhand/html/datavis/data_trove/entertainment/movies/letterboxd/users_export.csv · 8,139 rows × 5 cols

2× identifier 2× feature 1× free_text

This dataset contains 8,139 Letterboxd user profiles with 5 columns covering identifiers (username, _id), a display name, and two activity metrics (num_reviews, num_ratings_pages). The activity metrics are the most interesting signal: num_reviews is heavily right-skewed with a mean of 868 but a median of 588 and a max of 17,184, and num_ratings_pages shows similar skew along with a 41.7% null rate that warrants investigation. Display names are also worth a look — about 60% are one-word, 12.3% are duplicates, and 'null' literally appears 307 times as a value, suggesting some data quality issues. The username and _id columns are fully unique identifiers and can largely be ignored for analytical purposes.

Open
122 / 233
profile reading

large meteorites large meteorites

/home/coolhand/datasets/large-meteorites/large_meteorites.json · 4,871 rows × 10 cols

6× feature 1× identifier 1× timestamp 1× metadata 1× free_text

This dataset catalogues 4,871 large meteorites with 10 columns covering mass, location (latitude/longitude), discovery year, fall type, and classification. Mass is extremely skewed (skew ≈ 25, max 60,000,000g vs median 3,600g) with 712 outliers, so any mass-based analysis should use a log scale or trim extremes. Fall_type is heavily imbalanced — 84.8% are 'Found' versus 'Fell' — and meteorite_class is dominated by ordinary chondrites like L6 (16.9%) and H5. Year is left-skewed toward recent decades (median 1990, q1 1945), reflecting a modern collection bias. Note that the 'category' column is constant ('large_meteorites') and adds no signal.

Open
123 / 233
profile reading

language data wals values

/home/coolhand/datasets/language-data/wals_values.csv · 76,475 rows × 8 cols

4× foreign_key 1× identifier 1× feature 1× free_text 1× metadata

This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.

Open
124 / 233
profile reading

language data wals languages

/home/coolhand/datasets/language-data/wals_languages.csv · 3,573 rows × 17 cols

8× feature 4× foreign_key 3× metadata 1× identifier 1× label

This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.

Open
125 / 233
profile reading

language data glottolog languoid

/home/coolhand/datasets/language-data/glottolog_languoid.csv · 23,740 rows × 16 cols

8× feature 3× identifier 2× foreign_key 2× free_text 1× label

This dataset is a Glottolog languoid catalog with 23,740 rows and 16 columns describing languages, dialects, and families along with geographic and endangerment metadata. The `level` field splits the records into three classes — dialect (10,920), language (8,481), and family (4,339) — making it the natural primary lens. Endangerment `status` is dominated by 'safe' (~79.9%), but the remaining categories flag thousands of vulnerable to extinct languages worth investigating. Geography is concentrated: `country_ids` is led by PG (874), ID (695), and NG (480), and `family_id` is heavily skewed toward atla1278 (4,663) and aust1307 (3,850). Note that `iso639P3code`, `latitude`, and `longitude` are ~66% null, so spatial analysis will only cover about a third of rows.

Open
126 / 233
profile reading

joshua project joshua project unreached

/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet · 7,124 rows × 109 cols

64× feature 20× metadata 6× free_text 6× other 5× foreign_key 5× label 3× identifier

This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.

Open
127 / 233
profile reading

joshua project joshua project languages

/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json · 7,134 rows × 26 cols

17× feature 4× identifier 2× metadata 1× label 1× other 1× free_text

This is a Joshua Project languages dataset with 7,134 rows and 26 columns, profiling world languages alongside Bible translation status, audio/film resource availability, primary religion, and host-country distribution. The headline signal is religious-engagement coverage: PrimaryReligion is dominated by Christianity (3,328) followed by Ethnic Religions (1,472) and Islam (945), and JPScale skews toward the more-reached end with category 5 the largest bucket (2,050). Resource availability is uneven — HasAudioRecordings is roughly 59% Yes / 41% No, while HasJesusFilm is only ~28% Yes, suggesting the Jesus Film coverage gap is worth a closer look. Geographic concentration is also notable: HubCountry is led by Papua New Guinea (837), Indonesia (686), and Nigeria (494), together accounting for a large share of entries. Finally, NbrPGICs is extremely skewed (max 1,804, median 1) so any per-language counts should be inspected with that long tail in mind.

Open
128 / 233
profile reading

joshua project joshua project enriched

/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet · 16,382 rows × 109 cols

65× feature 18× metadata 6× foreign_key 6× free_text 6× other 5× label 3× identifier

This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.

Open
129 / 233
profile reading

housing housing crisis merged

/home/coolhand/datasets/us-inequality-atlas/housing/housing_crisis_merged.csv · 3,222 rows × 16 cols

13× feature 2× identifier 1× label

This dataset covers 3,222 U.S. counties (one row per county, identified by FIPS code) with 16 columns spanning housing stock, rent burden, income, and affordability metrics. The headline finding is that the affordability_category field is overwhelmingly imbalanced — 'Affordable' covers 3,192 of 3,222 counties (top_rate 0.99), with only 29 'Moderately Burdened' and 1 'Extremely Burdened', so this label likely needs reworking before it's useful. The rent-burden percentages tell a richer story: pct_rent_burdened_30plus has a mean of 36.4% and pct_rent_burdened_50plus a mean of 17.4%, suggesting real stress that the categorical label hides. Housing-count columns (owner_occupied, renter_occupied, total_housing_units) are extremely right-skewed (skew 9.5–15.8) with hundreds of outliers, reflecting a few very large urban counties — log scales recommended. Also note rent_to_income_ratio has an extreme max of 1200 with skew ~54, hinting at data-quality issues worth checking.

Open
130 / 233
profile reading

housing housing crisis counties

/home/coolhand/html/datavis/data_trove/demographic/housing/housing_crisis_counties.csv · 3,222 rows × 16 cols

13× feature 2× identifier 1× label

This dataset covers 3,222 US counties with 16 columns describing housing affordability — rents, incomes, renter shares, and rent-burden percentages. Several core numeric fields (annual_rent, median_gross_rent, median_household_income, rent_to_income_ratio) contain extreme negative sentinel values like -666666666 and -7999999992 that are dragging means deeply negative and producing skew of -17 to -56; these need to be cleaned or filtered before any analysis. The affordability_category field is heavily imbalanced, with 'Affordable' covering 99.1% of counties and only 1 county labeled 'Extremely Burdened', which suggests the categorization rule may be miscalibrated. Once the sentinel values are removed, the rent-burden percentage columns (pct_rent_burdened_30plus around a median of 37.4%, pct_rent_burdened_50plus around 17.6%) look like the cleanest signals to start with.

Open
131 / 233
profile reading

healthcare cms hospitals 2025

/home/coolhand/datasets/us-inequality-atlas/healthcare/cms_hospitals_2025.csv · 5,421 rows × 38 cols

24× feature 8× metadata 5× identifier 1× label

This dataset is a CMS hospital directory covering 5,421 U.S. hospitals across 56 state/territory codes, with 38 columns mixing facility identifiers (Facility ID, Name, Address, Phone), location fields, and a battery of CMS quality-measure summaries (Mortality, Readmission, Safety, Patient Experience, Timely & Effective care). Two things are worth a closer look first: the Hospital overall rating is 'Not Available' for 47% of facilities, and the 'Meets criteria for birthing friendly designation' field is 58% null with only 'Y' as a value, so any rating- or designation-based analysis will be heavily gated by missingness. Beyond that, the mix is dominated by Acute Care Hospitals (3,120) and Voluntary non-profit – Private ownership (2,291 / ~42%), with Texas, California, and Florida holding the largest state shares. The 'Count of … Measures Worse/Better' fields are highly skewed toward 0, suggesting most hospitals look 'no different than national average' on CMS comparisons — a useful framing before drilling into outliers.

Open
132 / 233
profile reading

glottolog languages

/home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet · 27,037 rows × 15 cols

6× feature 3× identifier 3× foreign_key 1× label 1× metadata 1× timestamp

This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.

Open
133 / 233
profile reading

geographic country centroids

/home/coolhand/html/datavis/data_trove/data/geographic/country_centroids.json · 7,124 rows × 10 cols

7× metadata 2× feature 1× other

This dataset contains 7,124 records of country centroid points sourced from Natural Earth 1:10m Admin 0 Label Points, with 10 columns covering geographic identifiers and coordinates. In practice, only the latitude and longitude columns carry usable signal — all eight categorical fields (continent, iso_a2, iso_a3, name, name_long, region_un, source, subregion) are effectively empty or constant, with a single value covering 100% of rows. Start by examining the spatial distribution: longitude spans the full globe (-179.97 to 179.99) while latitude is skewed toward the northern hemisphere (mean 22.9, median 25.2, skew -0.60). The 35 latitude outliers (~0.5%) likely correspond to extreme polar points worth a quick sanity check.

Open
134 / 233
profile reading

extracted cppi jp cppi cross reference

/home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv · 19,375 rows × 24 cols

19× feature 2× identifier 1× foreign_key 1× metadata 1× label

This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.

Open
135 / 233
profile reading

environmental desert data

/home/coolhand/html/datavis/data_trove/data/environmental/desert_data.json · 52,037 rows × 8 cols

7× feature 1× identifier

This dataset contains 52,037 records describing US Census-tract-level demographics, with an 11-character ID, county and state labels, and five numeric measures: distance/share, income, population, poverty rate, and SNAP counts. State coverage spans all 51 entries (50 states plus DC), led by Texas (4,010), California (3,727), and Florida (3,018), and counties are dominated by common names like Jefferson and Montgomery. The income distribution is right-skewed (mean $78,215 vs median $70,455, max $250,001) with about 4% flagged as outliers, and poverty rate shows a similar skew (mean 13.7%, median 10.8%, max 99.5%). Worth a closer look: the strong skew and outlier rates in inc, pov, and snap, plus how dist_share spreads almost uniformly from 0 to 10,000 (kurtosis -1.5), suggesting it may be a percentile-style metric rather than a raw count.

Open
136 / 233
profile reading

emoji unicode emoji list 20260119

/home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json · 5,225 rows × 6 cols

2× identifier 2× label 2× feature

This dataset is a catalog of 5,225 Unicode emoji, with each row carrying the emoji glyph, its codepoint sequence, an English-leaning name, and three classification fields (group, subgroup, status). The collection is heavily skewed toward people: the 'group' field shows 'People & Body' accounts for 3,468 of 5,225 rows (about 66%), so most subsequent breakdowns will be dominated by human figures. The 'status' field is similarly lopsided, with 'fully-qualified' covering 3,944 rows versus much smaller minimally-qualified, unqualified, and component buckets. The 'subgroup' column gives a finer 100-way split worth exploring, led by person-activity (697) and person-role (635). Name-level duplication (1,272 duplicate names, ~24%) reflects skin-tone and gender variants of the same base concept, which is the other thing to keep in mind when counting.

Open
137 / 233
profile reading

education education by county

/home/coolhand/datasets/us-inequality-atlas/education/education_by_county.csv · 3,222 rows × 6 cols

4× feature 1× identifier 1× metadata

This dataset contains 3,222 US county-level education records with 6 columns covering county identifiers (county_name, fips, state) and educational attainment metrics (pct_hs_or_higher, pct_bachelors_or_higher, total_25_plus). The bachelor's degree rate averages 23.5% but ranges from 0% to nearly 79%, suggesting wide regional disparities worth investigating. The total_25_plus population column is heavily skewed (skew=13.5) with 440 outliers and a max of nearly 6.9 million, so any analysis using it should consider log transforms or per-capita normalization. State coverage is fairly even across 52 entries, with TX, GA, and VA contributing the most counties.

Open
138 / 233
profile reading

disasters airplane crashes

/home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv · 5,268 rows × 13 cols

6× feature 2× timestamp 2× identifier 2× free_text 1× numeric_target

This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.

Open
139 / 233
profile reading

deepsky ngc

/home/coolhand/data/celestial/deepsky/NGC.csv · 13,969 rows × 32 cols

19× feature 5× identifier 3× metadata 2× label 1× other 1× foreign_key 1× free_text

This is an astronomical catalog of 13,969 deep-sky objects (NGC.csv) with 32 columns covering identifiers, sky coordinates, magnitudes across multiple bands, morphological classifications, and kinematic measurements like radial velocity and redshift. The catalog is dominated by galaxies — 75% of entries are type 'G' — with smaller populations of open clusters, duplicates, stars, and planetary nebulae. Object morphology (Hubble type) and constellation distribution are the most informative descriptive fields, while RadVel and Redshift give a clean view of the cosmological distance distribution skewing toward nearby objects (median z ≈ 0.016). Be aware that many columns are very sparsely populated: parallax (Pax), proper motions, and central-star magnitudes are >92% null, so any analysis on those fields will be limited to a small subset. Size measurements (MajAx, MinAx) are extremely skewed with heavy outliers, suggesting a few very large objects dominate the tails.

Open
140 / 233
profile reading

data raw wals language

/home/coolhand/servers/diachronica/data_raw/wals_language.csv · 3,573 rows × 17 cols

7× feature 5× foreign_key 2× identifier 2× label 1× metadata

This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.

Open
141 / 233
profile reading

data raw glottolog languoid

/home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv · 19,401 rows × 7 cols

3× identifier 3× feature 1× label

This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.

Open
142 / 233
profile reading

cms cms hospitals 20260121

/home/coolhand/html/datavis/data_trove/cache/cms/cms_hospitals_20260121.parquet · 5,421 rows × 38 cols

24× feature 8× metadata 5× identifier 1× label

This dataset catalogs 5,421 U.S. hospitals with 38 columns covering location (city, county, state, ZIP), facility identity, ownership and type, and CMS quality-measure rollups (mortality, readmission, safety, patient experience, timely & effective care). The most interesting structural story is the quality-rating coverage: 'Hospital overall rating' is 'Not Available' for 47% of hospitals, and the various footnote columns are null for 53–83% of rows, so any analysis of star ratings has to handle a large missing slice. On the categorical side, the mix is dominated by Acute Care Hospitals (~58%) and Voluntary non-profit – Private ownership (~42%), with Texas and California leading state counts. The 'Meets criteria for birthing friendly designation' field only ever takes the value 'Y' (58% null, no 'N'), so it is effectively a flag rather than a comparator.

Open
143 / 233
profile reading

quirky nuforc sightings

/home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet · 147,890 rows × 13 cols

5× feature 3× timestamp 3× free_text 1× identifier 1× label

This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.

Open
144 / 233
profile reading

celestrak active satellites

/home/coolhand/html/datavis/data_trove/cache/celestrak_active_satellites.json · 13,983 rows × 17 cols

10× feature 3× identifier 3× metadata 1× timestamp

This is a snapshot of 13,983 active satellites from CelesTrak, with 17 columns of two-line element (TLE) orbital parameters such as inclination, eccentricity, mean motion, and BSTAR drag term. Three columns (ELEMENT_SET_NO, EPHEMERIS_TYPE, MEAN_MOTION_DDOT) are constants and CLASSIFICATION_TYPE is uniformly 'U' (unclassified), so they can be ignored. The most informative shapes are in MEAN_MOTION and INCLINATION, which together reveal the orbital regimes (LEO vs higher orbits) populating the catalog — MEAN_MOTION is heavily concentrated near 15 rev/day (LEO) with a long tail down to 0.28, and INCLINATION spans 0–148° with a median around 53°. ECCENTRICITY is extremely skewed (median 0.00015, max 0.88) flagging a small number of highly elliptical orbits worth inspecting. OBJECT_NAME also offers a quick look at operator/constellation prevalence, with [DTC], Flock, Cosmos, and Iridium dominating.

Open
145 / 233
profile reading

blissapi

/home/coolhand/data/blissapi.db · 6,181 rows × 2 cols

1× identifier 1× metadata

This dataset contains 6,181 rows and 2 columns drawn from blissapi.db, pairing a free-text 'keyword' field with a categorical 'symbol_count' field. Every keyword is unique (6,181 distinct values across 6,181 rows) and is exactly one word, with lengths ranging from 2 to 72 characters and a median of 12. The 'symbol_count' column is fully constant at the value '1', so it carries no information for analysis. The most useful first look is the distribution of keyword lengths, since that is essentially the only varying signal in the data.

Open
146 / 233
profile reading

source food access atlas 2019

/home/coolhand/datasets/us-inequality-atlas/source/food_access_atlas_2019.xlsx#Read Me · 7 rows × 1 cols

1× metadata

This is the 'Read Me' sheet from the USDA Food Access Research Atlas 2019 download, not an analytical dataset. It contains just 7 rows in a single column of free-text notes describing the workbook's contents, source URL, and release history. Each row is unique, so there is nothing to aggregate or model here. Treat this sheet as documentation and pivot to the sibling sheets (the variable lookup and the main data table) for actual analysis.

Open
147 / 233
profile reading

bigfoot listings 20260210

/home/coolhand/html/datavis/data_trove/cache/bigfoot/listings_20260210.json · 5,411 rows × 9 cols

4× feature 2× identifier 1× timestamp 1× label 1× free_text

This dataset contains 5,411 Bigfoot sighting reports from BFRO, with 9 columns covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Sightings are concentrated in Washington, California, Ohio and Florida, and cluster heavily in late-summer and early-fall months (August, October, July). Classification is dominated by Class B (2,722) and Class A (2,655), with Class C barely represented (34) — worth flagging if you plan to filter by report quality. The year distribution is left-skewed with a median of 2001 and a long tail back to 1870, so most activity is recent. Note that the county field has 338 empty values and an 81% duplicate rate (expected, since counties repeat across reports).

Open
148 / 233
profile reading

ml 32m tags

/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/tags.csv · 2,000,072 rows × 4 cols

1× identifier 1× foreign_key 1× label 1× timestamp

This dataset is a 2,000,072-row movie tag log from MovieLens (ml-32m/tags.csv) with four columns: a free-text tag, a timestamp, a userId, and a movieId. The tag column is the most interesting feature — it has only 140,981 unique values across 2M rows (a 92.95% duplicate rate) and 52.47% of tags are a single word, with 'sci-fi', 'atmospheric', and 'action' leading the list. The timestamp column is left-skewed (skew −1.22) toward more recent activity, suggesting tagging picked up in later years, and userId shows that tagging is concentrated among a subset of users (only 15,848 distinct userIds for 2M rows). Start by looking at the top tags and the timestamp distribution to understand what users tag and when.

Open
149 / 233
profile reading

quirky deep sea

/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json · 200,000 rows × 12 cols

8× feature 4× label

This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.

Open
150 / 233
profile reading

animal attacks shark attacks gsaf

/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv · 6,462 rows × 257 cols

234× other 8× metadata 5× feature 4× identifier 3× free_text 2× label 1× timestamp

This is the GSAF shark attack file: 6,462 incident records described across 257 columns, though the vast majority (around 230 'Unnamed' columns) are empty padding that can be ignored. The substantive content sits in roughly two dozen fields covering case metadata (Case Number, Date, Year), context (Country, Area, Location, Activity, Time), and outcome (Type, Fatal (Y/N), Injury, Species). Two things are worth a closer look first: the dataset is heavily geographically skewed — USA accounts for 36% of cases and Australia another 21% — and incident type is dominated by 'Unprovoked' attacks (4,716 of ~6,400 typed rows, or 73%), with Fatal=N at 75% versus Fatal=Y at ~22%. Activity is also revealing: surfing (1,025) and swimming (932) together explain a large share of incidents. Watch out for data-quality issues: the Year column has a max of 3019 and 266 outliers, several key fields (Species, Time, Age) are 44–53% null, and Fatal (Y/N) contains stray values like 'M', 'F', and '2017'.

Open
151 / 233
profile reading

steam game network steam network

/home/coolhand/datasets/steam-game-network/steam_network.json · 1 rows × 3 cols

3× other

This dataset appears to be a Steam game network stored as a single JSON document with three top-level fields: links, meta, and nodes. Because the file was loaded as one row with three nested object columns, the profiler could not introspect any of them and skipped each. To get useful insights, the analyst should explode or normalize the nested structures — most likely treating `nodes` as the list of games and `links` as the edges between them — and re-profile from there. Until that flattening happens, there are no column-level distributions to chart.

Open
152 / 233
profile reading

quirky social norms 20260121

/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet · 355,922 rows × 25 cols

10× feature 7× label 3× free_text 3× foreign_key 1× metadata 1× identifier

This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).

Open
153 / 233
profile reading

hyg hygdata v41

/home/coolhand/data/celestial/hyg/hygdata_v41.csv · 119,626 rows × 37 cols

28× feature 4× identifier 4× metadata 1× foreign_key

This is the HYG star catalog (hygdata_v41.csv) with 119,626 stars and 37 columns covering positions (ra/dec, x/y/z), motion (pmra, pmdec, vx/vy/vz, rv), brightness (mag, absmag, lum, ci), and identifiers/classifications (hd, hip, spect, con, proper). The most informative single field is the spectral type 'spect': it has 4,310 distinct values but is dominated by a handful of classes (K0 ~8.6k, G5 ~6.0k, A0 ~4.9k), giving a clean view of stellar populations. Distance and luminosity are extremely right-skewed (lum skew ≈49, dist max 100,000 pc) with 10–15% outliers, so any analysis on those should use log scales. Radial velocity 'rv' is 81% zeros — effectively a 'measured vs not' flag rather than a continuous variable. Constellation 'con' is the most evenly distributed categorical (89 values, entropy ratio 0.95) led by Cen, UMa, and Her, making it a good grouping key.

Open
154 / 233
profile reading

bsky firehose anonymized dec 2025 bluesky posts

/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv · 101,040 rows × 19 cols

9× feature 4× foreign_key 2× timestamp 1× free_text 1× identifier 1× label 1× metadata

This is an anonymized Bluesky firehose snapshot of 101,040 posts from late December 2025, with 19 columns covering hashed identifiers, post text, timestamps, embed metadata, language, and sentiment. The content is heavily multilingual: English dominates at roughly 61% of posts, but Japanese, Korean, German, and Portuguese also have meaningful presence, alongside an 'unknown' bucket worth investigating. Sentiment skews neutral (~48%) with positive outweighing negative roughly 2:1, and post length is right-skewed (median 68 chars, max 525). Engagement features are sparse — about 18% of posts carry a link, 14% have images, and only 1.3% have video — and the embed_type field is null for ~61% of rows, which is the biggest data-quality flag to check first. Reply hashes are null for ~58% of rows, suggesting most posts are top-level rather than replies.

Open
155 / 233
profile reading

accessibility atlas who disability prevalence

/home/coolhand/datasets/accessibility-atlas/who_disability_prevalence.json · 1 rows × 5 cols

4× other 1× metadata

This dataset is a single-row JSON document at who_disability_prevalence.json, containing 5 top-level fields that appear to be nested containers rather than tabular columns: gho_data, metadata, sdg_disability_indicators, who_reference_statistics, and world_bank_health_indicators. Because the file holds only one record and every column was skipped as 'unknown' kind, there is no distributional signal to chart at this level. The structure suggests this is a reference compilation aggregating WHO GHO data, SDG disability indicators, WHO reference statistics, and World Bank health indicators under a metadata wrapper. To make this analysable, the next step is to expand each nested field into its own table and profile those individually rather than treating the file as flat.

Open
156 / 233
profile reading

linguistic

/home/coolhand/data/linguistic.db · 105,484 rows × 6 cols

2× feature 2× metadata 1× identifier 1× foreign_key

This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.

Open
157 / 233
profile reading

language data phoible

/home/coolhand/datasets/language-data/phoible.csv · 105,484 rows × 49 cols

39× feature 4× foreign_key 4× label 2× metadata

This is the PHOIBLE phonological inventory dataset: 105,484 rows describing phoneme segments across roughly 2,716 language names and 2,177 Glottocodes, with each row carrying a Phoneme/GlyphID plus 40+ binary phonological features (e.g. consonantal, nasal, sonorant, dorsal). The dataset is dominated by consonants — SegmentClass shows 72,282 consonants vs 31,052 vowels and 2,150 tones — and pulls from 8 sources, with 'ph' (36,274) and 'ea' (16,883) accounting for over half the rows. Most feature columns are heavily imbalanced toward '-' or '0', but a handful (consonantal, sonorant, continuant, dorsal, high, front, back) are fairly balanced and carry the real phonological signal worth exploring. Top language names like Iron Ossetic (444), Dutch (395), and Chechen (309) point to the densest inventories in the corpus.

Open
158 / 233
profile reading

quirky phenomena

/home/coolhand/html/datavis/data_trove/data/quirky/phenomena.json · 1 rows × 7 cols

6× other 1× metadata

This dataset is a single-row JSON document at phenomena.json containing 7 top-level fields: bigfoot, crashes, earthquakes, ghosts, metadata, meteorites, and ufos. Every column was flagged as 'unknown' kind and 'skipped' during profiling, which strongly suggests each field holds a nested structure (array or object) rather than a scalar — the file is a container of sub-datasets, not a flat table. The first thing to look at is the shape of each nested field individually; treat phenomena.json as an index and profile each key as its own dataset. Start with 'metadata' to understand provenance, then expand the phenomena collections (ufos, bigfoot, ghosts, etc.) one by one.

Open
159 / 233
profile reading

waterfalls waterfalls worldwide

/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json · 80,678 rows × 9 cols

4× feature 4× metadata 1× label

This dataset catalogues 80,678 waterfalls worldwide, sourced entirely from OpenStreetMap with latitude/longitude coordinates and minimal descriptive metadata. The most striking feature is how sparse the descriptive fields are: 'category' and 'source' are constant, 'date' and 'country' are essentially empty (country is blank for 80,650 of 80,678 rows), and 89.9% of 'description' entries are simply 'Waterfall'. The 'name' field is similarly thin — 'Unnamed Waterfall' accounts for 48,168 rows and the duplicate rate is 65.7%. The real analytical signal lives in the geographic coordinates, where latitude skews toward the northern hemisphere (median 40.3) and longitude spans the full globe, making this primarily a spatial dataset rather than an attribute-rich one.

Open
160 / 233
profile reading

wild nasa meteorites 20260121

/home/coolhand/html/datavis/data_trove/cache/wild/nasa_meteorites_20260121.json · 1 rows × 2 cols

2× other

This dataset is a single-row JSON cache file from NASA meteorites with just two top-level fields, `data` and `meta`. Both fields were detected as unknown-kind structures and skipped by the profiler, so no column-level statistics are available. The file appears to be a raw API envelope rather than a tabular dataset — you will likely need to unnest `data` (probably an array of meteorite records) before any meaningful analysis. Start by inspecting the structure of `data` and `meta` directly to determine the true record schema.

Open
161 / 233
profile reading

quirky caves

/home/coolhand/html/datavis/data_trove/data/quirky/caves.json · 69,716 rows × 12 cols

5× feature 4× metadata 1× identifier 1× label 1× free_text

This dataset catalogs 69,716 caves with 12 columns covering names, geocoordinates, country, tourism/access tags, and optional metadata like description, website, and Wikipedia links. The headline issue is sparsity in the descriptive fields: 'description' is empty in 65,189 rows, 'website' in 67,082, and 'wikipedia' in 67,531, so most analytical signal sits in name and coordinates. Worth a closer look first: the 'name' column, where 19,527 entries are literally 'Unnamed Cave' and overall duplicate rate is 35%, and the geographic spread, where 'lat' is heavily left-skewed (skew -3.16) with ~12.9% outliers and 'lon' has ~16.2% outliers, suggesting a Northern-Hemisphere/European concentration with scattered global entries. The 'country' field is almost entirely blank (99.95%), so country-level analysis will need to be derived from coordinates rather than read off directly. 'Access' is the most usable categorical, with meaningful splits across yes/no/private/permit when present.

Open
162 / 233
profile reading

quirky bioluminescence

/home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json · 43,060 rows × 14 cols

6× feature 5× label 2× timestamp 1× metadata

This dataset catalogues 43,060 records of bioluminescent marine organisms, with taxonomic fields (phylum, class, order, family, genus, scientificName), a bioluminescence_group label, geographic coordinates, depth, country, source dataset, and date/year. The taxonomy is dominated by Arthropoda (12,297) and Cnidaria (8,874) within 7 phyla, while bioluminescence_group is fairly evenly distributed across 26 categories led by Dinoflagellate (4,000). Two things deserve a closer look first: the depth column is highly skewed (skew 4.72, max 10,000m vs median 52.5m) with a 24.75% null rate and ~10.6% outliers, and the country field is 63.7% empty, limiting any geographic breakdown by nation. The year field is also 42% null, so temporal analysis will be partial.

Open
163 / 233
profile reading

aif 2022

/home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv · 4,970 rows × 13 cols

6× feature 4× metadata 1× identifier 1× label 1× timestamp

This dataset catalogues 4,970 historical letters (the PCEEC corpus metadata), with 13 columns describing each letter's reference code, author, recipient, their genders, dates of birth, social roles (API), and kinship relations. The social skew is striking: authors are 83% male versus 17% female, and recipients are 82% male versus 18% female, so any analysis of women's correspondence will work from a much smaller base. Roles and relations are heavily concentrated too — 'SIR' tops both author and recipient API fields, and 'FRIEND', 'BROTHER', and 'SON' dominate the kinship columns — though both API fields have long tails of 250+ distinct values worth scanning. Note also that 'Order of Gardiner letters in file' is 98.8% null (only relevant to a 58-letter subset) and 'Change from 2006?' is 95% 'ok', so neither carries much analytic signal.

Open
164 / 233
profile reading

rotten tomatoes rotten tomatoes movies

/home/coolhand/html/datavis/data_trove/entertainment/movies/rotten_tomatoes/rotten_tomatoes_movies.csv · 143,258 rows × 16 cols

11× feature 2× label 2× timestamp 1× identifier

This dataset catalogs 143,258 movies from Rotten Tomatoes across 16 columns covering metadata (title, director, writer, distributor), release info, runtime, genre, language, ratings, and critic/audience scores. Coverage is highly uneven — fields like boxOffice (89.7% null), rating (90.2% null), tomatoMeter (76.4% null), and releaseDateTheaters (78.5% null) are sparse, while audienceScore is missing in roughly half the rows. Worth a closer look first: the genre distribution, which is dominated by Drama (27,860), Documentary (15,162), and Comedy (11,514), and runtimeMinutes, which is heavily right-skewed (skew 7.6, max 2,700 minutes) with ~11.4% flagged as outliers despite a tight IQR of 84–103 minutes. The tomatoMeter and audienceScore distributions also tell a clear story — critics skew positive (median 73) while audiences are more middling (median 57). English dominates originalLanguage at 65.7% of titles, so any language-based analysis will be lopsided.

Open
165 / 233
profile reading

d3 celestial stars.14

/home/coolhand/data/celestial/d3-celestial/stars.14.json · 1 rows × 2 cols

1× metadata 1× other

This dataset is a single GeoJSON-style record loaded from stars.14.json, with just 2 columns and 1 row. The 'type' column holds the constant value 'FeatureCollection', and the 'features' column was skipped because it is a nested/unknown structure. There is essentially no tabular signal here — the file is a JSON document that needs to be unpacked (likely by exploding the 'features' array) before meaningful analysis is possible.

Open
166 / 233
profile reading

quirky tornadoes

/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json · 70,022 rows × 13 cols

9× feature 2× timestamp 2× numeric_target

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

Open
167 / 233
profile reading

accessibility wlasl index

/home/coolhand/html/datavis/data_trove/data/accessibility/wlasl_index.csv · 2,000 rows × 2 cols

1× label 1× free_text

This dataset contains 2,000 rows with two text columns: 'gloss' and 'instances'. The 'gloss' column is essentially a vocabulary list — every entry is unique, 97.75% are single words, and the mean length is just 6 characters, suggesting these are sign-language word labels. The 'instances' column is dramatically different: every value is unique, URL-heavy (100% url_rate), and averages 2,731 characters with a max of 9,982, indicating each row holds a JSON-like or URL-laden payload of video references. The most useful first look is the length distribution of 'instances' to understand payload variability, alongside confirming that 'gloss' behaves like a clean lexicon.

Open
168 / 233
profile reading

.cache who daly country 2021

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_country_2021.xlsx#Notes · 1 rows × 1 cols

1× metadata

This dataset is a single-row, single-column extract from the WHO Global Health Estimates 2021 Summary Tables (June 2024), specifically the 'Notes' sheet of the country-level DALY workbook. It contains essentially no analytical content — just one cell pointing to the WHO mortality and global health estimates portal. There is nothing to analyse here; the file is a metadata/notes sheet rather than a data table. To get usable data, load a different sheet from the same workbook (the country DALY estimates), not the Notes tab.

Open
169 / 233
profile reading

accessibility atlas vizwiz val annotations

/home/coolhand/datasets/accessibility-atlas/vizwiz_val_annotations.csv · 4,319 rows × 5 cols

2× free_text 2× label 1× identifier

This dataset is the VizWiz validation annotations file with 4,319 rows and 5 columns: an image filename, a question, a set of crowd answers, an answer_type label, and a binary answerable flag. The questions are dominated by a small number of generic openers — 'What is this?' alone accounts for 523 rows and questions have a 35% duplicate rate, so visual variety hides behind repeated prompts. Answer_type is heavily skewed: 'other' covers 62% of rows and 'unanswerable' another 1,385, while 'yes/no' and 'number' are rare. Consistent with that, the answerable flag has a mean of 0.68, meaning roughly 32% of items are flagged unanswerable — a notable share to inspect before modeling. The answers column is a serialized list of dicts (long strings averaging ~560 characters) and will need parsing rather than direct text analysis.

Open
170 / 233
profile reading

.cache who daly country 2010

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_country_2010.xlsx#Notes · 1 rows × 1 cols

1× metadata

This 'dataset' is a single-row, single-column extract from the Notes sheet of a WHO Global Health Estimates 2021 file (DALYs by country, 2010). The lone column header is the report title block, and its only value is a URL pointing to the WHO mortality and global health estimates portal. There is essentially no analyzable content here — it is metadata, not data. Before any analysis, point the pipeline at one of the actual data sheets in the workbook rather than the Notes tab.

Open
171 / 233
profile reading

.cache who daly country 2019

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_country_2019.xlsx#Notes · 1 rows × 1 cols

1× metadata

This dataset is essentially a single-row metadata note from the WHO Global Health Estimates 2021 summary tables (June 2024), extracted from the 'Notes' sheet of the DALY country-level workbook. It contains just one column and one value — a URL pointing to the WHO mortality and global health estimates portal. There is no analytical content here to chart or model; the file is a pointer/citation rather than a data table. The first thing to do is open a different sheet in the source workbook to find the actual estimates, since this Notes tab carries no observations.

Open
172 / 233
profile reading

.cache who daly country 2000

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_country_2000.xlsx#Notes · 1 rows × 1 cols

1× metadata

This dataset appears to be a metadata or notes sheet from the WHO Global Health Estimates 2021 Summary Tables (June 2024). It contains only 1 row and 1 column, with the single value being a URL pointing to the WHO mortality and global health estimates portal. There is essentially no analytical content here — this looks like a header/notes tab rather than a data table. To do any meaningful analysis, you should load one of the sibling sheets in the source workbook that contains the actual DALY country estimates.

Open
173 / 233
profile reading

accessibility atlas cms medicaid enrollment 2026

/home/coolhand/datasets/accessibility-atlas/cms_medicaid_enrollment_2026.csv · 10,302 rows × 44 cols

21× feature 20× metadata 1× timestamp 1× label

This dataset contains 10,302 monthly state-level records (51 states across 102 reporting periods from 201309 to 202510) tracking Medicaid and CHIP enrollment, application processing, eligibility determinations, and call center performance. The headline metric, Total Medicaid and CHIP Enrollment, is nearly complete and ranges from 0 to about 14.46M with a median of roughly 1.03M, while most other operational metrics are heavily right-skewed with substantial outliers. Two things deserve a closer look first: missingness is very uneven — Total Adult Medicaid Enrollment is 84.7% null and the call center metrics are ~69% null, while core enrollment fields are essentially complete — and the 'State Expanded Medicaid' flag splits the panel roughly 73%/27% (Y/N), which is a natural cut for comparison. The Final Report and Preliminary/Updated flags are exactly 50/50, suggesting each record appears in both a preliminary and final form, so deduplication may be needed before aggregation.

Open
174 / 233
profile reading

quirky fossils

/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json · 22,043 rows × 21 cols

14× feature 3× label 1× other 1× metadata 1× foreign_key 1× identifier

This dataset contains 22,043 fossil occurrence records with 21 columns spanning taxonomy (phylum, class, order, family, genus, name, rank), geography (country, state, lat/lon, paleolat/paleolng), and geologic age (early_age_mya, late_age_mya, period, late_interval). Taxonomy is dominated by Chordata (about 82% of rows) with Mammalia as the leading class (~32%) followed by Saurischia and Ornithischia, suggesting a strong vertebrate and dinosaur emphasis worth examining first. Geographically the data skews heavily to the US (~51%), with Wyoming, Montana, and New Mexico topping the state list, so any spatial analysis should account for this North American concentration. Age columns (early_age_mya, late_age_mya) are right-skewed with medians around 100 Mya and ~11% flagged as outliers, hinting at a long tail of very old records. Note that 'collection' and 'formation' are entirely empty and should be ignored.

Open
175 / 233
profile reading

wlasl index

/home/coolhand/html/datavis/data_trove/cache/wlasl_index.json · 2,000 rows × 2 cols

1× label 1× other

This dataset is a 2000-row index from a WLASL (Word-Level American Sign Language) source, with two columns: 'gloss' (text labels) and 'instances' (an unparsed/unknown field, likely nested data). The 'gloss' column is essentially a vocabulary list — every one of the 2000 rows is unique, 97.75% are single words, and the mean length is just 6 characters. The 'instances' column was skipped by the profiler and warrants manual inspection, since it likely contains the actual sign-language sample records keyed to each gloss. Start by looking at the gloss length distribution to confirm the single-word pattern, then dig into the structure of 'instances' separately.

Open
176 / 233
profile reading

parking parking violations sample 20260119

/home/coolhand/html/datavis/data_trove/cache/parking/parking_violations_sample_20260119.json · 10,000 rows × 40 cols

28× feature 4× timestamp 3× identifier 2× foreign_key 2× metadata

This is a 10,000-row sample of NYC parking violations with 40 fields covering ticket metadata, vehicle attributes, and location/precinct codes. The violation mix is dominated by one category — 'PHTO SCHOOL ZN SPEED VIOLATION' accounts for 4,416 of the issued tickets (about 52% of non-null descriptions) — which also drives the issuing_agency skew toward 'V' and law_section '408'. Geographically, registration_state is heavily NY (6,935) with NJ and PA trailing, and violation_county splits across Queens, Manhattan, Brooklyn, and the Bronx but with inconsistent codes (e.g., 'QN' vs 'Qns', 'BK' vs 'Kings'). Watch out for heavy nulls and placeholder zeros: meter_number, unregistered_vehicle, time_first_observed, and violation_post_code are >70% null, while issuer_precinct, feet_from_curb, and street_code* are dominated by '0' sentinel values. Vehicle_color also has unnormalized variants ('BK'/'BLK'/'BLACK', 'GY'/'GREY'/'GRY') that will need cleanup before any analysis.

Open
177 / 233
profile reading

accessibility atlas cdc dhds disability prevalence

/home/coolhand/datasets/accessibility-atlas/cdc_dhds_disability_prevalence.csv · 3,592 rows × 30 cols

15× metadata 8× feature 3× foreign_key 2× other 1× timestamp 1× label

This dataset contains 3,592 BRFSS-derived records of age-adjusted disability prevalence among U.S. adults 18+, broken out by state/territory (65 locations), year (2016-2022), and 8 disability response types. The core measure is Data_Value (percent prevalence), which ranges from 1.8% to 81.3% with a median of 9.1% and a heavily right-skewed distribution flagged for outliers. Most metadata columns (Category, Indicator, DataSource, Stratification1, etc.) are constant single-value fields and can be ignored as filters. The two things worth a closer look are the distribution of Data_Value across the 8 disability types in Response, and the geographic spread via LocationDesc — both are perfectly balanced in row counts, so any variation will come from the prevalence values themselves.

Open
178 / 233
profile reading

wild nasa meteorites

/home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv · 45,716 rows × 20 cols

8× feature 4× identifier 3× timestamp 3× metadata 1× other 1× label

This is a NASA meteorites dataset with 45,716 records and 20 columns covering each meteorite's name, classification, mass, fall type, year, and geographic coordinates. The most interesting signals are physical and categorical: mass (g) is extremely skewed (mean ~13,278g vs median 32.6g, max 60,000,000g) with ~15.5% flagged as outliers, and recclass is dominated by ordinary chondrites (L6 at 18.1%, followed by H5, L5, H6, H4). The fall column is heavily imbalanced — 97.6% 'Found' vs 2.4% 'Fell' — and year shows a clear concentration in recent decades, peaking at 2003 (3,323 records). Note that Counties and States are 96% null, several columns (created_at, updated_at, position, meta) are constant and can be ignored, and GeoLocation has 55% duplicate values driven by a few repeated Antarctic coordinates.

Open
179 / 233
profile reading

.cache who yld region

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/yld_region.xlsx#Notes · 196 rows × 2 cols

2× metadata

This is the 'Notes' sheet from a WHO Global Health Estimates 2021 workbook on Years Lost due to Disability (YLDs) by region, with 196 rows and just 2 columns. The first column is almost entirely empty (96.94% null) and contains only six narrative blurbs, while the second column carries 190 unique short strings — mostly country names plus a handful of header/citation lines. In other words, this isn't analytical data: it's a metadata/documentation sheet listing WHO member states and citation text. Before doing anything analytical, point the user to the workbook's other sheets; the meaningful YLD numbers live elsewhere.

Open
180 / 233
profile reading

.cache who yld global

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/yld_global.xlsx#Notes · 196 rows × 2 cols

1× metadata 1× identifier

This is a small 'Notes' sheet (196 rows, 2 columns) extracted from a WHO Global Health Estimates 2021 workbook on years lost due to disability (YLDs). It is essentially metadata and a country list rather than a tabular dataset: the unnamed first column is 96.94% null with only 6 distinct header/note strings, while the second column holds 190 nearly-unique values dominated by country names. The most useful thing to look at is the second column's values to confirm it is the WHO Member State list. Treat this sheet as documentation; the real burden-of-disease numbers live on other sheets of the source workbook.

Open
181 / 233
profile reading

.cache who daly region

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_region.xlsx#Notes · 196 rows × 2 cols

1× metadata 1× label

This is the 'Notes' sheet from the WHO Global Health Estimates 2021 workbook on DALYs by cause, age and sex, by WHO region, 2000-2021. It is essentially a metadata and country-listing tab rather than analytical data: 196 rows across just two columns. The first column (__UNNAMED__0) is 96.94% null and only carries six header/citation strings, while the second column holds 190 mostly unique entries — predominantly the list of WHO Member States plus a few title and source lines. Treat this sheet as documentation; the real DALY figures live on other sheets of the workbook.

Open
182 / 233
profile reading

.cache who daly global

/home/coolhand/html/datavis/data_trove/data/accessibility/.cache_who/daly_global.xlsx#Notes · 196 rows × 2 cols

2× metadata

This is a 196-row, 2-column slice extracted from the 'Notes' sheet of the WHO Global Health Estimates 2021 DALY workbook, not the burden-of-disease data itself. The first column (__UNNAMED__0) is almost entirely empty (null_rate 0.9694) with just 6 distinct free-text notes, while the second column holds 190 near-unique entries that look like a list of countries plus a few header lines. The structure suggests the file was parsed starting on the wrong sheet — the substantive DALY estimates live elsewhere in the workbook. Before any analysis, repoint the loader at the data sheets; this 'Notes' tab is essentially metadata and a country roster.

Open
183 / 233
profile reading

language data world languages integrated

/home/coolhand/datasets/language-data/world_languages_integrated.json · 7,130 rows × 8 cols

5× other 3× identifier

This dataset catalogs 7,130 world languages, each row keyed by a unique ISO 639-3 code paired with a language name. Only two of the eight columns (iso_639_3, name) parsed cleanly as text; the other six — including data_sources, glottolog, speaker_count, and us_indigenous — were skipped as nested or non-scalar structures and will need flattening before they yield insight. The name column is the most interesting surface feature: it averages about 9 characters, is single-word ~73% of the time, and its top tokens reveal heavy use of regional qualifiers like 'southern', 'northern', 'eastern', and 'central', plus 152–153 entries containing 'language' or 'sign'. Start by unpacking the skipped object columns (especially speaker_count and glottolog) and look at name-token frequencies to understand naming conventions and language-family clustering.

Open
184 / 233
profile reading

nationwide 2020 election

/home/coolhand/html/datavis/data_trove/data/geographic/nationwide/2020_election.csv · 3,152 rows × 10 cols

This dataset covers 3,152 U.S. counties from a 2020 election results file, with 10 columns spanning vote totals, party percentages, and geographic identifiers. Vote-count fields (total_votes, votes_dem, votes_gop, diff) are extremely right-skewed with high kurtosis and 12-16% outlier rates, reflecting a few massive-population counties dominating the raw counts — worth inspecting on a log scale. The percentage fields tell a cleaner story: per_gop has a median of 0.68 versus per_dem's 0.30, and per_point_diff is negatively skewed, indicating Republican margins in most counties despite Democrats winning the national popular vote. State coverage is led by Texas (254 counties), Georgia (159), and Virginia (133), so any state-level aggregation should account for that imbalance.

Open
185 / 233
profile reading

archive api data sample

/home/coolhand/html/datavis/data_trove/joshua-project/archive/api_data_sample.json · 50 rows × 107 cols

57× feature 26× metadata 7× free_text 6× other 5× foreign_key 4× label 2× identifier

This is a 50-row, 107-column sample from the Joshua Project API describing Arab and related Muslim people groups across 41 countries. The dataset is dominated by one affinity bloc ('Arab World', 100%) and one religion ('Islam', 98%), so the interesting variation lies in geography, population size, and reachedness rather than in identity fields. Look first at Population and PopulationPGAC, which are heavily right-skewed (max 2.22M and 7.56M respectively, with multiple outliers) and at PCIslam, which is high but varies from 25% to 100%. JPScaleText shows that 76% of groups are classified 'Unreached', making that the most actionable signal alongside Continent/RegionName for where these groups sit. Note also the high null rates on language, Bible-translation, and nomadic descriptors (86–98% missing), which limits any analysis of those attributes.

Open
186 / 233
profile reading

nyc housing nyc housing metrics merged

/home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_housing_metrics_merged.csv · 2,327 rows × 23 cols

20× feature 2× identifier 1× metadata

This dataset covers 2,327 NYC census tracts with 23 columns describing housing tenure, rent burden, income, and rent levels across the five boroughs. The most urgent issue is data hygiene: median_gross_rent and median_household_income both contain a sentinel value of -666666666, which drags their means to roughly -41.5M and -36M respectively despite sensible medians (~$1,735 rent, ~$76,833 income) — these need to be filtered before any analysis. Beyond that, the substantive story is rent burden: pct_rent_burdened has a median of 50% with an IQR of 40.9–58.8, meaning half of NYC tracts have a majority of renters paying 30%+ of income on rent. Brooklyn (Kings) dominates the tract count at 35%, followed by Queens (31%) and the Bronx (15%), so any borough-level comparison should weight accordingly. The state column is constant (all 36, New York) and can be dropped.

Open
187 / 233
profile reading

disability census disability by county 2022

/home/coolhand/datasets/us-inequality-atlas/disability/census_disability_by_county_2022.csv · 3,222 rows × 16 cols

12× feature 2× identifier 2× foreign_key

This dataset contains 2022 US Census disability counts for 3,222 counties, broken out by disability type (ambulatory, cognitive, hearing, vision, self-care, independent living) along with totals, a derived disability rate, and FIPS identifiers. Nearly every count column is heavily right-skewed (skew above 10) with substantial outliers — total_population alone ranges from 47 to 9.87M with a mean of ~102K but a median of just 25,328, so a handful of large counties dominate the raw counts. The disability_rate field is the most analyst-friendly view: it's bounded, less skewed (skew 2.17), and centers around a median of 1.07 with an IQR of 0.77–1.42. Start with disability_rate to compare counties on equal footing, then look at total_population to understand the size distribution before interpreting any raw disability counts.

Open
188 / 233
profile reading

nationwide census counties nationwide

/home/coolhand/html/datavis/data_trove/data/geographic/nationwide/census_counties_nationwide.csv · 3,144 rows × 8 cols

5× feature 2× identifier 1× foreign_key

This dataset covers 3,144 U.S. counties with demographic and socioeconomic indicators including population, median income, college attainment rate, and poverty rate, identified by FIPS codes and state. The most urgent issue is median_income: it has a minimum of -666,666,666 and a mean of -148,752, which are clearly sentinel values for missing data masquerading as numbers and must be cleaned before any analysis. Population is also extremely right-skewed (skew ~13, max ~9.9M vs median ~25,785), so log-scaling will be necessary for any visualization or modeling. State coverage is uneven, with Texas (254 counties), Georgia (159), and Virginia (133) dominating the row counts. College and poverty rates are the cleanest fields and behave roughly as expected for county-level distributions.

Open
189 / 233
profile reading

nationwide 2016 election

/home/coolhand/html/datavis/data_trove/data/geographic/nationwide/2016_election.csv · 3,141 rows × 11 cols

8× feature 2× identifier 1× foreign_key

This dataset contains 3,141 rows and 11 columns covering 2016 U.S. presidential election results at the county level, including total votes, Democratic and Republican vote counts and shares, and county/state identifiers. Vote-count columns (total_votes, votes_dem, votes_gop) are extremely right-skewed with high kurtosis and many outliers, reflecting a few very populous counties dominating the totals — worth a log-scale or filtered view. The per_gop and per_dem share columns tell a clearer story: per_gop has a mean of about 0.64 versus per_dem at 0.32, indicating Republican margins were larger across most counties. State coverage is broad (51 categories) with Texas (254 counties) and Georgia (159) most represented, so any state-level aggregation should account for that imbalance.

Open
190 / 233
profile reading

joshua project joshua project countries

/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_countries.json · 238 rows × 39 cols

33× feature 5× identifier 1× label

This dataset profiles 238 countries from the Joshua Project, combining demographic data (population, people groups, languages) with religious composition percentages and Bible translation/evangelization status. Christianity dominates as the primary religion in 159 of 238 countries, while the JPScaleText field shows 89 countries are 'Significantly Reached' versus 43 'Unreached' — a useful starting lens for mission analysis. Population and people-group counts are extremely right-skewed (skew >9, with outliers like 1.46B population), so log-scale views or per-capita ratios will be more informative than raw totals. Religion percentage columns also have very high zero-rates (e.g., Hinduism 56%, Buddhism 48%), reflecting that most countries have negligible presence of any given non-dominant religion. Note also that PoplPeoplesFPG and CntPeoplesFPG have substantial null rates (32% and 29%), so any analysis of frontier/unreached people groups should account for missing coverage.

Open
191 / 233
profile reading

quirky carnivorous plants real

/home/coolhand/html/datavis/data_trove/data/quirky/carnivorous_plants_real.json · 610 rows × 14 cols

7× feature 3× label 1× free_text 1× metadata 1× timestamp 1× identifier

This dataset holds 610 GBIF biodiversity occurrence records across 14 columns, mixing taxonomy (family, genus, species), geography (country, stateProvince, latitude/longitude), and observation metadata (basisOfRecord, year, month, coordinateUncertainty). Despite the 'carnivorous_plants' filename, the taxonomy is dominated by two unrelated families — Hesperiidae (skipper butterflies) and Canellaceae — each with 300 records, plus a small Araceae tail; this taxonomic split is the first thing worth investigating. Geographically, records skew to the Americas (USA 130, Mexico 73, Brazil 51) but span 35 countries, and 90% are HUMAN_OBSERVATION rather than preserved specimens. Watch coordinateUncertainty closely: it is highly skewed (skew 17.3) with a max of 766,917 m and 22.6% nulls, so any spatial analysis needs filtering. Years are tightly clustered in 2021–2026, indicating a recent-only snapshot.

Open
192 / 233
profile reading

fips county geology counties

/home/coolhand/html/datavis/data_trove/geographic/fips_county/geology_counties.csv · 3,235 rows × 9 cols

6× feature 1× identifier 1× metadata 1× label

This dataset links 3,235 U.S. counties (by FIPS code) to their nearest geological mineral or fuel deposit, including the deposit's type, era, state, and distance. Coal dominates deposit_type at roughly 42% of rows, with Copper, Iron, and Oil rounding out the major categories — worth checking whether this reflects true geological prevalence or sampling bias. The distance_to_deposit column is heavily right-skewed (skew ~7.5, max 5652 vs. median 152), so a small number of remote counties pull the mean far above typical values and deserve a closer look. Deposit eras span nine geological periods led by Pennsylvanian (~23%), and deposit_state concentrates in Missouri, Ohio, and Alabama even though counties themselves are spread across all 56 state codes.

Open
193 / 233
profile reading

nyc housing nyc rent burden by tract

/home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_rent_burden_by_tract.csv · 2,327 rows × 16 cols

12× feature 3× identifier 1× metadata

This dataset covers 2,327 NYC census tracts with 16 columns describing renter households and rent burden levels across the five boroughs. All tracts are in New York State (state is constant at 36) and split across five counties, with Brooklyn (Kings) the largest share at about 34.6% of tracts and Staten Island the smallest at 126 tracts. The headline housing-affordability metric, pct_rent_burdened, is roughly symmetric around a median of 50% with an IQR of 40.9 to 58.8, indicating that in a typical tract about half of renters spend 30%+ of income on rent. The raw count columns (rent_burdened, rent_50_pct_or_more, total_renter_households) are right-skewed with notable outliers, so look at the burden percentages first for cross-tract comparison and reserve the count fields for identifying the highest-volume tracts.

Open
194 / 233
profile reading

us attention data wikipedia trending

/home/coolhand/datasets/us-attention-data/wikipedia_trending.json · 500 rows × 5 cols

3× feature 1× identifier 1× other

This dataset captures 500 trending Wikipedia articles, with each row identified by a unique title and described by days_in_top_100, peak_views, total_views, and a daily_views series. All three numeric columns are heavily right-skewed with significant outliers — total_views skew is 10.4 with a max of ~23.9M against a median of ~213K, and peak_views shows similar behavior. Most articles spend only a few days in the top 100 (median 3, max 30), but a long tail extends well beyond. Start by examining the distribution of total_views and days_in_top_100 to understand how concentrated attention is on a few breakout articles.

Open
195 / 233
profile reading

food deserts food desert merged

/home/coolhand/datasets/us-inequality-atlas/food_deserts/food_desert_merged.csv · 3,222 rows × 11 cols

7× feature 3× identifier 1× foreign_key

This dataset contains 3,222 rows and 11 columns of US county-level indicators on poverty, SNAP eligibility and participation, vehicle access, and total population, keyed by FIPS and county/state codes. The population and program-count columns (total_pop, poverty_pop, snap_eligible_est, snap_participants_est, no_vehicle_total) are extremely right-skewed, with skew values from 13 to 20 and around 11-14% of rows flagged as outliers — a handful of very large counties dominate the raw totals. Note that snap_eligible_est and poverty_pop have identical statistics, suggesting one is a direct copy of the other and worth verifying before analysis. The rate-based columns are more tractable: poverty_rate has a moderate skew of 2.1 with a median of 13.55%, and no_vehicle_pct has a median of 5.41% but a long tail reaching 85.94%. Start with the rate columns for cross-county comparison and reserve the totals for absolute-magnitude questions.

Open
196 / 233
profile reading

parquet linguistic features

/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet · 76,475 rows × 6 cols

4× feature 1× foreign_key 1× metadata

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

Open
197 / 233
profile reading

olympics olympic medals data

/home/coolhand/html/datavis/data_trove/data/cultural/olympics/olympic_medals_data.json · 1,433 rows × 8 cols

7× feature 1× timestamp

This dataset contains 1,433 rows of Olympic medal counts by country and year, spanning 1896 to 2024 across 165 countries. Medal columns (gold, silver, bronze, total) are heavily right-skewed with high kurtosis and many outliers — a small number of dominant nations pull the means well above the medians (e.g. total has a median of 5 but a max of 234). Zero-rates are notable too: 33.9% of rows have zero gold medals and 25.3% zero silver, reflecting how often countries leave a Games empty-handed in a category. Country participation is fairly even at the top, with France and Great Britain tied as most-frequent entries (30 appearances each). Start by examining the shape of `total` and `gold` distributions and the `year` coverage to understand era effects.

Open
198 / 233
profile reading

healthcare healthcare desert merged

/home/coolhand/datasets/us-inequality-atlas/healthcare/healthcare_desert_merged.csv · 3,222 rows × 10 cols

7× feature 2× identifier 1× label

This dataset profiles 3,222 U.S. counties (one row per county, keyed by FIPS) with population, uninsured counts and rates, poverty rate, a hospital closure risk score, and rural/urban flags. Population and uninsured figures are extremely right-skewed (total_pop skew 13.4, uninsured_pop skew 17.8), so a handful of large counties will dominate any raw totals — analysis should likely use rates or log scales. The hospital_closure_risk_score collapses to just 3 distinct values (with ~29% scoring 0), and risk_category is heavily imbalanced with 84% of counties labeled 'Low' and the rest 'Moderate', which is worth examining first. About 69% of counties are flagged Rural, so rural/urban comparisons of uninsured and poverty rates should be a productive next cut.

Open
199 / 233
profile reading

api auth

/home/coolhand/data/api_auth.db · 502 rows × 11 cols

4× metadata 4× feature 2× identifier 1× timestamp

This dataset contains 502 API request logs across 11 columns, capturing usage telemetry like response time, status code, endpoint, and user agent. Traffic is dominated by a single API ('linguistic-api' at 99.6%) and a single method (GET), with all requests coming from one IP (127.0.0.1), so the interesting variation lives in endpoint, response_time_ms, status_code, and user_agent. Response times are heavily skewed: the median is just 3ms but the mean is 163ms with a max of 1238ms and 78 outliers (~24%), plus a 34% null rate worth investigating. Status codes split between 200 and 429, hinting at rate-limiting behavior. The endpoint column has a long tail of 209 distinct paths, with /api/languages and /api/search leading.

Open
200 / 233
profile reading

quirky social actions

/home/coolhand/html/datavis/data_trove/data/quirky/social_actions.json · 2,000 rows × 3 cols

2× free_text 1× feature

This dataset has 2,000 rows and 3 columns: a numeric `count` and two near-identical text fields (`name` and `full`) that look like short phrases about social behavior. The `count` column is extremely right-skewed (skew 6.26, kurtosis 76.6) with a median of 14 but a max of 461 and 85 outliers — worth investigating before any averaging. The two text columns are essentially twins: same length profile (mean ~28 chars, ~4.5 words), same top words (`your`, `being`, `to`, `a`), and overlapping vocab sizes (1628 vs 1626), suggesting `full` may be a near-duplicate or light reformat of `name`. Start by inspecting the `count` distribution on a log scale and spot-checking a few rows to see how `name` and `full` actually differ.

Open
201 / 233
profile reading

food deserts snap participation

/home/coolhand/html/datavis/data_trove/data/urban/food_deserts/snap_participation.csv · 3,222 rows × 9 cols

5× feature 4× identifier

This dataset covers 3,222 U.S. counties with population, poverty, and SNAP participation estimates alongside FIPS and state identifiers. Population and SNAP-related counts are extremely right-skewed — total_pop has a skew of 13.4 and a max of 9.78M against a median of just 25,174, with similar long tails in poverty_pop and snap_participants_est. The poverty_rate column is more behaved (median 13.55%, max 66.32%) and is probably the most useful field for cross-county comparison without log-scaling. Note that snap_eligible_est appears to be an exact duplicate of poverty_pop (identical stats), which is worth verifying before using either as an independent variable. State coverage spans 52 distinct values, so DC and territories are included.

Open
202 / 233
profile reading

us attention data wikipedia event articles

/home/coolhand/datasets/us-attention-data/wikipedia_event_articles.json · 10 rows × 5 cols

3× feature 1× identifier 1× other

This is a small dataset of 10 Wikipedia articles tracking US public attention, with view metrics (peak_views, avg_daily_views, total_views) plus an article name and a timeline field. The view metrics are heavily right-skewed — peak_views has a skew of 2.61 and a max of 739,258 against a median of just 22,111, suggesting one or two articles dominate attention. Each numeric column flags one outlier (10% outlier rate), so it's worth identifying which article is pulling the distribution. The article column has 10 unique values for 10 rows, so it functions as an identifier rather than a category to aggregate on.

Open
203 / 233
profile reading

nyc housing nyc tenure by tract

/home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_tenure_by_tract.csv · 2,327 rows × 10 cols

7× feature 2× identifier 1× metadata

This dataset contains 2,327 New York City census tracts with housing tenure breakdowns across 10 columns, covering owner- and renter-occupied household counts and percentages by county. Brooklyn (Kings) leads with 805 tracts (34.6% of rows), followed by Queens (725) and Bronx (361), while Staten Island has just 126. Renting dominates citywide: the mean share of renter-occupied households is 62.5% versus 37.5% owner-occupied, and renter counts are right-skewed with a long tail up to 8,209 per tract. Worth a closer look: the strong skew in raw household counts (owner_occupied skew 1.76, renter_occupied skew 1.59) and the ~4% null rate in the percentage columns. Note that 'state' is constant (all 36) and can be ignored.

Open
204 / 233
profile reading

food deserts vehicle access

/home/coolhand/html/datavis/data_trove/data/urban/food_deserts/vehicle_access.csv · 3,222 rows × 9 cols

6× feature 3× identifier

This dataset covers vehicle access for 3,222 US counties (one row per county, identified by FIPS code and name) across 9 columns, with no missing values. The household and no-vehicle counts are extremely right-skewed — `no_vehicle_total` has a median of 580 but a max of 601,621, and `total_households` ranges from 32 up to roughly 3.36 million — so a handful of large urban counties dominate the absolute totals. The more comparable signal is `no_vehicle_pct`, which has a median of 5.41% but stretches up to 85.94%, flagging a small set of counties with extreme transit dependence worth investigating first. State coverage looks complete (52 distinct state codes), so geographic breakdowns should be straightforward.

Open
205 / 233
profile reading

accessibility ssa sa fywl

/home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv · 1,093 rows × 30 cols

14× identifier 11× feature 3× metadata 1× other 1× timestamp

This appears to be the SSA-SA-FYWL dataset (Social Security Administration state/area fiscal-year workload data) with 1,093 rows and 30 columns, but the headers were not parsed correctly — most columns carry placeholder names like `_duplicated_*` and several columns hold metadata constants (file name, update date 3/13/2023, date type 'FY'). The most informative real fields are the geographic and time dimensions: `_duplicated_2` holds 53 US state codes (each appearing 21 times), `_duplicated_1` holds 11 region codes dominated by ATL (168 rows), and `_duplicated_4` holds 22 fiscal years from 2001 onward in a balanced panel. Many numeric measures (e.g. `_duplicated_22`, `_duplicated_12`, `_duplicated_10`) were ingested as text/categorical strings of decimal numbers, so they should be retyped before analysis. Start by fixing headers and dtypes, then look at the region/state/year structure to confirm the panel layout.

Open
206 / 233
profile reading

parquet phonemes

/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet · 105,484 rows × 8 cols

4× feature 2× label 1× foreign_key 1× metadata

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

Open
207 / 233
profile reading

nyc housing nyc median income by tract

/home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_median_income_by_tract.csv · 2,327 rows × 6 cols

3× feature 2× identifier 1× metadata

This dataset contains 2,327 New York City census tracts with median household income, geographic identifiers (state, county, tract), and tract names. The headline issue is median_household_income: it has a minimum of -666,666,666 and a mean of about -36 million, indicating sentinel/missing-value codes that must be filtered before any analysis — the median of $76,833 is the more trustworthy central value. County coverage is uneven, with Brooklyn (Kings) holding 34.6% of tracts and Staten Island only 126, so per-borough comparisons should be normalized. The state column is constant (36 = New York) and can be dropped.

Open
208 / 233
profile reading

nyc housing nyc median rent by tract

/home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_median_rent_by_tract.csv · 2,327 rows × 6 cols

3× identifier 2× feature 1× metadata

This dataset contains 2,327 New York City census tracts with median gross rent values across the five boroughs. The most important issue to investigate is median_gross_rent: it has a minimum of -666,666,666 and a mean of about -41.5 million, indicating sentinel values for missing data that must be filtered before any analysis — once cleaned, the median rent of $1,735 and IQR of $1,441–$2,049 are the realistic figures. The county_name field is well-distributed across five boroughs, with Brooklyn (Kings) the largest at 805 tracts (34.6%) and Staten Island the smallest at 126. Note that 'state' is constant (all 36, New York) and can be ignored, and 'NAME' is a unique tract label rather than an analytical field.

Open
209 / 233
profile reading

economic unemployment by county

/home/coolhand/datasets/us-inequality-atlas/economic/unemployment_by_county.csv · 3,222 rows × 8 cols

6× feature 1× identifier 1× metadata

This dataset contains 3,222 US county-level labor market records with 8 columns covering county identifiers (FIPS, name, state) and workforce statistics (labor force, total 16+, unemployed, unemployment rate, participation rate). The unemployment rate averages 5.13% with a median of 4.69%, ranging up to 31.99%, so the right tail is worth inspecting for distressed counties. Population-based counts (labor_force, total_16_plus, unemployed) are extremely right-skewed (skew >13) with hundreds of outliers — expected when a few large metros sit alongside small rural counties, but it means you should likely log-transform before modeling. Texas (254), Georgia (159), and Virginia (133) contribute the most counties, reflecting state geography rather than any sampling bias. County names show a 39% duplicate rate driven by repeated names like Washington, Jefferson, and Franklin Counties across states — join on FIPS, not name.

Open
210 / 233
profile reading

quirky ufo shapes aggregated

/home/coolhand/html/datavis/data_trove/data/quirky/ufo_shapes_aggregated.json · 28 rows × 5 cols

2× feature 2× other 1× identifier

This dataset aggregates UFO sightings by shape, with 28 rows and 5 columns covering shape categories, sighting counts, average durations, and nested sightings/yearly trend data. The numeric fields are highly skewed: avgDuration ranges from 30 to 37,800 with a mean of about 3,749 and skew near 3.95, while count ranges from 1 to 12,877 with a median of just 993.5. Both fields flag outliers worth inspecting — likely a few dominant shape categories pulling the distribution. The shape column has 28 unique values (one row per shape), so it functions as an identifier rather than a grouping variable. Start by looking at which shapes drive the count and duration extremes.

Open
211 / 233
profile reading

economic poverty depth by county

/home/coolhand/datasets/us-inequality-atlas/economic/poverty_depth_by_county.csv · 3,222 rows × 7 cols

6× feature 1× identifier

This dataset contains 3,222 rows of US county-level poverty statistics, with each row identified by a FIPS code, county name, and state abbreviation, plus three poverty rate measures and a population total. The poverty measures are all right-skewed: pct_poverty ranges from 1.6% to 66.32% with a median of 13.55%, while pct_deep_poverty has a median of 5.82% but reaches as high as 34.7%. The total population column is extremely skewed (skew of 13.4, kurtosis ~297) with a median of 25,174 but a max near 9.8 million, so any aggregate analysis should account for this. Texas (254 counties), Georgia (159), and Virginia (133) dominate the state distribution, which matters for any state-level rollups.

Open
212 / 233
profile reading

quirky hot sauces

/home/coolhand/html/datavis/data_trove/data/quirky/hot_sauces.json · 258 rows × 9 cols

4× feature 2× metadata 1× label 1× free_text 1× identifier

This dataset catalogs 258 hot sauce products sourced entirely from OpenFoodFacts, with 9 categorical columns covering brand, category, country, ingredients, labels, name, and URL. Brands are highly fragmented across 158 unique values, with Tabasco (12) and McIlhenny Company, Tabasco (11) leading but no dominant player — and 37 records have a blank brand worth investigating. Geographically, the United States (54) and France (28) account for the largest shares of the 123 country values, though inconsistent encoding (e.g., 'en:us' vs 'United States') suggests a data-cleaning task. The labels column is sparse: 145 of 258 rows are blank, so dietary tags like 'No gluten' or 'Non GMO project' apply to only a small minority. Note that source and type are constant (OpenFoodFacts / hot_sauce_product) and carry no analytical signal.

Open
213 / 233
profile reading

housing units

/home/coolhand/html/datavis/data_trove/cache/housing_units.parquet · 3,222 rows × 6 cols

4× feature 2× identifier

This dataset covers 3,222 U.S. counties with housing-unit counts (owner-occupied, renter-occupied, total) plus a FIPS code, county name, and the percent of renters. The three count columns are extremely right-skewed (skew between 9.5 and 15.8, kurtosis above 140) with 13–14% of rows flagged as outliers — a handful of huge urban counties (max total_housing_units of about 3.36M vs a median of roughly 10,021) dominate the distribution. The pct_renter field is far better behaved, centered near 26% with a much tighter spread, making it the most useful comparable metric across counties. Start by inspecting the long tail of total_housing_units, then use pct_renter to compare counties on a normalized basis.

Open
214 / 233
profile reading

acs 2022 county

/home/coolhand/html/datavis/data_trove/demographic/veterans/cache/acs_2022_county.parquet · 3,144 rows × 7 cols

3× identifier 3× feature 1× foreign_key

This dataset covers 3,144 U.S. counties from the 2022 American Community Survey, with each row identified by FIPS, state, county code, and name, plus three Census table values: total population (B01003_001E), male veteran population (B21001_002E), and civilian labor force (B23025_002E). All three demographic measures are extremely right-skewed (skew of 13.2, 8.0, and 13.1) with hundreds of outlier counties — for example, total population ranges from 50 up to 9.94 million while the median is just 25,784. About 13-14% of counties register as outliers on each measure, reflecting the handful of very large metro counties dominating the tails. Start by looking at population and labor-force distributions on a log scale, and use the state field (51 unique values) to see how counties cluster geographically.

Open
215 / 233
profile reading

accessibility atlas who hale long

/home/coolhand/datasets/accessibility-atlas/who_hale_long.csv · 4,070 rows × 4 cols

2× feature 1× foreign_key 1× timestamp

This dataset contains 4,070 rows of WHO Healthy Life Expectancy (HALE) data spanning 185 countries, 6 regions, and 22 years from 2000 to 2021. The panel is balanced — each country contributes 22 yearly observations — so the country_code distribution is essentially uniform and not informative on its own. The most interesting variable is hale_years, which ranges from 35.3 to 73.8 with a mean of 61.0 and a left-skewed distribution (skew = -0.82), indicating a long tail of countries with notably lower healthy life expectancy. Regional coverage is uneven, with Europe (1,100 rows) and Africa (1,034) dominating while South-East Asia contributes only 220 rows. Start by examining the hale_years distribution and how it breaks down by region.

Open
216 / 233
profile reading

economic gini by county

/home/coolhand/datasets/us-inequality-atlas/economic/gini_by_county.csv · 3,222 rows × 4 cols

2× feature 1× identifier 1× metadata

This dataset contains 3,222 US county-level records with four fields: county name, FIPS code, Gini index, and state. The Gini index is the most analytically interesting column, with a mean of 0.448 and a max of 0.721, plus 56 outliers worth investigating for unusually high local inequality. The state distribution is broad (52 unique values), led by Texas (254 counties) and Georgia (159), so any state-level comparison should account for that imbalance. County names show a 39% duplicate rate, reflecting common names like Washington, Jefferson, and Franklin County that recur across states.

Open
217 / 233
profile reading

rent burden

/home/coolhand/html/datavis/data_trove/cache/rent_burden.parquet · 3,222 rows × 5 cols

3× feature 2× identifier

This dataset contains 3,222 rows of U.S. county-level rent burden statistics, with each row identified by a county name and FIPS code and described by total renters and the share of renters paying 30%+ or 50%+ of income on rent. Total renters is extremely skewed (skew 15.8, max 1,810,929 vs. median 2,579.5), so a handful of large urban counties dominate the distribution and warrant separate treatment. Rent-burden percentages are more well-behaved: about 36.4% of renters per county are cost-burdened at the 30%+ threshold and 17.4% at the 50%+ threshold on average, both fairly symmetric. The most useful first look is comparing the two rent-burden distributions and isolating the outlier counties on total_renters.

Open
218 / 233
profile reading

animal attacks shark attacks analysis

/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_analysis.csv · 8 rows × 9 cols

6× metadata 1× identifier 1× feature 1× other

This is a tiny 8-row table that appears to hold Vega-Lite chart specifications rather than tidy observational data, with 9 columns mixing JSON-like fields (data, encoding, config, mark, $schema, autosize) and chart dimensions. The most informative columns are the chart metadata: every row uses the same `$schema` (vega-lite v2) and `mark` value of 'bar', while `height` (719) and `width` (761) are constants — so the dataset really describes 8 near-identical bar-chart specs. Worth a closer look: the `encoding` and `autosize`/`config` columns, which carry the only real variation (3 distinct encodings across `year`, `sex`, and a count aggregate), and the high null rates (62–87%) on those spec fields, which suggest each row only fills in one facet of the chart definition. Treat this as a chart-spec catalog, not a data table — joins or aggregations on `data` won't be meaningful since that column embeds long serialized arrays.

Open
219 / 233
profile reading

county health rankings

/home/coolhand/html/datavis/data_trove/cache/county_health_rankings.parquet · 3,222 rows × 5 cols

3× feature 2× identifier

This dataset contains 3,222 rows of US county-level health data, with each row identified by a unique county name and FIPS code, plus three numeric measures: total population, uninsured population, and uninsured rate. The population fields are extremely right-skewed — total_pop ranges from 47 to nearly 9.87 million with a median of 25,328, and uninsured_pop shows similar skew (median 36, max 20,915), so a few large counties dominate. The uninsured_rate is the most analytically interesting field: it has a median of 0.12 but stretches up to 3.7, with about 17% of counties reporting zero, suggesting either small/edge cases or data quality issues worth investigating. Start by examining the distribution of uninsured_rate and how it relates to total_pop.

Open
220 / 233
profile reading

healthcare data county health rankings 20260121

/home/coolhand/html/datavis/data_trove/cache/healthcare_data/county_health_rankings_20260121.parquet · 3,222 rows × 5 cols

3× feature 2× identifier

This dataset covers 3,222 U.S. counties (one row per FIPS code) with population totals and uninsured counts and rates. Both total_pop and uninsured_pop are extremely right-skewed (skew 13.4 and 17.8) with hundreds of outliers, indicating a handful of very large counties dominate the raw counts — analysts should work in per-capita or log space. The uninsured_rate is the more comparable metric: median 0.12 with about 17.5% of counties reporting zero, and a long tail reaching 3.7 that warrants a data-quality check. The county_name field shows Texas, Virginia, and Georgia contributing the most counties, useful context for any state-level rollups.

Open
221 / 233
profile reading

quirky aurora

/home/coolhand/html/datavis/data_trove/data/quirky/aurora.json · 300 rows × 8 cols

5× feature 2× timestamp 1× label

This dataset captures 300 minute-by-minute aurora and solar wind observations starting 2026-01-20, with 8 columns covering geomagnetic indices (kp_index, estimated_kp, intensity), solar wind conditions (speed, density), and a categorical activity label. The activity field is heavily skewed toward 'Moderate Storm' (172 of 300, ~57%), with only 8 'Quiet' readings — worth a closer look since this dominates the storyline. The kp_index and intensity columns are left-skewed and pile up at their max values (median equals max), with ~15% flagged as low-side outliers, suggesting the sample is a sustained storm period rather than a balanced range. Solar wind speed is also unusually elevated (min 881, max 1051 km/s), reinforcing that this is a storm-window snapshot rather than typical conditions.

Open
222 / 233
profile reading

median income

/home/coolhand/html/datavis/data_trove/cache/median_income.parquet · 3,222 rows × 3 cols

2× identifier 1× feature

This dataset contains 3,222 rows covering U.S. counties, with three columns: a county name, a FIPS code, and median household income. The income column is the headline issue — it has a minimum of -666,666,666 and a mean of roughly -144,603 against a median of 60,458, indicating a sentinel value (likely a missing-data placeholder) that is dragging the distribution into nonsense. About 5.8% of records (188 rows) are flagged as outliers and skew is extreme (-56.7), so any analysis should filter these sentinels before computing summary stats. County names are essentially unique row labels, while FIPS codes look clean and well-distributed across the expected national range.

Open
223 / 233
profile reading

poverty data

/home/coolhand/html/datavis/data_trove/cache/poverty_data.parquet · 3,222 rows × 3 cols

2× identifier 1× feature

This dataset contains 3,222 U.S. counties with three columns: a county name, a FIPS code identifier, and a poverty rate. Each row is unique by county_name and fips, so the analytical signal lives almost entirely in poverty_rate. Poverty rate ranges from 1.6% to 66.32% with a mean of 15.1% and median of 13.55%, and it is right-skewed (skew 2.10) with 137 high-end outliers (~4.25% of counties). That long upper tail is the first thing worth a closer look, since a small number of counties have poverty rates several times the national median.

Open
224 / 233
profile reading

healthcare data poverty data 20260121

/home/coolhand/html/datavis/data_trove/cache/healthcare_data/poverty_data_20260121.parquet · 3,222 rows × 3 cols

2× identifier 1× feature

This dataset contains 3,222 rows describing U.S. county-level poverty, with three columns: a FIPS code, a county name, and a poverty rate. Each row is a unique county (3,222 unique FIPS codes and county names), so the analytical signal lives in the poverty_rate column. Poverty rates range from 1.6% to 66.32% with a mean of 15.1% and median of 13.55%, and the distribution is right-skewed (skew ≈ 2.10) with 137 outliers on the high end. The county_name field also reveals geographic concentration, with Texas (256), Virginia (189), and Georgia (159) contributing the most counties. Start by examining the shape of poverty_rate and which states the high-poverty outliers cluster in.

Open
225 / 233
profile reading

median rents

/home/coolhand/html/datavis/data_trove/cache/median_rents.parquet · 3,222 rows × 3 cols

2× identifier 1× feature

This dataset contains 3,222 rows of U.S. county-level median gross rent figures, keyed by county name and FIPS code. The standout issue is the median_gross_rent column: while the median is a plausible $817.50 and the IQR runs $718 to $978, the minimum is -666,666,666, dragging the mean to roughly -2.07M and producing extreme skew (-17.87) and kurtosis (317.2). That sentinel-style negative value and the 235 flagged outliers (7.3%) should be cleaned or filtered before any analysis. The fips column is well-behaved and unique per row, and county_name is essentially an identifier (3,222 unique values), so neither needs deep inspection beyond confirming coverage.

Open
226 / 233
profile reading

rural urban

/home/coolhand/html/datavis/data_trove/cache/rural_urban.parquet · 3,222 rows × 4 cols

2× identifier 2× feature

This dataset is a county-level reference table covering 3,222 U.S. counties, with each row uniquely identified by a county name and FIPS code and labeled as either rural or urban/suburban. The headline finding is the rural skew: 2,212 counties (about 68.7%) are flagged Rural versus 1,010 Urban/Suburban, and the `rural` and `rural_category` columns are perfectly redundant duplicates of each other. County names are dominated by Texas (256), Virginia (189), and Georgia (159), reflecting how many counties those states contain rather than any data quality issue.

Open
227 / 233
profile reading

quirky peppers

/home/coolhand/html/datavis/data_trove/data/quirky/peppers.json · 175 rows × 11 cols

8× feature 2× identifier 1× label

This dataset catalogs 175 pepper varieties with 11 fields covering name, origin, flavor, heat category, biological type, intended use, and Scoville heat measurements (min, median, max, plus a jalapeño-relative score). The Scoville and jalRP numeric columns are extremely right-skewed (skew ~9-10, kurtosis >100) with max scoville_max reaching 16,000,000 versus a median of just 30,000 — a handful of super-hot peppers dominate the tail and 24% of rows flag as outliers. On the categorical side, 'Medium' heat accounts for 40% of peppers and 'Culinary' use covers 80%, while origin leans heavily toward the United States (26%) and Mexico (15%). Worth a closer look first: the Scoville distribution (consider a log scale) and the type column, which has casing inconsistencies ('annuum' vs 'Annuum', 'chinense' vs 'Chinense') that should be cleaned before any grouping.

Open
228 / 233
profile reading

healthcare data rural urban classification 20260121

/home/coolhand/html/datavis/data_trove/cache/healthcare_data/rural_urban_classification_20260121.parquet · 3,222 rows × 4 cols

2× identifier 2× feature

This dataset catalogs 3,222 U.S. counties, each identified by a unique 5-character FIPS code and county name, and classified as either rural or urban/suburban. The two classification columns (`rural` and `rural_category`) are perfectly redundant, both showing 2,212 counties (about 68.7%) flagged as Rural versus 1,010 as Urban/Suburban. The most useful angle here is the rural/urban split, since FIPS and county_name are unique identifiers with no aggregate signal. Top words in `county_name` hint at geographic concentration, with Texas (256), Virginia (189), and Georgia (159) contributing the most counties.

Open
229 / 233
profile reading

cms medicaid

/home/coolhand/datasets/accessibility-atlas/cms_medicaid_enrollment_2026.csv · 10,302 rows × 44 cols

23× feature 19× metadata 1× timestamp 1× label

This dataset captures monthly state-level Medicaid and CHIP performance reports (10,302 rows × 44 columns) covering enrollment counts, application volumes, eligibility determinations by processing-time bucket, and call-center metrics across all 51 state jurisdictions. The reporting structure is clean and balanced — each state contributes 202 rows, and the Final Report and Preliminary/Updated flags split exactly 50/50 — but most numeric metrics are heavily right-skewed and riddled with outliers, since large states like California dwarf smaller ones (e.g., Total Medicaid Enrollment ranges from 0 to 13.2M with skew 3.6). Two things deserve a closer look first: the very high null rates on operational metrics (Total Adult Medicaid Enrollment is 85% null; call-center fields ~70% null), which suggests many states simply don't report these, and the Medicaid-expansion split (73% Y vs 27% N) which is a natural lens for comparing enrollment and processing outcomes. The free-text 'footnotes' columns are also worth scanning — they reveal systematic data-quality caveats (e.g., 'Incorrectly reporting processing time at application level') that should temper any cross-state comparison.

Open
230 / 233
profile reading

bsky firehose dec 2025

/home/coolhand/datasets/bsky-firehose-anonymized-dec-2025/bluesky_posts.csv · 101,040 rows × 19 cols

9× feature 4× foreign_key 2× timestamp 1× free_text 1× identifier 1× label 1× metadata

This dataset captures 101,040 anonymized Bluesky firehose posts from late December 2025, with 19 columns covering post hashes, authorship, timestamps, content text, embeds, hashtags, mentions, links, language, and sentiment. The text column is richly multilingual — English dominates at ~61% of posts, followed by Japanese (~12.6k) and a sizable 'unknown' bucket (~11.5k) — and sentiment skews neutral (48.5%) with positive outweighing negative roughly 2:1. Engagement-style features are heavily zero-inflated: only ~13.6% of posts include images, ~18% include links, and just ~1.3% include video, so most posts are plain text. About 58% of posts have no reply_root_hash, suggesting top-level posts dominate over threaded replies. The most useful first cuts are language mix, sentiment distribution, embed_type composition, and post-length shape via char_count.

Open
231 / 233
profile reading

bluesky alt text

hf://lukeslp/bluesky-alt-text:[train] · 404,841 rows × 21 cols

5× feature 5× foreign_key 4× metadata 3× free_text 3× timestamp 1× identifier

This is a 404,841-row Bluesky image-post dataset (lukeslp/bluesky-alt-text) capturing posts with attached images, their alt text, author identifiers, and raw AT-Protocol records across 21 columns. Two things stand out for follow-up: alt-text length is extremely skewed (mean 227 chars but max 65,192, with ~12% outliers), suggesting a small number of very long descriptions are dragging the distribution; and authorship is highly concentrated, with one DID accounting for 32,558 posts and the top handle 'firefaerie81.bsky.social' contributing 6,828 — worth checking for bot or scraper bias. Content is overwhelmingly English (~72% of langs_json) but spans 214 language tags, and images are predominantly JPEG (93.5%) with PNG a distant second. Note also that ~31% of image URLs and author handles are null, which likely reflects the split between 'author_feed' (69%) and 'jetstream' (31%) source modes.

Open
232 / 233
compare reading

bluesky alt text curated vs firehose

curated (279,196) vs firehose (125,645) · 21 cols

The curated dataset (279,196 rows) is roughly 2.2× the size of firehose (125,645 rows), and several columns differ structurally rather than just statistically. Four fields—`query`, `image_fullsize_url`, `image_thumb_url`, and `author_handle`—show a +100% null shift with zero top-value overlap, indicating they are populated on one side and entirely absent on the other. Text fields also diverge in shape: `alt_text` has a mean length +79 characters with low language overlap (jaccard 0.35) and almost no shared top values (0.03), while `raw_record_json` is +154 characters longer on average with jaccard 0.25. These patterns suggest the two sources capture different schemas/enrichments rather than just sampling noise. Confidence is high that the divergence is structural, though the evidence does not say which side carries the nulls.

Open
233 / 233
profile reading

vizwiz

/home/coolhand/datasets/accessibility-atlas/vizwiz_val_annotations.csv · 4,319 rows × 5 cols

2× label 1× identifier 1× free_text 1× feature

This is the VizWiz validation annotation set: 4,319 rows linking an image filename to a question, a bundle of crowd answers, an answer_type label, and a binary 'answerable' flag. The question column is where the dataset's character lives — it has only 2,798 unique values with a 35% duplicate rate, dominated by short generic prompts like 'What is this?' (523 occurrences). Worth a closer look: the answer_type distribution is heavily skewed toward 'other' (62%) with 'unanswerable' a strong second, and the numeric 'answerable' flag confirms that ~32% of items are flagged unanswerable — a meaningful portion to account for in any downstream evaluation.

Open