saturn

/home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv 1,093 rows sample n=1,093 seed 42 2026-05-01T17:12:33+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/cache/accessibility/ssa_sa_fywl.csv
Total rows1,093
Profiled sample1,093
Columns30
Generated2026-05-01T17:12:33+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset medium anthropic:claude-opus-4-7

This appears to be the SSA-SA-FYWL dataset (Social Security Administration state/area fiscal-year workload data) with 1,093 rows and 30 columns, but the headers were not parsed correctly — most columns carry placeholder names like `_duplicated_*` and several columns hold metadata constants (file name, update date 3/13/2023, date type 'FY'). The most informative real fields are the geographic and time dimensions: `_duplicated_2` holds 53 US state codes (each appearing 21 times), `_duplicated_1` holds 11 region codes dominated by ATL (168 rows), and `_duplicated_4` holds 22 fiscal years from 2001 onward in a balanced panel. Many numeric measures (e.g. `_duplicated_22`, `_duplicated_12`, `_duplicated_10`) were ingested as text/categorical strings of decimal numbers, so they should be retyped before analysis. Start by fixing headers and dtypes, then look at the region/state/year structure to confirm the panel layout.

**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. high anthropic:claude-opus-4-7

This column is effectively a constant file-name tag ("SSA-SA-FYWL.csv" appears 1092 of 1093 times, top_rate 0.999) with a single stray "File Name" value that looks like a header row leaked into the data. The column header itself is a free-text note about 2021 data being backfilled with 2020 data, suggesting this is provenance metadata rather than a feature. Entropy is essentially zero (0.0106), so it carries no discriminative signal.

high anthropic:claude-opus-4-7

Binary categorical column with 1093 rows and only 2 distinct values, but it is effectively a constant: "2" appears 1092 times (top_rate 0.999) while "File Version" appears once. The lone "File Version" string alongside numeric "2" suggests a stray header row leaked into the data. Entropy of 0.0106 confirms there is virtually no information here.

_duplicated_0 high anthropic:claude-opus-4-7

This appears to be a duplicated date column where 1092 of 1093 rows hold the single value '3/13/2023', with the lone other entry being the literal string 'Update Date' — almost certainly a header row that leaked into the data. Entropy is effectively zero (0.0106) and the top rate is 0.999, so the column carries no discriminative signal. The 'Update Date' value also confirms a parsing/ingest issue worth fixing upstream.

_duplicated_1 high anthropic:claude-opus-4-7

Three-letter city/airport codes (ATL, DEN, BOS, PHL, CHI, DAL, SEA, SFO, KCM, NYC...) across 1093 rows with 11 unique values and no nulls. Distribution is fairly even — entropy ratio 0.947 and top value ATL only 15.4% — suggesting a balanced categorical rather than a skewed label. The column name `_duplicated_1` flags it as an auto-detected duplicate of another column in the profile.

_duplicated_2 high anthropic:claude-opus-4-7

This column holds two-letter US state/territory abbreviations with a trailing space (e.g. 'AK ', 'AL ', 'AR '), with 53 distinct values across 1093 rows and no nulls. The distribution is almost perfectly uniform — entropy_ratio of 0.996 and the top value appearing just 21 times (1.92%) — suggesting the data is a regular grid of states repeated roughly 21 times each. The 53 categories slightly exceed the 50 states, consistent with DC and US territories, and the trailing whitespace in every value is a data-hygiene flag.

_duplicated_3 high anthropic:claude-opus-4-7

A binary categorical column completely dominated by the value 'FY' (1092 of 1093 rows, top_rate 0.999), with a single stray 'Date Type' entry. Entropy is effectively zero (0.0106), and the name '_duplicated_3' suggests this is a residual from a duplicated header or pivot artifact rather than a real feature. The lone 'Date Type' value looks like a header row that leaked into the data.

_duplicated_4 high anthropic:claude-opus-4-7

This column holds 22 distinct year strings ranging from at least 2001 onward, with each year appearing almost exactly 52 times across 1,093 rows and zero nulls. The near-uniform distribution (entropy ratio 0.986, top rate just 0.0476) and the count of 52 strongly suggest weekly observations stacked per year. The '_duplicated_4' name indicates saturn detected this as a duplicate of another column in the dataset.

_duplicated_5 high anthropic:claude-opus-4-7

Stored as text but the values are short numeric tokens (length 6-21, mean 6.85, one word in 99.9% of rows), almost certainly some kind of numeric ID. Cardinality is near-unique (1037 distinct out of 1093) yet 56 rows duplicate (5.1% duplicate rate), which is unexpected for an identifier and worth checking. The column name '_duplicated_5' also suggests this was auto-generated from a collision during ingest.

_duplicated_6 high anthropic:claude-opus-4-7

Almost every value is a single all-caps token of 5-6 characters (len_mean 5.68, one_word_rate 0.999), with 1090 unique values across 1093 rows and only 3 duplicates. Top tokens are mostly numeric strings like '91371', '18795', '158314', suggesting this is an identifier or numeric code column rather than natural text — though a stray header-like fragment ('ssa', 'disability', 'beneficiaries', 'age', '18-64*') hints the source file had embedded header rows mixed into the data.

_duplicated_7 medium anthropic:claude-opus-4-7

Column is typed categorical but holds 511 distinct numeric strings like "5.50", "5.07", "4.90" across 1093 rows, suggesting a continuous measurement (price, rating, or similar) stored as text. Distribution is nearly flat: entropy ratio is 0.968 and the most common value covers only 1.01% of rows. The "_duplicated_7" name implies this is a redundant copy of another column produced during a join or pivot.

_duplicated_8 high anthropic:claude-opus-4-7

Single-token, all-caps short strings (length 6-26, mean 6.84, ~1 word each) that are overwhelmingly numeric — top values like '468802', '2702811', '1646445' are integers stored as text. With 1041 unique values across 1093 rows and only 52 duplicates, this looks like a near-unique numeric identifier rather than a feature. The 'allcaps' and Flesch=121.22 signals are artifacts of digit-only tokens; no URLs, emojis, or boilerplate appear.

_duplicated_9 high anthropic:claude-opus-4-7

Almost certainly an identifier-like code column: 1081 unique values across 1093 rows, single-token entries averaging 4.85 characters, and the top repeated values are short numeric strings like '4190' and '8630'. The 99.9% allcaps and one_word rates plus max length of 14 suggest compact alphanumeric codes rather than prose. The 12 duplicates (1.1%) are minor but worth checking given the column is otherwise near-unique.

_duplicated_10 medium anthropic:claude-opus-4-7

Stored as a categorical but the values are numeric strings clustered tightly around 1.0 (top values include '0.97', '1.11', '1.01', '1.04', '0.92'), suggesting a ratio, multiplier, or normalised index. Distribution is highly diffuse with 199 distinct values across 1093 rows and an entropy ratio of 0.929, so no single bucket dominates (top_rate just 0.023). The '_duplicated_10' name implies this column is a redundant copy from an upstream join.

_duplicated_11 high anthropic:claude-opus-4-7

Almost certainly a short alphanumeric code column: 1062 distinct values across 1093 rows, 99.9% one-word and 99.9% all-caps, lengths between 3 and 30 characters with a median of 4. Top tokens are bare numeric strings like '6632' and '1573', each appearing only 2-3 times, suggesting ID-like codes rather than categories. The '_duplicated_11' name and 31 duplicates (2.8%) hint this is a copy of another column with minor collisions.

_duplicated_12 medium anthropic:claude-opus-4-7

This column holds 69 distinct numeric-looking strings (e.g. '0.38', '0.34', '0.32') across 1093 rows with no nulls, suggesting a decimal ratio or rate stored as text. The distribution is fairly flat — top value '0.38' covers only 5.0% and entropy ratio is 0.905 — so no single value dominates. The '_duplicated_12' name signals it is a duplicate of another column, which is the main thing to flag.

_duplicated_13 high anthropic:claude-opus-4-7

This column holds short, single-token uppercase strings that are almost entirely unique (1079 unique out of 1093), with lengths between 4 and 24 characters and a median of 5. The top-frequency tokens are all numeric strings ('17955', '5808', etc.) appearing only twice each, suggesting this is a near-unique identifier code rather than natural text. The 'allcaps' and 'one_word' rates near 99.9% confirm a structured code format, and the column name '_duplicated_13' hints it was auto-generated during a join or pivot.

_duplicated_14 high anthropic:claude-opus-4-7

This column, labelled `_duplicated_14`, holds 1093 numeric-looking strings (e.g. "31.13", "44.89") with 883 unique values and no nulls — almost certainly a continuous measurement that was ingested as categorical. Entropy ratio of 0.99 and a top frequency of just 4 (0.37%) confirm near-uniqueness; the `long_tail` alert and the `_duplicated_` prefix suggest it is a redundant copy of another numeric column.

_duplicated_15 medium anthropic:claude-opus-4-7

This column holds short single-token numeric strings (one_word_rate 0.999, len_mean 6.4, max 24) stored as text rather than integers, with 1019 unique values across 1093 rows. The value '0' appears 21 times while every other top value occurs only twice, suggesting '0' is a sentinel or default. The name '_duplicated_15' and the 6.8% duplicate rate hint this is a redundant copy of a numeric identifier column from an upstream join.

_duplicated_16 high anthropic:claude-opus-4-7

Despite being typed as text, this column is dominated by short single-token numeric strings (one_word_rate 0.999, len_mean 4.54, max 38) with 1057 unique values across 1093 rows. The top tokens are bare integers like "0" (21 occurrences), "1358", "840", suggesting an ID or numeric code stored as text rather than natural language. The allcaps_rate of 0.98 is an artifact of digits/non-letter content, and the column name `_duplicated_16` implies it was auto-generated during a column-name collision.

_duplicated_17 medium anthropic:claude-opus-4-7

Stored as categorical strings but the values are numeric ('0.00', '1.68', '0.58', '1.07'), suggesting a small-magnitude continuous measurement that was read as text. Cardinality is high (272 unique across 1093 rows) with very flat distribution: top value '0.00' covers only 1.92% and entropy ratio is 0.949. The '_duplicated_17' name implies this is a duplicate of another column produced during a join or concat.

_duplicated_18 high anthropic:claude-opus-4-7

Despite being typed as text, this column holds single-token numeric strings (one_word_rate 0.999, word_mean 1.00, len_mean 6.4) with 1021 unique values across 1093 rows — effectively a high-cardinality numeric ID stored as text. The value '0' appears 20 times while every other top value occurs at most twice, hinting at '0' as a sentinel/placeholder amid otherwise near-unique IDs. The 'allcaps' alert is a quirk of digit-only strings rather than meaningful casing.

_duplicated_19 high anthropic:claude-opus-4-7

Despite being typed as text, this column is essentially short numeric tokens — 99.9% are single words with mean length 4.05 characters and a max of 32. With 1018 unique values across 1093 rows and the most common entry '0' appearing only 21 times, it behaves like a high-cardinality numeric identifier stored as strings. The 'allcaps' alert (97.99%) is an artifact of digits having no lowercase form rather than a meaningful signal.

_duplicated_20 medium anthropic:claude-opus-4-7

Despite being typed categorical, every one of the 156 distinct values is a two-decimal numeric string between 0.00 and 0.61+, suggesting a proportion or rate that was stored as text. The distribution is nearly flat (entropy ratio 0.907), with the modal value '0.30' covering only 2.6% of 1093 rows and no nulls. The column name '_duplicated_20' implies it is a copy of another column flagged during ingestion.

_duplicated_21 high anthropic:claude-opus-4-7

This column is labelled `_duplicated_21`, suggesting saturn detected it as a duplicate of another field; values appear to be numeric strings stored as categorical. With 957 unique values across 1093 rows and an entropy ratio of 0.9885, it is nearly an identifier — the only meaningful concentration is `"0"` at 21 occurrences (1.92%), likely a sentinel or default. The long_tail alert and near-unique cardinality mean it carries almost no categorical signal as-is.

_duplicated_22 medium anthropic:claude-opus-4-7

This column holds 70 distinct short decimal strings clustered tightly around 0.16–0.25, suggesting a numeric ratio or rate (perhaps a proportion or probability) that has been stored as text. Distribution is fairly even with the top value '0.18' taking only 7.0% of rows and entropy ratio 0.84, so no single bucket dominates. The 'categorical' kind plus the '_duplicated_22' name hint that saturn detected this as a duplicate of another column and parsed it as strings rather than floats.

_duplicated_23 high anthropic:claude-opus-4-7

Despite the text kind, every value is a single short token (word_mean 1.004, len_mean 4.05, len_max 37) and the top values are all numeric strings like "0", "406", "404". With 1028 unique values across 1093 rows and a 5.9% duplicate_rate dominated by "0" (21 occurrences), this looks like a numeric identifier or count stored as text. The allcaps_rate of 0.98 is a quirk of digit-only strings being flagged as uppercase.

_duplicated_24 high anthropic:claude-opus-4-7

Despite being typed categorical, the values are numeric strings (e.g. '0.00', '47.52', '51.82'), suggesting a monetary or measurement field that was read as text. With 900 unique values across 1093 rows and entropy ratio 0.9874, it is nearly unique; the only meaningful concentration is '0.00' at 1.38% (15 rows). The '_duplicated_24' name implies this is a repeated copy of another column in the source.

_duplicated_25 high anthropic:claude-opus-4-7

Almost every value is a single short ALLCAPS token (one_word_rate 0.999, allcaps_rate 0.999, len_mean 4.9, word_mean 1.0), and 1088 of 1093 rows are unique with only 5 duplicates. The top tokens are mostly numeric strings like '3584' or '14860', suggesting this is a near-unique short code rather than natural text. The column name '_duplicated_25' hints it was auto-generated from a duplicated source column during profiling.

_duplicated_26 high anthropic:claude-opus-4-7

Single-token, all-caps strings averaging 4.57 characters with 1069 unique values across 1093 rows — almost certainly an identifier or short code column. The top values are all numeric strings (e.g., '2280', '2086') appearing 2-3 times each, suggesting these are numeric IDs stored as text rather than meaningful tokens. The 99.9% one-word and all-caps rates plus near-unique cardinality rule out free text.

_duplicated_27 high anthropic:claude-opus-4-7

Stored as categorical strings but every observed value parses as a two-decimal number (e.g. '37.60', '41.85'), so this is almost certainly a numeric measurement — possibly a price, rate or score — that was ingested as text. With 873 unique values across 1093 rows and entropy ratio 0.989, it is near-unique; the most frequent value '37.60' appears just 4 times (top rate 0.37%). The '_duplicated_27' name suggests it is a duplicate of another column produced upstream.

**Please note** 2021 data in columns H, K, R, and U are populated with 2020 data until current data is released. categorical

top value is 99.9% of rows
rows1,093
null0 (0.0%)
unique2
top_valueSSA-SA-FYWL.csv
top_rate0.999
cardinality2
entropy0.011
entropy_ratio0.011
Top values (rank 1–20)
  1. SSA-SA-FYWL.csv — 1,092
  2. File Name — 1

categorical

top value is 99.9% of rows
rows1,093
null0 (0.0%)
unique2
top_value2
top_rate0.999
cardinality2
entropy0.011
entropy_ratio0.011
Top values (rank 1–20)
  1. 2 — 1,092
  2. File Version — 1

_duplicated_0 categorical

top value is 99.9% of rows
rows1,093
null0 (0.0%)
unique2
top_value3/13/2023
top_rate0.999
cardinality2
entropy0.011
entropy_ratio0.011
Top values (rank 1–20)
  1. 3/13/2023 — 1,092
  2. Update Date — 1

_duplicated_1 categorical

rows1,093
null0 (0.0%)
unique11
top_valueATL
top_rate0.154
cardinality11
entropy3.277
entropy_ratio0.947
Top values (rank 1–20)
  1. ATL — 168
  2. DEN — 126
  3. BOS — 126
  4. PHL — 126
  5. CHI — 126
  6. DAL — 105
  7. SEA — 84
  8. SFO — 84
  9. KCM — 84
  10. NYC — 63
  11. Region Code — 1

_duplicated_2 categorical

rows1,093
null0 (0.0%)
unique53
top_valueAK
top_rate0.019
cardinality53
entropy5.706
entropy_ratio0.996
Top values (rank 1–20)
  1. AK — 21
  2. AL — 21
  3. AR — 21
  4. AZ — 21
  5. CA — 21
  6. CO — 21
  7. CT — 21
  8. DC — 21
  9. DE — 21
  10. FL — 21
  11. GA — 21
  12. HI — 21
  13. IA — 21
  14. ID — 21
  15. IL — 21
  16. IN — 21
  17. KS — 21
  18. KY — 21
  19. LA — 21
  20. MA — 21

_duplicated_3 categorical

top value is 99.9% of rows
rows1,093
null0 (0.0%)
unique2
top_valueFY
top_rate0.999
cardinality2
entropy0.011
entropy_ratio0.011
Top values (rank 1–20)
  1. FY — 1,092
  2. Date Type — 1

_duplicated_4 categorical

rows1,093
null0 (0.0%)
unique22
top_value2001
top_rate0.048
cardinality22
entropy4.399
entropy_ratio0.986
Top values (rank 1–20)
  1. 2001 — 52
  2. 2002 — 52
  3. 2003 — 52
  4. 2004 — 52
  5. 2005 — 52
  6. 2006 — 52
  7. 2007 — 52
  8. 2008 — 52
  9. 2009 — 52
  10. 2010 — 52
  11. 2011 — 52
  12. 2012 — 52
  13. 2013 — 52
  14. 2014 — 52
  15. 2015 — 52
  16. 2016 — 52
  17. 2017 — 52
  18. 2018 — 52
  19. 2019 — 52
  20. 2020 — 52

_duplicated_5 text

99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,037
len_min6
len_max21
len_mean6.846
len_median7.000
len_p958.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates56
duplicate_rate0.051
vocab_size1,039
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 407208
  2. 5197780
  3. 802274
  4. 4470992
  5. 5075318
  6. 3483629
  7. 862241
  8. 384373
  9. 830302
  10. 4629213

_duplicated_6 text

99.7% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,090
len_min5
len_max40
len_mean5.683
len_median6.000
len_p956.000
word_mean1.005
word_median1.000
n_empty0
n_duplicates3
duplicate_rate2.74e-03
vocab_size1,094
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 12791
  2. 293982
  3. 31643
  4. 288035
  5. 262109
  6. 159132
  7. 35481
  8. 29628
  9. 77327
  10. 203305

_duplicated_7 categorical

rows1,093
null0 (0.0%)
unique511
top_value5.50
top_rate0.010
cardinality511
entropy8.710
entropy_ratio0.968
Top values (rank 1–20)
  1. 5.50 — 11
  2. 5.07 — 9
  3. 4.90 — 9
  4. 4.19 — 8
  5. 5.08 — 8
  6. 4.70 — 8
  7. 4.96 — 7
  8. 5.29 — 7
  9. 4.11 — 6
  10. 4.55 — 6
  11. 5.18 — 6
  12. 4.45 — 6
  13. 6.18 — 6
  14. 4.98 — 6
  15. 5.63 — 6
  16. 7.16 — 6
  17. 5.33 — 5
  18. 5.15 — 5
  19. 5.45 — 5
  20. 4.71 — 5

_duplicated_8 text

95.2% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,041
len_min6
len_max26
len_mean6.835
len_median7.000
len_p958.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates52
duplicate_rate0.048
vocab_size1,043
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 394417
  2. 4903798
  3. 770616
  4. 4182957
  5. 4813209
  6. 3318849
  7. 826760
  8. 354780
  9. 752975
  10. 4425908

_duplicated_9 text

98.9% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,081
len_min3
len_max14
len_mean4.854
len_median5.000
len_p956.000
word_mean1.001
word_median1.000
n_empty0
n_duplicates12
duplicate_rate0.011
vocab_size1,082
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 3487
  2. 42558
  3. 6257
  4. 37868
  5. 53646
  6. 22580
  7. 7902
  8. 4431
  9. 8934
  10. 41298

_duplicated_10 categorical

rows1,093
null0 (0.0%)
unique199
top_value0.97
top_rate0.023
cardinality199
entropy7.097
entropy_ratio0.929
Top values (rank 1–20)
  1. 0.97 — 25
  2. 1.11 — 24
  3. 1.01 — 23
  4. 1.04 — 19
  5. 0.92 — 19
  6. 1.08 — 18
  7. 1.02 — 17
  8. 1.12 — 16
  9. 1.07 — 16
  10. 1.15 — 16
  11. 0.96 — 15
  12. 1.00 — 14
  13. 1.13 — 14
  14. 1.10 — 14
  15. 0.89 — 13
  16. 1.23 — 13
  17. 0.94 — 13
  18. 0.90 — 13
  19. 1.05 — 13
  20. 0.85 — 13

_duplicated_11 text

97.2% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,062
len_min3
len_max30
len_mean4.498
len_median4.000
len_p955.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates31
duplicate_rate0.028
vocab_size1,064
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 1573
  2. 15940
  3. 2107
  4. 14243
  5. 18698
  6. 8951
  7. 3441
  8. 1938
  9. 2851
  10. 16121

_duplicated_12 categorical

rows1,093
null0 (0.0%)
unique69
top_value0.38
top_rate0.050
cardinality69
entropy5.527
entropy_ratio0.905
Top values (rank 1–20)
  1. 0.38 — 55
  2. 0.34 — 45
  3. 0.32 — 43
  4. 0.40 — 41
  5. 0.35 — 39
  6. 0.44 — 36
  7. 0.37 — 35
  8. 0.31 — 35
  9. 0.36 — 33
  10. 0.33 — 33
  11. 0.39 — 33
  12. 0.46 — 32
  13. 0.43 — 31
  14. 0.48 — 30
  15. 0.45 — 30
  16. 0.42 — 29
  17. 0.30 — 29
  18. 0.41 — 27
  19. 0.52 — 26
  20. 0.54 — 25

_duplicated_13 text

98.7% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,079
len_min4
len_max24
len_mean4.849
len_median5.000
len_p956.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates14
duplicate_rate0.013
vocab_size1,081
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 3369
  2. 41577
  3. 6096
  4. 37132
  5. 46098
  6. 22283
  7. 7360
  8. 4370
  9. 8916
  10. 39609

_duplicated_14 categorical

707 singleton categories
rows1,093
null0 (0.0%)
unique883
top_value31.13
top_rate3.66e-03
cardinality883
entropy9.686
entropy_ratio0.990
Top values (rank 1–20)
  1. 31.13 — 4
  2. 44.89 — 3
  3. 33.20 — 3
  4. 47.46 — 3
  5. 30.73 — 3
  6. 35.51 — 3
  7. 41.78 — 3
  8. 40.12 — 3
  9. 36.06 — 3
  10. 29.74 — 3
  11. 36.98 — 3
  12. 37.02 — 3
  13. 38.32 — 3
  14. 29.63 — 3
  15. 36.17 — 3
  16. 30.34 — 3
  17. 32.50 — 3
  18. 36.14 — 3
  19. 32.47 — 3
  20. 31.93 — 3

_duplicated_15 text

99.9% rows are a single word 98.0% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,019
len_min1
len_max24
len_mean6.415
len_median6.000
len_p957.000
word_mean1.003
word_median1.000
n_empty0
n_duplicates74
duplicate_rate0.068
vocab_size1,022
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.980
boilerplate_rate0.000
Sample values (first 10)
  1. 188453
  2. 1870106
  3. 299867
  4. 1366857
  5. 1847182
  6. 1301219
  7. 304573
  8. 1860793
  9. 250404
  10. 1751532

_duplicated_16 text

96.7% of rows are unique strings 99.9% rows are a single word 98.0% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,057
len_min1
len_max38
len_mean4.535
len_median5.000
len_p955.000
word_mean1.004
word_median1.000
n_empty0
n_duplicates36
duplicate_rate0.033
vocab_size1,061
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.980
boilerplate_rate0.000
Sample values (first 10)
  1. 970
  2. 22782
  3. 1214
  4. 21199
  5. 23598
  6. 10390
  7. 1730
  8. 1356
  9. 3878
  10. 19821

_duplicated_17 categorical

rows1,093
null0 (0.0%)
unique272
top_value0.00
top_rate0.019
cardinality272
entropy7.671
entropy_ratio0.949
Top values (rank 1–20)
  1. 0.00 — 21
  2. 1.68 — 13
  3. 0.58 — 12
  4. 1.07 — 12
  5. 1.08 — 12
  6. 1.24 — 12
  7. 1.15 — 12
  8. 0.64 — 12
  9. 1.52 — 11
  10. 1.42 — 11
  11. 1.18 — 11
  12. 1.70 — 11
  13. 1.81 — 11
  14. 1.20 — 10
  15. 1.09 — 10
  16. 1.44 — 10
  17. 1.11 — 10
  18. 0.94 — 10
  19. 1.78 — 10
  20. 1.56 — 10

_duplicated_18 text

99.9% rows are a single word 98.1% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null1 (0.1%)
unique1,021
len_min1
len_max26
len_mean6.407
len_median6.000
len_p957.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates71
duplicate_rate0.065
vocab_size1,023
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.981
boilerplate_rate0.000
Sample values (first 10)
  1. 187483
  2. 1847324
  3. 298653
  4. 1345658
  5. 863746
  6. 1289462
  7. 302843
  8. 1838723
  9. 246526
  10. 1731711

_duplicated_19 text

99.9% rows are a single word 98.0% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,018
len_min1
len_max32
len_mean4.047
len_median4.000
len_p955.000
word_mean1.004
word_median1.000
n_empty0
n_duplicates75
duplicate_rate0.069
vocab_size1,022
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.980
boilerplate_rate0.000
Sample values (first 10)
  1. 416
  2. 6917
  3. 343
  4. 5908
  5. 9872
  6. 2045
  7. 541
  8. 337
  9. 914
  10. 8064

_duplicated_20 categorical

rows1,093
null0 (0.0%)
unique156
top_value0.30
top_rate0.026
cardinality156
entropy6.609
entropy_ratio0.907
Top values (rank 1–20)
  1. 0.30 — 28
  2. 0.35 — 26
  3. 0.33 — 26
  4. 0.37 — 24
  5. 0.45 — 24
  6. 0.36 — 23
  7. 0.61 — 23
  8. 0.42 — 22
  9. 0.00 — 21
  10. 0.43 — 21
  11. 0.38 — 20
  12. 0.40 — 19
  13. 0.48 — 19
  14. 0.32 — 19
  15. 0.41 — 18
  16. 0.39 — 18
  17. 0.58 — 18
  18. 0.18 — 18
  19. 0.57 — 17
  20. 0.71 — 17

_duplicated_21 categorical

852 singleton categories
rows1,093
null0 (0.0%)
unique957
top_value0
top_rate0.019
cardinality957
entropy9.789
entropy_ratio0.989
Top values (rank 1–20)
  1. 0 — 21
  2. 1321 — 4
  3. 352 — 3
  4. 597 — 3
  5. 777 — 3
  6. 580 — 3
  7. 1184 — 3
  8. 1353 — 3
  9. 710 — 3
  10. 3128 — 3
  11. 463 — 3
  12. 227 — 3
  13. 1043 — 2
  14. 421 — 2
  15. 2891 — 2
  16. 5079 — 2
  17. 3228 — 2
  18. 299 — 2
  19. 3337 — 2
  20. 238 — 2

_duplicated_22 categorical

rows1,093
null0 (0.0%)
unique70
top_value0.18
top_rate0.070
cardinality70
entropy5.138
entropy_ratio0.838
Top values (rank 1–20)
  1. 0.18 — 77
  2. 0.20 — 75
  3. 0.21 — 64
  4. 0.22 — 62
  5. 0.25 — 61
  6. 0.17 — 61
  7. 0.23 — 49
  8. 0.19 — 46
  9. 0.24 — 45
  10. 0.16 — 43
  11. 0.15 — 41
  12. 0.26 — 36
  13. 0.14 — 27
  14. 0.12 — 26
  15. 0.27 — 26
  16. 0.13 — 25
  17. 0.11 — 25
  18. 0.10 — 22
  19. 0.00 — 21
  20. 0.29 — 20

_duplicated_23 text

99.9% rows are a single word 98.0% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,028
len_min1
len_max37
len_mean4.053
len_median4.000
len_p955.000
word_mean1.004
word_median1.000
n_empty0
n_duplicates65
duplicate_rate0.059
vocab_size1,032
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.980
boilerplate_rate0.000
Sample values (first 10)
  1. 404
  2. 6979
  3. 365
  4. 5876
  5. 8430
  6. 2164
  7. 510
  8. 314
  9. 939
  10. 8006

_duplicated_24 categorical

756 singleton categories
rows1,093
null6 (0.5%)
unique900
top_value0.00
top_rate0.014
cardinality900
entropy9.690
entropy_ratio0.987
Top values (rank 1–20)
  1. 0.00 — 15
  2. 47.52 — 5
  3. 51.82 — 4
  4. 47.04 — 4
  5. 54.24 — 4
  6. 48.91 — 4
  7. 51.90 — 3
  8. 48.89 — 3
  9. 37.97 — 3
  10. 51.35 — 3
  11. 44.18 — 3
  12. 63.06 — 3
  13. 40.64 — 3
  14. 38.66 — 3
  15. 57.98 — 3
  16. 30.92 — 3
  17. 53.94 — 3
  18. 60.20 — 3
  19. 39.15 — 3
  20. 30.05 — 3

_duplicated_25 text

99.5% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,088
len_min4
len_max18
len_mean4.906
len_median5.000
len_p956.000
word_mean1.001
word_median1.000
n_empty0
n_duplicates5
duplicate_rate4.57e-03
vocab_size1,089
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 3773
  2. 48556
  3. 6461
  4. 43008
  5. 54528
  6. 24447
  7. 7870
  8. 4684
  9. 9855
  10. 47615

_duplicated_26 text

97.8% of rows are unique strings 99.9% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows1,093
null0 (0.0%)
unique1,069
len_min3
len_max28
len_mean4.574
len_median5.000
len_p955.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates24
duplicate_rate0.022
vocab_size1,071
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 1856
  2. 18872
  3. 2295
  4. 17664
  5. 22008
  6. 10304
  7. 3748
  8. 2163
  9. 3393
  10. 19458

_duplicated_27 categorical

693 singleton categories
rows1,093
null0 (0.0%)
unique873
top_value37.60
top_rate3.66e-03
cardinality873
entropy9.662
entropy_ratio0.989
Top values (rank 1–20)
  1. 37.60 — 4
  2. 36.60 — 4
  3. 41.85 — 4
  4. 38.47 — 4
  5. 49.19 — 3
  6. 32.63 — 3
  7. 42.28 — 3
  8. 29.96 — 3
  9. 42.14 — 3
  10. 38.12 — 3
  11. 33.04 — 3
  12. 40.70 — 3
  13. 40.45 — 3
  14. 33.84 — 3
  15. 30.27 — 3
  16. 31.35 — 3
  17. 39.43 — 3
  18. 33.77 — 3
  19. 30.69 — 3
  20. 31.39 — 3