saturn·

emoji unicode emoji list 20260119

source /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json 5,225 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a catalog of 5,225 Unicode emoji, with each row carrying the emoji glyph, its codepoint sequence, an English-leaning name, and three classification fields (group, subgroup, status). The collection is heavily skewed toward people: the 'group' field shows 'People & Body' accounts for 3,468 of 5,225 rows (about 66%), so most subsequent breakdowns will be dominated by human figures. The 'status' field is similarly lopsided, with 'fully-qualified' covering 3,944 rows versus much smaller minimally-qualified, unqualified, and component buckets. The 'subgroup' column gives a finer 100-way split worth exploring, led by person-activity (697) and person-role (635). Name-level duplication (1,272 duplicate names, ~24%) reflects skin-tone and gender variants of the same base concept, which is the other thing to keep in mind when counting.

citing: row_count · column_count · columns.group.top_values · columns.group.stats · columns.status.top_values · columns.status.stats · columns.subgroup.top_values · columns.subgroup.stats · columns.name.stats · columns.name.top_words · columns.codepoints.stats

Schema

6 columns
Per-column summary. Click column name to jump to its detail.
Alerts
emoji text 0.0% 5,225
near_unique one_word allcaps short_text
codepoints text 0.0% 5,225
near_unique one_word allcaps
status categorical 0.0% 4
name text 0.0% 3,953
multilingual duplicates
group categorical 0.0% 10
subgroup categorical 0.0% 100

emoji

text identifier near_unique one_word allcaps short_text
This column appears to be a unique emoji identifier or catalog entry, with all 5225 values distinct (n_unique equals n) and a 97.7% emoji_rate. Each entry is a single token (one_word_rate 1.0) ranging from 1 to 10 characters with a median length of 3. The 51.3% allcaps_rate is unusual for an emoji column and suggests some entries contain ASCII letter components (e.g., regional indicators or text-based glyphs) rather than pure pictographs. Treatment: Treat as a unique key; drop from modelling or use only as a join/lookup field. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
5,225
len_min
1
len_max
10
len_mean
3.416
len_median
3
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
5,225
readability_flesch_mean
1.818
emoji_rate
0.9774
url_rate
0
one_word_rate
1
allcaps_rate
0.5133
boilerplate_rate
0

codepoints

text identifier near_unique one_word allcaps
This column holds Unicode codepoint sequences (likely emoji definitions), with every one of the 5225 rows unique and fully uppercase-hex. Tokens like '200d' (zero-width joiner, 3747 occurrences), 'fe0f' (variation selector, 1318), and skin-tone modifiers '1f3fb'-'1f3ff' (703 each) dominate, alongside the man/woman bases '1f468'/'1f469' (676 each). String length averages 18 characters (max 54) with a median of 3 tokens, consistent with multi-codepoint emoji ZWJ sequences. Treatment: Treat as a unique key per emoji; split on whitespace into codepoint tokens if structural features are needed. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
5,225
len_min
4
len_max
54
len_mean
18.04
len_median
15
len_p95
43
word_mean
3.416
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,451
readability_flesch_mean
118.8
emoji_rate
0
url_rate
0
one_word_rate
0.2679
allcaps_rate
1
boilerplate_rate
0

status

categorical label
Categorical qualification status with 4 levels and no nulls across 5225 rows. Heavily dominated by 'fully-qualified' at 75.5%, with 'minimally-qualified' (1029) and 'unqualified' (243) trailing, and 'component' a rare tail at just 9 occurrences. Entropy ratio of 0.495 confirms the imbalance. Treatment: One-hot encode; consider collapsing the rare 'component' class or stratifying splits to preserve it. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
4
top_value
fully-qualified
top_rate
0.7548
cardinality
4
entropy
0.9896
entropy_ratio
0.4948

name

text label multilingual duplicates
This column holds short descriptive labels for emoji (e.g. 'E4.0 man detective', 'E15.1 woman walking facing right: dark skin tone'), averaging 5.5 words and 33.8 characters with a versioned 'E#.#' prefix. Duplicates are heavy at 24.3% (1272 rows) because skin-tone variants share base names — 'skin' (3450) and 'tone' (2800) dominate the vocabulary of 1912 tokens. Although 4680 rows are tagged English, the language detector also flags 29 other languages including German (36), Persian (33), and Polish (27), likely false positives on the short codepoint-style tokens rather than true multilingual content. Treatment: Treat as the canonical emoji label; strip the 'E#.#' version prefix and skin-tone suffix if you need a deduplicated key. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
3,953
len_min
7
len_max
86
len_mean
33.81
len_median
34
len_p95
67
word_mean
5.521
word_median
6
n_empty
0
n_duplicates
1,272
duplicate_rate
0.2434
vocab_size
1,912
readability_flesch_mean
78.3
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0.0001914
boilerplate_rate
0

group

categorical feature
This is a categorical grouping column with 10 distinct values matching the standard Unicode emoji category taxonomy (e.g., "People & Body", "Smileys & Emotion", "Flags"). The distribution is heavily imbalanced: "People & Body" alone covers 66.4% of the 5,225 rows, while "Component" appears just 9 times. No nulls, and entropy ratio of 0.57 confirms the skew toward one dominant class. Treatment: One-hot or target-encode; consider grouping the rare "Component" class given its tiny support. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
10
top_value
People & Body
top_rate
0.6637
cardinality
10
entropy
1.908
entropy_ratio
0.5743

subgroup

categorical feature
Categorical taxonomy label with 100 distinct subgroups across 5225 rows and no nulls, suggesting an emoji or icon classification scheme dominated by people-related categories. The top value 'person-activity' covers 13.3% of rows, and the top eight values are all person/family/flag related, indicating a long tail where 92 remaining subgroups together account for most of the diversity (entropy ratio 0.75). No single category dominates overwhelmingly, but the person-centric concentration is notable. Treatment: Group-encode or target-encode given the 100 levels, or collapse rare subgroups into an 'other' bucket before modelling. high · anthropic:claude-opus-4-7
n
5,225
nulls
0 (0.0%)
unique
100
top_value
person-activity
top_rate
0.1334
cardinality
100
entropy
4.982
entropy_ratio
0.7498