emoji unicode emoji list 20260119

source /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json 5,225 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a catalog of 5,225 Unicode emoji, with each row carrying the emoji glyph, its codepoint sequence, an English-leaning name, and three classification fields (group, subgroup, status). The collection is heavily skewed toward people: the 'group' field shows 'People & Body' accounts for 3,468 of 5,225 rows (about 66%), so most subsequent breakdowns will be dominated by human figures. The 'status' field is similarly lopsided, with 'fully-qualified' covering 3,944 rows versus much smaller minimally-qualified, unqualified, and component buckets. The 'subgroup' column gives a finer 100-way split worth exploring, led by person-activity (697) and person-role (635). Name-level duplication (1,272 duplicate names, ~24%) reflects skin-tone and gender variants of the same base concept, which is the other thing to keep in mind when counting.

citing: row_count · column_count · columns.group.top_values · columns.group.stats · columns.status.top_values · columns.status.stats · columns.subgroup.top_values · columns.subgroup.stats · columns.name.stats · columns.name.top_words · columns.codepoints.stats

Charts the summary said to look at first

group · Shows how dominant 'People & Body' is relative to all other emoji groups.

Show data table

Top values for group (10 unique shown, of 10 total).
value	count	share
People & Body	3468	66.4%
Objects	316	6.0%
Symbols	305	5.8%
Flags	276	5.3%
Travel & Places	268	5.1%
Smileys & Emotion	187	3.6%
Animals & Nature	167	3.2%
Food & Drink	133	2.5%
Activities	96	1.8%
Component	9	0.2%

status · Highlights that roughly three quarters of entries are fully-qualified, with smaller minimally-qualified and unqualified slices.

Show data table

Top values for status (4 unique shown, of 4 total).
value	count	share
fully-qualified	3944	75.5%
minimally-qualified	1029	19.7%
unqualified	243	4.7%
component	9	0.2%

subgroup · Top subgroups (person-activity, person-role, family) reveal where the catalog packs the most variants.

Show data table

Top values for subgroup (20 unique shown, of 100 total).
value	count	share
person-activity	697	13.3%
person-role	635	12.2%
family	533	10.2%
person-sport	480	9.2%
person-gesture	300	5.7%
country-flag	259	5.0%
person-fantasy	246	4.7%
person	192	3.7%
animal-mammal	68	1.3%
hand-fingers-open	67	1.3%
sky & weather	65	1.2%
hands	62	1.2%
hand-fingers-partial	55	1.1%
transport-ground	55	1.1%
clothing	50	1.0%
body-parts	49	0.9%
alphanum	49	0.9%
hand-single-finger	43	0.8%
person-resting	42	0.8%
tool	38	0.7%

name · Distribution of name lengths (median 34 chars) hints at how many descriptors carry skin-tone or gender qualifiers.

Show data table

Character-length distribution for name (mean: 33.810143540669856).
chars	count
7 – 9	36
9 – 11	154
11 – 13	216
13 – 15	252
15 – 17	319
17 – 19	374
19 – 21	249
21 – 23	189
23 – 25	150
25 – 27	135
27 – 29	125
29 – 31	122
31 – 33	162
33 – 35	252
35 – 37	256
37 – 39	245
39 – 41	275
41 – 43	273
43 – 45	195
45 – 46	162
46 – 48	153
48 – 50	120
50 – 52	63
52 – 54	64
54 – 56	76
56 – 58	60
58 – 60	61
60 – 62	75
62 – 64	66
64 – 66	70
66 – 68	64
68 – 70	40
70 – 72	32
72 – 74	46
74 – 76	28
76 – 78	24
78 – 80	26
80 – 82	8
82 – 84	4
84 – 86	4

codepoints · Codepoint-string length spans 4 to 54 characters, signalling how many emoji are multi-codepoint ZWJ sequences.

Show data table

Character-length distribution for codepoints (mean: 18.04019138755981).
chars	count
4 – 5	1400
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	249
10 – 12	894
12 – 13	0
13 – 14	0
14 – 15	136
15 – 16	79
16 – 18	0
18 – 19	0
19 – 20	141
20 – 22	544
22 – 23	325
23 – 24	0
24 – 25	25
25 – 26	553
26 – 28	15
28 – 29	20
29 – 30	12
30 – 32	42
32 – 33	45
33 – 34	0
34 – 35	6
35 – 36	60
36 – 38	48
38 – 39	105
39 – 40	205
40 – 42	33
42 – 43	3
43 – 44	95
44 – 45	0
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	95
50 – 52	0
52 – 53	0
53 – 54	95

Schema

6 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
emoji	text	0.0%	5,225	near_unique one_word allcaps short_text
codepoints	text	0.0%	5,225	near_unique one_word allcaps
status	categorical	0.0%	4
name	text	0.0%	3,953	multilingual duplicates
group	categorical	0.0%	10
subgroup	categorical	0.0%	100

emoji

text identifier near_unique one_word allcaps short_text

This column appears to be a unique emoji identifier or catalog entry, with all 5225 values distinct (n_unique equals n) and a 97.7% emoji_rate. Each entry is a single token (one_word_rate 1.0) ranging from 1 to 10 characters with a median length of 3. The 51.3% allcaps_rate is unusual for an emoji column and suggests some entries contain ASCII letter components (e.g., regional indicators or text-based glyphs) rather than pure pictographs. Treatment: Treat as a unique key; drop from modelling or use only as a join/lookup field. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 5,225
len_min: 1
len_max: 10
len_mean: 3.416
len_median: 3
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 5,225
readability_flesch_mean: 1.818
emoji_rate: 0.9774
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.5133
boilerplate_rate: 0

codepoints

text identifier near_unique one_word allcaps

This column holds Unicode codepoint sequences (likely emoji definitions), with every one of the 5225 rows unique and fully uppercase-hex. Tokens like '200d' (zero-width joiner, 3747 occurrences), 'fe0f' (variation selector, 1318), and skin-tone modifiers '1f3fb'-'1f3ff' (703 each) dominate, alongside the man/woman bases '1f468'/'1f469' (676 each). String length averages 18 characters (max 54) with a median of 3 tokens, consistent with multi-codepoint emoji ZWJ sequences. Treatment: Treat as a unique key per emoji; split on whitespace into codepoint tokens if structural features are needed. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 5,225
len_min: 4
len_max: 54
len_mean: 18.04
len_median: 15
len_p95: 43
word_mean: 3.416
word_median: 3
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 1,451
readability_flesch_mean: 118.8
emoji_rate: 0
url_rate: 0
one_word_rate: 0.2679
allcaps_rate: 1
boilerplate_rate: 0

status

categorical label

Categorical qualification status with 4 levels and no nulls across 5225 rows. Heavily dominated by 'fully-qualified' at 75.5%, with 'minimally-qualified' (1029) and 'unqualified' (243) trailing, and 'component' a rare tail at just 9 occurrences. Entropy ratio of 0.495 confirms the imbalance. Treatment: One-hot encode; consider collapsing the rare 'component' class or stratifying splits to preserve it. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 4
top_value: fully-qualified
top_rate: 0.7548
cardinality: 4
entropy: 0.9896
entropy_ratio: 0.4948

name

text label multilingual duplicates

This column holds short descriptive labels for emoji (e.g. 'E4.0 man detective', 'E15.1 woman walking facing right: dark skin tone'), averaging 5.5 words and 33.8 characters with a versioned 'E#.#' prefix. Duplicates are heavy at 24.3% (1272 rows) because skin-tone variants share base names — 'skin' (3450) and 'tone' (2800) dominate the vocabulary of 1912 tokens. Although 4680 rows are tagged English, the language detector also flags 29 other languages including German (36), Persian (33), and Polish (27), likely false positives on the short codepoint-style tokens rather than true multilingual content. Treatment: Treat as the canonical emoji label; strip the 'E#.#' version prefix and skin-tone suffix if you need a deduplicated key. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 3,953
len_min: 7
len_max: 86
len_mean: 33.81
len_median: 34
len_p95: 67
word_mean: 5.521
word_median: 6
n_empty: 0
n_duplicates: 1,272
duplicate_rate: 0.2434
vocab_size: 1,912
readability_flesch_mean: 78.3
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0.0001914
boilerplate_rate: 0

group

categorical feature

This is a categorical grouping column with 10 distinct values matching the standard Unicode emoji category taxonomy (e.g., "People & Body", "Smileys & Emotion", "Flags"). The distribution is heavily imbalanced: "People & Body" alone covers 66.4% of the 5,225 rows, while "Component" appears just 9 times. No nulls, and entropy ratio of 0.57 confirms the skew toward one dominant class. Treatment: One-hot or target-encode; consider grouping the rare "Component" class given its tiny support. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 10
top_value: People & Body
top_rate: 0.6637
cardinality: 10
entropy: 1.908
entropy_ratio: 0.5743

subgroup

categorical feature

Categorical taxonomy label with 100 distinct subgroups across 5225 rows and no nulls, suggesting an emoji or icon classification scheme dominated by people-related categories. The top value 'person-activity' covers 13.3% of rows, and the top eight values are all person/family/flag related, indicating a long tail where 92 remaining subgroups together account for most of the diversity (entropy ratio 0.75). No single category dominates overwhelmingly, but the person-centric concentration is notable. Treatment: Group-encode or target-encode given the 100 levels, or collapse rare subgroups into an 'other' bucket before modelling. high · anthropic:claude-opus-4-7

n: 5,225
nulls: 0 (0.0%)
unique: 100
top_value: person-activity
top_rate: 0.1334
cardinality: 100
entropy: 4.982
entropy_ratio: 0.7498