emoji unicode emoji list 20260119

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json

Saturn profiled 5,225 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json",
    "--findings", "emoji-unicode_emoji_list_20260119.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a catalog of 5,225 Unicode emoji, with each row carrying the emoji glyph, its codepoint sequence, an English-leaning name, and three classification fields (group, subgroup, status). The collection is heavily skewed toward people: the 'group' field shows 'People & Body' accounts for 3,468 of 5,225 rows (about 66%), so most subsequent breakdowns will be dominated by human figures. The 'status' field is similarly lopsided, with 'fully-qualified' covering 3,944 rows versus much smaller minimally-qualified, unqualified, and component buckets. The 'subgroup' column gives a finer 100-way split worth exploring, led by person-activity (697) and person-role (635). Name-level duplication (1,272 duplicate names, ~24%) reflects skin-tone and gender variants of the same base concept, which is the other thing to keep in mind when counting.

citing: row_count · column_count · columns.group.top_values · columns.group.stats · columns.status.top_values · columns.status.stats · columns.subgroup.top_values · columns.subgroup.stats · columns.name.stats · columns.name.top_words · columns.codepoints.stats

Out[4]:

saturn.schema() · 6 columns

column	kind	n	null%	unique	alerts
emoji	text	5,225	0.0%	5,225	near_unique one_word allcaps short_text
codepoints	text	5,225	0.0%	5,225	near_unique one_word allcaps
status	categorical	5,225	0.0%	4
name	text	5,225	0.0%	3,953	multilingual duplicates
group	categorical	5,225	0.0%	10
subgroup	categorical	5,225	0.0%	100

Fig 1.

group · Shows how dominant 'People & Body' is relative to all other emoji groups.

Show data table

Top values for group (10 unique shown, of 10 total).
value	count	share
People & Body	3468	66.4%
Objects	316	6.0%
Symbols	305	5.8%
Flags	276	5.3%
Travel & Places	268	5.1%
Smileys & Emotion	187	3.6%
Animals & Nature	167	3.2%
Food & Drink	133	2.5%
Activities	96	1.8%
Component	9	0.2%

Fig 2.

status · Highlights that roughly three quarters of entries are fully-qualified, with smaller minimally-qualified and unqualified slices.

Show data table

Top values for status (4 unique shown, of 4 total).
value	count	share
fully-qualified	3944	75.5%
minimally-qualified	1029	19.7%
unqualified	243	4.7%
component	9	0.2%

Fig 3.

subgroup · Top subgroups (person-activity, person-role, family) reveal where the catalog packs the most variants.

Show data table

Top values for subgroup (20 unique shown, of 100 total).
value	count	share
person-activity	697	13.3%
person-role	635	12.2%
family	533	10.2%
person-sport	480	9.2%
person-gesture	300	5.7%
country-flag	259	5.0%
person-fantasy	246	4.7%
person	192	3.7%
animal-mammal	68	1.3%
hand-fingers-open	67	1.3%
sky & weather	65	1.2%
hands	62	1.2%
hand-fingers-partial	55	1.1%
transport-ground	55	1.1%
clothing	50	1.0%
body-parts	49	0.9%
alphanum	49	0.9%
hand-single-finger	43	0.8%
person-resting	42	0.8%
tool	38	0.7%

Fig 4.

name · Distribution of name lengths (median 34 chars) hints at how many descriptors carry skin-tone or gender qualifiers.

Show data table

Character-length distribution for name (mean: 33.810143540669856).
chars	count
7 – 9	36
9 – 11	154
11 – 13	216
13 – 15	252
15 – 17	319
17 – 19	374
19 – 21	249
21 – 23	189
23 – 25	150
25 – 27	135
27 – 29	125
29 – 31	122
31 – 33	162
33 – 35	252
35 – 37	256
37 – 39	245
39 – 41	275
41 – 43	273
43 – 45	195
45 – 46	162
46 – 48	153
48 – 50	120
50 – 52	63
52 – 54	64
54 – 56	76
56 – 58	60
58 – 60	61
60 – 62	75
62 – 64	66
64 – 66	70
66 – 68	64
68 – 70	40
70 – 72	32
72 – 74	46
74 – 76	28
76 – 78	24
78 – 80	26
80 – 82	8
82 – 84	4
84 – 86	4

Fig 5.

codepoints · Codepoint-string length spans 4 to 54 characters, signalling how many emoji are multi-codepoint ZWJ sequences.

Show data table

Character-length distribution for codepoints (mean: 18.04019138755981).
chars	count
4 – 5	1400
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	249
10 – 12	894
12 – 13	0
13 – 14	0
14 – 15	136
15 – 16	79
16 – 18	0
18 – 19	0
19 – 20	141
20 – 22	544
22 – 23	325
23 – 24	0
24 – 25	25
25 – 26	553
26 – 28	15
28 – 29	20
29 – 30	12
30 – 32	42
32 – 33	45
33 – 34	0
34 – 35	6
35 – 36	60
36 – 38	48
38 – 39	105
39 – 40	205
40 – 42	33
42 – 43	3
43 – 44	95
44 – 45	0
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	95
50 – 52	0
52 – 53	0
53 – 54	95

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
emoji	text	0.0%
codepoints	text	0.0%
status	categorical	0.0%
name	text	0.0%
group	categorical	0.0%
subgroup	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,988 detected strings).
lang	count	share
en	4680	93.8%
de	36	0.7%
fa	33	0.7%
nn	28	0.6%
pl	27	0.5%
es	21	0.4%
ar	19	0.4%
nl	19	0.4%
ru	19	0.4%
fr	17	0.3%
ja	16	0.3%
uk	9	0.2%
id	8	0.2%
it	7	0.1%
th	7	0.1%
ca	5	0.1%
sv	4	0.1%
ta	4	0.1%
pt	4	0.1%
ur	4	0.1%
zh	3	0.1%
ms	3	0.1%
hu	3	0.1%
war	2	0.0%
lt	2	0.0%
fi	2	0.0%
eo	2	0.0%
gl	2	0.0%
la	1	0.0%
ceb	1	0.0%

emoji text identifier

This column appears to be a unique emoji identifier or catalog entry, with all 5225 values distinct (n_unique equals n) and a 97.7% emoji_rate. Each entry is a single token (one_word_rate 1.0) ranging from 1 to 10 characters with a median length of 3. The 51.3% allcaps_rate is unusual for an emoji column and suggests some entries contain ASCII letter components (e.g., regional indicators or text-based glyphs) rather than pure pictographs.

Treatment: Treat as a unique key; drop from modelling or use only as a join/lookup field.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["emoji"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	5,225
len_min	1
len_max	10
len_mean	3.416
len_median	3
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,225
readability_flesch_mean	1.818
emoji_rate	0.9774
url_rate	0
one_word_rate	1
allcaps_rate	0.5133
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	51.3% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for emoji.

Show data table

Character-length distribution for emoji (mean: 3.415885167464115).
chars	count
1 – 1	1400
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	1143
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	215
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	1010
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	613
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	99
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	427
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	128
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	95
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	95

codepoints text identifier

This column holds Unicode codepoint sequences (likely emoji definitions), with every one of the 5225 rows unique and fully uppercase-hex. Tokens like '200d' (zero-width joiner, 3747 occurrences), 'fe0f' (variation selector, 1318), and skin-tone modifiers '1f3fb'-'1f3ff' (703 each) dominate, alongside the man/woman bases '1f468'/'1f469' (676 each). String length averages 18 characters (max 54) with a median of 3 tokens, consistent with multi-codepoint emoji ZWJ sequences.

Treatment: Treat as a unique key per emoji; split on whitespace into codepoint tokens if structural features are needed.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["codepoints"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	5,225
len_min	4
len_max	54
len_mean	18.04
len_median	15
len_p95	43
word_mean	3.416
word_median	3
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,451
readability_flesch_mean	118.8
emoji_rate	0
url_rate	0
one_word_rate	0.2679
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	26.8% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 9.

Character-length distribution for codepoints.

Show data table

Character-length distribution for codepoints (mean: 18.04019138755981).
chars	count
4 – 5	1400
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	249
10 – 12	894
12 – 13	0
13 – 14	0
14 – 15	136
15 – 16	79
16 – 18	0
18 – 19	0
19 – 20	141
20 – 22	544
22 – 23	325
23 – 24	0
24 – 25	25
25 – 26	553
26 – 28	15
28 – 29	20
29 – 30	12
30 – 32	42
32 – 33	45
33 – 34	0
34 – 35	6
35 – 36	60
36 – 38	48
38 – 39	105
39 – 40	205
40 – 42	33
42 – 43	3
43 – 44	95
44 – 45	0
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	95
50 – 52	0
52 – 53	0
53 – 54	95

status categorical label

Categorical qualification status with 4 levels and no nulls across 5225 rows. Heavily dominated by 'fully-qualified' at 75.5%, with 'minimally-qualified' (1029) and 'unqualified' (243) trailing, and 'component' a rare tail at just 9 occurrences. Entropy ratio of 0.495 confirms the imbalance.

Treatment: One-hot encode; consider collapsing the rare 'component' class or stratifying splits to preserve it.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["status"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	4
top_value	fully-qualified
top_rate	0.7548
cardinality	4
entropy	0.9896
entropy_ratio	0.4948

Fig 10.

Top values for status.

Show data table

Top values for status (4 unique shown, of 4 total).
value	count	share
fully-qualified	3944	75.5%
minimally-qualified	1029	19.7%
unqualified	243	4.7%
component	9	0.2%

name text label

This column holds short descriptive labels for emoji (e.g. 'E4.0 man detective', 'E15.1 woman walking facing right: dark skin tone'), averaging 5.5 words and 33.8 characters with a versioned 'E#.#' prefix. Duplicates are heavy at 24.3% (1272 rows) because skin-tone variants share base names — 'skin' (3450) and 'tone' (2800) dominate the vocabulary of 1912 tokens. Although 4680 rows are tagged English, the language detector also flags 29 other languages including German (36), Persian (33), and Polish (27), likely false positives on the short codepoint-style tokens rather than true multilingual content.

Treatment: Treat as the canonical emoji label; strip the 'E#.#' version prefix and skin-tone suffix if you need a deduplicated key.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["name"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	3,953
len_min	7
len_max	86
len_mean	33.81
len_median	34
len_p95	67
word_mean	5.521
word_median	6
n_empty	0
n_duplicates	1,272
duplicate_rate	0.2434
vocab_size	1,912
readability_flesch_mean	78.3
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.0001914
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	24.3% duplicate strings

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 33.810143540669856).
chars	count
7 – 9	36
9 – 11	154
11 – 13	216
13 – 15	252
15 – 17	319
17 – 19	374
19 – 21	249
21 – 23	189
23 – 25	150
25 – 27	135
27 – 29	125
29 – 31	122
31 – 33	162
33 – 35	252
35 – 37	256
37 – 39	245
39 – 41	275
41 – 43	273
43 – 45	195
45 – 46	162
46 – 48	153
48 – 50	120
50 – 52	63
52 – 54	64
54 – 56	76
56 – 58	60
58 – 60	61
60 – 62	75
62 – 64	66
64 – 66	70
66 – 68	64
68 – 70	40
70 – 72	32
72 – 74	46
74 – 76	28
76 – 78	24
78 – 80	26
80 – 82	8
82 – 84	4
84 – 86	4

group categorical feature

This is a categorical grouping column with 10 distinct values matching the standard Unicode emoji category taxonomy (e.g., "People & Body", "Smileys & Emotion", "Flags"). The distribution is heavily imbalanced: "People & Body" alone covers 66.4% of the 5,225 rows, while "Component" appears just 9 times. No nulls, and entropy ratio of 0.57 confirms the skew toward one dominant class.

Treatment: One-hot or target-encode; consider grouping the rare "Component" class given its tiny support.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["group"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	10
top_value	People & Body
top_rate	0.6637
cardinality	10
entropy	1.908
entropy_ratio	0.5743

Fig 12.

Top values for group.

Show data table

Top values for group (10 unique shown, of 10 total).
value	count	share
People & Body	3468	66.4%
Objects	316	6.0%
Symbols	305	5.8%
Flags	276	5.3%
Travel & Places	268	5.1%
Smileys & Emotion	187	3.6%
Animals & Nature	167	3.2%
Food & Drink	133	2.5%
Activities	96	1.8%
Component	9	0.2%

subgroup categorical feature

Categorical taxonomy label with 100 distinct subgroups across 5225 rows and no nulls, suggesting an emoji or icon classification scheme dominated by people-related categories. The top value 'person-activity' covers 13.3% of rows, and the top eight values are all person/family/flag related, indicating a long tail where 92 remaining subgroups together account for most of the diversity (entropy ratio 0.75). No single category dominates overwhelmingly, but the person-centric concentration is notable.

Treatment: Group-encode or target-encode given the 100 levels, or collapse rare subgroups into an 'other' bucket before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["subgroup"].stats

stat	value
n	5,225
nulls	0 (0.0%)
unique	100
top_value	person-activity
top_rate	0.1334
cardinality	100
entropy	4.982
entropy_ratio	0.7498

Fig 13.

Top values for subgroup.

Show data table

Top values for subgroup (20 unique shown, of 100 total).
value	count	share
person-activity	697	13.3%
person-role	635	12.2%
family	533	10.2%
person-sport	480	9.2%
person-gesture	300	5.7%
country-flag	259	5.0%
person-fantasy	246	4.7%
person	192	3.7%
animal-mammal	68	1.3%
hand-fingers-open	67	1.3%
sky & weather	65	1.2%
hands	62	1.2%
hand-fingers-partial	55	1.1%
transport-ground	55	1.1%
clothing	50	1.0%
body-parts	49	0.9%
alphanum	49	0.9%
hand-single-finger	43	0.8%
person-resting	42	0.8%
tool	38	0.7%

How to cite

click to copy

BibTeX

@misc{saturn-emoji-unicode-emoji-list-20260119-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: emoji unicode emoji list 20260119},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/emoji-unicode_emoji_list_20260119}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: emoji unicode emoji list 20260119. Source: /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/emoji-unicode_emoji_list_20260119