saturn·

emoji unicode emoji list 20260119

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json

Saturn profiled 5,225 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json",
    "--findings", "emoji-unicode_emoji_list_20260119.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a catalog of 5,225 Unicode emoji, with each row carrying the emoji glyph, its codepoint sequence, an English-leaning name, and three classification fields (group, subgroup, status). The collection is heavily skewed toward people: the 'group' field shows 'People & Body' accounts for 3,468 of 5,225 rows (about 66%), so most subsequent breakdowns will be dominated by human figures. The 'status' field is similarly lopsided, with 'fully-qualified' covering 3,944 rows versus much smaller minimally-qualified, unqualified, and component buckets. The 'subgroup' column gives a finer 100-way split worth exploring, led by person-activity (697) and person-role (635). Name-level duplication (1,272 duplicate names, ~24%) reflects skin-tone and gender variants of the same base concept, which is the other thing to keep in mind when counting.

citing: row_count · column_count · columns.group.top_values · columns.group.stats · columns.status.top_values · columns.status.stats · columns.subgroup.top_values · columns.subgroup.stats · columns.name.stats · columns.name.top_words · columns.codepoints.stats

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
emoji text 5,225 0.0% 5,225 near_unique one_word allcaps short_text
codepoints text 5,225 0.0% 5,225 near_unique one_word allcaps
status categorical 5,225 0.0% 4
name text 5,225 0.0% 3,953 multilingual duplicates
group categorical 5,225 0.0% 10
subgroup categorical 5,225 0.0% 100
Fig 1.
group · Shows how dominant 'People & Body' is relative to all other emoji groups.
Show data table
Top values for group (10 unique shown, of 10 total).
valuecountshare
People & Body346866.4%
Objects3166.0%
Symbols3055.8%
Flags2765.3%
Travel & Places2685.1%
Smileys & Emotion1873.6%
Animals & Nature1673.2%
Food & Drink1332.5%
Activities961.8%
Component90.2%
Fig 2.
status · Highlights that roughly three quarters of entries are fully-qualified, with smaller minimally-qualified and unqualified slices.
Show data table
Top values for status (4 unique shown, of 4 total).
valuecountshare
fully-qualified394475.5%
minimally-qualified102919.7%
unqualified2434.7%
component90.2%
Fig 3.
subgroup · Top subgroups (person-activity, person-role, family) reveal where the catalog packs the most variants.
Show data table
Top values for subgroup (20 unique shown, of 100 total).
valuecountshare
person-activity69713.3%
person-role63512.2%
family53310.2%
person-sport4809.2%
person-gesture3005.7%
country-flag2595.0%
person-fantasy2464.7%
person1923.7%
animal-mammal681.3%
hand-fingers-open671.3%
sky & weather651.2%
hands621.2%
hand-fingers-partial551.1%
transport-ground551.1%
clothing501.0%
body-parts490.9%
alphanum490.9%
hand-single-finger430.8%
person-resting420.8%
tool380.7%
Fig 4.
name · Distribution of name lengths (median 34 chars) hints at how many descriptors carry skin-tone or gender qualifiers.
Show data table
Character-length distribution for name (mean: 33.810143540669856).
charscount
7 – 936
9 – 11154
11 – 13216
13 – 15252
15 – 17319
17 – 19374
19 – 21249
21 – 23189
23 – 25150
25 – 27135
27 – 29125
29 – 31122
31 – 33162
33 – 35252
35 – 37256
37 – 39245
39 – 41275
41 – 43273
43 – 45195
45 – 46162
46 – 48153
48 – 50120
50 – 5263
52 – 5464
54 – 5676
56 – 5860
58 – 6061
60 – 6275
62 – 6466
64 – 6670
66 – 6864
68 – 7040
70 – 7232
72 – 7446
74 – 7628
76 – 7824
78 – 8026
80 – 828
82 – 844
84 – 864
Fig 5.
codepoints · Codepoint-string length spans 4 to 54 characters, signalling how many emoji are multi-codepoint ZWJ sequences.
Show data table
Character-length distribution for codepoints (mean: 18.04019138755981).
charscount
4 – 51400
5 – 60
6 – 80
8 – 90
9 – 10249
10 – 12894
12 – 130
13 – 140
14 – 15136
15 – 1679
16 – 180
18 – 190
19 – 20141
20 – 22544
22 – 23325
23 – 240
24 – 2525
25 – 26553
26 – 2815
28 – 2920
29 – 3012
30 – 3242
32 – 3345
33 – 340
34 – 356
35 – 3660
36 – 3848
38 – 39105
39 – 40205
40 – 4233
42 – 433
43 – 4495
44 – 450
45 – 460
46 – 480
48 – 490
49 – 5095
50 – 520
52 – 530
53 – 5495
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
emojitext0.0%
codepointstext0.0%
statuscategorical0.0%
nametext0.0%
groupcategorical0.0%
subgroupcategorical0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 4,988 detected strings).
langcountshare
en468093.8%
de360.7%
fa330.7%
nn280.6%
pl270.5%
es210.4%
ar190.4%
nl190.4%
ru190.4%
fr170.3%
ja160.3%
uk90.2%
id80.2%
it70.1%
th70.1%
ca50.1%
sv40.1%
ta40.1%
pt40.1%
ur40.1%
zh30.1%
ms30.1%
hu30.1%
war20.0%
lt20.0%
fi20.0%
eo20.0%
gl20.0%
la10.0%
ceb10.0%

emoji text identifier

This column appears to be a unique emoji identifier or catalog entry, with all 5225 values distinct (n_unique equals n) and a 97.7% emoji_rate. Each entry is a single token (one_word_rate 1.0) ranging from 1 to 10 characters with a median length of 3. The 51.3% allcaps_rate is unusual for an emoji column and suggests some entries contain ASCII letter components (e.g., regional indicators or text-based glyphs) rather than pure pictographs.

Treatment: Treat as a unique key; drop from modelling or use only as a join/lookup field.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["emoji"].stats

statvalue
n5,225
nulls0 (0.0%)
unique5,225
len_min 1
len_max 10
len_mean 3.416
len_median 3
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 5,225
readability_flesch_mean 1.818
emoji_rate 0.9774
url_rate 0
one_word_rate 1
allcaps_rate 0.5133
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps51.3% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for emoji.
Show data table
Character-length distribution for emoji (mean: 3.415885167464115).
charscount
1 – 11400
1 – 10
1 – 20
2 – 20
2 – 21143
2 – 20
2 – 30
3 – 30
3 – 3215
3 – 30
3 – 30
3 – 40
4 – 40
4 – 41010
4 – 40
4 – 50
5 – 50
5 – 5613
5 – 50
5 – 60
6 – 60
6 – 60
6 – 699
6 – 60
6 – 70
7 – 70
7 – 7427
7 – 70
7 – 80
8 – 80
8 – 80
8 – 8128
8 – 80
8 – 90
9 – 90
9 – 995
9 – 90
9 – 100
10 – 100
10 – 1095

codepoints text identifier

This column holds Unicode codepoint sequences (likely emoji definitions), with every one of the 5225 rows unique and fully uppercase-hex. Tokens like '200d' (zero-width joiner, 3747 occurrences), 'fe0f' (variation selector, 1318), and skin-tone modifiers '1f3fb'-'1f3ff' (703 each) dominate, alongside the man/woman bases '1f468'/'1f469' (676 each). String length averages 18 characters (max 54) with a median of 3 tokens, consistent with multi-codepoint emoji ZWJ sequences.

Treatment: Treat as a unique key per emoji; split on whitespace into codepoint tokens if structural features are needed.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["codepoints"].stats

statvalue
n5,225
nulls0 (0.0%)
unique5,225
len_min 4
len_max 54
len_mean 18.04
len_median 15
len_p95 43
word_mean 3.416
word_median 3
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 1,451
readability_flesch_mean 118.8
emoji_rate 0
url_rate 0
one_word_rate 0.2679
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word26.8% rows are a single word
alert: allcaps100.0% rows are all-caps
Fig 9.
Character-length distribution for codepoints.
Show data table
Character-length distribution for codepoints (mean: 18.04019138755981).
charscount
4 – 51400
5 – 60
6 – 80
8 – 90
9 – 10249
10 – 12894
12 – 130
13 – 140
14 – 15136
15 – 1679
16 – 180
18 – 190
19 – 20141
20 – 22544
22 – 23325
23 – 240
24 – 2525
25 – 26553
26 – 2815
28 – 2920
29 – 3012
30 – 3242
32 – 3345
33 – 340
34 – 356
35 – 3660
36 – 3848
38 – 39105
39 – 40205
40 – 4233
42 – 433
43 – 4495
44 – 450
45 – 460
46 – 480
48 – 490
49 – 5095
50 – 520
52 – 530
53 – 5495

status categorical label

Categorical qualification status with 4 levels and no nulls across 5225 rows. Heavily dominated by 'fully-qualified' at 75.5%, with 'minimally-qualified' (1029) and 'unqualified' (243) trailing, and 'component' a rare tail at just 9 occurrences. Entropy ratio of 0.495 confirms the imbalance.

Treatment: One-hot encode; consider collapsing the rare 'component' class or stratifying splits to preserve it.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["status"].stats

statvalue
n5,225
nulls0 (0.0%)
unique4
top_value fully-qualified
top_rate 0.7548
cardinality 4
entropy 0.9896
entropy_ratio 0.4948
Fig 10.
Top values for status.
Show data table
Top values for status (4 unique shown, of 4 total).
valuecountshare
fully-qualified394475.5%
minimally-qualified102919.7%
unqualified2434.7%
component90.2%

name text label

This column holds short descriptive labels for emoji (e.g. 'E4.0 man detective', 'E15.1 woman walking facing right: dark skin tone'), averaging 5.5 words and 33.8 characters with a versioned 'E#.#' prefix. Duplicates are heavy at 24.3% (1272 rows) because skin-tone variants share base names — 'skin' (3450) and 'tone' (2800) dominate the vocabulary of 1912 tokens. Although 4680 rows are tagged English, the language detector also flags 29 other languages including German (36), Persian (33), and Polish (27), likely false positives on the short codepoint-style tokens rather than true multilingual content.

Treatment: Treat as the canonical emoji label; strip the 'E#.#' version prefix and skin-tone suffix if you need a deduplicated key.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["name"].stats

statvalue
n5,225
nulls0 (0.0%)
unique3,953
len_min 7
len_max 86
len_mean 33.81
len_median 34
len_p95 67
word_mean 5.521
word_median 6
n_empty 0
n_duplicates 1,272
duplicate_rate 0.2434
vocab_size 1,912
readability_flesch_mean 78.3
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0.0001914
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates24.3% duplicate strings
Fig 11.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 33.810143540669856).
charscount
7 – 936
9 – 11154
11 – 13216
13 – 15252
15 – 17319
17 – 19374
19 – 21249
21 – 23189
23 – 25150
25 – 27135
27 – 29125
29 – 31122
31 – 33162
33 – 35252
35 – 37256
37 – 39245
39 – 41275
41 – 43273
43 – 45195
45 – 46162
46 – 48153
48 – 50120
50 – 5263
52 – 5464
54 – 5676
56 – 5860
58 – 6061
60 – 6275
62 – 6466
64 – 6670
66 – 6864
68 – 7040
70 – 7232
72 – 7446
74 – 7628
76 – 7824
78 – 8026
80 – 828
82 – 844
84 – 864

group categorical feature

This is a categorical grouping column with 10 distinct values matching the standard Unicode emoji category taxonomy (e.g., "People & Body", "Smileys & Emotion", "Flags"). The distribution is heavily imbalanced: "People & Body" alone covers 66.4% of the 5,225 rows, while "Component" appears just 9 times. No nulls, and entropy ratio of 0.57 confirms the skew toward one dominant class.

Treatment: One-hot or target-encode; consider grouping the rare "Component" class given its tiny support.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["group"].stats

statvalue
n5,225
nulls0 (0.0%)
unique10
top_value People & Body
top_rate 0.6637
cardinality 10
entropy 1.908
entropy_ratio 0.5743
Fig 12.
Top values for group.
Show data table
Top values for group (10 unique shown, of 10 total).
valuecountshare
People & Body346866.4%
Objects3166.0%
Symbols3055.8%
Flags2765.3%
Travel & Places2685.1%
Smileys & Emotion1873.6%
Animals & Nature1673.2%
Food & Drink1332.5%
Activities961.8%
Component90.2%

subgroup categorical feature

Categorical taxonomy label with 100 distinct subgroups across 5225 rows and no nulls, suggesting an emoji or icon classification scheme dominated by people-related categories. The top value 'person-activity' covers 13.3% of rows, and the top eight values are all person/family/flag related, indicating a long tail where 92 remaining subgroups together account for most of the diversity (entropy ratio 0.75). No single category dominates overwhelmingly, but the person-centric concentration is notable.

Treatment: Group-encode or target-encode given the 100 levels, or collapse rare subgroups into an 'other' bucket before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["subgroup"].stats

statvalue
n5,225
nulls0 (0.0%)
unique100
top_value person-activity
top_rate 0.1334
cardinality 100
entropy 4.982
entropy_ratio 0.7498
Fig 13.
Top values for subgroup.
Show data table
Top values for subgroup (20 unique shown, of 100 total).
valuecountshare
person-activity69713.3%
person-role63512.2%
family53310.2%
person-sport4809.2%
person-gesture3005.7%
country-flag2595.0%
person-fantasy2464.7%
person1923.7%
animal-mammal681.3%
hand-fingers-open671.3%
sky & weather651.2%
hands621.2%
hand-fingers-partial551.1%
transport-ground551.1%
clothing501.0%
body-parts490.9%
alphanum490.9%
hand-single-finger430.8%
person-resting420.8%
tool380.7%

How to cite

click to copy

BibTeX
@misc{saturn-emoji-unicode-emoji-list-20260119-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: emoji unicode emoji list 20260119},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/emoji-unicode_emoji_list_20260119}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: emoji unicode emoji list 20260119. Source: /home/coolhand/html/datavis/data_trove/cache/emoji/unicode_emoji_list_20260119.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/emoji-unicode_emoji_list_20260119