Unicode: Behind the Curtain
The Unicode Consortium celebrated its 25th anniversary last year. The truth is that despite all the work Unicode does to ensure text from languages around the world work โ most of us know Unicode as the group that approves new emojis.
What might not be so clear is why a large consortium is required, or the hidden complexity of Unicode. Or how the vomit emojis shown in the XCKD cartoon above are already considered "valid (but not recommended)".
Above: Many think of Unicode in terms of emoji support. We do.
Mark Davis, co-founder and current-day president of Unicode, has sought to clarify how emoji fits into Unicode in this high-level overview that looks at what Unicode is, and how the Unicode Emoji Subcommittee ("Emoji SC")[1] fits into it.
Davis notes that emojis make up just a fraction of the total number of characters in the Unicode Standard.You can barely make them out in this chart:
All images that follow are from this presentation.
Characters alone don't tell half the story. A number of glyphs need to combine when displayed in certain orders or combinations.
A combination that will be familiar to many is how emoji skin tones are implemented.
These work by detecting when a modifier character is displayed after human emoji, such as ๐ง Girl. These combine on supported platforms to show a single emoji:
A more complicated implementation involves joining two or more emojis together into what is called an Emoji ZWJ Sequence.
These are used to create professions such as the ๐ฉโโ๏ธ Woman Judge. This emoji is created using the ๐ฉ Woman and โ๏ธ Balance Scale emojis in sequence.
A "ZWJ" (Zero Width Joiner) character stands between these two emojis, and is an invisible glue that joins multiple emojis into one (where supported).
Other types of ZWJ Sequences list an existing emoji such as ๐ต๏ธ Detective with a gender symbol โ๏ธ Female Sign[2] added after it.
This type of ZWJ Sequence is generally used if an emoji already exists. For example: runner, surfer, or many of the gestures.
And yes, you can combine modifiers and ZWJs to create a longer sequence.
Unicode doesn't control ZWJ Sequences in the same way as new emojis that require their own code point.
Unicode does recommended sequences which should be supported for cross-platform consistency. However vendors are free to combine any emoji with any other, as they see fit.
Microsoft has six Ninja Cats available in Windows which aren't part of Unicode's recommended list. ๐ฑ Cat Face and ๐ Rocket are combined on Windows 10 to show an emoji for ๐ฑโ๐ Astro Cat.
Astro Cat is valid (as it uses a correct sequence structure) but not recommended like other professions and genders are.
XKCD suggested that vomit should be a modifier character to make a "Vomiting Cowboy".
Davis points out that ๐ค Cowboy Hat Face could already be combined with ๐คฎ Face Vomiting to create a valid ZWJ Sequence:
Above: Vomiting suggestions from XKCD. Davis notes that no modifier is needed for these.
Other sequence types exist for emoji, including flag sequences, tag sequences and keycap sequences. You should check out the entire set of slides to see these in more detail.
Finally, a look at the (current, 2017) timeline for how a new emoji is born:
๐จ Update April 2020: the current timeline for how a new emoji is create has been significantly impacted by the COVID-19 pandemic. You can read more about the revised schedule for 2020 and beyond here.
Of course Unicode still has plenty to do outside of emoji support:
"There are approximately 7,000 living human languages, with varying levels of vitality. Less than 100 of these languages are well-supported on computers, mobile phones, and other devices, while all the rest risk being digitally disadvantaged"
Unicode has an Adopt a Character program. Funds raised from adoptions go toward research to support these digitally disadvantaged languages.
More:
Disclaimer: I am a member of the Unicode Emoji Subcommittee. โฉ๏ธ
VS-16 is an invisible character that tells the previous character to use emoji presentation. In this example, โ๏ธ Female Sign has a text and emoji version. ZWJ Sequences should specify the emoji version if both exist and the default is text. โฉ๏ธ