runes – Unicode-aware JS string splitting with full Emoji support

At Small Town Heroes I’m currently working on a newsreader app built using React Native. On Android (even 7.1.1) we noticed this weird issue where some emojis would render incorrectly when we were applying styling on it using index-based ranges: the range seemed to be off by one, splitting the emoji into its separate bytes. What made this issue even more weird is that this behaviour stopped when we connected the app to a debugging session.

Result of applying a specific style on this 41 symbol counting sentence
Figure: Result of applying a specific style on this 41 symbol counting sentence.

As you might be aware emoji can be multibyte strings and thus compromise two (or even more) bytes. When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols. Technically correct, but no so good for us:

// I count 7, what about you my dear JavaScript?
>> 'Emoji 🤖'.length
8

When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols.

To get the correct symbol count, you can use Array.from() or the spread operator (*):

>> Array.from('Emoji 🤖');
["E", "m", "o", "j", "i", " ", "🤖"]

>> Array.from('Emoji 🤖').length
7

>> [...'Emoji 🤖'].length
7

(*) Do note that this technique is not 100% bulletproof though. It has “problems” with skin tone modifiers and other emoji combinations – which in itself yields some fun results – but let’s ignore that for now.

Knowing how to get the correct count, it’s possible to extract proper substrings from that sentence, to apply your styling on (*):

// Wrong way to do it (not multibyte-aware)
>> 'Emoji 🤖'.substr(0,7);
"Emoji �"

// Correct way to do it (multibyte-aware)
>> [...'Emoji 🤖'].slice(0,7).join('');
"Emoji 🤖"

(*) Why not just use String#length and String#split all throughout our code (thus bypassing the whole thing) you might wonder? Well, the editor used to input the article *is* multibyte aware, so it would return 7 as the length of that sentence 😉

Now, even though we were using Array.from() to get the correct substrings, we ran into issues on Android whilst doing so: it would aways yield "Emoji �", no matter which technique we used. Long story short: we found out that the runtime on the Android phone – somehow – was using a non-multibyte aware Array.from(), explaining the wrong result.

// Android 7.1.1
>> Array.from('Emoji 🤖');
["E", "m", "o", "j", "i", " ",  "�", "�"] // <-- Wait, wut?

With this, we also found out that the JavaScript runtime used during a debugging session is different from the one used in standalone mode. For the debug session, the one contained in node (e.g. V8) would be used.

The solution to bypassing this mysterious problem was to use runes, a library that's Unicode-aware. Above that it also plays nice with skin tone modifiers and other emoji combinations, making it superior to the Array.from() technique.

const runes = require('runes');
 
// Standard String.split 
'♥️'.split('') => ['♥', '️']
'Emoji 🤖'.split('') => ['E', 'm', 'o', 'j', 'i', ' ', '�', '�']
'👩‍👩‍👧‍👦'.split('') => ['�', '�', '‍', '�', '�', '‍', '�', '�', '‍', '�', '�']
 
// ES6 string iterator 
[...'♥️'] => [ '♥', '️' ]
[...'Emoji 🤖'] => [ 'E', 'm', 'o', 'j', 'i', ' ', '🤖' ]
[...'👩‍👩‍👧‍👦'] => [ '👩', '', '👩', '', '👧', '', '👦' ]
 
// Runes 
runes('♥️') => ['♥️']
runes('Emoji 🤖') => ['E', 'm', 'o', 'j', 'i', ' ', '🤖']
runes('👩‍👩‍👧‍👦') => ['👩‍👩‍👧‍👦']
const runes = require('runes')
 
// String.substring 
'👨‍👨‍👧‍👧a'.substring(1) => '�‍👨‍👧‍👧a'
 
// Runes 
runes.substring('👨‍👨‍👧‍👧a', 1) => 'a'

runes – Unicode-aware JS string splitting with full Emoji support →

Related: The presentation Javascript ❤️ Unicode and accompnaying blogpost JavaScript has a Unicode Problem by Mathias Bynens is pure gold when it comes to JavaScript and Unicode 😉

Did this help you out? Like what you see?
Consider donating.

I don't run ads on my blog nor do I do this for profit. A donation however would always put a smile on my face though. Thanks!

☕️ Buy me a Coffee ($3)

Published by Bramus!

Bramus is a frontend web developer from Belgium, working as a Chrome Developer Relations Engineer at Google. From the moment he discovered view-source at the age of 14 (way back in 1997), he fell in love with the web and has been tinkering with it ever since (more …)

Unless noted otherwise, the contents of this post are licensed under the Creative Commons Attribution 4.0 License and code samples are licensed under the MIT License

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.