runes – Unicode-aware JS string splitting with full Emoji support

At Small Town Heroes I’m currently working on a newsreader app built using React Native. On Android (even 7.1.1) we noticed this weird issue where some emojis would render incorrectly when we were applying styling on it using index-based ranges: the range seemed to be off by one, splitting the emoji into its separate bytes. What made this issue even more weird is that this behaviour stopped when we connected the app to a debugging session.

Result of applying a specific style on this 41 symbol counting sentence
Figure: Result of applying a specific style on this 41 symbol counting sentence.

As you might be aware emoji can be multibyte strings and thus compromise two (or even more) bytes. When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols. Technically correct, but no so good for us:

// I count 7, what about you my dear JavaScript?
>> 'Emoji πŸ€–'.length
8

When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols.

To get the correct symbol count, you can use Array.from() or the spread operator (*):

>> Array.from('Emoji πŸ€–');
["E", "m", "o", "j", "i", " ", "πŸ€–"]

>> Array.from('Emoji πŸ€–').length
7

>> [...'Emoji πŸ€–'].length
7

(*) Do note that this technique is not 100% bulletproof though. It has β€œproblems” with skin tone modifiers and other emoji combinations – which in itself yields some fun results – but let’s ignore that for now.

Knowing how to get the correct count, it’s possible to extract proper substrings from that sentence, to apply your styling on (*):

// Wrong way to do it (not multibyte-aware)
>> 'Emoji πŸ€–'.substr(0,7);
"Emoji οΏ½"

// Correct way to do it (multibyte-aware)
>> [...'Emoji πŸ€–'].slice(0,7).join('');
"Emoji πŸ€–"

(*) Why not just use String#length and String#split all throughout our code (thus bypassing the whole thing) you might wonder? Well, the editor used to input the article *is* multibyte aware, so it would return 7 as the length of that sentence πŸ˜‰

Now, even though we were using Array.from() to get the correct substrings, we ran into issues on Android whilst doing so: it would aways yield "Emoji οΏ½", no matter which technique we used. Long story short: we found out that the runtime on the Android phone – somehow – was using a non-multibyte aware Array.from(), explaining the wrong result.

// Android 7.1.1
>> Array.from('Emoji πŸ€–');
["E", "m", "o", "j", "i", " ",  "οΏ½", "οΏ½"] // <-- Wait, wut?

With this, we also found out that the JavaScript runtime used during a debugging session is different from the one used in standalone mode. For the debug session, the one contained in node (e.g. V8) would be used.

The solution to bypassing this mysterious problem was to use runes, a library that's Unicode-aware. Above that it also plays nice with skin tone modifiers and other emoji combinations, making it superior to the Array.from() technique.

const runes = require('runes');
 
// Standard String.split 
'β™₯️'.split('') => ['β™₯', '️']
'Emoji πŸ€–'.split('') => ['E', 'm', 'o', 'j', 'i', ' ', 'οΏ½', 'οΏ½']
'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.split('') => ['οΏ½', 'οΏ½', '‍', 'οΏ½', 'οΏ½', '‍', 'οΏ½', 'οΏ½', '‍', 'οΏ½', 'οΏ½']
 
// ES6 string iterator 
[...'β™₯️'] => [ 'β™₯', '️' ]
[...'Emoji πŸ€–'] => [ 'E', 'm', 'o', 'j', 'i', ' ', 'πŸ€–' ]
[...'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'] => [ 'πŸ‘©', '', 'πŸ‘©', '', 'πŸ‘§', '', 'πŸ‘¦' ]
 
// Runes 
runes('β™₯️') => ['β™₯️']
runes('Emoji πŸ€–') => ['E', 'm', 'o', 'j', 'i', ' ', 'πŸ€–']
runes('πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦') => ['πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦']
const runes = require('runes')
 
// String.substring 
'πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§a'.substring(1) => 'οΏ½β€πŸ‘¨β€πŸ‘§β€πŸ‘§a'
 
// Runes 
runes.substring('πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§a', 1) => 'a'

runes – Unicode-aware JS string splitting with full Emoji support →

Related: The presentation Javascript ❀️ Unicode and accompnaying blogpost JavaScript has a Unicode Problem by Mathias Bynens is pure gold when it comes to JavaScript and Unicode πŸ˜‰

Elsewhere , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *