Can I Unicode‽ Unicode support across JavaScript engines

Small site by @mathias (who else?) on the Unicode support in browsers their JavaScript engines.

Changes in the Unicode Standard can affect the JavaScript language. This page gives concrete examples and tracks which exact Unicode version each browser supports for each feature.

What surprises me is that some engines support different Unicode versions per feature. Check out Safari for example that has Unicode v12 support for Identifiers, but Unicode v11 for RegExp property escapes.

Can I Unicode‽ →

Unicode Patterns with <css-doodle />

<css-doodle /> is a web component for drawing patterns with CSS.

The component will generate a grid of divs by the rules (plain CSS) inside it. You can easily manipulate those cells using CSS to come up with a graphic pattern or an animated graph. The limit is the limit of CSS itself.

Combine <css-doodle /> with generated content and Unicode characters, and you can create nice decorative patterns as shown above.

<css-doodle>
  :doodle {
    @grid: 21 / 35em 20em;
    overflow: hidden;
  }

  --c: @pick(
    #11CBD7, #C6F1E7, #FA4659
  );

  :after {
    content: '\@hex(@rand(0x1401, 0x141b))';
    font-size: 2em; 
    background: linear-gradient(
      @rand(360deg),
      transparent 50%, var(--c) 50%
    );
    -webkit-background-clip: text;
    -webkit-text-fill-color: transparent;
    -webkit-text-stroke: var(--c);
    -webkit-text-stroke-width: .5px;
    color: transparent;
  } 
</css-doodle>

Unicode Patterns →
Unicode Patterns demos →
<css-doodle />

runes – Unicode-aware JS string splitting with full Emoji support

At Small Town Heroes I’m currently working on a newsreader app built using React Native. On Android (even 7.1.1) we noticed this weird issue where some emojis would render incorrectly when we were applying styling on it using index-based ranges: the range seemed to be off by one, splitting the emoji into its separate bytes. What made this issue even more weird is that this behaviour stopped when we connected the app to a debugging session.

Result of applying a specific style on this 41 symbol counting sentence
Figure: Result of applying a specific style on this 41 symbol counting sentence.

As you might be aware emoji can be multibyte strings and thus compromise two (or even more) bytes. When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols. Technically correct, but no so good for us:

// I count 7, what about you my dear JavaScript?
>> 'Emoji 🤖'.length
8

When asking the length of a string, JavaScript will count the number of bytes, not the number of symbols.

To get the correct symbol count, you can use Array.from() or the spread operator (*):

>> Array.from('Emoji 🤖');
["E", "m", "o", "j", "i", " ", "🤖"]

>> Array.from('Emoji 🤖').length
7

>> [...'Emoji 🤖'].length
7

(*) Do note that this technique is not 100% bulletproof though. It has “problems” with skin tone modifiers and other emoji combinations – which in itself yields some fun results – but let’s ignore that for now.

Knowing how to get the correct count, it’s possible to extract proper substrings from that sentence, to apply your styling on (*):

// Wrong way to do it (not multibyte-aware)
>> 'Emoji 🤖'.substr(0,7);
"Emoji �"

// Correct way to do it (multibyte-aware)
>> [...'Emoji 🤖'].slice(0,7).join('');
"Emoji 🤖"

(*) Why not just use String#length and String#split all throughout our code (thus bypassing the whole thing) you might wonder? Well, the editor used to input the article *is* multibyte aware, so it would return 7 as the length of that sentence 😉

Now, even though we were using Array.from() to get the correct substrings, we ran into issues on Android whilst doing so: it would aways yield "Emoji �", no matter which technique we used. Long story short: we found out that the runtime on the Android phone – somehow – was using a non-multibyte aware Array.from(), explaining the wrong result.

// Android 7.1.1
>> Array.from('Emoji 🤖');
["E", "m", "o", "j", "i", " ",  "�", "�"] // <-- Wait, wut?

With this, we also found out that the JavaScript runtime used during a debugging session is different from the one used in standalone mode. For the debug session, the one contained in node (e.g. V8) would be used.

The solution to bypassing this mysterious problem was to use runes, a library that's Unicode-aware. Above that it also plays nice with skin tone modifiers and other emoji combinations, making it superior to the Array.from() technique.

const runes = require('runes');
 
// Standard String.split 
'♥️'.split('') => ['♥', '️']
'Emoji 🤖'.split('') => ['E', 'm', 'o', 'j', 'i', ' ', '�', '�']
'👩‍👩‍👧‍👦'.split('') => ['�', '�', '‍', '�', '�', '‍', '�', '�', '‍', '�', '�']
 
// ES6 string iterator 
[...'♥️'] => [ '♥', '️' ]
[...'Emoji 🤖'] => [ 'E', 'm', 'o', 'j', 'i', ' ', '🤖' ]
[...'👩‍👩‍👧‍👦'] => [ '👩', '', '👩', '', '👧', '', '👦' ]
 
// Runes 
runes('♥️') => ['♥️']
runes('Emoji 🤖') => ['E', 'm', 'o', 'j', 'i', ' ', '🤖']
runes('👩‍👩‍👧‍👦') => ['👩‍👩‍👧‍👦']
const runes = require('runes')
 
// String.substring 
'👨‍👨‍👧‍👧a'.substring(1) => '�‍👨‍👧‍👧a'
 
// Runes 
runes.substring('👨‍👨‍👧‍👧a', 1) => 'a'

runes – Unicode-aware JS string splitting with full Emoji support →

Related: The presentation Javascript ❤️ Unicode and accompnaying blogpost JavaScript has a Unicode Problem by Mathias Bynens is pure gold when it comes to JavaScript and Unicode 😉

Did this help you out? Like what you see?
Consider donating.

I don't run ads on my blog nor do I do this for profit. A donation however would always put a smile on my face though. Thanks!

☕️ Buy me a Coffee ($3)

Phishing with Unicode Domains

When visiting a domain name containing a Unicode character that visually resembles an ASCII character, your browser will transform the Unicode characters to Punycode in the address bar to prevent homograph attacks.

For example: the Cyrillic а (codepoint U+0430) totally looks like the Latin a (codepoint U+0061). When visting brаm.us (with the Cyrillic а in place of the Latin a), your browser will transform the URL to xn--brm-7cd.us

Turns out this is not always the case though:

Chrome’s (and Firefox’s) homograph protection mechanism unfortunately fails if every characters is replaced with a similar character from a single foreign language. The domain “аррӏе.com”, registered as “xn--80ak6aa92e.com”, bypasses the filter by only using Cyrillic characters.

TIP: Whenever you’re in doubt when receiving a mail from a “well known party” containing a link, I recommend manually typing the URL into the address bar.

Phishing with Unicode Domains →

(via Jeremy)

Sidenote: Worth digging up is this tweet from 2010 by my pal Manuel: