• Bytes Route logo

What is Unicode or UTF-8

Cristi Minica

September 16th, 2020

Intro

Unicode is a universal character set. This means that the standard defines, in one place, all the characters needed for writing the majority of living languages, punctuation, or other symbols like emojis 😇.

coding-horror-tweet

In the past, a common character set was ASCII, which was very limited because there was just 8 bit of information available for a character. This meant that 256 characters can be used. Not enough for a World Wide Web.

A code/number associated with a character is called a code point. The actual image used for that character is called a glyph. Even though 'A' has the same code point in ASCII, which is '65', the representation on the screen differs with respect to the fonts used.

ascii-code-point-table

Unicode character set

The first 65,536 code point position in the Unicode character set constitute the Basic Multilingual Plane(BMP). It contains a small part of the set but that part is more commonly used. The rest of approximately a million code point positions are referred to as supplementary characters.

Unicode has 3 different encoding forms, UTF-8, UTF-16, UTF-32. UTF-8 uses one byte for ASCII set, two bytes for several more alphabets, and three bytes for the rest of BMP, four bytes for the rest of the BPM.

UTF-16 uses two bytes for the BMP and four bytes for supplementary characters.

UTF-32 uses four bytes for the BMP and four bytes for supplementary characters.

Endianness refers to the ordering of bytes in a word. A big-endian system stores the most significant byte in the word at the smallest memory address.

When in doubt, use UTF-8. It is simpler this way.

In order to know how to interpret a file, a program must contain the file's encoding. Character encoding detection is not a reliable process.

JavaScript source files can have any kind of encoding but the code will be converted ot UTF-16 internally. For data structures usually, UTF-8 is the default.

 const bt = Buffer.alloc(12);
 bt.write('abcdef'); // defaults to 'utf8'
 console.log(bt);
 console.log(bt.toString());
 bt.write('abcdef',0, 6,'ascii');
 console.log(bt);
 console.log(bt.toString());
 bt.write('abcdef',0, 12,'utf16le');
 console.log(bt);
 console.log(bt.toString());
 console.log(bt.toString('utf8'));
 console.log(bt.toJSON());

For more details you can see Nick Gammon's take on the subject.