Saturday, November 9, 2024

JavaScript Relationship between character, character code, and code unit

In JavaScript, understanding the relationship between character, character code, and code unit is essential for working with strings, especially for handling Unicode text. Here's a breakdown of each:

Character:

  • A character is a single textual symbol, like 'A', 'B', 'ใ‚', or an emoji like '๐Ÿ˜Š'.
  • Characters are abstract representations of text. In Unicode, each character has a unique code point, which is an abstract number that represents that character.

Character Code (Code Point):

  • A character code, or code point, is the unique number assigned to each character in the Unicode standard. For example, the character 'A' has a Unicode code point of U+0041, and the emoji '๐Ÿ˜Š' has a code point of U+1F60A.
  • JavaScript represents code points using hexadecimal notation prefixed with U+. You can also refer to code points in JavaScript by their numeric values using String.fromCodePoint or String.prototype.codePointAt.

Code Unit:

  • A code unit is a 16-bit unit used by JavaScript to represent strings in memory. JavaScript uses UTF-16 encoding, which means each character is represented by one or more code units.
  • Characters in the Basic Multilingual Plane (BMP), with code points from U+0000 to U+FFFF, are represented by a single 16-bit code unit.
  • Characters outside the BMP, such as many emoji and rare characters (code points from U+10000 to U+10FFFF), require two code units in UTF-16, called a surrogate pair. For example, the emoji '๐Ÿ˜Š' (U+1F60A) is represented by the surrogate pair 0xD83D and 0xDE0A.

Summary of Relationships:

  • A character is an abstract symbol.
  • A character code (code point) is a unique number assigned to each character.
  • A code unit is the actual 16-bit data chunk in JavaScript's UTF-16 encoding. Each code unit represents either a full character (for BMP characters) or half of a surrogate pair (for non-BMP characters).

Practical Example:

// Character: ๐Ÿ˜Š
let smiley = '๐Ÿ˜Š';

// Code point of ๐Ÿ˜Š
console.log(smiley.codePointAt(0).toString(16)); // "1f60a"

// Code units of ๐Ÿ˜Š in UTF-16 (surrogate pair)
console.log(smiley.charCodeAt(0).toString(16)); // "d83d"
console.log(smiley.charCodeAt(1).toString(16)); // "de0a"
Here:

The code point for '๐Ÿ˜Š' is U+1F60A.
In JavaScript's UTF-16 encoding, this character is represented by the code units 0xD83D and 0xDE0A.

Recap:

In JavaScript, strings are represented in UTF-16, meaning that:
  • Code Points (each representing a unique character) are encoded using one or two Code Units.
  • Basic Multilingual Plane (BMP) characters (code points from U+0000 to U+FFFF) fit within a single 16-bit code unit.
  • Supplementary characters (code points from U+10000 to U+10FFFF) require two code units because they lie outside the BMP and need a surrogate pair.
So the relationship is:
  • Character ➔ Represented by a Code Point in Unicode.
  • Code Point ➔ Encoded by one or two Code Units in JavaScript UTF-16.
  • Code Units ➔ Contain the actual 16-bit data used to represent the character in memory.
This setup allows JavaScript to represent all Unicode characters, though handling surrogate pairs correctly is necessary for characters outside the BMP.

No comments:

Post a Comment

Hot Topics