Skip to content

Commit ed6e52a

Browse files
authored
Create article.md
1 parent 0ba846e commit ed6e52a

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
2+
# Юникод, внутреннее устройство строк
3+
4+
```warn header="Глубокое погружение в тему"
5+
Этот раздел более подробно описывает, как устроены строки. Такие знания пригодятся, если вы намерены работать с эмодзи, редкими математическими символами, иероглифами, и т.д.
6+
```
7+
8+
Как мы уже знаем, строки в JavaScript основаны на [Юникоде](https://ru.wikipedia.org/wiki/Юникод): каждый символ представляет из себя последовательность байтов из 1-4 байтов.
9+
10+
JavaScript позволяет нам вставить символ в строку, указав его шестнадцатеричный код Юникода с помощью одной из этих трех нотаций:
11+
12+
- `\xXX`
13+
14+
Вместо `XX` должны быть указаны две шестнадцатеричные цифры со значением от `00` до `FF`. В этом случае `\xXX` -- это символ, Юникод которого равен `XX`.
15+
16+
Поскольку нотация `\xXX` поддерживает только две шестнадцатеричные цифры, ее можно использовать только для первых 256 символов Юникода.
17+
18+
Эти 256 символов включают в себя латинский алфавит, большинство основных синтаксических символов и некоторые другие. Например, `"\x7A"` - это то же самое, что `"z"` (Юникод `U+007A`).
19+
20+
```js run
21+
alert( "\x7A" ); // z
22+
alert( "\xA9" ); // ©, символ авторского права
23+
```
24+
25+
- `\uXXXX`
26+
Вместо `XXXX` должны быть указаны ровно 4 шестнадцатеричные цифры со значением от `0000` до `FFFF`. В этом случае `\uXXXX` - это символ, код Юникода которого равен `XXXX`.
27+
28+
Символы со значениями Юникода, превышающими `U+FFFF`, также могут быть представлены с помощью этой нотации, но в таком случае нам придется использовать так называемую суррогатную пару (о ней мы поговорим позже в этой главе).
29+
30+
```js run
31+
alert( "\u00A9" ); // ©, то же самое, что \xA9, используя 4-значную шестнадцатеричную нотацию
32+
alert( "\u044F" ); // я, буква кириллического алфавита
33+
alert( "\u2191" ); // ↑, символ стрелки вверх
34+
```
35+
36+
- `\u{X…XXXXXX}`
37+
38+
Вместо `XXXXXXX` должно быть шестнадцатеричное значение от 1 до 6 байт от `0` до `10FFFF` (самая высокая точка кода, определенная стандартом Юникод). Эта нотация позволяет нам легко представлять все существующие символы Юникода.
39+
40+
```js run
41+
alert( "\u{20331}" ); // 佫, редкий китайский иероглиф (длинный Юникод)
42+
alert( "\u{1F60D}" ); // 😍, символ улыбающегося лица (ещё один длинный Юникод)
43+
```
44+
45+
## Суррогатные пары
46+
47+
All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideographic sets (CJK -- from Chinese, Japanese, and Korean writing systems), have a 2-byte representation.
48+
49+
Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode.
50+
51+
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called "a surrogate pair".
52+
53+
As a side effect, the length of such symbols is `2`:
54+
55+
```js run
56+
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
57+
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
58+
alert( '𩷶'.length ); // 2, a rare Chinese character
59+
```
60+
61+
That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
62+
63+
We actually have a single symbol in each of the strings above, but the `length` property shows a length of `2`.
64+
65+
Getting a symbol can also be tricky, because most language features treat surrogate pairs as two characters.
66+
67+
For example, here we can see two odd characters in the output:
68+
69+
```js run
70+
alert( '𝒳'[0] ); // shows strange symbols...
71+
alert( '𝒳'[1] ); // ...pieces of the surrogate pair
72+
```
73+
74+
Pieces of a surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
75+
76+
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of `0xd800..0xdbff`, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval `0xdc00..0xdfff`. These intervals are reserved exclusively for surrogate pairs by the standard.
77+
78+
So the methods `String.fromCodePoint` and `str.codePointAt` were added in JavaScript to deal with surrogate pairs.
79+
80+
They are essentially the same as [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt), but they treat surrogate pairs correctly.
81+
82+
One can see the difference here:
83+
84+
```js run
85+
// charCodeAt is not surrogate-pair aware, so it gives codes for the 1st part of 𝒳:
86+
87+
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835
88+
89+
// codePointAt is surrogate-pair aware
90+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair
91+
```
92+
93+
That said, if we take from position 1 (and that's rather incorrect here), then they both return only the 2nd part of the pair:
94+
95+
```js run
96+
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3
97+
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3
98+
// meaningless 2nd half of the pair
99+
```
100+
101+
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
102+
103+
````warn header="Takeaway: splitting strings at an arbitrary point is dangerous"
104+
We can't just split a string at an arbitrary position, e.g. take `str.slice(0, 4)` and expect it to be a valid string, e.g.:
105+
106+
```js run
107+
alert( 'hi 😂'.slice(0, 4) ); // hi [?]
108+
```
109+
110+
Here we can see a garbage character (first half of the smile surrogate pair) in the output.
111+
112+
Just be aware of it if you intend to reliably work with surrogate pairs. May not be a big problem, but at least you should understand what happens.
113+
````
114+
115+
## Diacritical marks and normalization
116+
117+
In many languages, there are symbols that are composed of the base character with a mark above/under it.
118+
119+
For instance, the letter `a` can be the base character for these characters: `àáâäãåā`.
120+
121+
Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
122+
123+
To support arbitrary compositions, the Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
124+
125+
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
126+
127+
```js run
128+
alert( 'S\u0307' ); // Ṡ
129+
```
130+
131+
If we need an additional mark above the letter (or below it) -- no problem, just add the necessary mark character.
132+
133+
For instance, if we append a character "dot below" (code `\u0323`), then we'll have "S with dots above and below": `Ṩ`.
134+
135+
For example:
136+
137+
```js run
138+
alert( 'S\u0307\u0323' ); // Ṩ
139+
```
140+
141+
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
142+
143+
For instance:
144+
145+
```js run
146+
let s1 = 'S\u0307\u0323'; // Ṩ, S + dot above + dot below
147+
let s2 = 'S\u0323\u0307'; // Ṩ, S + dot below + dot above
148+
149+
alert( `s1: ${s1}, s2: ${s2}` );
150+
151+
alert( s1 == s2 ); // false though the characters look identical (?!)
152+
```
153+
154+
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
155+
156+
It is implemented by [str.normalize()](mdn:js/String/normalize).
157+
158+
```js run
159+
alert( "S\u0307\u0323".normalize() == "S\u0323\u0307".normalize() ); // true
160+
```
161+
162+
It's funny that in our situation `normalize()` actually brings together a sequence of 3 characters to one: `\u1e68` (S with two dots).
163+
164+
```js run
165+
alert( "S\u0307\u0323".normalize().length ); // 1
166+
167+
alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true
168+
```
169+
170+
In reality, this is not always the case. The reason is that the symbol `` is "common enough", so Unicode creators included it in the main table and gave it the code.
171+
172+
If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](https://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough.

0 commit comments

Comments
 (0)