You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
22
22
23
-
Alternation works not on a character level, but on expression level. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
23
+
Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
24
24
25
25
For instance:
26
26
27
27
-`pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
28
-
-`pattern:gra|ey` means "gra" or "ey".
28
+
-`pattern:gra|ey` means `match:gra` or `match:ey`.
29
29
30
30
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
31
31
32
32
## Regexp for time
33
33
34
-
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time.
34
+
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be).
35
35
36
36
How can we make a better one?
37
37
38
-
We can apply more careful matching:
38
+
We can apply more careful matching. First, the hours:
39
39
40
-
-The first digit must be `0` or `1` followed by any digit.
41
-
- Or`2` followed by `pattern:[0-3]`
40
+
-If the first digit is `0` or `1`, then the next digit can by anything.
41
+
- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`.
42
42
43
43
As a regexp: `pattern:[01]\d|2[0-3]`.
44
44
45
-
Then we can add a colon and the minutes part.
46
-
47
-
The minutes must be from `0` to `59`, in the regexp language that means the first digit `pattern:[0-5]` followed by any other digit `\d`.
45
+
Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
48
46
49
47
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
50
48
51
-
We're almost done, but there's a problem. The alternation `|` is between the `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. That's wrong, because it will match either the left or the right pattern:
52
-
53
-
54
-
```js run
55
-
let reg =/[01]\d|2[0-3]:[0-5]\d/g;
56
-
57
-
alert("12".match(reg)); // 12 (matched [01]\d)
58
-
```
59
-
60
-
That's rather obvious, but still an often mistake when starting to work with regular expressions.
49
+
We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
61
50
62
-
We need to add parentheses to apply alternation exactly to hours:`[01]\d` OR `2[0-3]`.
51
+
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions.
Copy file name to clipboardExpand all lines: 5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md
+16-15Lines changed: 16 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ That may even be a vulnerability. For instance, if JavaScript is on the server,
10
10
11
11
So the problem is definitely worth to deal with.
12
12
13
-
## Example
13
+
## Introductin
14
14
15
15
The plan will be like this:
16
16
@@ -24,23 +24,22 @@ We want to find all tags, with or without attributes -- like `subject:<a href=".
24
24
25
25
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
26
26
27
-
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute.
27
+
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>`if inside an attribute.
28
28
29
29
```js run
30
30
// the match doesn't reach the end of the tag - wrong!
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
37
35
38
-
In the regexp language that is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`:
36
+
1. For the `tag` name: `pattern:\w+`,
37
+
2. For the `key` name: `pattern:\w+`,
38
+
3. And the `value` can be a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
39
39
40
-
1.`pattern:<\w+` -- is the tag start,
41
-
2.`pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
40
+
If we substitute these into the pattern above, the full regexp is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`.
42
41
43
-
That doesn't yet support few details of HTML grammar, for instance strings in 'single' quotes, but they can be added later, so that's somewhat close to real life. For now we want the regexp to be simple.
42
+
That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now.
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
94
94
95
-
In regular expressions these can be set by `\p{…}`.
95
+
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
96
96
97
97
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
98
98
@@ -121,13 +121,24 @@ You could also explore properties at [Character Property Index](http://unicode.o
121
121
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
122
122
```
123
123
124
-
There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`).
124
+
There are also other derived categories, like:
125
+
-`Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
126
+
-`Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
127
+
- ...Unicode is a big beast, it includes a lot of properties.
125
128
126
-
Unicode is a big beast, it includes a lot of properties.
129
+
For instance, let's look for a 6-digit hex number:
127
130
128
-
One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
131
+
```js run
132
+
let reg =/\p{Hex_Digit}{6}/u; // flag 'u' is requireds
133
+
134
+
alert("color: #123ABC".match(reg)); // 123ABC
135
+
```
136
+
137
+
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
138
+
139
+
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
129
140
130
-
The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`.
141
+
### Universal \w
131
142
132
143
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
0 commit comments