Skip to content

Commit 7888439

Browse files
committed
regexp draft
1 parent 65184ed commit 7888439

File tree

4 files changed

+42
-41
lines changed

4 files changed

+42
-41
lines changed

5-regular-expressions/11-regexp-alternation/article.md

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -20,46 +20,35 @@ alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
2020

2121
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
2222

23-
Alternation works not on a character level, but on expression level. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
23+
Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
2424

2525
For instance:
2626

2727
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
28-
- `pattern:gra|ey` means "gra" or "ey".
28+
- `pattern:gra|ey` means `match:gra` or `match:ey`.
2929

3030
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
3131

3232
## Regexp for time
3333

34-
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time.
34+
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be).
3535

3636
How can we make a better one?
3737

38-
We can apply more careful matching:
38+
We can apply more careful matching. First, the hours:
3939

40-
- The first digit must be `0` or `1` followed by any digit.
41-
- Or `2` followed by `pattern:[0-3]`
40+
- If the first digit is `0` or `1`, then the next digit can by anything.
41+
- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`.
4242

4343
As a regexp: `pattern:[01]\d|2[0-3]`.
4444

45-
Then we can add a colon and the minutes part.
46-
47-
The minutes must be from `0` to `59`, in the regexp language that means the first digit `pattern:[0-5]` followed by any other digit `\d`.
45+
Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
4846

4947
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
5048

51-
We're almost done, but there's a problem. The alternation `|` is between the `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. That's wrong, because it will match either the left or the right pattern:
52-
53-
54-
```js run
55-
let reg = /[01]\d|2[0-3]:[0-5]\d/g;
56-
57-
alert("12".match(reg)); // 12 (matched [01]\d)
58-
```
59-
60-
That's rather obvious, but still an often mistake when starting to work with regular expressions.
49+
We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
6150

62-
We need to add parentheses to apply alternation exactly to hours: `[01]\d` OR `2[0-3]`.
51+
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions.
6352

6453
The correct variant:
6554

5-regular-expressions/12-regexp-anchors/article.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The pattern `pattern:^Mary` means: "the string start and then Mary".
1818

1919
Now let's test whether the text ends with an email.
2020

21-
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. It's not perfect, but mostly works.
21+
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
2222

2323
To test whether the string ends with the email, let's add `pattern:$` to the pattern:
2424

5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ That may even be a vulnerability. For instance, if JavaScript is on the server,
1010

1111
So the problem is definitely worth to deal with.
1212

13-
## Example
13+
## Introductin
1414

1515
The plan will be like this:
1616

@@ -24,23 +24,22 @@ We want to find all tags, with or without attributes -- like `subject:<a href=".
2424

2525
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
2626

27-
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute.
27+
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute.
2828

2929
```js run
3030
// the match doesn't reach the end of the tag - wrong!
3131
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
3232
```
3333

34-
We need the whole tag.
35-
3634
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
3735

38-
In the regexp language that is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`:
36+
1. For the `tag` name: `pattern:\w+`,
37+
2. For the `key` name: `pattern:\w+`,
38+
3. And the `value` can be a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
3939

40-
1. `pattern:<\w+` -- is the tag start,
41-
2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
40+
If we substitute these into the pattern above, the full regexp is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`.
4241

43-
That doesn't yet support few details of HTML grammar, for instance strings in 'single' quotes, but they can be added later, so that's somewhat close to real life. For now we want the regexp to be simple.
42+
That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now.
4443

4544
Let's try it in action:
4645

@@ -54,9 +53,11 @@ alert( str.match(reg) ); // <a test="<>" href="#">, <b>
5453

5554
Great, it works! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
5655

57-
Now let's see the problem.
56+
Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
57+
58+
## Infinite backtracking
5859

59-
If you run the example below, it may hang the browser (or whatever JavaScript engine runs):
60+
If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
6061

6162
```js run
6263
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g;
@@ -65,18 +66,18 @@ let str = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
6566
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
6667

6768
*!*
68-
// The search will take a long long time
69+
// The search will take a long, long time
6970
alert( str.match(reg) );
7071
*/!*
7172
```
7273

73-
Some regexp engines can handle that search, but most of them don't.
74+
Some regexp engines can handle that search, but most of them can't.
7475

75-
What's the matter? Why a simple regular expression on such a small string "hangs"?
76+
What's the matter? Why a simple regular expression "hangs" on such a small string?
7677

77-
Let's simplify the situation by removing the tag and quoted strings.
78+
Let's simplify the situation by looking only for attributes.
7879

79-
Here we look only for attributes:
80+
Here we removed the tag and quoted strings from the regexp.
8081

8182
```js run
8283
// only search for space-delimited attributes

5-regular-expressions/20-regexp-unicode/article.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
9292

9393
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
9494

95-
In regular expressions these can be set by `\p{…}`.
95+
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
9696

9797
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
9898

@@ -121,13 +121,24 @@ You could also explore properties at [Character Property Index](http://unicode.o
121121
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
122122
```
123123

124-
There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`).
124+
There are also other derived categories, like:
125+
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
126+
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
127+
- ...Unicode is a big beast, it includes a lot of properties.
125128

126-
Unicode is a big beast, it includes a lot of properties.
129+
For instance, let's look for a 6-digit hex number:
127130

128-
One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
131+
```js run
132+
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
133+
134+
alert("color: #123ABC".match(reg)); // 123ABC
135+
```
136+
137+
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
138+
139+
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
129140

130-
The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`.
141+
### Universal \w
131142

132143
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
133144

0 commit comments

Comments
 (0)