142 lines
4.1 KiB
Typst
142 lines
4.1 KiB
Typst
#import "@local/handout:0.1.0": *
|
|
|
|
#show: doc => handout(
|
|
doc,
|
|
quarter: link(
|
|
"https://betalupi.com/handouts",
|
|
"betalupi.com/handouts",
|
|
),
|
|
|
|
title: [The Regex Warm-Up],
|
|
by: "Mark",
|
|
)
|
|
|
|
|
|
Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \
|
|
This is often abbreviated "regex," which is pronounced like "gif."
|
|
|
|
#v(2mm)
|
|
|
|
Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \
|
|
|
|
Often enough, a clever regex pattern can do the work of a few hundred lines of code.
|
|
|
|
#v(2mm)
|
|
|
|
Like the DFAs we've studied, a regex pattern _accepts_ or _rejects_ a string. However, we don't usually use this terminology with regex, and instead say that a string _matches_ or _doesn't match_ a pattern.
|
|
|
|
#v(5mm)
|
|
|
|
Regex strings consist of characters, quantifiers, sets, and groups.
|
|
|
|
#v(5mm)
|
|
|
|
|
|
|
|
*Quantifiers* \
|
|
Quantifiers specify how many of a character to match. \
|
|
There are four of these: `+`, `*`, `?`, and `{ }`.
|
|
|
|
#v(4mm)
|
|
|
|
`+` means "match one or more of the preceding token" \
|
|
`*` means "match zero or more of the preceding token"
|
|
|
|
For example, the pattern `ca+t` will match the following strings:
|
|
- `cat`
|
|
- `caat`
|
|
- `caaaaaaaat`
|
|
`ca+t` will *not* match the string `ct`. \
|
|
The pattern `ca*t` will match all the strings above, including `ct`.
|
|
|
|
|
|
#v(4mm)
|
|
|
|
|
|
`?` means "match one or none of the preceding token" \
|
|
The pattern `linea?r` will match only `linear` and `liner`.
|
|
|
|
#v(4mm)
|
|
|
|
Brackets `{min, max}` are the most flexible quantifier. \
|
|
They specify exactly how many tokens to match: \
|
|
`ab{2}a` will match only `abba`. \
|
|
`ab{1,3}a` will match only `aba`, `abba`, and `abbba`. \
|
|
`ab{2,}a` will match any `ab...ba` with at least two `b`s. // spell:disable-line
|
|
|
|
#problem()
|
|
Write the patterns `a*` and `a+` using only `{ }`.
|
|
#v(1fr)
|
|
|
|
#problem()
|
|
Draw a DFA equivalent to the regex pattern `01*0`.
|
|
#v(1fr)
|
|
|
|
#pagebreak()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*Characters, Sets, and Groups* \
|
|
In the previous section, we saw how we can specify characters literally: \
|
|
`a+` means "one or more `a` characters" \
|
|
There are, of course, other ways we can specify characters.
|
|
|
|
#v(4mm)
|
|
|
|
The first such way is the _set_, denoted `[ ]`. A set can pretend to be any character inside it. \
|
|
For example, `m[aoy]th` will match `math`, `moth`, or `myth`. \
|
|
`a[01]+b` will match `a0b`, `a111b`, `a1100110b`, and any other similar string. \
|
|
|
|
#v(4mm)
|
|
|
|
We can negate a set with a `^`. \
|
|
`[^abc]` will match any single character except `a`, `b`, or `c`, including symbols and spaces.
|
|
|
|
#v(4mm)
|
|
|
|
If we want to keep characters together, we can use the _group_, denoted `( )`. \
|
|
|
|
Groups work exactly as you'd expect, representing an atomic#footnote([In other words, "unbreakable"]) group of characters. \
|
|
`a(01)+b` will match `a01b` and `a010101b`, but will *not* match `a0b`, `a1b`, or `a1100110b`.
|
|
|
|
#problem()
|
|
You are now familiar with most of the tools regex has to offer. \
|
|
Write patterns that match the following strings:
|
|
|
|
- An ISO-8601 date, like `2022-10-29`. \
|
|
#hint([Invalid dates like `2022-13-29` should also be matched.])
|
|
|
|
- An email address. \
|
|
#hint([Don't forget about subdomains, like `math.ucla.edu`.])
|
|
|
|
- A UCLA room number, like `MS 5118` or `Kinsey 1220B`.
|
|
|
|
- Any ISBN-10 of the form `0-316-00395-7`. \
|
|
#hint([Remember that the check digit may be an `X`. Dashes are optional.])
|
|
|
|
- A word of even length. \
|
|
#hint([
|
|
The set `[A-z]` contains every english letter, capitalized and lowercase. \
|
|
`[a-z]` will only match lowercase letters.
|
|
])
|
|
|
|
- A word with exactly 3 vowels. \
|
|
#hint([
|
|
The special token `\w` will match any word character. \
|
|
It is equivalent to `[A-z0-9_]`. `_` represents a literal underscore.
|
|
])
|
|
|
|
- A word that has even length and exactly 3 vowels.
|
|
|
|
- A sentence that does not start with a capital letter.
|
|
#v(1fr)
|
|
|
|
#problem()
|
|
If you'd like to know more, check out `https://regexr.com`.
|
|
It offers an interactive regex prompt,
|
|
as well as a cheatsheet that explains every other regex token there is. \
|
|
You can find a nice set of challenges at `https://alf.nu/RegexGolf`.
|