#import "@local/handout:0.1.0": * #show: doc => handout( doc, quarter: link( "https://betalupi.com/handouts", "betalupi.com/handouts", ), title: [The Regex Warm-Up], by: "Mark", ) Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \ This is often abbreviated "regex," which is pronounced like "gif." #v(2mm) Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \ Often enough, a clever regex pattern can do the work of a few hundred lines of code. #v(2mm) Like the DFAs we've studied, a regex pattern _accepts_ or _rejects_ a string. However, we don't usually use this terminology with regex, and instead say that a string _matches_ or _doesn't match_ a pattern. #v(5mm) Regex strings consist of characters, quantifiers, sets, and groups. #v(5mm) *Quantifiers* \ Quantifiers specify how many of a character to match. \ There are four of these: `+`, `*`, `?`, and `{ }`. #v(4mm) `+` means "match one or more of the preceding token" \ `*` means "match zero or more of the preceding token" For example, the pattern `ca+t` will match the following strings: - `cat` - `caat` - `caaaaaaaat` `ca+t` will *not* match the string `ct`. \ The pattern `ca*t` will match all the strings above, including `ct`. #v(4mm) `?` means "match one or none of the preceding token" \ The pattern `linea?r` will match only `linear` and `liner`. #v(4mm) Brackets `{min, max}` are the most flexible quantifier. \ They specify exactly how many tokens to match: \ `ab{2}a` will match only `abba`. \ `ab{1,3}a` will match only `aba`, `abba`, and `abbba`. \ `ab{2,}a` will match any `ab...ba` with at least two `b`s. // spell:disable-line #problem() Write the patterns `a*` and `a+` using only `{ }`. #v(1fr) #problem() Draw a DFA equivalent to the regex pattern `01*0`. #v(1fr) #pagebreak() *Characters, Sets, and Groups* \ In the previous section, we saw how we can specify characters literally: \ `a+` means "one or more `a` characters" \ There are, of course, other ways we can specify characters. #v(4mm) The first such way is the _set_, denoted `[ ]`. A set can pretend to be any character inside it. \ For example, `m[aoy]th` will match `math`, `moth`, or `myth`. \ `a[01]+b` will match `a0b`, `a111b`, `a1100110b`, and any other similar string. \ #v(4mm) We can negate a set with a `^`. \ `[^abc]` will match any single character except `a`, `b`, or `c`, including symbols and spaces. #v(4mm) If we want to keep characters together, we can use the _group_, denoted `( )`. \ Groups work exactly as you'd expect, representing an atomic#footnote([In other words, "unbreakable"]) group of characters. \ `a(01)+b` will match `a01b` and `a010101b`, but will *not* match `a0b`, `a1b`, or `a1100110b`. #problem() You are now familiar with most of the tools regex has to offer. \ Write patterns that match the following strings: - An ISO-8601 date, like `2022-10-29`. \ #hint([Invalid dates like `2022-13-29` should also be matched.]) - An email address. \ #hint([Don't forget about subdomains, like `math.ucla.edu`.]) - A UCLA room number, like `MS 5118` or `Kinsey 1220B`. - Any ISBN-10 of the form `0-316-00395-7`. \ #hint([Remember that the check digit may be an `X`. Dashes are optional.]) - A word of even length. \ #hint([ The set `[A-z]` contains every english letter, capitalized and lowercase. \ `[a-z]` will only match lowercase letters. ]) - A word with exactly 3 vowels. \ #hint([ The special token `\w` will match any word character. \ It is equivalent to `[A-z0-9_]`. `_` represents a literal underscore. ]) - A word that has even length and exactly 3 vowels. - A sentence that does not start with a capital letter. #v(1fr) #problem() If you'd like to know more, check out `https://regexr.com`. It offers an interactive regex prompt, as well as a cheatsheet that explains every other regex token there is. \ You can find a nice set of challenges at `https://alf.nu/RegexGolf`.