2023-01-29 22:10:13 -08:00
\documentclass [
2023-03-23 09:51:10 -07:00
solutions,
2023-05-25 21:44:07 -07:00
hidewarning,
2023-01-29 22:10:13 -08:00
]{ ../../resources/ormc_ handout}
\usepackage { xcolor}
\usepackage { soul}
2023-06-27 21:23:37 -07:00
\usepackage { hyperref}
\usepackage [T1] { fontenc} % Fixes texttt braces
2023-01-29 22:10:13 -08:00
\definecolor { Light} { gray} { .90}
\sethlcolor { Light}
\newcommand { \htexttt } [1]{ \texttt { \hl { #1} } }
2023-05-25 21:44:07 -07:00
\title { The Regex Warm-Up}
\subtitle { Prepared by Mark on \today }
2023-01-29 22:10:13 -08:00
\begin { document}
\maketitle
2023-03-23 10:30:20 -07:00
2023-06-27 21:23:37 -07:00
Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \par
This is often abbreviated \say { regex} , which is pronounced like \say { gif.}
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \par
2023-01-29 22:10:13 -08:00
2023-06-27 21:23:37 -07:00
Often enough, a clever regex pattern can do the work of a few hundred lines of code.
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
Like the DFAs we've studied, a regex pattern \textit { accepts} or \textit { rejects} a string. However, we don't usually use this terminology with regex, and instead say that a string \textit { matches} or \textit { doesn't match} a pattern.
2023-01-29 22:10:13 -08:00
\vspace { 5mm}
2023-06-27 21:23:37 -07:00
Regex strings consist of characters, quantifiers, sets, and groups.
\vspace { 5mm}
\textbf { Quantifiers} \par
Quantifiers specify how many of a character to match. \par
There are four of these: \htexttt { +} , \htexttt { *} , \htexttt { ?} , and \htexttt { \{ \} }
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
\htexttt { +} means \say { match one or more of the preceding token} \par
\htexttt { *} means \say { match zero or more of the preceding token}
2023-01-29 22:10:13 -08:00
For example, the pattern \htexttt { ca+t} will match the following strings:
\begin { itemize}
\item \texttt { cat}
\item \texttt { caat}
\item \texttt { caaaaaaaat}
\end { itemize}
2023-06-27 21:23:37 -07:00
\htexttt { ca+t} will \textbf { not} match the string \texttt { ct} . \par
2023-01-29 22:10:13 -08:00
The pattern \htexttt { ca*t} will match all the strings above, including \texttt { ct} .
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
\htexttt { ?} means \say { match one or none of the preceding token} \par
The pattern \htexttt { linea?r} will match only \texttt { linear} and \texttt { liner} .
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
Brackets \htexttt { \{ min, max\} } are the most flexible quantifier. \par
They specify exactly how many tokens to match: \par
\htexttt { ab\{ 2\} a} will match only \texttt { abba} . \par
\htexttt { ab\{ 1,3\} a} will match only \texttt { aba} , \texttt { abba} , and \texttt { abbba} . \par
2023-01-29 22:10:13 -08:00
\htexttt { ab\{ 2,\} a} will match any \texttt { ab...ba} with at least two \texttt { b} s.
\vspace { 5mm}
\problem { }
Write the patterns \htexttt { a*} and \htexttt { a+} using only \htexttt { \{ \} } .
\vfill
\problem { }
Draw a DFA equivalent to the regex pattern \htexttt { 01*0} .
\vfill
\pagebreak
2023-06-27 21:23:37 -07:00
\textbf { Characters, Sets, and Groups} \par
In the previous section, we saw how we can specify characters literally: \par
\texttt { a+} means \say { one or more \texttt { a} character}
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
There are, of course, other ways we can specify characters.
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
The first such way is the \textit { set} , denoted \htexttt { [ ]} . A set can pretend to be any character inside it. \par
For example, \htexttt { m[aoy]th} will match \texttt { math} , \texttt { moth} , or \texttt { myth} . \par
\htexttt { a[01]+b} will match \texttt { a0b} , \texttt { a111b} , \texttt { a1100110b} , and any other similar string. \par
You may negate a set with a \htexttt { \textasciicircum } . \par
2023-01-31 14:45:34 -08:00
\htexttt { [\textasciicircum abc]} will match any character except \texttt { a} , \texttt { b} , or \texttt { c} , including symbols and spaces.
2023-01-29 22:10:13 -08:00
\vspace { 2mm}
2023-06-27 21:23:37 -07:00
If we want to keep characters together, we can use the \textit { group} , denoted \htexttt { ( )} . \par
2023-01-29 22:10:13 -08:00
2023-06-27 21:23:37 -07:00
Groups work exactly as you'd expect, representing an atomic\footnotemark { } group of characters. \par
\htexttt { a(01)+b} will match \texttt { a01b} and \texttt { a010101b} , but will \textbf { not} match \texttt { a0b} , \texttt { a1b} , or \texttt { a1100110b} .
2023-01-29 22:10:13 -08:00
\footnotetext { In other words, \say { unbreakable} }
\problem { } <regex>
2023-06-27 21:23:37 -07:00
You are now familiar with most of the tools regex has to offer. \par
2023-01-31 14:45:34 -08:00
Write patterns that match the following strings:
\begin { enumerate} [itemsep=1mm]
2023-06-27 21:23:37 -07:00
\item An ISO-8601 date, like \texttt { 2022-10-29} . \par
2023-01-31 14:45:34 -08:00
\hint { Invalid dates like \texttt { 2022-13-29} should also be matched.}
2023-06-27 21:23:37 -07:00
\item An email address. \par
2023-01-31 14:45:34 -08:00
\hint { Don't forget about subdomains, like \texttt { math.ucla.edu} .}
\item A UCLA room number, like \texttt { MS 5118} or \texttt { Kinsey 1220B} .
2023-01-29 22:10:13 -08:00
2023-06-27 21:23:37 -07:00
\item Any ISBN-10 of the form \texttt { 0-316-00395-7} . \par
2023-01-31 14:45:34 -08:00
\hint { Remember that the check digit may be an \texttt { X} . Dashes are optional.}
2023-06-27 21:23:37 -07:00
\item A word of even length. \par
2023-01-31 14:45:34 -08:00
\hint { The set \texttt { [A-z]} contains every english letter, capitalized and lowercase. \\
\texttt { [a-z]} will only match lowercase letters.}
2023-06-27 21:23:37 -07:00
\item A word with exactly 3 vowels. \par
2023-02-03 13:19:19 -08:00
\hint { The special token \texttt { \textbackslash w} will match any word character. It is equivalent to \texttt { [A-z0-9\_ ]} \\ \texttt { \_ } stands for a literal underscore.}
2023-01-31 14:45:34 -08:00
\item A word that has even length and exactly 3 vowels.
\item A sentence that does not start with a capital letter.
2023-01-29 22:10:13 -08:00
\end { enumerate}
\vfill
\problem { }
2023-06-27 21:23:37 -07:00
If you'd like to know more, check out \url { https://regexr.com} . It offers an interative regex prompt, as well as a cheatsheet that explains every other regex token there is. \par
You will find a nice set of challenges at \url { https://alf.nu/RegexGolf} .
2023-01-29 22:10:13 -08:00
I especially encourage you to look into this if you are interested in computer science.
\end { document}