150 lines
5.1 KiB
TeX
150 lines
5.1 KiB
TeX
\documentclass[
|
|
solutions,
|
|
hidewarning,
|
|
]{../../resources/ormc_handout}
|
|
|
|
|
|
\usepackage{xcolor}
|
|
\usepackage{soul}
|
|
|
|
\definecolor{Light}{gray}{.90}
|
|
\sethlcolor{Light}
|
|
\newcommand{\htexttt}[1]{\texttt{\hl{#1}}}
|
|
|
|
|
|
\title{The Regex Warm-Up}
|
|
\subtitle{Prepared by Mark on \today}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
|
|
Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \\
|
|
(abbreviated \say{regex}, which is pronounced like \say{gif})
|
|
|
|
\vspace{2mm}
|
|
|
|
Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \\
|
|
|
|
Often enough, a clever regex pattern can do the work of a few hundred lines of code. \\
|
|
|
|
\vspace{2mm}
|
|
|
|
Like the DFAs we have studied, a regex pattern \textit{accepts} or \textit{rejects} a string. However, we don't usually use this terminology when discussing regex, instead opting to say a pattern \textit{matches} or \textit{doesn't match} a string. \\
|
|
|
|
\vspace{5mm}
|
|
|
|
\textbf{Quantifiers} \\
|
|
Quantifiers tell us how many of a character to match. \\
|
|
There are four of them:
|
|
\htexttt{+}, \htexttt{*}, \htexttt{?}, and \htexttt{\{ \}}
|
|
|
|
\vspace{2mm}
|
|
|
|
\htexttt{+} means \say{match one or more of the preceding token} \\
|
|
\htexttt{*} means \say{match zero or more of the preceding token} \\
|
|
|
|
For example, the pattern \htexttt{ca+t} will match the following strings:
|
|
\begin{itemize}
|
|
\item \texttt{cat}
|
|
\item \texttt{caat}
|
|
\item \texttt{caaaaaaaat}
|
|
\end{itemize}
|
|
\htexttt{ca+t} will \textbf{not} match the string \texttt{ct}. \\
|
|
The pattern \htexttt{ca*t} will match all the strings above, including \texttt{ct}.
|
|
\vspace{2mm}
|
|
|
|
|
|
\htexttt{?} means \say{match one or none of the preceding token} \\
|
|
The pattern \htexttt{linea?r} will match only \texttt{linear} and \texttt{liner}. \\
|
|
\vspace{2mm}
|
|
|
|
Brackets \htexttt{\{min, max\}} are the most flexible quantifier. \\
|
|
They specify exactly how many tokens to match: \\
|
|
\htexttt{ab\{2\}a} will match only \texttt{abba}. \\
|
|
\htexttt{ab\{1,3\}a} will match only \texttt{aba}, \texttt{abba}, and \texttt{abbba}. \\
|
|
\htexttt{ab\{2,\}a} will match any \texttt{ab...ba} with at least two \texttt{b}s.
|
|
|
|
\vspace{5mm}
|
|
|
|
\problem{}
|
|
Write the patterns \htexttt{a*} and \htexttt{a+} using only \htexttt{\{ \}}.
|
|
\vfill
|
|
|
|
\problem{}
|
|
Draw a DFA equivalent to the regex pattern \htexttt{01*0}.
|
|
\vfill
|
|
|
|
\pagebreak
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\textbf{Characters, Sets, and Groups} \\
|
|
We specify characters literally, as shown above: \\
|
|
\texttt{a+} means \say{one or more \texttt{a} character} \\
|
|
|
|
\vspace{2mm}
|
|
|
|
There are, however, other ways we can specify characters. \\
|
|
|
|
\vspace{2mm}
|
|
|
|
The first such way is the \textit{set}, denoted \htexttt{[ ]}. A set can pretend to be any character inside it. \\
|
|
For example, \htexttt{m[aoy]th} will match \texttt{math}, \texttt{moth}, or \texttt{myth}. \\
|
|
\htexttt{a[01]+b} will match \texttt{a0b}, \texttt{a111b}, \texttt{a1100110b}, and any other similar string. \\
|
|
You may negate a set with a \htexttt{\textasciicircum}. \\
|
|
\htexttt{[\textasciicircum abc]} will match any character except \texttt{a}, \texttt{b}, or \texttt{c}, including symbols and spaces.
|
|
|
|
\vspace{2mm}
|
|
|
|
If we want to keep characters together, we can use the \textit{group}, denoted \htexttt{( )}. \\
|
|
|
|
Groups work exactly as you'd expect, representing an atomic\footnotemark{} group of characters. \\
|
|
\htexttt{a(01)+b} will match \texttt{a01b} and \texttt{a010101b}, but will \textbf{not} match \texttt{a0b}, \texttt{a1b}, or \texttt{a1100110b}. \\
|
|
|
|
\footnotetext{In other words, \say{unbreakable}}
|
|
|
|
|
|
\problem{}<regex>
|
|
You are now familiar with most of the tools regex has to offer. \\
|
|
Write patterns that match the following strings:
|
|
\begin{enumerate}[itemsep=1mm]
|
|
\item An ISO-8601 date, like \texttt{2022-10-29}. \\
|
|
\hint{Invalid dates like \texttt{2022-13-29} should also be matched.}
|
|
|
|
\item An email address. \\
|
|
\hint{Don't forget about subdomains, like \texttt{math.ucla.edu}.}
|
|
|
|
\item A UCLA room number, like \texttt{MS 5118} or \texttt{Kinsey 1220B}.
|
|
|
|
\item Any ISBN-10 of the form \texttt{0-316-00395-7}. \\
|
|
\hint{Remember that the check digit may be an \texttt{X}. Dashes are optional.}
|
|
|
|
\item A word of even length. \\
|
|
\hint{The set \texttt{[A-z]} contains every english letter, capitalized and lowercase. \\
|
|
\texttt{[a-z]} will only match lowercase letters.}
|
|
|
|
\item A word with exactly 3 vowels. \\
|
|
\hint{The special token \texttt{\textbackslash w} will match any word character. It is equivalent to \texttt{[A-z0-9\_]} \\ \texttt{\_} stands for a literal underscore.}
|
|
|
|
\item A word that has even length and exactly 3 vowels.
|
|
|
|
\item A sentence that does not start with a capital letter.
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
\vfill
|
|
|
|
|
|
|
|
|
|
|
|
\problem{}
|
|
If you'd like to know more, check out \texttt{regexr.com}. It offers an interative regex prompt, as well as a cheatsheet that explains every other regex token there is. You will find a nice set of challenges at \texttt{http://regex.alf.nu}. \\
|
|
I especially encourage you to look into this if you are interested in computer science.
|
|
\end{document} |