handouts/Misc/Warm-Ups/regex.tex

\documentclass[
	solutions,
	hidewarning,
]{../../resources/ormc_handout}


\usepackage{xcolor}
\usepackage{soul}
\usepackage{hyperref}
\usepackage[T1]{fontenc} % Fixes texttt braces

\definecolor{Light}{gray}{.90}
\sethlcolor{Light}
\newcommand{\htexttt}[1]{\texttt{\hl{#1}}}


\title{The Regex Warm-Up}
\subtitle{Prepared by Mark on \today}

\begin{document}

	\maketitle


	Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \par
	This is often abbreviated \say{regex}, which is pronounced like \say{gif.}

	\vspace{2mm}

	Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \par

	Often enough, a clever regex pattern can do the work of a few hundred lines of code.

	\vspace{2mm}

	Like the DFAs we've studied, a regex pattern \textit{accepts} or \textit{rejects} a string. However, we don't usually use this terminology with regex, and instead say that a string \textit{matches} or \textit{doesn't match} a pattern.

	\vspace{5mm}

	Regex strings consist of characters, quantifiers, sets, and groups.

	\vspace{5mm}

	\textbf{Quantifiers} \par
	Quantifiers specify how many of a character to match. \par
	There are four of these: \htexttt{+}, \htexttt{*}, \htexttt{?}, and \htexttt{\{ \}}

	\vspace{2mm}

	\htexttt{+} means \say{match one or more of the preceding token} \par
	\htexttt{*} means \say{match zero or more of the preceding token}

	For example, the pattern \htexttt{ca+t} will match the following strings:
	\begin{itemize}
		\item \texttt{cat}
		\item \texttt{caat}
		\item \texttt{caaaaaaaat}
	\end{itemize}
	\htexttt{ca+t} will \textbf{not} match the string \texttt{ct}. \par
	The pattern \htexttt{ca*t} will match all the strings above, including \texttt{ct}.
	\vspace{2mm}


	\htexttt{?} means \say{match one or none of the preceding token} \par
	The pattern \htexttt{linea?r} will match only \texttt{linear} and \texttt{liner}.
	\vspace{2mm}

	Brackets \htexttt{\{min, max\}} are the most flexible quantifier. \par
	They specify exactly how many tokens to match: \par
	\htexttt{ab\{2\}a} will match only \texttt{abba}. \par
	\htexttt{ab\{1,3\}a} will match only \texttt{aba}, \texttt{abba}, and \texttt{abbba}. \par
	\htexttt{ab\{2,\}a} will match any \texttt{ab...ba} with at least two \texttt{b}s.

	\vspace{5mm}

	\problem{}
	Write the patterns \htexttt{a*} and \htexttt{a+} using only \htexttt{\{ \}}.
	\vfill

	\problem{}
	Draw a DFA equivalent to the regex pattern \htexttt{01*0}.
	\vfill

	\pagebreak


	\textbf{Characters, Sets, and Groups} \par
	In the previous section, we saw how we can specify characters literally: \par
	\texttt{a+} means \say{one or more \texttt{a} character}

	\vspace{2mm}

	There are, of course, other ways we can specify characters.

	\vspace{2mm}

	The first such way is the \textit{set}, denoted \htexttt{[ ]}. A set can pretend to be any character inside it. \par
	For example, \htexttt{m[aoy]th} will match \texttt{math}, \texttt{moth}, or \texttt{myth}. \par
	\htexttt{a[01]+b} will match \texttt{a0b}, \texttt{a111b}, \texttt{a1100110b}, and any other similar string. \par
	You may negate a set with a \htexttt{\textasciicircum}. \par
	\htexttt{[\textasciicircum abc]} will match any character except \texttt{a}, \texttt{b}, or \texttt{c}, including symbols and spaces.

	\vspace{2mm}

	If we want to keep characters together, we can use the \textit{group}, denoted \htexttt{( )}. \par

	Groups work exactly as you'd expect, representing an atomic\footnotemark{} group of characters. \par
	\htexttt{a(01)+b} will match \texttt{a01b} and \texttt{a010101b}, but will \textbf{not} match \texttt{a0b}, \texttt{a1b}, or \texttt{a1100110b}.

	\footnotetext{In other words, \say{unbreakable}}


	\problem{}<regex>
	You are now familiar with most of the tools regex has to offer. \par
	Write patterns that match the following strings:
	\begin{enumerate}[itemsep=1mm]
		\item An ISO-8601 date, like \texttt{2022-10-29}. \par
		\hint{Invalid dates like \texttt{2022-13-29} should also be matched.}

		\item An email address. \par
		\hint{Don't forget about subdomains, like \texttt{math.ucla.edu}.}

		\item A UCLA room number, like \texttt{MS 5118} or \texttt{Kinsey 1220B}.

		\item Any ISBN-10 of the form \texttt{0-316-00395-7}. \par
		\hint{Remember that the check digit may be an \texttt{X}. Dashes are optional.}

		\item A word of even length. \par
		\hint{The set \texttt{[A-z]} contains every english letter, capitalized and lowercase. \\
		\texttt{[a-z]} will only match lowercase letters.}

		\item A word with exactly 3 vowels. \par
		\hint{The special token \texttt{\textbackslash w} will match any word character. It is equivalent to \texttt{[A-z0-9\_]} \\ \texttt{\_} stands for a literal underscore.}

		\item A word that has even length and exactly 3 vowels.

		\item A sentence that does not start with a capital letter.
	\end{enumerate}


	\vfill


	\problem{}
	If you'd like to know more, check out \url{https://regexr.com}. It offers an interative regex prompt, as well as a cheatsheet that explains every other regex token there is. \par
	You will find a nice set of challenges at \url{https://alf.nu/RegexGolf}.
	I especially encourage you to look into this if you are interested in computer science.
\end{document}
Added regex handout 2023-01-29 22:10:13 -08:00			`\documentclass[`
added nowarning option 2023-03-23 09:51:10 -07:00			`solutions,`
Transition to new format 2023-05-25 21:44:07 -07:00			`hidewarning,`
Added regex handout 2023-01-29 22:10:13 -08:00			`]{../../resources/ormc_handout}`


			`\usepackage{xcolor}`
			`\usepackage{soul}`
Cleanup 2023-06-27 21:23:37 -07:00			`\usepackage{hyperref}`
			`\usepackage[T1]{fontenc} % Fixes texttt braces`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\definecolor{Light}{gray}{.90}`
			`\sethlcolor{Light}`
			`\newcommand{\htexttt}[1]{\texttt{\hl{#1}}}`

Transition to new format 2023-05-25 21:44:07 -07:00
			`\title{The Regex Warm-Up}`
			`\subtitle{Prepared by Mark on \today}`

Added regex handout 2023-01-29 22:10:13 -08:00			`\begin{document}`

			`\maketitle`

Minor cleanup 2023-03-23 10:30:20 -07:00
Cleanup 2023-06-27 21:23:37 -07:00			`Last time, we discussed Deterministic Finite Automata. One interesting application of these mathematical objects is found in computer science: Regular Expressions. \par`
			`This is often abbreviated \say{regex}, which is pronounced like \say{gif.}`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`Regex is a language used to specify patterns in a string. You can think of it as a concise way to define a DFA, using text instead of a huge graph. \par`
Added regex handout 2023-01-29 22:10:13 -08:00
Cleanup 2023-06-27 21:23:37 -07:00			`Often enough, a clever regex pattern can do the work of a few hundred lines of code.`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`Like the DFAs we've studied, a regex pattern \textit{accepts} or \textit{rejects} a string. However, we don't usually use this terminology with regex, and instead say that a string \textit{matches} or \textit{doesn't match} a pattern.`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{5mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`Regex strings consist of characters, quantifiers, sets, and groups.`

			`\vspace{5mm}`

			`\textbf{Quantifiers} \par`
			`Quantifiers specify how many of a character to match. \par`
			`There are four of these: \htexttt{+}, \htexttt{*}, \htexttt{?}, and \htexttt{\{ \}}`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`\htexttt{+} means \say{match one or more of the preceding token} \par`
			`\htexttt{*} means \say{match zero or more of the preceding token}`
Added regex handout 2023-01-29 22:10:13 -08:00
			`For example, the pattern \htexttt{ca+t} will match the following strings:`
			`\begin{itemize}`
			`\item \texttt{cat}`
			`\item \texttt{caat}`
			`\item \texttt{caaaaaaaat}`
			`\end{itemize}`
Cleanup 2023-06-27 21:23:37 -07:00			`\htexttt{ca+t} will \textbf{not} match the string \texttt{ct}. \par`
Added regex handout 2023-01-29 22:10:13 -08:00			`The pattern \htexttt{ca*t} will match all the strings above, including \texttt{ct}.`
			`\vspace{2mm}`


Cleanup 2023-06-27 21:23:37 -07:00			`\htexttt{?} means \say{match one or none of the preceding token} \par`
			`The pattern \htexttt{linea?r} will match only \texttt{linear} and \texttt{liner}.`
Added regex handout 2023-01-29 22:10:13 -08:00			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`Brackets \htexttt{\{min, max\}} are the most flexible quantifier. \par`
			`They specify exactly how many tokens to match: \par`
			`\htexttt{ab\{2\}a} will match only \texttt{abba}. \par`
			`\htexttt{ab\{1,3\}a} will match only \texttt{aba}, \texttt{abba}, and \texttt{abbba}. \par`
Added regex handout 2023-01-29 22:10:13 -08:00			`\htexttt{ab\{2,\}a} will match any \texttt{ab...ba} with at least two \texttt{b}s.`

			`\vspace{5mm}`

			`\problem{}`
			`Write the patterns \htexttt{a*} and \htexttt{a+} using only \htexttt{\{ \}}.`
			`\vfill`

			`\problem{}`
			`Draw a DFA equivalent to the regex pattern \htexttt{01*0}.`
			`\vfill`

			`\pagebreak`






Cleanup 2023-06-27 21:23:37 -07:00			`\textbf{Characters, Sets, and Groups} \par`
			`In the previous section, we saw how we can specify characters literally: \par`
			`\texttt{a+} means \say{one or more \texttt{a} character}`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`There are, of course, other ways we can specify characters.`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`The first such way is the \textit{set}, denoted \htexttt{[ ]}. A set can pretend to be any character inside it. \par`
			`For example, \htexttt{m[aoy]th} will match \texttt{math}, \texttt{moth}, or \texttt{myth}. \par`
			`\htexttt{a[01]+b} will match \texttt{a0b}, \texttt{a111b}, \texttt{a1100110b}, and any other similar string. \par`
			`You may negate a set with a \htexttt{\textasciicircum}. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`\htexttt{[\textasciicircum abc]} will match any character except \texttt{a}, \texttt{b}, or \texttt{c}, including symbols and spaces.`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\vspace{2mm}`

Cleanup 2023-06-27 21:23:37 -07:00			`If we want to keep characters together, we can use the \textit{group}, denoted \htexttt{( )}. \par`
Added regex handout 2023-01-29 22:10:13 -08:00
Cleanup 2023-06-27 21:23:37 -07:00			`Groups work exactly as you'd expect, representing an atomic\footnotemark{} group of characters. \par`
			`\htexttt{a(01)+b} will match \texttt{a01b} and \texttt{a010101b}, but will \textbf{not} match \texttt{a0b}, \texttt{a1b}, or \texttt{a1100110b}.`
Added regex handout 2023-01-29 22:10:13 -08:00
			`\footnotetext{In other words, \say{unbreakable}}`


			`\problem{}<regex>`
Cleanup 2023-06-27 21:23:37 -07:00			`You are now familiar with most of the tools regex has to offer. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`Write patterns that match the following strings:`
			`\begin{enumerate}[itemsep=1mm]`
Cleanup 2023-06-27 21:23:37 -07:00			`\item An ISO-8601 date, like \texttt{2022-10-29}. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`\hint{Invalid dates like \texttt{2022-13-29} should also be matched.}`

Cleanup 2023-06-27 21:23:37 -07:00			`\item An email address. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`\hint{Don't forget about subdomains, like \texttt{math.ucla.edu}.}`

			`\item A UCLA room number, like \texttt{MS 5118} or \texttt{Kinsey 1220B}.`
Added regex handout 2023-01-29 22:10:13 -08:00
Cleanup 2023-06-27 21:23:37 -07:00			`\item Any ISBN-10 of the form \texttt{0-316-00395-7}. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`\hint{Remember that the check digit may be an \texttt{X}. Dashes are optional.}`

Cleanup 2023-06-27 21:23:37 -07:00			`\item A word of even length. \par`
Edits to regex warmup 2023-01-31 14:45:34 -08:00			`\hint{The set \texttt{[A-z]} contains every english letter, capitalized and lowercase. \\`
			`\texttt{[a-z]} will only match lowercase letters.}`

Cleanup 2023-06-27 21:23:37 -07:00			`\item A word with exactly 3 vowels. \par`
Typos 2023-02-03 13:19:19 -08:00			`\hint{The special token \texttt{\textbackslash w} will match any word character. It is equivalent to \texttt{[A-z0-9\_]} \\ \texttt{\_} stands for a literal underscore.}`
Edits to regex warmup 2023-01-31 14:45:34 -08:00
			`\item A word that has even length and exactly 3 vowels.`

			`\item A sentence that does not start with a capital letter.`
Added regex handout 2023-01-29 22:10:13 -08:00			`\end{enumerate}`



			`\vfill`





			`\problem{}`
Cleanup 2023-06-27 21:23:37 -07:00			`If you'd like to know more, check out \url{https://regexr.com}. It offers an interative regex prompt, as well as a cheatsheet that explains every other regex token there is. \par`
			`You will find a nice set of challenges at \url{https://alf.nu/RegexGolf}.`
Added regex handout 2023-01-29 22:10:13 -08:00			`I especially encourage you to look into this if you are interested in computer science.`
			`\end{document}`