celeste-ai/report/parts/methods.tex

\section{Methods}
% Detailed description of methods used or developed.

Our solution to \textit{Celeste Classic} consists of two major parts: the \textit{interface} and the \textit{agent}. The first provides a high-level interface for the game, and the second uses deep Q-learning techniques to control the player.


\subsection{Interface}

The interface component does not have any machine-learning logic. Its primary job is to send input and receive game state from \textit{Celeste Classic}. We send input by emulating keypresses with the standard X11 utility \texttt{xtodool}. A minor consequence of this is the fact that our agent may only be run in a linux environment, but this can be remedied with a bit of extra code.

\vspace{2mm}

We receive game state by abusing the PICO-8's debugging features. Since PICO-8 games are plain text files, we were able to modify the code of \textit{Celeste Classic} with a few well-placed debug-print statements. The interface captures this text, parses it, and feeds it to our model.

\vspace{2mm}

The final component of the interface is timing. First, we modified \textit{Celeste Classic} to only run frames when a key is pressed. This allows the agent to run in in-game time, which wouldn't be possible otherwise: \textit{Celeste} usually runs at 30 fps, and the hardware we used to train our model cannot compute gradients that quickly.

Second, we implemented a \say{frame skip} mechanism to the interface, which tells the game to run a certain number of frames---many more than one---after the agent selects an action. The benefit of this is twofold: first, it prevents our model from training on redundant information. The game's state does not see significant change over consecutive frames. Second, frame skipping allows transitions to more directly reflect the consequences of an action.

For example, say the agent chooses to dash upwards. Due to the way \textit{Celeste} is designed, the player cannot take any other action until that dash is complete. Our frame-skip mechanism will run the game until the dash is complete, returning a new state only when a new action can be taken.


\subsection{Agent}

The agent we trained to solve \textit{Celeste Classic} is a plain deep Q-learning agent. A neural network estimates the reward of taking each possible action at a given state, and the agent selects the action with the highest predicted reward. This network is a four-layer fully-connected linear net with 128 nodes in each layer and a ReLU activation function on each hidden node. It has two input nodes that track the player's X and Y-position, and nine output nodes which each correspond to an action the agent can take.


\subsubsection{Reward}

\noindent
\begin{minipage}{0.58\textwidth}
	During training, the agent receives 10 reward whenever it reaches a checkpoint (at right) or completes the stage. If the agent skips a checkpoint, it gets extra reward for each checkpoint it skipped. For example, jumping from point 1 to point 3 would give the agent 20 reward.

	\vspace{2mm}

	These checkpoints are distributed close enough to keep the agent progressing, but far enough away to give it a challenge. Points 4 and 5 are particularly interesting in this respect. When training an agent without point 4, it would often reach the ledge and fall off, getting no reward.

	\vspace{2mm}

	Despite many thousand epochs, this training process was unable to finish the stage. Though the ledge under point 4 is fairly easy to reach from either point 2 or 3, it is highly unlikely that an untrained agent would make it from point 2 to point 5 without the extra reward at point 4.

\end{minipage}\hfill
\begin{minipage}{0.4\textwidth}
	\begin{center}
	\includegraphics[width=0.9\textwidth]{points}

	\vspace{1mm}
	\begin{minipage}{0.8\textwidth}
		Locations of non-final checkpoints
	\end{minipage}
	\end{center}
\end{minipage}

\vfill
\pagebreak


\subsubsection{Exploration Probability}

At every step, we use the Q network to predict the expected reward for taking each of the nine actions. Naturally, the best action to take is the one with the highest predicted reward. In order to encourage exploration, we also take a random action with a probability given by
$$
	P(c) = \epsilon_1 + (\epsilon_0 - \epsilon_1) e^{-c / d}
$$

Where $\epsilon_0$ is the initial random probability, $\epsilon_1$ is the end random probability, and $d$ is the rate at which $P(c)$ decays to $\epsilon_1$. $c$ is a rather unusual \say{time} parameter: it counts the number of times the agent has reached the next point.

\vspace{2mm}

Usually, such $\epsilon$ policies depend on the number of training steps competed. For many applications, this makes sense: if a model is trained on many iterations, it begins to perform better, and thus has less of a need to explore. In our case, that doesn't work: we need to explore until we find a way to reach a checkpoint, and rely on the model's preditions once we've found one. Therefore, instead of computing $P$ with respect to a simple iteration counter, we instead compute it with respect to $c$.


\subsubsection{Target Network, Replay Memory}

To prevent an unstable training process, we use a \textit{target network} as described in \textit{Human-Level Control} \cite{humanlevel}. However, instead of periodically hard-resetting the target network to the Q network, we use a soft update defined by the following equation, where $W_Q$ and $W_T$ are weights of the Q network and target network, respectively.
$$
	W_T = 0.05 W_Q + 0.95 W_T
$$

\vspace{2mm}

We also use \textit{replay memory} from the same paper, with a batch size of 100 and a total size of 50,000. Our model is optimized using Adam with a learning rate of 0.001.


\subsubsection{Bellman Equation}

Our goal is to train our model to approximate the value function $Q(s, a)$, which tells us the value of taking action $a$ at state $s$. This approximation can then be used to choose the best action at each state. We define $Q$ using the Bellman equation:
$$
	Q(s, a) = r(s) + \gamma Q(s_a)
$$
Where $r(s)$ is the reward at state $s$, $Q(s_a)$ is the value of the state we get to when we perform action $a$ at state $s$, and $\gamma$ is a discount factor that makes present reward more valuable than future reward. In our model, we set $\gamma$ to $0.9$.


\vfill
\pagebreak