In reinforcement learning, an agent takes actions within an environment. When an agent acts, a reward function assigns feedback to the agent that either positively or negatively reinforces the agent’s action. The agent’s goal is to use this feedback to learn an optimal policy.

For any given state of the environment, a policy tells the agent the action to take when the environment is in that state. An optimal policy tells the agent the best action to take. Of course, there are several different ways to define what a best action is. Let’s say that an optimal policy maximizes the sum of discounted rewards over time.

Say that Q(s, a) is the reward received from taking action a from state s, plus the discounted value of following the optimal policy thereafter. So, if the agent knows Q, then it can execute the optimal policy. Whenever the agent is in state s, the optimal policy is to take the action a that maximizes Q(s, a). So, the agent’s task is to learn Q.

To learn Q, create a table Q[s][a] and initialize the entries to zero. Then, over and over again, have the agent take actions from states and receive the ensuing rewards. Update Q[s][a] each time. Q[s][a] will converge to the true value of Q as the number of times that each state-action pair is visited converges to infinity.

I implemented Q-learning, and I applied it to a simple environment. The code is up at my GitHub.


Mitchell, Tom. Machine Learning. McGraw-Hill, 1997.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s