#### Markov Decision Processes

#### State , action, reward function, transition function

The state transition function probabilistically specifies the next state of the environment as a function of its current state and the agent’s action. The reward function specifies expected instantaneous reward as a function of the current state and action. The model is *Markov* if the state transitions are independent of any previous environment states or agent actions.

**Reinforcement learning**

RL methods are based on the notion of value functions. Value functions are either state-values (i.e. value of a state) or action-values (i.e. value of taking an action in a given state).Action value functions are usually denoted Q(s,a), where a ∈ A. In control applications, the goal of RL is to learn an action-value function that allows the agent to use a policy that is as close to optimal as possible. However, since the action-values are initially unknown, the agent first has to explore the environment in order to learn it.

**Finding a Policy Given a Model**

Before we consider algorithms for learning to behave in MDP environments, we will explore techniques for determining the optimal policy given a correct model. These dynamic programming techniques will serve as the foundation and inspiration for the learning algorithms to follow.

We will speak of the **optimal value of a state–it is the expected infinite discounted sum of reward that the agent will gain if it starts in that state and executes the optimal policy.** Using as a complete decision policy, it is written

This optimal value function is unique and can be defined as the solution to the simultaneous equations

which assert that the value of a state *s* is the expected instantaneous reward plus the expected discounted value of the next state, using the best available action. Given the optimal value function, we can specify the optimal policy as

Suppose we *know* the state transition function P and the reward function R, and we wish to calculate the policy that maximizes the expected discounted reward.

The standard family of algorithms to calculate this optimal policy requires storage for two arrays indexed by state: *value* , which contains real values, and *policy* which contains actions. At the end of the algorithm, will contain the solution and will contain the discounted sum of the rewards to be earned (on average) by following that solution from state .

The algorithm has the following two kinds of steps, which are repeated in some order for all the states until no further changes take place. They are

**The Value Function**

the issue of how the agent learns to choose “good” actions, or even how we might measure the utility of an action is not explained. First, two terms are defined. A policy determines which action should be performed in each state; a policy is a mapping from states to actions. The value of a state is defined as the sum of the reinforcements received when starting in that state and following some fixed policy to a terminal state. The optimal policy would therefore be the mapping from states to actions that maximizes the sum of the reinforcements when starting in an arbitrary state and performing actions until a terminal state is reached.

Under this definition the value of a state is dependent upon the policy. The value function is a mapping from states to state values and can be approximated using any type of function approximator (e.g., multilayered perceptron, memory based system, radial basis functions, look-up table, etc.).

**Model-free**:

Learn a controller without learning a model.

What is reinforcement learning?

we reviewed methods for obtaining an optimal policy for an MDP assuming that we already had a model. The model consists **of knowledge of the state transition probability function T(s,a,s‘)** and the reinforcement function

**Reinforcement learning is primarily concerned with**

*R*(*s*,*a*).**how to obtain the optimal policy when such a model is not known in advance. The agent must interact with its environment directly to obtain information**which, by means of an appropriate algorithm, can be processed to produce an optimal policy.

**temporal difference methods:**

How do we know whether the action just taken is a good one, when it might have far-reaching effects? One strategy is to wait until the “end” and reward the actions taken if the result was good and punish them if the result was bad. In ongoing tasks, it is difficult to know what the “end” is, and this might require a great deal of memory. Instead, **we will use insights from value iteration to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state. **This class of algorithms is known as temporal difference methods TD, Q-learning.

Key idea: use insight from dynamic programming to adjust the value of a state based on the immediate reward and the value of the next state.

Q learning

Q-learning can be applied to **discounted infinite-horizon MDPs**. It can also be applied to undiscounted problems as long as the optimal policy is guaranteed to reach a reward-free absorbing state and the state is periodically reset.

it is very suited for repeated games against an unknown opponent. (remember IPD!!!!!!).Q-learning algorithms works by estimating the values of state-action pairs.The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter.Onec these values have been learned, the optimal action from any state is the one with the highest Q-value.After being initialized to arbitrary numbers, Q-values are estimated on the basis of experience as follows:-

- From the current state
*s*, select an action*a*.This will cause a receipt of an immediate payoff*r*, and arrival at a next state*s’*. - Update Q(
*s,a*) based upon this experience as follows:

small changes in Q(*s,a*) = *x*[*r* + *y*maxQ(*s’,b*)-Q(*s,a*)]

where *x* is the learning rate and 0 < *y* < 1 is the discount factor

- Go to 1.

This algorithm is guaranteed to converge to the correct Q-values with the probability one if the environment is stationary and depends on the current state and the action taken in it;called *Markovian*,a lookup table is used to store the Q-values, every state-action pair continues to be visited, and the learning rate is decreased appropriately over time.This exploration strategy does not specify which action to select at each step. In practice, a method for action, called the *Boltzmann distribution* strategy, is usually chosen that will ensure sufficient exploration while still favouring actions with higher value estimates. Experiments with Q-learning agent have been done in the past with favourable results.Bradke(1993) described encouraging results in applying multiagent reinforcement learning to efficiently damp out disturbances of a flexi beam.Littman and Boyan (1994) describe a distributed RL algorithm for packet routing.Littman (1994) also describes experiments with Q-learning agents that try to learn a mixed strategy that is optimal against the worst possible opponent in a zero-sum 2-player game. So, we can see that sufficent work has been done and even more underway to acheive our goal of "learning" artificial intelligence