# Monte Carlo Methods(教程,强化学习,编程语言)

• TAG :

Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. And Monte Carlo methods require no prior knowledge of the environment’s dynamics.

To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.

The term “Monte Carlo” is often used more broadly for any estimation method whose operation invloves a significant random component. Here we use it specifically for methods based on average complete returns (as opposed to methods that learn form partial returns, considered in the next chapter).

As in the DP chapter, first we consider the prediction problem (the computation of vπ v π $v_{\pi}$ and qπ q π $q_{\pi}$ for a fixed arbitrary policy π π $\pi$) then policy improvement, and, finally, the control problem and its solution by GPI. Each of these ideas taken from DP is extended to the Monte Carlo case in which only sample experience is available.

## 1. Monte Carlo Prediction

Averaging the returns observed after visits to that states is the idea underlies all Monte Carlo methods.
First-visit MC method and every-visit MC method are very similar but have slightly different theoretical properties. Backup diagram can be also used to persent Monte Carlo methods.
The diffenrence between DP and MC:
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap as we defined it in the previous chapter.

## 2. Monte Carlo Estimation of Action Values

The Monte Carlo methods for action value estimation are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state.

The only complication is that many state-action pairs may never be visited.
One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts. It is somtimes useful.

## 3. Monte Carlo Control

The overall idea is to proceed according to the same pattern as in the same pattern as in the DP chapter, that is, according to the idea of generalized policy iteration (GPI). We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. ## 4. Monte Carlo Control without Exploring Starts

The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods.
On-policy methods attempted to evaluate or improve the policy that is used to make decisions, where as off-policy methods evaluate or improve a policy different from that used to generate the data.

On-policy method:
In on-policy control methods, the policy is generally soft, meaning that π(as)>0 π ( a ∣ s ) > 0 $\pi(a\mid s)>0$ for all sS s ∈ S $s\in S$ and all aA(s) a ∈ A ( s ) $a\in A(s)$.
The ε ε $\varepsilon$-greedy policies are examples of ε ε $\varepsilon$-soft policies, defined as policies for which π(as)ε|A(s)| π ( a ∣ s ) ≥ ε | A ( s ) | $\pi (a\mid s)\geq \frac{\varepsilon}{\left | A(s) \right |}$ for all states and actions, for some ε>0 ε > 0 $\varepsilon>0$. That any ε ε $\varepsilon$-greedy policy with respect to qπ q π $q_{\pi}$ is an improvement over any ε ε $\varepsilon$-soft policy π π $\pi$ is assured by the policy improvement theorem.

## 5. Off-policy Prediction via Importance Sampling

The policy being learned about is called the target policy. and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.