Monte Carlo Methods
导读:本文共7235.5字符,通常情况下阅读需要24分钟。同时您也可以点击右侧朗读,来听本文内容。按键盘←(左) →(右) 方向键可以翻页。
摘要:Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. And Monte Carlo methods require no prior knowledg... ...
目录
(为您整理了一些要点),点击可以直达。Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. And Monte Carlo methods require no prior knowledge of the environment’s dynamics.
To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
(episode task是指不管采取哪种策略,都会在有限时间内达到终止状态并获得回报的任务。)
The term “Monte Carlo” is often used more broadly for any estimation method whose operation invloves a significant random component. Here we use it specifically for methods based on average complete returns (as opposed to methods that learn form partial returns, considered in the next chapter).
As in the DP chapter, first we consider the prediction problem (the computation of vπ v π and qπ q π for a fixed arbitrary policy π π ) then policy improvement, and, finally, the control problem and its solution by GPI. Each of these ideas taken from DP is extended to the Monte Carlo case in which only sample experience is available.
1. Monte Carlo Prediction
Averaging the returns observed after visits to that states is the idea underlies all Monte Carlo methods.
First-visit MC method and every-visit MC method are very similar but have slightly different theoretical properties.
Backup diagram can be also used to persent Monte Carlo methods.
The diffenrence between DP and MC:
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap as we defined it in the previous chapter.
2. Monte Carlo Estimation of Action Values
The Monte Carlo methods for action value estimation are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state.
The only complication is that many state-action pairs may never be visited.
One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts. It is somtimes useful.
3. Monte Carlo Control
The overall idea is to proceed according to the same pattern as in the same pattern as in the DP chapter, that is, according to the idea of generalized policy iteration (GPI).

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
4. Monte Carlo Control without Exploring Starts
The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods.
On-policy methods attempted to evaluate or improve the policy that is used to make decisions, where as off-policy methods evaluate or improve a policy different from that used to generate the data.
On-policy method:
In on-policy control methods, the policy is generally soft, meaning that π(a∣s)>0 π ( a ∣ s ) > 0 for all s∈S s ∈ S and all a∈A(s) a ∈ A ( s ) .
The ε ε -greedy policies are examples of ε ε -soft policies, defined as policies for which π(a∣s)≥ε|A(s)| π ( a ∣ s ) ≥ ε | A ( s ) | for all states and actions, for some ε>0 ε > 0 .
That any ε ε -greedy policy with respect to qπ q π is an improvement over any ε ε -soft policy π π is assured by the policy improvement theorem.
5. Off-policy Prediction via Importance Sampling
The policy being learned about is called the target policy. and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
Monte Carlo Methods的详细内容,希望对您有所帮助,信息来源于网络。