How is the policy gradient calculated in REINFORCE? Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?Implementing experience replay in reinforcement learningWhy does the discount rate in the REINFORCE algorithm appears twice?Questions about n-step tree backup algorithmImportance Sampling Ratio ProbabilityWhat is the meaning of Model(s, a) in the prioritized sweeping algorithm?Impact of Varying Length Trajectories on Policy Gradient OptimizationWhy is the state value function sufficient to determine the policy if a model is available?Is REINFORCE the same as 'vanilla policy gradient'?Difficulty understanding Monte Carlo policy evaluation (state-value) for gridworldPolicy gradient loss for neural network training

Where and when has Thucydides been studied?

Why not use the yoke to control yaw, as well as pitch and roll?

Does the main washing effect of soap come from foam?

Russian equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

What is the proper term for etching or digging of wall to hide conduit of cables

Did John Wesley plagiarize Matthew Henry...?

What should one know about term logic before studying propositional and predicate logic?

How do you cope with tons of web fonts when copying and pasting from web pages?

How to achieve cat-like agility?

Keep at all times, the minus sign above aligned with minus sign below

Is there a verb for listening stealthily?

Is there night in Alpha Complex?

Is a copyright notice with a non-existent name be invalid?

How can I list files in reverse time order by a command and pass them as arguments to another command?

"Destructive power" carried by a B-52?

Dinosaur Word Search, Letter Solve, and Unscramble

How to infer difference of population proportion between two groups when proportion is small?

French equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

New Order #6: Easter Egg

Vertical ranges of Column Plots in 12

How many time has Arya actually used Needle?

How do Java 8 default methods hеlp with lambdas?

The test team as an enemy of development? And how can this be avoided?

How to ask rejected full-time candidates to apply to teach individual courses?



How is the policy gradient calculated in REINFORCE?



Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Announcing the arrival of Valued Associate #679: Cesar Manara
Unicorn Meta Zoo #1: Why another podcast?Implementing experience replay in reinforcement learningWhy does the discount rate in the REINFORCE algorithm appears twice?Questions about n-step tree backup algorithmImportance Sampling Ratio ProbabilityWhat is the meaning of Model(s, a) in the prioritized sweeping algorithm?Impact of Varying Length Trajectories on Policy Gradient OptimizationWhy is the state value function sufficient to determine the policy if a model is available?Is REINFORCE the same as 'vanilla policy gradient'?Difficulty understanding Monte Carlo policy evaluation (state-value) for gridworldPolicy gradient loss for neural network training



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4












$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    9 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago

















4












$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    9 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago













4












4








4


1



$begingroup$


Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.










share|improve this question











$endgroup$




Reading Sutton and Barto, I see the following in describing policy gradients:



policy grad



How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the gradient is computed, since we need some loss function to compute the gradient.



I've seen a good PyTorch article, but I still don't understand the meaning of this gradient conceptually, and I don't know what I'm looking to implement. Any intuition that you could provide would be helpful.







reinforcement-learning policy-gradients rl-an-introduction notation reinforce






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago









Philip Raeisghasem

1,169121




1,169121










asked 10 hours ago









HanzyHanzy

1516




1516











  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    9 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago
















  • $begingroup$
    Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
    $endgroup$
    – nbro
    9 hours ago











  • $begingroup$
    @nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
    $endgroup$
    – Hanzy
    9 hours ago










  • $begingroup$
    @nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
    $endgroup$
    – Hanzy
    9 hours ago






  • 1




    $begingroup$
    I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
    $endgroup$
    – nbro
    8 hours ago











  • $begingroup$
    @nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
    $endgroup$
    – Hanzy
    4 hours ago















$begingroup$
Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
$endgroup$
– nbro
9 hours ago





$begingroup$
Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $pi(A_t mid S_t, theta_t)$.
$endgroup$
– nbro
9 hours ago













$begingroup$
@nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
$endgroup$
– Hanzy
9 hours ago




$begingroup$
@nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet.
$endgroup$
– Hanzy
9 hours ago












$begingroup$
@nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
$endgroup$
– Hanzy
9 hours ago




$begingroup$
@nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight.
$endgroup$
– Hanzy
9 hours ago




1




1




$begingroup$
I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
$endgroup$
– nbro
8 hours ago





$begingroup$
I think this is just a notation or terminology issue. To train your neural network representing $pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: github.com/pytorch/examples/blob/master/reinforcement_learning/….
$endgroup$
– nbro
8 hours ago













$begingroup$
@nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
$endgroup$
– Hanzy
4 hours ago




$begingroup$
@nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now).
$endgroup$
– Hanzy
4 hours ago










1 Answer
1






active

oldest

votes


















2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracS_t,theta_t)S_t,theta_t)\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11929%2fhow-is-the-policy-gradient-calculated-in-reinforce%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracS_t,theta_t)S_t,theta_t)\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago















2












$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracS_t,theta_t)S_t,theta_t)\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago













2












2








2





$begingroup$

The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracS_t,theta_t)S_t,theta_t)\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.






share|improve this answer











$endgroup$



The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.




The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.



Usually, we take a derivative/gradient of some loss function $mathcalL$ because we want to minimize that loss. So we update our parameters in the direction opposite the direction of the gradient.



$$theta_t+1 = theta_t - alphanabla_theta_t mathcalL tag1$$



In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to maximize some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.



$$theta_t+1 = theta_t + alphanabla_theta_t J tag2$$



In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that



$$beginalign
nabla_theta_tJ(theta_t) &propto sum_s mu(s)sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)\
&=mathbbE_mu left[ sum_a q_pi (s,a) nabla_theta_t pi (a|s,theta_t)right].
endaligntag3
$$



The rest of the derivation is in your question, so let's skip to the end.



$$beginalign
theta_t+1 &= theta_t + alpha G_t fracS_t,theta_t)S_t,theta_t)\
&= theta_t + alpha G_t nabla_theta_t ln pi(A_t|S_t,theta_t)
endaligntag4$$



Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)




Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.



Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.



In our case, the function we want to minimize is
$$-G_tln pi(A_t|S_t,theta_t).$$



This corresponds to stochastic gradient descent ($G_t$ is not a function of $theta_t$).



You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $nabla_theta_t J$. If so, you would instead minimize
$$-sum_t G_tln pi(A_t|S_t,theta_t),$$
where $theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.







share|improve this answer














share|improve this answer



share|improve this answer








edited 24 mins ago

























answered 2 hours ago









Philip RaeisghasemPhilip Raeisghasem

1,169121




1,169121











  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago
















  • $begingroup$
    Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
    $endgroup$
    – Hanzy
    2 hours ago















$begingroup$
Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
$endgroup$
– Hanzy
2 hours ago




$begingroup$
Thanks you answered one of my questions about an implementation summing over the products of the (negative) log probabilities and their associated returns (I hadn’t considered it as mini batch implementation). I really appreciate the detailed answer. I guess I was trying to figure out how to represent it in a framework but I got bogged down in the details and forgot that Sutton / Barto mentioned that $J$ is a representation of the value of the start state. With the two answers provided I’m starting to find my way.
$endgroup$
– Hanzy
2 hours ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11929%2fhow-is-the-policy-gradient-calculated-in-reinforce%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

یوتیوب محتویات پیشینه[ویرایش] فناوری‌های ویدئویی[ویرایش] شوخی‌های آوریل[ویرایش] سانسور و فیلترینگ[ویرایش] آمار و ارقامی از یوتیوب[ویرایش] تأثیر اجتماعی[ویرایش] سیاست اجتماعی[ویرایش] نمودارها[ویرایش] یادداشت‌ها[ویرایش] پانویس[ویرایش] پیوند به بیرون[ویرایش] منوی ناوبریبررسی شده‌استYouTube.com[بروزرسانی]"Youtube.com Site Info""زبان‌های یوتیوب""Surprise! There's a third YouTube co-founder"سایت یوتیوب برای چندمین بار در ایران فیلتر شدنسخهٔ اصلیسالار کمانگر جوان آمریکایی ایرانی الاصل مدیر سایت یوتیوب شدنسخهٔ اصلیVideo websites pop up, invite postingsthe originalthe originalYouTube: Overnight success has sparked a backlashthe original"Me at the zoo"YouTube serves up 100 million videos a day onlinethe originalcomScore Releases May 2010 U.S. Online Video Rankingsthe originalYouTube hits 4 billion daily video viewsthe originalYouTube users uploading two days of video every minutethe originalEric Schmidt, Princeton Colloquium on Public & Int'l Affairsthe original«Streaming Dreams»نسخهٔ اصلیAlexa Traffic Rank for YouTube (three month average)the originalHelp! YouTube is killing my business!the originalUtube sues YouTubethe originalGoogle closes $A2b YouTube dealthe originalFlash moves on to smart phonesthe originalYouTube HTML5 Video Playerنسخهٔ اصلیYouTube HTML5 Video Playerthe originalGoogle tries freeing Web video with WebMthe originalVideo length for uploadingthe originalYouTube caps video lengths to reduce infringementthe originalAccount Types: Longer videosthe originalYouTube bumps video limit to 15 minutesthe originalUploading large files and resumable uploadingthe originalVideo Formats: File formatsthe originalGetting Started: File formatsthe originalThe quest for a new video codec in Flash 8the originalAdobe Flash Video File Format Specification Version 10.1the originalYouTube Mobile goes livethe originalYouTube videos go HD with a simple hackthe originalYouTube now supports 4k-resolution videosthe originalYouTube to get high-def 1080p playerthe original«Approximate YouTube Bitrates»نسخهٔ اصلی«Bigger and Better: Encoding for YouTube 720p HD»نسخهٔ اصلی«YouTube's 1080p – Failure Depends on How You Look At It»نسخهٔ اصلیYouTube in 3Dthe originalYouTube in 3D?the originalYouTube 3D Videosthe originalYouTube adds a dimension, 3D goggles not includedthe originalYouTube Adds Stereoscopic 3D Video Support (And 3D Vision Support, Too)the original«Sharing YouTube Videos»نسخهٔ اصلی«Downloading videos from YouTube is not supported, except for one instance when it is permitted.»نسخهٔ اصلی«Terms of Use, 5.B»نسخهٔ اصلی«Some YouTube videos get download option»نسخهٔ اصلی«YouTube looks out for content owners, disables video ripping»«Downloading videos from YouTube is not supported, except for one instance when it is permitted.»نسخهٔ اصلی«YouTube Hopes To Boost Revenue With Video Downloads»نسخهٔ اصلی«YouTube Mobile»نسخهٔ اصلی«YouTube Live on Apple TV Today; Coming to iPhone on June 29»نسخهٔ اصلی«Goodbye Flash: YouTube mobile goes HTML5 on iPhone and Android»نسخهٔ اصلی«YouTube Mobile Goes HTML5, Video Quality Beats Native Apps Hands Down»نسخهٔ اصلی«TiVo Getting YouTube Streaming Today»نسخهٔ اصلی«YouTube video comes to Wii and PlayStation 3 game consoles»نسخهٔ اصلی«Coming Up Next... YouTube on Your TV»نسخهٔ اصلی«Experience YouTube XL on the Big Screen»نسخهٔ اصلی«Xbox Live Getting Live TV, YouTube & Bing Voice Search»نسخهٔ اصلی«YouTube content locations»نسخهٔ اصلی«April fools: YouTube turns the world up-side-down»نسخهٔ اصلی«YouTube goes back to 1911 for April Fools' Day»نسخهٔ اصلی«Simon Cowell's bromance, the self-driving Nascar and Hungry Hippos for iPad... the best April Fools' gags»نسخهٔ اصلی"YouTube Announces It Will Shut Down""YouTube Adds Darude 'Sandstorm' Button To Its Videos For April Fools' Day"«Censorship fears rise as Iran blocks access to top websites»نسخهٔ اصلی«China 'blocks YouTube video site'»نسخهٔ اصلی«YouTube shut down in Morocco»نسخهٔ اصلی«Thailand blocks access to YouTube»نسخهٔ اصلی«Ban on YouTube lifted after deal»نسخهٔ اصلی«Google's Gatekeepers»نسخهٔ اصلی«Turkey goes into battle with Google»نسخهٔ اصلی«Turkey lifts two-year ban on YouTube»نسخهٔ اصلیسانسور در ترکیه به یوتیوب رسیدلغو فیلترینگ یوتیوب در ترکیه«Pakistan blocks YouTube website»نسخهٔ اصلی«Pakistan lifts the ban on YouTube»نسخهٔ اصلی«Pakistan blocks access to YouTube in internet crackdown»نسخهٔ اصلی«Watchdog urges Libya to stop blocking websites»نسخهٔ اصلی«YouTube»نسخهٔ اصلی«Due to abuses of religion, customs Emirates, YouTube is blocked in the UAE»نسخهٔ اصلی«Google Conquered The Web - An Ultimate Winner»نسخهٔ اصلی«100 million videos are viewed daily on YouTube»نسخهٔ اصلی«Harry and Charlie Davies-Carr: Web gets taste for biting baby»نسخهٔ اصلی«Meet YouTube's 224 million girl, Natalie Tran»نسخهٔ اصلی«YouTube to Double Down on Its 'Channel' Experiment»نسخهٔ اصلی«13 Some Media Companies Choose to Profit From Pirated YouTube Clips»نسخهٔ اصلی«Irate HK man unlikely Web hero»نسخهٔ اصلی«Web Guitar Wizard Revealed at Last»نسخهٔ اصلی«Charlie bit my finger – again!»نسخهٔ اصلی«Lowered Expectations: Web Redefines 'Quality'»نسخهٔ اصلی«YouTube's 50 Greatest Viral Videos»نسخهٔ اصلیYouTube Community Guidelinesthe original«Why did my YouTube account get closed down?»نسخهٔ اصلی«Why do I have a sanction on my account?»نسخهٔ اصلی«Is YouTube's three-strike rule fair to users?»نسخهٔ اصلی«Viacom will sue YouTube for $1bn»نسخهٔ اصلی«Mediaset Files EUR500 Million Suit Vs Google's YouTube»نسخهٔ اصلی«Premier League to take action against YouTube»نسخهٔ اصلی«YouTube law fight 'threatens net'»نسخهٔ اصلی«Google must divulge YouTube log»نسخهٔ اصلی«Google Told to Turn Over User Data of YouTube»نسخهٔ اصلی«US judge tosses out Viacom copyright suit against YouTube»نسخهٔ اصلی«Google and Viacom: YouTube copyright lawsuit back on»نسخهٔ اصلی«Woman can sue over YouTube clip de-posting»نسخهٔ اصلی«YouTube loses court battle over music clips»نسخهٔ اصلیYouTube to Test Software To Ease Licensing Fightsthe original«Press Statistics»نسخهٔ اصلی«Testing YouTube's Audio Content ID System»نسخهٔ اصلی«Content ID disputes»نسخهٔ اصلیYouTube Community Guidelinesthe originalYouTube criticized in Germany over anti-Semitic Nazi videosthe originalFury as YouTube carries sick Hillsboro video insultthe originalYouTube attacked by MPs over sex and violence footagethe originalAl-Awlaki's YouTube Videos Targeted by Rep. Weinerthe originalYouTube Withdraws Cleric's Videosthe originalYouTube is letting users decide on terrorism-related videosthe original«Time's Person of the Year: You»نسخهٔ اصلی«Our top 10 funniest YouTube comments – what are yours?»نسخهٔ اصلی«YouTube's worst comments blocked by filter»نسخهٔ اصلی«Site Info YouTube»نسخهٔ اصلیوبگاه YouTubeوبگاه موبایل YouTubeوووووو

Magento 2 - Auto login with specific URL Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?Customer can't login - Page refreshes but nothing happensCustom Login page redirectURL to login with redirect URL after completionCustomer login is case sensitiveLogin with phone number or email address - Magento 1.9Magento 2: Set Customer Account Confirmation StatusCustomer auto connect from URLHow to call customer login form in the custom module action magento 2?Change of customer login error message magento2Referrer URL in modal login form

Rest API with Magento using PHP with example. Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How to update product using magento client library for PHP?Oauth Error while extending Magento Rest APINot showing my custom api in wsdl(url) and web service list?Using Magento API(REST) via IXMLHTTPRequest COM ObjectHow to login in Magento website using REST APIREST api call for Guest userMagento API calling using HTML and javascriptUse API rest media management by storeView code (admin)Magento REST API Example ErrorsHow to log all rest api calls in magento2?How to update product using magento client library for PHP?