Question on deriving algorithm for Maximum Entropy Inverse Reinforcement Learning

by minch   Last Updated September 12, 2019 01:19 AM

I am trying to make sense of the Algorithm 1 in Ziebert et al's classic "Maximum Entropy Inverse Reinforcement Learning." I can't figure out why the "Backward pass" pseudocode is a sensible way to compute the action probabilities. There seems to be no motivation for this in the paper. If anyone understands this part of the paper, please provide some guidance on how to make sense of this.

I am also confused about the section "Stochastic Policies." Why should it be true that the probability of an action is proportional to the sum of all trajectories starting with that action? Again, there is no derivation of this fact.

Paper here:

Related Questions

How can I maximise binary cross entropy loss?

Updated October 09, 2018 15:19 PM