I am trying to make sense of the Algorithm 1 in Ziebert et al's classic "Maximum Entropy Inverse Reinforcement Learning." I can't figure out why the "Backward pass" pseudocode is a sensible way to compute the action probabilities. There seems to be no motivation for this in the paper. If anyone understands this part of the paper, please provide some guidance on how to make sense of this.
I am also confused about the section "Stochastic Policies." Why should it be true that the probability of an action is proportional to the sum of all trajectories starting with that action? Again, there is no derivation of this fact.