A Distributional Perspective on Reinforcement Learning

Understanding distributional RL

distributional RL is where instead of treating everything as a value (reward, value of state, etc.) you treat it as a distribution.
you learn the value distribution and then use that to update your parameters.
specifically, instead of learning a function that approximates the q-value of a given state, action you learn a distribution of q-values for a given state,action
so the q-update is no longer dropping in your next function approximator in place of the old one but rather projecting your older distribution onto the new one and minimizing the distance between them.
the value distribution is modelled using a discrete distribution whose support is a set of atoms. the atoms are the canonical returns of the distribution.
training details are very similar to that of DQN, except that instead of outputting the probability of an action, they output atom probabilities.
the q-value is the expectation of the atoms.
the update is done by
- computing the projection
- distributing the probability of an atom to it’s neighbors
in practice, they use the mean of the distribution as the q-value and then decide the action to be taken based on that
however, it results in improvements while still actually using the same function. not exactly clear why.