Dora Q-Learning -making better use of explorations
Résumé
(hereafter referred to as Q(λ)) record the stack of (state, action) pairs enacted during a learning episode, enabling any rewards observed to be back-propagated down the stack, thus speeding up learning. In standard Q(λ), after an explore action, the eligibility trace is cut (reset to an empty stack), meaning that any good results found further on can take a long time to percolate back to the initial state. We present here Dora, an adaptation of Q(λ) which makes better use of results found when exploring, and therefore learns consistently faster. In Dora, our aim is to avoid cutting the trace on an explore if possible. This idea is quite simple and natural, but to the best of our knowledge, it has not been developed like this before. We note that the principle of Dora could be argued to resemble that of experience replay [Long-Ji Lin, 1991], but Dora is not model-based, has fewer parameters, and consumes less memory, whilst still giving excellent results.
Origine : Fichiers produits par l'(les) auteur(s)