In this paper, a framework for multi-policy inner-loop
MORL training has been presented. It is modular; it
allows to change some of its part without the need to
change the others. From that framework, the possibil-
ity to apply state-of-the-art optimization methods to
MORL was identified.
Three new exploration strategies were presented
to solve the exploration problem in this setting: Re-
pulsive Pheromones-based, Count-based and Tabu-
based. These are inspired from existing work in meta-
heuristics. All three proposed strategies perform bet-
ter than current state-of-the-art, ε-greedy based meth-
ods on the studied environments. Also, it is shown
that behavioral policies which do not converge to be-
come greedy perform better than converging ones in
the learning phase.
Because of the lack of benchmark environments,
this paper also proposes a new benchmark called the
Mirrored Deep Sea Treasure. It is a harder version of
the well known Deep Sea Treasure.
As future work, we plan to introduce new
(stochastic) benchmarks, use MORL algorithms to
control actual robots, and to continue to study the
application of traditional optimization techniques to
This work is funded by the Fonds National de la
Recherche Luxembourg (FNR), CORE program un-
der the ADARS Project, ref. C20/IS/14762457.
