6 CONCLUSION
In this paper, a framework for multi-policy inner-loop
MORL training has been presented. It is modular; it
allows to change some of its part without the need to
change the others. From that framework, the possibil-
ity to apply state-of-the-art optimization methods to
MORL was identified.
Three new exploration strategies were presented
to solve the exploration problem in this setting: Re-
pulsive Pheromones-based, Count-based and Tabu-
based. These are inspired from existing work in meta-
heuristics. All three proposed strategies perform bet-
ter than current state-of-the-art, ε-greedy based meth-
ods on the studied environments. Also, it is shown
that behavioral policies which do not converge to be-
come greedy perform better than converging ones in
the learning phase.
Because of the lack of benchmark environments,
this paper also proposes a new benchmark called the
Mirrored Deep Sea Treasure. It is a harder version of
the well known Deep Sea Treasure.
As future work, we plan to introduce new
(stochastic) benchmarks, use MORL algorithms to
control actual robots, and to continue to study the
application of traditional optimization techniques to
MORL.
ACKNOWLEDGEMENTS
This work is funded by the Fonds National de la
Recherche Luxembourg (FNR), CORE program un-
der the ADARS Project, ref. C20/IS/14762457.
REFERENCES
Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., and
Precup, D. (2021). A Survey of Exploration Methods
in Reinforcement Learning. arXiv:2109.00157 [cs].
arXiv: 2109.00157.
Barrett, L. and Narayanan, S. (2008). Learning all opti-
mal policies with multiple criteria. In Proceedings of
the 25th international conference on Machine learn-
ing - ICML ’08, pages 41–47, Helsinki, Finland. ACM
Press.
Gambardella, L. M. and Dorigo, M. (1995). Ant-Q: A Re-
inforcement Learning approach to the traveling sales-
man problem. In Prieditis, A. and Russell, S., editors,
Machine Learning Proceedings 1995, pages 252–260.
Morgan Kaufmann, San Francisco (CA).
Hayes, C., R
˘
adulescu, R., Bargiacchi, E., K
¨
allstr
¨
om, J.,
Macfarlane, M., Reymond, M., Verstraeten, T., Zint-
graf, L., Dazeley, R., Heintz, F., Howley, E., Irissap-
pane, A., Mannion, P., Nowe, A., Ramos, G., Restelli,
M., Vamplew, P., and Roijers, D. (2021). A Practi-
cal Guide to Multi-Objective Reinforcement Learning
and Planning.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov,
M., Ronneberger, O., Tunyasuvunakool, K., Bates, R.,
ˇ
Z
´
ıdek, A., Potapenko, A., Bridgland, A., Meyer, C.,
Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-
Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T.,
Petersen, S., Reiman, D., Clancy, E., Zielinski, M.,
Steinegger, M., Pacholska, M., Berghammer, T., Bo-
denstein, S., Silver, D., Vinyals, O., Senior, A. W.,
Kavukcuoglu, K., Kohli, P., and Hassabis, D. (2021).
Highly accurate protein structure prediction with Al-
phaFold. Nature, 596(7873):583–589.
Oliveira, T., Medeiros, L., Neto, A. D., and Melo, J. (2020).
Q-Managed: A new algorithm for a multiobjective re-
inforcement learning. Expert Systems with Applica-
tions, 168:114228.
Parisi, S., Pirotta, M., and Restelli, M. (2016). Multi-
objective Reinforcement Learning through Continu-
ous Pareto Manifold Approximation. Journal of Ar-
tificial Intelligence Research, 57:187–227.
Roijers, D. M., R
¨
opke, W., Nowe, A., and Radulescu,
R. (2021). On Following Pareto-Optimal Policies in
Multi-Objective Planning and Reinforcement Learn-
ing.
Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley,
R. (2013). A survey of multi-objective sequential
decision-making. Journal of Artificial Intelligence Re-
search, 48(1):67–113.
Ruiz-Montiel, M., Mandow, L., and P
´
erez-de-la Cruz, J.-
L. (2017). A temporal difference method for multi-
objective reinforcement learning. Neurocomputing,
263:15–25.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou,
I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,
Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T., and Hassabis, D. (2016). Mastering the game of
Go with deep neural networks and tree search. Na-
ture, 529(7587):484–489.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-
ing: An Introduction. Adaptive Computation and Ma-
chine Learning series. A Bradford Book, Cambridge,
MA, USA, 2 edition.
Talbi, E.-G. (2009). Metaheuristics: From Design to Imple-
mentation, volume 74.
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., and
Dekker, E. (2011). Empirical evaluation methods
for multiobjective reinforcement learning algorithms.
Machine Learning, 84(1):51–80.
Vamplew, P., Dazeley, R., and Foale, C. (2017a). Soft-
max exploration strategies for multiobjective rein-
forcement learning. Neurocomputing, 263:74–86.
Vamplew, P., Issabekov, R., Dazeley, R., Foale, C., Berry,
A., Moore, T., and Creighton, D. (2017b). Steering ap-
proaches to Pareto-optimal multiobjective reinforce-
ment learning. Neurocomputing, 263:26–38.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
672