We show that the algorithms proposed in Nadjahi
et al. (2019) are not provably safe and propose a new
version that is provably safe. We also adapt their ideas
to derive a heuristic algorithm which shows, among
the entire SPIBB class on two different benchmarks,
both the best mean performance and the best 1%-
CVaR performance, which is important for safety-
critical applications. Furthermore, it proves to be
competitive in the mean performance against other
state of the art uncertainty incorporating algorithms
and especially to outperform them in the 1%-CVaR
performance. Additionally, it has been shown that
the theoretically supported Adv-Approx-Soft-SPIBB
performs almost as well as its predecessor Approx-
Soft-SPIBB, only falling slightly behind in the mean
The experiments also demonstrate different prop-
erties of the two classes of SPI algorithms in Figure 1:
algorithms penalizing the action-value functions tend
to perform better in the mean, but lack in the 1%-
CVaR, especially if the available data is scarce.
Perhaps the most relevant direction of future work
is how to apply this framework to continuous MDPs,
which has so far been explored by Nadjahi et al.
(2019) without theoretical safety guarantees. Apart
from theory, we hope that our observations of the two
classes of SPI algorithms can contribute to the choice
of algorithms for the continuous case.
Safe Policy Improvement Approaches on Discrete Markov Decision Processes