there are many options. Basically all of them try to
keep a high exploration ratio at the beginning of the
learning procedure while, as time goes by, more and
more the selected actions are those which are consid-
ered the best (according to the Q-values).
A good example is the exploration based on the
Boltzman distribution. According to this strategy, the
probability of executing action a
∗
in state s(t), is
given by the following expression:
P rob(s(t), a
∗
) =
e
Q(s(t),a
∗
)/T
P
∀a
e
Q(s(t),a)/T
, (1)
where the value T > 0 is a temperature that controls
the randomness. Very low temperatures cause nearly
deterministic action selection, while high values re-
sult in random performances.
3 COMBINING GENETIC
ALGORITHMS WITH
REINFORCEMENT
LEARNING: GA+RL
A genetic algorithm (GA) is a biologically inspired
search technique very suitable to find approximate
solutions to optimization search problems (Davidor,
1991). This algorithm is based on the evolutionary
ideas of natural selection and genetic. In general ge-
netic algorithms are considered to be useful when the
search space is large, complex or poorly understood,
the domain knowledge is scarce or expert knowledge
is difficult to encode, and a mathematical analysis of
the problem is difficult to carry out.
To solve a search problem the GA begins with a
population of solutions (called chromosomes). Each
chromosome of this population is evaluated using
an objective function, called fitness function, which
measures how close is the chromosome to the desired
solution. A new population is then obtained from the
first one, according to the fitness values: the better a
chromosome seems to be, the higher the probability
that chromosome is selected for the new population
of solutions. To keep certain randomness in the al-
gorithm, which prevents it from being trapped in a lo-
cal minimum, the application of genetic operators like
chromosome mutation (random changes in the pro-
posed solution), or chromosome crossover (combina-
tion of two chromosomes to raise two new solutions),
is required when a new set of solutions is obtained
from a previous one.
The evaluation of the population of solutions, and
its evolution to a new and improved population, are
two steps which are repeated continuously until the
desired behaviour is reached.
In our case we wanted to combine the potential of
a GA with RL (figure 1), in such a way that the draw-
backs of each paradigm are balanced with the benefits
of the other.
3.1 How a GA Improves RL
Genetic algorithms help RL in generalization and se-
lectivity (David E. Moriarty, 1999). GA learns a map-
ping from observed states to recommended actions,
usually eliminating explicit information concerning
less desirable actions. In a GA the knowledge about
bad decisions is not explicitly preserved, since poli-
cies that make such decisions are not selected by the
evolutionary algorithm and are eventually eliminated
from the population. Thus, because the attention is
only focused on profitable actions and less informa-
tion has to be learnt, the system reaches globally good
solutions faster than RL.
As can be seen through figure 1, RL and GA share
the same finite number of states, S, which represent
the different situations the system may detect. Thus,
a chromosome and a policy are both the same, as a
chromosome also has to establish the action a ∈ A,
to be carried out at every state s ∈ S (this coincides
with the definition of policy given in section 2). Ac-
cording to figure 1, the fitness is a function of the re-
ward values. The longer a chromosome/policy is able
to control the system properly, the higher its fitness
value is.
When a set of N policies are being evaluated, if one
of them makes the system behave much better than
the others, there is a big probability that similar se-
quences of actions are explored through the next gen-
erated population of chromosomes. In RL, this is not
possible, as it would require a high learning coeffi-
cient and a fast decrease of the temperature coefficient
(towards the greedy policy). This high learning coef-
ficient would cause important instabilities during the
learning process, as the execution of correct actions
in the middle of wrong ones, would quickly decrease
their Q-values.
3.2 How RL Improves GA
As pointed out previously RL keeps more informa-
tion than GA. The Q-values reflect the consequences
of executing every possible action in each state. This
information might be useful to bias the genetic opera-
tors to reduce the number of chromosomes required to
find a solution. Thus, the mutation probability could
be higher for those policies and states where the se-
lected action is poor according to the Q-values, the
crossover between two policies could be in such a way
that the exchanged actions have similar Q-value, etc.
In our proposal, the first genetic operator which
saw its way of functioning changed was the mutation
ICINCO 2006 - ROBOTICS AND AUTOMATION
190