properties. The performance has been measured by
mean categorical cross-entropy on training and test
sets (further referred to as loss). The x-axis of figs. 1
to 6 is the determination ratio Q of (eq. (1)), in log-
arithmic scale (so that the value 10
0
corresponds to
Q = 1). This presentation makes the dependence of
the generalization performance (as seen in the conver-
gence of the training and the test set cross-entropy)
on the determination ratio clear. This ratio grows
with the decreasing number of parameters, that is,
with the decreasing number of heads if the number
of transformer-encoders is kept to four and the de-
creasing number of transformer-encoders if the num-
ber of heads is kept to four. The rightmost configura-
tion is that with a single head or a single transformer-
encoder, respectively, followed to the left with two
heads or two transformer-encoders, etc..
The optimization was done exclusively with sin-
gle precision (float32) over a fixed number of 100
epochs by AdamW (Loshchilov and Hutter, 2019)
with a learning rate of 1 × 10
−3
and a weight decay
of 1 × 10
−4
. For consistency, the batch size was set to
256 for all experiments.
As a simple regularization during training, stan-
dard image augmentation techniques were applied:
random translation by a factor of (0.1,0.1), random
rotation by a factor of 0.2, and random cropping to
80 %.
The patches are flattened and their absolute posi-
tion is added in embedded form to each patch before
entering the first encoder.
All experiments were individually conducted on
one Tesla V100-SXM3-32GB GPU for a total number
of 60 GPU days.
4.1 Dataset MNIST
The MNIST (Lecun et al., 1998) dataset consists of
pixel images of digits. All pairs (h,t) with number
of heads h ∈ {1,2,4, 8} and number of transformer-
encoders t ∈ {1,2,4,8} have been optimized. The re-
sults in the form of loss depending on the determina-
tion ratio Q are given in fig. 2.
The gray-scale images were resized to 32 ×32 and
the patch size was set to 2. All internal dimensions
(keys, queries, values, feedforward, and model size)
are set to 64.
The cross-entropies for the training and the test
sets are fairly consistent, due to the determination ra-
tio Q > 1. The results are substantially more sensi-
tive to the lack of transformer-encoders: the right-
most configurations with four heads but one or two
transformer-encoders have a poor performance. By
contrast, using only one or two heads leads only to a
10
0
10
0.2
10
0.4
10
0.6
10
0.8
0.2
0.4
Determination ratio Q =
KM
/P
Loss
TF = 4 train
TF = 4 test
HD = 4 train
HD = 4 test
Figure 2: Training and test set losses of model variants for
dataset MNIST.
moderate performance loss. In other words, it is more
productive to stack more transformer-encoders than to
use many heads. This is not surprising for simple im-
ages such as those of digits. The context-dependency
of image patches can be expected to be rather low and
to require only a simple attention mechanism with a
moderate number of heads.
4.2 Dataset CIFAR-100
The dataset CIFAR-100 (Krizhevsky, 2009) is a col-
lection of images of various object categories such
as animals, household objects, buildings, people, and
others. The objects are labeled into 100 classes. The
training set consists of 50,000, the test set of 10,000
samples. With M = 100 and K = 50 000, the deter-
mination coefficient Q is equal to unity (100 on the
plot x-axis) for 5 million free parameters (M × K).
The results are given in fig. 3. All pairs (h,t) with
number of heads h ∈ {1,2, 4,8,16,32} and number of
transformer-encoders t ∈ {1, 2,4,8,16,32} have been
optimized.
The colored images were up-scaled to 64×64 and
the patch size was set to 8. All internal dimensions
(keys, queries, values, feedforward, and model size)
are set to 128.
The cross-entropies for the training and the test
sets converge to each other for about Q > 4, with a
considerable generalization gap for Q < 1. This can
be expected taking theoretical considerations men-
tioned in section 3 into account. The results are more
sensitive to the lack of transformer-encoders than to
that of heads. How far a high number of transformer-
encoders would be helpful, cannot be assessed be-
cause of getting then into the region of Q < 1. With
this training set size, a reduction of some transformer
KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval
318