Table 3: Vision Transformer Confusion matrix for the Bi-
nary Model (Tumor/No tumor).
Tumor Detection no tumor tumor
no tumor 99 7
tumor 2 497
our custom-built CNN trained under the same exact
conditions. Although the model did not train on a
huge amount of data, and used an unbalanced it still
managed to achieve 96.5 % classification accuracy,
and over 98 % detection accuracy, which is impres-
sive. We compared to CNNs, which are used in the
SoA for such tasks, and demonstrated that the ViT
can still achieve better accuracy, despite lacking trans-
lational invariance. Some modifications could im-
prove the efficiency of the model, such as optimizing
the hyper-parameters. Adding another regularization
technique and appropriate data augmentation could
also ensure the model does not overfit the data. These
solutions entail another tradeoff, as they are likely to
significantly increase training time to achieve good
accuracy. Finally, future work includes investigating
the use of recently introduced Convolutional Vision
Transformers (CvT), which attain higher results than
normal Vision Transformers.
