Table 3: Vision Transformer Confusion matrix for the Bi-
nary Model (Tumor/No tumor).
Tumor Detection no tumor tumor
no tumor 99 7
tumor 2 497
our custom-built CNN trained under the same exact
conditions. Although the model did not train on a
huge amount of data, and used an unbalanced it still
managed to achieve 96.5 % classification accuracy,
and over 98 % detection accuracy, which is impres-
sive. We compared to CNNs, which are used in the
SoA for such tasks, and demonstrated that the ViT
can still achieve better accuracy, despite lacking trans-
lational invariance. Some modifications could im-
prove the efficiency of the model, such as optimizing
the hyper-parameters. Adding another regularization
technique and appropriate data augmentation could
also ensure the model does not overfit the data. These
solutions entail another tradeoff, as they are likely to
significantly increase training time to achieve good
accuracy. Finally, future work includes investigating
the use of recently introduced Convolutional Vision
Transformers (CvT), which attain higher results than
normal Vision Transformers.
REFERENCES
Badza, M. and Barjaktarovic, C. (2020). Classification of
brain tumors from mri images using a convolutional
neural network. Applied Sciences.
Cho, K., van Merrienboer, B., G
¨
ulc¸ehre, C¸ ., Bahdanau, D.,
Bougares, F., Schwenk, H., and Bengio, Y. (2014).
Learning phrase representations using RNN encoder-
decoder for statistical machine translation. In Mos-
chitti, A., Pang, B., and Daelemans, W., editors, Pro-
ceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP 2014,
October 25-29, 2014, Doha, Qatar, A meeting of
SIGDAT, a Special Interest Group of the ACL, pages
1724–1734. ACL.
Dai, Y., Y. Gao, Y., and Liu, F. (2021). Transmed: Trans-
formers advance multi-modal medical image classifi-
cation. Diagnostics.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).
BERT: pre-training of deep bidirectional transformers
for language understanding. In Burstein, J., Doran,
C., and Solorio, T., editors, Proceedings of the 2019
Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Lan-
guage Technologies, NAACL-HLT 2019, Minneapolis,
MN, USA, June 2-7, 2019, Volume 1 (Long and Short
Papers), pages 4171–4186. Association for Computa-
tional Linguistics.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Interna-
tional Conference on Learning Representations.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778.
Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear
units (gelus). arXiv preprint arXiv:1606.08415.
Herholz, K. (2012). Brain tumors. PubMed.
Hossain, T., Tonmoy, F., Shishir, S., Ashraf, M., Nasim, A.,
and Shah, F. (2019). Brain tumor detection using con-
volutional neural network. In 2019 1st International
Conference on Advances in Science, Engineering and
Robotics Technology (ICASERT), pages 1–6.
ImageNet (2007). Imagenet benchmark (image classifica-
tion), 2007. https://paperswithcode.com/sota/image-
classification-on-imagenet.
Koner, R., Sinhamahapatra, P., and Tresp, V. (2020). Rela-
tion transformer network. CoRR, abs/2004.06193.
MRI Kaggle dataset (2020). Brain tu-
mor classification (mri), 2020.
https://www.kaggle.com/sartajbhuvaji/brain-tumor-
classification-mri.
Radford and et al. (2019). Language models are unsuper-
vised multitask learners. OpenAI.
Rehman, A., Naz, S., Razzak, I. M., Akram, F., and Im-
ran
˚
a, M. (2019). A deep learning-based framework
for automatic brain tumors classification using trans-
fer learning. Circuits, Systems, and Signal Processing.
Russell, S. J. and Norvig, P. (2009). Artificial Intelligence:
a modern approach. Pearson, 3 edition.
Scans (2021). Best scans to detect cancer.
https://www.envrad.com/best-scans-to-detect-cancer/.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A., Kaiser, L., and Polosukhin, I. (2017).
Soft-gated warping-gan for pose-guided person image
synthesis. Advances in Neural Information Processing
Systems, pages 5998–6008.
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., and
Chao, L. (2019). Learning deep transformer models
for machine translation. In 57th Annual Meeting of
the Association for Computational Linguistics (ACL)
2019, pages 1810–1822.
Wang, W., Chen, C., Ding, M., Li, J., Yu, H., and Zha, S.
(2021). Transbts: Multimodal brain tumor segmen-
tation using transformer. In 24th International Con-
ference on Medical Image Computing and Computer
Assisted Intervention (MICCAI 2021).
Warden (2017). How many images do
you need to train a neural network?
https://petewarden.com/2017/12/14/how-many-
images-do-you-need-to-train-a-neural-network/.
Yamashita, R. (2017). Convolutional neural networks: an
overview and application in radiology. Insights into
Imaging.
Zichao, Y. and et al. (2016). Hierarchical attention net-
works for document classification. In Proceedings of
the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, page 1480–1489.
BIOIMAGING 2022 - 9th International Conference on Bioimaging
130