Towards Reducing the Need for Annotations in Digital Dermatology with

Self-supervised Learning

Fabian Gr

oger

1 a

, Philippe Gottfrois

, Ludovic Amruthalingam

2 b

, Alvaro Gonzalez-Jimenez

Simone Lionetti

1 c

, Alexander A. Navarini

2,3 d

and Marc Pouly

1 e

Lucerne University of Applied Sciences and Arts, Rotkreuz, Switzerland

Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland

Department of Dermatology, University Hospital of Basel, Switzerland

Keywords:

Self-supervised Learning, Pre-training, Transfer Learning, Dermatology, Medical Imaging.

Abstract:

Training supervised models requires large amounts of labelled data, whose creation is often expensive and

time-consuming, especially in the medical domain. The standard practice to mitigate the lack of annotated

clinical images is to use transfer learning and ﬁne-tune pre-trained ImageNet weights on a downstream task.

While this approach achieves satisfactory performance, it still requires a sufﬁciently large dataset to adjust

the global features for a speciﬁc task. We report on an ongoing investigation to determine whether self-

supervised learning methods applied to unlabelled domain-speciﬁc images can provide better representations

for digital dermatology compared to ImageNet. We consider ColorMe, SimCLR, BYOL, DINO, and iBOT,

and present preliminary results on the evaluation of pre-trained initialization for three different medical tasks

with mixed imaging modalities. Our intermediate ﬁndings indicate a beneﬁt in using features learned by iBOT

on dermatology datasets compared to conventional transfer learning from ImageNet classiﬁcation.

1 INTRODUCTION

Current artiﬁcial intelligence applications to medical

imaging often rely on Convolution Neural Networks

(CNNs) trained in a supervised way. These mod-

els have shown remarkable performance across many

tasks due to their ability to deal with raw image data

(Brinker et al., 2019). CNNs typically require vast

amounts of annotated data to achieve high levels of

performance and robustness. However, medical im-

ages can be hard to obtain: The acquisition process

needs to consider strict regulations, collection biases

should be appropriately mitigated (Groh et al., 2021),

and examples of rare conditions are challenging to

ﬁnd.

Dermatology is a ﬁeld of medicine where image

analysis can have a signiﬁcant impact since many

pathologies are visible with a naked eye and can eas-

ily be photographed. Two main types of images are

https://orcid.org/0000-0002-9699-688X

https://orcid.org/0000-0001-5980-5469

https://orcid.org/0000-0001-7305-8957

https://orcid.org/0000-0001-7059-632X

https://orcid.org/0000-0002-9520-4799

usually considered: dermoscopy pictures taken with

a dedicated device called “dermatoscope”, which is

almost in direct contact with the skin and offers mag-

niﬁcation, and ordinary clinical images, which are in-

stead collected with cameras that are not speciﬁcally

designed for dermatology.

Although a diagnosis is usually reported in patient

ﬁles, one typically needs to recruit and train several

dermatologists to do a tedious labeling job if more de-

tailed information is required. Moreover, for several

medical conditions, it is hard to get experts to agree on

a common answer (Jacob et al., 2021). This explains

why the most popular general-purpose image dataset,

ImageNet (Deng et al., 2009), has 200 times more pic-

tures than the biggest public dermoscopy dataset and

2000 times more than the largest public clinical image

collection (ISIC, 2016).

Transfer learning can leverage knowledge from a

source task with plenty of data to perform a target

task where data is limited. The most common ap-

proach to transfer learning is pre-training large mod-

els such as residual neural network (ResNet) on huge

labeled image datasets. Models pre-trained on Ima-

geNet are also widely used in the medical domain,

Gröger, F., Gottfrois, P., Amruthalingam, L., Gonzalez-Jimenez, A., Lionetti, S., Navarini, A. and Pouly, M.

Towards Reducing the Need for Annotations in Digital Dermatology with Self-supervised Learning.

DOI: 10.5220/0011532100003523

In Proceedings of the 1st Workshop on Scarce Data in Artiﬁcial Intelligence for Healthcare (SDAIH 2022), pages 41-46

ISBN: 978-989-758-629-3

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

which is controversial since features extracted from

natural images, may not be ideal representations in

some medical contexts (Matsoukas et al., 2022).

Self-supervised learning (SSL) is a hybrid ap-

proach that strives to combine supervised and un-

supervised learning and is also often used through

pre-training followed by adaptation (Shurrab and

Duwairi, 2021). It aims at learning semantically use-

ful features by creating a supervised objective from a

pool of unlabelled data without the need for human

annotation. This objective is often called the pretext

task. Learnt features can then be used in downstream

tasks where annotated data is scarce. In recent years,

SSL became popular also in the medical domain as

large volumes of unlabelled data are easier to obtain

than their annotated counterparts. Several works have

demonstrated the effectiveness of this approach for

detection and classiﬁcation (Li et al., 2021), for de-

tection and localization (Sowrirajan et al., 2021), and

for segmentation (Xie et al., 2020).

In this paper, we report intermediate results from

an ongoing investigation on using SSL to mitigate the

need for annotated samples in dermatology. We train

several prominent SSL algorithms on a mixed dataset

of 250’000 skin images, both from dermoscopy and

clinical pictures. We then research the impact of using

these representations as a starting point for different

medical imaging tasks in dermatology.

2 SELF-SUPERVISED LEARNING

TECHNIQUES

In this section, we brieﬂy review the different SSL

techniques considered in this work.

2.1 ColorMe

ColorMe (Li et al., 2020) was speciﬁcally designed

for applications in the medical domain and embeds

inductive bias in its two pretext tasks. It uses an en-

coder to map the green channel of an image to a vec-

tor space. This is followed both by a decoder, which

learns to reconstruct the pixel values of the red and

blue channels and by a fully-connected layer, which

learns to predict the overall color distribution of the

red and blue channels.

2.2 SimCLR

A simple framework for contrastive learning of visual

representations (SimCLR) (Chen et al., 2020) pro-

poses to obtain different views of the same image with

heavy data augmentation, called positive samples, and

minimize the distance between their representations.

At the same time, the distance between views of dif-

ferent images in the same batch, the negative sam-

ples, is maximized. The encoder is typically a ResNet

architecture followed by a projection head to be dis-

carded after training.

2.3 BYOL

Bootstrap your own latent (BYOL) (Grill et al., 2020)

compares embeddings of image views obtained from

two networks, removing the need for negative sam-

ples. The ﬁrst network is termed online and consists

of an encoder, a projection head, and a prediction

head. The second network is known as target, has the

same architecture as the online network except for the

prediction head, and its weights are an exponential

moving average of the online network weights. The

training objective is to match the online network’s

output with the target network’s output using mean

squared error.

2.4 DINO

Self-distillation with no labels (DINO) (Caron et al.,

2021) uses a similar principle to BYOL, but it passes

transformations of an image into two separate en-

coders, respectively the student and the teacher net-

works. Unlike previous SSL approaches the en-

coder is a vision transformer (ViT) (Dosovitskiy et al.,

2020). The loss compares the probability outputs of

both networks using cross-entropy. Only the weights

of the student are updated via backpropagation, while

the parameters of the teacher are an exponential mov-

ing average of the student.

2.5 iBOT

Image BERT pre-training with online tokenizer

(iBOT) (Zhou et al., 2022) exploits inherent proper-

ties of ViTs to learn representations capturing local

and global information. Similarly to DINO, iBOT

uses a student network that is trained and a teacher

network which is an exponential moving average. Its

loss function is the sum of two cross-entropies. The

ﬁrst one compares the output of the two networks

when they are given different views of the same im-

age. And the second one, when the two networks

are both passed the same view, but some patches are

masked for the student, matches the outputs corre-

sponding to those patches.

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

Table 1: Hyperparameters of the different self-supervised learning techniques.

Algorithm Backbone # Params Optimizer Batch size Lr. Scheduler

ColorMe ResNet50 23 Mio. SGD 50 1 × 10

−3

SimCLR ResNet50 23 Mio. Adam 160 3 × 10

−5

cosine

BYOL ResNet50 23 Mio. Adam 150 3 × 10

−3

cosine

DINO ViT-tiny 5.4 Mio. AdamW 56 5 × 10

−4

cosine

iBOT ViT-tiny 5.4 Mio. AdamW 56 5 × 10

−4

cosine

3 EXPERIMENTAL SETUP

3.1 Pre-training Data

The training data includes both dermoscopy and clini-

cal images, with the hypothesis that there are common

patterns in skin pictures taken at different magniﬁca-

tions. In total, we use 242’039 images from both pub-

lic and private datasets as listed below.

• derm7pt (Kawahara et al., 2019), featuring 2’020

dermoscopy and clinical images of various skin

pigmentation lesions.

• ISIC (ISIC, 2016), which consists of 107’208 der-

moscopy images with a wide spectrum of pig-

mented skin lesions mostly on low-pigmentation

skin. We exclude pictures overlapping with

HAM10000 (Tschandl et al., 2018) as these are

used in a downstream task.

• MED-NODE (Giotis et al., ), which includes 170

clinical images of pigmented skin lesions.

• SD-260 (Sun et al., 2016), containing 12’583 clin-

ical images of 260 different skin conditions.

• Ph2 (Mendonc¸a et al., 2013), containing 200 der-

moscopy images of pigmented skin lesions.

• A private dataset of 119’858 clinical images, re-

ﬂecting the data distribution encountered in a

Swiss hospital. Pictures were taken using diverse

reﬂex cameras by a trained photographer, were

anonymized, and used with approval (EKNZ-

2018-01074) from an ethical committee according

to Swiss regulations.

3.2 Downstream Tasks

To evaluate the performance of the pre-training meth-

ods, we use three different downstream tasks, which

were not present in the pre-training data.

• Fitzpatrick17k (Groh et al., 2021) is a public

benchmark dataset containing 16’577 clinical im-

ages with skin condition annotations and skin type

labels based on the Fitzpatrick scoring system.

The dataset contains labels with different granu-

larity. This study used the coarsest level, which

splits skin conditions into three main categories.

• PAD-UFES-20 (Pacheco et al., 2020) is a pub-

lic benchmark dataset composed of clinical im-

ages collected from smartphone devices and pa-

tient metadata. The dataset consists of 1’373 pa-

tients, 1’641 skin lesions, and 2’298 images for

six different diagnoses: three skin diseases and

three skin cancers.

• HAM10000 (Tschandl et al., 2018) is a public

benchmark dataset consisting of 10’015 dermo-

scopic images gathered from different cohorts.

The collected cases include a representative sam-

ple of seven diagnostic categories of pigmented

lesions.

All non-test data for downstream tasks were randomly

split into training and validation sets with size 85%

and 15% respectively. Each downstream task was

evaluated using the test set deﬁned by the dataset au-

thors.

3.3 Architectures

The SSL algorithms considered can be split into two

groups based on the backbone architecture they work

with, which can be a CNN or a ViT. To ensure a fair

comparison, at least within the same group, we used

the very same model when possible. To further pro-

mote the correspondence between the two groups, we

selected architectures that perform similarly on Ima-

geNet. For the CNN-based models we chose ResNet-

50, and for the ViT-based ones a tiny vision trans-

former (Dosovitskiy et al., 2020) with patch size of

16 × 16. The number of trainable parameters is there-

fore roughly 23 Mio. for ResNet-50 and 5.4 Mio. for

ViT-tiny.

3.4 Hyperparameters

Table 1 gives some details about the hyperparameters

for pre-training. All images are resized to 224 × 224

pixels and normalized. Further, the models are trained

Towards Reducing the Need for Annotations in Digital Dermatology with Self-supervised Learning

Table 2: Macro-averaged F1 scores of various models and a baseline on the hold-out test set of the three open-source derma-

tology downstream tasks. After adding a linear layer, both freezing and ﬁne-tuning the backbone are considered.

Evaluation Pre-training Fitzpatrick17k PAD-UFES-20 HAM10000

Linear Eval.

Stratiﬁed sampling 33.4 % 18.0 % 14.0 %

ImageNet 51.0 % 49.7 % 54.1 %

ColorMe 44.8 % 42.2 % 47.0 %

SimCLR 37.0 % 32.7 % 28.2 %

BYOL 48.1 % 34.4 % 44.4 %

DINO 46.7 % 44.2 % 57.2 %

iBOT 53.0 % 58.2 % 72.0 %

Fine-tuned

ImageNet 72.1 % 61.5 % 79.0 %

ColorMe 71.0 % 61.7 % 73.1 %

iBOT 73.9 % 62.3 % 82.0 %

until the validation loss does not improve consecu-

tively over ﬁve epochs. A full scan of hyperparameter

space is not performed as this exceeds the scope of

this paper and our available computational resources.

However, we ensure that all models converge to a

suitable solution by manually choosing an appropri-

ate optimizer and tuning the learning rate. We use

the same data augmentation policies described in the

original paper introducing each pretext task.

3.5 Evaluation

To compare the representations learned by different

pre-training strategies, we test them in two ways: us-

ing linear evaluation on frozen embeddings and ﬁne-

tuning the whole model. Both experiments are per-

formed on all downstream tasks. For models using

a ResNet backbone, we add the linear layer after the

last average pooling layer. For ViT-based models, we

follow the (Caron et al., 2021) and add a linear layer

after the concatenation of the class token to the last

four blocks in the model. In the ﬁne-tuning experi-

ment, the backbone and the linear classiﬁcation head

are trained together.

Finally, to better understand the utility of self-

supervised pre-training in different low-data regimes,

we also train a simple k-nearest neighbor (kNN) clas-

siﬁer that acts as a few-shot learner for the down-

stream tasks. This classiﬁer learns from a random

subset of the downstream task’s labeled data and is

evaluated on the same hold-out test set as the linear

evaluation and the ﬁne-tuning.

4 RESULTS

Scores that probe the ability of self-supervised pre-

trained features to generalize to the three downstream

tasks are reported in table 2. The upper half shows

the performance of frozen pre-trained embeddings

upon linear evaluation. All models show better results

compared to the random stratiﬁed sampling base-

line, which randomly samples the prediction indepen-

dently of the input data with empirical class proba-

bilities determined from the training set. However,

we also observe that the results of ColorMe, SimCLR

and BYOL are worse than the performance achieved

by using ImageNet features. This indicates that mod-

els with a CNN backbone do not proﬁt from domain-

speciﬁc pre-training in our setting and are better off

using general features such as the ones from Ima-

geNet. A possible explanation for this is that such

models have a strong inductive bias which bene-

ﬁts more from initialization using general pre-trained

weights (Matsoukas et al., 2022). Features from a ViT

backbone yield less uniform patterns in comparison

with ImageNet. DINO is only able to outperform the

general features in one of three tasks and only by 3%.

On the other hand, the results clearly show that iBOT

performs well across all tasks and yields an improve-

ment over ImageNet initialization by 2%, 9.5%, and

17.9%, respectively.

The lower half of table 2 summarizes the per-

formance upon ﬁne-tuning the pre-trained features.

Here we only report results for the reference Ima-

geNet model and the two models which obtained the

highest average scores in the CNN and ViT classes

of SSL approaches, i.e. ColorMe and iBOT. Simi-

lar to the results from the linear evaluation, we ob-

serve that using iBOT as initialization yields the best

performance for all downstream tasks. Furthermore,

we notice that the gap between ColorMe and iBOT

is smaller than using linear evaluation, indicating that

with enough training data and trainable parameters,

similar features can be learned. Comparing the per-

formance in the linear and ﬁne-tuned evaluation, we

can also see that the improvement for iBOT on PAD-

UFES-20 and HAM10000 is noticeably smaller with

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

(a) Fitzpatrick17k (b) PAD-UFES-20 (c) HAM10000

Figure 1: Results of a kNN classiﬁer on pre-trained representations when varying the number of samples per class for all three

downstream tasks.

respect to other methods. This suggests that the inher-

ent structure of the dataset labels was already learned

in the pre-training phase, and only minor adaptions to

the existing features could be done.

Finally, ﬁgure 1 shows the results of adding a kNN

classiﬁer to the pre-trained ColorMe, iBOT and Im-

ageNet features upon changing the training dataset

size. The results achieved by iBOT outperform, on

average, the ones from ImageNet over all three down-

stream tasks, indicating that its features are very com-

petitive also in low data regimes.

5 CONCLUSION

In this paper, we set out to investigate whether

features from domain-speciﬁc self-supervised pre-

training yield a beneﬁt over general-purpose ones

such as ImageNet weights, which are currently the

de facto standard in the medical domain. The re-

sults achieved so far indicate that there might be an

advantage in SSL initialization, especially when us-

ing iBOT. However, we currently cannot conclude

whether this beneﬁt can be traced back to the pre-

training strategy or the difference in model architec-

ture. An indication in favor of the former is that

DINO, which is also based on ViTs, did not outper-

form ImageNet initialization. In the future, we plan

ablation experiments to determine if the performance

gain is really due to the pre-training task or inﬂuenced

by the different architecture.

REFERENCES

Brinker, T. J., Hekler, A., Enk, A. H., Berking, C., Hafer-

kamp, S., Hauschild, A., Weichenthal, M., Klode,

J., Schadendorf, D., Holland-Letz, T., von Kalle, C.,

ohling, S., Schilling, B., and Utikal, J. S. (2019).

Deep neural networks are superior to dermatologists

in melanoma image classiﬁcation. European Journal

of Cancer, 119:11–17.

Caron, M., Touvron, H., Misra, I., J

egou, H., Mairal, J., Bo-

janowski, P., and Joulin, A. (2021). Emerging Prop-

erties in Self-Supervised Vision Transformers. pages

9650–9660.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).

A Simple Framework for Contrastive Learning of Vi-

sual Representations. In Proceedings of the 37th In-

ternational Conference on Machine Learning, pages

1597–1607. PMLR.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 248–255.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2020). An Image is Worth 16x16 Words: Trans-

formers for Image Recognition at Scale.

Giotis, I., Molders, N., Land, S., Biehl, M., Jonkman,

M., and Petkov, N. Med-node: A computer-

assisted melanoma diagnosis system using non-

dermoscopic images”. Expert Systems with Applica-

tions, 42:6578–6585.

Grill, J.-B., Strub, F., Altch

e, F., Tallec, C., Richemond, P.,

Buchatskaya, E., Doersch, C., Avila Pires, B., Guo,

Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu, k.,

Munos, R., and Valko, M. (2020). Bootstrap Your

Own Latent - A New Approach to Self-Supervised

Learning. In Advances in Neural Information Process-

ing Systems, volume 33, pages 21271–21284. Curran

Associates, Inc.

Groh, M., Harris, C., Soenksen, L., Lau, F., Han, R., Kim,

A., Koochek, A., and Badri, O. (2021). Evaluating

Deep Neural Networks Trained on Clinical Images in

Dermatology with the Fitzpatrick 17k Dataset. pages

1820–1828. IEEE Computer Society.

ISIC (2016). ISIC Archive. https://www.isic-archive.com/.

Accessed: 2022-05-20.

Jacob, J., Ciccarelli, O., Barkhof, F., and Alexander, D. C.

(2021). Disentangling human error from the ground

truth in segmentation of medical images. ACL.

Towards Reducing the Need for Annotations in Digital Dermatology with Self-supervised Learning

Kawahara, J., Daneshvar, S., Argenziano, G., and

Hamarneh, G. (2019). Seven-point checklist and skin

lesion classiﬁcation using multitask multimodal neu-

ral nets. IEEE Journal of Biomedical and Health In-

formatics, 23(2):538–546.

Li, X., Hu, X., Qi, X., Yu, L., Zhao, W., Heng, P.-A.,

and Xing, L. (2021). Rotation-Oriented Collabora-

tive Self-Supervised Learning for Retinal Disease Di-

agnosis. IEEE Transactions on Medical Imaging,

40(9):2284–2294.

Li, Y., Chen, J., and Zheng, Y. (2020). A Multi-Task Self-

Supervised Learning Framework for Scopy Images. In

2020 IEEE 17th International Symposium on Biomed-

ical Imaging (ISBI), pages 2005–2009. ISSN: 1945-

8452.

Matsoukas, C., Haslum, J. F., Sorkhei, M., S

oderberg, M.,

and Smith, K. (2022). What Makes Transfer Learning

Work For Medical Images: Feature Reuse & Other

Factors. Technical Report arXiv:2203.01825, arXiv.

Mendonc¸a, T., Ferreira, P. M., Marques, J. S., Marcal,

A. R. S., and Rozeira, J. (2013). PH2 - A dermo-

scopic image database for research and benchmark-

ing. In 2013 35th Annual International Conference of

the IEEE Engineering in Medicine and Biology Soci-

ety.

Pacheco, A. G. C., Lima, G. R., Salom

ao, A. S., Krohling,

B., Biral, I. P., de Angelo, G. G., Alves Jr, F. C. R.,

Esgario, J. G. M., Simora, A. C., Castro, P. B. C.,

Rodrigues, F. B., Frasson, P. H. L., Krohling, R. A.,

Knidel, H., Santos, M. C. S., do Esp

ırito Santo, R. B.,

Macedo, T. L. S. G., Canuto, T. R. P., and de Barros,

L. F. S. (2020). PAD-UFES-20: A skin lesion dataset

composed of patient data and clinical images collected

from smartphones. Data in Brief, 32:106221.

Shurrab, S. and Duwairi, R. (2021). Self-supervised learn-

ing methods and applications in medical imaging

analysis: A survey. arXiv:2109.08685 [cs, eess].

arXiv: 2109.08685.

Sowrirajan, H., Yang, J., Ng, A. Y., and Rajpurkar, P.

(2021). MoCo-CXR: MoCo Pretraining Improves

Representation and Transferability of Chest X-ray

Models. arXiv:2010.05352 [cs].

Sun, X., Yang, J., Sun, M., and Wang, K. (2016). A

benchmark for automatic visual classiﬁcation of clin-

ical skin disease images. In Leibe, B., Matas, J.,

Sebe, N., and Welling, M., editors, Computer Vision

– ECCV 2016, pages 206–222, Cham. Springer Inter-

national Publishing.

Tschandl, P., Rosendahl, C., and Kittler, H. (2018). The

HAM10000 dataset, a large collection of multi-source

dermatoscopic images of common pigmented skin le-

sions. Scientiﬁc Data, 5(1):180161.

Xie, Y., Zhang, J., Liao, Z., Xia, Y., and Shen, C.

(2020). PGL: Prior-Guided Local Self-supervised

Learning for 3D Medical Image Segmentation.

arXiv:2011.12640 [cs].

Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A.,

and Kong, T. (2022). ibot: Image bert pre-training

with online tokenizer. International Conference on

Learning Representations (ICLR).

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare