In this paper, we are interested in automatic seg-
mentation in order to quickly extract regions of in-
terest (tumours for example) to make a more precise
analysis of these areas only. However, only few ap-
proaches on fully unsupervised segmentation of WSI
have been proposed. The first attempt to segment
regions of interest from WSI without any prior in-
formation or examples has been performed in (Khan
et al., 2013). The authors highlight tissue morphol-
ogy in breast cancer histology images by calculating
a set of Gabor filters to discriminate different regions.
In (Fouad et al., 2017), the authors use mathemati-
cal morphology to extract ‘virtual-cells’ (e.g. super-
pixels), for which morphological and colour features
are calculated to then apply a consensus clustering al-
gorithm to identify the different tissues in the image.
More recently, a similar approach has been presented
in (Landini et al., 2019), adding a semi-supervised
self-training classifier to the previous techniques that
enhances the results at the cost of partial supervision.
All these approaches propose to cluster the image
based on predefined features. However, deep learn-
ing approaches, particularly via autoencoding archi-
tectures, make it possible to avoid manual definition
of features by calculating a condensed representation
of the image in a latent space by applying convolu-
tional filters. Unfortunately, as stated in (Raza and
Singh, 2018), most applications of autoencoders in
digital pathology were developed to perform cell seg-
mentation or nuclei detection (Xu et al., 2015; Hou
et al., 2019), or stain normalisation (Janowczyk et al.,
2017). Therefore, we propose here to study the po-
tential of these approaches for WSI tissue segmenta-
tion. The aim is to try to automatically identify clus-
ters corresponding to each type of tissue in the WSI
that could then be labelled by pathologists.
In this paper, we present a study on how convolu-
tional autoencoders perform on WSI segmentation by
comparing different approaches. First, different au-
toencoders architectures are compared to quantify the
importance of hyperparameters of interest (number of
convolutional layers, number of convolutions by layer
and size of the latent space). Then, a multi-resolution
approach using an ensemble clustering framework is
evaluated, to see if such ensemble techniques could
provide more accurate results.
2 METHODS
2.1 Convolutional Autoencoders
In this section, we explore of the use of convolu-
tional autoencoders to cluster WSI histopathological
images. For this, we present several experiments to
evaluate the importance of each hyperparameter.
As shown in Figure 1, a Convolutional AutoEn-
coder (CAE) is a deep convolutional neural network
composed of two parts: an encoder and a decoder.
The main purpose of the CAE is to minimise a loss
function L, evaluating the difference between the in-
put and the output of the CAE (usually Mean Squared
Error). Once this function is minimised, we can as-
sume that the encoder part builds up a suitable sum-
mary of the input data, in the latent space, as the de-
coder part is capable of reconstructing an accurate
copy of it from this encoded representation.
The encoder is first constituted of the input layer
(having the size of the input image) which is con-
nected to N convolutional layers of diminishing size,
up to an information bottleneck of size Z, called the
latent space. The bottleneck is connected to a se-
ries of N convolutional layers of increasing size, un-
til reaching the size of the input. This second part
is called the decoder. Each convolution layer is com-
posed of C convolutions and is followed by three other
layers: a batch normalisation, an activation function
(ReLU) and a max pooling of size (2,2).
To perform the clustering, a trained CAE is used to
encode each patch of the whole image. Then, this en-
coded representation of the patch (in the latent space)
is given as the input of a clustering algorithm and a
cluster is assigned to the patch.
We decided to evaluate the influence of the three
hyperparameters N, Z and C. For each one, different
values were tested while fixing the two others (N =
2, Z = 250, C = 10). To evaluate the quality of the
results, the Adjusted Rand Index (ARI) is calculated
to compare the obtained clustering to the annotations
of the expert. The Rand Index computes a similarity
measure between two clusterings by considering all
pairs of samples and counting pairs that are assigned
in the same or different clusters in the predicted and
true clusterings. The score is then normalised into the
ARI score by:
ARI =
(RI − Expected RI)
(max(RI) − Expected RI)
(1)
Values of the ARI are close to 0 for random la-
belling independently of the number of clusters and
samples, and exactly 1 when the clusterings are iden-
tical (up to a permutation).
Each CAE was trained over a set of 10,000 differ-
ent patches randomly selected. As the result of both
the clustering and the training of the CAE are non-
deterministic, due to a high sensitivity to the initial
conditions, 10 autoencoders were trained and the re-
sults averaged for each hyperparameter value.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
934