ditive distortions. Some methods model the problem
as an intrinsic image decomposition, that is, an image
is decomposed into a reflectance image (albedo) and
a shading image. Reflectance describes the amount
of light an object reflects; it is an intrinsic value that
depends only on the object’s material. Shading is a
varying property that depends on the lighting condi-
tions and the position of objects relative to the light
sources. Background subtraction methods deal with
a similar problem. Given a series of images with a
dynamic foreground, the background has to be ex-
tracted, which is assumed to be constant.
One can differentiate between single-image and
multi-image approaches for artifact removal. Single-
image approaches use prior knowledge to identify
specularities (Artusi et al., 2011) and shadows (Fin-
layson et al., 2009), relying heavily on assumptions
about the appearance of said artifacts. Multi-image
approaches can use statistical properties (Weiss,
2001) or optimization (Yu, 2016) to combine infor-
mation from all images for reconstruction.
Deep learning approaches that do not rely on
rigid assumptions are used in many state-of-the-art
image processing tasks. Convolutional Neural Net-
works (CNNs) have been successfully used on sin-
gle images for shadow removal (Qu et al., 2017;
Hu et al., 2019) and specularity removal (Lin et al.,
2019). They have also been applied to intrinsic im-
age decomposition (Lettry et al., 2018). However,
none of these methods gives a general solution for
artifact removal. Moreover, there are few methods
that use multi-image approaches, even though addi-
tional images could provide more information for the
reconstruction. Furthermore, some objects, such as
paintings, photographs, or posters, can contain shad-
ows, specularities, and various objects as stylistic el-
ements. Single-image methods could distort parts of
the content by mistake. For example, without using
additional images, it can be impossible to differenti-
ate between a shadow that has been cast onto an book
cover and a shadow that is part of the book cover’s
content.
Our use case contains a combination of all previ-
ous problems: varying illumination, shadows, specu-
larities, and occlusions. This work proposes a univer-
sal approach, using deep learning that utilizes input
sequences in order to solve a more complex problem.
There are deep learning models that learn to trans-
form image sequences into single images (Chang and
Luo, 2019; Wang et al., 2018; Xingjian et al., 2015).
However, many methods that rely on RNNs, LSTMs,
transformers, or 3D convolutions do not enforce per-
mutation invariance or cannot handle dynamic se-
quence length. Moreover, many models are not very
memory efficient. They either require low resolution
images or they only process each image individually,
discarding a lot of information. Permutation invari-
ant CNNs have also been successfully used for image
deblurring (Aittala and Durand, 2018). However, the
proposed architecture can only handle low resolution
images.
Our work provides the following main contribu-
tions:
1. Our architecture removes shadows, occlusions,
and specularities simultaneously.
2. A synthetic dataset is created, using a 3D pipeline
to generate artificial image distortions. The
dataset can be used for pre-training machine
learning models.
3. A dataset with real distortions is created, using
commodity hardware.
4. We provide a general purpose deep learning archi-
tecture for image reconstruction from image se-
quences. The architecture is permutation invari-
ant, robust to varying sequence lengths, and ro-
bust to varying resolutions.
5. We show for our use case that one can overcome
data scarcity using pre-training on synthetic data.
6. The architecture was optimized to process images
sequences of at least 4K resolution. We provide a
simple online algorithm for processing arbitrarily
long image sequences using a constant memory
consumption.
3 DATASET
3.1 Synthetic Data
To the best of our knowledge, there is no la-
beled dataset containing aligned images with shad-
ows, specularities, and occlusions, together with
corresponding ground-truth. We therefore use a
dataset consisting of 207,572 images of book cov-
ers taken from Amazon (Iwana et al., 2016). We add
artificially-generated artifacts to these images and use
the original book cover as ground truth. The dataset
contains varying illuminations, occlusions and shad-
ows. Figure 1 shows how we create a 3D scene: we
position a plane such that it perfectly covers the im-
age plane when projected and apply one of our book
covers to this plane as a texture. We then generate
multiple point light sources of varying position, in-
tensity, and color. Afterwards, we position a random
object between the image plane and the book plane.
Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets
119