ditive distortions. Some methods model the problem
as an intrinsic image decomposition, that is, an image
is decomposed into a reflectance image (albedo) and
a shading image. Reflectance describes the amount
of light an object reflects; it is an intrinsic value that
depends only on the object’s material. Shading is a
varying property that depends on the lighting condi-
tions and the position of objects relative to the light
sources. Background subtraction methods deal with
a similar problem. Given a series of images with a
dynamic foreground, the background has to be ex-
tracted, which is assumed to be constant.
One can differentiate between single-image and
multi-image approaches for artifact removal. Single-
image approaches use prior knowledge to identify
specularities (Artusi et al., 2011) and shadows (Fin-
layson et al., 2009), relying heavily on assumptions
about the appearance of said artifacts. Multi-image
approaches can use statistical properties (Weiss,
2001) or optimization (Yu, 2016) to combine infor-
mation from all images for reconstruction.
Deep learning approaches that do not rely on
rigid assumptions are used in many state-of-the-art
image processing tasks. Convolutional Neural Net-
works (CNNs) have been successfully used on sin-
gle images for shadow removal (Qu et al., 2017;
Hu et al., 2019) and specularity removal (Lin et al.,
2019). They have also been applied to intrinsic im-
age decomposition (Lettry et al., 2018). However,
none of these methods gives a general solution for
artifact removal. Moreover, there are few methods
that use multi-image approaches, even though addi-
tional images could provide more information for the
reconstruction. Furthermore, some objects, such as
paintings, photographs, or posters, can contain shad-
ows, specularities, and various objects as stylistic el-
ements. Single-image methods could distort parts of
the content by mistake. For example, without using
additional images, it can be impossible to differenti-
ate between a shadow that has been cast onto an book
cover and a shadow that is part of the book cover’s
Our use case contains a combination of all previ-
ous problems: varying illumination, shadows, specu-
larities, and occlusions. This work proposes a univer-
sal approach, using deep learning that utilizes input
sequences in order to solve a more complex problem.
There are deep learning models that learn to trans-
form image sequences into single images (Chang and
Luo, 2019; Wang et al., 2018; Xingjian et al., 2015).
However, many methods that rely on RNNs, LSTMs,
transformers, or 3D convolutions do not enforce per-
mutation invariance or cannot handle dynamic se-
quence length. Moreover, many models are not very
memory efficient. They either require low resolution
images or they only process each image individually,
discarding a lot of information. Permutation invari-
ant CNNs have also been successfully used for image
deblurring (Aittala and Durand, 2018). However, the
proposed architecture can only handle low resolution
Our work provides the following main contribu-
1. Our architecture removes shadows, occlusions,
and specularities simultaneously.
2. A synthetic dataset is created, using a 3D pipeline
to generate artificial image distortions. The
dataset can be used for pre-training machine
learning models.
3. A dataset with real distortions is created, using
commodity hardware.
4. We provide a general purpose deep learning archi-
tecture for image reconstruction from image se-
quences. The architecture is permutation invari-
ant, robust to varying sequence lengths, and ro-
bust to varying resolutions.
5. We show for our use case that one can overcome
data scarcity using pre-training on synthetic data.
6. The architecture was optimized to process images
sequences of at least 4K resolution. We provide a
simple online algorithm for processing arbitrarily
long image sequences using a constant memory
3.1 Synthetic Data
To the best of our knowledge, there is no la-
beled dataset containing aligned images with shad-
ows, specularities, and occlusions, together with
corresponding ground-truth. We therefore use a
dataset consisting of 207,572 images of book cov-
ers taken from Amazon (Iwana et al., 2016). We add
artificially-generated artifacts to these images and use
the original book cover as ground truth. The dataset
contains varying illuminations, occlusions and shad-
ows. Figure 1 shows how we create a 3D scene: we
position a plane such that it perfectly covers the im-
age plane when projected and apply one of our book
covers to this plane as a texture. We then generate
multiple point light sources of varying position, in-
tensity, and color. Afterwards, we position a random
object between the image plane and the book plane.
Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets