Specularity, Shadow, and Occlusion Removal from Image Sequences

using Deep Residual Sets

Monika Kwiatkowski

and Olaf Hellwich

Computer Vision & Remote Sensing, Technische Universit

at Berlin, Marchstr. 23, Berlin, Germany

Keywords:

Deep Sets, Deep Learning, Image Reconstruction, Background Reconstruction, Artifact Removal.

Abstract:

When taking images of planar objects, the images are often subject to unwanted artifacts such as speculari-

ties, shadows, and occlusions. While there are some methods that specialize in the removal of each type of

artifact individually, we offer a generalized solution. We implement an end-to-end deep learning approach

that removes artifacts from a series of images using a fully convolutional residual architecture and Deep Sets.

Our architecture can be used as general approach for many image restoration tasks and is robust to varying

sequence lengths and varying image resolutions. Furthermore, it enforces permutation invariance on the input

sequence. The architecture is optimized to process high resolution images. We also provide a simple online

algorithm that allows the processing of arbitrarily long image sequences without increasing the memory con-

sumption. We created a synthetic dataset as an initial proof-of-concept. Additionally, we created a smaller

dataset of real image sequences. In order to overcome the data scarcity of our real dataset, we use the syn-

thetic data for pre-training our model. Our evaluations show that our model outperforms many state of the art

methods that are used in related problems such as background subtraction and intrinsic image decomposition.

1 INTRODUCTION

When taking images of planar objects such as mag-

azines, paintings, posters, books, or facades, one is

confronted with many possible obstructions. Some of

these, such as specularities or shadows, may be due

to illumination, while others may be due to occlu-

sions. These effects can lead to information loss and

signiﬁcantly reduce the image quality. It is often not

possible to capture a single ﬂawless image, however

obstructions, such as specularities or occlusions, usu-

ally vary with the viewpoint of the camera or move

over time. Our proposed method aims to provide a

practical solution for reconstructing the content using

multiple partially-obstructed images.

We introduce a novel approach using deep learn-

ing that learns an end-to-end image transformation to

remove artifacts from a sequence of distorted images.

Due to the lack of an existing dataset, we generate

a synthetic dataset and a real dataset. Our synthetic

data creation process follows a realistic image forma-

tion model and creates complex artifacts containing

occlusions, specularities, shadows, and varying illu-

mination.

https://orcid.org/0000-0001-9808-1133

https://orcid.org/0000-0002-2871-9266

Training a deep neural network requires a large

amount of training data in order not to overﬁt. Our

real dataset is not large enough; however, we can

create an arbitrarily large amount of artiﬁcial data.

Therefore, we use a combined approach of pre-

training on artiﬁcial data (200,000 image sequences)

and only ﬁne-tuning on the signiﬁcantly smaller real

dataset (100 image sequences).

The proposed architecture uses the concept of

Deep Sets (Zaheer et al., 2017), making it robust to

varying lengths of input sequences and invariant to

permutation. We provide a memory-efﬁcient algo-

rithm for processing arbitrarily long input sequences.

Furthermore, the architecture is fully-convolutional,

i.e. it can handle images of varying resolution. We

compare our deep learning method to unsupervised

methods that are commonly used for outlier removal,

background subtraction, and intrinsic image decom-

position.

2 RELATED WORK

Many methods deal with each problem individually,

typically modeling shadows as multiplicative distor-

tions of the original content, and specularities as ad-

118

Kwiatkowski, M. and Hellwich, O.

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets.

DOI: 10.5220/0010822300003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

118-125

ISBN: 978-989-758-555-5; ISSN: 2184-4321

ditive distortions. Some methods model the problem

as an intrinsic image decomposition, that is, an image

is decomposed into a reﬂectance image (albedo) and

a shading image. Reﬂectance describes the amount

of light an object reﬂects; it is an intrinsic value that

depends only on the object’s material. Shading is a

varying property that depends on the lighting condi-

tions and the position of objects relative to the light

sources. Background subtraction methods deal with

a similar problem. Given a series of images with a

dynamic foreground, the background has to be ex-

tracted, which is assumed to be constant.

One can differentiate between single-image and

multi-image approaches for artifact removal. Single-

image approaches use prior knowledge to identify

specularities (Artusi et al., 2011) and shadows (Fin-

layson et al., 2009), relying heavily on assumptions

about the appearance of said artifacts. Multi-image

approaches can use statistical properties (Weiss,

2001) or optimization (Yu, 2016) to combine infor-

mation from all images for reconstruction.

Deep learning approaches that do not rely on

rigid assumptions are used in many state-of-the-art

image processing tasks. Convolutional Neural Net-

works (CNNs) have been successfully used on sin-

gle images for shadow removal (Qu et al., 2017;

Hu et al., 2019) and specularity removal (Lin et al.,

2019). They have also been applied to intrinsic im-

age decomposition (Lettry et al., 2018). However,

none of these methods gives a general solution for

artifact removal. Moreover, there are few methods

that use multi-image approaches, even though addi-

tional images could provide more information for the

reconstruction. Furthermore, some objects, such as

paintings, photographs, or posters, can contain shad-

ows, specularities, and various objects as stylistic el-

ements. Single-image methods could distort parts of

the content by mistake. For example, without using

additional images, it can be impossible to differenti-

ate between a shadow that has been cast onto an book

cover and a shadow that is part of the book cover’s

content.

Our use case contains a combination of all previ-

ous problems: varying illumination, shadows, specu-

larities, and occlusions. This work proposes a univer-

sal approach, using deep learning that utilizes input

sequences in order to solve a more complex problem.

There are deep learning models that learn to trans-

form image sequences into single images (Chang and

Luo, 2019; Wang et al., 2018; Xingjian et al., 2015).

However, many methods that rely on RNNs, LSTMs,

transformers, or 3D convolutions do not enforce per-

mutation invariance or cannot handle dynamic se-

quence length. Moreover, many models are not very

memory efﬁcient. They either require low resolution

images or they only process each image individually,

discarding a lot of information. Permutation invari-

ant CNNs have also been successfully used for image

deblurring (Aittala and Durand, 2018). However, the

proposed architecture can only handle low resolution

images.

Our work provides the following main contribu-

tions:

1. Our architecture removes shadows, occlusions,

and specularities simultaneously.

2. A synthetic dataset is created, using a 3D pipeline

to generate artiﬁcial image distortions. The

dataset can be used for pre-training machine

learning models.

3. A dataset with real distortions is created, using

commodity hardware.

4. We provide a general purpose deep learning archi-

tecture for image reconstruction from image se-

quences. The architecture is permutation invari-

ant, robust to varying sequence lengths, and ro-

bust to varying resolutions.

5. We show for our use case that one can overcome

data scarcity using pre-training on synthetic data.

6. The architecture was optimized to process images

sequences of at least 4K resolution. We provide a

simple online algorithm for processing arbitrarily

long image sequences using a constant memory

consumption.

3 DATASET

3.1 Synthetic Data

To the best of our knowledge, there is no la-

beled dataset containing aligned images with shad-

ows, specularities, and occlusions, together with

corresponding ground-truth. We therefore use a

dataset consisting of 207,572 images of book cov-

ers taken from Amazon (Iwana et al., 2016). We add

artiﬁcially-generated artifacts to these images and use

the original book cover as ground truth. The dataset

contains varying illuminations, occlusions and shad-

ows. Figure 1 shows how we create a 3D scene: we

position a plane such that it perfectly covers the im-

age plane when projected and apply one of our book

covers to this plane as a texture. We then generate

multiple point light sources of varying position, in-

tensity, and color. Afterwards, we position a random

object between the image plane and the book plane.

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets

119

Figure 1: Illustration of a random scene and the resulting

image. The white octahedron illustrates a point light source.

The pyramid shows the frustum of the perspective camera

and the image plane.

We create random reﬂectivity and roughness, which

affects the shininess of the texture and the brightness

of the book cover. These physical properties affect

the appearance of specularities and the effect of light-

ing on the underlying image. For occlusions, we use

a set of predeﬁned geometries such as spheres, cones,

planes, etc. and we randomly sample a shape for each

scene, setting the orientation, size, texture, and posi-

tion of the object at random. Although we use a ﬁnite

number of shapes as occlusions, there are inﬁnitely

many ways to position, scale and texturize them.

The dataset is not a perfect representation of a

realistic use-case, but it contains a broad variety of

image distortions, which makes it suitable for pre-

training our model. The pre-trained model can then

be ﬁne-tuned on the real dataset.

3.2 Real Data

We create an additional dataset containing real image

sequences of planar objects, such as book covers and

movie covers. The dataset contains 100 image se-

quences, each containing 11 images. One image is

free of distortions, while the other 10 contain shad-

ows, specularities, and occlusions. The images were

taken indoors. The occlusions were created by plac-

ing various objects on top of the planar object. Specu-

larities were created with lamps and ﬂashlights. Shad-

ows were cast onto the planar objects. The ground-

truth images were made by taking images of the pla-

nar objects under ambient illumination.

Each sequence was aligned using a feature based

method. We used a SIFT feature detector and

computed element-wise homographies between the

ground-truth image and all distorted images. In order

to further improve the alignment, we used a method

described by Schroeder et al. (2011) (Schroeder et al.,

2011). We then cropped the images so that they only

contain the content of the planar object.

4 DEEP LEARNING

4.1 Architecture

Our use-case requires the architecture to handle a dy-

namic number of input images. One possibility to im-

plement this would be to use models for sequential

data, such as recurrent neural networks or transform-

ers. However, RNNs and transformers are not permu-

tation invariant. Moreover, both architectures require

a lot of memory, limiting the maximal resolution of

the input images.

Therefore, we decided to use a Residual Network

for our architecture. Residual Networks (ResNets)

are used for many image restoration tasks; however,

they usually require a ﬁxed number of input chan-

nels. To apply ResNets to dynamic input sequences,

we adapted the concept of Deep Sets by Zaheer et

al. (2017) (Zaheer et al., 2017). Deep Sets en-

force permutation invariance. Given a ﬁnite set X =

, x

, ··· , x

}, a function f is permutation invariant,

if it can be decomposed as follows:

f (x

, x

, · ·· , x

) = ρ

∑

i=1

φ(x

)

(1)

ρ and φ describe deep learning models. Note that al-

though f takes an ordered sequence as input, the or-

der is irrelevant due to the commutative property of

the sum. Besides, neither ρ nor φ are dependent on N;

therefore, we can apply f to arbitrarily large image

sets.

We created an architecture that follows this de-

composition. The architecture consists of an encoder

φ and a decoder ρ, which both make use of resid-

ual blocks. However, we replace the summation by

a mean:

f (x

, x

, · ·· , x

) = ρ

∑

i=1

φ(x

)

(2)

The mean normalizes the embedding space and en-

forces a scale invariance. It has been shown that

ResNets are more robust to train than regular CNNs,

especially on data that is close to an identity mapping

(He et al., 2016). To increase the receptive ﬁeld of

our architecture, we use dilation. Yu et al. (2017)

(Yu et al., 2017) showed that residual networks with

dilation have an increased receptive ﬁeld and outper-

form most non-dilated models without increasing the

model complexity. Additionally, downsampling lay-

ers are used to reduce the dimensionality of the em-

bedding and to further increase the receptive ﬁeld.

The downsampling layers are particularly necessary

to reduce memory consumption in order to apply the

architecture on high resolution image sequences.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

120

Figure 2 illustrates the encoder and decoder archi-

tecture and how they are combined. We use trans-

posed convolutions to upsample our feature maps.

Residual blocks decode the feature maps and generate

the resulting image. After every convolutional layer

in the encoder and decoder follows a ReLU-layer as

non-linearity. Our architecture is fully-convolutional

and can be applied on any image resolution.

Encoder

Image

Conv 2D

Strided

Conv 2D

Strided

Conv 2D

Residual

Block 1

Residual

Block 10

...

Embedding

Decoder

Residual

Block 1

Residual

Block 5

...

Embedding

Transposed

Conv 2D

Transposed

Conv 2D

Residual

Block 1

Residual

Block 6

...

Conv 2D Conv 2D

Restored

Image

Image n

Image 1

Image 2

...

Encoder

AVG Decoder

Restored

Image

Figure 2: The ﬁrst two images illustrate the encoder and

decoder architecture. The third image shows how the en-

coder and decoder are used as building blocks in the overall

architecture.

4.2 Training

All our models are trained on a NVIDIA Geforce

RTX 2070 with 8GB memory. First, we train our

model on the synthetic data. We use 100,000 syn-

thetic image sequences for our training. The data

is split into 75% training data and 25% test data.

We use the Adam-optimizer with default parameters

= 0.9, β

= 0.999, ε = 10

−8

and a learning rate

λ = 0.0001. We use a decaying learning rate that is

reduced every 5 epochs by a factor of 10. The mean

squared error is used as optimization criterion. We

use varying sequence lengths of up to 9 images. We

use a batch size of 10 images sequences. All images

have a resolution of 256 × 256.

In order to apply our architecture to real images,

we ﬁne-tune our previous model on real data. We split

the data into 60% training data and 40% test data. We

train the model using the same parameters as before.

The images have a resolution of 1024 × 1024. The

higher resolution increases memory consumption sig-

niﬁcantly, such that the model is not able to process a

batch size of 10 images simultaneously. Instead, we

emulate the batch size by processing individual image

sequences and aggregating the gradients calculated by

backpropagation. After every 10th image sequence

we perform the optimization step.

4.3 Online Inference

The standard implementation of Deep Sets has a high

memory consumption, since an embedding φ(x

) has

to be computed and stored in memory for each input

image before summation. One can optimize this by

replacing the mean in (2) with an iterative computa-

tion:

:= 0 (3)

∑

i=1

φ(x

) (4)

= e

N−1

φ(x

) − e

N−1

(5)

⇒ f (x

, · ·· , x

) = ρ(e

) (6)

A derivation for formula 5 can be seen in Finch

(2009)(Finch, 2009). e

is the accumulated average

of all results from the encoders up to the N-th input

image. e

can be computed iteratively using an on-

line algorithm (5). One can see that only the last ac-

cumulated result e

N−1

and the new encoding φ(x

)

have to be stored in memory instead of all embed-

dings φ(x

), · ·· φ(x

With this method, deep sets can be efﬁciently ap-

plied on arbitrarily long sequences. Formula (6) also

describes the applicability of online inference to real-

time data. Using this method, we are able to pro-

cess image sequences with 4K resolution on our GPU.

Note that the model requires much more memory dur-

ing training for gradient computations.

5 EVALUATION

5.1 Evaluation on Synthetic Data

We evaluate Residual Deep Set and several other

methods on our synthetic dataset using varying

lengths of input sequences n. From each image se-

quence, we randomly sample n images, which are

then used for reconstruction and evaluation; we repeat

this process 10 times for each image sequence. We

then compare the results of our architecture to those

of common approaches for outlier removal, back-

ground subtraction, and intrinsic image decomposi-

tion. Firstly, we use a pixel-wise median of the RGB

intensities for reference. Secondly, we use an intrin-

sic image decomposition method that uses a Maxi-

mum Likelihood Estimation (MLE) of the reﬂectance

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets

121

(Weiss, 2001). Thirdly, Robust PCA (RPCA) is being

used. RPCA uses optimization to decompose an im-

age into a low-rank image containing the content, and

a sparse image containing the artifacts(Bouwmans

et al., 2018). RPCA is a state-of-the-art method that

has been used both in background subtraction and in-

trinsic image decomposition (Yu, 2016). In addition,

the pixel-wise mean of the input sequence is used for

comparison as a worst-case solution that only attenu-

ates artifacts.

We use the mean squared error (MSE), the struc-

tural similarity index (SSIM), and the peak signal-to-

noise ratio (PSNR) as quality measures. 1,000 image

sequences, each containing 9 images from a valida-

tion set, are used for evaluation. Figures 3, 4 and

5 show the average error for each model on varying

lengths of input sequences.

Figure 3: MSE for each method applied on varying image

sequences of synthetic data.

Figure 4: SSIM for each method applied on varying image

sequences of synthetic data.

Figure 5: PSNR for each method applied on varying image

sequences of synthetic data.

Figure 6: Sequence of four distorted images from the syn-

thetic dataset with resulting reconstructions and error met-

rics.

Figure 7: Sequence of eight distorted images from the syn-

thetic dataset with resulting reconstructions and error met-

rics.

Our evaluation shows that the Deep Set architec-

ture has a consistently better performance compared

to all other unsupervised methods over all metrics.

As expected, the pixel-wise mean gives the worst re-

sults. Figures 6 and 7 illustrate the difﬁculty of the

reconstruction for classical outlier removal methods.

The artifacts overlap frequently and the underlying

content is rarely seen uncorrupted. The ﬁgures both

contain very complex illumination and specularities.

Even when parts of the image do not contain artifacts

such as shadows, occlusions, or specularities, the re-

construction is still ill-posed due to varying illumina-

tion. In ﬁgure 6, the statistical methods are unable to

remove the overlapping occlusions, while Deep Set is

able to extract the relevant content.

Since the encoder is applied on each image inde-

pendently, it is reasonable to assume that the averaged

embedding space of the encoders contains attenuated

features of artifacts, similar to the average in the RGB

color space. However, the result in ﬁgure 6 indicates

that the decoder is able to extract the real content from

the corrupted embedding. Moreover, the occlusions

are removed, despite their overlapping in three out of

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

122

four images. This implies that the architecture does

not solely rely on a pixel-wise consensus, but also

uses contextual information in each image.

Note that RPCA performs poorly on our synthetic

dataset. This is likely due to the fact, that RPCA as-

sumes sparse distortions. Since our image sequences

are relatively small compared to other background

subtraction tasks, many distortions are not considered

sparse by RPCA.

Figure 7 shows a sequence, where the illumina-

tion distorts the homogeneous color of the book. Al-

though it is impossible to extract the exact color, Deep

Set is the only method able to generate an image

with a homogeneous color. This requires a high-

level understanding of the image. We assume that the

large receptive ﬁeld enables Deep Set to understand

the broader context of each image and makes it less

prone to errors inherent in methods, which are based

on pixel-wise statistics, e.g. mean, median or MLE

(Weiss, 2001).

5.2 Evaluation on Real Data

We use the 40 image sequences from our test set for

evaluation. All images have a resolution of 1024 ×

1024. We follow the same evaluation steps as for the

synthetic data. Figures 8, 9 and 10 show the results of

our evaluation.

Figure 8: MSE for each method applied on varying image

sequences of real data.

Figure 9: SSIM for each method applied on varying image

sequences of real data.

The evaluation on the real input images shows that

Deep Sets have the best performance with regards to

Figure 10: PSNR for each method applied on varying image

sequences of real data.

MSE and PSNR. MLE gives slightly better results

with regards to SSIM on longer input images. How-

ever, the examples in ﬁgure 11 and 12 show that no

metric fully captures the quality of the reconstruction.

In ﬁgure 11, the result of Deep Set has a much lower

MSE error than all of the other methods, but has the

same SSIM as MLE. In ﬁgure 12, RPCA has the re-

sult with the lowest MSE, although it contains more

ghosting artifacts than the results of MLE and Deep

Sets. Although the results of Deep Sets look promis-

ing, this suggests that MSE is not the most suitable

metric for optimizing our architecture. The exam-

Figure 11: Sequence of four distorted images from the real

dataset with resulting reconstructions, ground truth image,

and error metrics.

Figure 12: Sequence of eight distorted images from the real

dataset with reconstructed images, ground truth image, and

error metrics.

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets

123

ples also show that outlier removal methods such as

RPCA or median ﬁltering, that are also used in back-

ground subtraction, can not remove varying illumina-

tion. The evaluation also conﬁrms that one can com-

pensate for the lack of training data (with only 60 im-

age sequences available for training) using a synthetic

dataset.

6 ABLATION STUDY

In our ablation study, we tested various depths for

the encoder and decoder architecture. It is a trade-

off between reconstruction quality and memory-

consumption. The conﬁguration shown in ﬁgure 2 is

best suited for our use case. Further increasing the

number of residual blocks for either component did

not signiﬁcantly improve the quality of the resulting

model. However, the memory consumption increases

signiﬁcantly with the depth of the encoder, because

the feature maps for each image in the sequence have

to be computed. Reducing the number of residual

blocks results in a worse reconstruction quality. Fur-

thermore, we evaluated the effect of dilation in our

residual blocks. Dilation signiﬁcantly improves the

quality of the reconstruction, without changing the

number of parameters of the model, by increasing the

receptive ﬁeld. Our Deep Residual Sets therefore an-

alyze a larger context of the images, which allows

them to better distinguish between a distorted image

patch and an undistorted one. Additionally, we tried

adding and removing downsampling layers (adjusting

upsampling layers accordingly). Increasing the num-

ber of downsampling layers reduces the quality of the

reconstruction due to information loss. However, not

using downsampling layers had no noticeable effect

on the reconstruction quality, it only increased the

memory consumption of the architecture. Memory

consumption is a limiting factor for our architecture.

It limits the batch size and maximum sequence length

of our model during training. Additionally, it limits

the maximum resolution our architecture can handle.

7 CONCLUSION AND FUTURE

WORK

In this paper, we introduced a deep learning architec-

ture, which can successfully learn to remove shadows,

specularities, and occlusions from image sequences.

The architecture uses residual blocks and the concept

of Deep Sets(Zaheer et al., 2017). The architecture

enforces permutation invariance and can be applied to

dynamic input sequences and high resolution images.

In section 4.3, we showed a memory-efﬁcient method

for applying our architecture to arbitrarily long image

sequences, that is also suited for high resolutions, in-

cluding streaming data, without an increase in mem-

ory consumption.

We created a synthetic dataset containing complex

illumination, occlusions, shadows, and specularities

with corresponding ground truth data. The synthetic

dataset was initially created to establish a proof-of-

concept for our architecture and was later used for

pre-training the model. Our evaluation shows that

a supervised method can outperform unsupervised

methods for outlier removal, background subtraction,

and intrinsic image decomposition. Although the re-

construction is ambiguous and ill-posed, the model

was still able to generate images that were visually

consistent, see ﬁgure 7.

Furthermore, we evaluated our model on a real

dataset. Deep Set was able to compete with the ex-

isting methods. We showed that one can compensate

for the lack of a large dataset using synthetic data.

The model was pre-trained on synthetic data and ﬁne-

tuned on the real data, resulting in a superior recon-

struction compared to unsupervised methods. This

method of pre-training on a large augmented dataset

combined with ﬁne-tuning on a small real dataset is

especially helpful for use cases where it is hard or

impossible to obtain large datasets. We have shown

that Deep Sets are a simple and efﬁcient method to

improve on existing deep learning models in image

restoration.

In future work, we are interested in applying Deep

Sets to other computer vision tasks utilizing multiple

images, such as super-resolution, background extrac-

tion, or panorama stitching. Introducing adversarial

loss could further enforce visually coherent results,

rather than exact reconstructions.

REFERENCES

Aittala, M. and Durand, F. (2018). Burst image deblurring

using permutation invariant convolutional neural net-

works. In Proceedings of the European Conference on

Computer Vision (ECCV), pages 731–747.

Artusi, A., Banterle, F., and Chetverikov, D. (2011). A

survey of specularity removal methods. In Computer

Graphics Forum, volume 30, pages 2208–2230. Wiley

Online Library.

Bouwmans, T., Javed, S., Zhang, H., Lin, Z., and Otazo,

R. (2018). On the applications of robust pca in im-

age and video processing. Proceedings of the IEEE,

106(8):1427–1457.

Chang, Y. and Luo, B. (2019). Bidirectional convolutional

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

124

lstm neural network for remote sensing image super-

resolution. Remote Sensing, 11(20):2333.

Finch, T. (2009). Incremental calculation of weighted mean

and variance.

Finlayson, G. D., Drew, M. S., and Lu, C. (2009). Entropy

minimization for shadow removal. International Jour-

nal of Computer Vision, 85(1):35–57.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hu, X., Jiang, Y., Fu, C.-W., and Heng, P.-A. (2019). Mask-

shadowgan: Learning to remove shadows from un-

paired data. arXiv preprint arXiv:1903.10683.

Iwana, B. K., Rizvi, S. T. R., Ahmed, S., Dengel, A., and

Uchida, S. (2016). Judging a book by its cover. arXiv

preprint arXiv:1610.09204.

Lettry, L., Vanhoey, K., and Van Gool, L. (2018). Darn:

a deep adversarial residual network for intrinsic im-

age decomposition. In 2018 IEEE Winter Conference

on Applications of Computer Vision (WACV), pages

1359–1367. IEEE.

Lin, J., Seddik, M. E. A., Tamaazousti, M., Tamaazousti, Y.,

and Bartoli, A. (2019). Deep multi-class adversarial

specularity removal. In Scandinavian Conference on

Image Analysis, pages 3–15. Springer.

Qu, L., Tian, J., He, S., Tang, Y., and Lau, R. W. (2017). De-

shadownet: A multi-context embedding deep network

for shadow removal. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 4067–4075.

Schroeder, P., Bartoli, A., Georgel, P., and Navab, N.

(2011). Closed-form solutions to multiple-view ho-

mography estimation. In 2011 IEEE Workshop on

Applications of Computer Vision (WACV), pages 650–

657.

Wang, W., Shen, J., Guo, F., Cheng, M.-M., and Borji,

A. (2018). Revisiting video saliency: A large-scale

benchmark and a new model. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 4894–4903.

Weiss, Y. (2001). Deriving intrinsic images from image se-

quences. In Proceedings Eighth IEEE International

Conference on Computer Vision. ICCV 2001, vol-

ume 2, pages 68–75. IEEE.

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-

K., and Woo, W.-c. (2015). Convolutional lstm net-

work: A machine learning approach for precipitation

nowcasting. In Advances in neural information pro-

cessing systems, pages 802–810.

Yu, F., Koltun, V., and Funkhouser, T. (2017). Dilated

residual networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 472–480.

Yu, J. (2016). Rank-constrained pca for intrinsic images de-

composition. In 2016 IEEE International Conference

on Image Processing (ICIP), pages 3578–3582. IEEE.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,

Salakhutdinov, R. R., and Smola, A. J. (2017). Deep

sets. In Advances in neural information processing

systems, pages 3391–3401.

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets

125