Evaluation of RGB and LiDAR Combination for Robust Place

Recognition

Farid Alijani

1 a

, Jukka Peltom

aki

1 b

, Jussi Puura

, Heikki Huttunen

3 c

Joni-Kristian K

ainen

1 d

and Esa Rahtu

1 e

Tampere University, Finland

Sandvik Mining and Construction Ltd, Finland

Visy Oy, Finland

Keywords:

Visual Place Recognition, Image Retrieval, Deep Convolutional Neural Network, Deep Learning for Visual

Understanding.

Abstract:

Place recognition is one of the main challenges in localization, mapping and navigation tasks of self-driving

vehicles under various perceptual conditions, including appearance and viewpoint variations. In this paper,

we provide a comprehensive study on the utility of ﬁne-tuned Deep Convolutional Neural Network (DCNN)

with three MAC, SpoC and GeM pooling layers to learn global image representation for place recognition in

an end-to-end manner using three different sensor data modalities: (1) only RGB images; (2) only intensity

or only depth 3D LiDAR point clouds projected into 2D images and (3) early fusion of RGB images and

LiDAR point clouds (both intensity and depth) to form a uniﬁed global descriptor to leverage robust features

of both modalities. The experimental results on a diverse and large long-term Oxford Radar RobotCar dataset

illustrate an achievement of 5 m outdoor place recognition accuracy with high recall rate of 90 % using early

fusion of RGB and LiDAR sensor data modalities when ﬁne-tuned network with GeM pooling layer is utilized.

1 INTRODUCTION

Place recognition is a fundamental component in

the long-term navigation stack of the robotic sys-

tems in real-world applications, ranging from au-

tonomous vehicles to drones and computer vision sys-

tems (Lowry et al., 2016). Used for variety of appli-

cations such as localization, image retrieval and loop

closure detection in GPS denied environments, it is

the process of recognizing a previously visited place

using visual content, often under varying appearance

conditions and viewpoint changes with certain com-

putational constraints.

A common practice to obtain a precise location of

an agent in an unknown environment is to ﬁrst collect

a database of images with different sensor modalities

such as camera or LiDAR, stamped with their pre-

cise GNSS/INS or odometry positions. Then given a

https://orcid.org/0000-0003-3928-7291

https://orcid.org/0000-0002-9779-6804

https://orcid.org/0000-0002-6571-0797

https://orcid.org/0000-0002-5801-4371

https://orcid.org/0000-0001-8767-0864

query image or LiDAR scan of a place, we search the

stored database to retrieve the best match which re-

veals the exact pose of that query with respect to the

reference database (gallery).

Robust and efﬁcient feature representation meth-

ods with powerful discriminatory performance are

crucial to solve the place recognition task. These

can mainly be categorized as image-based and point

cloud-based approaches (Liu et al., 2019). Compared

to the conventional approaches (Cummins and New-

man, 2008; Milford and Wyeth, 2012) and the deep

learning approaches (Arandjelovic et al., 2018) ap-

plied to RGB images, learning local or global repre-

sentations of raw LiDAR point clouds for place recog-

nition is very challenging and still an open research

question (Zou et al., 2019) due to its irregular un-

ordered structure and lack of robust descriptors.

Although raw LiDAR point clouds suffer from

lacking the detailed texture information compared to

RGB images and similar corner or edge features may

easily lead to false positives in the LiDAR-based

place recognition, the availability of precise depth

information enables more accurate place recognition

using LiDAR point clouds compared to RGB images.

650

Alijani, F., Peltomäki, J., Puura, J., Huttunen, H., Kämäräinen, J. and Rahtu, E.

Evaluation of RGB and LiDAR Combination for Robust Place Recognition.

DOI: 10.5220/0010909100003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

650-658

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Furthermore, the geometric information of the Li-

DAR point clouds is invariant to enormous illumina-

tion changes which leads to more robust place recog-

nition over different times, days and seasons in an en-

tire year.

Following (Shi et al., 2015; Su et al., 2015), we

project 3D LiDAR point clouds into 2D images and

apply 2-dimensional DCNN to obtain an improved

global feature retrieval performance. In this paper,

our contribution is to provide a comprehensive study

on the utility of the ﬁne-tuned DCNNs with three

MAC, SpoC and GeM pooling layers to learn global

image representations for place recognition in an end-

to-end manner for three different sensor data modali-

ties: (1) only RGB images; (2) only intensity or depth

3D point clouds projected into 2D images and (3)

early fusion of RGB images and LiDAR point clouds

(both intensity and depth) to form a uniﬁed global de-

scriptor to leverage robust features of both modalities.

The rest of this paper is organized as follows. Sec-

tion 2 brieﬂy discusses the related work. In section 3,

we provide the baseline method along with section 4

which explains the real-world outdoor datasets to ad-

dress the visual place recognition task. In section 5,

we show the experimental results and conclude the

paper in section 6.

2 RELATED WORK

Image-based Place Recognition. One of the most

crucial parts of a place recognition system is the im-

age representation, similar to the most visual recog-

nition tasks, including image retrieval, image clas-

siﬁcation and object detection (Lowry et al., 2016;

Zhang et al., 2021). Conventional methods extract

handcrafted local invariant features (Lowe, 2004; Bay

et al., 2008) and aggregate them into global descrip-

tors (Filliat, 2007; J

egou et al., 2010; J

egou et al.,

2010; Torii et al., 2015) as image representation.

The main problem with traditional handcrafted fea-

tures is that they are not robust enough with respect

to the environmental variations such as lighting con-

ditions, scales and viewpoints (Masone and Caputo,

2021). With the rise of deep learning methods, DC-

NNs outperformed conventional approaches to learn

deep and compact visual representation (Radenovi

et al., 2016; Radenovi

c et al., 2018; Radenovi

c et al.,

2019; Kalantidis et al., 2016; Tolias et al., 2016b;

Arandjelovic et al., 2018).

Pointcloud-based Place Recognition. LiDAR-based

place recognition has become a compelling research

topic, over the past few years, thanks to its irre-

placeable data structure. It contains informative 3D

structural information of the environment and is more

robust against illumination and seasonal variations.

Compared to RGBD cameras, laser sensors provide

longer working range which makes them suitable, es-

pecially, for perception of outdoor scenes. Few of

the recent work which concentrated on learning deep

descriptors of 3D point clouds are (He et al., 2016;

Dewan et al., 2018; Klokov and Lempitsky, 2017).

Dub

e et al. (Dub

e et al., 2017) propose SegMatch as a

technique for enabling autonomous vehicles to recog-

nize previously visited areas based on the extraction

and matching of 3D segments of LiDAR point clouds.

SegMatch can recognize places at object-level even

though there is no intact object.

Uy and Lee (Uy and Lee, 2018) integrate Point-

Net (Charles et al., 2017) and NetVLAD (Arand-

jelovic et al., 2018), to obtain PointNetVLAD in order

to tackle place recognition in large-scale scenes. It

extracts discriminative global representations of raw

LiDAR data. The authors formulate place recognition

as a metric learning problem and present a lazy triplet

and quadruplet loss function to train the proposed

network end-to-end. It, however, does not consider

the local structure information and ignores the spa-

tial distribution of local features. PCAN (Zhang and

Xiao, 2019) improves PointNetVLAD by learning an

attention map for aggregation, using an architecture

inspired by PointNet++ (Qi et al., 2017). Liu et

al. (Liu et al., 2019) present LPD-Net to learn global

descriptors from 3D point clouds. Compared with

PointNetVLAD (Uy and Lee, 2018), LPD-Net con-

siders the spatial distribution of similar local struc-

tures, which is capable of improving the recognition

performance and gaining more robustness with re-

spect to weather or illumination changes.

Compared to the image-based place recognition,

the LiDAR-based approaches are still growing. Al-

though handcrafted 3D descriptors have been used for

recognition tasks (Rusu et al., 2009), using classical

global pooling techniques such as GeM (Radenovi

et al., 2019) pooling layer applied to LiDAR-based or

image-LiDAR-based approaches is still relatively un-

touched (Martinez et al., 2020).

3 PROCESSING PIPELINE

In this work, we concentrate on the deep image repre-

sentation obtained by DCNN in which given an input

an image, it produces a global descriptor to describe

the visual content of the image. For training, Raden-

ovic et al. (Radenovi

c et al., 2019) adopt the Siamese

neural network architecture. The Siamese architec-

ture is trained using positive and negative image pairs

Evaluation of RGB and LiDAR Combination for Robust Place Recognition

651

Figure 1: Overview of the network architecture with RGB image and LiDAR point clouds as sensor data modalities. Each

individual RGB image from the dataset is concatenated with its corresponding LiDAR point clouds which then input to the

DCNN to calculate the feature vector.

and the loss function enforces large distances between

negative pairs (images from two distant places) and

small distances between positive pairs (images from

the same place). Radenovic et al. (Radenovi

c et al.,

2019) use the contrastive loss (Hadsell et al., 2006)

that acts on matching (positive) and non-matching

(negative) pairs and is deﬁned as follows:

L =

(

) for matching images

max



0,M − l(

)



otherwise

(1)

where l is the pair-wise distance term (Euclidean

distance) and M is the enforced minimum margin be-

tween the negative pairs.

and

denote the deep

feature vectors of images I

and I

computed using

the convolutional head of a backbone network such

as AlexNet, VGGNet or ResNet. The typical feature

vector lengths K are 256, 512 or 2048, depending on

the backbone. Feature vectors are global descriptors

of the input images and pooled over the spatial di-

mensions. The feature responses are computed from

K convolutional layers X

following with max pool-

ing layers that select the maximum spatial feature re-

sponse from each layer (MAC vector)

f = [ f

... f

], f

= max

x∈X

x . (2)

Radenovic et al. originally used the MAC vec-

tors (Radenovi

c et al., 2016), but in their more recent

paper (Radenovi

c et al., 2019) compared MAC vec-

tors to average pooling (SPoC vector) and generalized

mean pooling (GeM vector) and found that GeM vec-

tors provide the best average retrieval accuracy.

The Radenovic et al. main pipeline is shared by

the most deep metric learning approaches for image

retrieval, but the unique components are the proposed

supervised whitening post-processing and effective

positive and negative sample mining. More details are

described in (Radenovi

c et al., 2016) and (Radenovi

et al., 2018) and available in the code provided by the

original authors.

Radenovic et al. (Radenovi

c et al., 2019) propose

GeM pooling layer to modify Maximum activation

of convolutions (MAC) (Azizpour et al., 2015; To-

lias et al., 2016a) and sum-pooled convolutional fea-

tures (SPoC) (Yandex and Lempitsky, 2015). This is

a pooling layer which takes χ as an input and pro-

duces a vector f = [ f

, f

,..., f

]

as an output of

the pooling process which results in:

|χ

∑

x∈χ

(3)

MAC and SPoC pooling methods are special cases

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

652

of GeM depending on how pooling parameter p

derived in which p

→ ∞ and p

= 1 correspond to

max-pooling and average pooling, respectively. The

GeM feature vector is a single value per feature map

and its dimension varies depending on different net-

works, i.e. K = [256, 512, 2048]. It also adopts a

Siamese architecture to train the networks for image

matching.

4 EVALUATION PIPELINE

We evaluated our pipeline on a publicly available and

versatile outdoor dataset, e.g., Oxford Radar Robot-

Car (Barnes et al., 2020). We assigned three differ-

ent test sets, as query sequences, according to their

difﬁculty levels: (1) Test 01 with similar condition

to the gallery set (cloudy) but acquired at different

time; (2) Test 02 with moderately changed condi-

tions (sunny) and (3) Test 03 with different condition

from the gallery set (rainy). In the following, we de-

scribe the dataset and selection of training, gallery and

the three distinct query sequences along with training

process of our DCNN architecture.

4.1 Oxford Radar RobotCar Dataset

The Oxford Radar RobotCar dataset (Barnes et al.,

2020) is a radar extension to the Oxford RobotCar

dataset (Maddern et al., 2017). This dataset pro-

vides an optimized ground-truth using radar odome-

try data which is obtained from a Navtech CTS350-X

Millimetre-Wave FMCW. Data acquisition was per-

formed in January 2019 over 32 traversals in central

Oxford with a total route of 280 km urban driving.

This dataset addresses a variety of challenging con-

ditions including weather, trafﬁc, and lighting alter-

ations. It also contains several sensor modalities to

perceive an environment and localize an agent accu-

rately (Figure 2).

The combination of one Point Grey Bumblebee

XB3 trinocular stereo and three Point Grey Grasshop-

per2 monocular cameras provide a 360° visual cov-

erage of scenes around the vehicle platform. Along

with the image sensor modality, this dataset also com-

prises a pair of high resolution real-time 3D Velodyne

HDL-32E LiDARs for 3D scene understanding. 6D

poses are acquired by NovAtel SPAN-CPT inertial

and GPS navigation system at 50 Hz and generated

by performing a large-scale optimization with ceres

solver incorporating visual odometry, visual loop clo-

sures, and GPS/INS constraints with the resulting tra-

jectories shown in Figure 2 (a).

The three monocular Grasshopper2 cameras with

ﬁsheye lenses mounted on the rare side of the vehicle

are synchronized and logged 1024 × 1024 images at

an average frame rate of 11.1 Hz with 180° HFoV.

The Velodynes also provide 360° HFoV, 41.3°VFoV

with 100 m range and 2 cm range resolution for full

coverage around the vehicle.

In our evaluation pipeline, we are interested in

UNIX timestamp synchronized measured data of the

left-view camera images and left-side 3D LiDAR

point clouds, given their precise GPS/INS positions.

Therefore, in order to simplify our experiments, we

only obtain raw images from the left Point Grey

Grasshopper2 monocular camera along with raw 3D

point clouds from the left Velodyne LiDAR. The se-

lected sensor modalities points substantially towards

the left side of the road to encode the stable urban

environment characteristics, including buildings, cy-

clists, pedestrian trafﬁc, trafﬁc lights and passing-by

and/or parked cars.

4.2 Network Training

We ﬁne-tune our DCNN architecture using the MAC,

SPoC and GeM pooling layers and weights of the

ResNet50 backbone which is initially pre-trained

on ImageNet (Russakovsky et al., 2015) dataset

for image-based, point cloud-based and image-point

cloud-based models. The idea of this procedure is to

use a pre-trained model, which to some extent is capa-

ble of recognizing locations, and adapt it to the place

recognition problem after ﬁne-tuning.

We ﬁne-tune our DCNN using contrastive loss

with hard matching (positive) and hard non-matching

(negative) pairs to improve the obtained image repre-

sentation, taking advantage of variability in the train-

ing data. The number of positive matches is the same

as the number of images in the pool of queries from

which they are selected randomly, whereas the num-

ber of negative matches is always ﬁxed in the training.

Following (Radenovi

c et al., 2019), we learn

whitening through the same training data for two rea-

sons: (1) it mutually complement ﬁne-tuning to boost

the performance, (2) applying whitening as a post-

processing step expedites training compared to learn-

ing it in an end-to-end manner (Radenovi

c et al.,

2016). We also utilize trainable GeM pooling layer

which signiﬁcantly outperforms the retrieval perfor-

mance while preserving the dimension of the descrip-

tor. The comprehensive comparison results of MAC,

SPoC and GeM pooling layers are provided in Sec-

tion 5 for three test sets.

From the Oxford Radar RobotCar dataset, we

specify sequences for a training set for ﬁne-tuning,

Evaluation of RGB and LiDAR Combination for Robust Place Recognition

653

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 2: Three different sample tests, e.g., Test 01, Test 02 and Test 03 from Oxford Radard RobotCar outdoor dataset.

(a) Satellite-view with approximated GPS positions. (b), (e) and (h) Left-view of test images obtained from Grasshopper2:

cloudy, sunny, rainy, respectively. (c), (f) and (i) 3D LiDAR point clouds (intensity) obtained from Velodyne left. (d), (g) and

(j) 3D LiDAR point clouds (depth) obtained from Velodyne left.

a gallery set against which the query images from test

sequences are matched and three distinct test sets as

follows: (1) same day but different time, (2) differ-

ent day but approximately at the same time and (3)

different day and different time which also contains a

different weather condition. Table 1 summarizes dif-

ferent sets used for training, gallery and testing se-

quences. All experiments are conducted on a GPU

cluster with a single NVIDIA Quadro RTX 8000 GPU

with 32 GB memory using PyTorch 1.7 (Paszke et al.,

2019) deep learning framework.

Table 1: The Oxford Radar RobotCar sequences used in

our experiments. 1024 × 1024 Images obtained from Point

Grey Grasshopper2 monocular camera at 11.1 Hz mounted

on the rare and left side of the car. Point clouds obtained

from Velodyne HDL-32E 3D LIDAR at 20 Hz mounted on

the left side of the car.

Sequence Images Point Clouds Date Start [GMT] Condition

Train 37,724 44,414 Jan. 10 2019 11:46 Sunny

Gallery 36,660 43,143 Jan. 10 2019 12:32 Cloudy

Test 01 29,406 34,622 Jan. 10 2019 14:50 Cloudy

Test 02 32,625 38,411 Jan. 11 2019 12:26 Sunny

Test 03 28,633 33,714 Jan. 16 2019 14:15 Rainy

5 EXPERIMENTS

The conducted experiments in this paper address the

following research questions: (1) how precise the

ﬁne-tuned DCNN architecture can recognize a scene,

given its primary contribution for the image retrieval

task? and whether or not its robustness changes over

challenging conditions, including the viewpoint and

the appearance variations, (2) which of the sensor

data modalities performs the best for the place recog-

nition problem in an outdoor environment? and (3)

how early fusion of RGB images and point clouds

(intensity and depth) could potentially boost the place

recognition performance?

Data Pre-processing. Given unrectiﬁed 8-bit raw

Bayer images, obtained from the left-side Point Grey

Grasshopper2 monocular camera, we ﬁrst demosaic

images using RGGB Bayer pattern and then undistort

them using the look-up table for undistortion of im-

ages and mapping pixels in an undistorted image to

pixels in the distorted image. The resulting images are

shown in Figure 2 (b), (e) and (h). In our experiments,

we report the results of training with both raw Bayer

images and undistorted images to investigate the im-

pact of the RGB color space array, converted using

bilinear demosaicing algorithm (Losson et al., 2010).

The image dimensions are ﬁxed at 1024 × 1024, sim-

ilar to original images. We also decode the raw Velo-

dyne samples into range, as depicted in Figure 2 (c),

(f) and (i), and intensity, as depicted in Figure 2 (d),

(g) and (j), which are then interpolated to consistent

azimuth angles between scans. Considering the early

fusion of RGB images and LiDAR point clouds, we

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

654

resize the intensity and range measurements such that

the aspect ratio is always ﬁxed similar to the image

width, e.g., 1024, and height will be adjusted ac-

cordingly, thus the ﬁnal dimensions of LiDAR point

clouds are 1024 × 46.

Performance Metric. After obtaining the image rep-

resentation of a given query image ( f (q)) using the

pipeline described in Section 4, calculating similar-

ity score indicates how well two images belong to the

same location in order to measure the performance.

In this way, the feature vector is matched to all image

representations of the gallery set f (G

),i = 1, 2,..., M

using Euclidean distance d

q,G

= || f (q)− f (G

)||

and

the smallest distance is selected as best match. The

best match position within the given distance thresh-

old (d

q,G

≤ τ) is identiﬁed as true positive and false

positive, otherwise.

Table 2: Recall@1 for Test 01 (different time but same day)

of the outdoor Oxford Radar RobotCar dataset.

Method τ = 25 m τ = 10 m τ = 5 m τ = 2 m

Only RGB (Bayer)

MAC (Tolias et al., 2016a) 71.07 69.24 64.55 60.63

SPoC (Yandex and Lempitsky, 2015) 73.42 70.02 65.85 61.77

GeM (Radenovi

c et al., 2019) 87.11 84.96 76.19 69.26

Only RGB (undistorted)

MAC (Tolias et al., 2016a) 76.55 69.56 66.42 62.14

SPoC (Yandex and Lempitsky, 2015) 79.23 71.41 67.00 64.99

GeM (Radenovi

c et al., 2019) 88.21 85.68 77.31 70.46

Only LiDAR (intensity)

MAC (Tolias et al., 2016a) 79.05 72.15 69.51 66.40

SPoC (Yandex and Lempitsky, 2015) 84.41 78.23 75.85 71.20

GeM (Radenovi

c et al., 2019) 95.57 94.22 86.72 77.34

Only LiDAR (depth)

MAC (Tolias et al., 2016a) 82.74 77.80 73.26 68.59

SPoC (Yandex and Lempitsky, 2015) 89.32 81.73 79.09 77.05

GeM (Radenovi

c et al., 2019) 97.71 96.82 88.13 80.06

RGB (undistorted) + LiDAR (intensity)

MAC (Tolias et al., 2016a) 79.11 71.38 68.71 64.02

SPoC (Yandex and Lempitsky, 2015) 83.52 77.12 74.71 71.05

GeM (Radenovi

c et al., 2019) 92.34 87.48 77.44 71.40

RGB (undistorted) + LiDAR (depth)

MAC (Tolias et al., 2016a) 78.75 71.98 69.02 64.88

SPoC (Yandex and Lempitsky, 2015) 83.01 76.25 75.38 71.00

GeM (Radenovi

c et al., 2019) 92.49 88.38 78.08 73.96

Following (Arandjelovic et al., 2018) and (Chen

et al., 2011), we measure the place recognition perfor-

mance by the fraction of correctly matched queries,

given the gallery dataset. We denote the fraction of

the top-N shortlisted correctly recognized candidates

as recall@N. Given the available ground-truth anno-

tations and τ for outdoor datasets, recall@N varies

accordingly. To evaluate the place recognition per-

formance using different sensor data modalities, we

report only the fraction of top-1 matches, (recall@1)

for multiple thresholds.

We provide the comparison results of ﬁne-tuned

DCNN with three MAC, SpoC and GeM pooling lay-

ers in Table 2-4, for three test sets from the Oxford

Radar RobotCar dataset. GeM pooling layer consis-

tently outperforms MAC and SPoC with a clear mar-

gin given different sensor modalitiy inputs. Results of

Table 2, 3 and 4 also highlight a small outperformance

Table 3: Recall@1 for Test 02 (different day but same time)

of the outdoor Oxford Radar RobotCar dataset.

Method τ = 25 m τ = 10 m τ = 5 m τ = 2 m

Only RGB (Bayer)

MAC (Tolias et al., 2016a) 68.12 63.95 59.17 53.88

SPoC (Yandex and Lempitsky, 2015) 69.75 66.02 59.93 54.11

GeM (Radenovi

c et al., 2019) 71.75 68.41 61.30 56.08

Only RGB (undistorted)

MAC (Tolias et al., 2016a) 60.35 57.11 48.52 47.03

SPoC (Yandex and Lempitsky, 2015) 62.07 57.25 48.30 47.61

GeM (Radenovi

c et al., 2019) 65.06 58.94 51.40 49.82

Only LiDAR (intensity)

MAC (Tolias et al., 2016a) 84.32 81.25 78.25 69.01

SPoC (Yandex and Lempitsky, 2015) 85.93 83.02 80.01 70.02

GeM (Radenovi

c et al., 2019) 88.99 86.75 81.41 68.35

Only LiDAR (depth)

MAC (Tolias et al., 2016a) 96.15 94.21 92.71 85.41

SPoC (Yandex and Lempitsky, 2015) 95.95 94.09 93.41 86.16

GeM (Radenovi

c et al., 2019) 99.58 99.28 98.02 86.35

RGB (undistorted) + LiDAR (intensity)

MAC (Tolias et al., 2016a) 70.67 69.25 59.08 53.81

SPoC (Yandex and Lempitsky, 2015) 71.56 69.39 60.01 54.36

GeM (Radenovi

c et al., 2019) 77.51 70.58 60.17 55.71

RGB (undistorted) + LiDAR (depth)

MAC (Tolias et al., 2016a) 77.23 71.23 61.79 58.17

SPoC (Yandex and Lempitsky, 2015) 78.01 72.63 63.98 58.04

GeM (Radenovi

c et al., 2019) 81.20 74.19 64.89 60.21

when undistorted RGB images are used, compared to

Bayer images for all test sets. The reason is that the

DCNN mostly learns from the image center and four

dark corners of the raw Bayer images do not have a

signiﬁcant effect on the ﬁne-tuning stage. Test 02,

however, depicted a different results when converted

to undistorted images. One possible explanation is

the sunny condition of this test set in which major-

ity of the scenes are either blurred or occluded with

sunlight.

There is also a clear enhancement on the place

recognition performance results, given only LiDAR

point clouds as the primary sensory input for training

and ﬁne-tuning the DCNN, compared to the RGB im-

ages. LiDAR point clouds remain largely invariant to

the lighting and seasonal changes which makes it a ro-

bust option in place recognition. The depth measure-

ments provides approximately 6 −10 % better perfor-

mance as the intensity data which are signiﬁcant dif-

ferences in Test 02 and Test 03, corresponding to dif-

ferent conditions, compared to gallery set. In Test 01,

we obtain an average performance boost of 2.5 %.

According to the extensive results of Figure 3, we

observe that ﬁne-tuning of DCNN with GeM pooling

layers using early fusion of the RGB images and Li-

DAR point clouds outperform the case in which solely

RGB images are used. However, it fails to boost

the performance when compared to the case in which

only LiDAR point clouds is used as the primary sen-

sory input. A possible reason is the fusion approach

used in data pre-processing. In pre-processing, RGB

images (1024 × 1024) are still dominant part of the

learning process, compared to LiDAR point clouds

(1024 × 46).

Rank Analysis. We evaluated the early fusion per-

formance of RGB and LiDAR (depth) point clouds

Evaluation of RGB and LiDAR Combination for Robust Place Recognition

655

(a) (b)

Figure 3: Place recognition performance (Recalls and Tops) for given threshold (τ = 5.0 m), ﬁne-tuned DCNN with GeM

pooling layer. (a) Early fusion of RGB images and LiDAR point clouds (intensity). (b) Early fusion of RGB images and

LiDAR point clouds (depth).

Table 4: Recall@1 for Test 03 (different day and different

time) of the outdoor Oxford Radar RobotCar dataset.

Method τ = 25 m τ = 10 m τ = 5 m τ = 2 m

Only RGB (Bayer)

MAC (Tolias et al., 2016a) 54.92 50.74 43.06 37.41

SPoC (Yandex and Lempitsky, 2015) 54.17 52.36 41.55 38.00

GeM (Radenovi

c et al., 2019) 56.84 52.25 43.76 38.14

Only RGB (undistorted)

MAC (Tolias et al., 2016a) 59.16 54.77 46.88 39.00

SPoC (Yandex and Lempitsky, 2015) 61.28 54.88 45.88 40.21

GeM (Radenovi

c et al., 2019) 62.16 56.25 46.58 40.26

Only LiDAR (intensity)

MAC (Tolias et al., 2016a) 64.92 58.03 49.77 44.01

SPoC (Yandex and Lempitsky, 2015) 65.41 59.11 51.55 45.14

GeM (Radenovi

c et al., 2019) 68.91 59.48 54.85 47.65

Only LiDAR (depth)

MAC (Tolias et al., 2016a) 72.24 68.89 57.47 47.56

SPoC (Yandex and Lempitsky, 2015) 73.88 69.06 56.69 49.63

GeM (Radenovi

c et al., 2019) 75.23 69.26 59.63 51.01

RGB (undistorted) + LiDAR (intensity)

MAC (Tolias et al., 2016a) 66.52 60.22 49.94 48.66

SPoC (Yandex and Lempitsky, 2015) 69.23 62.99 51.21 49.09

GeM (Radenovi

c et al., 2019) 70.36 63.77 51.52 49.56

RGB (undistorted) + LiDAR (depth)

MAC (Tolias et al., 2016a) 71.36 66.28 52.39 49.21

SPoC (Yandex and Lempitsky, 2015) 73.65 68.91 54.69 50.33

GeM (Radenovi

c et al., 2019) 76.46 69.21 56.34 51.21

with ranks and where exactly the failure occurs in the

Oxford Radar RobotCar dataset. The purpose is to

provide a visualization of the failure analysis on the

map, given three tests with different conditions, e.g.,

sunny, cloudy, rainy weather conditions.

Figure 4 illustrates our investigation of Rank5+

for Test 01, Test 02 and Test 03 with the number

of samples and their estimate positions on the map.

In our analysis, we refer to Rank5+ as a parameter

which identiﬁes the most difﬁcult test case has the

most number of hard failing samples, e.g., 9540 for

Test 01 compared to other tests. According to the re-

sults of Figure 4, we can generalize about the con-

ditions of the scene in which the more challenging

illumination leads to the higher ranks.

6 CONCLUSIONS

In this paper, we evaluated the place recognition per-

formance of DCNN using MAC, SpoC and GeM

pooling layers when ﬁne-tuned with three different

sensor data modalities, including only RGB images,

only LiDAR point clouds (intensity and depth) and

early fusion of the RGB images with LiDAR point

clouds (intensity and depth). Our comprehensive

studies indicate that GeM pooling layers outperforms

MAC and SpoC pooling layers with margin. It also

demonstrates that LiDAR-based place recognition ap-

proach leads to more robust performance, given dif-

ferent appearance, and viewpoint variations, due to

longer range capability of the LiDAR compared to the

RGB-based approach.

Our experiments on three query tests with differ-

ent illumination conditions in the outdoor dataset il-

lustrated that using only LiDAR-based (depth) sen-

sor data outperforms the ﬁne-tuning with LiDAR-

based (intensity) sensor data, especially in Test 03

with more challenging rainy conditions. Our eval-

uation has also shown that integrating early sensor-

fusion process with place recognition is challenging

and less robust compared to using only LiDAR point

cloud sensor data modality although it still obtains su-

perior results compared to only image-based sensor

data modality. This can be taken to the future studies

in which considering the idea of using one sensor data

to supervise the data of other sensors.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

656

Figure 4: Rank Analysis for Test 01, Test 02 and Test 03 with different illuminations using early fusion of RGB and LiDAR

point clouds (depth) and τ = 5.0 m, ﬁne-tuned DCNN with GeM pooling layer.

REFERENCES

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,

J. (2018). NetVLAD: Cnn architecture for weakly su-

pervised place recognition. TPAMI.

Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., and

Carlsson, S. (2015). From generic to speciﬁc deep

representations for visual recognition. In 2015 IEEE

Conference on Computer Vision and Pattern Recogni-

tion Workshops (CVPRW), pages 36–45.

Barnes, D., Gadd, M., Murcutt, P., Newman, P., and Posner,

I. (2020). The oxford radar robotcar dataset: A radar

extension to the oxford robotcar dataset. In 2020 IEEE

International Conference on Robotics and Automation

(ICRA), pages 6433–6438.

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).

Speeded-up robust features (surf). Computer Vision

and Image Understanding, 110(3):346–359. Similar-

ity Matching in Computer Vision and Multimedia.

Charles, R. Q., Su, H., Kaichun, M., and Guibas, L. J.

(2017). Pointnet: Deep learning on point sets for 3d

classiﬁcation and segmentation. In 2017 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 77–85.

Chen, D. M., Baatz, G., K

oser, K., Tsai, S. S., Vedantham,

R., Pylv

ainen, T., Roimela, K., Chen, X., Bach, J.,

Pollefeys, M., Girod, B., and Grzeszczuk, R. (2011).

City-scale landmark identiﬁcation on mobile devices.

In CVPR 2011, pages 737–744.

Cummins, M. and Newman, P. (2008). Fab-map: Proba-

bilistic localization and mapping in the space of ap-

pearance. The International Journal of Robotics Re-

search, 27(6):647–665.

Dewan, A., Caselitz, T., and Burgard, W. (2018). Learning

a local feature descriptor for 3d lidar scans. In Proc. of

the IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS), Madrid, Spain.

Dub

e, R., Dugas, D., Stumm, E., Nieto, J., Siegwart, R.,

and Cadena, C. (2017). Segmatch: Segment based

place recognition in 3d point clouds. In 2017 IEEE

International Conference on Robotics and Automation

(ICRA), pages 5266–5272.

Filliat, D. (2007). A visual bag of words method for interac-

tive qualitative localization and mapping. In Proceed-

ings 2007 IEEE International Conference on Robotics

and Automation, pages 3921–3926.

Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimen-

sionality reduction by learning an invariant mapping.

In 2006 IEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition (CVPR’06), vol-

ume 2, pages 1735–1742.

He, L., Wang, X., and Zhang, H. (2016). M2dp: A novel

3d point cloud descriptor and its application in loop

closure detection. In 2016 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS),

pages 231–237.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010).

Aggregating local descriptors into a compact image

representation. In 2010 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

pages 3304–3311.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010).

Evaluation of RGB and LiDAR Combination for Robust Place Recognition

657

Aggregating local descriptors into a compact image

representation. In 2010 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

pages 3304–3311.

Kalantidis, Y., Mellina, C., and Osindero, S. (2016). Cross-

Dimensional Weighting for Aggregated Deep Convo-

lutional Features. In Computer Vision – ECCV 2016

Workshops, pages 685–701. Springer, Cham, Switzer-

land.

Klokov, R. and Lempitsky, V. (2017). Escape from cells:

Deep kd-networks for the recognition of 3d point

cloud models. In 2017 IEEE International Conference

on Computer Vision (ICCV), pages 863–872.

Liu, Z., Zhou, S., Suo, C., Yin, P., Chen, W., Wang, H., Li,

H., and Liu, Y. (2019). Lpd-net: 3d point cloud learn-

ing for large-scale place recognition and environment

analysis. In 2019 IEEE/CVF International Conference

on Computer Vision (ICCV), pages 2831–2840.

Losson, O., Macaire, L., and Yang, Y. (2010). Comparison

of color demosaicing methods. In Advances in Imag-

ing and Electron Physics, volume 162, pages 173–

265. Elsevier.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. Int. J. Comput. Vision, 60(2):91–

110.

Lowry, S., S

underhauf, N., Newman, P., Leonard, J. J.,

Cox, D., Corke, P., and Milford, M. J. (2016). Vi-

sual place recognition: A survey. IEEE Transactions

on Robotics, 32(1):1–19.

Maddern, W., Pascoe, G., Linegar, C., and Newman, P.

(2017). 1 year, 1000 km: The oxford robotcar

dataset. The International Journal of Robotics Re-

search, 36(1):3–15.

Martinez, J., Doubov, S., Fan, J., B

arsan, l. A., Wang, S.,

attyus, G., and Urtasun, R. (2020). Pit30m: A

benchmark for global localization in the age of self-

driving cars. In 2020 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems (IROS), pages

4477–4484.

Masone, C. and Caputo, B. (2021). A survey on deep visual

place recognition. IEEE Access, 9:19516–19547.

Milford, M. J. and Wyeth, G. F. (2012). Seqslam: Vi-

sual route-based navigation for sunny summer days

and stormy winter nights. In 2012 IEEE International

Conference on Robotics and Automation, pages 1643–

1649.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., Desmaison, A., K

opf, A., Yang, E., De-

Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,

Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).

Pytorch: An imperative style, high-performance deep

learning library. ArXiv, abs/1912.01703.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. arXiv preprint arXiv:1706.02413.

Radenovi

c, F., Tolias, G., and Chum, O. (2019). Fine-tuning

cnn image retrieval with no human annotation. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 41(7):1655–1668.

Radenovi

c, F., Tolias, G., and Chum, O. (2016). CNN

image retrieval learns from BoW: Unsupervised ﬁne-

tuning with hard examples. In ECCV.

Radenovi

c, F., Tolias, G., and Chum, O. (2018). Deep shape

matching. In ECCV.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,

S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,

Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015).

ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV),

115(3):211–252.

Rusu, R. B., Blodow, N., and Beetz, M. (2009). Fast point

feature histograms (fpfh) for 3d registration. In 2009

IEEE International Conference on Robotics and Au-

tomation, pages 3212–3217.

Shi, B., Bai, S., Zhou, Z., and Bai, X. (2015). Deeppano:

Deep panoramic representation for 3-d shape recogni-

tion. IEEE Signal Processing Letters, 22(12):2339–

2343.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E.

(2015). Multi-view convolutional neural networks for

3d shape recognition. In 2015 IEEE International

Conference on Computer Vision (ICCV), pages 945–

953.

Tolias, G., Sicre, R., and J

egou, H. (2016a). Particular

Object Retrieval With Integral Max-Pooling of CNN

Activations. In ICL 2016 - RInternational Confer-

ence on Learning Representations, International Con-

ference on Learning Representations, pages 1–12, San

Juan, Puerto Rico.

Tolias, G., Sicre, R., and J

egou, H. (2016b). Particular ob-

ject retrieval with integral max-pooling of cnn activa-

tions.

Torii, A., Arandjelovi

c, R., Sivic, J., Okutomi, M., and Pa-

jdla, T. (2015). 24/7 place recognition by view syn-

thesis. In 2015 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 1808–1817.

Uy, M. A. and Lee, G. H. (2018). Pointnetvlad: Deep point

cloud based retrieval for large-scale place recognition.

In 2018 IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 4470–4479.

Yandex, A. B. and Lempitsky, V. (2015). Aggregating local

deep features for image retrieval. In 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 1269–1277.

Zhang, W. and Xiao, C. (2019). Pcan: 3d attention map

learning using contextual information for point cloud

based retrieval. In 2019 IEEE/CVF Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 12428–12437.

Zhang, X., Wang, L., and Su, Y. (2021). Visual place recog-

nition: A survey from deep learning perspective. Pat-

tern Recognition, 113:107760.

Zou, C., He, B., Zhu, M., Zhang, L., and Zhang, J. (2019).

Learning motion ﬁeld of lidar point cloud with con-

volutional networks. Pattern Recognition Letters,

125:514–520.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

658