be mitigated by using a camera setting with a lower
meter-pixel ratio focusing on a single lane.
Recent works have proposed the use of LSTM-
type recurrent architectures (Parimi and Jiang, 2021),
(Martinez et al., 2021), but as demonstrated in (Mar-
tinez et al., 2021) with synthetic data, they seem
to perform worse than other non-recurrent methods
such as 3D CNNs. In (Revaud and Humenberger,
2021) camera calibration, scene geometry and traf-
fic speed detection are performed using a transformer
network in a learning process from synthetic training
data, considering that cars have similar and known 3D
shapes with normalized dimensions. In (Rodr
´
ıguez-
Rangel et al., 2022), after vehicle detection and track-
ing using YOLOv3 and Kalman filter, respectively,
a linear regression model is used to estimate the ve-
hicle speed. Other statistical and machine learning-
based methods are also compared. Vehicle detection
and tracking is performed in (Barros and Oliveira,
2021) using Faster R-CNN and DeepSORT, respec-
tively. Next, dense optical flow is extracted using
FlowNet2, and finally, a modified VGG-16 deep net-
work is used to perform speed regression.
2.2 Datasets for Speed Detection
So far, only two datasets with real sequences and
ground truth values are publicly available. First, the
BrnoCompSpeed dataset (Sochor et al., 2017), which
contains 21 sequences (∼1 hour per sequence) with
1920 × 1080 pixels resolution images at 50 fps in
highway scenarios. The speed ground truth is ob-
tained using a laser-based light barrier system. Sec-
ond, the UTFPR dataset (Luvizon et al., 2017), which
includes 5 hours of sequences with 1920 × 1080 pix-
els resolution images at 30 fps in an urban scenario.
The ground truth speeds are obtained using inductive
loop detectors.
Therefore, as described above, the use of synthetic
datasets is becoming increasingly prevalent for this
problem. For example, in (Lee et al., 2019) a CNN
model to estimate the average speed of traffic from
top-view images is trained using synthesized images,
which are generated using a cycle-consistent adver-
sarial network (Cycle-GAN). Synthetic scenes with
a resolution of 1024 × 768 pixels covering multiple
lanes with vehicles randomly placed on the road are
used in (Revaud and Humenberger, 2021) to train and
test the method used to jointly deal with camera cal-
ibration and speed detection. In our previous work
(Martinez et al., 2021), a publicly available synthetic
dataset was generated using CARLA simulator, us-
ing one fixed camera at 80 FPS with Full HD format
(1920 ×1080), with variability corresponding to mul-
tiple speeds, different vehicle types and colors, and
lighting and weather conditions. A similar approach
was presented in (Barros and Oliveira, 2021) includ-
ing multiple cameras and generating more than 12K
instances of vehicles speeds. Preliminary, it is ob-
served that multi-camera variability does not nega-
tively affect the results. However, this effect has not
been sufficiently studied.
Figure 2: Overview of the different datasets. Upper row 4m
camera height. Lower row 3m camera height. The columns
show an 45º, 50º and 60º camera pitch from left to right.
3 METHOD
This section describes the methodology used for the
construction of the new dataset, based on the one
shown in the previous work (Martinez et al., 2021),
including new camera poses, and increasing the com-
plexity of the speed detection problem.
3.1 Multi-view Synthetic Dataset
Starting from the single-view synthetic dataset gen-
erated in the our previous work (Martinez et al.,
2021) using the CARLA simulator (Dosovitskiy et al.,
2017), the extrinsic parameters of the camera are
modified, specifically the pitch angle and the height
with respect to the road, leaving the rest of the extrin-
sic and intrinsic parameters fixed.
As a first step to increase variability in a controlled
manner and to be able to carry out a comprehensive
analysis of results, we included six different poses,
using two different camera heights and three different
pitch angles (extrinsic parameters): 3m45 (the origi-
nal one), 3m50, 3m60, 4m45, 4m50 and 4m60. On
the one hand, a holistic model is generated, trained,
validated and tested using all the sequences in the
dataset. On the other hand, pose-specific models are
trained, validated and tested only with the sequences
corresponding to each pose. Fig. 3 depicts how the
dataset is used.
For the selected camera heights (3m and 4m) the
three pitch angels have been defined taking into ac-
count the road section captured by the camera field
of view. For pitch angles greater than 60
◦
the road
section is too short, while for angles less than 45
◦
KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval
190