Analyzing Airplane Detection Performance with YOLOv4 by using

Synthetic Data Domain Randomization

Houssem-Eddine Benseddik

1 a

, Ariane Herbulot

1,2 b

and Michel Devy

LAAS, CNRS, Toulouse, France

Univ. de Toulouse, UPS, LAAS, F-31400 Toulouse, France

Keywords:

Airplane Detection, YOLOv4, Deep Learning, Domain Randomization, Object Detection, Synthetic Data.

Abstract:

This paper proposes a novel approach to generate a synthetic dataset through domain randomization, to address

the problem of real-time airplane detection on airport zones with high accuracy. Most solutions have been

employed and developed across satellite images with deep learning techniques. Our approach speciﬁcally

targets airplane detection on complex airport environment using deep learning approach as YOLOv4. To

improve training, a large amount of annotated training data are required for good performance. To address

this issue, this study proposes the use of synthetic training data. There is however a large performance gap

between methods trained on real and synthetic data. This paper introduces a new method, which bridges this

gap based upon Domain Randomization. The approach is evaluated on bounding box detection of airplanes

on the FGVC-Aircraft dataset.

1 INTRODUCTION

Since the advent of deep convolutional networks, the

capabilities of computer vision systems to solve vari-

ous problems such as object detection, have advanced

considerably in the last few years. The major success

of deep learning-based approaches can be attributed

to the availability of required computational power to

perform the huge amount of calculations, and to the

availability of large datasets to enable models to be

trained (Agarwal et al., 2019).

Training a deep neural network is a time consum-

ing and expensive task which typically involves col-

lecting and manually annotating a large amount of

data for supervised learning. Obtaining such data in

the real world is tedious and error-prone. Further-

more, the collection of a large amount of training data

with sufﬁcient variety in real word is often expensive

or not feasible (Hinterstoisser et al., 2019). As such,

researchers have been developing various techniques

to overcome this issue and introduce cost saving mea-

sures to build a high-quality dataset. One of the most

promising solutions that have been investigated for

this issue is Domain Randomization (DR) (Nowruzi

et al., 2019).

https://orcid.org/0000-0003-4229-0125

https://orcid.org/0000-0002-8377-6474

DR is a simple yet powerful technique for gener-

ating training data for machine-learning algorithms.

The goal is to generate or synthetically improve the

data, in order to introduce random variances in the

properties of the training environment. These prop-

erties are essentially present at the learning task and

are not necessarily present at the test task, unlike tra-

ditional training data when collected in the same do-

main as the test data (Valtchev and Wu, 2020). A reas-

suring method in this direction is to create simulated

data for training that is well capable of imitating the

test statistics of the real data.

In this paper, we explore multiple ways in this

context to extend DR to the task of real-world air-

planes detection. By focusing on various synthetic

datasets and training the convolutional neural network

entirely on synthetic data, our goal is to identify a pro-

cedure that addresses the domain shift and the trained

network generalizes well to real test data.

In particular, we are interested in answering the

following questions:

1. Can DR on synthetic data achieves enormous air-

plane detection results, on real-world data?

2. Can DR on synthetic data reaches competitive re-

sults with those obtained by real data?

3. Can DR with ﬁne-tuning by real data during train-

ing improve more the accuracy?

Benseddik, H., Herbulot, A. and Devy, M.

Analyzing Airplane Detection Performance with YOLOv4 by using Synthetic Data Domain Randomization.

DOI: 10.5220/0011089500003191

In Proceedings of the 8th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2022), pages 425-431

ISBN: 978-989-758-573-9; ISSN: 2184-495X

425

4. How do the parameters of DR affect results?

By analysing these questions, this work con-

tributes the following:

• Extension of DR to detect different airplane mod-

els in front of real-world complex backgrounds;

• Introduction of a new DR component, namely, ob-

ject motion (rotation and translation), which im-

proves detection accuracy;

• Investigation of the parameters of DR to evaluate

their importance for airplane detection;

• A comprehensive metric study to make compari-

son between synthetic and real data training’s;

• Fine-tuning models trained on large synthetic

dataset with a small set of real data.

2 PREVIOUS WORKS

Our work is related to airplane detection, synthetic

data for machine learning, domain adaptation and do-

main randomization.

The performance for computer vision algorithms

increases logarithmically with training data increas-

ing. However, obtaining a large amount of annotated

data in the real-word is a bottleneck for computer vi-

sion tasks. One way of dealing with this issue is to use

the synthetic data as a cheap and efﬁcient solution to

assemble such large datasets (Bousmalis et al., 2018).

The use of synthetic data however introduces what

is known as the reality gap, which is the inability for

the synthetic environments to fully generate the real-

world data, for numerous reasons including textures,

physics of materials, lighting and domain distribu-

tions (Nowruzi et al., 2019). In an attempt to nar-

row the reality gap, DR is introduced to simulate a

sufﬁciently large amount of variations such that real-

world data is viewed as simply another domain varia-

tion. This can include randomization of view angles,

textures, shapes, camera localisation, object positions

and many other parameters (Valtchev and Wu, 2020).

On the other hand, underlying principal of DR is to

create enough variance in training data which forces

the model to only learn relevant features useful for the

task. DR provides a solution to narrow the reality gap,

thus by enriching the data generation phase.

To address the reality gap, DR techniques have

been explored, including most notably the work of

(Tobin et al., 2017), where they synthesized images

of basic geometric objects on a table, in an attempt

to estimate their 3D positions, such that a robotic

arm could pick them up. Their accuracy varied de-

pending on domain parameters, achieving errors as

low as 1.5cm on average in terms of object location,

showing promise for synthetic data training. Notably,

they found that the number of images and the number

of unique textures used in the images were the most

prominent parameters to model accuracy. Camera po-

sitioning and occlusion also had meaningful contri-

butions, while the addition of random noise in images

did not. (Tremblay et al., 2018) uses DR for car de-

tection by effectively abandoning photorealism in the

creation of the synthetic dataset. The involved work

forced the network to learn only the essential features

for the task of car detection. Results of this approach

are comparable to the Virtual KITTI dataset (Gaidon

et al., 2016). One issue with the Virtual KITTI dataset

is its limited sample size of 2500 images. This could

result in worse performance than the larger datasets.

(Loquercio et al., 2019) used domain randomiza-

tion to bridge the gap between the artiﬁcial world and

the real one, in the task of autonomous drone ﬂight.

In their work, they synthesized arbitrary race courses

for the drone to learn to ﬂy in, and then tested their

controller in arbitrary track conﬁgurations in the real

world. They achieved near perfect course completion

scores for many variations including max speed con-

straints up to 10m/s, and lap totals less than 3.

In (Barisic et al., 2021), a network is trained by

texture invariant object representation for aerial ob-

ject detection. By a technique of randomly assigning

atypical textures to UAV models, the obtained results

conﬁrm that shape plays a greater role in aerial ob-

ject detection. The authors proved also that the train-

ing by the synthetic dataset outperforms baseline and

even real-world data in situations with difﬁcult light-

ing and distant objects.

On the other hand, supervised-learning-based ap-

proaches for airplane detection often require a large

amount of training data, for the most important ob-

ject detection in both military and civil aviation ﬁelds.

The manual annotation, of an object such as airplane,

in large image sets is generally expensive and some-

times unreliable, due to the signiﬁcant appearance

variations of airplane models, the airport area back-

ground is often complex and cluttered, and the air-

planes can be at multiple scales on images. As a re-

sult, it is difﬁcult to achieve accurate detection with

training from a small amount of annotated real data.

Besides, most research have used satellites as the data

feeder. Accordingly, this work targets real-time air-

plane detection applications in airport areas, using

synthetic images collected by domain randomization

and processed through deep learning model.

VEHITS 2022 - 8th International Conference on Vehicle Technology and Intelligent Transport Systems

426

3 METHOD OVERVIEW

Our goal is to detect airplanes in a speciﬁc scene

and particularly in airport areas. In order to ob-

tain synthetic images appropriate to our case, the SE-

SCENARIO tool was used. More details about this

tool will be found below. The most important chal-

lenge related to this work, as mentioned earlier, is

how to ensure generalization of our network trained

on synthetic data to unseen real data. To address

this challenge, we employ domain randomization in

the data generation step to create synthetic data with

enough variations, such that the reality gap bridging

will exist between the training and test data. In this

section, we ﬁrst describe our scene-speciﬁc data gen-

eration methodology followed by details about do-

main randomization.

3.1 Synthetic Data Generation

To generate our training images, we make use of

SE-WORKENCH environment proposed by OKTAL-

. This solution uses physics-based sensor sim-

ulation software to generate synthetic dataset. The

SE-WORKBENCH environment is a set of render-

ing of physical properties. Associated to a materials

database, the SE-WORKBENCH can give all the in-

formation needed to render or perform physical com-

putation on the SE-WORKBENCH database poly-

gon, in a given area. A data generation scenario was

created, in visible domain, for Toulouse-Blagnac Air-

port of Toulouse city in southwest France. Thus, SE-

WORKBENCH allows us to produce a large amount

of visual synthetic data, and also generates the anno-

tations corresponding to these renderings. Moreover,

other parameters can be controlled from this environ-

ment, such as atmospheric conditions, time of day, in-

trinsic parameters of the camera and also its position

in the scene. These ﬂexibilities are paramount to our

domain randomization setup. The total number of our

DR synthetic is about 6500 images with the resolution

of 1280x960 pixels.

3.2 Domain Randomization

Domain Randomization attempts to generate a rich

distribution of training images by introducing ran-

domness into the data. When the synthetic data con-

tains enough variability, the real world may appear to

the model as yet another variant of what it has seen

during training. The key idea here is to force a model

Software Editor company expert in the development

of ElectroOptics, RADAR and GNSS rendering simulation

tools.

to generalize to real-world data. Within a domain,

various attributes can be parametrically randomized

to produce data samples from a large variation of the

possible image space. The resulting images, with

automatically generated ground truth airplane labels

(e.g., bounding boxes), are used for neural network

training.

To conduct the training, the synthetic data were

generated by randomly varying the following aspects

of the scene:

3.2.1 Object Variation

A random number of positions are applied to the air-

plane model placed in different areas on the Toulouse-

Blagnac airport. The airplane model used during the

whole data generation is an Airbus-A320. To achieve

airplane shape and size variations we have created

two kinds of scenarios. The ﬁrst one, when the air-

plane is placed on the airport tarmac and the camera

is pointed at the airplane when it moves straight, as

shown in Figs. [1-(a), 1-(b)]. Eight images Collec-

tion Points CPs were established every 50 meters un-

til the airplane ﬁnd the distance of 400 meters from

the camera. At each CP, a rotation in the range of 0

to 360

around the Z axis with an angular spacing of

were applied to the airplane. The second one, the

airplane is positioned at 70m above the tarmac and 3-

D rotations around the X-Y-Z were considered for the

airplane. In fact, we took a single collection point,

and the airplane was rotated in the range of 0

to 360

around each axis with an angular step of 10

to ob-

tain all the conﬁguration of distinct 3-D orientations,

as shown in Figs. 1-(f), 1-(g) and 1-(h)]. Since the en-

vironment is synthetic, these rotations were also com-

plied with any airplane mechanical constraints.

3.2.2 Background Variation

The background variation is achieved by varying the

CPs locations at the airport. Indeed, two different lo-

cations were considered for the acquisitions. The ﬁrst

location is on the airstrip in an open area 1-(a). The

second location is near to the airport terminal, where

building textures are ubiquitous in rendered images 1-

(c). Furthermore, we apply also light conditions vary-

ing by changing the time of day.

3.2.3 Viewpoint Variation

As a third variation, we randomized the position of the

camera in the 3D environment in such a way that it is

always placed around the airplane. For each position,

the optical axis of the camera was pointed towards

the airplane, and along this direction a translation mo-

tion was applied to the camera to obtain near and far

Analyzing Airplane Detection Performance with YOLOv4 by using Synthetic Data Domain Randomization

427

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 1: Sample images generated by DR approach. Note that DR images yet contain more variety to force the deep neural

network to focus on the structure of the objects of interest.

views, as illustrated in Figs 1-(c), 1-(d) and 1-(e). By

using this variation, we intend to make the trained net-

work robust to any change of perspective. Further-

more, to eliminate the need to collect new training

data that only contains the airplane parts, a change in

the orientation of the camera around its Z axis was in-

volved. This forces the network to learn to deal with

partial occlusion of the object of interest, as shown in

Figs 1-(f), 1-(g) and 1-(h).

3.3 Detector Model Architecture and

Training

We parametrize our object detector with a deep con-

volutional neural network. The object detector used

in this work comes from a well-established fam-

ily of one-stage detectors called You Only Look

Once (YOLO) (Redmon et al., 2016), (Redmon and

Farhadi, 2017) and (Redmon and Farhadi, 2018).

YOLO detectors treat object detection as a regres-

sion problem and use features from the entire image

to detect objects. In this study, to efﬁciently and ac-

curately inspect the DR for airplane detection with a

real-time speed, the recent developed one-stage object

detection framework, YOLO-v4 (Bochkovskiy et al.,

2020), is selected. YOLO-v4 is a high-precision and

real-time object detection algorithm based on regres-

sion proposed in 2020, which integrated the charac-

teristics of YOLO-v1, YOLO-v2 and YOLO-v3.

YOLO-v4 object detection algorithm consists of

three structures: the backbone, the neck, and the

heads. The main function of the backbone block is the

feature extraction process. Selection of the backbone

is one of the most important steps to increase perfor-

mance of the object detection algorithm. The purpose

of the neck block is to add extra layers between the

backbone and the head so that feature maps from dif-

ferent stages can be combined. Usually, a neck con-

sists of several bottom-up paths and several top-down

paths. The ﬁnal stage, the head block in single-stage

detectors, performs the ﬁnal prediction. This predic-

tion consists of a vector containing the coordinates

of the predicted bounding box, the conﬁdence score,

and the label of the prediction (Pacal and Karaboga,

2021).

For experiments, we used the pretrained model

yolov4.conv.137 and trained for 2000 epochs. The de-

tector was modiﬁed in way to accommodate training

and testing for one class. Each input image is resized

to 512x512 before passing into YOLOv4 architecture.

All the experiments are carried out by using 512x512

input image size. The reason behind choosing 512 im-

age patch is to make computations more simple and

the number of images per batch is limited to 64. The

learning rate and the momentum are set to 0.0001 and

0.9, respectively. The weights are saved after every

1000 epochs so that we can calculate mAP results to

make sure that the model is learning well.

4 EVALUATION

We benchmark our methodology on the FGVC-

Aircraft dataset (Maji et al., 2013). This dataset con-

tains 100 example real images for each of the 100

model variants for aircraft. Therefore, the dataset

contains 10,000 annotated images and their resolution

is about 1-2 Mpixels. Images are equally divided into

training, validation, and test subsets, so that FGVC-

Aircraft dataset has 34:33:33 split as the training, val-

VEHITS 2022 - 8th International Conference on Vehicle Technology and Intelligent Transport Systems

428

(a) AP@0.5.

(b) AP@0.75.

Figure 2: Results of True Positive (TP), False Positive (FP) and False Negative (FN) on real-FGVC-Aircraft test dataset of

YOLOv4. The models training’s established on COCO dataset, our DR data, ﬁne-tuning and FGVC-Aircraft real images, for

comparison.

idation and test set. We perform evaluations on test

subset and the 3333 images, thus, constitutes our real

data for evaluations. The bounding box information

has ben used for validating our airplane detector algo-

rithm.

For all our experiments, we compare four models.

• COCO: YOLOv4 is trained on COCO dataset

(Veit et al., 2016)

• DR: initialized with YOLOv4 random weights

and trained on domain randomized synthetic data.

• DR+ﬁnetune : initialized with DR and ﬁne-tuned

on 10% FGVC-Aircraft real training data, about

333 images.

• Real: initialized with YOLOv4 random weights

and trained on the entire FGVC-Aircraft training

data.

Quantitative Analysis: Table 1 compares the per-

formance of the YOLOv4 when trained on COCO-

dataset, FGVC-aircraft dataset versus our DR dataset.

Our DR and DR+ﬁne tuning achieves the high-

est scores compared to the real COCO-dataset for

AP=0.5. We hypothesize this is due to overﬁtting on

limited real data which is avoided by design in do-

main randomization. Besides, the individual parame-

ters for the True Positive(TP), False Positive(FP) and

False Negative(FN), who participate directly on the

calculation of Precision, Recall and F1-score metrics,

are shown in Fig 2. This helps to explain the metric

results of Table 1. The necessary conclusion to draw

from all these results, is that our model trained only

on synthetic data is competitive to the model trained

on the entire real data at IoU=0.5. DR+ﬁnetune out-

performs the off-shelf baseline COCO and DR, which

illustrates the beneﬁts of training on real images af-

ter ﬁrst training on synthetic images. Furthermore,

even though our DR network has never seen a real

image, it is able to successfully detect different mod-

els of airplane present on FGVC-aircraft dataset. This

surprising result illustrates the power of such a sim-

ple technique as DR for bridging the reality gap.

Also, we found that the bounding boxes of DR and

mostly DR+ﬁnetune are more accurate than COCO-

dataset. We argue that this is because DR helps to mit-

igates this problem by randomizing the dimensions

and shapes of the airplane during data generation, thus

directly providing enough variance in the bounding

box distribution.

Consequently, by the use of randomization tech-

niques, we achieve an improvement for real airplanes

detection, outperforming the results without DR. The

weights of pretrained network on Coco-dataset is at a

local minimum at the beginning of the training. The

bigger variance in the randomized images might help

to force the network out of its local minimum. How-

ever, the non-randomized is likely to stay close to its

Table 1: Detection results tested on FGVC-Aircraft dataset.

AP@0.5 AP@0.75

Approaches Precision Recall F1-score aIoU(%) Precision Recall F1-score aIoU(%)

Coco 0.4 0.97 0.57 36.53 0.41 0.98 0.58 37.12

DR 0.86 0.99 0.92 69.29 0.65 0.74 0.69 54.2

Fine-tuning 0.99 0.99 0.99 87.96 0.98 0.97 0.97 88.90

Real 0.99 0.99 0.99 93.04 0.99 0.98 0.98 92.5

Analyzing Airplane Detection Performance with YOLOv4 by using Synthetic Data Domain Randomization

429

Coco

Fine-tuning

Real

Figure 3: Examples of detection results. We compare the performance of COCO (ﬁrst column), DR (second column) and

DR+ﬁnetune (third column) with real FGVC-Aircraft data, on three different views. Each row contains one of theme .

local minimum. As a result, the randomization tech-

niques are a very fundamental tool for achieving bet-

ter accuracy, when applied to pretrained networks.

Qualitative Analysis: A qualitative comparison

between the purely DR, DR+ﬁnetune trained detec-

tor and the detector trained on COCO and FGVC-

Aircraft real images, is shown in Fig. 3. In the pre-

sented images obtained from free video of airplane

landing, a detector trained on real images fails to de-

tect one airplane due to the challenging conditions

of distant-objects and difﬁcult illumination. COCO

fails in cases of false positive and false negative de-

tection of other object classes. In contrast, for purely

synthetic data and also with ﬁnetune, the detector

can successfully and accurately detect the airplane in

three different view images. These results show that

by DR the real gap can be bridged and that the accu-

racy of the real data can even be surpassed. Thus, the

DR is a fundamental tool for achieving high accuracy

with synthetic training data. They beneﬁt from ran-

domization in particular, because the increased vari-

ance in the images forces the networks to ascent out

of local minimum.

5 CONCLUSIONS

This work was motivated by real-time airplane de-

tection on airport zones in a context where acquiring

large amount of annotated images is not easily acces-

sible. To tackle this challenge, we propose a solution

that uses Domain Randomization to train the YOLO-

v4 architecture, based upon synthetic images. The

results of the experimental evaluation of the syntheti-

cally generated dataset conﬁrm that DR is an effective

technique to bridge the reality gap, to accomplish the

task of airplane detection. We demonstrated also that

with ﬁne-tuning on quiet amount of real images, the

DR can outperforms the real datasets and our method

achieves higher detection accuracy than state-of-the-

art baselines and even real-world data in situations

with difﬁcult lighting and distant objects. Thus, using

DR for training deep neural networks is a promising

approach to bridge the reality gap between simulation

and the real world and other direction can be explored

to generalize this technique to solve other issues on

computer vision.

Future directions that should be explored include

using several types of airplanes, different weather and

light conditions, applying the technique to motion

and/or 3D depth estimation, and further investigating

the mixing of synthetic and real data to leverage the

beneﬁts of both.

ACKNOWLEDGEMENTS

We would like to thank OKTAL-SE for providing us

SE-SCENARIO and SE-TOOLKIT tools to generate

synthetic data.

REFERENCES

Agarwal, S., Terrail, J. O. D., and Jurie, F. (2019). Recent

advances in object detection in the age of deep convo-

lutional neural networks.

Barisic, A., Petric, F., and Bogdan, S. (2021). Sim2air-

VEHITS 2022 - 8th International Conference on Vehicle Technology and Intelligent Transport Systems

430

synthetic aerial dataset for uav monitoring. arXiv

preprint arXiv:2110.05145.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. arXiv preprint arXiv:2004.10934.

Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey,

M., Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P.,

Konolige, K., Levine, S., and Vanhoucke, V. (2018).

Using simulation and domain adaptation to improve

efﬁciency of deep robotic grasping. In 2018 IEEE

international conference on robotics and automation

(ICRA), pages 4243–4250.

Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (2016). Vir-

tual worlds as proxy for multi-object tracking analy-

sis. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pages 4340–

4349.

Hinterstoisser, S., Pauly, O., Heibel, H., Marek, M., and

Bokeloh, M. (2019). An annotation saved is an anno-

tation earned: Using fully synthetic training for object

instance detection.

Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A.,

Koltun, V., and Scaramuzza, D. (2019). Deep drone

racing: From simulation to reality with domain ran-

domization. IEEE Transactions on Robotics, 36(1):1–

14.

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,

A. (2013). Fine-grained visual classiﬁcation of air-

craft. arXiv preprint arXiv:1306.5151.

Nowruzi, F. E., Kapoor, P., Kolhatkar, D., Hassanat, F. A.,

Laganiere, R., and Rebut, J. (2019). How much real

data do we actually need: Analyzing object detection

performance using synthetic and real data.

Pacal, I. and Karaboga, D. (2021). A robust real-time

deep learning based automatic polyp detection sys-

tem. Computers in Biology and Medicine, page

104519.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster,

stronger. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 7263–

7271.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and

Abbeel, P. (2017). Domain randomization for transfer-

ring deep neural networks from simulation to the real

world. In 2017 IEEE/RSJ international conference on

intelligent robots and systems (IROS), pages 23–30.

IEEE.

Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jam-

pani, V., Anil, C., To, T., Cameracci, E., Boochoon,

S., and Birchﬁeld, S. (2018). Training deep networks

with synthetic data: Bridging the reality gap by do-

main randomization. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition

workshops, pages 969–977.

Valtchev, S. Z. and Wu, J. (2020). Domain randomization

for neural network classiﬁcation. Journal of Big Data.

Veit, A., Matera, T., Neumann, L., Matas, J., and Belongie,

S. (2016). Coco-text: Dataset and benchmark for text

detection and recognition in natural images. arXiv

preprint arXiv:1601.07140.

Analyzing Airplane Detection Performance with YOLOv4 by using Synthetic Data Domain Randomization

431