Deep Learning based Object Detection and Tracking for Maritime

Situational Awareness

Rihab Lahouli

, Geert De Cubber

Beno

ıt Pairet

, Charles Hamesse

Timoth

ee Fr

eville and Rob Haelterman

Royal Military Academy, Mathematics Departement, Renaissancelaan 30, 1000 Brussels, Belgium

Keywords:

Object Detection, Tracking, Situational Awareness, Maritime Dataset.

Abstract:

Improving real-time situational awareness using deep-learning based video processing is of great interest in

maritime and inland waterway environments. For instance, automating visual analysis for the classiﬁcation

and interpretation of the objects surrounding a vessel remains a critical challenge towards more autonomous

navigational system. The complexity dramatically increases when we address waterway environments with

a more dense trafﬁc compared to open sea, and presenting navigation marks that need to be detected and

correctly understood to take correct decisions. In this paper, we will therefore propose a new training dataset

tailored to the navigation and mooring in waterway environments. The dataset contains 827 representative

images gathered in various Belgian waterways. The images are captured on board a navigating barge and

from a camera mounted on a drone. The dataset covers a range of realistic conditions of trafﬁc and weather

conditions. We investigate in the current study the training of the YOLOv5 model for the detection of seven

different classes corresponding to vessels, obstacles and different navigation marks. The detector is combined

with a pretrained Deep Sort Tracker. The YOLOv5 training results proved to reach an overall mean average

precision of 0.891 for an intersection over union of 0.5.

1 INTRODUCTION

Deep learning algorithms and precisely CNNs (Con-

volutional neural networks) have shown to be a pow-

erful tool for visual analysis and understanding tasks

such as classiﬁcation, object detection, and track-

ing (A. Khan and Qureshi, 2020)(G. Ciaparrone and

Troiano, 2020). The great progress in these models

promises to computationally generate an advanced in-

telligence for the detection of collision threats and

recognition of navigation marks without signiﬁcant

need for human intervention (Z. Chen and Cheng,

2020)(B. Iancu and Lilius, 2021).

In this paper, we report our work in the scope of

the SSAVE project (Shared Situational Awareness be-

tween Vessels), funded by the Flemish Agency for

Innovation and Entrepreneurship. The goal of the

project is to build a shared dynamic semantic map for

https://orcid.org/0000-0002-0883-8225

https://orcid.org/0000-0001-7772-0258

https://orcid.org/0000-0002-8731-033X

https://orcid.org/0000-0002-2321-0620

https://orcid.org/0000-0002-1610-2218

surveillance purposes. Several sensors are used in the

project and we are in charge of the visual camera pro-

cessing and useful information extraction to consoli-

date the real-time situational awareness system.

Our work consists of implementing a CNN-based

solution able to generate safety alerts in order to avoid

collisions; it can also support decision making in a

maritime environment. The proposed solution intends

to enhance monitoring and safety in waterway envi-

ronments. In recent years, interesting open-source

maritime datasets have been realized and used for

the training of CNN models which has provided an

important contribution in the ﬁeld of image segmen-

tation in addition to ships and obstacle detection in

the maritime environment. We can cite the Singa-

pore Maritime dataset (DK. Prasad and Rajabally,

2017) that was created around Singapore waters, the

MaSTr1325 dataset with 1325 diverse images and the

MODD2 dataset captured both in the gulf of Koper

“Slovenia” (B. Bovcon and Kristan, 2019)(Borja and

Kristan, 2020).

These datasets are mostly realized in an un-

crowded open-sea which make using them for our ap-

plication irrelevant. In fact, in our project we address

Lahouli, R., De Cubber, G., Pairet, B., Hamesse, C., Fréville, T. and Haelterman, R.

Deep Learning based Object Detection and Tracking for Maritime Situational Awareness.

DOI: 10.5220/0010901000003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

643-650

ISBN: 978-989-758-555-5; ISSN: 2184-4321

643

waterways navigation zones with high risk of colli-

sion and with the presence of several maritime marks

required for mooring assistance and priorities indica-

tion. To address this issue, we conducted measure-

ment campaigns to collect our custom SSAVE dataset

tailored to the training of the relevant classes that we

need to detect and keep track of, aboard a vessel. Our

main contribution is a new diverse training dataset

addressed for waterways maritime environment cap-

tured by cameras mounted in a drone and a moving

barge in realistic weather and trafﬁc conditions. We

investigate also the training of the YOLOv5 model for

the detection of seven different classes (ship, barge,

cutter, yelow mark, red mark, line mark and other ob-

stacle) for situational awareness purposes. This paper

is organized as follows: Sect.2 describes the collected

SSAVE dataset and the classes that we considered for

the semantic annotation; in Sect.3 distortion correc-

tion of a part of the dataset is presented. In Sect.4,

the results of YOLOv5 detector training are presented

and evaluated. And ﬁnally Sect.5 presents inference

results of combining the detection with a Deep Sort

pre-trained tracker; also it gives conclusions and dis-

cusses possible further research perspectives.

2 MARITIME WATERWAY

DATASET: SSAVE

Maritime image datasets are essential for the training

of neural networks performing object detection and

tracking. They should present an important variety

of weather and lighting conditions; also, the images

should be taken with different angles of view.

2.1 SSAVE Dataset

The SSAVE dataset was realized in collaboration

with our industrial partners (Deme and Tresco) in the

project and with the Belgian naval base at Zeebrugge.

We collected thousands of images with high deﬁni-

tion (1080x1920 pixels) in realistic conditions of traf-

ﬁc and in different weather conditions such as sunny,

cloudy and rainy days. We used AXIS Q3515-LV

Network IP cameras mounted on a navigating barge in

addition to GoPro Hero8 cameras placed on a drone

and also attached to a navigating barge. Recording

images from a drone was essential to detect the nav-

igating barge itself and also to have a better view of

the line marks. We hand-picked 827 representative

images out of the gathered footage to construct our

dataset. The variety in terms of luminosity and dis-

tances separating the barges to the obstacles in addi-

tion to the presence of objects of interest with close

proportions have been considered in the selection of

the images constructing the dataset. The dataset is

composed of 78 images taken from a GoPro Hero 8

camera mounted on a drone, 175 images taken from

a similar camera mounted on a navigating barge and

574 images taken from an AXIS Q3515-LV Network

IP camera attached to a navigating barge. Indeed, the

dataset is tailored to the scenarios addressed by the

study corresponding to the navigation of a barge in

a waterway and mooring into a ﬁxed platform called

a cutter. We annotated manually each image in the

dataset using CVAT image annotator. Seven classes

were considered in the annotation (ship, barge, cutter,

yellow mark, red mark, line mark and other obstacle).

Figure 1 presents an example of annotated images.

For instance, we need to detect and track all ships sur-

rounding the navigating barge in addition to all types

of obstacles that can present collision threats. Naviga-

tion marks need to be detected and classiﬁed in order

to respect priorities and interdiction areas while navi-

gating and mooring, also the navigation marks can be

subject to collisions. Finally the cutter should be de-

tected and kept in track in order to assist the barge in

mooring properly. We did not proceed to augmenta-

tion techniques to increase the dataset size.

2.2 Preprocessing of Distorted Images

Some of the collected images (about 100) presented

radial distortion. To remedy to this error, we ﬁrst cal-

ibrated the camera to identify its parameters using the

Matlab camera calibrating tool and a checkerboard

pattern. Then, we used the UndistortImage function

of Matlab to rectify the lens distortion and therefore

convert distorted images into undistorted ones. Fig-

ure 2 presents an example of an image before and af-

ter distortion correction.

3 OBJECT DETECTION AND

TRACKING METHODS

This section presents a brief description of the deep

learning models used for the detection and tracking

tasks. Also, the evaluation metrics used for perfor-

mance measurement are deﬁned.

3.1 YOLOv5 Detection

Object detection is one of the fundamental computer

vision problems that enables semantic video analy-

sis and image understanding. Its concept consists in

identifying precisely in an image the presence and the

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

644

Figure 1: Examples of annotated images.

Figure 2: Image with radial distortion at left and after cor-

rection at right.

location of objects deﬁned in a pre-established list

(PF. Felzenszwalb and Ramanan, 2009).

The “You only look once (YOLO)” detector has

been proposed by (J. Redmon and Farhadi, 2016) as

a novel approach for object detection that makes use

of a uniﬁed framework to predict both classes con-

ﬁdence and bounding box coordinates at the same

time. The basic idea of YOLO is to divide the input

image into a grid and compute the bounding boxes

and the probabilities for each cell grid. The predicted

conﬁdence scores are used to weigh these bounding

boxes. The algorithm performs only one forward

propagation pass through the neural network to make

predictions, so it “only looks once” at the input im-

age. YOLOv5 is a recent update of the YOLO ob-

ject detection family. It has been released in 2020

and it was developed with the Pytorch framework.

The YOLOv5 detector is very fast, permitting to per-

form real-time object detection with high accuracy

(Jocher, 2021). The YOLO object detection algo-

rithms have been used for vehicle detection tasks

(Fachrie, 2020) (M. Kasper-Eulaers and Sebulonsen,

2021) and they have proven to outperform other state-

of-the-art CNN algorithms for object detection, such

as Faster R-CNN, in sensitivity and processing time

with a comparable precision (B. Benjdira and Ammar,

2019). These attributes led us to the selection of the

YOLOv5 object detection algorithm for the training

of the SSAVE dataset to detect the classes of interest

in the waterway navigation environment. Decreasing

the processing time gains a signiﬁcant importance for

real-time detection and trafﬁc monitoring.

3.2 Deep Sort Tracker

Tracking objects in a video at real-time speeds is

a powerful tool for monitoring applications. Deep

Sort is a CNN architecture developed using the Py-

torch framework. It tracks objects along video frames

given their bounding boxes predicted ﬁrst by a detec-

tor. It was used for pedestrian tracking (N. Wojke and

Paulus, 2017). In our study, the detection generated

by YOLOv5 are passed to the Deep Sort algorithm to

keep track of the detected classes in the video frames.

3.3 Performance Evaluation

The performance evaluation for an object detector

is typically assessed by three metrics that we ex-

plain brieﬂy in this sub-section (I. Goodfellow and

Courville, 2016).

• Intersection over union (IoU): which measures the

overlap degree between the predicted bounding

boxes and the ground truth ones. It is a ratio be-

tween the area of overlap between the two bound-

ing boxes divided by the area of union between

both of them. For a speciﬁc IoU threshold we can

measure the accuracy of the predictions by com-

puting the precision recall metrics.

• Precision-Recall: which measures the trade-off

between the precision, being the percentage that

your predictions are correct and the recall which

measures how well you ﬁnd all the positive cases

over your predictions as presented by the follow-

Deep Learning based Object Detection and Tracking for Maritime Situational Awareness

645

Figure 3: Plots of box loss, objectness loss, classiﬁcation loss, precision, recall and mean average precision (mAP) over 200

training epochs for the training and validation set.

ing formulas.

Precision =

T P

T P + FP

(1)

Recall =

T P

T P + FN

(2)

Note that, T P is the True Positive, FP is the

False Positive and FN is the False Negative all

measured over the whole prediction of the CNN

model. The relationship between precision and

recall can be observed using a plot.

• Mean Average Precision (mAP): Given a ﬁxed

IOU threshold, the mean precision of each pre-

dicted class is calculated then averaged with all

the classes precision to return the Mean Average

Precision for the dataset.

4 YOLOV5 OBJECT DETECTION

TRAINING USING THE SSAVE

DATASET

In this section, we investigate the performance of the

YOLOv5 detector using the SSAVE dataset.

4.1 Implementation details

All the implementations are performed under the deep

learning framework Pytorch Ultralytics YOLOv5. We

use the docker implementation on an NVIDIA DGX

Station A100 with four fully interconnected NVIDIA

A100 Tensor Core GPUs and up to 128 GB of total

GPU memory. We use the YOLOv5 model pretrained

on the Common Objects in Context (COCO) dataset

(TY. Lin and Hays, 2017) to initialize the network

and extract features from the input images. Then we

prepare our custom SSAVE dataset with 827 images

by splitting it into training, validation and test data

with percentage of 70%, 20% and 10% respectively.

The computational cost of training the model for 200

epochs takes about 89 minutes.

Table 1: Detection precision per class.

Classes Precision

ship 0.877

barge 0.995

cutter 0.940

yellow mark 0.986

red mark 0.887

line mark 0.842

other obstacle 0.713

all classes 0.891mAP@0.5

4.2 Experimental Analysis

Figure 3 shows the evolution of different performance

metrics over the 200 training epochs for both the

training and validation sets. There are three losses

that the model should minimize to achieve a detec-

tion, location and classiﬁcation convergence. The box

loss represents how well the predicted bounding box

covers an object. Objectness loss refers to the prob-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

646

ability that an object exists in a proposed region of

interest. The classiﬁcation loss represents how well

the algorithm can predict the correct class of a given

object. We notify a rapid decline of the box, object-

ness and classiﬁcation losses until around 150 epochs.

The model also showed a rapid improvement in terms

of precision, recall and mean average precision. It

achieved a 0.891 mAP at 0.5 IOU as shown by Fig-

ure 4. We also see a high precision between 0.842

and 0.995 for all the classes except the “other obsta-

cle” class where the precision reaches only 0.713 as

resumed in Table 1 . This is explained by the fact that

this class includes objects with different forms going

from poles to small obstacles that can not be included

in the other categories but they need to be detected in

order to notify the navigating vessel about a collision

threat. Figure 9 presents the qualitative performance

of prediction of the new trained YOLOv5 model ap-

plied to a test batch for detecting the seven classes

(ship, barge, cutter, yellow mark, red mark, line mark

and other obstacle).

Figure 4: Precision recall curve.

4.3 Discussion

Figure 5 shows the proportion distribution of the

seven annotated classes into the training dataset.

“Ship” category is the most presented with more than

770 bounding boxes. On the other hand, the barge cat-

egory is the least represented. This is due to the fact

that most images have been captured using the camera

attached to the barge. Thus the barge was present only

in images captured by the drone. therefore the classes

are unbalanced in the dataset and this problem can

be considered in future work to further enhance the

detection of under-represented classes. The low pre-

cision of the “other obstacle” class led to some false

negative detection as shown by Figure 6 where some

obstacles present in the image were not detected al-

though detection with a conﬁdence score of 0.55 has

been done for a neighbour obstacle. Also the vege-

tation background in the waterways led to some false

positive detection as presented by Figure 7 where the

Figure 5: Proportion of each annotated class in the training

dataset.

Figure 6: False negative detection for “other obstacle” class.

Figure 7: False positive detection for “ship” class.

detector has assumed by mistake a ship in the vege-

tation. In the other hand, Figure 8 presents example

results of correct detection for all the objects present

in the images. Data augmentation techniques are also

commonly used to increase the data size and variance

by generating geometric perturbations, Gaussian blur

and noise (B. Zoph and Ghiasi, 2020). We suggest to

investigate the capability of augmentation strategy to

enhance the detection performances.

Deep Learning based Object Detection and Tracking for Maritime Situational Awareness

647

Figure 8: Correct detection results.

5 CONCLUSIONS

In this paper, we introduced a new dataset tailored to

the training of object detection CNN models for situ-

ational awareness purposes in the maritime waterway

environments. Up to 827 representative images with

high variety have been selected manually from video

recording in real scenarios of navigation and mooring.

Ip cameras attached to a navigating barge in addition

to a camera attached to a drone have been used to en-

sure a variety in the angles of view of the objects to

detect. Seven different classes have been annotated

manually which are; ship, barge, cutter, yellow mark,

red mark, line mark and other obstacle. The main con-

cern was about the implementation of a CNN-based

object detection algorithm able to alert the vessels

about the presence of collision threats in the neigh-

borhood in addition to the detection and the correct

interpretation of the navigation marks present in the

waterway in order to help the vessel taking the right

decisions. We chose to train the YOLOv5 object de-

tection model due to its better performances in terms

of processing time with similar precision compared

to other state-of-the-art object detection models. We

have trained the YOLOv5 using our dataset and we

obtained successful results achieving a mean average

precision of 0.891 at an intersection over union of 0.5.

The precision of predictions by classes was also high

reaching 0.995 and not less than 0.713. Qualitative

results corresponding to prediction results on test im-

ages have been presented. The detection generated by

YOLOv5 has been passed to the pretrained Deep Sort

algorithm and we tested the results on real videos of a

navigating barge and the results were satisfying. For

future work, we suggested to investigate into increas-

ing the presence of the least representative classes in

the dataset in order to deal with unbalanced classes.

Also we suggested to test some common augmenta-

tion techniques on the SSAVE dataset and analyse if

we can achieve better results.

ACKNOWLEDGEMENTS

The research presented in this paper has been funded

by the Flemish Agency for Innovation and En-

trepreneurship (VLAIO), project SSAVE.

REFERENCES

A. Khan, A. Sohail, U. Z. and Qureshi, A. (2020). A sur-

vey of the recent architectures of deep convolutional

neural networks review. Artiﬁcial Intelligence Review,

53:5455–5516.

B. Benjdira, T. Khursheed, A. K. and Ammar, A. (2019).

Car detection using unmanned aerial vehicles: Com-

parison between faster r-cnn and yolov3. In 2019

1st International Conference on Unmanned Vehicle

Systems-Oman (UVS). IEEE.

B. Bovcon, J. Muhovi

c, J. P. and Kristan, M. (2019). The

mastr1325 dataset for training deep usv obstacle de-

tection models. In IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems. IEEE.

B. Iancu, V. Soloviev, L. Z. and Lilius, J. (2021). Aboships-

an inshore and offshore maritime vessel detection

dataset with precise annotations. Remote Sensing,

13(5):988.

B. Zoph, E. C. and Ghiasi, G. (2020). Learning data aug-

mentation strategies for object detection. In ECCV.

Borja, B. and Kristan, M. (2020). A water-obstacle sepa-

ration and reﬁnement network for unmanned surface

vehicles. In 2020 IEEE International Conference on

Robotics and Automation (ICRA). Computer Science.

DK. Prasad, D. Rajan, L. R. and Rajabally, E. (2017). Video

processing from electro-optical sensors for object de-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

648

Figure 9: Images from the test dataset showing the performance for predicting the seven classes; ship, barge, cutter, yellow

mark, red mark, line mark and other obstacle.

tection and tracking in a maritime environment: a sur-

vey. IEEE Transactions on Intelligent Transportation

Systems.

Fachrie, M. (2020). A simple vehicle counting system us-

ing deep learning with yolov3 model. Journal RESTI

(Rekayasa Sistem DanTeknologi Informasi), 4:462–

468.

G. Ciaparrone, FL.S

anchez, S. T. and Troiano, L. (2020).

Deep learning in video multi-object tracking: A sur-

vey. Neurocomputing, 381:61–88.

I. Goodfellow, Y. B. and Courville, A. (2016). Deep Learn-

ing. MIT Press. http://www.deeplearningbook.org.

J. Redmon, S. Divvala, R. G. and Farhadi, A. (2016). You

only look once: Uniﬁed, real-time object detection.

Deep Learning based Object Detection and Tracking for Maritime Situational Awareness

649

Computer Vision and Pattern Recognition.

Jocher, G. (2021). Yolov5 in pytorch.

https://github.com/ultralytics/yolov5.

M. Kasper-Eulaers, N. Hahn, S. B. and Sebulonsen, T.

(2021). Short communication: Detecting heavy

goods vehicles in rest areas in winter conditions us-

ing yolov5. Algorithms, 14:114.

N. Wojke, A. B. and Paulus, D. (2017). Simple online and

realtime tracking with a deep association metric. In

Computer Vision and Pattern Recognition. Computer

Science.

PF. Felzenszwalb, RB. Girshick, D. M. and Ramanan, D.

(2009). Object detection with discriminatively trained

part-based models. IEEE transactions on pattern

analysis and machine intelligence, 32(9):1627–1645.

TY. Lin, M. Maire, S. B. and Hays, J. (2017). Microsoft

coco: Common objects in context. In European con-

ference on computer vision. Springer.

Z. Chen, D. Chen, Y. Z. and Cheng, X. (2020). Deep learn-

ing for autonomous ship-oriented small ship detec-

tion. Safety Science, 130:104812.

APPENDIX

The SSAVE dataset has been made publicly available

at:

https://mecatron.rma.ac.be/index.php/publications/datasets/

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

650