Tracking Data Visual Representations for Sports Broadcasting

Enrichment

Murilo Couceiro

1,3 a

, In

es Rito Lima

3 b

, Alexandre Ulisses

3 c

, Tiago Mendes-Neves

1,2 d

and Jo

ao Mendes-Moreira

1,2 e

Faculdade de Engenharia, Universidade do Porto, Porto, Portugal

LIAAD - INESC TEC, Porto, Portugal

MOG Technologies, Department of Innovation, Maia, Portugal

Keywords:

Metadata Enhancement, Player Tracking, Computer Vision, Sports Broadcasting.

Abstract:

The broadcast of audio-video sports content is a ﬁeld with increasingly larger audiences demanding higher

quality content and involvement. This growth creates the necessity to develop more content to engage the

users and keep this trend. Otherwise, it may stall or even diminish. Therefore, enhancing the user experience,

engagement, and involvement during live sports event broadcasts is of utmost importance. This paper pro-

poses a solution to extract event’s information from video, resorting to Computer Vision techniques and Deep

Learning algorithms. More speciﬁcally, the project encompassed the deﬁnition and implementation of ﬁeld

registration, object detection and tracking tasks. Focusing on football sports events, a novel dataset combining

several video sources was created and used for analysis and metadata extraction. In particular, the proposed

solution can detect and track players with acceptable precision using state-of-the-art methods, like YOLOv5

and DeepSORT. Furthermore, resorting to unsupervised learning techniques, the system provides team seg-

mentation based on the colour of the players’ kits. A series of visual representations regarding the players’

movements on the ﬁeld enables broadcast enrichment and increased user experience. The presented solution

is framed in the H2020 DataCloud project and will be deployed in a cloud environment simplifying its access

and utilisation.

1 INTRODUCTION

As the most popular sport in our society, with millions

of fans, football has very high market value and has a

growing amount of visual data generated, leading to

a growth in the research of this topic. Football clubs,

coaches and fans demand customised football match

analysis services, creating the need within the broad-

casting industry for automated video annotation and

metadata extraction tools.

The proposed work is framed in the DataCloud

project

, particularly on the MOGSports business

case. MOGSports focuses on the media and entertain-

ment business and aims to enhance the user experi-

https://orcid.org/0000-0002-8451-9680

https://orcid.org/0000-0002-9681-4740

https://orcid.org/0000-0003-4802-1508

https://orcid.org/0000-0002-4802-7558

https://orcid.org/0000-0002-2471-2833

Horizon2020 grant agreement nº 101016835

ence, engagement, and involvement during live sports

event broadcasts.

This paper explores the development of a system

that assesses multi-camera video content and provides

visual representations to enhance the user experience

on live sports broadcasting. Additionally, the pre-

sented system extracts tracking data as a ﬁrst step

towards event detection. The proposed data acquisi-

tion setup includes two cameras to extract informa-

tion from the entire ﬁeld at all times. The system

extract metadata by a detection and tracking algo-

rithm for each camera. This uni-camera information

is then fused, creating a ﬁnal multi-camera global re-

sult. The solution must be deployed in a Cloud en-

vironment along with the project objectives. More-

over, the project relates to complementing a broad-

cast, so real-time or near-real-time extraction is nec-

essary. Thus, the size and velocity of the methods

used are critical aspects.

Football matches video contains prominent and

diverse information that can be extracted to enrich the

Couceiro, M., Rito Lima, I., Ulisses, A., Mendes-Neves, T. and Mendes-Moreira, J.

Tracking Data Visual Representations for Sports Broadcasting Enrichment.

DOI: 10.5220/0011551900003321

In Proceedings of the 10th International Conference on Sport Sciences Research and Technology Support (icSPORTS 2022), pages 125-131

ISBN: 978-989-758-610-1; ISSN: 2184-3201

125

content of a broadcast. This paper presents a solu-

tion for extracting player position and team informa-

tion, in near real-time, throughout the match, provid-

ing the equivalent positional representations as out-

put. The proposed methodology is crucial to enable

further complex analysis such as goals, passes, player

velocity and other statistics extraction. Furthermore,

the applications of the retrieved positional data are

manifold since it is the basis for most sport-related

Machine Learning work, making it an excellent con-

tribution to the Sports Data Mining Community.

Following the context provided in the current sec-

tion, a literature overview on sports analytics, com-

puter vision and machine learning is given in Section

2. Section 3 presents the pipeline of the project at

hand and detailed information about its modules. Sec-

tion 4 presents the results obtained with the proposed

solution, and Section 5 describes its positive and neg-

atives sides as well as recommendations and ideas for

future work. Finally, Section 6 summarises the work

done for this paper.

2 RELATED WORK

Investigation into sports-related areas has been get-

ting much attention. Many studies have been con-

ducted in Computer Vision and Deep Learning to

extract football matches information through video.

This paper covers the areas of Field Registration, Ob-

ject Detection and Object Tracking. All to extract in-

formation from a video of a football match.

The task of sports ﬁeld registration aims at deter-

mining a homography matrix H for the transforma-

tion of an image into two-dimensional sports ﬁeld co-

ordinates (Theiner et al., 2022). The homography is

deﬁned by a 3x3 matrix with eight degrees of free-

dom that performs mapping between two images of a

planar surface from different perspectives. There are

two traditional homography estimation approaches,

direct and feature-based methods, which can also be

used together to increase robustness (Nie et al., 2021;

Nguyen et al., 2018). The direct methods use pixel-

to-pixel matching by shifting or warping the images

relative to each other and comparing the pixel inten-

sity values using an error metric. The feature-based

approach ﬁrst extracts keypoints in each image using

local feature detectors to establish correspondences

between the two sets. Also, inspired by the data-

driven Convolution Neural Networks (CNN) in CV,

approaches for homography estimation have emerged

using supervised methods. For this extent, (Nie et al.,

2021) regresses a set of dense features through a deep

network, to estimate the homography. Another ap-

proach is to use unsupervised methods (Nguyen et al.,

2018). Generative Adversarial Networks (GAN) and

synthetic data have also been proposed (Chen and Lit-

tle, 2019)

Player detection is a case of object detection but

with additional challenges. When performing player

detection, several associated challenges must be faced

to obtain successful results: the occlusions between

players, the abrupt movements of the camera, the

changes in lighting, the lack of resolution in very

distant players, the blurring of the players that are

moving, or the players remaining long static periods

(Cuevas et al., 2020). When (Sah and Direkoglu,

2021) compares object detection models such as

Faster R-CNN, YOLO and SSD using different back-

bones and other parameters for a ﬁeld hockey dataset,

it concludes that these complex neural networks fail

to detect occluded players and players with low reso-

lution (far from the camera).

Player Tracking is a Multi-Object-Tracking

(MOT) problem. The MOT methods can be grouped

regarding the initialisation method into two sets:

Detection-Based Tracking (DBT) and Detection-Free

Tracking (DFT). Brieﬂy, DBT, commonly named

tracking-by-detection, happens when objects are ﬁrst

detected and then linked into trajectories. Thanks to

improvements in the performance of object detectors

and the ability to deal with challenges such as clut-

tered scenes or dynamics of tracked objects, tracking-

by-detection has become a leading paradigm for MOT

(Ivasic-Kos et al., 2021). This approach is followed

by methods like SORT, DeepSORT, TRACKTOR++,

FairMOT, TransMOT, and StrongSORT. These meth-

ods make the tracking step an association step for the

detections received by the detector.

The game analysis applications have a variety of

topics: action spotting or event detection, player and

ball tracking, tactics analysis, statistics collection,

pass feasibility, talent scouting, game phase analysis,

or highlights generation (Cioppa et al., 2021). Fur-

thermore, there is work on the enhancement and ex-

perience area, like (Rematas et al., 2018) that uses

augmented reality to present a match in a 3D model

on a ﬂat surface.

3 PROPOSED SOLUTION

This section describes the MOGSports modular

pipeline shown in Figure 1, focusing on the AI Event

Detection and Metadata Fusion modules.

The pipeline was designed to enable the use of

N cameras, each undertaking an AI Event Detection

module. This module includes three steps: Field De-

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

126

Figure 1: MOGSports pipeline.

tection, Player Tracking and Event Detection. This

paper presents the ﬁrst two modules.

Field Detection encapsulates the task of Field

Registration. Resorting to a homography matrix,

Field Registration allows the transformation from the

game image to a top view perspective of the ﬁeld.

With the players’ detection representations, this top

model allows a better understanding of the game.

The Metadata Fusion module combines the infor-

mation retrieved from each camera video processing.

In this sense, the fusion module must adjust accord-

ingly, accompanying the camera scalability.

3.1 Camera Setup

Whilst using more cameras enables detailed ﬁeld cov-

erage, it also increases the computation effort when

matching the extracted data. Hence, the counterbal-

ance between the number of cameras and the fusion

algorithm complexity must be evaluated.

In this sense, the proposed setup comprises two

static cameras placed along one side-line, as shown in

Figure 2, guaranteeing the coverage of the entire ﬁeld

with signiﬁcant detail. Furthermore, this setup allows

the design of innovative strategies to ﬁght occlusion,

one of the problematic aspects of payer detection and

tracking task

This setup was tested at Est

adio Municipal de

Aveiro stadium, and the cameras were around 20 me-

ters high and had around 15

◦

tilt.

Figure 2: Proposed Camera Setup.

3.2 Field Detection

The Field Detection problem relates to the homogra-

phy matrix computation between the captured image

and a top-view ﬁeld image.

Since the setup cameras are static, the homogra-

phy computation only needs to happen once at the

match’s beginning. The premise of a constant ho-

mography makes the registration process exponen-

tially cheaper when compared to a moving camera,

where it can change several times per second.

Furthermore, this premise also enables a

lightweight manual registration. Thus, the proposed

solution integrates a manual registration process, re-

quiring the user to select four or more corresponding

points in one video frame and on the top view ﬁeld

image.

3.3 Player Tracking

Player Tracking allows the identiﬁcation and re-

identiﬁcation of the detected players throughout the

frames, as shown in Figure 5. Tracking can create

and represent more valuable information that aug-

ments the user experience, such as other visual repre-

sentation options or individual player statistics. The

chosen tracking algorithm is based on a tracking-by-

detection methodology, so player detection is a fun-

damental step. Without a robust detection model, the

tracking algorithm will perform poorly.

3.3.1 Player Detection

The detection algorithm must be as accurate and fast

as possible, enabling the pipeline to run in real-time.

Also, it must be precise to strengthen the tracking

step. These requirements drove the choice of the one-

stage algorithm YOLOv5 (Jocher et al., 2022). This

object detector is chosen as it has the best speed and

accuracy counterbalance, as well as an easy integra-

tion. As very fast inference is needed, only one-stage

detectors were considered.

Tracking Data Visual Representations for Sports Broadcasting Enrichment

127

The players’ position in the ﬁeld is set where they

stand, at their feet. Foreseeing that the player is in

the centre of the retrieved bounding box, its posi-

tion is deﬁned with the bounding box’s bottom centre

coordinates. Once mapped by the homography, this

point is the player’s position in the top-view repre-

sentation. The top-view ﬁeld image is predetermined,

with known ﬁeld lines coordinates. Thus, the players

projected coordinates can be easily ﬁltered around the

ﬁeld area. Hence, the off-ﬁeld detections are removed

before being forwarded to the tracking algorithm.

3.3.2 DeepSORT

The tracking algorithm integrated with the solution is

DeepSORT (Wojke et al., 2017). Despite sporadic

accuracy issues, it is one of the most suitable track-

ing algorithm for the time-quality balance among the

state-of-the-art.

This algorithm contains a motion and a deep asso-

ciation metric. The proposed solution uses an OSNet,

trained for person re-identiﬁcation, to extract the fea-

tures needed for the deep association metric.

3.4 Metadata Fusion

This module is responsible for receiving the cameras’

metadata and merging it into a single, complete out-

put. It is also responsible for assigning the tracked

players to their teams, creating visual representations

and storing the information created for other possible

objectives.

To fuse both cameras’ information, the detections

are ﬁltered after mapping the players’ position into

the top view according to their ﬁeld side - the left cam-

era detects the players on the left side and right on the

right side. This strategy requires the deﬁnition of the

midﬁeld line pixels, in the top view representations,

as a separator of the cameras’ input.

3.4.1 Team Labelling

The players’ positional representations must illustrate

their respective teams, which are recognised through

their outﬁt’s colour. Thus, the objective is to anal-

yse the bounding boxes’ colours and distribute them

to the teams. For this, a K-means clustering algo-

rithm allows the aggregation of team players based

on their outﬁt colours. As the ﬁeld players’ colours

are the most meaningful to differentiate, the number

of clusters is two. Regularly, as the players are in the

centre of the bounding boxes, the value used to rep-

resent them is the average of the pixels’ colour of a

5x5 centred area. The model is built with the colour

values of the bounding boxes from both cameras’ ﬁrst

frame. After ﬁtting the model, the following frames’

bounding boxes’ colour values are fed to the model to

predict its cluster and, consequently, its team. After

predicted, the colour is the same throughout the track-

ing ID lifetime. The team labelling algorithm oper-

ates within the fusion module as the k-means model

should be the same for both cameras.

3.4.2 Visual Representations

Three types of visual representations are deﬁned for

the proposed solution based on the players’ positional

information, as shown in Figure 3. The ﬁrst two pro-

vide a spatial-temporal representation of the players’

position at a given moment. However, the second also

represents the past positions of that player, giving it a

notion of direction. The third is a heat map that in-

dicates the ﬁeld areas with more presence throughout

the game.

These visual representations can be used by the

media companies as screen overlays to enrich the

broadcasting of a football match.

4 RESULTS

The results of the proposed solution are promising,

while unveiling technical and scientiﬁc challenges

that need to be addressed in the future. The pro-

posed solution can perform player detection and ac-

cording representation, namely the positional plot and

heatmap, in real-time. However, due to hardware lim-

itations, the system can not perform online tracking.

To evaluate the overall system behaviour, Figure 4

illustrates the number of tracked players per frame

for the right, left and fused outputs, during a signif-

icant time span video (

5 minutes) using the setup de-

scribed in Figure 2. The setup employed two cam-

eras with different lens exposures, which changes the

image’s brightness. This will prove signiﬁcant in the

further presented results. The usage of two different

cameras enabled testing the system’s behaviour fac-

ing various image qualities and delimiting a minimum

quality threshold. The cameras used were CANON’s

250D and 700D, where the former presents better im-

age quality leading to better performance. Figure 4

shows many variations in the detection and tracking

models. Overall, the number of tracks is around 22

persons, which is a good indicator. However, when

the left camera (CANON 250D) has fewer detections,

the right camera (CANON 700D) can not cope with

the players on its side, decreasing the total number

players tracked. This is a consequence of a video

quality discrepancy between cameras, since the right

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

128

Figure 3: Visual Representations.

camera quality is signiﬁcantly lower than the left (see

Figure 5 and 6). In fact, all results are affected by this

limitation, which will be taken into account for future

data acquisition experiments (with CANON’s 250D

image quality as the minimum threshold).

The main consequence of these variations is the

number of ID switches. DeepSORT was selected

based on its ability to ﬁght this problem. In light of

these results, the deep metric used may be less rel-

evant than expected. The auto-encoder used for this

metric is explicitly trained for re-identiﬁcation, so this

is presumably not the faulty step. The reason may be

the distance between the players and the camera. De-

spite the camera placement advantages relative to the

ﬁeld coverage, the players become small, leading to

less signiﬁcant features, which may affect the deep

metric precision. Another aspect affecting the num-

ber of ID switches is the players switching midﬁelds.

Another cause for variations are occlusions, illus-

trated in the top sequence of Figure 6. These are hard

to oppose with a single view. However, as shown in

the bottom sequence of Figure 6, the opposite cam-

era can detect the previously occluded players, which

is an advantage of the proposed setup. Therefore, fu-

ture work may include the usage of both cameras to

overcome this phenomenon.

Regarding the current fusion approach, even

though the detection works suitably, some loss of in-

formation may occur. As Figure 5 shows, if one cam-

era detects the players that the other does not, if on

the opposite midﬁeld, these detections are discarded.

Team labelling has satisfactory results. It assigns

the expected colours for a high number of tracks,

with occasional errors, specially when noise is in the

bounding box.

5 DISCUSSION AND FUTURE

WORK

As previously mentioned, the presented solution is

a work-in-progress in the scope of MOGSports and

DataCloud. In this section are discussed some of the

positive and negative aspects of the implemented so-

lution and of the next steps. The current solution

encompasses modules that perform ﬁeld registration,

player detection and tracking, team labelling and fu-

sion. Manual registration is a straightforward method

for computing the homography matrix, which bene-

ﬁts from static cameras. Despite being user-friendly

for few cameras, automation is a project goal and

requirement for scalability. Accordingly, an auto-

mated approach is being implemented and tested. The

Two-GAN model (Chen and Little, 2019) is used for

ﬁeld line segmentation. Then, the solution attempts a

feature-based approach, comparing the top-view and

the segmented image. Along with the tests conducted

until now, the Two-GAN line segmentation is ex-

pected to improve with more training data, primarily

similar to the ones acquired in the proposed setup.

The Player Detection model presents promising

results, still requiring further ﬁne-tuning. Players’ un-

usual positions or fast changing movements increase

the difﬁculty of accurate detections. Furthermore, the

many occlusions during a match lead to unstable po-

sitional information retrieval and consequent unstable

visual representations.

Future work is expected to improve the used

YOLOv5 model performance to detect the players

and add the capacity to detect the ball resorting to the

acquired dataset.

As for the Player Tracking module, due to hard-

ware constraints, DeepSORT is not yet thoroughly

tested for real-time processing. Nonetheless, an un-

wanted ID switching issue has been identiﬁed. Conse-

Tracking Data Visual Representations for Sports Broadcasting Enrichment

129

Figure 4: Number of tracks per frame, fused tracks (blue), left camera (orange), right camera (green). Results obtained with

reduced sample (around 3 minutes) of a custom dataset with proposed camera setup.

Figure 5: Player tracking example for a frame inside the interval [1400-1600] in each camera (cropped image). Blue: Predicted

tracks. Red: Miss detections manually annotated.

Figure 6: Camera Setup advantages for occlusion handling.

Frame sequences for both cameras representing the same

moment, Top: Left camera sequence. Bottom: Right cam-

era sequence.

quently, the algorithm can suffer some modiﬁcations.

As presented in Section 4, the deep metric may have a

lower impact than initially thought in the association

step. Team information could be a valuable metric

for this step, neglecting possible matches with players

from the opposite team. This strategy would require a

very robust team labelling model.

Alternatively, tracking can occur after data fusion,

resorting to algorithms like graph-based tracking, us-

ing the transformed top-view detection coordinates.

This strategy would demand a robust fusion algorithm

as a previous step.

Despite being accurate for numerous players,

team labelling also has limitations. As players move

rapidly, their bounding box may temporarily not ﬁt its

body. Hence, background or other artefacts can inﬂu-

ence the colour value calculation. In this sense, the

method could be improved by assigning each player

to the team associated with the mode of a periodic

check prediction or of the N initial frames predictions.

Furthermore, to counter the cluster issue, as the goal-

keepers have a generally ﬁxed position, their colour

could be assigned by it.

Fusion is critical, and a scalable solution is nec-

essary for this module. The cameras provide a good

overlap area that can be used for occlusion handling

and to complete miss detections from one another.

Incorporating a matching algorithm into this module

could leverage the full potential of the proposed cam-

era setup. One possible strategy would be associat-

ing the players from each camera in each frame. This

method could consider all players, calculate the eu-

clidean distance between the transformed points, and

then associate them optimally. Being a computation-

ally demanding task, other parameters, such as the

team label, could be used to reduce the distances cal-

culations and consecutive matching. The quality dis-

crepancy between cameras used for the data acquisi-

tion affects the results negatively. Consequently, the

debated problems may be softened when considering

a dataset with equally high-quality images.

Event Detection is a comprehensive problem that

MOGSports intends to solve, which reﬂects the cul-

mination of the developments from the previous mod-

ules and its valuable metadata extracted. It may com-

prise match highlights (e.g. goals), match events (e.g.

passes, offsides), as well as statistic insights.

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

130

6 CONCLUSION

In response to audience demand, user experience

enhancement is an imperative action for the sports

broadcasting industry. This paper presented a step

forward to accomplish this enhancement, based on

players’ positional data extraction and representation.

Extracting this data has complex associated problems

and the results reﬂect this complexity. However, there

are one or more proposed alternatives for each prob-

lem encountered. The proposed solution reﬂects on-

going work. The current solution can be integrated

into sports broadcasting. Future work includes the

overlay of speciﬁc game-related events.The results

achieved so far create good expectations for the up-

coming developments and for the potential of the

MOGSports business case within the media and en-

tertainment industries.

ACKNOWLEDGEMENTS

This work has been supported by the H2020 DATA-

CLOUD project, funded by European Union’s Hori-

zon 2020 research and innovation programme under

grant agreement No 101016835, as well as by the Na-

tional Funds through the Portuguese funding agency,

FCT - Fundac¸

ao para a Ci

encia e a Tecnologia, within

project UIDB/50014/2020.

REFERENCES

Chen, J. and Little, J. J. (2019). Sports camera calibration

via synthetic data. In 2019 IEEE/CVF Conference on

Computer Vision and Pattern Recognition Workshops

(CVPRW), pages 2497–2504.

Cioppa, A., Deliege, A., Magera, F., Giancola, S., Bar-

nich, O., Ghanem, B., and Van Droogenbroeck, M.

(2021). Camera calibration and player localization in

soccernet-v2 and investigation of their representations

for action spotting. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 4537–4546.

Cuevas, C., Quilon Gonzalez, D., and Garc

ıa, N. (2020).

Techniques and applications for soccer video analysis:

A survey. Multimedia Tools and Applications, 79.

Ivasic-Kos, M., Host, K., and Pobar, M. (2021). Application

of deep learning methods for detection and tracking of

players. In Deep Learning Applications. IntechOpen.

Jocher, G., Chaurasia, A., Stoken, A., Borovec, J.,

NanoCode012, Kwon, Y., TaoXie, Fang, J., imyhxy,

Michael, K., Lorna, V, A., Montes, D., Nadar, J.,

Laughing, tkianai, yxNONG, Skalski, P., Wang, Z.,

Hogan, A., Fati, C., Mammana, L., AlexWang1900,

Patel, D., Yiwei, D., You, F., Hajek, J., Diaconu, L.,

and Minh, M. T. (2022). ultralytics/yolov5: v6.1 -

TensorRT, TensorFlow Edge TPU and OpenVINO Ex-

port and Inference.

Nguyen, T., Chen, S. W., Shivakumar, S. S., Taylor, C. J.,

and Kumar, V. (2018). Unsupervised deep homogra-

phy: A fast and robust homography estimation model.

IEEE Robotics and Automation Letters, 3(3):2346–

2353.

Nie, X., Chen, S., and Hamid, R. (2021). A robust and efﬁ-

cient framework for sports-ﬁeld registration. In 2021

IEEE Winter Conference on Applications of Computer

Vision (WACV), pages 1935–1943.

Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., and

Seitz, S. (2018). Soccer on your tabletop. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition, pages 4738–4747.

Sah, M. and Direkoglu, C. (2021). Review and evaluation of

player detection methods in ﬁeld sports: Comparing

conventional and deep learning based methods. Mul-

timedia Tools and Applications.

Theiner, J., Gritz, W., M

uller-Budack, E., Rein, R., Mem-

mert, D., and Ewerth, R. (2022). Extraction of po-

sitional player data from broadcast soccer videos. In

Proceedings of the IEEE/CVF Winter Conference on

Applications of Computer Vision, pages 823–833.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric. In 2017 IEEE international conference on image

processing (ICIP), pages 3645–3649. IEEE.

Tracking Data Visual Representations for Sports Broadcasting Enrichment

131