feature extraction techniques relying on
common-sense visual observation?
3. If so, can we develop a multi-stage video
processing and modelling framework that is
transferable to other sports?
1.2 Background and Prior Work
Advancements in motion pattern indexing can not
only be evaluated by improving classification
performance for a specific task, low-cost real-time
computing and extending the number of labelled
events of interest, but also on their universal
applicability to various sources such as 3D motion
data (Bačić & Hume, 2018), video (Bloom &
Bradley, 2003; D. Connaghan, Conaire, Kelly, &
Connor, 2010; Martin, Benois-Pineau, Peteri, &
Morlier, 2018; Ramasinghe, Chathuramali, &
Rodrigo, 2014; Shah, Chockalingam, Paluri, Pradeep,
& Raman, 2007), and sensor signal processing
(Anand, Sharma, Srivastava, Kaligounder, &
Prakash, 2017; Damien Connaghan et al., 2011; Kos,
Ženko, Vlaj, & Kramberger, 2016; Taghavi, Davari,
Tabatabaee Malazi, & Abin, 2019; Xia et al., 2020).
To our knowledge, tennis shots or strokes action
recognition relying on computer vision started in
2001, by combining computer vision and hidden
Markov model (HMM) approaches, before HD TV-
broadcast resolution became available (Petkovic,
Jonker, & Zivkovic, 2001). After Sepp Hochreiter
and Jürgen Schmidhuber invented Long Short Term
Memory (LSTM) in 1997, LSTMs have been used in
action recognition (Cai & Tang, 2018; Liu,
Shahroudy, Xu, Kot, & Wang, 2018; Zhao, Yang,
Chevalier, Xu, & Zhang, 2018). In 2017, inertial
sensors with Convolutional Neural Networks (CNN)
and bi-directional LSTM networks were used to
recognise actions in multiple sports (Anand et al.,
2017). In 2018, an LSTM with Inception v3 was used
to recognise actions in tennis videos achieving 74%
classification accuracy (Cai & Tang, 2018).
For prototyping explainable AI in next-generation
augmented coaching software, which is expected to
capture expert’s assessment and continue to provide
comprehensive coaching diagnostic feedback (Bačić
& Hume, 2018), we can rely on multiple data sources
including those operating beyond human vision.
Prior work on 3D motion data is categorised as:
(1) traditional feature-based swing indexing based on
sliding window and thresholding (Bacic, 2004) and
expert-driven algorithmic approach in tennis shots
and stance classification (Bačić, 2016c); (2)
featureless approach for accurate swing indexing
using Echo State Network (ESN) (Bačić, 2016b) and
(3) further sub-event processing i.e., phasing analysis
via produced ensemble of ESN (Bačić, 2016a).
Prior work on video analysis applied Histograms
of Oriented Gradients (HOG), Local Binary Pattern
(LBP) and Scale Invariant Local Ternary Pattern
(SILTP) for human activity recognition (HAR) in
surveillance (Lu, Shen, Yan, & Bačić, 2018). A pilot
case study on cricket batting balance (Bandara &
Bačić, 2020) used recurrent neural networks (RNN)
and pose estimation to generate classification of
batting balance (from rear or front foot). This prior
work on privacy-preservation filtering is aligned with
privacy-preserving elderly care monitoring systems
and with extracting diagnostic information for
silhouette-based augmented coaching (Bačić, Meng,
& Chan, 2017; Chan & Bačić, 2018). It is also
generally applicable to usability and safety of spaces
where human activity occurs such as smart cities,
future environments and traffic safety (Bačić, Rathee,
& Pears, 2021).
2 METHODOLOGY
Considering past research, our objective is to produce
a relatively simple and generalised initial solution and
a human motion modelling (HMMA) framework for
video indexing applicable to tennis. The tennis
dataset was created from both amateur and
professional players’ videos. It is also expected that
the produced framework may be easily transferable to
other sport disciplines and related contexts such as
rehabilitation and improving safety and usability of
spaces where human movement occurs. As part of
movement pattern analysis, we focused on expressing
features as spatiotemporal human movement patterns
from faster moving segments (e.g., dominant hand
holding a racquet) relative to the more static trunk
segment.
2.1 Stick Figure Overlays as Initial
Data Preprocessing
To retrieve player’s motion-based data from video,
we generated stick figure overlays using OpenPose
(https://github.com/CMU-Perceptual-Computing-
Lab/openpose) and 25 key point estimator
COCO+Foot (Figure 1 and Figure 2).
Figure 2 shows an example of data format
representing the key points coordinate of a player in
each video frame recorded as multi time series data.
As video overlays, animated stick figure topology of
generated key points (Figure 3) represents a way of
extracting information from video to facilitate human