Feasibility Study of an Image-based Supporting System for Sprint

Training

Shiho Hanashiro

, Motoki Takematsu

and Ryusuke Miyamoto

Department of Computer Science, Graduate School of Science and Technology, Japan

Department of Computer Science, School of Science and Technology,

Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Japan

Keywords:

Stride Length, Stride Frequency, Velocity, Color Processing, Location Estimation.

Abstract:

This study focuses on developing a novel system to improve the performance of short-distance races, where

stride length, stride frequency, and maximum velocity are important factors. To estimate stride length and

stride frequency, color-based image processing is adopted to extract the feet of a runner, based on cosine

similarity in the RGB color space. The experimental results indicate that the stride length and stride frequency

could be estimated with negligible errors. To estimate the running velocity; visual object detection, and pose

estimation based on state-of-the-art deep learning schemes were applied: RetinaNet for visual object detection,

and OpenPose for pose estimation. The experimental results using the real image dataset, indicated that the

estimation error of the velocity by the proposed scheme was quite negligible.

1 INTRODUCTION

Recently, video-based analysis has become popu-

lar in several kinds of sports: tactical analysis for

team sports, and form analysis for personal sports.

These analyses enable not only player performance

improvement, but also novel presentations to audi-

ences who are not experts. The signiﬁcant merit of the

video-based analysis can be summarized as follows:

lower prices of sensing systems and more sources for

analysis. Video-based analysis only requires visible

cameras as sensors whose prices are quite lower than

other sensing devices and enables analysis using even

visual information used for TV broadcasting.

This study focuses on performance improvement

for short-distance races, where stride length, stride

frequency, and maximum velocity are important fac-

tors using only visual information obtained by RGB

cameras. Speciﬁcally, maximum velocity has a pow-

erful effect on the time of short-distance races (Mat-

suo et al., 2016). The aim of this study is to actualize

a supporting system, which can be utilized in daily

training to improve the performance of short-distance

races.

To measure the running velocity of a human, two

types of sensors are widely used: photoelectric sen-

sors and Doppler radar. When using photoelectric

sensors to measure the running velocity, several sen-

sors should be set along a running course. A Doppler

radar can measure the running velocity using only a

device, but it is expensive. Accordingly, the most sig-

niﬁcant problem with existing devices for measuring

the running velocity is the price of the system. For ex-

ample, the OptJump(MICROGATE, 2011) can mea-

sure stride frequency and stride length, but its price is

too expensive for general uses by amateur players.

The authors are trying to construct a more cost-

effective system to measure stride frequency, stride

length, and running velocity, to make computer as-

sisted training popular for several people. In our ap-

proach, only visible imaging sensors were utilized to

obtain information about the target humans. Using

this approach, the system cost is expected to become

much cheaper than existing systems that use expen-

sive sensors.

In the proposed scheme, after acquisition of an im-

age sequence including a target human, visual object

detection and pose estimation based on a deep neural

network are applied to estimate the running velocity.

RetinaNet(Lin et al., 2017) and Open Pose(Cao et al.,

2017) were adapted for detection and pose estimation,

respectively. To measure stride frequency and stride

length, the feasibility of color-based image process-

ing was evaluated using actual images.

Hanashiro, S., Takematsu, M. and Miyamoto, R.

Feasibility Study of an Image-based Supporting System for Sprint Training.

DOI: 10.5220/0010713500003059

In Proceedings of the 9th Inter national Conference on Sport Sciences Research and Technology Support (icSPORTS 2021), pages 151-156

ISBN: 978-989-758-539-5; ISSN: 2184-3201

151

2 RELATED WORK

This section explains Open Pose(Cao et al., 2021) and

RetinaNet(Lin et al., 2017) adopted in our research.

2.1 Open Pose

The bottom-up approach that extracts body parts of

a human from input image and makes most possible

connections among the extracted parts shows good

performance for human pose estimation. However

sometimes global connections of local parts are not

used and the computationalamount becomes too large

owing to the largest number of combinations of body

parts to be connected(Pishchulin et al., 2016; Insafut-

dinov et al., 2016).

To solve the global connection problem, Cao et

al.(Cao et al., 2017) proposed a novel convolutional

neural network (CNN) architecture. The architec-

ture adopts detection of body parts using conﬁdence

maps, and part afﬁnity ﬁelds (PAFs) that estimates the

connection between two body parts. The computa-

tional problem is solved by transforming a complete

graph into several bipartite graphs, and applying the

greedy algorithm to compute. The greedy algorithm

may worsen the accuracy, but the global context from

conﬁdence maps and PAFs helps to maintain it. Ac-

cordingly, real-time processing with high accuracy is

achieved: the Open Pose shows good performance for

the MPII multi-person dataset(Andriluka et al., 2014)

and the COCO key points challenge(Lin et al., 2014).

2.2 RetinaNet

RetinaNet(Lin et al., 2017) comprises a backbone net-

work that extracts features from an input image, and

two subnetworks that localize target objects and es-

timate their classes. This method attempts to solve

the class imbalance problem between foreground and

background pixels using a novel loss function called

Focal Loss, which reduces the inﬂuence of easy nega-

tives in the training process. Consequently, the detec-

tion accuracy of RetinaNet outperforms that of two-

stage detectors. The following equations represent the

focal loss adopted in RetinaNet, and the widely used

cross-entropy loss:

FocalLoss(p

) = −(1− p

)

log(p

CrossEntropyLoss(p

) = log(p

3 ESTIMATION OF STRIDE

FREQUENCY AND STRIDE

LENGTH BASED ON COLOR

IMAGE PROCESSING

In this section, we explain the estimation of the stride

frequency and stride length. The stride frequency

is deﬁned as the number of steps taken in a given

amount of time. Strictly, stride length refers to the

moving distance of the mass center during one run-

ning stride. In this study, stride length is approxi-

mated by the length between the landing points of the

foot. Fig. 1 illustrates an example of the landing point

of the foot.

The proposed scheme implemented and tested in

this article comprises the following processes:

1. Detection of the landing point,

2. Determining the landing moment,

3. Computation of stride frequency, and

4. computation of stride length.

The rest of this section details these processes.

3.1 Detection of the Landing Point

First, the foot of the target human was extracted by

the color of their shoes. In this process, the similar-

ity of color is measured by cosine similarity in the

RGB color space. After similarity computation, sim-

ple thresholding was applied to extract the pixels cor-

responding to the shoes. To determine a frame when

the foot just lands, a frame where the foot is located

becomes the lowest in several frames. In the real im-

age of the sprint, the location of the foot increases and

decreases, as illustrated in Fig. 2. The movement of

the foot can be plotted on a graph, as illustrated in

Fig. 3.

3.2 Determining the Landing Moment

As described in the previous subsection, Fig. 3 illus-

trates the movementof the foot, where the vertical and

horizontal axes represent the vertical coordinate of the

foot and the frame number, respectively. In Fig. 3, the

ﬂat frames during the two peaks indicate that the foot

makes contact with the ground plane. It starts when

the foot has just landed on the ground, and ends when

the foot has just left the ground.

The proposed scheme determines the beginning

point of contact between the foot and the ground, ac-

cording to the vertical coordinate of the foot; the pre-

vious frame when the vertical coordinate becomes ap-

proximately constant. Similarly, the end point of the

icSPORTS 2021 - 9th International Conference on Sport Sciences Research and Technology Support

152

Figure 1: Landing points of the foot.

Figure 2: Movement of the foot location in the real image.

contact can be obtained as the frame, when the verti-

cal coordinate is different from that of the beginning

point. In these processes, the landing moment of the

foot can be determined.

3.3 Computation of Stride Frequency

The stride frequency (SF) can be calculated by count-

ing the number of frames between the landing mo-

ment determined by the previous operation. Fig. 4

illustrates examples of the computation of the stride

frequency.

SF(step/s) =

FPS(frame/s)

no of frames per step(frame/step)

(1)

3.4 Computation of Stride Length

In the proposed scheme, the stride length (SL) was

computed using two end points of foot landing. To es-

timate the length of the actual space, white lines were

drawn at one-meter intervals, as illustrated in Fig. 1.

These white lines provide a relation between the pix-

els and the actual length between them. Finally, we

can compute SL based on the number of pixels be-

tween the two end points of foot landing.

SL(cm) = Coordinate(pixel) × Length(cm/pixel) (2)

4 VELOCITY ESTIMATION BY

OBJECT DETECTION AND

POSE ESTIMATION

The proposed scheme estimates velocity of a target

human, using object detection by RetinaNet(Lin et al.,

2017) and pose estimation by OpenPose(Cao et al.,

2021). The RetinaNet-based object detector extracts

a boundingbox surroundinga target runner and Open-

Pose estimates the location of the waist in the ex-

tracted bounding box.

Once the locations of the waist are in two arbi-

trary frames, the velocity of the target runner can be

computed using the following equation:

pred

[

cm/s

] =

pred

[

pixel

] · α[

cm/pixel

]

frame

]/120[

f ps

]

, (3)

where l, α, and t represent the moving length of

the waist, the distance per pixel, and the number of

frames adopted in the velocity computation.

Fig. 5 illustrates the parameters adopted in the ve-

locity estimation, the red line represents the motion

of the waist, the numerical value with “cm” at the bot-

tom right represents moving distance, and the numeri-

cal value with “frame” indicates the number of frames

used in the estimation.

5 EVALUATION

This section describes the dataset utilized in the eval-

uation, and how to evaluate the performance of esti-

Feasibility Study of an Image-based Supporting System for Sprint Training

153

Figure 3: Movement of the location plotted as a graph.

Figure 4: How to compute stride frequency. The number of frames corresponding to the same pose is counted. In this example,

34 and 33 frames are spent for the ﬁrst and the second cycle, respectively.

mating the stride frequency (SF), stride length (SL),

and velocity.

5.1 Dataset

A dataset was created using an actual video sequence,

to evaluate the performance of the proposed scheme.

In the video, the runner wore pink shoes for color

processing. It is difﬁcult to make a correspondence

between the actual length, and pixels in the captured

images, white lines were drawn at one-meter inter-

vals on the ground. A camera to record a video was

located at approximately ten meters from the running

course. The height of the camera was 1.5 m. The res-

olution and frame rates were 1920 × 1080 pixels and

60 frames per second, respectively. Fig 6 illustrates an

example shot of the dataset using actual images. For

the evaluation of only SF and SL, a dataset based on

synthetic images was created, a CG-based dataset was

generated using Unreal Engine 4 (UE4) as illustrated

in Fig. 7.

5.2 Image Calibration before

Evaluation

When measuring the moving length in an actual space

using a captured image, image calibration is per-

formed to obtain the relationship between the length

in the actual three-dimensional space and the image

plane. Accordingly, a white line at one-meter inter-

vals was used; hence pixels between two adjacent

lines were measured manually. Once the number of

pixels corresponding to 1 m in the actual space was

obtained, calibration could be easily performed.

5.3 Estimation of SF and SL

Tables 1 and 2 presents the errors in SF for the CG-

based and real datasets, respectively. For the CG-

based dataset, the actual ground truth was obtained

from the locations of the human model utilized in data

generation. For the real dataset, the ground truth was

created manually. Evidently, the error values some-

times became larger in the CG-based dataset, but they

icSPORTS 2021 - 9th International Conference on Sport Sciences Research and Technology Support

154

Figure 5: Example of output in velocity estimation.

Figure 6: CG-based dataset generated using UE4.

were not as large in the real dataset. Table 3 presents

the estimation errors of SL in the real dataset. This

seems good because the largest error was 1.25 cm.

5.4 Velocity Estimation

To evaluate the estimation performance of velocity,

the following two criteria were evaluated.

5.4.1 Estimation Error of Distance

Before evaluating the velocity estimation, the estima-

tion error of the distance is based on two key frames

for velocity evaluation. The estimation error is com-

puted using the following equation:

error

= |l

true

− l

pred

|, (4)

where l

true

and l

pred

represent distance between key

frames in the ground truth, and is computed by the

estimated locations of the waist, respectively.

Ground truth should have been created using

highly accurate sensors, but they were created man-

Figure 7: Dataset comprising actual images.

ually. We have attempted to use Perception Neu-

ron(NOITOM,2018), which is one of the most widely

used motion capture systems, but it did not work efﬁ-

ciently.

Table 4 presents the distance error while changing

the number of frames adopted to compute this error.

The results show that the proposed scheme can esti-

mate the distance between two key frames with small

error values.

5.4.2 Estimation Error of Velocity

To evaluate the estimation error of velocity, the er-

rors in the distance obtained, as presented in Table 4

were divided by the number of frames between two

key frames, as indicated in the following equation:

error

true

− l

pred

t/120

. (5)

Table 5 presents the estimation errors for velocity

when the number of frames between two key frames

is adopted for measuring the velocity. The estima-

tion error seemed good because the error value was

Feasibility Study of an Image-based Supporting System for Sprint Training

155

Table 1: Stride frequency(UE4).

steps 1 2 3 4 5 6 7 8 9 10 11

error(step/s) 0.21 0.01 0.16 0.01 0.21 0.01 0.01 0.01 0.43 0.32 0.21

Table 2: Stride frequency(Real image).

steps 1 2 3 4 5 6

error(step/s) 0.1478 0.1961 0.1137 0.1009 0.1209 0.0000

Table 3: Stride length(Real image).

steps(cm) 1 2 3 4 5 6

error(cm) 1.25 0.62 0.55 0.08 0.62 0.55

Table 4: Estimation errors of distance.

frame 2 10 30 60 120

error[cm] 1.14 1.85 2.50 2.10 1.90

approximately 1.9 cm/s. However, the smaller the

number of frames, the larger the error. If we want

to estimate the velocity at short sections, the estima-

tion error of the distance must become negligible, to

obtain practical values for velocity estimation.

Table 5: Speed error.

frame 2 10 30 60 120

error[cm/s] 68.5 22.2 10.0 4.2 1.9

6 CONCLUSION

This paper proposes a novel scheme for support-

ing sprint training using image processing alone.

The proposed scheme estimates the stride frequency

(SF) and stride length (SL) using color processing,

based on the cosine similarity between shoes and the

ground. Experimental results indicated that SF and

SL could be estimated with negligible errors. To esti-

mate the running velocity, visual object detection and

pose estimation based on state-of-the-art deep learn-

ing schemes were applied, RetinaNet for visual object

detection, and OpenPose for pose estimation. The ex-

perimental results using the real image dataset indi-

cated that the distance error of the proposed scheme

was negligible. However, it may be insufﬁcient for

measuring velocity in very short sections. To improve

the estimation accuracy furthermore, the accuracy of

image-based localization should be improved.

REFERENCES

Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B.

(2014). 2D human pose estimation: New benchmark

and state of the art analysis. In Proc. IEEE Conf. Com-

put. Vis. Pattern Recognit.

Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y.

(2021). Openpose: Realtime multi-person 2D pose es-

timation using part afﬁnity ﬁelds. IEEE Trans. Pattern

Anal. Mach. Intell., 43(1):172–186.

Cao, Z., Simon, T., Wei, S., and Sheikh, Y. (2017). Real-

time multi-person 2d pose estimation using part afﬁn-

ity ﬁelds. In Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., pages 1302–1310.

Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M.,

and Schiele, B. (2016). Deepercut: A deeper, stronger,

and faster multi-person pose estimation model. In

Proc. European Conference on Computer Vision,

pages 34–50.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P.

(2017). Focal loss for dense object detection. In Proc.

IEEE Int. Conf. Comput. Vis., pages 2980–2988.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll´ar, P., and Zitnick, C. L. (2014).

Microsoft COCO: Common objects in context. In

Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T.,

editors, Proc. European Conference on Computer Vi-

sion, pages 740–755, Cham. Springer International

Publishing.

Matsuo, A., Hirokawa, R., Yanagiya, T., Matsub-

ayashi, T., Takahashi, K., Kobayashi, K., and

Sugita, M. (2016). Speed and pitch-stride anal-

ysis for men’s and woman’s 100m in the 2016

season and all seasons(in japanese). 12:74–

83. jaaf.or.jp/pdf/about/publish/2016/2016-074-

83pdf.pdf.

MICROGATE (2011). Optojumpnext.

https://training.microgate.it/en/products/optojump-

next.

NOITOM (2018). Perception neuron 2.0.

https://neuronmocap.com/.

Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., An-

driluka, M., Gehler, P. V., and Schiele, B. (2016).

Deepcut: Joint subset partition and labeling for multi

person pose estimation. In Proc. IEEE Conf. Comput.

Vis. Pattern Recognit., pages 4929–4937.

icSPORTS 2021 - 9th International Conference on Sport Sciences Research and Technology Support

156