tight ROIs and 0.861 for context ROIs. Training on
high-res videos with context ROIs did not improve
performance, yielding an AP of 0.848.
The context ROI of a player who is the object of
a foul often includes, completely or partially, the sub-
ject who is fouling him or her—and vice versa (e.g.,
the 3rd and 4th columns of Fig. 4). Therefore, a
binary subject-object label seems improper and may
slow training. So we propose a multi-label task in
which each video clip is labeled with two floating-
point values between 0 and 1 indicating subjectness
and objectness by computing the median IoU between
the ROI and subject and object bounding boxes over
every frame of the sequence. In this case the AP rises
to 0.980.
The context variant of SF PvB successfully de-
tected 64.24% of foul participants @ 0.5 IoU thresh-
old at the foul moment over a test set of 167 clips (vs.
52.51% for the tight variant with the same tracks).
Fig. 1 shows three examples of such detections. The
second row demonstrates the detector’s ability to pick
out one anomalous motion in a crowd (in this case
the foul object sinking to the ground). Subjects and
objects were detected at the same IoU threshold with
30.15% and 45.21% accuracy, respectively (16.39%
and 30.06% for tight). The detection accuracy is con-
siderably higher at lower IoU thresholds (e.g. 84.34%
@ 0.1 IoU), indicating that this approach locates the
rough foul area quite robustly.
6 CONCLUSION AND FUTURE
WORK
We report strong performance on a sports spatiotem-
poral video activity recognition task. There are a
number of directions to take before removing the
foul oracle assumption and working on the scale
of entire games, including extending the system to
near-view clips with more training examples, deal-
ing with shot boundaries automatically, and incor-
porating foul-relevant information outside of sub-
ject/object bounding boxes. Filtering subject/object
hypotheses by making sure candidate pairs are on op-
posite teams could boost performance, but require a
per-game learning of jersey colors and patterns using,
for example, deep image clustering (Li et al., 2021).
Using high-res versions of the game videos would en-
able further analysis such as ball tracking and reading
player names/jersey numbers to correlate with roster
data and/or commentary. Finally, camera pose esti-
mation and parsing of field line features (Cuevas et al.,
2020) would help filter off-field person detections and
recognize foul-relevant game situations.
REFERENCES
Assfalg, J., Bertini, M., Colombo, C., Bimbo, A. D., and
Nunziati, W. (2003). Semantic annotation of soccer
videos: automatic highlights detection. Computer Vi-
sion and Image Understanding, 92(2):285–305.
Canales, F. (2013). Automated Semantic Annotation of
Football Games from TV Broadcast. PhD thesis, De-
partment of Informatics, TUM Munich.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action
recognition? a new model and the Kinetics dataset.
In IEEE Conference on Computer Vision and Pattern
Recognition.
Cuevas, C., Quil
´
on, D., and Garc
´
ıa, N. (2020). Automatic
soccer field of play registration. Pattern Recognition,
103.
F
´
ed
´
eration Internationale de Football Association (FIFA)
(2015). Laws of the game. https://img.fifa.com/ im-
age/upload/ datdz0pms85gbnqy4j3k.pdf.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).
Slowfast networks for video recognition. In Proceed-
ings of the IEEE international conference on com-
puter vision, pages 6202–6211.
FIFA.com (2019). Video assistant referees (VAR).
https://football-technology. fifa.com/ en/media-tiles/
video-assistant-referee-var.
Gerke, S., Muller, K., and Schafer, R. (2015). Soccer jersey
number recognition using convolutional neural net-
works. In IEEE International Conference on Com-
puter Vision Workshop.
Giancola, S., Amine, M., Dghaily, T., and Ghanem, B.
(2018). Soccernet: A scalable dataset for action spot-
ting in soccer videos. In CVPR Workshop on Com-
puter Vision in Sports.
Gkioxari, G., Girshick, R., Doll
´
ar, P., and He, K. (2018).
Detecting and recognizing human-object interactions.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 8359–8367.
Hasan, I., Liao, S., Li, J., Akram, S. U., and Shao, L. (2020).
Generalizable pedestrian detection: The elephant in
the room.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In Proceedings of the IEEE international
conference on computer vision, pages 2961–2969.
Hu, G., Cui, B., He, Y., and Yu, S. (2020). Progressive re-
lation learning for group activity recognition. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 980–989.
Huda, N., Jensen, K., Gade, R., and Moeslund, T.
(2018). Estimating the number of soccer players using
simulation-based occlusion handling. In CVPR Work-
shop on Computer Vision in Sports.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk-
thankar, R., and Fei-Fei, L. (2014). Large-scale video
classification with convolutional neural networks. In
Proceedings of the IEEE conference on Computer Vi-
sion and Pattern Recognition, pages 1725–1732.
Kazemi, V. and Sullivan, J. (2012). Using richer models for
articulated pose estimation of footballers. In British
Machine Vision Conference.
Who Did It? Identifying Foul Subjects and Objects in Broadcast Soccer Videos
53