The Elusive Features of Success in Soccer Passes: A Machine Learning

Perspective

Hugo Muacho

1 a

and Ricardo Ribeiro

1,2 b

and Rui J. Lopes

1,3 c

Iscte - Instituto Universit

ario de Lisboa, Portugal

INESC-ID Lisboa, Portugal

Instituto de Telecomunicac¸

oes, Lisboa, Portugal

Keywords:

Machine Learning, Decision Trees, Soccer, Performance Analysis, Pass Success.

Abstract:

Machine learning has in recent years been increasingly used in the soccer realm. This paper focuses on inves-

tigating the factors inﬂuencing pass success, a chief element in team performance. Decision tree techniques

are used aiming to identify which features are the most important in pass success. This process is applied to

a data set of 13 matches of the men’s French “Ligue 1”. Two experiments are conducted using different fea-

ture sets: one containing the positional data and Voronoi area off all players, the second considering only the

ball carrier and closest teammates and opponents. The results obtained with the ﬁrst feature set indicate that

the relative importance of features is match dependent and somehow related to teams’ formation and players’

tactical mission. The second feature set, being more directly related to the passing process, provided a more

consistent ranking of features. Features related to the interaction with the opponent standout. Low precision

and recall values show that the features and factors leading to pass success are in fact elusive.

1 INTRODUCTION

We currently live in the Digital Age (Techopedia,

2017), where new technologies emerge and take their

place in society. Artiﬁcial Intelligence (AI) is part of

this contribution to the society and became one of the

biggest trends in the world today. The recent popu-

larity of AI can be attributed to the following three

factors: the growth of Big Data, easy access to com-

puting power, and the development of new AI tech-

niques (Obschonka and Audretsch, 2020).

In sports, soccer being no exception, AI has been

applied in several areas. Examples of its use are the

evaluation of player performance, team coordination

and prediction of the expected goals (estimating the

quantity and quality of goal opportunities a team has

created in a match). With the introduction of AI in

soccer, teams are able to discover new potentials and

achieve new and ambitious goals, especially in in-

creasing team competitiveness, decision making, and

performance assessment. Albeit this interest and po-

tential, the technology is still immature and needs im-

https://orcid.org/0000-0001-9343-7073

https://orcid.org/0000-0002-2058-693X

https://orcid.org/0000-0002-8943-0415

provement (Ks, 2020).

In the game of soccer, team’s performance is

strongly related to the way the ball carrier interacts

with his teammates via passes. There is a multiplicity

of indicators of passing performance that have been

identiﬁed, notably: the zone of the ﬁeld, the trajec-

tory of the receiver, and the space where the ball is

received (Cordon et al., 2020).

This paper aims to contribute to this enquiry on

the indicators of pass success using AI techniques.

Notably, it uses a computational approach based on

AI techniques (decision tree) that processes real data

in order to quantitatively assess the importance of fea-

tures (e.g., location of players, available area) for pass

success. The novelty on the paper is that the focus

is not on investigating techniques for a better predic-

tion of pass success but rather to understand, using ex-

isting techniques, what are the factors that contribute

more strongly for pass success.

This paper is structured as follows: next section

describes related work on AI and soccer match anno-

tation; Section 3 presents the data sources used; Sec-

tion 4 addresses the methods used to process the data

and compute the importance of features; the results

obtained are presented in Section 5; Final Remarks

and Future Work close the paper.

110

Muacho, H., Ribeiro, R. and Lopes, R.

The Elusive Features of Success in Soccer Passes: A Machine Learning Perspective.

DOI: 10.5220/0011541700003321

In Proceedings of the 10th International Conference on Sport Sciences Research and Technology Support (icSPORTS 2022), pages 110-116

ISBN: 978-989-758-610-1; ISSN: 2184-3201

2 LITERATURE REVIEW

Soccer analytic companies have only recently started

to analyse so-called big data (e.g., high-resolution

video, tracking player movement and possession in-

formation). At the same time, only recently have

major advances been made in machine learning, pro-

ducing techniques that can handle these new high-

dimensional data sets. The amount of data available in

soccer has increased with different techniques to col-

lect a large amount of data such as sensors, GPS, and

computer vision algorithms. This helps the use of ma-

chine learning in soccer in its various areas such as in

recruiting and analysing the performance of players,

in selling tickets bringing fans closer to their club, and

also in helping decision making that affects an entire

area of a club.

2.1 Machine Learning in Soccer

Machine learning is the ﬁeld of study that focus on

how computers learn to perform a task without be-

ing explicitly programmed to do it. It can be deﬁned

as a set of methods that can automatically detect pat-

terns in data to predict future data or to perform other

types of decision making (Murphy, 2012). Machine

learning is beginning to play an essential role within

the following branches of computing: data migra-

tion, hard-to-program applications, and custom soft-

ware applications (Mitchell, 1997). Machine learn-

ing algorithms generally fall into two paradigms: su-

pervised learning and unsupervised learning (Stimp-

son and Cummings, 2014). In supervised learning a

“teacher” is assumed to be present, where the correct

answers are provided for each situation. Supervised

learning techniques build predictive models that learn

from a large number of training examples, where each

training example has a label that indicates its truth

output (Zhou, 2017) – a pair consisting of the input

object and an output label value that belongs to a class

or is a continuous value.

Machine learning, and AI in general, have been

more and more used in the world of soccer not only in

performance or tactical analysis, but also in the med-

ical and marketing departments.

One such example outside the tactical ﬁeld is in-

jury prevention. For example, the study conducted by

Rommers et al. (Rommers et al., 2020) who during

one season tried to predict the injuries of 734 players

aged between 10 and 15 years old from seven Bel-

gian academies. At the beginning of the season a bat-

tery of tests were performed to evaluate motor coor-

dination and physical ﬁtness and characteristics (e.g.,

height, weight, strength, and ﬂexibility). Based on

these characteristics, the machine learning algorithm

was able to predict injuries and distinguish between

serious and light injuries with high accuracy. The ap-

plication of this type of algorithms also helps coaches

in decision making during the game, such as knowing

the physical condition of a player and whether or not

he should be substituted.

Another example of the application of machine

learning in soccer is in analysing player performance.

Jamil et al. (Jamil et al., 2021) applied several ma-

chine learning algorithms (Logistic Regression, Gra-

dient Boosting, and Random Forest) to classify the

performance of professional goalkeepers aiming to

distinguish an elite goalkeeper from a sub-elite goal-

keeper. The conclusions drawn in this study where

that all elite goalkeepers shared the same common

characteristics: short distribution, successfully pass-

ing and receiving the ball, and not conceding goals.

This study suggested that it is the goalkeeper’s skill

with his feet that distinguishes elite goalkeepers from

the sub-elite.

Another example in the area of performance anal-

ysis is the work of Pappalardo et al. (Pappalardo et al.,

2019) through a simulator recommendation. The

work implemented PlayeRank, a data-driven frame-

work that offers a principled multi-dimensional and

role-aware evaluation of the performance of soccer

players.

2.2 Match Data and Annotation

Annotations in soccer are an important tool to obtain

data from a match. The analysis of soccer matches re-

lies on the annotation of both individual player’s ac-

tions (e.g., passes and shoots), athletic performance

and team events (e.g., substitutions). Consequently,

annotating soccer events at a ﬁne-grained level is an

expensive and error-prone task (Barra et al., 2021).

On the other hand, positional data is usually ob-

tained using automated or semi-automated tools that

rely on devices such as GPS receivers, cameras and

computer vision. One of the more interesting oppor-

tunities provided by the availability of position track-

ing data in soccer is the analysis of tactical behaviour.

Tactical behaviour is an important determinant of per-

formance in team sports like soccer, and refers to how

a team manages its spatial positioning over time to

achieve a shared goal.

3 MATERIALS

The material used in this paper is a database cor-

responding to annotation and positional data of 13

The Elusive Features of Success in Soccer Passes: A Machine Learning Perspective

111

matches in the French Premier League (Ligue 1).

This database contains 563 067 entries and 11 vari-

ables. Each entry corresponds to a technical action

performed in the match; the variables correspond to

player’s positioning and other attributes describing

the technical action (e.g., a pass). These include the

following:

Match (integer): unique game identiﬁer;

Period (1,2): ﬁrst (1) or second part (2) of the match

Time (decimal, seconds): match time

Team (f, o): team identiﬁer, (f)ocus or (o)pponent

Tactical Mission (class): player’s tactical mission

(e.g., GK - Goal Keeper, LB - Left back)

x, y (decimal): player’s longitudinal (x) and lateral

(y) position on the pitch (in meters). The centre

of the pitch corresponds to coordinate (0,0)

Voronoi Area (decimal, m

): player’s Voronoi cell

area

Event (class): technical action performed (e.g., Pass,

Shot)

Distance (decimal): distance from the player hold-

ing the ball to the opponent’s goal (in meters)

Ball Zone (O, I): if the ball zone is in an (O)utside or

(I)nside area

Continuation (0, 1): the ball remains in the team’s

possession (0) or changes to the opponent (1)

Angle (decimal): angle to offensive goal

In addition to the information pertaining to each

event, other annotation information was also used, no-

tably the predominant tactical formation adopted by

each team (in Figure 1 a 3-5-2 for the focus team, red,

and 4-4-1-1 for the opponent team, blue).

Using these features two experiments were per-

formed. In both experiments all passes (10 332) made

in the ﬁrst half of the 13 games were considered. Of

these, only 1 405 (13.6%) were unsuccessful (this

unbalanced data represents a challenge for machine

learning techniques).

4 METHODS

This section presents the methods used in the experi-

ments: a decision tree to model the outcome of a pass;

how the importance of the different features for that

outcome is estimated; the use of Voronoi cells as fea-

tures; the use of cosine similarity to compare matches.

4.1 Decision Tree

Decision tree is a supervised learning method used in

classiﬁcation and regression tasks. The goal is to cre-

ate a model that predicts the value of a target variable,

the output, by learning simple decision rules inferred

from data, the features (Pedregosa et al., 2011). Since

the decision tree follows a supervised approach, the

algorithm is fed a collection of pre-processed data that

is used to train the algorithm.

In a decision tree, the top level is called the root,

the root gives rise to links to other elements called

nodes. A node that has no link is called an end node,

otherwise is a decision node (see Figure 3). Decision

nodes in the tree correspond to questions that are pre-

sented to the data (if a feature variable is larger or

smaller than a threshold value). Each edge of the tree

corresponds to an outcome of the question and lead

to another decision node or to an end node represent-

ing a class distribution (i.e., a value of the output).

This method is based on algorithms that divide the

initial data set into more homogeneous subsets which

in turn can be divided into even more homogeneous

subsets (de Ville, 2006). The decision tree algorithm

works through several aligned if-else statements in

which successive conditions are checked unless the

model reaches a conclusion on the output or a prede-

ﬁned depth of the tree is reached.the cases in the tree

concern passes made by players. The Gini impurity

metric indicates how well a tree splits the data.

4.2 Importance of Features

The chief assumption in the paper is that the impor-

tance of a feature (e.g., x coordinate) can be quantiﬁed

by its importance in the decision tree. Feature im-

portance is calculated as the decrease in node impu-

rity weighted by the probability of reaching that node

(the number of samples that reach the node, divided

by the total number of samples). The importance for

each feature in a decision tree is then calculated using

Eq. 1 (Stacey, 2018).

= w

− w

left( j)

− w

right( j)

(1)

the importance of node j

probability of reaching node j

the impurity value of node j

left( j) child node from left split on node j

right( j) child node from right split on node j

4.3 Voronoi Area

A Voronoi diagram is a partition of a plane into re-

gions close to each of a given set of objects. In the

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

112

Figure 1: Players’ Voronoi diagram.

simplest case, these objects are points in the plane

(called seeds, sites, or generators). For each object

corresponds one region deﬁned by the points in the

plane that closer to that object than to any other.

In invasion sports, the spatial distribution of play-

ers on the ﬁeld is determined by the interaction behav-

ior established at both player and team levels (Fon-

seca et al., 2013) making Voronoi diagrams a useful

tool to analyze matches. In this context, Voronoi di-

agrams are computed using players’ position in the

pitch (see Figure 1). Voronoi diagrams may help

coaches to see how well the players use space, ﬁnd

new spaces in which to attack, and identify areas

in defense that the team leaves open. In this paper

player’s Voronoi cell area will be used as a feature

inﬂuencing the success of the pass.

4.4 Comparing Matches

Passes occur within a context: a match. As our

study involves different matches, it is thus important

to assess quantitatively how (di)similar two matches

are. Cosine similarity is a well know method that

can be used for this. In order to assess how similar

two matches are each match is described by a vec-

tor of attributes (say a and b) and the similarity value

sim

cos

(a, b) computed using Equation 2.

sim

cos

(a, b) =

a · b

kakkbk

∑

i=1

∑

i=1

∑

i=1

(2)

In this paper, three different types of attribute vec-

tors are used to characterise each match:

• Team formation: a match is described by the typ-

ical formation adopted by each team. The match

from Figure 1 is described by vector

[3, 5, 2, 0,

| {z }

Focus

4, 4, 1, 1]

| {z }

Opponent

• Tactical mission: a match is described by the tac-

tical mission (e.g., GK, ST) of the players in the

pitch

[GK, LB, LCB, . . . , ST, SS,

| {z }

Focus

GK, LB, LCB, . . . , ST, SS]

| {z }

Opponent

Value 1 is used if a player with that tactical mis-

sion is on the pitch, 0 otherwise. The match from

Figure 1 is described by vector

[1, 0, 0, . . . , 0, 0,

| {z }

Focus

1, 1, 0, . . . , 1, 1]

| {z }

Opponent

• Features importance: a match is described by the

importance value, σ

computed for feature i of the

decision tree.

[σ

, σ

, . . . , σ

]

| {z }

N f eatures

5 EXPERIMENTAL RESULTS

5.1 Match Similarity: Team Formation

and Tactical Mission

The similarity of pair of matches according to the

teams’ formation (tactical mission of the teams’ line-

up) was computed for all possible pairs of matches

as represented in Figure 2. As expected, similarity is

typically higher in teams’ formation than in tactical

mission of the teams’ line-up. Tactical mission of the

teams’ line-up similarity presents a higher variability.

For both criteria, matches 101 and 102 present a high

similarity between them but low similarity to all other

matches. Considering the more discriminatory crite-

ria based on players’ line-up tactical mission one ﬁnds

as the most similar the following pairs of matches:

(101, 102); (103 − 102); (104 − 109); (106 − 108);

(106 − 111) and (108 − 111).

5.2 Experiment with Features from All

Players

In order to pinpoint the more relevant features for pass

success a ﬁrst experiment was conducted using fea-

tures (x, y, and Voronoi Area) of all 22 players on

the ﬁeld (66 features in total) and as output value the

pass outcome (success or not, Continuation in Sec-

tion 3). Two decision trees were created, one with 6

levels (represented in Figure 3) another with 20 levels.

The Elusive Features of Success in Soccer Passes: A Machine Learning Perspective

113

(a) Teams’ Formation. (b) Teams’ Player Line-up Tactical Mission.

Figure 2: Cosine similarity between matches.

Figure 3: Decision Tree with 6 levels.

Table 1: 1

experiment cross validation results.

Tree Accuracy Precision Recall

6 levels 0.82 0.158 0.083

20 levels 0.79 0.235 0.500

Table 1 presents the cross validation assessment

values for the two trees.

Analyzing the results we can see that although the

decision tree with 6 levels has a higher accuracy the

other parameters are very low.

Using the method described in Section 4.2, the im-

portance of each feature in the 13 matches was com-

puted. Features were ordered according to their im-

portance, using maximum and average values across

matches in which it was present. Table 2 presents the

top 5 features on both criteria. Figure 4 shows the im-

portance of the 66 features on each match, ordered by

their average.

Figure 4: Features’ importance across matches (by avg.).

Table 2: Top 5 features.

Top features by max. Top features by avg.

Feature Value Feature Value Matches

f GK x 0.29 o CM2 area 0.11 2

o RCB x 0.27 f RM x 0.09 1

f RB x 0.24 f CAM x 0.06 4

o LCB x 0.23 f CAM y 0.06 4

o CAM y 0.21 f LM y 0.06 1

Analysing the ranking by average, in the ﬁrst

places appear features that do not appear in many

games (between 1 and 4). These values are also low,

which may indicate that there is no common feature

that stands out in all matches. Notably, none of the

different classes of features, x, y, and Voronoi Area

can be considered as dominant. On the other hand,

all top 5 features are associated to mid-ﬁeld tactical

missions (4 belonging to the focus team).

Considering the maximum value for the impor-

tance of features none of them has a very high value.

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

114

This indicates that the importance of features in all

games is very disperse among the features and teams

(2 focus, 3 opponent). Nonetheless, the x feature class

stands out as well as defensive tactical missions.

Figure 4 conﬁrms this dispersion of importance

across the different features and matches. Disper-

sion across matches was investigated by computing

the similarity between the importance of features for

all matches pairs using the cosine similarity metric

(as in Section 4.4). The values represented in Fig-

ure 5 indicate a low similarity between all pairs of

matches. However, it is of note that most pairs pre-

senting higher similarity (e.g., 106 − 108, 101 − 102

and 108 − 111) correspond to matches that have also

high tactical mission similarity.

Figure 5: Features’ importance Cosine Similarity.

Albeit the interesting relation between features’

importance and players’ tactical mission, this experi-

ment does not identify a consistent set of features to

be considered as the most relevant across matches for

deciding pass outcomes. This was somehow to be ex-

pected as the features considered where not strongly

associated to the process under observation: passing.

5.3 Experiment with Ball Carrier and

Closest Players

In order to overcome the limitations of the previ-

ous experiment we used features of the player per-

forming the pass (ball carrier) combined with fea-

tures of the closest player from the same and oppo-

nent teams. In addition to the longitudinal and lat-

eral coordinates, features related to ”available” space

(Voronoi area) and ”support/pressure” (distance to

teammate/opponent), were considered:

p x(y) (decimal): longitudinal (lateral) position of

ball carrier.

p area (decimal): Voronoi cell area of ball carrier.

p dist (decimal): distance to opponent’s goal from

ball carrier.

f(o) sep (decimal): distance between ball carrier and

closest teammate/opponent (f/o).

f(o) area (decimal): Voronoi cell area of closest

teammate/opponent (f/o).

f(o) dist (decimal): distance to goal from closest

teammate/opponent (f/o).

Table 3: 2

experiment cross validation results.

Tree Accuracy Precision Recall

6 levels 0.82 0.1 0.15

20 levels 0.78 0.31 0.32

Analysing the Table 3 we can see that although the

6-level decision tree has a higher accuracy the other

parameters are lower.

Figure 6 shows the importance of each feature

sorted by increasing average value. The most impor-

tant feature is the distance between ball carrier and

closest opponent. This makes sense, as the opponent

is applying pressure, the difﬁculty of a successful pass

increases. Actually, all three opponent related fea-

tures are found in the Top 5, reinforcing the hypoth-

esis that the interaction with the closest opponent is

of chief importance on pass success. Conversely, fea-

tures concerning the teammate are amongst the least

important, especially distance to goal.

Figure 6: Ball carrier, Teammate and Opponent features’.

6 CONCLUSIONS AND FUTURE

RESEARCH

From this exploratory work the following main con-

clusion can be obtained: identifying and quantifying

the factors of passing success is in fact a difﬁcult task.

This is conﬁrmed by the fact that, albeit the high ac-

curacy, precision and recall scores are low in all ex-

periments. Additional, more detailed, conclusions are

The Elusive Features of Success in Soccer Passes: A Machine Learning Perspective

115

the following:

• The pass success/insuccess imbalance impairs the

assessment of the decision mechanism.

• The relative importance of features is somehow

related to the match teams’ formation and players’

tactical missions.

• Using more features does not guarantee an in-

crease on accuracy;

• Having a set of features that are more directly re-

lated to the process (passing) enabled a more con-

sistent ranking of features across matches.

• Interaction with the closest opponent appears to

be of key importance for pass success.

Concerning future work, we suggest:

• Explore techniques to mitigate data imbalance.

• Inquire other features related with the interaction

with opponents.

• Investigate why Voronoi areas are not as relevant

as expected. An hint is that the complete Voronoi

area may not be considered as “usefull”.

Albeit its limitations, notably low precision and

recall, the results of the paper may be useful to prac-

titioners. For example, they may help designing con-

strained pass practice tasks (e.g., with representative

distances to opponent and team mates).

ACKNOWLEDGEMENTS

Rui J. Lopes was partly supported by the Fundac¸

para a Ci

encia e Tecnologia, under Grant Num-

ber UIDB/50008/2020 attributed to Instituto de

Telecomunicac¸

oes. Ricardo Ribeiro was partly sup-

ported by national funds through Fundac¸

ao para

a Ci

encia e a Tecnologia (FCT) with reference

UIDB/50021/2020. The authors would like to thank

elson Caldeira for his valuable comments and sug-

gestions.

REFERENCES

Barra, S., Carta, S. M., Giuliani, A., Pisu, A., Podda,

A. S., and Riboni, D. (2021). Footapp: an ai-

powered system for football match annotation. CoRR,

abs/2103.02938.

Cordon, A., Garcia, A., Marquina Nieto, M., Calvo, J.,

Mon, D., and Rom

an, I. (2020). What is the rele-

vance in the passing action between the passer and the

receiver in soccer? Int. Journal of Environmental Re-

search and Public Health, 17.

de Ville, B. (2006). Decision Trees for Business Intelligence

and Data Mining: Using SAS Enterprise Miner. SAS

Institute.

Fonseca, S., Milho, J., Travassos, B., Araujo, D., and

Lopes, A. (2013). Measuring spatial interaction be-

havior in team sports using superimposed Voronoi di-

agrams. International Journal of Performance Analy-

sis in Sport, 13:179–189.

Jamil, M., Phatak, A., Mehta, S., Beato, M., Memmert,

D., and Connor, M. (2021). Using multiple machine

learning algorithms to classify elite and sub-elite goal-

keepers in professional men’s football. Scientiﬁc Re-

ports, 11(1):22703.

Ks, M. (2020). Applications of artiﬁcial intelligence in the

game of football: The global perspective. Journal of

Arts Science & Commerce, 11:18–29.

Mitchell, T. M. (1997). Does machine learning really work?

AI Magazine, 18(3):11.

Murphy, K. P. (2012). Machine learning: a probabilistic

perspective. MIT press.

Obschonka, M. and Audretsch, D. (2020). Artiﬁcial

Intelligence and Big Data in entrepreneurship: A

new era has begun. Small Business Economics,

55(3):529–539.

Pappalardo, L. et al. (2019). PlayeRank: Data-Driven Per-

formance Evaluation and Player Ranking in Soccer

via a Machine Learning Approach. ACM Trans. In-

tell. Syst. Technol., 10(5).

Pedregosa, F. et al. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Rommers, N. et al. (2020). A Machine Learning Approach

to Assess Injury Risk in Elite Youth Football Players.

Medicine & Science in Sports & Exercise, 52:1745–

1751.

Stacey, R. (2018). Towards Data Science: The

Mathematics of Decision Trees, Random Forest

and Feature Importance in Scikit-learn and Spark.

https://tinyurl.com/3kca6xp8. Accessed: 2022-06-01.

Stimpson, A. J. and Cummings, M. L. (2014). Assessing in-

tervention timing in computer-based education using

machine learning algorithms. IEEE Access, 2:78–87.

Techopedia (2017). What does digital revolution mean?

tinyurl.com/4xkk4mkr. Accessed: 2021-12-22.

Zhou, Z.-H. (2017). A brief introduction to weakly super-

vised learning. National Science Review, 5(1):44–53.

icSPORTS 2022 - 10th International Conference on Sport Sciences Research and Technology Support

116