Granularity. Pruning granularity is usually di-
vided into two groups: unstructured pruning, i.e. the
weights are removed individually, without any intent
to keep structure in the weights (LeCun et al., 1989;
Hassibi et al., 1993). This method leads to matri-
ces able to reach a high sparsity level, but difficult to
speed-up due to the lack of regularity in the pruning
patterns. For those reasons, structured pruning, i.e.
removing groups of weights, have been introduced
(He et al., 2017). Those structures can include vec-
tor of weights, or even kernel or entire filters when
pruning convolutional architectures.
Criteria. Early methods proposed to use the second-
order approximation of the loss surface to find the
least useful parameters (LeCun et al., 1989; Hassibi
et al., 1993). Later work also explored the use of vari-
ational dropout (Molchanov et al., 2017a) or even l
0
regularization for parameter removal (Louizos et al.,
2018). However, it has been shown recently that, even
though those heuristics may provide good results un-
der particular conditions, they are more complex and
less generalizable than computing the l
1
norm of the
weights, and using that value as a measure of the im-
portance of the weights, thus removing the ones with
the lowest norm (Gale et al., 2019). Moreover, when
comparing the importance of weights, one might do it
locally, i.e. only compare weights that belong to the
same layer, which will provide a pruned model that
possesses the same sparsity in each layer. One also
might compare the weights globally, i.e. the weights
from the whole model are compared when the pruning
is applied, resulting in a model with different sparsity
levels for each layer. Comparing the weights globally
usually provides better results but may be more ex-
pensive to compute when the model grows larger in
size.
Scheduling. There exist many ways to schedule
the network pruning. Early methods proposed to re-
move weights of a trained network in a single-step,
the so-called one-shot pruning (Li et al., 2017). Such
a strategy typically required further fine-tuning of the
pruned model in order to recover from the lost per-
formance. However, performing the pruning in sev-
eral steps, i.e. the iterative pruning, is able to provide
better results and reach a higher sparsity level (Han
et al., 2015; Molchanov et al., 2017b). Nevertheless,
such methods were usually time-consuming because
of the alternation of many iterations of pruning and
fine-tuning (Li et al., 2017). Recently, another family
of schedules has emerged, performing a pruning that
is more intertwined with the training process, allow-
ing to obtain a pruned network in a more reasonable
time (Zhu and Gupta, 2017; Hubens et al., 2021).
3 EXPERIMENTS
The experimental setup consists in applying unstruc-
tured global pruning to the model proposed by (Zhang
et al., 2018) with a hidden layer size h
size
= 512 and
8 experts. As explained in section 2.1, it is defined by
two fully connected-based subnetworks : the gating
network extracting the coefficients ω
i
and the motion
prediction network whose its weights are the result of
the MoE processus. The chosen pruning scheduling is
the one cycle pruning (Hubens et al., 2021) with the
weights l1 norm criterion (Gale et al., 2019). We use
the same training data as (Zhang et al., 2018) which
is composed of motion capture recording of dog loco-
motion with various gaits.
First of all, we increase the network sparsity step-
by-step from 10% to 90% of the total amount of pa-
rameters to extract the relationship between the net-
work sparsity and the motion quality. To quantita-
tively assess the performance of each model, we mea-
sure the foot skating artifact on the generated motion.
Foot skating is the fact that the character foot slides
when the related joint on the skeleton is considered on
contact with the ground which is practically defined
if this joint foot height h is under a height threshold
H. That effect has a bad impact on the motion nat-
uralness and is induced by the mean regression dur-
ing the training process as explained by (Zhang et al.,
2018). We use the equation from (Zhang et al., 2018)
to quantify the foot skating s where v stands for the
foot horizontal velocity. We fixed the threshold H at
2.5cm.
s = v(2 −2
h/H
) (2)
Then, we draw a comparison of the generated mo-
tion quality between the same size dense (unpruned)
and sparse models (pruned). Next, the qualitative con-
tribution of each expert in the generation of the move-
ment is established through an ablation study in order
to visualize possible differences between their roles in
the case where the initial model is pruned. Finally, the
dynamic behavior of the gating network output vector
ω is extracted and subjectively compared between the
dense and sparse network.
For these experiments, we use the Fasterai frame-
work (Hubens, 2020) built on top of Fastai (Howard
et al., 2018) to implement the pruning methods. We
train 150 epochs on a GTX-1080 Nvidia GPU with
a batch size of 32, a learning rate η = 1e − 4 and a
weight decay rate λ = 2.5e − 3 with a AdamWR al-
gorithm (Loshchilov and Hutter, 2018) warm restart
with the same parameters as (Zhang et al., 2018).
GRAPP 2022 - 17th International Conference on Computer Graphics Theory and Applications
288