Reduction of Variance-related Error through Ensembling:

Deep Double Descent and Out-of-Distribution Generalization

Pavlos Rath-Manakidis, Hlynur Davíð Hlynsson and Laurenz Wiskott

Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany

Keywords:

Deep Double Descent, Ensemble Model, Error Decomposition.

Abstract:

Prediction variance on unseen data harms the generalization performance of deep neural network classiﬁers.

We assess the utility of forming ensembles of deep neural networks in the context of double descent (DD) on

image classiﬁcation tasks to mitigate the effects of model variance. To that end, we propose a method for using

geometric-mean based ensembling as an approximate bias-variance decomposition of a training procedure’s

test error. In ensembling equivalent models we observe that ensemble formation is more beneﬁcial the more the

models are correlated with each other. Our results show that small models afford ensembles that outperform

single large models while requiring considerably fewer parameters and computational steps. We offer an

explanation for this phenomenon in terms of model-internal correlations. We also ﬁnd that deep DD that

depends on the existence of label noise can be mitigated by using ensembles of models subject to identical

label noise almost as thoroughly as by ensembles of networks each trained subject to i.i.d. noise. In the context

of data drift, we ﬁnd that out-of-distribution performance of ensembles can be assessed by their in-distribution

performance. This aids in ascertaining the utility of ensembling for generalization.

1 INTRODUCTION

In the ﬁeld of deep learning the process of deriving

models is normally subject to multiple internal and

external sources of randomness and noise. Their in-

ﬂuence is such that the realized predictors – even

if their performance scores on the test set are al-

most equal – realize measurably different mappings.

It is therefore beneﬁcial to analyse such inﬂuences

and try to reduce their adverse effects. We consider

the process of ensemble formation (ensembling) as

a means to distinguish the sources of generalization

error in classiﬁcation problems and to mitigate the

stochasticity-related error of the models. By the term

learning procedure we refer to the entirety of the

process of model initialization, data acquisition and

preparation, and derivation of model parameters.

We claim that the expected error rate of simple

models (i.e. models that are not ensembles) can be

decomposed into the generalization error of the un-

derlying learning procedure as such, which quantiﬁes

the procedure’s inductive bias (Belkin et al., 2019),

and the variance-error, which quantiﬁes the expected

underspeciﬁcation-related error of its induced mod-

els. D’Amour et al. (2020) characterize an ML

pipeline as underspeciﬁed if it can yield multiple dis-

tinct predictors with similar (in-distribution) test set

performance. We demonstrate this for the single-class

classiﬁcation datasets CIFAR-10 and CIFAR-100 by

approximating the above decomposition with geomet-

ric mean-based deep ensembles (Lakshminarayanan

et al., 2017) and present results that suggest that

the out-of-distribution accuracy improvement through

ensembling can be estimated through its effects on

test set performance.

In this paper we aim to help better understand the

difference between the over- and under-parameterized

learning regimes in which machine learning takes

place and to explore the potential of ensembling to

help us understand and overcome the high error rates

that can occur at the boundary of these two learn-

ing regimes. The transition between the two regimes

takes place when the model is just barely able to en-

code all training data. This is the case when the model

capacity marginally sufﬁces to almost perfectly ﬁt the

training data and when the model receives sufﬁcient

computational resources (e.g. training epochs) to do

so. As it marks the point where the model starts to

afford interpolating the training data, the name of this

transition is interpolation threshold. At around that

threshold, under certain conditions, models exhibit

markedly reduced generalization performance com-

Rath-Manakidis, P., Hlynsson, H. and Wiskott, L.

Reduction of Variance-related Error through Ensembling: Deep Double Descent and Out-of-Distribution Generalization.

DOI: 10.5220/0010821300003122

In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2022), pages 31-40

ISBN: 978-989-758-549-4; ISSN: 2184-4313

pared to similar models that have either an increased

or even a decreased effective capacity to learn input

regularities. This error hike at the border of the learn-

ing regimes gives rise to the double descent (DD) in

the generalization error curve. In the case of deep

neural networks (cf. Nakkiran et al., 2019) DD can be

elicited in the absence of regularization through suf-

ﬁciently strong label noise. We choose to approach

this phenomenon using ensemble construction from

the perspective of a tentative error decomposition in

order to empirically approach the idea that any given

stochastic learning procedure (and thus ML pipeline

in general) can be seen as inducing a distribution over

mappings (i.e. in the form of models) whose statistical

properties can be studied.

2 RELATED WORK

Although ﬁrst observed by Opper et al. (1990), ex-

plicit theoretic inquiry into DD started as late as 2018

with Belkin et al. (2019), who ﬁrst coined the term.

Belkin et al. (2019) have demonstrated model-wise

DD in random Fourier Feature models, two layer neu-

ral networks and Random Forests. By initializing

neural networks with small initial weights and by re-

quiring minimum functional norm solutions for over-

speciﬁed two-layer feature models they have demon-

strated DD across models and datasets although they

remained focused on models that are shallow by ma-

chine learning standards.

Most theoretical research around DD has up to

now dealt with wide shallow one or two layer net-

works measuring network capacity based on layer

widths (most notably Mei and Montanari, 2019; Ad-

vani et al., 2020; d’Ascoli et al., 2020). In most stud-

ies the ﬁrst layer is random and ﬁxed and weights are

only learned for the linear second layer. Interpolation

often occurs exactly at the point where the number

of model parameters P equals the number of train-

ing samples N. A common feature of these works is

that they deploy asymptotic analysis with P and N di-

verging to inﬁnity and constant P/N by using random

matrix theory to obtain expected values for the gen-

eralization error using randomly distributed artiﬁcial

training data.

Deep DD refers to DD on highly non-linear net-

works with a large number of hidden layers. Up to

now, there are, to our knowledge, no analytical tech-

niques for demarcating the possible generality of this

phenomenon. A promising direction is the study of

Gaussian processes (Rasmussen, 2003) as they have

been shown to behave like deep networks (Lee et al.,

2017) in the limit of inﬁnite layer width and to be

mathematically well-characterized to the point that

closed-form descriptions of the distribution of model

predictions can be derived. A related promising ﬁeld

is kernel learning, because it is integral in the for-

mal characterization of the aforementioned Gaussian

processes w.r.t. their generalization error (Jacot et al.,

2018) and because DD (see Belkin et al., 2018) occurs

in this setting too.

Nakkiran et al. (2019) have demonstrated DD

on image classiﬁcation tasks using preactivation

ResNet18 models and simpler stacked convolu-

tional architectures on the CIFAR-10 and CIFAR-100

datasets. Yang et al. (2020) have expanded on the

study of deep DD by decomposing the test loss into its

bias and variance terms and have observed that under

some settings the loss variance ﬁrst monotonically in-

creases until around the test error DD hike and mono-

tonically decreases thereafter.

The paper by Geiger et al. (2020) is the work most

closely related to this paper. The authors consider

variance between classiﬁers derived from the same

learning procedure. The learning scenarios they con-

sider lack label noise and they attribute the variance

to the variability in model parameter initialization.

Among other setups, they consider linear and sim-

ple convolutional models on MNIST and CIFAR-10.

They also conjecture that for over-parameterized net-

works the second descent is related to the weakening

of ﬂuctuations in the test error as the model is iter-

ated. Geiger et al. (2020) also consider the trade-off

between ensembling and using larger networks. We,

on the other hand, consider scenarios with more re-

alistic and larger neural networks where label noise

is added to training data and deal with multiple di-

mensions of model complexity (training duration and

model size) and compare the generalization perfor-

mance of ensembles under distribution shift.

Adlam and Pennington (2020a) demonstrate in a

realistic setting (and Chen et al. (2020) under more

artiﬁcial conditions) that DD can be part of a more

complicated test error curve resulting from an elab-

orate interplay between the speciﬁcs of the dataset

structure and a learning procedure family’s disposi-

tions to capture it. Adlam and Pennington (2020a), in

particular, ﬁnd that a test error hike can appear when

the number of parameters equals the training set size

and, additionally, when it is equal to about the square

of the number training data points. In another work

Adlam and Pennington (2020b) also demonstrate that

learning procedure variance can and should be de-

composed further in order to make better sense of DD.

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

3 METHODS

The test error approximately quantiﬁes the deﬁciency

of a model’s ability to capture the generalization dis-

tribution of the data. Yet, a learning procedure’s de-

ﬁciency need not be accurately reﬂected in the de-

ﬁciency of any one particular model generated by

it. Central tendencies of the distribution of the map-

pings of the models the learning procedure induces

may capture the structure of the data better than any

one particular model. The error of a stochastic learn-

ing procedure deﬁned on some data distribution can

therefore be thought of as the structure in the data

which cannot be inferred using the procedure itself

without additional knowledge about the problem do-

main, even when combining the outputs of inﬁnitely

many models. We choose ensembling as an approxi-

mation for the bias-variance decomposition of the test

error for classiﬁcation tasks. In particular, we ﬁnd that

the properties of geometric-mean based ensembling

(following Yang et al., 2020) allow for a theoretically

motivated combination of individual predictions that,

in a meaningful and domain-agnostic way, utilize the

degree of certainty of accepting or rejecting the dif-

ferent categorical outcomes of each predictor.

For a given x drawn from the generalization dis-

tribution and corresponding correct classiﬁcation y

and single-predictor output f the cross-entropy loss

of learning procedure T decomposes as:

f ∼T

H(y, f ) = H(y)

|{z}

Irred.

Error

(y ∥

f )

| {z }

Bias

+ E

f ∼T

(

f ∥ f )

| {z }

Variance

(1)

where

f is precisely the geometric mean-based mix-

ture of estimators deﬁned as:

f ∝ exp



f ∼T

log f



(2)

This mapping is approximated by the geometric-mean

ensemble of i.i.d. models. By analogy, we consider

the difference of the mean error rate of the constituent

models and the error rate of the ensemble to be indica-

tive of the distribution-speciﬁc underspeciﬁcation of

the ML pipeline realized by the learning procedure.

This is a different quantity from the systematic error,

which is manifested in the bias of the central tendency

of the distribution of predictors.

The exact deﬁnition of the learning procedure un-

der label noise depends on whether the introduction

of noise is considered to be a part of the learning

procedure. If it is, the approximation of the loss de-

composition has to be carried out across sub-learners

trained under i.i.d. noise-proﬁles each; alternatively, it

has to be conducted using sets of sub-learners trained

under ﬁxed i.i.d. noise-proﬁles each. We propose a

method (see Figure 1) for re-using one set of sub-

models for both conﬁgurations. In this way we re-

duce the overall number of models we need to train.

Furthermore, by combining the same models in differ-

ent ways the difference between the same-noise con-

dition and the cross-noise condition is less affected

by model variability and mainly results from the en-

semble construction strategy. We combine the sub-

models for the cross-noise proﬁle setting following

the cyclical pattern depicted in Figure 1. We do so in

order to minimize the effect of individual noise pro-

ﬁles on the average of measures over the ensembles.

If we choose the number of distinct noise-proﬁles to

be equal to the number of trials per noise-proﬁle and

sufﬁciently large, then we obtain an empirical ﬁne-

grained decomposition of the variance-error based on

the different sources of randomness, i.e. data noise

and model-derivation randomness.

Figure 1: Conﬁguration Design for Ensembling Strategy

Comparison. Cross-noise proﬁle ensembles are constructed

following the horizontal arrows and same-noise proﬁle en-

sembles following the vertical arrows.

The effective complexity of deep learning mod-

els depends on multiple parameters. Accordingly, we

visualize the DD on the test error along the model

width (model-wise DD) and the training epoch count

(epoch-wise DD) in Figure 2. The model width is a

factor in the number of convolutional channels used in

the hidden layers of the VGG11 (Simonyan and Zis-

serman, 2014) and ResNet18 (He et al., 2016) visual

classiﬁcation models we consider in this paper. The

ensembles considered in Figure 2 are cross-noise pro-

ﬁle ensembles.

We compare different ensembling strategies and

ensemble mixing functions as well as comparable

non-ensemble architectures in order to be able to dis-

entangle the effects of design components on general-

ization and on DD mitigation. We compare:

1. Single sub-learners trained with and without label

noise.

2. Same-noise proﬁle ensembles: Ensembles over

models all trained under the same label noise pro-

Reduction of Variance-related Error through Ensembling: Deep Double Descent and Out-of-Distribution Generalization

ﬁle. We use the following mixing techniques for

these ensembles:

(a) Learnable fully-connected bias-free linear en-

sembling with Softmax normalization. The

outputs of the sub-learners are fed through a

linear layer whose parameters are learned dur-

ing an additional training process. This addi-

tional training is subject to the same noise pro-

ﬁle that the sub-learners were subjected to.

(b) Arithmetic mean-based ensembling: The out-

puts for each category over all learners are

mixed according to: f

∑

n=1

puts for each category over all learners are

mixed according to: f

∝

∏

n=1

(d) Plurality vote-based ensembling: The category

that is the ﬁrst choice of the largest number of

sub-learners is selected by the ensemble.

3. Cross-noise proﬁle ensembles: Ensembles over

models each trained under a different noise pro-

ﬁle by using mixing techniques (b), (c) and (d).

4. Mono-Ensembles: Geometric mean-based ensem-

ble architectures trained as single models un-

der the same training conﬁgurations as the sub-

learners.

5. Large Networks: Networks of the same architec-

ture type as the sub-learners, but with a parameter

count as close as possible to multiples of the pa-

rameter count of the sub-learners and all training

conﬁgurations left equal.

We are also interested in the implications of the

underspeciﬁcation-related performance penalty out-

of-sample. To that end, we also study the effect of

approximating the central tendency of learning pro-

cedures through ensembling on alternative datasets

for the same classiﬁcation problem. In particular,

we test small ResNet18 classiﬁers that have been

trained on CIFAR-10 that lie close to (width 12) or

slightly beyond (width 26) the interpolation thresh-

old on CIFAR-10.1 (Recht et al., 2018), CIFAR-10.2

(Lu et al., 2020), and the set of images contained

in CINIC-10 (Darlow et al., 2018) that were adapted

from ImageNet (Russakovsky et al., 2015). CIFAR-

10.1 was designed to present almost no distribution

shift w.r.t. standard CIFAR-10, whereas both CIFAR-

10.2 and CINIC-10 were constructed to realize differ-

ent data-distributions, e.g. regarding low level image

statistics and the sub-categories (i.e. synonym sets)

used for populating the classes.

Additional information on model training and ex-

During training, sub-network outputs passed on to the

ensembling layer where saturated to be greater or equal to

−5

. This prevents not-a-number errors.

perimental design along with further results is avail-

able in Rath-Manakidis (2021).

4 RESULTS

In this section we present the results of our experi-

ments and describe their implications.

4.1 Findings on Double Descent

We observed no DD without noise on CIFAR-10.

With almost no explicit regularization and without la-

bel noise, we obtained no epoch-wise DD on CIFAR-

10 for neither ResNet18 nor VGG11 models. There-

fore it is not possible in this case to empirically sep-

arate distinct learning regimes because the test error

exhibits a global monotonous trend.

Under most conﬁgurations tested, we obtained

epoch-wise DD w.r.t. the test set only w.r.t. the er-

ror and not the loss (see Plots 2b and 2c). This is

surprising given the fact that it is the loss that the net-

works were trained to minimize and that it is merely

intended as a proxy for the error that can be used

in back-propagation

, whereas in most other settings

(e.g. in regression) the measure used in optimization

itself exhibits DD on the test set. The test error is

the metric we ultimately intend to minimize and the

network is never confronted with it directly. For ex-

ample, in Plot 2c we see that the loss does not display

epoch-wise DD for this conﬁguration and that it gen-

erally remains high, when compared to the values of

its ﬁrst descent. Nevertheless, the error performs a

combined epoch-wise and model-wise DD. The loss

does perform a weak model-wise DD for large epoch

values but it does not reach again the values at the

bottom of the ﬁrst valley of its trajectory.

The valley in Plot 2b for the test error DD curve

coincides with the valley of the noise-free train set

error in 2a. This implies that the test error ascend is

partly due to learning to ﬁt label noise and is not a

phenomenon tied strictly to generalization.

We ﬁnd that ensembling of deep networks, es-

pecially if realized through a geometric mean-based

approach to model output mixing, signiﬁcantly im-

proves model performance (see Figures 3 and 4). We

attribute this to the capacity of the geometric-mean

ensemble to differentiate between strong and weak

preference for or against each category by the sub-

models: if some models weakly favor one category

E.g. the loss criterion is such that even correctly clas-

siﬁed training inputs keep contributing to the optimization

signal to the network parameters.

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

20 40 60

100

150

200

250

300

0.1

0.2

0.3

Error

Width

Epoch

(a) Error on the noise-free training set.

20 40 60

100

150

200

250

300

0.1

0.2

0.3

Error

Width

Epoch

(b) Test error.

20 40 60

100

150

200

250

300

0.6

0.8

Loss

Width

Epoch

20 40 60

100

150

200

250

300

0.1

0.2

0.3

Error

Width

Epoch

(d) Ensemble test error.

20 40 60

100

150

200

250

300

0.02

0.04

0.06

0.08

Error

Width

Epoch

(e) Test error difference.

20 40 60

100

150

200

250

300

Width

Epoch

(f) Relative test error reduction.

20 40 60

100

150

200

250

300

0.8

0.85

0.9

Correlation

Width

Epoch

(g) Output correlation.

0.04 0.06 0.08

0.8

0.85

0.9

100

200

300

Epoch

Error Diﬀerence

Avg. Correlation

(h) Accuracy increase against correla-

tion.

Figure 2: Test error of ResNet18 on CIFAR-10. For each model-width and epoch combination we have rendered we ensemble

4 models with i.i.d. 10% label noise.

but a smaller number of models strongly prefer some

other category, the latter option may still be preferred

by the ensemble.

We ﬁnd that those models that fall into the region

of the interpolation threshold of the DD phenomenon

proﬁt most strongly from ensembling (see Plot 2e

here and Figure 28 in Nakkiran et al., 2019). They do

so in absolute and not in relative terms (cf. Plot 2f) in

accordance with the interpretation that the beneﬁts of

ensembling hinge on the variability between the mod-

els. The observed performance improvement, at least

for the case where it depends on the existence of la-

bel noise, can be realized by using same-noise pro-

ﬁle ensembles almost as thoroughly as by cross-noise

proﬁle ensembles (see e.g. Figure 5). It also turns out

that the generalization performance beneﬁt of using

cross-noise proﬁle ensembles over same-noise proﬁle

ensembles tends to be statistically signiﬁcant but quite

small in magnitude. This means that around the DD

hike most error variance of the learning procedure is

accounted for by the network weight initialization and

training-time randomness and only a small fraction

of the variance can be additionally explained away in

terms of the variation in the training noise proﬁle. For

ML deployment, this implies that artifacts in the train-

ing data that enter during data acquisition (i.e. that

give rise to errors that constitute a noise proﬁle) re-

sult, to a limited extend, in artifacts that cancel each

other out in the learned function mappings.

We proceed to present some observations from the

model type comparison in Figures 3 and 4. The ge-

ometric mean-based ensembler consistently outper-

forms all other ensembling methods, both with same-

noise proﬁle and cross-noise proﬁle sub-learners for

all ensemble sizes tested. We presume this is because

geometric mean-based ensembling utilizes the avail-

able information better than the other two static ap-

proaches we tested: arithmetic mean-based ensem-

bling fails to sufﬁciently reduce the impact of the

votes of models that weakly prefer some output cat-

Reduction of Variance-related Error through Ensembling: Deep Double Descent and Out-of-Distribution Generalization

1 2 3 4 5 6

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Sub Net

Big Net

MonoEnsemble

Fully Conn. Ens.

Eq. W. Ens.

Cross Eq. W. Ens.

Plurality Ens.

Cross Plurality Ens.

Geom. Ens.

Cross Geom. Ens.

0% Noise Sub Net

0% Noise Geom. Ens.

#SubNets

Error

(a) CIFAR-10, sub-model width 9.

1 2 3 4 5 6

0.4

0.42

0.44

0.46

0.48

0.5

0.52

0.54

Sub Net

Big Net

Mono-Ensemble

Fully Conn. Ens.

Arithm. Ens.

Plurality Ens.

Geom. Ens.

0% Noise Sub Net

0% Noise Geom. Ens.

#SubNets

Error

(b) CIFAR-100, sub-model width 11.

Figure 3: Test error of model trials based on VGG11. Continuous lines represent same-noise proﬁle ensembles and dashed

lines cross-noise proﬁle ensembles. Noise strength is 10% for CIFAR-10 and 5% for CIFAR-100. The setups are explained

in Section 3.

2 4 6

0.1

0.12

0.14

0.16

0.18

0.2

Sub Net

Big Net

Mono-Ensemble

Fully Conn. Ens.

Arithm. Ens.

Plurality Ens.

Geom. Ens.

0% Noise Sub Net

0% Noise Geom. Ens.

#SubNets

Error

(a) CIFAR-10, sub-model width 10.

1 2 3 4 5 6

0.35

0.4

0.45

Sub Net

Big Net

Mono-Ensemble

Fully Conn. Ens.

Arithm. Ens.

Plurality Ens.

Geom. Ens.

0% Noise Sub Net

0% Noise Geom. Ens.

#SubNets

Error

(b) CIFAR-100, sub-model width 11.

Figure 4: Test error of model trials based on ResNet18. Continuous lines represent same-noise proﬁle ensembles and dashed

lines cross-noise proﬁle ensembles. Noise strength is 10%.

1 2 3 4 5 6

0.1

0.15

0.2

0.25

0.3

30% Noise Sub Net

30% Same Noise Ens.

30% Cross Noise Ens.

20% Noise Sub Net

20% Same Noise Ens.

20% Cross Noise Ens.

10% Noise Sub Net

10% Same Noise Ens.

10% Cross Noise Ens.

0% Noise Sub Net

0% Noise Ens.

#SubNets

Error

Figure 5: Test error of ResNet18 on CIFAR-10, model

width 10. Continuous lines represent same-noise proﬁle

ensembles and dashed lines cross-noise proﬁle ensembles.

The colors code for noise strength.

egory over the others and plurality vote-based en-

sembling abstracts from any information regarding

weak preference or strong rejection of categories. The

additional training for the ensembles with a fully-

connected linear layer as ensembling layer is less ca-

pable of facilitating generalization. The error back-

propagation is conducted based on the error of the en-

semble prediction rather than the errors of the pre-

dictions of the constituent modules for the mono-

ensemble architecture. This architecture performs

mostly worse than architecturally identical ensembles

and better than monolithic architectures with equiva-

lent parameter count.

With our experimental results we support the

suggestion made by Geiger et al. (2020) that “. . . ,

given a computational envelope, the smallest gen-

eralization error is obtained using several networks

of intermediate sizes, just beyond N

∗

, and averag-

ing their outputs”, under slightly more realistic con-

ditions, whereby N

∗

refers to the interpolation thresh-

old model parameterization. In Table 1 we show that

a given computational envelope can be utilized more

effectively if we train for a given type of model archi-

tecture many same-noise proﬁle networks around the

interpolation threshold in parallel and combine them

for deployment, instead of training a large single net-

work. In this way we require fewer computational

steps for training and fewer parameters in total.

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

Table 1: Relative performance of ensembles and full-width

networks for ResNet18 on CIFAR-10. The models were

trained on 10% label noise and the ensembles are same-

noise ensembles of 6 sub-learners. Models with widths 10

and 12 are interpolation threshold models. The mean per-

formance of the ensembles of the models of width 10 and

12 is statistically signiﬁcantly different from that of the full-

width networks judging by the standard error of the mean.

Sub-model

#Parameters

Test Error

Width Mean ± SEM

6 597696 0.1304 ± 0.00082

10 1649880 0.1162 ± 0.00084

12 2372100 0.1123 ± 0.00048

26 11088696 0.0919 ± 0.00058

Single net

Width

64 11173962 0.1307 ± 0.00180

4.2 Interpretation of Relative Model

Performance

In those experiments where we have estimated the av-

erage performance of geometric mean-based ensem-

bles, mono-ensembles and large networks (Figures 3

and 4), we see the general pattern that large networks

perform worse than mono-ensembles which in turn

tend to perform worse than geometric mean ensem-

bles of the same-noise proﬁle type.

We conjecture

that this trend is due to internal correlations between

the components (e.g. layers and channels) in some of

the models. Table 1 shows that ensembles of models

at the interpolation threshold can perform well while

using relatively little memory. We attribute this to

the good accuracy of their mean prediction and to the

low degree of output correlation between these mod-

els (see Figures 2g and 2e) which permits extensive

variance reduction through model averaging. We plot

the accuracy increase due to ensembling against the

strength of the correlations between the models con-

stituting the ensemble in Plot 2h.

Geometric mean-based ensembles tend to perform

better than comparable mono-ensembles (see e.g. Fig-

ure 3a). The latter differ from the former in that there

is a shared last layer and that they are being trained

through back-propagation on the loss of their overall

predictions. Additionally, mono-ensembles face less

diverse data shufﬂing and augmentation during train-

ing.

With the exception of the experiment reported in Fig-

ure 3b where mono-ensembles outperform all regular en-

sembles.

The input to each sub-module in the mono-ensemble

for each training batch was always identical: it was not sub-

Because of the two above-mentioned factors, dur-

ing training there are identical inﬂuences on the

sub-modules and the loss signal each sub-module is

trained on is not speciﬁc to the individual module’s

divergence from the ground truth. This, in turn, may

introduce correlations between the sub-modules, in

the sense that it couples their training trajectories and

causes the model as a whole to learn a less diverse and

differentiated set of hidden features.

The worst performance in our comparison among

models with an almost identical parameter count was

that of large single networks. This can be attributed

to the fact that the intermediary results of the compu-

tations in these models are used more often than in

the other models. This leads to internal correlations

between the parameters within the hidden layers of

these models during back-propagation: In the large

networks the number of parameters corresponding to

each computational element (e.g. convolution chan-

nel) is larger than in the ensemble architectures and

the internal state of the networks is more compact,

i.e. fewer variables are used for storing intermediate

results during forward-propagation. This impairs iso-

lating relevant structure in the hidden features.

4.3 In-Distribution and

Out-of-Distribution Generalization

For a given architecture, e.g. ResNet18 models of

width 12 (Figure 6) or width 26 (Figure 7), the loss

of the ensembles correlates linearly between differ-

ent test sets as the number of sub-networks in the en-

sembles changes (cf. Miller et al., 2021). Likewise,

a linear correlation also exists between the accuracy

values on the different test sets. We also observe,

as expected, that with increasing ensemble size the

generalization performance increases and converges

toward a limit. Simultaneously, the variance in the

performance between distinct ensembles diminishes.

We observe these trends for all transfer test sets we

experimented on, regardless of the extend of distribu-

tion shift. This implies that the in-distribution bene-

ﬁt of ensembling predicts its out-of-distribution utility

well. The increase in test-set performance by forming

small ensembles can be used to estimate whether or

not training further sub-models in order to form even

larger ensembles is a worthwhile strategy for repre-

senting the meaningful structure of the problem at

hand.

ject to independent selection and augmentation as was the

case for the regular ensembles. This was necessary to reli-

ably train all parameters using one loss signal.

Reduction of Variance-related Error through Ensembling: Deep Double Descent and Out-of-Distribution Generalization

(a) Loss

(b) Accuracy

Figure 6: ResNet18 on CIFAR-10, model width 12. Blue points represent single models and red points ensembles of 7 models.

Intermediary colors represent ensembles of sizes 2 to 6. We do not use label noise.

5 DISCUSSION

In this paper we present further examples of the

DD generalization curve of the error on classiﬁca-

tion problems. Although we demonstrate a non-

monotonous trend of the error for a range of up to

now untested conditions and also approximate a bias-

variance-decomposition of the error, the question in

as far this is the same phenomenon that has been anal-

ysed in shallow models (e.g. Belkin et al., 2019; Ad-

vani et al., 2020; Mei and Montanari, 2019) remains

open, because of the way loss and error relate and the

fact that the trend of the error often contradicts that of

its corresponding loss. This divergence is especially

strong at and beyond the interpolation threshold and it

raises the question in as far one could use alternative

learning objectives to reliably mitigate the DD hike

(cf. Ishida et al., 2020). We illustrate this divergence

in Section 4.1 where we present results for deep learn-

ing settings for which certain types of DD occur only

for the error and not for the loss.

In Table 1 we present the ﬁnding that small mod-

els that individually are impaired by DD can afford

ensembles that outperform single large models while

requiring considerably fewer overall parameters and

computational steps. This has practical implications

because it shows, that allowing for more intricate

models with more complicated hidden features is not

always the best approach to augment model perfor-

mance if the model design suffers from underspeciﬁ-

cation. In particular, with a sufﬁciently general and

reliable ensemble deﬁnition (geometric-mean ensem-

bling) and with model and training procedure deﬁ-

nitions that contain impactful sources of randomness

we demonstrate how multiple predictors can be com-

bined through ensembling to produce a considerably

better model. This is the case even if we do not com-

bine multiple learner types or strategies and without

access to noise-free training data. The performance

improvement through same-noise proﬁle ensembling

was on par with the improvement gained when also

redrawing the noise on the training data (see Figure

5).

In Section 4.2 we propose an interpretation of the

model architecture comparison in terms of coupled

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

(a) Loss

(b) Accuracy

Figure 7: ResNet18 on CIFAR-10, model width 26. Same legend as for Figure 6.

training dynamics and ensuing model correlation that

we hope will be scrutinized in future research. Fu-

ture research is also needed to verify the degree of

validity of the notion of error decomposition for clas-

siﬁcation problems and to mathematically ground the

notion of a distribution of mappings and its evolution

in the course of training.

With our out-of-distribution generalization exper-

iments in Section 4.3 we show that the general trend

of in-distribution generalization of increasingly large

ensembles applies also to out-of-distribution settings

and that the beneﬁcial effects of ensembling general-

ize beyond the training distribution in a regular man-

ner.

REFERENCES

Adlam, B. and Pennington, J. (2020a). The neural tangent

kernel in high dimensions: Triple descent and a multi-

scale theory of generalization. In International Con-

ference on Machine Learning, pages 74–84. PMLR.

Adlam, B. and Pennington, J. (2020b). Understanding dou-

ble descent requires a ﬁne-grained bias-variance de-

composition.

Advani, M. S., Saxe, A. M., and Sompolinsky, H. (2020).

High-dimensional dynamics of generalization error in

neural networks. Neural Networks, 132:428 – 446.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Recon-

ciling modern machine-learning practice and the clas-

sical bias–variance trade-off. Proceedings of the Na-

tional Academy of Sciences, 116(32):15849–15854.

Belkin, M., Ma, S., and Mandal, S. (2018). To understand

deep learning we need to understand kernel learn-

ing. In International Conference on Machine Learn-

ing, pages 541–549. PMLR.

Chen, L., Min, Y., Belkin, M., and Karbasi, A. (2020). Mul-

tiple descent: Design your own generalization curve.

arXiv preprint arXiv:2008.01036.

D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Ali-

panahi, B., Beutel, A., Chen, C., Deaton, J., Eisen-

stein, J., Hoffman, M. D., Hormozdiari, F., Houlsby,

N., Hou, S., Jerfel, G., Karthikesalingam, A., Lucic,

M., Ma, Y., McLean, C., Mincu, D., Mitani, A., Mon-

tanari, A., Nado, Z., Natarajan, V., Nielson, C., Os-

borne, T. F., Raman, R., Ramasamy, K., Sayres, R.,

Schrouff, J., Seneviratne, M., Sequeira, S., Suresh,

H., Veitch, V., Vladymyrov, M., Wang, X., Webster,

K., Yadlowsky, S., Yun, T., Zhai, X., and Sculley,

D. (2020). Underspeciﬁcation presents challenges

Reduction of Variance-related Error through Ensembling: Deep Double Descent and Out-of-Distribution Generalization

for credibility in modern machine learning. arXiv

preprint arXiv:2011.03395.

Darlow, L. N., Crowley, E. J., Antoniou, A., and Storkey,

A. J. (2018). CINIC-10 is not ImageNet or CIFAR-

10. arXiv preprint arXiv:1810.03505.

d’Ascoli, S., Reﬁnetti, M., Biroli, G., and Krzakala, F.

(2020). Double trouble in double descent: Bias and

variance (s) in the lazy regime. In International

Conference on Machine Learning, pages 2280–2290.

PMLR.

Geiger, M., Jacot, A., Spigler, S., Gabriel, F., Sagun, L.,

d’Ascoli, S., Biroli, G., Hongler, C., and Wyart, M.

(2020). Scaling description of generalization with

number of parameters in deep learning. Journal

of Statistical Mechanics: Theory and Experiment,

2020(2):023401.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Ishida, T., Yamane, I., Sakai, T., Niu, G., and Sugiyama, M.

(2020). Do we need zero training loss after achieving

zero training error? In International Conference on

Machine Learning, pages 4604–4614. PMLR.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tan-

gent kernel: convergence and generalization in neu-

ral networks. In Proceedings of the 32nd Interna-

tional Conference on Neural Information Processing

Systems, pages 8580–8589.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017).

Simple and scalable predictive uncertainty estimation

using deep ensembles. Advances in Neural Informa-

tion Processing Systems, pages 6402–6413.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Penning-

ton, J., and Sohl-Dickstein, J. (2017). Deep neu-

ral networks as gaussian processes. arXiv preprint

arXiv:1711.00165.

Lu, S., Nott, B., Olson, A., Todeschini, A., Vahabi, H., Car-

mon, Y., and Schmidt, L. (2020). Harder or different?

a closer look at distribution shift in dataset reproduc-

tion. In ICML Workshop on Uncertainty and Robust-

ness in Deep Learning.

Mei, S. and Montanari, A. (2019). The generalization er-

ror of random features regression: Precise asymp-

totics and double descent curve. arXiv preprint

arXiv:1908.05355.

Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S.,

Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and

Schmidt, L. (2021). Accuracy on the line: on the

strong correlation between out-of-distribution and in-

distribution generalization. In International Confer-

ence on Machine Learning, pages 7721–7735. PMLR.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B.,

and Sutskever, I. (2019). Deep double descent: Where

bigger models and more data hurt. arXiv preprint

arXiv:1912.02292.

Opper, M., Kinzel, W., Kleinz, J., and Nehl, R. (1990).

On the ability of the optimal perceptron to gener-

alise. Journal of Physics A: Mathematical and Gen-

eral, 23(11):L581.

Rasmussen, C. E. (2003). Gaussian processes in machine

learning. In Summer school on machine learning,

pages 63–71. Springer.

Rath-Manakidis, P. (2021). Interaction of ensembling and

double descent in deep neural networks. Master’s the-

sis, Cognitive Science, Ruhr University Bochum, Ger-

many.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2018).

Do CIFAR-10 classiﬁers generalize to CIFAR-10?

arXiv preprint arXiv:1806.00451.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. (2020).

Rethinking bias-variance trade-off for generalization

of neural networks. In International Conference on

Machine Learning, pages 10767–10777. PMLR.

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods