Safe Screening for Logistic Regression with ℓ

–ℓ

Regularization

Anna Deza

and Alper Atamt

urk

Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA, U.S.A.

Keywords:

Screening Rules, Sparse Logistic Regression.

Abstract:

In logistic regression, it is often desirable to utilize regularization to promote sparse solutions, particularly for

problems with a large number of features compared to available labels. In this paper, we present screening rules

that safely remove features from logistic regression with ℓ

−ℓ

regularization before solving the problem. The

proposed safe screening rules are based on lower bounds from the Fenchel dual of strong conic relaxations

of the logistic regression problem. Numerical experiments with real and synthetic data suggest that a high

percentage of the features can be effectively and safely removed apriori, leading to substantial speed-up in the

computations.

1 INTRODUCTION

Logistic regression is a classiﬁcation model used to

predict the probability of a binary outcome from a set

of features. Its use is prevalent in a large variety of do-

mains, from diagnostics in healthcare (Gramfort et al.,

2013; Shevade and Keerthi, 2003; Cawley and Talbot,

2006) to sentiment analysis in natural language pro-

cessing (Wang and Park, 2017; Yen et al., 2011) and

consumer choice modeling in economics (Kuswanto

et al., 2015).

Given a data matrix A ∈ R

m×n

of m observations,

each with n features and binary labels y ∈ {−1,1}

the logistic regression model seeks regression coefﬁ-

cients x ∈ R

that minimize the convex loss function

L(x) =

∑

i=1

log



1 + exp(−y



We use A

to denote the i-th row of matrix A and

to denote the j-th column of A. When the num-

ber of available features is large compared to the

number of the observations (labels), i.e., m << n,

logistic regression models are prone to overﬁtting.

Such cases require pruning the features to mitigate

the risk of overﬁtting. Regularization is a natural

approach for this purpose. Convex ℓ

-regularization

(ridge) (Hoerl and Kennard, 1970) imposes bias by

shrinking the regression coefﬁcients x

, i ∈ [n], to-

ward zero. The ℓ

-regularization (lasso) (Tibshirani,

1996) and ℓ

–ℓ

-regularization (elastic net) (Zou and

Hastie, 2005) perform shrinkage of the coefﬁcients

https://orcid.org/0000-0002-4849-683X

https://orcid.org/0000-0003-1220-808X

and selection of the features simultaneously. Re-

cently, there has been a growing interested in utilizing

the exact ℓ

-regularization (Miller, 2002; Bertsimas

et al., 2016) for selecting features in linear regression.

Although ℓ

-regularization introduces non-convexity

to regression models, signiﬁcant progress has been

done to develop strong models and specialized algo-

rithms to solve medium to large scale instances re-

cently (e.g. Bertsimas and Van Parys, 2017; Atamt

urk

and G

omez, 2019; Hazimeh and Mazumder, 2020;

Han et al., 2020).

We consider logistic regression with ℓ

–ℓ

regu-

larization:

min

x∈R

L(x) +

∥x∥

+ µ∥x∥

, and (REG)

min

x∈R

L(x) +

∥x∥

s.t. ∥x∥

≤ k. (CARD)

Whereas the ℓ

-regularization penalty term above

encourages shrinking the coefﬁcients, which helps

counter effects of noise present in the data matrix A,

the ℓ

-regularization penalty term in (REG) encour-

ages sparsity, selecting a small number of key fea-

tures to be used for prediction, which is modeled as

an explicit cardinality constraint in (CARD). Due to

the ℓ

-regularization terms, (REG) and (CARD) are

non-convex optimization problems.

Screening rules refer to preprocessing procedures

that discard certain features, leading to a reduction

in the dimension of the problem, which, in turn, im-

proves the solution times of the employed algorithms.

For ℓ

-regularized linear regression, El Ghaoui et al.

(2010) introduce safe screening rules that guarantee

to remove only features that are not selected in the so-

lution. Strong rules (Tibshirani, 2011), on the other

Deza, A. and Atamtürk, A.

Safe Screening for Logistic Regression with – Regularization.

DOI: 10.5220/0011578100003335

In Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022) - Volume 1: KDIR, pages 119-126

ISBN: 978-989-758-614-9; ISSN: 2184-3228

119

hand, are heuristics with no guarantee but able to

prune a large number of features fast. A large body

of work exists on screening rules for ℓ

-regularized

regression (Wang et al., 2013; Liu et al., 2014; Fer-

coq et al., 2015; Ndiaye et al., 2017; Dantas et al.,

2021), including some for logistic regression (Wang

et al., 2014). However, little attention has been given

to the ℓ

-regularized regression problem, where di-

mension reduction by screening rules can have sub-

stantially larger impact due to the higher compu-

tational burden for solving the non-convex regres-

sion problems. Bounds from strong conic relaxations

of ℓ

-regularized problems (Atamt

urk et al., 2021;

Atamt

urk and G

omez, 2019) substantially reduce the

computational burden with effective pruning strate-

gies. Recently, Atamt

urk and G

omez (2020) propose

safe screening rules for the ℓ

-regularized linear re-

gression problem from perspective relaxations. To the

best of our knowledge, no screening rule exists in the

literature for the logistic regression problems (REG)

and (CARD) with ℓ

–ℓ

regularization, studied in this

paper.

Outline. In Section 2, we give strong conic mixed 0-1

formulations for logistic regression problems (REG)

and (CARD) with ℓ

–ℓ

regularization. In Section 3,

we derive the safe screening rules for them based on

bounds from Fenchel duals of their conic relaxations

and in Section 4, we summarize the computational ex-

periments performed for testing the effectiveness of

the proposed screening rules for ℓ

–ℓ

logistic regres-

sion problems with synthetic as well as real data. Fi-

nally, we conclude with a few ﬁnal remarks in Sec-

tion 5.

2 CONIC REFORMULATIONS

In this section, we present strong conic formulations

for (REG) and (CARD). First, we state convex logis-

tic regression loss minimization as a conic optimiza-

tion problem. Writing the epigraph of the softplus

function log(1 + exp(x)) ≤ s as an upper bound on

the sum of two exponential functions exp(x − s) +

exp(−s) ≤ 1, it follows that the logistics regression

loss L(x) minimization problem can be formulated as

an exponential cone optimization problem

min

x∈R

L(x) = min

x,s,u,v

∑

i=1

s.t. u

≥ exp (−y

x − s

), i ∈ [m]

≥ exp (s

), i ∈ [m]

+ v

≤ 1, i ∈ [m]

x ∈ R

, s,u,v ∈ R

which is readily solvable by modern conic optimiza-

tion solvers.

Introducing binary indicator variables z ∈ {0,1}

to model the ℓ

-regularization terms, (REG) can be

formulated as a mixed-integer conic optimization

problem:

= min L(x) +

∑

j=1

+ µ

∑

j=1

(2a)

s.t. x

(1 − z

) = 0, j ∈ [n] (2b)

x ∈ R

, z ∈ {0,1}

. (2c)

Here, we adopt the convention x

= 0 if z

0, x

= 0 and x

= ∞ if z

= 0, x

̸= 0. Constraint

(2b) ensures that x

= 0 whenever z

= 0. This con-

straint can be linearized using the “big-M” technique

by replacing it with −Mz

≤ x

≤ Mz

, where M is

a large enough positive scalar. However, such big-

M constraints lead to very weak convex relaxations

as we show in the computational experiments in Sec-

tion 4.

Instead, we use the conic formulation of the per-

spective function of x

to model them more effec-

tively. Replacing x

in the objective with its per-

spective function x

signiﬁcantly strengthens the

convex relaxation when 0 < z

< 1, and introduc-

ing t

≥ 0, the perspective can be stated as a rotated

second-order cone constraint x

≤ z

Akt

urk et al.

(2009). Dropping the complementary constraints (2b)

as well as the integrality constraints on z, we arrive at

the respective conic (convex) relaxation for (REG):

= min L(x) +

∑

j=1

+ µ

∑

j=1

(3a)

≤ z

, j ∈ [n] (3b)

x ∈ R

,t ∈ R

,z ∈ [0,1]

. (3c)

Note that constraint (3b) is valid for z ∈ {0,1}

: as

= 0 implies x

= 0 and z

= 1 implies simply x

≤

, j ∈ [n].

Similarly, one can write the cardinality-

constrained version (CARD) as a mixed integer

non-linear model with the perspective reformulation:

= min L(x) +

∑

j=1

(4a)

s.t.

∑

j=1

≤ k (4b)

(1 − z

) = 0, j ∈ [n] (4c)

x ∈ R

,z ∈ {0,1}

. (4d)

KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval

120

Dropping (4c) and integrality constraints, and stat-

ing the perspectives as rotated cone constraints, we

arrive at the conic relaxation for (CARD):

= min L(x) +

∑

j=1

(5a)

s.t.

∑

j=1

≤ k (5b)

≤ z

, j ∈ [n] (5c)

x ∈ R

,t ∈ R

,z ∈ [0,1]

. (5d)

3 SAFE SCREENING RULES

In this section, we ﬁrst present the safe screening rules

for logistic regression with ℓ

–ℓ

regularization and

then discuss their derivation.

Proposition 1 (Safe Screening Rule for Regularized

Logistic Regression (REG)). Let x

∗

be an optimal so-

lution to (3), with objective value η

, α

= y



1 +

exp(y

∗

)



, i ∈ [m], δ

(α

′

)

, j ∈ [n], and

be an upper bound on η

. Then any optimal solution

to (2) satisﬁes

(

0, if η

+ µ − γδ

1, if η

− µ + γδ

Proposition 2 (Safe Screening Rule for Cardinality–

constrained Logistic Regression (CARD)). Let x

∗

an optimal solution to (5), with objective value η

= y

/(1+exp(y

∗

)), i ∈ [m], δ

(α

′

)

, j ∈

[n], δ

[k]

denote the k-th largest value of δ, and

an upper bound on η

. Then any optimal solution to

(4) satisﬁes

(

0, if δ

≤ δ

[k+1]

and η

− γ(δ

− δ

[k]

) >

1, if δ

≥ δ

[k]

and η

+ γ(δ

− δ

[k+1]

) >

3.1 Derivation of Proposition 1

In this section, we present the derivation for the

screening rule for (REG) via Fenchel duality. Similar

to Atamt

urk and G

omez (2020), we utilize the dual of

the perspective terms. In particular, for p,q ∈ R, con-

sider the convex conjugate, h

∗

(p,q) of the perspective

function h(x,z) = x

/z:

∗

(p,q) = max

x,z

px + qz −

· (6)

By Fenchel’s inequality, we have px + qz −

∗

(p,q) ≤

· Therefore, for any p,q ∈ R

, we can

replace the perspective terms in the objective of (3) to

derive a lower bound on η

. Then, the Fenchel dual

of (3) is obtained by maximizing the lower bound:

max

p,q

min

x,z∈[0,1]

L(x) + µ

∑

j=1



′

x + q

′

z −

∑

j=1

∗

)



· (7)

Observing that px + qz −

is concave in x and z,

allows one to get a closed form solution for (6). In-

deed, by simply setting the partial derivatives to zero,

we obtain

∗

(p,q) =

(

0, q = −p

∞, otherwise.

Then, replacing q

with −p

/4 and using the

closed form solution for h

∗

, we obtain from (7) the

simpliﬁed form of the Fenchel dual:

=max

min

x,z∈[0,1]

L(x)+

∑

j=1



µz

−

4γ



. (8)

Note that (8) is concave in p. Taking the derivative

of (8) with respect to p

, we obtain the optimal p

∗

, j ∈ [n]. Plugging p

∗

into (8), we see that it is

equivalent to (3), implying that the dual is tight, i.e.,

= η

For the inner minimization problem, taking the

derivative with respect to z

, we ﬁnd the optimality

conditions

(

0, µ −

> 0

1, µ −

< 0.

If µ −

= 0, then z

∈ [0, 1]. On the other hand,

taking the derivative with respect to x

we derive the

following optimality condition:

∑

i=1

i j

1 + exp(y

Let x

∗

be the optimal solution, and, for i ∈ [m],

deﬁne

:= y

/(1 + exp(y

∗

)), for i ∈ [m]

and

(α

′

)

, for j ∈ [n].

Then, p

∗

= γA

α. Furthermore,

µ −

∗

)

4γ

= µ −

γ(α

′

)

= µ − γδ

Safe Screening for Logistic Regression with – Regularization

121

Using this closed form solution, we can obtain p

∗

for (8) from the optimal solution of (3) via α, which

in turn can be used to recover z

∗

, j ∈ [n].

Proof of Proposition 1. Suppose µ−γδ

> 0. Then

∗

= 0 in (8), and further η

− (µ − γδ

) <

. Sup-

pose we add a constraint z

= 1 to (8). Let the optimal

objective value for this problem be η

= 1). Since

+ µ − γδ

≤ η

= 1), then if η

+ µ − γδ

, there exists no feasible solution for (3) with z

= 1

that has a lower objective than

. But, this implies

that no optimal solution for (2) has z

= 1, and thus it

must be that z

= 0.

The same argument is used for the case that µ −

γδ

< 0 and z

∗

= 1 in an optimal solution to (8). Since

− (µ − γδ

) = η

− µ + γδ

≤ η

= 0), if

− µ + γδ

, then the optimal solution for (2)

must have z

= 1.

3.2 Derivation of Proposition 2

Using steps similar to in Section 3.1 we derive the

Fenchel dual for (5):

= max

min

x,z∈[0,1]

L(x) +

∑

j=1



−



s.t.

∑

i=1

≤ k. (9)

Similarly it can be shown that p

∗

= 2x

, j ∈ [n],

and thus there is no duality gap and η

= η

Again, taking the derivative we see that for the min-

imization problem, the optimal solution for (9) has

= 1 for the k most negative values of µ −

which

simply translates to the z

with the k largest values of

, with the rest of the indicator variables being equal

to zero. In the case that there is no tie between the

k-th and (k + 1)-th most largest values, then there is

a unique optimal solution for (9) which is integer in

z, which is therefore the unique optimal solution for

(4). Again, we can recover p

∗

= γA

α, and ﬁnd that

−

∗

)

4γ

= −γδ

Proof for Proposition 2. Suppose δ

≤ δ

[k+1]

Then x

= 0 in an optimal solution for (9). Adding

the constraint z

= 1, one obtains a solution where the

(k − 1) indicators with the largest values of δ are set

to 1, as well as z

, implying z

[k]

= 0 by the cardinality

constraint. But since η

−γδ

+γδ

[k]

≤ η

= 1),

there exists no optimal solution for (4) with z

= 1 if

− γ(δ

+ δ

[k]

) >

Using the same argument, if δ

≥ δ

[k]

, then z

= 1

in an optimal solution for (9). Adding the constraint

= 0, we obtain a solution with δ

[k+1]

= 1 as the

solution sets the indicator with the next largest δ to

one. Therefore, η

+ γδ

− γδ

[k+1]

≤ η

= 0),

and thus there exists no solution for (4) with z

= 1 if

+ γ(δ

− δ

[k]

) >

4 COMPUTATIONAL RESULTS

In this section, we present the computational ex-

periments performed to test the effectiveness of the

safe screening rules described in Section 3 for the

ℓ

− ℓ

regularized and cardinality-constrained logis-

tic regression problem. We test the proposed screen-

ing methods on synthetic datasets as well as on real

datasets.

4.1 Experimental Setup

The real data instances of varying sizes are obtained

from the UCI Machine Learning Repository (Dua

et al., 2017) as well as genomics data from the Gene

Expression Omnibus Database (Edgar et al., 2002).

Synthetic datasets are generated using the method-

ology described in Dedieu et al. (2021). Given a

number of features n and a number of observations

m, we generate a data matrix A ∼ N

(0,Σ), and a

sparse binary vector ˜x, representing the “true” fea-

tures, which has k equi-spaced entries equal to one

and the remaining entries equal to zero. For each ob-

servation i ∈ [m], we generate a binary label y

, where

Pr(y

= 1|A

) = (1 + exp(−sA

˜x))

−1

. The covariance

matrix Σ controls the correlations between features,

and s can be viewed as the signal-to-noise ratio. For

each experimental setting, we generate ten random in-

stances and report the average of the results for these

ten instances for experiments with synthetic data.

We compare the performance of solving (REG)

and (CARD) using MOSEK ApS (2021) mixed-

integer conic branch-and-bound algorithm with and

without screening. For consistency of the runs, we ﬁx

the solver options as follows: the branching strategy is

set to pseudocost method, node selection is set to best

bound method, and presolve and heuristics that add

random factors to the experiments are turned off. Up-

per bounds used for the screening rules are obtained

by simply rounding the conic relaxation solution to a

nearest feasible integer solution.

4.2 Results on Synthetic Data

We ﬁrst present the experimental results with screen-

ing procedure applied to the synthetic datasets. We

test the regularized logistic regression (REG) with

KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval

122

n = 500, s = 1000, k = 50 as a function of the number

of observations, m ∈ {200, 500,1000}, the strength

of the ℓ

regularization, γ ∈ {1, 1.5,1.8}, and the ℓ

regularization, µ ∈ {5e

−4

,1e

−3

}. For the cardinality-

constrained model (CARD), we use the same setting

and vary γ in the same way while changing the ratio

k/n ∈ {0.25,0.05,0.017} by ﬁxing k = 50 and vary-

ing n. In both experiments, Σ = I, which corresponds

to generating features that are independent of one an-

other.

Figure 1: Percentage of features screened as a function of

the number of observations in the dataset and regularization

strength for (REG).

Figure 2: Percentage of features screened as a function of

the number of observations in the dataset and regularization

strength for (CARD).

Figures 1 and 2 show the percentage of fea-

tures eliminated from the regression by the screen-

ing procedure for different regularization strengths for

(REG) and (CARD), respectively. As the number of

observations increases, the number of screened fea-

tures increases as well. We observe the same trend as

the strength of the regularization increases, i.e., higher

values of µ and lower values of γ lead to better screen-

ing. The reason for improved screening with larger

number of observations and stronger regularization

can be explained by the smaller integrity gap of the

conic relaxations, as shown in Tables 1 and 2. Inte-

grality gap of a relaxation is the relative gap between

the optimal objective value of the mixed-integer prob-

lem and the relaxation. Smaller integrality gaps lead

to the satisfaction of a higher number of screening

rules in Propositions 1 and 2.

Table 1: Integrality gap of big-M and conic formulations for

(REG).

Big-M relaxation Conic relaxation

1 1.5 1.8 1 1.5 1.8

−4

200 12.91 15.06 16.05 0.01 0.02 0.04

500 8.43 10.18 10.97 7e

−3

0.02 0.03

1,000 6.00 7.15 7.69 7e

−3

0.01

−3

200 15.81 18.94 20.34 0.02 0.04 0.06

500 9.88 12.38 13.51 0.01 0.03 0.04

1,000 7.39 9.23 10.01 0.01 0.03 0.03

Average 10.07 12.16 13.10 0.01 0.03 0.04

Table 2: Integrality gap of big-M and conic formulations for

(CARD).

Big-M relaxation Conic relaxation

k/n

1 1.5 1.8 1 1.5 1.8

0.250

200 10.23 13.32 14.86 0.02 0.05 0.07

500 5.27 7.12 8.08 0.01 0.03 0.04

1,000 2.87 3.91 4.46 4e

−3

0.01 0.02

0.050

200 19.49 24.33 26.61 0.03 0.07 0.10

500 10.48 13.82 15.51 0.01 0.03 0.05

1,000 6.14 8.25 9.33 6e

−3

0.02 0.02

0.017

200 41.74 47.70 50.19 0.05 0.10 0.11

500 26.58 32.73 - 0.02 0.06 -

1,000 16.67 21.47 23.78 0.01 0.03 0.04

Average 15.50 19.18 19.10 0.02 0.04 0.06

In Tables 1 and 2, we also compare the strength

of the conic formulation with the big-M formula-

tion. Observe that the integrality gaps produced by the

conic relaxation are very small, on average 0.03% for

the regularized model and 0.04% for the cardinality-

constrained model. On the other hand, the big-M for-

mulation has a much weaker gap, 12% and 18% for

the regularized and constrained models, respectively.

The tighter gaps with the conic formulation signiﬁ-

cantly help speed up the solution time of the branch-

and-bound algorithm, as well as lead to the elimina-

tion of more variables with the screening rules, further

Safe Screening for Logistic Regression with – Regularization

123

Table 3: Solution times for the regularized logistic regression (REG) with and without screening rules.

Time (sec.)

Speed-up

BnB BnB + Screening

1 1.5 1.8 1 1.5 1.8 1 1.5 1.8

−4

200 16 136 264 5 58 127 2.9 2.4 2.1

500 25 69 174 6 19 57 4.3 3.7 3.2

1,000 30 35 49 5 6 9 5.3 5.7 5.5

−3

200 10 31 69 3 10 25 3.4 3.0 2.9

500 9 29 38 2 6 8 4.2 4.7 4.8

1,000 39 66 71 6 9 10 6.3 7.1 6.7

Average 21 61 111 5 18 39 4.4 4.4 4.2

Table 4: Solution times for the cardinality-constrained logistic regression (CARD) with and without screening rules.

Time (sec.)

Speed-up

BnB BnB + Screening

k/n

1 1.5 1.8 1 1.5 1.8 1 1.5 1.8

0.250

200 16 40 69 4 11 20 4.2 3.6 3.5

500 41 110 256 7 23 68 5.8 4.9 4.2

1,000 30 47 52 4 7 7 6.7 7.0 7.1

0.050

200 73 200 410 12 38 92 6.2 5.5 4.6

500 102 407 1,056 13 61 234 8.1 6.8 5.3

1,000 159 242 287 14 23 28 10.8 10.3 10.0

0.017

200 912 2,267 1,457 92 1,313 1,703 10.2 8.0 6.4

500 1,267 3,548 - 167 1,144 1,971 12.6 9.3 -

1,000 1,166 1,806 2,327 57 153 368 19.9 15.3 14.4

Average 418 963 740 41 308 499 9.4 7.9 6.9

speeding up the optimization.

In order to see the impact of screening procedure

on the overall solution times, we solve the logistic re-

gression problem using the branch-and-bound algo-

rithm with and without screening, and compare the

solution times and speed-up due to screening vari-

ables. The branch-and-bound algorithm for solving

the big-M formulation exceeds our time limit of 12

hours for the larger instances; therefore, we report re-

sults for the perspective formulation only. These re-

sults are shown in Tables 3 and 4. The computation

time for the screening procedure is included when re-

porting the solution times for branch-and-bound with

screening. The reported times are rounded to the near-

est second. On average, we observe a 4.3× and 8.1×

speed-up in computations due to the proposed screen-

ing procedure for (REG) and (CARD), respectively.

The improvement in solution times increases with the

number of observations. Again, a trend of increased

speed-up as the strength of regularization penalty in-

creases is seen, since more features are eliminated a

priori.

4.3 Results on Real Data

In order to test the effectiveness of the proposed

screening procedures on real data, we solve problems

from the UCI Machine Learning Repository (Dua

et al., 2017) (arcene and newsgroups) and genomic

data from the Gene Expression Omnibus Database

(Edgar et al., 2002) (genomic). In particular we focus

on these larger instances of the repository with a high

ratio of features to observations for which regulariza-

tion is more important to avoid overﬁtting. We solve

these instances using the regularized logistic regres-

sion model (REG), varying the strength of the regu-

larization. As before, a time limit of 12 hours is set

for each run.

The results are summarized in Table 5. For each

instance, at least 92% of the features are screened, and

particularly for the genomic dataset, 99.9% of the fea-

tures are screened for each parameter setting. Over all

instances, on average, 98% of the features are elimi-

nated by the screening procedure before the branch-

and-bound algorithm. Seven out of the 18 runs did not

complete in 12 hours without screening. On the other

hand, with screening, all but one run is completed

within the time limit and always much faster. For

the instances where branch-and-bound with and with-

out screening both terminate within the time limit,

screening leads to on average 13.8× speed-up, with

larger speed-up (up to 25.6×) for the more difﬁ-

cult instances. For instances where only the branch-

and0bound does not terminate, there is an average of

KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval

124

Table 5: Results for screening on real datasets using regularized logistic regression (REG).

Time (sec.)

Speed-up

µ γ % Screened BnB BnB + Screening

genomic

n = 22,883

m = 107

−4

0.5 99.9 104 19 5.5

1 99.9 182 17 11.0

1.5 99.9 184 33 5.5

−3

0.5 99.9 152 14 11.0

1 99.9 445 32 13.8

1.5 99.9 384 54 7.1

arcene

n = 10,000

m = 100

−4

0.5 97 25,963 1,013 25.6

1 97 6,999 336 20.8

1.5 92 >12hr 10,925 >4.0

−3

0.5 99 477 32 14.8

1 96 10,044 467 21.5

1.5 95 22,466 1,425 15.7

newsgroups

n = 28,467

m = 1,977

−4

0.5 99.9 >12hr 1,135 >38.1

0.7 99.9 >12hr 8,701 >5.0

1 99 >12hr >12hr n.a

−3

0.5 99.9 >12hr 401 >107.7

0.7 99.9 >12hr 522 >82.8

1 99.7 >12hr 7,439 >5.8

at least 40.5× speed-up. These experimental results

clearly indicate that the proposed screening rules are

very effective in pruning a large number of features

and result in substantial savings in computational ef-

fort for the real datasets as well.

5 CONCLUSION

In this work, we present safe screening rules for

ℓ

− ℓ

regularized and cardinality-constrained logis-

tic regression. Our numerical experiments show that

a large percentage of features can be eliminated ef-

ﬁciently and safely via this preprocessing step before

employing branch-and-bound algorithms, particularly

when regularization is strong, leading to signiﬁcant

computational speed-up. The strength of the conic re-

laxations contribute signiﬁcantly to the effectiveness

of the screening rules in pruning a large number of

features. We show the conic formulation provides

much smaller integrality gaps compared to the big-M

formulation, making it more suitable for solving ℓ

–

ℓ

-regularized logistic regression with a branch-and-

bound algorithm and also for the derived screening

rules.

REFERENCES

Akt

urk, M. S., Atamt

urk, A., and G

urel, S. (2009). A strong

conic quadratic reformulation for machine-job assign-

ment with controllable processing times. Operations

Research Letters, 37:187–191.

Atamt

urk, A. and G

omez, A. (2019). Rank-one con-

vexiﬁcation for sparse regression. arXiv preprint

arXiv:1901.10334.

Atamt

urk, A. and G

omez, A. (2020). Safe screening rules

for ℓ

-regression from perspective relaxations. In In-

ternational Conference on Machine Learning, pages

421–430. PMLR.

Atamt

urk, A., G

omez, A., and Han, S. (2021). Sparse

and smooth signal estimation: Convexiﬁcation of ℓ

formulations. Journal of Machine Learning Research,

22(52):1–43.

Bertsimas, D., King, A., Mazumder, R., et al. (2016). Best

subset selection via a modern optimization lens. The

Annals of Statistics, 44:813–852.

Bertsimas, D. and Van Parys, B. (2017). Sparse

high-dimensional regression: Exact scalable al-

gorithms and phase transitions. arXiv preprint

arXiv:1709.10029.

Cawley, G. C. and Talbot, N. L. (2006). Gene selection

in cancer classiﬁcation using sparse logistic regres-

sion with bayesian regularization. Bioinformatics,

22(19):2348–2355.

Dantas, C., Soubies, E., and F

evotte, C. (2021). Expand-

ing boundaries of gap safe screening. arXiv preprint

arXiv:2102.10846.

Dedieu, A., Hazimeh, H., and Mazumder, R. (2021). Learn-

ing sparse classiﬁers: Continuous and mixed integer

optimization perspectives. Journal of Machine Learn-

ing Research, 22(135):1–47.

Dua, D., Graff, C., et al. (2017). UCI machine learning

repository.

Edgar, R., Domrachev, M., and Lash, A. E. (2002). Gene

expression omnibus: Ncbi gene expression and hy-

bridization array data repository. Nucleic acids re-

search, 30(1):207–210.

El Ghaoui, L., Viallon, V., and Rabbani, T. (2010). Safe

Safe Screening for Logistic Regression with – Regularization

125

feature elimination for the lasso and sparse supervised

learning problems. arXiv preprint arXiv:1009.4219.

Fercoq, O., Gramfort, A., and Salmon, J. (2015). Mind

the duality gap: safer rules for the lasso. In Interna-

tional Conference on Machine Learning, pages 333–

342. PMLR.

Gramfort, A., Strohmeier, D., Haueisen, J., H

ainen,

M. S., and Kowalski, M. (2013). Time-frequency

mixed-norm estimates: Sparse m/eeg imaging with

non-stationary source activations. NeuroImage,

70:410–422.

Han, S., G

omez, A., and Atamt

urk, A. (2020). 2x2-

convexiﬁcations for convex quadratic optimiza-

tion with indicator variables. arXiv preprint

arXiv:2004.07448.

Hazimeh, H. and Mazumder, R. (2020). Fast best subset

selection: Coordinate descent and local combinato-

rial optimization algorithms. Operations Research,

68(5):1517–1537.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression:

Biased estimation for nonorthogonal problems. Tech-

nometrics, 12:55–67.

Kuswanto, H., Asﬁhani, A., Sarumaha, Y., and Ohwada, H.

(2015). Logistic regression ensemble for predicting

customer defection with very large sample size. Pro-

cedia Computer Science, 72:86–93.

Liu, J., Zhao, Z., Wang, J., and Ye, J. (2014). Safe screen-

ing with variational inequalities and its application to

lasso. In International Conference on Machine Learn-

ing, pages 289–297. PMLR.

Miller, A. (2002). Subset Selection in Regression. CRC

Press.

MOSEK ApS, . (2021). MOSEK Optimizer API for Python.

Release 9.3.13.

Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J.

(2017). Gap safe screening rules for sparsity enforcing

penalties. The Journal of Machine Learning Research,

18(1):4671–4703.

Shevade, S. K. and Keerthi, S. S. (2003). A simple and efﬁ-

cient algorithm for gene selection using sparse logistic

regression. Bioinformatics, 19(17):2246–2253.

Tibshirani, R. (1996). Regression shrinkage and selection

via the Lasso. Journal of the Royal Statistical Society.

Series B (Methodological), pages 267–288.

Tibshirani, R. (2011). Regression shrinkage and selection

via the lasso: A retrospective. Journal of the Royal

Statistical Society: Series B (Statistical Methodol-

ogy), 73:273–282.

Wang, J. and Park, E. (2017). Active learning for penalized

logistic regression via sequential experimental design.

Neurocomputing, 222:183–190.

Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014).

A safe screening rule for sparse logistic regression.

Advances in Neural Information Processing Systems,

27:1053–1061.

Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso

screening rules via dual polytope projection. In Ad-

vances in Neural Information Processing Systems,

pages 1070–1078. Citeseer.

Yen, S.-J., Lee, Y.-S., Ying, J.-C., and Wu, Y.-C. (2011). A

logistic regression-based smoothing method for chi-

nese text categorization. Expert Systems with Appli-

cations, 38(9):11581–11590.

Zou, H. and Hastie, T. (2005). Regularization and variable

selection via the elastic net. Journal of the Royal Sta-

tistical Society: Series B (Statistical Methodology),

67:301–320.

KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval

126