The Max-Cut Decision Tree: Improving on the Accuracy and Running

Time of Decision Trees

Jonathan Bodine and Dorit S. Hochbaum

University of California, Berkeley, U.S.A.

Keywords:

Decision Tree, Principal Component Analysis, Maximum Cut, Classiﬁcation.

Abstract:

Decision trees are a widely used method for classiﬁcation, both alone and as the building blocks of multiple

different ensemble learning methods. The Max-Cut decision tree involves novel modiﬁcations to a standard,

baseline model of classiﬁcation decision tree, precisely CART Gini. One modiﬁcation involves an alternative

splitting metric, Maximum Cut, which is based on maximizing the distance between all pairs of observations

that belong to separate classes and separate sides of the threshold value. The other modiﬁcation is to select the

decision feature from a linear combination of the input features constructed using Principal Component Anal-

ysis (PCA) locally at each node. Our experiments show that this node-based, localized PCA with the novel

splitting modiﬁcation can dramatically improve classiﬁcation, while also signiﬁcantly decreasing computa-

tional time compared to the baseline decision tree. Moreover, our results are most signiﬁcant when evaluated

on data sets with higher dimensions, or more classes. For the example data set CIFAR-100, the modiﬁcations

enabled a 49% improvement in accuracy, relative to CART Gini, while reducing CPU time by 94% for compa-

rable implementations. These introduced modiﬁcations will dramatically advance the capabilities of decision

trees for difﬁcult classiﬁcation tasks.

1 INTRODUCTION

Decision trees are a widely used method for classiﬁ-

cation, both alone and as the building blocks of multi-

ple different ensemble learning methods. A standard

approach for the construction of decision trees uti-

lizes the Classiﬁcation and Regression Trees (CART)

method (Breiman et al., 1984). This method con-

structs decision trees by selecting a threshold value

on some feature to separate, or split, the observations

into two branches, based on some evaluation method.

This process continues recursively until a preset stop-

ping criterion is reached. Throughout our analysis,

we deﬁne the stopping rule as either the node having

only one class present, or no split existing such that

both sides of the branch contain at least one observa-

tion.

One of the standard methods for evaluating the

different splits is Gini Impurity, which was provided

in the original CART methodology (Breiman et al.,

1984). Gini Impurity is deﬁned in equation 1, where

is the fraction of observations that belong to class

∑

(1 −p

) (1)

Deﬁne a threshold as a value in range of one of the

features’ values such that the observations are par-

titioned into two subsets, the samples whose values

for that feature are less than the threshold, and those

whose values are greater than the threshold. Then, a

threshold is selected such that the average Gini Im-

purity of the resulting two subsets, weighted by their

cardinality, is minimized. We will use this method-

ology as the baseline to compare our novel splitting

procedure against.

Our splitting procedure is motivated by a short-

coming of Gini Impurity. The partitions generated by

Gini Impurity depend only on a threshold value while

ignoring the fact that some partitions are to two sets

that are very dissimilar in terms of the distances be-

tween their feature vectors, whereas others are closer

in terms to this distance. The more dissimilar the

partition, the more meaningful it is. Given two op-

tions for a threshold that both partition the observa-

tions into the same ratios and same sizes, Gini Im-

purity would show no preference, regardless of the

distance between these classes. We believe that this

distance information is meaningful and preferable, as

it produces thresholds that separate the classes further

apart in space, resulting in decreased susceptibility to

overﬁtting.

Bodine, J. and Hochbaum, D.

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees.

DOI: 10.5220/0010107400590070

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 59-70

ISBN: 978-989-758-474-9

Our splitting procedure takes inspiration from the

Max Cut problem, which in general is NP-complete

(Karp, 1972). Our use of Max Cut is in one dimen-

sion, based on distances with respect to a single fea-

ture, and restricted to a single threshold. It is there-

fore polynomial-time solvable. Indeed, in the ap-

pendix, we show a linear-time implementation given

the sorted observations. The Max Cut criterion is to

ﬁnd θ maximizing equation 2, where x

i, j

represents

the value of feature j for observation i, y

represent

the class of observation i, and θ represent the thresh-

old value.

∑

{i|x

i, j

≤θ}

∑

{k|x

k, j

>θ}

6=y



i, j

−x

k, j



(2)

Given the fact that the observations can be sorted

based on any feature in O (n logn) and the linear-time

implementation of Max Cut, we have an O (n log n)

algorithm for solving this problem on any given fea-

ture of the feature vector, where n is the number of

observations. It is important to note that this is the

same asymptotic time complexity achieved when im-

plementing the Gini Impurity method.

Modifying how the features are represented can

have a signiﬁcant impact on the performance of a de-

cision tree. If the feature vector x

is modiﬁed through

a change of basis, the splits of the decision tree can be

changed. This paper seeks to ﬁnd a good basis to rep-

resent the feature vector. One common way to ﬁnd a

natural basis to represent the feature vector in is Prin-

cipal Component Analysis (PCA), ﬁrst proposed by

Karl Pearson F.R.S. (F.R.S., 1901). PCA iteratively

ﬁnds the direction that accounts for the most variance

in the data, given that it is orthogonal to all the di-

rections previously found. The use of PCA as a pre-

processing step when utilizing decision trees is not

new. However, we propose that it should instead be

used throughout the construction of the decision tree,

at every node. This is motivated by the idea that dis-

tributions could vary in different subspaces of the fea-

ture space. Therefore, a meaningful direction overall

might not be signiﬁcant in speciﬁc sub-spaces. To ac-

count for this, we introduce the following two meth-

ods for ﬁnding locally meaningful directions Node

Features PCA and Node Means PCA (see section 2

for details).

To understand how these changes affect a deci-

sion tree’s performance, we performed extensive ex-

perimental analysis on 8 different decision tree algo-

rithms listed in Table 2, each of which uses differ-

ent combinations of the methods discussed in section

2. Moreover, we implemented all of the algorithms

in python to understand the relative performance be-

tween the algorithms without having different levels

of code optimization effecting the results. This does

mean our baseline code is not as fast as commercially

available versions; however, by choosing our own im-

plementation for the baseline we are able to minimize

the differences between the code for the different im-

plementations. This highlights differences in perfor-

mance caused by the changes to the algorithm and not

different implementation choices.

An analysis was conducted on over 20,000 syn-

thetic data sets (section 3.1) with training set sizes

ranging between 100 and 300,000 and testing sets

ﬁxed at 100,000 observations. We ﬁrst considered bi-

nary classiﬁcation problems (section 3.1.2), followed

by multiclass classiﬁcation problems (section 3.1.3).

On the synthetic data we validated that the modiﬁca-

tion produced statistically signiﬁcant improvements

to the baseline Gini Impurity CART model. Simi-

larly, these results were then validated on real-world

data sets (section 3.2). These modiﬁcations were

found to provide dramatic improvements in both ac-

curacy and computational time, and these advantages

further increased with the dimensionality of the data

or the number of classes (section 4). The signiﬁ-

cance of these results is demonstrated here with the

CIFAR-100 dataset (Krizhevsky, 2009), which has

100 classes, 3,072 dimensions, and 60,000 total ob-

servations (of which 48,000 are used for training). In

this case, compared to the baseline CART Gini model,

our Max Cut Node Means PCA algorithm (see section

2) resulted in a 49.4% increase in accuracy relative to

the baseline while simultaneously reducing the CPU

time required by 94%.

2 DESCRIPTION OF

ALGORITHMS

We introduce two methods for ﬁnding locally mean-

ingful bases. The ﬁrst method that we consider is

Node Features PCA. At every decision node, PCA is

performed on the original feature vectors using only

the observations that have reached that node in order

to ﬁnd the local principal components. These local

principal components are then used as the input fea-

tures to ﬁnd the optimal threshold, rather than using

the original features.

The second method, referred to as Node Means

PCA, is derived from the idea that it is not the di-

rections that best describe the data, but rather the dif-

ference between the classes, that are important. Like

Node Features PCA, this algorithm is used at every

decision node; the difference is in what points get

fed into the PCA algorithm. Speciﬁcally, the ﬁrst

step is to think of the data set in a one-vs-rest con-

text, locally calculating the mean position of each of

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

the possible ‘rest’ collections. For example, if there

are 3 local points (1,0,0), (0, 1, 0), and (0,0,1), each

belonging to classes A, B, and C respectively, then

(0,0.5,0.5), (0.5,0,0.5), and (0.5,0.5,0) would be

the resulting means, calculated by excluding the ﬁrst,

second, and third points respectively. These newly

generated means vectors would be passed into the

PCA algorithm to ﬁnd a new basis, and the original

features are then transformed into this new basis. It is

important to note that this transformation will neces-

sarily result in a dimensionality reduction in the case

where the number of classes locally present is less

than the dimension of the original feature vector. The

features in this new basis are then used as the input

features to ﬁnd the optimal threshold, instead of us-

ing the original features.

A high level description of a generic decision tree

and how the speciﬁcation of the split method and the

feature type of basis selected generates the algorithms

proposed, see Figure 1. The notation used in Fig 1 is

U for the set of all observations; X, y for the feature

vectors and classes respectively. Let the ”split” pa-

rameter function take the value of either Gini or Max-

Cut, which are functions that return the two subsets

implied by the optimal split of their respective objec-

tive functions. Let featureType be 1, 2, 3, or 4 for

Original, Global PCA, Local PCA, and Local Means

PCA respectively. Let PCA(X) and PCAMapping(X)

be functions that return the principal components of

X or the function that maps X to its principal compo-

nents respectively. Let C(y) be the set of class values

of vector y, {i|∃y

= i}, and let U

−i

= {u|u ∈U,y

i}.

In our analysis, we consider a total of 8 different

algorithms, which we implemented in the Python pro-

gramming language (Van Rossum and Drake, 2009).

In order to improve the run time of these algorithms,

both the Gini Impurity and Max Cut optimal thresh-

old calculation for a given feature were implemented

as NumPy (Oliphant, 2006) vector operations to avoid

the use of slower Python for loops. Moreover, both

utilize efﬁcient O (n log n) implementations. We re-

mark that both the Gini Impurity and Max Cut met-

ric were implemented so that in the event of a tie,

they favored more balanced splits, and the split was

chosen to be the mean value between the two closest

points that should be separated. Other than these sub-

routines, in order to make time-based comparisons as

consistent as possible, no difference existed between

the implementations of Gini Impurity and Max Cut

based trees. The scikit-learn (Pedregosa et al., 2011)

package was used to perform PCA.

As a result of Gini Impurity and Max Cut having

the same time complexity, the following results on

the time complexity of each algorithm hold regard-

less of the choice of evaluation criterion. The time

complexities for computing the optimal split at each

node are given in Table 1, where n is the number of

procedure decisionTree(U, (X ,y),split, f eatureType):

begin

if type = 2 do X ← PCA(X);

list ← {(U,X,y)};

while list 6=

0 do

pop (U,X,y) from list;

if type = 1,2 do X

← X;

else if type = 3 do X

← PCA(X);

else do X

← meansPCA(U, X ,y);

← split(U,X

,y);

if |C(y(P

))| 6= 1 do list ← list ∪{(P

,X(P

),y(P

))};

if |C(y(P

))| 6= 1 do list ← list ∪{(P

,X(P

),y(P

))};

end

procedure meansPCA(U,X,y):

begin

means ← {};

for each i ∈C(y) do

means ← means ∪{

−i

∑

u∈U

−i

};

T ← PCAMapping(means);

← T (X);

return X

;

end

Figure 1: High-Level description of decision tree ﬁtting and means PCA algorithms.

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

observations, d is the number of input features, and p

is the number of principal components considered.

Table 1: Time Complexity of Node Splitting.

Algorithm Time Complexity

Baseline O(dn log n)

Node Features PCA O(n

Node Means PCA O(d

+ nlogn)

In order to be able to evaluate the relative per-

formance of the algorithms we ran them all with the

same implementation. Our baseline implementation

is obviously slower than that of open source pack-

ages (such as scikit-learn (Pedregosa et al., 2011)),

which have been optimized and are not implemented

in Python, yet it provides a cleaner basis for compar-

ison. The code that we used has not been made pub-

licly available as it does not represent the levels of

optimization needed for a production grade package.

Our code, however, is made available upon request.

Each of the algorithms is determined by:

1. Split criterion, which is Gini or Max Cut.

2. Which feature vector type to choose: Original,

Global PCA, Local PCA, or Local Means PCA.

Table 2 provides the nomenclature we use to refer to

each of these different combinations. Each of these

algorithms was put through the same rigorous exper-

imentation, using both synthetic and real-world data

sets.

3 A COMPARATIVE STUDY

3.1 Synthetic Data Sets

We ﬁrst consider the performance of the 8 algorithms

on synthetic data sets, which provide key advantages

compared to real-world data sets. Foremost, synthetic

datasets allow for generating an arbitrary number of

independent datasets. This allowed us to draw sta-

tistically signiﬁcant conclusions from the 20,000 plus

datasets we generated. Moreover, we were able to

generate an arbitrarily large number of training and

test examples with datasets ranging from 100,100

to 400,000 observations. This gave us greater in-

sights into the algorithms’ performances on larger

datasets than we could have achieved using only real-

world data. Lastly, we had ﬁne-grained control of the

datasets’ different characteristics.

3.1.1 Experimental Design

To generate the synthetic data sets, we used the

datasets.make classiﬁcation tool found within the

scikit-learn package (Pedregosa et al., 2011). This

was designed such that only the case where each class

would have one cluster was analyzed. These clusters

were then centered around the vertices of a hyper-

cube of a speciﬁed dimension (one of the character-

istics that we would modify). Random points would

then be generated from a normal distribution around

each cluster’s center, and these points would be as-

signed to that cluster’s corresponding class, except in

1% of cases where the class is randomly assigned.

The features were then randomly linearly combined

within each cluster to generate covariance. The op-

tion of having redundant or repeated features was not

used in our experiments. We use the option of adding

an arbitrary number of random features to the feature

vector as a second characteristic that we control. Fi-

nally, the features were randomly shifted and scaled.

Throughout our experiments, we observed how

the following four factors changed the accuracy, wall

clock time, and the number of leaves in the tree. The

factors that we controlled were:

1. The number of training examples. Note that since

we were able to generate arbitrarily many data

points per data set, we were able to have arbitrar-

ily large testing set. We chose to ﬁx the size of the

testing set at 100,000 data points. For instance,

if the training set consisted of 10,000 points, we

would generate a data set of 110,000 data points

where 10,000 points would be randomly selected

for training and the remaining 100,000 would be

for testing.

2. The number of classes.

3. The number of informative dimensions, referred

to as the Dimension of Data in our results.

4. The number of meaningless dimensions, referred

to as the Noise Dimension in our results.

Table 2: Algorithms’ nomenclature.

Split Criteria

Feature Type Gini Max Cut

(0) Original Gini Features Max Cut Features

(1) Global PCA Gini Pre PCA Features Max Cut Pre PCA Features

(2) Local PCA Gini Node Features PCA Max Cut Node Features PCA

(3) Local Means PCA Gini Node Means PCA Max Cut Node Means PCA

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

For each combination of parameter settings ex-

plored, we evaluated the models’ results on 30 in-

dependently generated datasets. For each of these

datasets, the models were presented with the same

training/testing examples, which were standardized

(zero mean and unit variance) using the scikit-learn

StandardScaler (Pedregosa et al., 2011) ﬁt to the train-

ing examples. Standardization was used to make the

distances in each feature more comparable, which

will be important in the PCA analysis as well as ﬁnd-

ing the Max Cut. Note that both the Gini Impurity

method and the orientation of the original feature vec-

tor are invariant under this transformation, indicating

that the baseline algorithm’s performance should also

be invariant.

Relying on multiple independent data sets, we can

calculate the signiﬁcance of our results, for each data

generation setting, using a one-tailed Paired Sample

T-Test. For each algorithm, j, and data generation

setting we consider 14 alternative hypotheses {A

}

i=1

and corresponding null hypotheses {N

}

1=1

. Speciﬁ-

cally, let {A

}

i=1

represent the alternative hypotheses

that algorithm j out performed, in a pairwise com-

parison, each of the other 7 algorithms. Conversely,

let {A

}

i=8

represent the alternative hypotheses that

algorithm j under performed, in a pairwise compar-

ison, each of the other 7 algorithms. We then used

the Holm-Bonferroni Method (Holm, 1979), to cor-

rect for multiple hypotheses testing, testing for signif-

icance at the 5% level. In the case where all {N

}

1=1

were rejected we labeled that data setting on the accu-

racy signiﬁcance graph as ”Best”. In the case where

any of the of {N

}

1=8

were rejected we we labeled

that data setting on the accuracy signiﬁcance graph as

”Not Best”. Else we labeled the region as ”Undecid-

able”.

All of these experiments were run on server nodes

that each contained two Intel Xeon E5-2670 v2 CPUs

for a total of 20 cores (no hyper-threading was used).

Each decision tree was constructed using a single core

and thread, such that up to 20 experiments were run-

ning simultaneously on a single node.

3.1.2 Binary Classiﬁcation

The ﬁrst set of experiments served to evaluate how the

accuracy and run time of each algorithm varied with

the size of the training set as well as the number of

informative dimensions. For this experiment, we set

the number of noise dimensions to zero. The results

of these experiments are summarized in Figure 2 and

Figure 3.

Examining Figure 2 and 3, it is apparent that Lo-

calized PCA dramatically improves the model accu-

racy compared to the baseline, as well as the model

with PCA applied in preprocessing. Figure 3 makes it

clear that with statistical signiﬁcance the baseline al-

gorithm is not the best performing algorithm on aver-

age. On the the other hand Figure 3 indicates that the

most promising algorithms are Gini Node Features

PCA, Gini Node Means PCA, and Max Cut Node

Means PCA

We ﬁrst consider Gini Node Means PCA vs. Max

Cut Node Means PCA. When looking at Figure 2 and

Figure 3, there are no signiﬁcant trends that can be

seen when comparing the two, except that Max Cut

Node Means PCA seems to perform better more often

for smaller training sets. We then examine the accu-

racy ratio of Max Cut Node Means PCA / Gini Node

Means PCA for each of the 5,640 data sets in this ex-

periment. From this analysis, we found the mean ratio

to be 1.0003 and the sample standard deviation to be

0.0087, implying a 95% conﬁdence interval for the

mean ratio of (1.00008, 1.00053). This implies that

Max Cut Node Means PCA had a slight advantage

over the Gini Node Means PCA when considering the

average ratio. Therefore, we decided to compare Gini

Node Features PCA vs. Max Cut Node Means PCA.

When considering Gini Node Features PCA vs.

Max Cut Node Means PCA the most prominent pat-

tern that emerges is that Gini Node Features PCA per-

forms better for large training sets (based on accu-

racy) and Max Cut Node Means PCA performs bet-

ter for smaller training sets. This pattern of Max

Cut Node Means PCA performing better for harder

datasets can be seen in Figure 4. It is important to note

that this out performance is almost entirely attributed

to the choice of feature vector type as the same pattern

emerges if Gini Node means PCA was used instead of

Max Cut Node Means PCA. In the cases where 1,000

training samples were used, Node Means PCA, after a

small dip for low dimensional data, increasingly out-

performs Node Features PCA as the dimensionality

increases. One reason for this is that Node Means

PCA uses averaging, thereby reducing noise. As a re-

sult of the dimensionality reduction Node Means PCA

searches fewer basis vectors than Node Features PCA.

In fact, in the binary case there are only two classes,

hence the Node Means PCA algorithm searches only

two basis vectors to select a threshold.

Figure 5 is the histogram all 5,640 ratios of Max

Cut Node Means PCA Accuracy divided by Gini

Node Features PCA Accuracy. It is evident from

the right skew in ﬁgure 5 that when Node Means

PCA provides better accuracy it does so considerably.

On the other hand when it provides less accuracy it

does so with only a minor loss. To further demon-

strate this point, we consider the mean and max ratios.

Conditioned on the ratio being greater than 1 (which

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

0.0 0.2 0.4 0.6 0.8 1.0

Dimension of Data

0.0

0.2

0.4

0.6

0.8

1.0

Size of Training Data

Gini Features Gini Pre PCA Features Gini Node Features PCA Gini Node Means PCA

10 20 30 40 50

Max Cut Features

10 20 30 40 50

Max Cut Pre PCA Features

10 20 30 40 50

Max Cut Node Features PCA

10 20 30 40 50

Max Cut Node Means PCA

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Mean Accuracy

Figure 2: Mean accuracy of binary classiﬁcation on synthetic data, evaluated on test sets of 100,000.

0.0 0.2 0.4 0.6 0.8 1.0

Dimension of Data

0.0

0.2

0.4

0.6

0.8

1.0

Size of Training Data

Gini Features Gini Pre PCA Features Gini Node Features PCA Gini Node Means PCA

10 20 30 40 50

Max Cut Features

10 20 30 40 50

Max Cut Pre PCA Features

10 20 30 40 50

Max Cut Node Features PCA

10 20 30 40 50

Max Cut Node Means PCA

Not Best

Undecidable

Best

Mean Accuracy

Figure 3: Signiﬁcance of Mean accuracy of binary classiﬁcation on synthetic data, evaluated on test sets of 100,000.

10 20 30 40 50

Dimension of Data

−0.02

−0.01

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Node Means PCA - Node Features PCA

Figure 4: Mean Max Cut Node Means PCA Accuracy mi-

nus Mean Gini Node Features PCA Accuracy with 95%

conﬁdence interval using the 1,000 Training Sets.

happens 49.6% of the time), the mean is 1.075 and

the max is 1.430. When the ratio is conditioned to be

less than 1 (which happens 50.4% of the time), the

mean is 0.989 and the minimum is 0.915. This shows

that Node Means PCA has the potential of a signiﬁ-

cant improvement in accuracy with only a minor risk

of accuracy loss.

When examining the experimental results of Node

Means PCA vs. Node Features PCA, we see that

Node Means PCA has signiﬁcantly better proper-

ties in regard to computational cost. Max Cut Node

Means PCA took between 27% and 98% less time

compared to Gini Node Features PCA to ﬁt; and be-

tween 21% and 98% less time compared to the base-

line model. The computational efﬁciency of Node

Means PCA is partially explained by the computa-

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

0.9 1.0 1.1 1.2 1.3 1.4

Max Cut Node Means PCA Accuracy / Gini Node Features PCA Accuracy

100

125

150

175

200

Number of Data Sets

Figure 5: Histogram of Max Cut Node Means PCA Accu-

racy / Gini Node Features PCA Accuracy.

tional efﬁciency of the split as provided in table 1.

However, another factor is the size of the decision tree

for witch the number of leaves is a proxy. The mean

number of leaves across all data sets is provided in

Table 3.

Table 3: Mean Number of Leaves in Decision Tree.

Algorithm Mean # of Leaves

Baseline 1129

Gini Node Features PCA 418

Gini Node Means PCA 429

Max Cut Node Means PCA 448

Table 3 shows that the average number of leaves

is signiﬁcantly less for Gini Node Features PCA, Gini

Node Means PCA, and Max Cut Node Means PCA

when compared to the Baseline model. This reduction

can help account for the signiﬁcant observed com-

putational efﬁciency of the Node Means PCA vs the

Baseline. Therefore, taking into account the costs and

beneﬁts related to computational time, accuracy, and

robustness to higher dimensional problems, the Node

Means PCA algorithm has the best performance, with

a slight preference towards the Max Cut metric over

Gini Impurity.

The second experiment that was performed con-

sidered the addition of noise features in order to see

if the Localized PCA algorithms would be negatively

affected. For this experiment, the training set sizes

were ﬁxed to 10,000 and then for each combination of

meaningful dimensions in {5,10,15,...,50}and noise

dimensions in {0, 5, 10, ..., 50}, we generated and an-

alyzed 30 datasets. The major result of this exper-

iment was that localized PCA only under performs

in extreme cases. When the real dimension was ﬁve

and the noise dimension was greater than or equal to

15, our modiﬁcations had no improvement upon the

baseline, with the baseline actually outperforming our

algorithm. However, this represents the cases where

noise features accounted for 75% or more of the avail-

able features and the number of informative features

was small. Therefore, we do not regard this as a fun-

damental issue with our algorithm.

3.1.3 Multiclass Classiﬁcation

The next experiment investigates the impact of using

data with ten classes rather than just two. Referring

to the same steps as in section 3.1.2, one can see how

different training set sizes, informative dimensions,

and noise dimensions can affect the algorithm’s per-

formance. The main results derived from varying the

training data size and the underlying dimension of the

data can be seen in Figure 6 and Figure 7. An impor-

tant note is that these plots do not use a perfect loga-

rithmic scale for the y-axis. The values for each hor-

izontal bar are 10

, 10

, 2 ×10

, and 3 ×

(note that the last two are not 10

and 10

as one

might initially suspect).

As seen in Figure 7, the baseline Gini Features al-

gorithm is ”Not Best” for all tested synthetic dataset

parameters. The prevalence of these ”Not Best”

markings indicates with statistical signiﬁcance that

the baseline is not the best algorithm on average,

much like in the binary case (section 3.2). The ”Best”

markings actually indicate, from an accuracy stand-

point, that Max Cut Nodes Means PCA was the best

option in most cases. Moreover, Max Cut Nodes

Means PCA provides signiﬁcant computational ad-

vantages for high dimensional data (taking up to 93%

less time). It can, however, take up to 22% longer

when looking at low dimensional problems, but we do

not view this as a fundamental issue. In general, lower

dimensional problems are computationally faster, so

the increased relative time does not translate to sig-

niﬁcantly more realized computational time.

Upon close inspection of Figure 7, the markings

indicate that the Gini Node Features algorithm may

be the best option in terms of accuracy for larger train-

ing sets and fewer dimensions. This is the same pat-

tern that emerged in the binary case, and Figure 8 al-

lows for a closer comparison of these two algorithms,

focusing speciﬁcally on the 300,000 case. It demon-

strates that while Gini Node Features may outperform

in the fewer dimensional, larger training set case, Max

Cut Node Means PCA rapidly and dramatically out-

performs Gini Node Features as the number of dimen-

sions increases.

When the comparison is made between the Max

Cut Node Means PCA algorithm and the Gini Node

Means PCA algorithm, Figure 9 helps demonstrates

the marginal but statistically signiﬁcant improvement

by utilizing the Max Cut metric. Speciﬁcally, the

95% conﬁdence interval for the mean ratio between

the accuracy of the Max Cut Node Means PCA al-

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

0.0 0.2 0.4 0.6 0.8 1.0

Dimension of Data

0.0

0.2

0.4

0.6

0.8

1.0

Size of Training Data

2 × 10

3 × 10

Gini Features Gini Pre PCA Features Gini Node Features PCA Gini Node Means PCA

10 20 30 40 50

2 × 10

3 × 10

Max Cut Features

10 20 30 40 50

Max Cut Pre PCA Features

10 20 30 40 50

Max Cut Node Features PCA

10 20 30 40 50

Max Cut Node Means PCA

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Mean Accuracy

Figure 6: Mean accuracy of multiclass classiﬁcation on synthetic data, evaluated on test sets of 100,000.

0.0 0.2 0.4 0.6 0.8 1.0

Dimension of Data

0.0

0.2

0.4

0.6

0.8

1.0

Size of Training Data

2 × 10

3 × 10

Gini Features Gini Pre PCA Features Gini Node Features PCA Gini Node Means PCA

10 20 30 40 50

2 × 10

3 × 10

Max Cut Features

10 20 30 40 50

Max Cut Pre PCA Features

10 20 30 40 50

Max Cut Node Features PCA

10 20 30 40 50

Max Cut Node Means PCA

Not Best

Undecidable

Best

Mean Accuracy

Figure 7: Signiﬁcance of Mean accuracy of multiclass classiﬁcation on synthetic data, evaluated on test sets of 100,000.

10 20 30 40 50

Dimension of Data

0.00

0.02

0.04

0.06

0.08

0.10

Node Means PCA - Node Features PCA

Figure 8: Mean Max Cut Node Means PCA Accuracy mi-

nus Mean Gini Node Features PCA Accuracy with 95%

conﬁdence interval using the 300,000 Training Sets.

gorithm and the Gini Node Means PCA algorithm

is (1.0060,1.0071). Thus, we can conclude that the

use of Max Cut does appear to improve accuracy on

average.

When analyzing how the performance of the al-

gorithms were affected by the inclusion of noise fea-

tures, the same results as in the binary analysis are

seen. That is, for low dimensional data (informative

dimensions of 5 and 10) and high amounts of noise

(noise dimensions greater than or equal to 5 and 25

respectively), the baseline model has better perfor-

mance. However, these cases have very high percent-

ages of noise features (greater than 50%) and few di-

mensions. Since the improved accuracy of Max Cut

Node Means PCA still held for higher dimensions,

even when the percentage of noise features exceeded

50%, we believe that our Max Cut Node Means PCA

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20

Max Cut Node Means PCA Accuracy / Gini Node Means PCA Accuracy

100

200

300

400

500

600

Number of Data Sets

Figure 9: Histogram of Max Cut Node Means PCA Accu-

racy / Gini Node Means PCA Accuracy.

algorithm is broadly applicable.

3.1.4 Synthetic Dataset Results Summary

After extensive testing, it is apparent that our local-

ized PCA algorithms provide dramatic improvements

over the baseline model in all but the most extreme

noise conditions. A thorough analysis of these lo-

calized PCA algorithms found that from an accuracy

standpoint, Node Means PCA should be preferred

for increasingly difﬁcult classiﬁcation tasks, partic-

ularly in cases where there are more classes, higher

dimensions, and/or fewer training examples. More-

over, when Node Means PCA was not the best op-

tion, it was typically only outperformed by a small

margin, and when it was the best option, it could end

up dramatically improving results. Empirically, Node

Means PCA results in signiﬁcantly faster computa-

tion times compared to the baseline method and the

Node Features method. Thus, Node Means PCA pro-

vides dramatic improvements in both accuracy and

run time.

We further compared the Max Cut and Gini Impu-

rity metrics and found that when the Max Cut met-

ric was used in conjunction with the Node Means

PCA algorithm, on average, it can increase a decision

tree’s performance. Based on our experimentation on

more than 20,000 diverse datasets, we generally rec-

ommend that Max Cut Node Means PCA be consid-

ered as an alternative to the traditional baseline model.

We have thus shown that for synthetic datasets Max

Cut Node Means PCA yields a signiﬁcant and dra-

matic improvement over the baseline model in both

accuracy and computational efﬁciency. We present

next the performance of our algorithm for real-world

datasets.

3.2 Real-world Data Sets

We proceeded to explore the effects of the eight dif-

ferent algorithms on real-world datasets, considering

a total of 5 different datasets. We used the Iris Data

Set and Wine Quality Data Set (Cortez et al., 2009),

both taken from the UCI Machine Learning Reposi-

tory (Dua and Graff, 2017). The Wine Quality Data

Set is separated into 2 sets, red and white wines, and

in our experimentation, we consider each of these

sets as a single combined dataset and as individual

sets. In the combined dataset, the wine’s classiﬁca-

tion was included as another feature, one if red, zero

otherwise. We also used the MNIST dataset (LeCun

et al., 2010), as well as the CIFAR-10 and CIFAR-100

datasets (Krizhevsky, 2009). Table 4 provides a brief

overview of each of these datasets.

Table 4: Data Set Characteristics.

Data Set Samples Features Classes

Iris 150 4 3

Wine-Red 1,599 11 6

Wine-White 4,898 11 7

Wine-Both 6,497 12 7

MNIST 70,000 784 10

CIFAR-10 60,000 3,072 10

CIFAR-100 60,000 3,072 100

All of these experiments on real-world data were

run on server nodes that each contained two Intel

Xeon E5-2670 v2 CPUs, for a total of 20 cores (no

hyper-threading was used). Each decision tree was

constructed using a full server node utilizing paral-

lel processing due to the NumPy vector operations.

Rather than recording wall clock time as in the previ-

ous synthetic data set experiments, the total CPU time

was used for these experiments.

We considered each algorithm with modiﬁed fea-

tures as well as with standardized features in the ini-

tial input and reported the best result. To evaluate

the performance of each algorithm we used one of

two methods: either 10 ×10 cross-validation (as in

the case of Iris, Wine-Red, Wine-White, and Wine-

Both), or an 80%-20% train-test data split (as in the

case of MNIST, CIFAR-10, and CIFAR-100). The de-

cision to not use 10 ×10 cross-validation for MNIST,

CIFAR-10, and CIFAR-100 was a result of signiﬁ-

cant computational requirements for computing these

trees. Moreover, due to the large size of the datasets,

the variance in accuracy should be lower than that

in smaller datasets, implying that the 10 ×10 cross-

validation would not be as beneﬁcial. The results of

these experiments are reported in Table 5, with the

mean value provided for the Iris, Wine-Red, Wine-

White, and Wine-Both datasets.

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

Table 5: Results of Real-World Data Set Experiments.

Iris Wine-Red

Accuracy

Relative

Time Scaled Accuracy

Relative

Time Scaled

Gini Features 0.945 1.000 False 0.626 01.000 False

Gini Pre PCA Features 0.941 1.000 True 0.636 03.229 True

Gini Node Features PCA 0.944 1.309 True 0.619 11.690 True

Gini Node Means PCA 0.950 0.894 False 0.638 06.707 True

Max Cut Features 0.947 1.032 True 0.629 00.799 True

Max Cut Pre PCA Features 0.950 0.915 False 0.632 03.308 True

Max Cut Node Features PCA 0.941 1.213 False 0.636 12.479 True

Max Cut Node Means PCA 0.960 0.798 False 0.631 05.539 True

Wine-White Wine-Both

Accuracy

Relative

Time Scaled Accuracy

Relative

Time Scaled

Gini Features 0.623 1.000 False 0.624 1.000 False

Gini Pre PCA Features 0.615 1.216 True 0.617 1.214 False

Gini Node Features PCA 0.615 8.301 True 0.608 9.122 True

Gini Node Means PCA 0.635 3.788 True 0.633 4.231 True

Max Cut Features 0.627 0.548 True 0.623 0.648 True

Max Cut Pre PCA Features 0.628 1.076 True 0.626 1.046 True

Max Cut Node Features PCA 0.634 8.740 True 0.633 9.755 True

Max Cut Node Means PCA 0.635 4.449 True 0.636 4.251 True

MNIST CIFAR-10

Accuracy

Relative

Time Scaled Accuracy

Relative

Time Scaled

Gini Features 0.869 1.000 False 0.262 1.000 False

Gini Pre PCA Features 0.823 1.558 True 0.246 1.162 False

Gini Node Features PCA 0.863 2.612 False 0.290 1.173 False

Gini Node Means PCA 0.922 0.186 False 0.342 0.085 False

Max Cut Features 0.847 0.744 False 0.243 1.313 False

Max Cut Pre PCA Features 0.866 1.091 False 0.271 1.503 False

Max Cut Node Features PCA 0.897 1.552 False 0.309 1.034 True

Max Cut Node Means PCA 0.924 0.093 False 0.349 0.075 False

CIFAR-100

Accuracy

Relative

Time Scaled

Gini Features 0.083 1.000 False

Gini Pre PCA Features 0.073 1.259 False

Gini Node Features PCA 0.090 1.010 True

Gini Node Means PCA 0.116 0.132 False

Max Cut Features 0.077 0.626 False

Max Cut Pre PCA Features 0.090 0.637 False

Max Cut Node Features PCA 0.109 0.491 False

Max Cut Node Means PCA 0.124 0.061 True

The results of the real-world analysis verify that

the improvements in accuracy are similar to those in

the synthetic datasets. In the case of the Node Means

PCA modiﬁcation, we found that this was the best

option to use for all the problems analyzed. Deﬁn-

ing performance as the percentage improvement in

accuracy relative to the baseline model (utilizing the

better-performing metric between Max Cut and Gini),

this modiﬁcation had improvements of 1.6%, 1.9%,

1.9%, 1.9%, 6.3% 33.3%, and 49.4% for each of the

data sets. Moreover, Max Cut was the best choice

in 5 out of 7 problems. Deﬁning performance as

the percentage improvement of the best Max Cut al-

gorithm compared to the best performing Gini algo-

rithm, Max Cut had improvements of 1.1% -0.3%,

0.0%, 0.5%, 0.2% 2.0%, and 6.9%. Finally, Max

Cut Node Means PCA was the best choice in 5 out

of 7 problems and was tied for the best in 1 problem.

Deﬁning performance as the percentage improvement

of accuracy compared to the baseline model, Max

Cut Node Means PCA represented an improvement of

1.6%, 1.0%, 1.9%, 1.9%, 6.3%, 33.3%, and 49.4%.

From Table 5, it can be seen that Max Cut Node

Means PCA is invariably the fastest algorithm for

larger problems. Although Max Cut Node Means

PCA took more time for small datasets, we do not be-

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

lieve this discounts its overall performance. Regard-

less of the algorithm chosen, the decision trees are ﬁt

in a negligible amount of time for the smaller datasets

relative to the larger datasets. Moreover, Max Cut

Node Means PCA still provides the previously men-

tioned accuracy beneﬁts. For the larger datasets, such

as CIFAR-100, the Max Cut Node Means PCA algo-

rithm reduced the required CPU time by 94% com-

pared to the baseline algorithm. Thus, Max Cut Node

Means PCA provides signiﬁcant computational and

accuracy advantages.

4 CONCLUSIONS

In this paper, we propose two important modiﬁcations

to the CART methodology for constructing classiﬁca-

tion decision trees. We ﬁrst introduced a Max Cut

based metric for determining the optimal binary split

as an alternative to the Gini Impurity method (see ap-

pendix for an O(nlogn) implementation of Max Cut

along a single feature) . We then introduced the use of

localized PCA to determine locally viable directions

for considering splits and further modiﬁed the tradi-

tional PCA algorithm, Means PCA, to ﬁnd suitable

directions for discriminating between classes. We

follow this study with a theoretical commentary on

how these modiﬁcations are reﬂected in the asymp-

totic bounds of node splitting and demonstrate that

Means PCA improves upon the traditional method.

Our extensive experimental analysis included

more than 20,000 synthetically generated data sets

with training sizes ranging from 100 and 300,000

and informative dimensions ranging between 4 and

50, We further considered both binary and multiclass

classiﬁcation tasks. These experiments demonstrate

the signiﬁcant improvements in increased accuracy

and decreased computational time provided by uti-

lizing localized Means PCA and Max Cut. Further-

more, we show that the accuracy improvements be-

come even more substantial as the dimension of the

data and/or the number of classes increases and that

the runtime improves with the dimension and size of

the datasets.

Analysis was also done on real-world datasets.

Our results indicate that the Max Cut Node Means

PCA algorithm remains advantageous even when us-

ing real-world data. For example, we show that in

the case of CIFAR-100 (100 classes, 3,072 dimen-

sions, and 48,000 training points out of 60,000 total)

our algorithm results in a 49.4% increase in accuracy

performance compared to the baseline CART model,

while simultaneously reducing the CPU time required

by 94%. This novel algorithm helps bring decision

trees into the world of big, high-dimensional data.

Our experiments demonstrate the signiﬁcant improve-

ments that Max Cut Node Means PCA has on con-

structing classiﬁcation decision trees for these types

of datasets. Further research on how these novel deci-

sion trees affect the performance of ensemble meth-

ods may lead to even greater advancements in the

area.

ACKNOWLEDGEMENTS

This research used the Savio computational cluster re-

source provided by the Berkeley Research Comput-

ing program at the University of California, Berke-

ley (supported by the UC Berkeley Chancellor, Vice

Chancellor for Research, and Chief Information Of-

ﬁcer). This research was supported in part by NSF

award No. CMMI-1760102.

REFERENCES

Breiman, L., Friedman, J., Stone, C., and Olshen, R. (1984).

Classiﬁcation and Regression Trees. The Wadsworth

and Brooks-Cole statistics-probability series. Taylor

& Francis.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis,

J. (2009). Modeling wine preferences by data min-

ing from physicochemical properties. Decis. Support

Syst., 47:547–553.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory.

F.R.S., K. P. (1901). Liii. on lines and planes of closest ﬁt to

systems of points in space. The London, Edinburgh,

and Dublin Philosophical Magazine and Journal of

Science, 2(11):559–572.

Holm, S. (1979). A simple sequentially rejective multi-

ple test procedure. Scandinavian Journal of Statistics,

6(2):65–70.

Karp, R. M. (1972). Reducibility among combinatorial

problems. In Miller, R. E., Thatcher, J. W., and

Bohlinger, J. D., editors, Complexity of Computer

Computations, pages 85–103. Plenum Press, New

York.

Krizhevsky, A. (2009). Learning multiple layers of features

from tiny images. Technical report.

LeCun, Y., Cortes, C., and Burges, C. (2010). Mnist hand-

written digit database. ATT Labs [Online]. Available:

http://yann.lecun.com/exdb/mnist, 2.

Oliphant, T. E. (2006). A guide to NumPy, volume 1. Trel-

gol Publishing USA.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Van Rossum, G. and Drake, F. L. (2009). Python 3 Refer-

ence Manual. CreateSpace, Scotts Valley, CA.

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

APPENDIX

Implementation of Max Cut Metric in O(nlogn):

The ﬁrst step is to sort the observations in ascend-

ing order such that x

i, j

< x

k, j

∀i < k, which is known

to be implemented in O(n log n) time. There are a total

of n possible splits to consider; let the value achieved

by the split between x

i, j

and x

k, j

be referred to as θ

We will now show that given θ

i−1

, θ

can be com-

puted in constant time. It is evident that the following

equality holds:

= θ

i−1

−

∑

t<i

6=y



i, j

−x

t, j



∑

t>i

6=y



i, j

−x

t, j



= θ

i−1

−

∑

t<i

6=y

i, j

−x

t, j

)

∑

t>i

6=y

t, j

−x

i, j

)

= θ

i−1

∑

t<i

6=y

t, j

−1

6=y

i, j

∑

t>i

6=y

t, j

−1

6=y

i, j

Since 1

6=y

= 0 ∀i:

= θ

i−1

∑

6=y

t, j

−1

6=y

i, j

= θ

i−1



∑

6=y

t, j



−x

i, j



∑

6=y



Since

∑

6=y

t, j

and

∑

6=y

are only dependent

on the class of observation i, these can be calculated

once for each class c and recorded as S

and N

re-

spectively. Then,

= θ

i−1

+ S

−x

i, j

Therefore, the complexity of the split per feature is

O(nlogn).

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval