cell as compared to the cancer cell. However, a yel-
low spot suggests that the gene is expressed equally
in both cells and therefore, they are not relevant as
the cause of the disease, because when the healthy
cell becomes cancerous its activity does not undergo
a change. Using DNA microarray, we can analyze
a large amount of genes at the same time, find which
genes are being expressed and decide on a better prog-
nosis based on the previous analyzes.
Figure 3 depicts the process of generating a
dataset from the use of the DNA microarray technique
mentioned in Figure 2. The datasets considered in this
work are obtained with this process.
Figure 3: Dataset generation from DNA Microarray.
2.2 Related Approaches
In the last decades, there has been considerable re-
search on microarray data classification for cancer
diagnosis (Alonso-Betanzos et al., 2019; Yip et al.,
2011; Statnikov et al., 2005b). Many unsupervised
and supervised FD and FS techniques have been em-
ployed on this type of data, before classification takes
place. Since microarray datasets are typically la-
belled, supervised techniques are usually preferred to
unsupervised ones. In this section, we briefly review
some of the existing related work using FD, FS, and
classification techniques.
A survey of common classification techniques and
related methods to increase their accuracy for mi-
croarray analysis is presented by Alonso-Betanzos
et al. (2019); Yip et al. (2011). The experimental
evaluation is carried out in publicly available datasets.
Saeys et al. (2007) surveyed FS techniques used for
this type of data, showing their adequacy.
It has been found that unsupervised FD performs
well when combined with several classifiers. For
instance, the equal frequency binning (EFB) tech-
nique with Na
¨
ıve Bayes (NB) classifier produces very
good results (Witten et al., 2016). It has also been
reported that applying equal interval binning (EIB)
and EFB with microarray data, together with support
vector machines (SVM) classifiers, yields good re-
sults (Meyer et al., 2008). The work of Statnikov et al.
(2005a) shows that FS significantly improves the clas-
sification accuracy of multi-class SVM classifiers and
other classification algorithms.
An FS filter for microarray data, with an
information-theoretic criterion named double input
symmetrical relevance (DISR), which measures fea-
ture complementarity, was proposed by Meyer et al.
(2008). The reported experimental results on one syn-
thetic dataset and 11 microarray datasets show that the
DISR criterion is competitive with existing FS filters.
Diaz-Uriarte and Andres (2006) explored FS tech-
niques, such as backwards elimination of features
and classification, both using random forests (RF).
The authors applied the chosen method on one simu-
lated and nine real microarray datasets and found that
RF has better performance than other classification
methods, such as diagonal linear discriminant analy-
sis (DLDA), K-nearest neighbors (KNN), and SVM.
They also showed that the used FS technique led to a
smaller subset of features than alternative techniques,
namely Nearest Shrunken Centroids and a combined
method of filter and nearest neighbor classifier.
The work by Li et al. (2018) introduced the use of
large-scale linear support vector machine (LLSVM)
and recursive feature elimination with variable step
size (RFEVSS) as an enhancement to the traditional
FS technique based on SVM with recursive feature
elimination (SVMRFE), which is considered one of
the best methods in the literature, but exhibits large
computational cost. The improved approach consists
in upgrading the RFE by varying the step size with
the goal of reducing the number of iterations (the step
size is kept higher in the initial stages of this process
where non-relevant features are discarded). In addi-
tion, the standard SVM is upgraded to a large-scale
linear SVM and thus accelerating the method of as-
signing weights. The authors compare their approach
to FS with SVM and RF, and use the SVM, NB, KNN
and logistic regression (LR) classifiers. These tech-
niques are applied on six microarray datasets and the
approach provides better performance with compara-
ble levels of accuracy, showing that SVM and LR out-
perform the other two classifiers.
Recently, in the context of cancer explainability,
Consiglio et al. (2021) considered the problem of
finding a small subset of features capable of discern-
ing among six classes of instances. These classes may
be healthy or cancerous. The goal was to define a
comprehensive set of rules based on the most relevant
features (selected by their technique) that can distin-
guish classes based on their gene expressions. The
proposed method combines a genetic algorithm (GA)
to conduct FS and a fuzzy rule-based system to ex-
ecute classification on a dataset, with 21 instances,
more than 45 thousand features, and 6 classes. Ten
rules were devised, each one of them taking into ac-
count specific features, which make them crucial in
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
364