Predicting the Malignant Breast Cancer using Tumor Tissue Features 
Wenrui Zhao 
College of Art and Science, the Ohio State University, Columbus, OH, 43210, U.S.A. 
Keywords:  Breast Cancer, Breast Cancer Datase, Feature Selection, FNA, Cancer Diagnosis. 
Abstract:  Breast cancer is one of the most common cancers in women and is the second leading cause of death after 
lung cancer. In clinical diagnosis, fine needle aspiration cytology is often used in tumor diagnosis, considering 
safety, accuracy, and ease of operation. Pathologists can judge whether the patient's tumor tissue is malignant 
by observing  the  cell  population. The accuracy of  fine-needle biopsy  largely depends on the  doctors  who 
participate in sampling and analysis. Therefore, it is crucial to study which characteristics of cells can become 
a solid basis for discrimination. This article constructs univariate and multivariate logistic regression models 
to analyze the predictive value of 9 features of the cell to breast cancer. By evaluating the ROC curve, the 
article shows that the constructed model accurately predicts malignant tumor tissue. The 9 characteristics of 
FNA quantitative detection of tumor tissue are of great value in predicting malignant breast cancer. 
1  INTRODUCTION 
Breast cancer is one of the most common cancers in 
women and is the second leading cause of death after 
lung cancer (Nguyen,1970) (Mangasarian, 1990). In 
2020, over 2.3 million women were diagnosed with 
breast cancer worldwide, and 685 thousand died. Due 
to  population  growth,  aging,  and  the  increasing 
prevalence  of  known  cancer  risk  factors  (such  as 
smoking and unhealthy eating), WHO believes that if 
the global incidence rate remains the same as in 2020, 
there  will  be  around  28.4  million  new  cancer  cases 
worldwide in 2040. Women in every country face the 
risks  of  developing  breast  cancer  at  any  age  after 
puberty, but the incidence rate will increase with age 
growth  (Piro,2021).  Existing  diagnostic  techniques, 
including  nuclear  magnetic  resonance  imaging, 
ultrasound,  CT  (computer  tomography)  or  PET 
(positron emission tomography), are very effective in 
tumor detection (World Health Organization, 2021) 
However, when doctors find suspicious tumor tissue, 
they still hope to obtain tissue samples for analysis. 
Biopsy isan essential technique for the diagnosis of 
cancer in the clinic. Because fine needle biopsy does 
not need any preparation in advance, nor does it need 
special  dietary norms, fine-needle  aspiration (FNA) 
has  become  the  preliminary  diagnostic  basis  for 
judging  whether  breast  tissue  is  cancerous.  A  large 
number of  data  show  that  although FNA  has  many 
advantages,  a  few  cases  may  be  misdiagnosed. 
Therefore, it is vital to study which characteristics of 
cells  can  become  a  solid  basis  for  discrimination. 
From 1989 to 1991,  Dr. Wolberg, Dr.  Mangasarian 
and  two  graduate  students  constructed  a  classifier 
using  the  pattern  separation  multi-surface  method 
(MSM)  for  these  nine  features  and  successfully 
diagnosed  97%  of  new  cases  (Nguyen,1970) 
(Wolberg,1989).  These  led  to  the  Wisconsin  breast 
cancer dataset. This article constructs univariate and 
multivariate logistic regression models to analyze the 
predictive  value  of  9  features  of  the  cell  to  breast 
cancer.  This  article  used  biometric  methods  for 
exploratory data analysis to focus more narrowly on 
checking the fitting degree of the  model (Chatfield, 
2021) By studying the different importance of the 9 
features of cells, the article helps people establish a 
more standard method to judge whether tumor tissue 
is malignant. 
2  ANALYSES 
FNA uses a tiny needle tube of about 20-27G (similar 
to or  smaller than  the needle tube for  regular blood 
testing.  Generally,  the  larger  the  number  of  G,  the 
smaller the needle tube) (CancerQuest,2021). Due to 
the  small  amount  of  tissue  and  its  cellular 
components  collected,  pathologists  will  pay  more 
attention to the observation of cell populations. The 
study  used  the  Wisconsin  Breast  Cancer  Dataset