2 DATA PROCESSING AND
SAMPLE EXPANSION
As early as 2016, China has formulated data
standards for the registration of patients with various
diseases. The standard document WS-375 specifies
the metadata attributes and data element attributes of
the basic data set of stroke registration reports, so as
to provide them to disease prevention and control
institutions, medical institutions providing relevant
services and relevant health administrative
departments for relevant business data collection,
transmission, storage, etc. Due to the late start of
medical data accumulation in China, the early data
records are few and non-standard, and there is still
no stroke data set of appropriate size. In this case,
machine learning method based on small sample
learning becomes an optional method for building
prediction models.
2.1 Data Sources
The data in this paper mainly comes from two parts:
1. The original data of 300 cases obtained by the
local medical institution from the screening of the
target population in the community has more than
200 effective attributes of the screening data. In the
300 cases, 3 cases have been diagnosed, and others
do not have the characteristics of stroke.
2. The data set of 300 confirmed patients
provided by the local hospital over the years,
including obvious stroke symptoms, and some cases
have died. These registration forms are organized
according to the WS-375 document standard.
Because there are so many attribute fields for
these data, and many fields are missing data, these
data can be used accurately only after they are
cleaned, formatted, and normalized.
2.2 Data Processing
Data processing mainly deals with incomplete
values, incorrect values and attribute types of data to
make the data conform to the format required by the
model. These data are processed using the following
rules:
1. If all data of an attribute is missing, the
attribute will be treated as invalid and deleted.
2. For categorical variables, if they are ordered
categorical variables, the corresponding values will
be assigned directly; for unordered category
variables, use one-hot coding.
3. Some attributes with some missing data need
to be handled as appropriate. If only a few instances
of an attribute have data (less than 10%), and there
are no stroke patients in the instances with data,
delete the attribute; On the contrary, the same kind
of mean value interpolation is used for processing.
4. For out of range or incorrect data, if it is
numerical data, select the value closest to it. For
nominal attributes, select different negative integers
to distinguish.
In addition to the above processing rules, some
attribute data are also standardized and normalized。
2.3 Establish Training Data and Test
Data Set
In the screening data, there are only 3 confirmed
cases. If machine learning algorithm is used, it will
be difficult to learn the typical characteristics of
patients, so that a reasonable prediction model
cannot be built. If the accumulated 300 confirmed
cases are combined into the screening data set, such
simple combination will distort the entire learning
process. Therefore, this paper randomly confuses the
confirmed data with the screening data to establish a
more reasonable data set, which has 400 instances,
30% of which are confirmed cases, so that the
established data set can establish a more reasonable
and accurate prediction model.
After data attribute processing, the available
attributes are reduced from more than 300 to less
than 100, and the data set size is about 400 records,
including more than 130 patient records. See Table 1
for data attributes.
Table 1: Category attribute quantity.
Categor