committed, the Channel removes the event from its
own internal buffer. Sink components include hdfs,
logger, avro, file, null, HBase, message queues, etc.
An Event is the smallest unit of data streaming
defined by Flume.
2.2 Big Data Import/Preprocessing
Big data preprocessing. Although there are many
databases on the collection side itself, if you want to
effectively analyze these massive data, you should
import these data into a centralized large-scale
distributed database or distributed storage cluster,
and at the same time, complete data cleaning and
preprocessing on the basis of import. Some users
will use Storm from Twitter to stream data when
importing to meet the real-time computing needs of
some businesses.
In the real world, the data is generally
incomplete, inconsistent "dirty" data, can not be
directly data mining, or the mining results are
unsatisfactory, in order to improve the quality of
data mining, data preprocessing technology is
produced. Data preprocessing mainly includes steps
such as data cleaning, data integration, data
transformation and data reduction.
Figure 3: Big data preprocessing process.
Usually, when we get the data, it is difficult for
the data to reach our expectations, such as: missing
data, accuracy problems, too many indicators, and
so on. It always has to go through a series of
analysis, and only through data manipulation can we
get the data we want. Therefore, at this time, an
important step - data preprocessing is particularly
important. Big data preprocessing is very important,
and data quality is the life of data. Data
preprocessing is the key to data quality. The above
big data preprocessing flow chart (Figure 3) shows
that data preprocessing is mainly divided into five
steps: data exploration, data cleaning, data
integration, data specification, and data
transformation. (Wang,2022)
2.3 Big Data Statistics/Analysis
Statistics and analysis mainly use distributed
databases, or distributed computing clusters to
perform general analysis and subtotals of massive
data stored in them to meet most common analysis
needs, in this regard, some real-time requirements
will use EMC's GreenPlum, Oracle's Exadata, and
MySQL-based columnar storage Infobright, etc.,
while some batch, or semi-structured data-based
requirements can use Hadoop. The main feature and
challenge of this part of statistics and analysis is that
the amount of data involved in the analysis is large,
which will greatly occupy system resources,
especially I/O.
Figure 4: Student's t-test.
For example, the statistical method used for
evaluation, the hypothesis test, the Student's t-test
(Figure 4), assumes that the distributions of the two
populations have equal but unknown variances. And
assume that each population is normally distributed.
In this case, T (t-statistic) follows a t-distribution
with degrees of freedom (df) (n1 + n2 - 2). The
farther T from zero is to the point that it is
impossible to observe such a T-value, the greater the
difference between groups. If T is too large, the null
hypothesis will be rejected. According to the
formula, the greater the difference in the mean, the
larger the T. When the variance of the population is
larger, the smaller the T. Significance level - α:
The likelihood that the null hypothesis will be
rejected when it is actually TRUE. For a small
probability, such as α=0.05, look for the value of
T* such that P(| T| ≥ T*) = 0.05。 After sampling
and calculating the observations according to the
formula, if | T| ≥ T*, the null hypothesis (μ1 =
μ2) is rejected. A student's t-test that considers the
probability of μ1 > μ2 and μ1 < μ2 is called a
bilateral hypothesis test, and the sum of the tail
probabilities of the two t-distributions should equal
the significance level. For the most part, it is
customary to evenly divide the level of significance
between the two tails. (0.05 / 2 = 0.025) −t and t are
both t-statistically observed values. p-value: is the
sum of P(T <= -t) and P(T >= t). If the null
hypothesis is TRUE, the p-value provides the
observed | T| >= the likelihood of t. In general, the
p-value represents the probability that a sampling
result will lead to the null hypothesis. Therefore,
when p-value < significance level, the null
hypothesis can be rejected. The opposite is true.
Confidence is an interval estimate based on the
population parameters of the sample data.