results appear in section 4 and section 5. Finally,
section 6 concludes the paper.
2 RELATED WORK
Recently, there have been more and more interests in
data stream management system (DSMS) and its
related algorithms. A good overview can be found in
(B. Babcock et al., 2002) or (L. Golab and M.T.
Ozsu, 2003). A number of academic projects also
arise, such as STREAM(B. Babcock et al., 2002),
Telegraph(Sirish Chandrasekaran and Michael J.
Franklin, 2002), Aurora(D. J. Abadi et al., 2003),
StatStream(Zhu Y and Shasha D, 2002),
Gogascope(C. Cranor et al, 2002), etc. Landmark
window model is one of most popular window
model in data stream processing. Some data stream
algorithms over landmark window have been
presented (S. Guha et al., 2001)(Guha N. and
Koudas K, 2002).
Random sampling has been proposed and used in
many different contexts of DSMS. A number of
specific sampling algorithms have been designed for
computing quantiles (M. Greenwald and S. Khanna,
2001), heavy hitters (G. Manku and R. Motwani,
2002), distinct counts (P.Gibbons, 2001), adaptive
sampling for convex hulls (S. Guha et al., 2001) and
construction of synopsis structures (S. Guha et al.,
2001)(M Datar et al., 2002), etc. Many DSMSs
being developed support random sampling,
including the DROP operator of Aurora (D. J. Abadi
et al., 2003), the SMAPLE keyword in STREAM
(B. Babcock et al., 2002), and sampling functions in
Gigascope (C. Cranor et al, 2002). The classic
algorithm for maintaining an online random sample
is known as reservoir sampling (Vitter JS., 1985). It
makes one pass over data set and is suited for the
data stream model, but has some drawbacks to
directly used for sampling from landmark windows
over data streams.
3 THE CLASSIC RESERVOIR
SAMPLING
The reservoir sampling (Vitter JS., 1985) solves the
problem of maintaining an online random sample of
size k from a pool of N data items, where the value
of N may be unknown. It makes only one pass over
data set sequentially, and suits for data stream model
(B. Babcock et al., 2002)(S. Guha et al., 2001)(C
Jermaine et al., 2004). Let k be the number of data
items in sample R, n denote the number of data
items processed so far. The basic idea of reservoir
sampling can be described as follows (Vitter JS.,
1985)(T. Johnson et al, 2005):
Algorithm 1: The Classic Reservoir Sampling
Input: Data Stream S, k
Output: Sample R
1. Make first k data items
candidates for the sample R;
2. Process the rest of data items in
the following manner:
3. At each iteration generate an
independent random variable ζ (k,
n).
4. Skip over the next ζ data items.
5. Make the next data item a
candidate by replacing one at
random.
6. If the current number of
candidates exceeds k, randomly
choose a sample out of the
reservoir of candidates.
The classic reservoir sampling can be used for
data streams to select a random sample of size k. But
it has serious drawbacks to be directly used for
landmark window. First, reservoir sampling works
well when the incoming data contains only inserts
and updates but runs into difficulties if the data
contains deletions (S. Guha et al., 2001), it is
inefficient to delete data items in landmark window.
Second, when the number of data items in landmark
window exceeds the limited memory, a data item is
randomly selected to delete. Older data items and
newer ones are processed equally. A newer data item
may be deleted too early.
4 A STRATIFIED SAMPLING
ALGORITHM FOR
LANDMARK WINDOW
To overcome the drawbacks of reservoir sampling,
we use the basic window (BW) technique in
conjunction with reservoir sampling to present a
BW-based stratified multistage sampling algorithm
for landmark window (SMS Algorithm). Let T be
temporal span of the landmark window W, and the
time interval of W’s updating cycle is T
c
. We divide
the data items in W into k strata (or groups), and S[i]
denotes stratum i (i =1,2,…,k), the temporal span of
each stratum is equal to T
c
/m (m is a nonnegative
integer). f
0
denotes the sampling fraction in the
beginning, f
r
denotes re-sampling fraction.
ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
104