
 
results appear in section 4 and section 5. Finally, 
section 6 concludes the paper. 
2 RELATED WORK 
Recently, there have been more and more interests in 
data stream management system (DSMS) and its 
related algorithms. A good overview can be found in 
(B. Babcock et al., 2002) or (L. Golab and M.T. 
Ozsu, 2003). A number of academic projects also 
arise, such as STREAM(B. Babcock et al., 2002), 
Telegraph(Sirish Chandrasekaran and Michael J. 
Franklin, 2002), Aurora(D. J. Abadi et al., 2003), 
StatStream(Zhu Y and Shasha D, 2002), 
Gogascope(C. Cranor et al, 2002), etc. Landmark 
window model is one of most popular window 
model in data stream processing. Some data stream 
algorithms over landmark window have been 
presented (S. Guha et al., 2001)(Guha N. and 
Koudas K, 2002). 
Random sampling has been proposed and used in 
many different contexts of DSMS. A number of 
specific sampling algorithms have been designed for 
computing quantiles (M. Greenwald and S. Khanna, 
2001), heavy hitters (G. Manku and R. Motwani, 
2002), distinct counts (P.Gibbons, 2001), adaptive 
sampling for convex hulls (S. Guha et al., 2001) and 
construction of synopsis structures (S. Guha et al., 
2001)(M Datar et al., 2002), etc. Many DSMSs 
being developed support random sampling, 
including the DROP operator of Aurora (D. J. Abadi 
et al., 2003), the SMAPLE keyword in STREAM 
(B. Babcock et al., 2002), and sampling functions in 
Gigascope (C. Cranor et al, 2002). The classic 
algorithm for maintaining an online random sample 
is known as reservoir sampling (Vitter JS., 1985). It 
makes one pass over data set and is suited for the 
data stream model, but has some drawbacks to 
directly used for sampling from landmark windows 
over data streams. 
3  THE CLASSIC RESERVOIR 
SAMPLING 
The reservoir sampling (Vitter JS., 1985) solves the 
problem of maintaining an online random sample of 
size k from a pool of N data items, where the value 
of N may be unknown. It makes only one pass over 
data set sequentially, and suits for data stream model 
(B. Babcock et al., 2002)(S. Guha et al., 2001)(C 
Jermaine et al., 2004). Let k be the number of data 
items in sample R, n denote the number of data 
items processed so far. The basic idea of reservoir 
sampling can be described as follows (Vitter JS., 
1985)(T. Johnson et al, 2005): 
 
Algorithm 1: The Classic Reservoir Sampling 
Input: Data Stream S, k 
Output: Sample R 
1.  Make first k data items 
candidates for the sample R; 
2.  Process the rest of data items in 
the following manner: 
3.  At each iteration generate an 
independent random variable ζ (k, 
n). 
4.  Skip over the next ζ  data items. 
5.  Make the next data item a 
candidate by replacing one at 
random. 
6.  If the current number of 
candidates exceeds k, randomly 
choose a sample out of the 
reservoir of candidates. 
The classic reservoir sampling can be used for 
data streams to select a random sample of size k. But 
it has serious drawbacks to be directly used for 
landmark window. First, reservoir sampling works 
well when the incoming data contains only inserts 
and updates but runs into difficulties if the data 
contains deletions (S. Guha et al., 2001), it is 
inefficient to delete data items in landmark window. 
Second, when the number of data items in landmark 
window exceeds the limited memory, a data item is 
randomly selected to delete. Older data items and 
newer ones are processed equally. A newer data item 
may be deleted too early. 
4 A STRATIFIED SAMPLING 
ALGORITHM FOR 
LANDMARK WINDOW 
To overcome the drawbacks of reservoir sampling, 
we use the basic window (BW) technique in 
conjunction with reservoir sampling to present a 
BW-based stratified multistage sampling algorithm 
for landmark window (SMS Algorithm). Let T be 
temporal span of the landmark window W, and the 
time interval of W’s updating cycle is T
c
. We divide 
the data items in W into k strata (or groups), and S[i] 
denotes stratum i (i =1,2,…,k), the temporal span of 
each stratum is equal to T
c
 /m (m is a nonnegative 
integer).  f
0
 denotes the sampling fraction in the 
beginning, f
r
 denotes re-sampling fraction. 
ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
104