create new sequences. They found frames
representing ‘transition points’, in the original
sequence. By selecting the frames that did not end
up at ‘dead-ends’, that is, places in the sequence
from which there are no graceful exits, they created
very realistic output sequences. But they only
replayed already existing frames, and had to rely on
morphing and blending to compensate for visual
discontinuities. Sequences which did not have
similar frames well spaced temporally, were very
difficult to be synthesized. Many natural processes
like fluids were thus, hard to synthesize. Kwatra et
al. introduced a new seam finding and patch fitting
technique for video sequences. They represented
video sequences by 3D spatio-temporal textures.
Two such 3D textures could be merged by
calculating a 2D surface which could act as the
optimal seam. However like in (Schödl et al., 2000),
they first found transition points by comparing the
frames of the input sequence. Then in a window of a
few frames around this transition they found an
optimal seam to join the two sequences, represented
as 3D textures. Since they rely on transitions, they
sometimes need very long input sequences to find
similar frames and a good seam. Both (Schödl et al.,
2000) and (13), offered little in terms of editability
as the only parameter that could be controlled was
the length of the output sequence. Simple control
like slowing down or speeding up could not be
achieved. It was only techniques based on a model
of the visual signal in the input images, that
provided this opportunity to control various aspects
of the output. Szummer and Picard (Szummer et al.,
1996) suggested a STAR model for generating
temporal textures using an Auto Regressive Process
(ARP). Fitzgibbon (Fitzgibbon et al., 2001)
introduced a model based technique of creating
video textures by projecting the images into a low-
dimensional eigenspace, and modeling them using a
moving average ARP. Here, some of the initial
eigenvector responses (depicting non-periodic
motions, like panning) had to be removed manually.
Soatto et al. (Soatto et al, 2001) produced similar
work. They modeled dynamic textures as a Linear
Dynamic System (LDS) using either a set of
principal components or a wavelet filter bank. They
could model complex visual phenomena such as
smoke and water waves with a relatively low
dimensional representation. The use of a model not
only allowed for greater editing power but the
output sequences also included images that were
never a part of the original sequence. However, the
outputs were blurry compared to those from non-
procedural techniques and for a few sequences the
signal would decay rapidly and the intensity gets
saturated. Yuan et al. (Yuan et al., 2004) extended
this work by introducing feedback control and
modeling the system as a closed loop LDS. The
feedback loop corrected the problem of signal
decay. But the output generated was still blurry.
This is because these models assume a linear
correlation between the various input frames.
In this paper, we propose a new modeling
framework that captures the non-linear
characteristics of the input. This provides clear
output sequences comparable to those of the
procedural techniques while providing better control
on the output through model parameters. The
organization of the paper is as follows: Section 2
describes the mathematical framework that forms
the basis of our model. In section 3, we provide a
brief overview of Non-Linear Dimensionality
Reduction (NLDR). Section 4 describes the
technique used to model the dynamics of the
sequence in the embedded space. In section 5, we
describe the methodology of transforming the model
from the low dimension embedding space to the
observed image space. Finally, section 6 presents the
results of using our framework on a diverse set of
input image sequences.
2 MODEL FRAMEWORK
In this section we summarize the mathematical
framework that we use for modeling the dynamic
texture. The existing image based techniques model
the input visual signal ((Soatto et al,
2001),(Fitzgibbon et al., 2001),(Szummer et al.,
1996)) for creating dynamic textures, using a linear
dynamic system of the following form:
1
,(0,)
,(0,)
tttt v
tttt w
xAx vv
yCxww
−
+ Σ
+ Σ
∼
∼
Here,
n
t
R
is the observation vector;
,
r
t
Rr n
<<
is the hidden state vector, Ais the
system matrix;
C is the output matrix and ,
tt
vware
Gaussian white noises driving the system. In such a
system, the observation is a linear function of the
state. The limitation of this system is that this
captures only the linear correlation between
subsequent images. The lack of non-linear
characteristics, lead to an output sequence that is not
as crisp and detailed as the input.
VISAPP 2006 - IMAGE ANALYSIS
244