0.5 and 5. The tool has an excellent audio output
stream quality, but introduces an intolerable delay
for real time processing. Sound Forge is used for
comparison with this paper proposal in Section 7.
Windows Media Player 10 (Microsoft, 2006)
supports many audiovisual formats, such as MP3
and WMA. The tool allows choosing the adjustment
factor to be used during media exhibition and to
change this factor in presentation time. Although the
adjustment factor can assume values between 0.06
and 16, the program specification suggests a range
between 0.5 and 2.0 to keep media quality high.
Although not mentioned, it is almost assured that the
time-scaling processing takes place after the
decoding, due to two reasons: it is not possible to
save the processed audio in a compressed format;
and, since the algorithm is done for a specific player,
it is better to apply time scaling just before
presentation.
The MPEG-4 audio specification defines a
presentation tool called PICOLA (Pointer Interval
Controlled OverLap Add), which can make time-
based adjustments after decompression, in mono
audio with sample rate of 8kHz or 16kHz (ISO,
2001).
FastMPEG (Covell, 2001) is a time-scaling
proposal that explores the partial decoding/encoding
strategy. It describes three time-based algorithms for
MP2 format on-the-fly adjustments. The algorithms
are performed after a partial decoding of the audio
stream and followed by a partial re-encoding. The
adjustment factor varies between 2/3 and 2.0.
All aforementioned time-scaling algorithms are
not applied straight on the compressed stream.
Instead, streams are decoded (at least partially),
processed, and, eventually, encoded again. The
solutions are complex and decoder dependent. The
algorithms allow a large range for the tuning factor f,
perhaps one of the reasons that guided their
implementation. However, this is reached by the
dependence of the presentation tool, or by the use of
non-real-time computation.
Different from all mentioned work, this paper
proposes a time-scaling algorithm for compressed
audio streams, simple enough to be executed in
presentation time. The algorithm is performed
straight on the compressed data, supporting tuning-
factor variation, and being independent of the
decoder (and thus the player) implementation. Due
to the intentional simplicity of the proposed
algorithm, its tuning-factor is limited to the range
[0.90 , 1.10].
Indeed, this paper proposes a framework for a
class of algorithms, that is, a meta algorithm. The
framework is instantiated for a set of format-
dependent algorithms, described in the paper, and
implemented as a library, called HyperAudioScaling
tool, which can be easily integrated with third-party
applications. The media formats handled by the
library are MPEG-1 audio (ISO, 1993), MPEG-2
systems (ISO, 2000) and audio (ISO, 1998) (ISO,
1997), MPEG-4 AAC audio (ISO, 2001), and AC-3
(ATSC, 1995). These standards were chosen
because they have been largely used in commercial
applications, such as those for digital and interactive
TV, and also in different audiovisual formats, like
VCD, SVCD and DVD.
3 AUDIO TIME-SCALING
ALGORITHM
Many high-quality audio formats deal with audio
streams as a sequence of frames (or segments).
Every frame has a header and a data field, and is
associated with a logical data unit (LDU). A set of
coded audio samples, gathered during a small time
interval (typically, about 30ms), concatenated with
auxiliary bits (called PAD) compose a logical data
unit. The number of PAD bits is not limited and are
generally used to carry metadata.
Although associated with a specific frame, the
LDU does not need to be carried in the data field of
this frame. Alternatively, the LDU can borrow bits
from data fields of previous frames (the bit reservoir
in MPEG nomenclature) and be transported partially
or entirely in previous frames. The maximum size of
the bit reservoir is limited. Thus, data fields can
contain one, several or part of an LDU. Figure 1
shows an audio stream with frames separated by
vertical lines. In each frame, the header bytes are
stripped. They are located in the beginning of the
frame and are followed by a data field. The figure
also depicts the LDU of each frame.
Figure 1: Frame representation of a compressed audio
stream.
The HyperAudioScaling algorithm is based on a
well-known time-based algorithm called Granular
synthesis (Gabor, 1946). However, an important
difference must be pointed out. Once time scaling
must be executed without decoding the compressed
audio (to recover the audio samples), and because an
LDU can be coded in frequency domain, the chosen
adjustment unit is the LDU (and not samples, as in
the Gabor’s proposal). Thus, HyperAudioScaling
SIGMAP 2006 - INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA
APPLICATIONS
112