Introduction
to bitrate reduction of sound data
In its most directly encoded form (voltage levels from a microphone digitized and stored without any further encoding), digital audio is a high bandwidth medium. Many thousands of times a second a voltage value (or two for stereo) is converted (sampled) to a digital representation of it’s distance (negative or positive) from zero voltage. This is known as PCM (Plus Coded Modulation) digital audio. Red Book Audio CDs (the standard format for CDs) samples voltage values 44.1 thousand times a second, each time encoding two pairs of voltage values (one for the left channel and one for the right channel) into two 16 bits values (Erickson, 1994). Therefore, CD audio consumes 44,100 * 16 * 2 = 1,411,200 bits (172 kilobytes) per second. One minute of CD quality PCM requires over 10 megabytes of storage.
For
some purposes PCM is a good choice. Because
of its simplicity, very low cost hardware can both encode and decode PCM in real
time. PCM is the most direct
representation of the digitized source of any encoding scheme and it discards
the least amount of auditory data from the signal possible.
This does not mean that all PCM encoded audio is high fidelity. A lot of
noise can be introduced into the signal (as well as higher frequencies lost)
when it’s digitized at low sampling rates with few bits used to encode each
voltage value. However, with bit rates at, or above CD audio, the quality rivals
and exceeds analog technologies. It
is ideal for professional uses because encoding audio at high bitrates
introduces very little noise.
In digital audio editing, every time a modification is made to the signal
that same amount redigitizing noise is added into the result.
A more compact representation might sound as good as PCM on the first
encoding. But with each
modification the encoding noise will be compounded, and very possibly,
significantly reduce the overall quality of the signal after several
digitization cycles. When the amount of storage space and/or bandwidth available
is plentiful, PCM is clearly the best format for storing audio for further
editing. When only needed for
playback, however, PCM becomes less appealing because of its storage
requirements.
One
alternative is to utilize the mathematical properties of the PCM bitstream,
applying a transform that produces an exact representation of the same bits, but
in a more compact form. Compression
software (PK-ZIP, StuffIt) designed for compressing text files and other
computer data files can also be
used on PCM audio. When used on text these systems reduce file size by 60%-80%.
For CD quality PCM, however, the resulting files are only reduced by 10%
(Gilchrist, 1999a & 1999b).
When
tuned to the characteristics of PCM audio, lossless compression improves by a
significant margin. In one cross comparison of several audio sources and
compression technologies (Whittle, 1998), some sources were reduced by as much
as 66%. Other sources did not fare
as well, only reducing by 25%, with the average across encoders and sources
tending towards 50% reduction in file size.
While 86 kilobytes per second is an improvement over 172, it is still
too large for many applications, most notably transferring audio over the
Internet. To enable further
compression, current solutions rely on encoding the PCM bitstream in a
representation which cannot exactly reproduce the original bitstream.
Such technology is termed lossy compression because between the encoding
and the decoding cycle, some of the audio data is lost.
In its simplest form, this is done by dropping out some of the audio
samples in the bit stream, which removes the higher frequency spectral content
while maintaining the lower frequencies. For
example, halving the number of samples will cut off the top half of the
frequencies a PCM stream can represent. Depending
on the audio in question the higher frequencies can play a small or large role;
indeed if the original has no frequencies in the range cut out by lowering the
sample rate, no loss of audio quality is perceived.
Another simple solution is encoding each sample with less resolution (a
process known as quantization). Whenever
digital audio is encoded, it has to be quantized to a certain level, since an
analog voltage amplitude has an infinite number of distances from 0, whereas as
16 bit audio can only encode 65,535 discreet voltage distances.
While it does reduce the audio quality, dropping to 8 bits per samples
(256 unique voltage values) still maintains a recognizable signal, and reduces
the total bitrate by half.
Neither
of the above bitrate reduction methods are at all intelligent about which parts
of the audio are discarded; what is lost is a byproduct how it was easiest to
store digitized voltage values. Better
lossy encoding schemes rely on removing parts of the audio which the mind/brain
does not and/or cannot attend to.
Primarily,
these lossy encoding schemes rely on frequency masking (sound at one frequency
and volume masking another at a different frequency and volume).
Some systems also exploit the lack of sensitivity to stereo for sounds at
low frequencies. Based on these two theories of audio perception,
popular compression algorithms achieve 10 to 1 bitrate reductions with minimal
perceived quality loss.
The
psychoacoustic theories of bitrate reduction, in depth
By far the most important psychoacoustic theory used by perceptual audio coding is that of masking. Most simply, this is the result of hearing two sounds at different energy levels at nearby frequencies, whereby one sound obscures the perception of the other. Most current theories state that the incoming spectrum is split (filtered) into separate bands (known as critical bands); the number and size of which varies by theory, although all agree that the bands are arranged logarithmically, with most bands occurring at lower frequencies. This logarithmic arrangement corresponds to physical locations on the basilar membrane that each respond to limited ranges of frequencies, with more space on the basilar membrane devoted to the lower frequencies (Moore, 1997).
Frequencies
within a critical band do not interact with sounds in other critical bands, and
the mind/brain can selectively attend to all the bands at once, or just the ones
of interest (say, where the frequencies of speech are concentrated). Within a band, however, there is less ability to
differentiate between co-occurring sounds.
In fact, frequencies of significantly higher volume will partially, or
even completely, occlude (mask) the perception of other frequencies within the
same band. In addition, lower
frequency sounds mask high frequency sound better than the reverse.
Finally, masking ability of a given sound also varies depending on its
tonality and noisiness (Brandenburg, 1996).
Tonal
sounds are simple, repeating sounds, which occupy a small range of frequencies
at any one time. Conversely, noisy
sounds are much more complex, the most extreme example being a random
distribution of energy spread over a wide range of frequencies.
As a rule, noisy sounds are much better at masking than tonal sounds.
There are two theories for this. From
a neural viewpoint, the noisy sound swamps the neurons that react to that CB,
the patterns of activation changing so frequently that they hide the static
activation caused by any tonal sounds (Moore, 1997). From a physical standpoint, sounds interfere with each other
on the basilar membrane, noisy sounds have more frequencies than tonal sounds,
and therefore stimulate the same limited area on the membrane more than a tonal
sound does (note, this is a theory put together from what I’ve read from many
sources, I do not have a single source which explicitly states this).
Another
psychoacoustic characteristic which is used in bitrate reduction coding is
stereo imaging. Due to the nature
of most stereo sound recordings, the sound in the left channel is correlated to
the sound in the right channel. This
relationship is not crucial to the sound; when the two channels are downmixed
into one center channel, the original content is still recognizable; however,
the information about the spatial locations of individual sounds are lost.
Not all the differences between the two channels are necessary, however,
to maintain our sense of location. Particularly,
the human auditory system is less sensitive to the details of the higher
frequency critical bands (above 2kHz), deriving most of the localization
information from time delay between signal onset and volume (Pan, 1995).
In
depth discussion of technologies of bitrate reduction audio encoding
MP3
technology in depth
In
its traditional form MP3 encoding works on one channel at a time, so encoding a
stereo file is identical to encoding two mono files and storing the encoded
result in one file. Not only does
encoding stereo take twice as long, but it also results in a file twice as
large. The following explains how
one channel of audio is encoded.
At
its heart, MP3 encoding consists of two major steps. First, the uncompressed PCM
bitstream is filtered into 32 log scaled, overlapping bands (based on the
critical bandwidths (CB)) by a polyphase filter. Then each band is further
subdivided into 18 sub-bands (for the total of 576 frequency bands) by a
Modified Discrete Cosine Transform filter. (Brandenburg & Stoll, 1994).
Note, this combination of filters is used because it is both reasonably fast and
gives enough detail to determine masking accurately, not because it has any
direct correlation with how we think auditory perception works.
In
the second step, the bands are categorized as either noisy or tonal, and based
on the perceptual coding model, judged for their ability to mask other bands
within the same CB. The bands which
are totally masked are discarded and the bands which are only partially masked
are set aside for further processing.
For
each of these remaining bands, the perceptual model calculates how much noise
can safely be inserted into the sample, such that the noise will still remain
masked by other signals. The source
of noise comes from encoding each sample within the band with a lower resolution
(i.e., quantization).
One
problem with this is that filtering audio into frequency bands always reduces
the time resolution of each band. While
this cannot be completely solved, the MP3 standard addresses this by using four
types of filter windows. The normal window is 1024 samples long, with ½ overlap
between successive windows and is shaped like a bell curve. The other windows
are: another bell curve 1/3 as large, a bell curve skewed towards the beginning
of the window, and another skewed towards the end. Whenever the encoder determines the normal sized window is
nearing a section of sound with a lot of transient energy, the encoding switches
to a window shape that allows for as small a part of the transient energy to be
captured at a time as possible. (Brandenburg & Stoll, 1994).
Without this, whenever the encoded bitstream nears a transitory moment, the that part of the signal starts to be heard before the actual event (this is known as pre-echo). The problem with switching window sizes, however, is that smaller windows require more encoded bits per second. To keep the total bits per second constant, the encoding system normally uses slightly less than the full bandwidth allowed, and then in times of transience spends the reserve bits on smaller window sizes.
When
music is not highly transitory, but rather filled with silence or constant
tones, traditional run length encoding and Huffman encoding can be used to
reduce the bitrate. MP3 layers
these two lossless compression algorithms on top of the lossy compression,
however, they only decrease the resulting file size by about 10 percent.
All
of the above applies to mono MP3 files and files stored in standard stereo mode.
Although more computationally demanding to encode, two additional systems
exist that can improve audio quality for a given bitrate.
The
first is very simple. Usually the audio in both channels of a stereo file is
highly related. One way to take
advantage of this is to encode
three channels: Right, Left, and Center. The
center channel contains the left and right channels from the original file,
mixed together. The encoded left
channel specifies how the original left channel differs from the encoded center
channel. To retrieve playback data
for the left channel, subtract the encoded left channel from the encoded center
channel. The process for the right
channel is analogues. As long as
the two original channels are highly correlated, the amount of difference that
must be coded for the left and right channels is minimal, and can be encoded
with fewer bits than one normal channel would require, leaving more bits over to
represent the middle channel. (Brandenburg,
1996).
The
other method is more based on psychoacoustics. As I have stated before, the
human auditory system is less sensitive to the stereo content of the higher
frequency critical bands, deriving most of the localization information from
time delay between signal onset and volume.
For bands that fall into this category, the left and right channel
frequency values are mixed together. For later retrieval of the lost stereo
information, the two volume envelopes of the two channels are saved.
Then on playback both bands are inserted back into the left and right
channels, with their volume over time controlled by their respective volume
envelopes. (Brandenburg, 1996) & ( Chen, et. al, 1998).
Conclusions
and thoughts about compression technology
The
MP3 standard was completed in 92. Eight years is a long time for computer
technology. If nothing else, much more processing power exists on the desktop.
Already, new technologies are starting to gain recognition, such as AAC,
and TwinVQ. Still MP3s are widely
used, and increasing penetration by the week, it seems.
It will be interesting to see how much improvement a new technology will
have to bring in terms of bitrate and quality to usurp the current standard.