Introduction
to the empirical test of compressed audio fidelity.
Since
audio compression is based on perceptual quirks of the human auditory system,
quirks which are only partially understood, there is no mathematical proof which
describes the fidelity of a compressed audio stream.
Only by compressing a bitstream and then subjectively comparing it to the
original can a sense of fidelity be achieved.
Such evaluation is necessary during the development of the algorithms,
and frequently the developers of the system are the ones who evaluate it.
Given
the subjective, and not entirely uniform hearing abilities of human subjects,
using just a few people to judge fidelity of a bitrate reduction system is not
necessarily enough to determine how the results compare to the common baseline
standard of RedBook CD audio. In
the case of MP3s, the technology is clearly good enough that with low quality
enough reproduction equipment there is no discernible quality difference.
Because the technology does a fairly good job, suggestibility comes into
play; knowing that the audio quality may be compromised may result in more
careful attention to defects in the sound, be they from the reproduction
equipment, the original source, or even from the
lossy nature of the compression algorithm used.
Only with blind testing can the effect of the lossy compression be
isolated and tested without being confounded by the these other factors.
Fidelity
is somewhat hard to judge, and its measure will vary not only with the type of
sound, but also in the subject’s taste for that kind of sound.
If the subject does not care for a type of music, their rating of its
fidelity will very likely be less accurate for cases where the amount of
difference is minimal. Therefore,
coming up with a single test procedure which makes efficient use of all subjects
is difficult.
Methods
– Experiment # 1
Five
subjects were asked to judge fidelity of 24 sample pairs. Each pair contained
the exact same music sample, but one was compressed and the other was not.
Five
second long samples of music were chosen as test data from the following
sources:
1.
Peter Gabriel; In your eyes
(drums, synthesizers, and vocals)
2.
Peter Gabriel; In your eyes
(drum solo)
3.
David Lanz; Improvisations,
adapted from Pachelbel’s Canon in D Major (solo piano)
4.
Bizet; Carmen, Aragonaise
(horns, drums, long; drawn-out cymbal crash)
5.
Doug Coulter; Stereo Sample
(drums, electric guitar and base)
6.
Robert Palmer; Simply Irresistible
(distorted electric guitars, drums (w/ dramatic silence between beats)
Each
sample was encoded at the following bit rates: 96kbps, 112kbps, 128kbps, and
160kbps. The L.A.M.E MP3 encoding engine V3.1.4 (retrieved from http://www.sulaco.org/mp3)
was used with the following settings: High Quality, Joint Stereo.
A
fresh install of WinAMP 2.5C was used to play back all samples; no equalization
or output modifying plugins were installed.
An Ensoniq AudioPCI (fully 16bit-44.khz capable) sound card converted the
output from digital to analog, and a Sony STR D315 stereo receiver provided
amplification to a pair of Beyerdynamic DT 831closed (isolating) circumaural
headphones (claimed frequency response of 5hz – 32000hz).
At
the beginning of the study, the concept of fidelity was described to the
subject, “Fidelity refers to the overall quality of a sound, where a higher
fidelity source will have less noise and distortion.
For example, FM radio has considerably higher fidelity than AM radio.
Once
primed with a concept of fidelity, six sets of four sample pairs were run for
each subject. For all pairs two, 5 second clips, were played together, with one
second of white noise interjected between clips.
After each trial, the subject was given 3 seconds to mark on a two column
table which sample had the best fidelity. The
white noise was inserted to distract the subject from directly comparing the two
samples, in an attempt to make them focus instead on their subjective feeling of
quality for each sample.
For
every pair, one of the clips was the original sample in 16 bit, 44kHz stereo PCM
format, while the other clip was the same sample encoded at one of MP3 the bit
rates from above. Within each set
of four pairs, the order of compressed and uncompressed audio was randomized,
with an equal number of pairs starting with the compressed sample as
uncompressed. The order of the
pairs was also randomized, within the requirement that at the end, all four MP3
bitrates had been tested. The order
of the sets was also randomized, except that sample 1 (table 1) was always
presented first, once to show the subject how the test worked, and a second time
to acclimate them to making judgments.
Results
– Experiment # 1
Note,
as described in the methods section, the In
You Eyes vocal sample was presented twice to the listeners, unlike the other
samples. The data collected, however, was roughly similar to the other samples,
so I have included it in the results.
Table
2: Individual Performance
Percent
correct (uncompressed audio marked as sounding the best):
Experimenter
(not included in any other calculations):
83%
Subject
1:
71%
Subject
2:
66%
Subject 3: 54%
Subject
4:
62%
Subject
5:
45%
Average
correct among subjects: 69/120 = 57%
Discussion
of results
There
are several interpretations of these results.
Over 120 sample pairs, subjects only answered correctly 57% of the time,
not much better than chance. The most positive interpretation is that people
cannot, in fact, judge the difference between MP3s encoded at 96kps (or higher)
and PCM. When I ran the test
upon myself, I easily identified 83% of the compressed samples.
Most of my subjects, however, complained of not being tell any
difference, and having to guess on almost all choices.
If they were picking up on any quality difference at all, it would make
most sense if they were more accurate on samples where severe compression was
used. When
graphed, such results would show up as stair-step lines, with the top
line of each group the shortest. If
such a pattern exists in the data, I cannot find it. [Graph not included in HTML
version of this report].
The
data seems to indicate no significant perception of quality difference. Perhaps,
however, MP3s do sound worse in general, but my test just doesn’t expose it to
the unaccustomed. Or, maybe I’ve
just learned what the compressed samples sound like as I created them, and
although the difference is small, I can pick up on it, where as the average
listener cannot. It might not be a
question of one sample truly sounding better, just that I can identify how it
sounds different, and associate different with worse.
The
important question, however, is not whether the experimenter can figure out how
to crack the experiment, but whether the average subject can determine a
noticeable quality difference. If
nothing else, this first study further underlines the necessity for testing MP3
quality on subjects, rather than just running blind tests on the person who
designed the test, as running it on myself strongly indicates that there is a
difference, and running it on subjects suggests that there is not.
Because
of the disparity between my own judgement, and those of my subjects, I decided
to run a second experiment, which would give subjects more time to chances to
each sample. After all, I listened
to each sample many times during their creation, maybe that is the most
significant cause of my improved accuracy.
Methods – Experiment #2
Subjects
auditioned 16 pairs of audio samples, where one sample was always compressed,
and the other uncompressed. To make
sure subjects understood their task, the first two samples were compressed at
56kpbs and 80kbps, both of which had highly audible noise and compression
artifacts. If they did not
correctly identify the higher fidelity audio on these samples, they were asked
to listen again.
After
the first two sample pairs, the procedure followed was always the same.
A random pair of samples was chosen. One of the samples was PCM, and the
other, 128kbps MP3. The samples
were played in succession with a few seconds space between the samples. Then the
subjects were allowed to listen to the samples as many times as they wanted to
aid in deciding which had the highest fidelity.
Because so many subjects in the first experiment had been disturbed by
having to make guesses when they felt like they had no idea which sample sounded
best, I decided to let all the subjects in the second experiment answer a/b if
they couldn’t tell after running the samples several times.
In
addition to all the samples used in the first experiment, I also added the
following new audio clips:
·
Moody
Blues; Nights in White Satin (Live
From the Red Rocks version, at a point heavy with applause).
·
Moody
Blues; Nights in White Satin (Live
from the Red Rocks version, at a point with lots of cymbal hits).
·
Vangelis;
Chariots of fire – theme (piano, synthesizers, and light drums)
·
Vangelis;
Direct (layered, high tempo
synthesizers).
·
R.E.M; Drive
(guitars, light drumming, and vocals).
·
R.E.M;
Drive (guitars and silence).
·
R.E.M; Drive
(guitars – distorted, vocals, drumming)
·
Enya; Orocino
Flow (slow tempo layered synthesizers, vocals).
·
Pink
Floyd, Conformably Numb (strings, base, vocals).
Note that while there are several samples from the same song in this set, all of those samples are very different in character as far as number of instruments, and the intensity of playing (fast, loud, slow, melodic, noisy).
Results – Experiment # 2
Since
the subjects were corrected when they guessed incorrectly on the first two
samples, that data is not included in the following table.
|
Subject |
# Correct |
Correct/Guessed |
Correct/Total |
# Incorrect |
# Unsure |
MP3s
not detected |
|
1 |
6 |
54% |
42% |
5 |
3 |
58% |
|
2 |
8 |
73% |
57% |
3 |
3 |
43% |
|
3 |
10 |
77% |
71% |
3 |
1 |
29% |
|
4 |
9 |
64% |
64% |
5 |
0 |
36% |
|
5 |
2 |
40% |
14% |
3 |
9 |
86% |
|
6 |
10 |
77% |
71% |
3 |
1 |
29% |
|
7 |
5 |
42% |
36% |
7 |
2 |
64% |
|
|Av: |
7.5 |
51% |
50% |
4 |
3 |
50% |
Discussion
of Experiment #2
This
time I tried to answer a smaller question: did people judge PCM as sounding
better than MP3 compressed at one fixed bitrate (128kbps)?
Since I knew the sample size would be small, I decided to accept
“Don’t Know” as an option to answer to the question, “Which sounded
better?”. Averaged across all
subjects, the average number of MP3s clips undetected (sample pairs marked
either “don’t know”, or the compressed sample marked as sounding best)
falls at 50%, (exactly!) the same percent as if people were guessing randomly.
Not knowing statistics, I don’t know how statistically valid this is,
however, it suggests fairly strongly that people cannot, at least for five
second clips, tell the difference between MP3s at 128Kkbs, and PCM.
The big unanswered question, of course, is how well that represents their
ability to judge quality for longer clips (or the full length of a song), for
music they know and love. I’m not
really sure how to address that, currently.
At least, however, these results show that MP3 compression works well
enough that much more intensive levels of testing will be needed to show if it
really causes a significant drop in perceived quality or not.