Things I’ve learned, published for the public benefit
Hope This Helps header image

Should we care about ABX test results?

Policy no. 8 in the Terms of Service of the respected audiophile community Hydrogenaudio states:

8. All members that put forth a statement concerning subjective sound quality, must — to the best of their ability — provide objective support for their claims. Acceptable means of support are double blind listening tests (ABX or ABC/HR) demonstrating that the member can discern a difference perceptually, together with a test sample to allow others to reproduce their findings.

What a breath of fresh air. Other audio forums are full of snake-oil-peddling and Kool-Aid-drinking evangelists who go on and on about how replacing $200 speaker wires with $400 speaker wires “really opened up the soundstage and made the upper-midrange come alive”. The people at Hydrogenaudio know that such claims demand proper scientific evidence. How nice to see that they dismiss subjective nonsense and rely instead on the ultimate authority of ABX tests, which really tell us what makes a difference and what doesn’t.

Except that ABX tests don’t measure what really matters to us. ABX tests tell us whether we can hear a difference between A and B. What we really want to know, however, is whether A is as good as B.

1.

“Wait a second!”, I hear you exclaim. “Surely if I cannot tell A from B, then for all intents and purposes, A is as good as B and vice versa. If you can’t see the difference, why pay more?

Actually, there could be tons of reasons. To take a somewhat contrived example, suppose I magically replaced the body of your car with one that were less resistant to corrosion, leaving all the other features of your vehicle intact. Looking at the car and driving it, you would not notice any difference. Even if I gave you a chance to choose between your original car and the doctored one, they would seem identical to you and you could choose either of them. However, if you were to choose the one I tampered with, five years later your vehicle’s body would be covered in spots of rust.

The obvious lesson here is that “not seeing a difference” does not guarantee that A is as good as B. Choosing one thing over another can have consequences that are hard to detect in a test because they are delayed, subtle, or so odd-ball that no one even thinks to record them during the test.

But how is this relevant to listening tests? Assuming that music affects us through our hearing, how could we be affected by differences that we cannot hear?

In his fascinating book Burning House: Unlocking the Mysteries of the Brain, Jay Ingram describes the case of a 49-year-old woman suffering from a condition called hemispatial neglect (the case was researched by neuropsychologists John Marshall and Peter Halligan). Patients with hemispatial neglect are unable to perceive one (usually the left) side of the objects they see. When asked to copy drawings, they draw only one side; when reading out words, they read them only in half (e.g. they read simile as mile).

burning-houseIn Marshall and Halligan’s experiment, the woman was given two simple drawings showing two houses. In one of the drawings, the left side of the house was covered in flames and smoke; the houses looked the same otherwise. Since the flames were located on the left side, the patient was unable to see them and claimed to see no difference between the drawings. When Marshall and Halligan asked her which of the houses she would rather live in, she replied — rather unsurprisingly — that it was a silly question, given that the houses were identical.

However, when the experimenters persuaded her to make a choice anyway, she picked the flameless house 14 out of 17 times, all the time insisting that both houses look the same.

Marshall and Halligan’s experiment shows (as do other well-known psychological experiments, including those pertaining to subliminal messages) that it is possible for information to be in a part of the brain where it is inaccessible to conscious processes. This information can influence one’s state of mind and even take part in decision-making processes without one realizing it.

If people can be affected by information that they don’t even know is there, then who says they cannot be affected by inaudible differences between an MP3 and a CD? Failing an ABX test tells you that you are unable to consciously tell the difference between two music samples. It does not mean that the information isn’t in your brain somewhere — it just means that your conscious processes cannot access it.

So the fact that you cannot tell the difference between an MP3 and a CD in an ABX test does not mean that an MP3 is as good as a CD. Who knows? Maybe listening to MP3s causes more fatigue in the long run. Maybe it makes you get bored with your music more quickly. Or maybe the opposite is true and MP3s are actually better. We can formulate and test all sorts of plausible hypotheses — the point is, an ABX test which shows no audible difference is not the end of the discussion.

2.

I have shown that the lack of audible differences between A and B in an ABX test does not imply that A is as good as B. Before you read this post as an apology for lossless audio formats, here is a statement that will surely upset hard-core audiophiles:

The fact that you can tell the difference between an MP3 and a CD in an ABX test does not mean that the MP3 is worse than a CD.

First of all, the differences between MP3s encoded at mainstream bitrates (128 kbps and 192 kbps) and original recordings are really subtle and can be detected only under special conditions (quiet environment, good equipment, full listener concentration, direct comparisons of short samples). Because the differences are so tiny, we cannot automatically assume that it is the uncompressed version that sounds better. Subtle compression artifacts such as slightly reduced sharpness of attacks on short, loud sounds may in fact be preferred by some listeners in a direct comparison.

Secondly, even if we found that the uncompressed version is preferred by listeners, that wouldn’t necessarily mean that it is better. People prefer sitting in front of the TV to exercising, but the latter might make them feel much better overall. If it were discovered, for example, that compressed music is less tiring to listen to (this is of course pure speculation), then that fact might outweigh any preference for uncompressed sound in blind tests.

Summary

The relevance of ABX tests to the lives of music lovers is questionable. Neither does the absence of audible differences imply equal quality, nor does the presence of audible differences imply that the compressed version is inferior. Rather than being the argument to end all debate, the results of ABX tests are just one data point and the relative strengths of various audio formats may well be put in a new light by further research.

21 Comments so far

  • Andrew

    1. But what you just demonstrated is that there *is* a bias in favor of the better thing, and that people will demonstrate that bias given enough samples, even if they insist that they don’t recognize a difference!

    2. I think you’re being far too kind to MP3. There are some perfectly ordinary tracks (for instance, almost any track on Neil Young’s “Live Rust”) that when encoded as 128 MP3 go “chirp” and “warble” and “sizzle” badly enough that you can hear the artifacts on a cheap car stereo. In a car moving at freeway speed.

    • tszynalski

      1. I’m not sure ABX tests would expose this sort of unconscious bias, as they are the equivalent of asking “is A and X the same?”. Anyway, as I wrote, there could be other ways (other than decision bias) in which compressed audio could affect your mental state.

      2. I think you may have had some bad experiences with obsolete encoders. Try the latest version of LAME. Audible differences at 128 kbps on a cheap car stereo? No way.

  • Western Infidels

    “Failing an ABX test tells you that you are unable to consciously tell the difference between two music samples.”

    How can you justify the “conciously” part? Particularly in light of the hemispatial neglect example, which would seem to contradict this line of thinking quite directly?

    • tszynalski

      An ABX test is not designed to capture subconscious effects. It would be hard to design a test that could tell you “There is no difference, not even in your subconscious.”.

      • Western Infidels

        Hard? Medical, psychological, and even marketing and advertising studies are designed to capture subconscious effects all the time, and they tend to be pretty simple blind trials.

        What makes ABX testing special or different from other types of blind tests that makes it insensitive to subconscious effects?

  • tszynalski

    Western Infidels:
    There are tests that capture specific subconscious effects, but no test can tell you that there are no subconscious effects whatsoever. Fatigue effects require a different testing scheme than memory effects or habituation effects. (ABX tests do not have the potential to detect any of these.) No matter how many tests you do, there could always be a subconscious effect that nobody thought to test for.

  • just a dude

    Hello,

    This might be an interesting addendum to the conversation:

    Inaudible High-Frequency Sounds Affect Brain Activity: Hypersonic Effect
    http://jn.physiology.org/cgi/content/full/83/6/3548

    • tszynalski

      Very interesting paper. Thanks for posting it.

      Brief summary: In a blind test, their test subjects described music with (inaudible) high-frequency sounds as more comfortable to the ears and more nuanced than music stripped of those frequencies. At the same time, they could not reliably tell A from B if they were asked which one they like more. So the ability to tell A from B depended on the exact question that was asked.

      The experiment also found differences in EEG and PET scans between the two samples.

  • audio-skeptic

    Hi there – interesting article.

    There are a couple of points i take issue with however. Firstly, pertaining to the car body analogy. An equivalent property in high end audio gear to the resiliance of the original car body, would be some sort of comparable indirect/pragmatic benefit, like robustness or longevity. In other words, high end audio gear might be “better” in a way which abx testing cannot reveal – in some fasion other than SQ. That’s all well and good, and perfectly plausible, but the truth is that hardcore audiophiles do not claim to buy a £1000 cable over a £30 cable because the former will last longer. Instead, they claim it’s because the more expensive of the two demonstrates a discernable difference in SQ. ABX testing would reveal this if it were true.

    To my mind, the burning house example is a bit of a red herring. Firstly, the woman’s inability to percieve a difference between the two houses would only impinge on her quality of life were she to actually choose to enter one of the hypothetical houses. In other words, her other senses would alert her to a difference her vision (or her visual cortex) was incapable of detecting.

    That said, i realise this misses the underlying point you were trying to get at. That is: differences which aren’t audible, of which we aren’t consciously aware, can have some indirect effect on our state of mind more broadly. I think the first problem with this reasoning, is that we’re often not aware of the influence effects like subliminal messages are having on us, much less of their original source. In other words, it is highly unlikely that this is the explanation for why audiophiles seem to prefer gear which they can’t discern from cheaper kit in ABX testing, because the hypothetical “effect” is completely disconnected from the “cause”. Moreover, this misrepresents the actual position of the audiophile community, which is that they CAN hear differences between different gear (and they can, when they know they’re listening to it). They do not, conversly, claim some intengible benefit to their quality of life. Their claims are often spurious, and ABX testing demonstrates this.

    Finally, if indeed these effects exist, their source should be scientifically measurable. Quite the opposite is true. In RMAA tests, “burned in” vs non-burned in headphones and speakers produce identical response/distortion/noise curves, as do cheap vs expensive cables. I imagine things like individual DACS and amps vary slightly depending on their design and sound colouration, though the truth is they’re all striving for the same thing – low noise, low distortion, flat frequency response. These things are all easily achievable in modern electrical engineering. It’s possible such minor differences could account for slightly different “subconscious” effects, but they certainly cannot explain nonsense like “really opened up the soundstage and made the upper-midrange come alive” as you so eloquently put it 😉

    These sorts of comments are borne out of the subjectivity of human hearing. Like our other senses, it is vastly coloured by our expectations, our mood, how tired we are, what we had for breakfast etc etc etc.

    • tszynalski

      You are correct that audiophiles typically claim audible differences in favor of more expensive equipment. They do not say that A and B sound the same, but have obscure and hidden effects on our mental worlds. They say that A sounds better than B. Therefore, their statements can be (and should be) evaluated with ABX tests. If A sounds better than B, they should be able to hear it even if they are not told which sample they are listening to.

      I did not mean to suggest subconscious effects are an explanation for the audiophiles’ stated preferences. The most likely explanation (in a non-blinded situation) is that their perception is affected by the conscious information that they have (for example, the fact that they are listening to $1000 cables).

      I don’t understand why you wrote that it is a “problem with my reasoning” that it does not align with the arguments made by the audiophile community. What I wrote was not supposed to support or explain the audiophiles’ claims. It was supposed to point out that ABX tests may be too crude a tool to measure the relevant differences between CDs and MP3s.

      I agree that the source of any hidden effects must be scientifically accounted for. In the case of headphones and cables, as you have pointed out, no such source has been identified. But CDs and MP3s produce different waveforms, so the underlying difference is there.

      The burning house example may be a bit confusing in that the subconscious differences were successfully detected in the ABX test that the subject was given, while my article is about the possibility of subconscious differences that do not manifest themselves in an ABX test.

  • Peder

    I have strong objections about the reasoning in 2)
    The only goal for a lossy encoder is to provide a result that is as similar to the original as possible.

    Sure someone might find the smearing or pre-echo artifacts enjoyable, but why should the majority have to suffer because of them. And if they /are/ the majority the problem should be addressed in the original recording.
    If you find an altered signal more pleasant you can always remix it before encoding.

    And, as with physic “laws”, you have to use the best alternatives available now when making these decisions. Sure someone might find in 5 years that the sonic equivalent of F=ma isn’t accurate in some specific cases but until then ABX is the best tool we’ve got.

    • tszynalski

      Which part of the following statement do you have strong objections about?

      “The fact that you can tell the difference between an MP3 and a CD in an ABX test does not mean that the MP3 is worse than a CD.”

      By “worse”, I mean “less enjoyable”, since enjoyment is the ultimate goal of music and it is what people (should) really care about. Of course, if by “worse” you simply mean “different” (as shown in an ABX test), then my statement becomes tautologically false.

      There has been a lot of misunderstanding in the Hydrogenaudio thread concerning my post. I’ll try to avoid it by presenting a list of possible situations in order of my personal preference:

      1. (best) Recordings are published in several versions for people with different tastes/equipment.
      2. (second best) Codecs have optional filters that allow the user to customize the sound.
      3. Codecs always filter the input signal so that 90% of the population finds it better. (Let’s assume this is possible.)
      4. Codecs simply preserve the original sound. (This is worse than 3 because [by assumption] 90% of the population finds the “transparent” sound worse than the improved sound. You could argue that people could remix tracks or use postprocessing to get the same effect without reducing choice, but this is not a realistic alternative for most people.)

  • fung0

    I’d say the reverse: “The fact that you CAN’T tell the difference between an MP3 and a CD in an ABX test does not mean that the MP3 is EQUALLY GOOD as the CD.” The enjoyment of music is a far more delicate thing.

    After listening to a LOT of MP3 files, it struck me that I simply wasn’t enjoying them as much as the original CDs. I started comparing and listening for the compression losses, and immediately found them quite significant and easy to identify. But they were not sufficient to explain the overall loss of ‘presence’ that I had experienced.

    Yes, the difference between MP3 and WAV or CD is subtle… but so is the difference between a *great* performance and an average one, or a great recording and a mediocre one. (Or even the most fabulous coffee-table reproduction and an original painting.) Art is all ABOUT subtlety.

    My suspicion is that we’re asking the wrong questions. If we took a large audience and asked them which recording they *enjoyed* more (on good equipment, naturally), they’d pick the lossless version more often. (In any case, I know I would – or did – which is what matters to me.)

    Even if you’re unconvinced, the question should be moot. Considering today’s infinitesimal costs of both storage and bandwidth, I can’t see any reason for entrusting great music to a brutally lossy format such as MP3.

    • Tomasz

      I said what you call “the reverse” in section 1 of my post. 😉 A favorable result of an ABX test does not mean than an MP3 is as good as a CD — it could be worse or it could be better.

      I suggest you do a blind test yourself. Instead of asking yourself “is A=X”, ask yourself “which of the two samples is more enjoyable?”. You can see for yourself whether you find CDs more enjoyable because you know they are CDs (placebo effect), or because they are truly different to your brain. My experience is that artifacts seem obvious when you KNOW that you are listening to an MP3 track, but they mysteriously disappear once this information is withheld. It could be an interesting experience for you.

      And in my personal case, the compression question is far from moot. I have a 80 GB iPod 50% full of MP3 music and podcasts. It simply wouldn’t fit if it were encoded with FLAC.

  • Chef

    At first I was quite offended by the idea that ABX tests are not conclusive, but I’m glad I read on. You make many good points in this piece.

    I think what I gather from it, is that ABX tests are useful, but they are not a guarantee we are listening to what we might arbitrarily call ‘the best’ version of a track. I think I will continue to rely on ABX tests to make decisions about what I put on my DAP, but I won’t reject the idea of their possible inferiority. Basically an unsatisfying sense of ‘I’m using the best track to the best of my knowledge… but my knowledge isn’t infallible.’

    Thanks for this though, I think it will help me to be more open minded. Could have done without the car analogy though, which was what almost scared me away 🙂

  • Daniel

    If a sample of music is enjoyable than other then there WILL be a “decision bias”.
    For every thing you want to know if its infuenced, you need to do some different test/study. For example, if you want to know how the general quality of life of the subject is infuenced in the long term, you will need to do epidemiological studies. But even, the bias will be far more important than the actual data. For example, probably one uses computers, the other maybe not. That is a bigger factor than the format of the music itself.
    Anyway, I WANT to hear what the people who made the music was hearing, except when i VOLUNTARY change the sound. Because of that, for me in one side of the balance is the precision of the reproduction, and in the other side is the price that i need to pay for it (more physical or logical space, maybe?).

  • lossy test conclusions? at blog.humaneguitarist.org

    […] writer of this blog entry makes an interesting point regarding the double blind ABX listening […]

  • Marshall and Halligan’s experiment on hemispatial neglect patient « Coalescent Architecture

    […] There is an interesting experiment done by John Marshall and Peter Halligan on how we consciously and unconsciously recognize world. I quote the explanation from this blog. […]

  • Alice Wonder

    She was unable to describe a difference yet she still picked the better house a statistically significant majority of the time, indicating the “ABX” test worked.

    Logic, what do they teach in schools these days? 😉

    • Tomasz

      An ABX test asks you to say if two things are equal. The test in the house experiment asked the subject to express preference. Audio compression algorithms are not typically tested by asking “which one do you prefer?”. (There are tests like that, but they are rarely used.)

  • Al

    The degree to which cables will make a difference depends not only on the intrinsic characteristics and quality of the cable, and on the quality and musical resolution of the system, but perhaps just as significantly or even more so on interactions between the technical characteristics of the cable and those of what it is connecting. Impedances, for instance, among many other dependencies that could be cited which have no direct relation to the sonic quality of the system.

    What drives the cost of expensive cables (aside from markups) is the use of exotic materials and construction techniques. I think that the fundamental problem here is that there is little or no established science supporting a correlation between those materials and techniques and better sound.

    And muddling the picture further is that many exotic cables are non-neutral by design, incorporating in some cases outlandish values of capacitance or inductance, or “network boxes” whose function is defined primarily by nonsensical techno-hype.

    Keep in mind, also, that more expensive does not necessarily equal better. What can be expected to matter most is how the cable interacts with what it is connecting.

Leave a Reply to Western Infidels