Parente's Mindtrove

Spatial PulseAudio

January 01, 2008

In his interview about Pulse Audio in Fedora 8, Lennart Pottering mentions support for spatial sound as one of his future goals:

Spatial event sounds: click on a button on the left side of your screen, and the event sound comes out of your left speaker. Click on one on the right side of your screen, and the event sound comes of of the right speaker. It's earcandy, but I think this could actually be quite useful, but only if we get better quality event sounds, than we have right now.

While spatialized event sounds may be 'earcandy' as Lennart admits, there are other benefits of using 3D audio over mono sounds in certain applications. One interesting use concerns the separation of concurrent sound streams such that a user can distinguish and 'pick out' one of many. The theory of auditory scene analysis (Bregman, 1990) says (among many other things) that we humans can better segregate different sound sources and select one for attentive processing if the acoustic and semantic properties of streams from distinct sources differ along certain dimensions while certain properties within a stream remain constant over time.

For instance, say I make two audio recordings of the same person speaking two different utterances. In one recording, the person says 'What a lovely bunch of coconuts.' In the other, 'That dog certainly has fleas.' If I mix these two recordings into a mono track with the two utterances starting at exactly the same time, you will have a hard time determining and understanding the two independent phrases. If I create a stereo sound, with one phrase played in the left speaker and another in the right, you'll have a much easier time identifying the original phrases. (But you'll likely have to listen to the sound more than once before you can repeat both phrases: another tenet of auditory scene analysis.) Better still, if I apply a head related transfer function (HRTF) to each recording such that the two utterances appear to come from the left and right side of you head in a 3D space, your task becomes even easier.

In other words, spatialized sounds aid segregation and selection of independent streams of speech and sound. In fact, research (McGookin, 2004) suggests that spatialization alone is sufficient to aid recognition of information encoded in properties of concurrent musical sounds (earcons).

Applying the concept of distinguishable sound streams to screen readers is an interesting endeavor. (Or, at least, I think so.) Screen readers currently rely on a single, serial stream of speech and sound to describe the multitasking, high-bandwidth graphical desktop. In a single stream design, reports of peripheral information outside the application focus are either non-existent, delayed, or interruptions, and can be easily missed. For instance, if a screen reader is busy reading an email when the user receives an instant message in another application, the screen reader has to decide whether to keep reading the email or announce the new message in some manner. If the screen reader interjects, the user might confuse the instant message content with that of the email or become annoyed with the interruption. If the screen reader decides to wait for the email reading to finish, the late announcement about the chat message runs the risk of being stale.

Worse yet, any single-stream announcement of the new message can be inadvertantly interrupted at any time by the next user command. In such a situation, unless the user tabs around looking for the new instant message or the chat program is set to play a sound every time a message is received (which still doesn't indicate which of potentially many chats has the new message), the user may never learn of the existence of the new message.

Concurrent streams provide an answer to this peripheral awareness problem, but only if the screen reader can present them in a way that avoids masking other simultaneous streams. And this is exactly where the ability to spatialize sound helps. Without interrupting or modifying the stream of speech reading the email, another stream can pipe up and announce the new chat message with a sound, speech, or both according to the verbosity settings of the user (Zing! or 'Message from Harvey' or 'Harvey says "Hey! Stop reading your email and answer me! This is important!"'). As long as these streams are spatially separated according to some simple rules, the user will be able to effectively distinguish them, ignore one, listen to one, and switch attention back and forth between them.

Instant messaging is just one example of a modern desktop application that begs for concurrent streams in screen readers. Just from looking at my GNOME desktop I see a system monitor, the clock applet, my network status, a popup balloon, and a log monitor all updating in the background while I write this post. Of course, a user can't cope with all of these event sources reporting at once. But that's where the interesting design problems start: how do we construct a usable multi-stream auditory display?

An open source library supporting spatial sound is a fundamental building-block for this investigation (and I'm certain, others). I hope Lennart pursues it.

Another Read: Android Speech Synth (Where Are You?) »

I took a peek at the Google Android class hierarchy today. As far as UI goes, it looks like there's great support for 2D/3D visuals. There's some APIs for doing MIDI and sampled sound output. There's even a class for doing speech reco.