Spaceship!
When Gary announced Outfox back in 2008, all manner of ideas for using speech and sound in the browser popped into my head. I've always had the boring demos (i.e., for adults) at Maze Day, so I decided to work first on a fun, somewhat educational, self-voicing browser game for the 2009 rendition. After all, keeping the mostly under-13, soda drinking, pizza eating, game playing clientele happy is always priority #1 at Maze Day.
The result is Spaceship!, a JavaScript game for Firefox built using Creative Commons licensed music, sound, speech, and graphics; the Dojo toolkit; and the Outfox add-on. In the primary portion of the game, the player fires shots at a grid of tiles trying to hit enemy ships. When the player runs out of ammo, he or she plays a set of minigames in an attempt to earn more shots. Of course, hazards and bonuses abound to keep things interesting.
A text description is nice, but you're better off watching the gameplay video below to really understand what I'm jabbering about. Or, better yet, grab Outfox and Firefox 3 and play it yourself online at http://spaceship.mindtrove.info.
What a great exercise this turned out to be! The payoff has been manyfold:
- I learning a ton more about Dojo and writing custom widgets.
- I developed some interesting MVC techniques for aural+visual event driven apps in Dojo. I hope to blog about these.
- I built some nice, reusable Dojo components for future browser games.
- I got to show off client-side music, sound, and speech in Firefox with pure JS. Maybe this will spur development of other audio apps?
- I drummed up some interest in extending Spaceship! with new minigames. Hopefully more coming soon.
- My wife was entertained. Yes, she will actually ask to play the game if she sees me working on it.
- I had lots of teachers ask when the game will be online at Maze Day. Well, here it is, a month later.
- And, most importantly, a steady stream of kids (and adults) got to play it at Maze Day. Hopefully even more can enjoy it now online.
If you try it out, leave a comment. It's new, there are bugs, and there is room for improvement. But anything you report will help in making the game better.
I owe many thanks to the artists who made their wonderful images, songs, and sounds available under open licenses. Their names appear in the Credits section off the main game menu. Be sure to check them out.
Oh, and of course the game code itself is BSD-licensed. Grab the code from http://svn.mindtrove.info/spaceship http://github.com/parente/spaceship if you're feeling adventurous.
iPod Shuffle with TTS
It's about time! Now when can I expect it on the other iPods? Just because they have screens doesn't mean I'm always looking at them.
Seadragon and Photosynth
I saw this demo video posted on Digg: http://www.metacafe.com/watch/637132/this_technology_will_blow_your_mind/
The second half of the video struck me as novel. Image based reconstruction of 3D environments is nothing new. Doing it using images mined from Flickr or other large databases is as far as I know.
I Googled for the project home page for Photosynth, intending to play with their demo, but didn't bother when I saw it was Windows only.
Spatial PulseAudio
In his interview about Pulse Audio in Fedora 8, Lennart Pottering mentions support for spatial sound as one of his future goals:
Spatial event sounds: click on a button on the left side of your screen, and the event sound comes out of your left speaker. Click on one on the right side of your screen, and the event sound comes of of the right speaker. It’s earcandy, but I think this could actually be quite useful, but only if we get better quality event sounds, than we have right now.
While spatialized event sounds may be “earcandy” as Lennart admits, there are other benefits of using 3D audio over mono sounds in certain applications. One interesting use concerns the separation of concurrent sound streams such that a user can distinguish and “pick out” one of many. The theory of auditory scene analysis (Bregman, 1990) says (among many other things) that we humans can better segregate different sound sources and select one for attentive processing if the acoustic and semantic properties of streams from distinct sources differ along certain dimensions while certain properties within a stream remain constant over time.
For instance, say I make two audio recordings of the same person speaking two different utterances. In one recording, the person says “What a lovely bunch of coconuts.” In the other, “That dog certainly has fleas.” If I mix these two recordings into a mono track with the two utterances starting at exactly the same time, you will have a hard time determining and understanding the two independent phrases. If I create a stereo sound, with one phrase played in the left speaker and another in the right, you’ll have a much easier time identifying the original phrases. (But you’ll likely have to listen to the sound more than once before you can repeat both phrases: another tenet of auditory scene analysis.) Better still, if I apply a head related transfer function (HRTF) to each recording such that the two utterances appear to come from the left and right side of you head in a 3D space, your task becomes even easier.
In other words, spatialized sounds aid segregation and selection of independent streams of speech and sound. In fact, research (McGookin, 2004) suggests that spatialization alone is sufficient to aid recognition of information encoded in properties of concurrent musical sounds (earcons).
Applying the concept of distinguishable sound streams to screen readers is an interesting endeavor. (Or, at least, I think so.) Screen readers currently rely on a single, serial stream of speech and sound to describe the multitasking, high-bandwidth graphical desktop. In a single stream design, reports of peripheral information outside the application focus are either non-existent, delayed, or interruptions, and can be easily missed. For instance, if a screen reader is busy reading an email when the user receives an instant message in another application, the screen reader has to decide whether to keep reading the email or announce the new message in some manner. If the screen reader interjects, the user might confuse the instant message content with that of the email or become annoyed with the interruption. If the screen reader decides to wait for the email reading to finish, the late announcement about the chat message runs the risk of being stale.
Worse yet, any single-stream announcement of the new message can be inadvertantly interrupted at any time by the next user command. In such a situation, unless the user tabs around looking for the new instant message or the chat program is set to play a sound every time a message is received (which still doesn’t indicate which of potentially many chats has the new message), the user may never learn of the existence of the new message.
Concurrent streams provide an answer to this peripheral awareness problem, but only if the screen reader can present them in a way that avoids masking other simultaneous streams. And this is exactly where the ability to spatialize sound helps. Without interrupting or modifying the stream of speech reading the email, another stream can pipe up and announce the new chat message with a sound, speech, or both according to the verbosity settings of the user (Zing! or “Message from Harvey” or “Harvey says ‘Hey! Stop reading your email and answer me! This is important!’”). As long as these streams are spatially separated according to some simple rules, the user will be able to effectively distinguish them, ignore one, listen to one, and switch attention back and forth between them.
Instant messaging is just one example of a modern desktop application that begs for concurrent streams in screen readers. Just from looking at my GNOME desktop I see a system monitor, the clock applet, my network status, a popup balloon, and a log monitor all updating in the background while I write this post. Of course, a user can’t cope with all of these event sources reporting at once. But that’s where the interesting design problems start: how do we construct a usable multi-stream auditory display?
An open source library supporting spatial sound is a fundamental building-block for this investigation (and I’m certain, others). I hope Lennart pursues it.
Android speech synth (where are you?)
I took a peek at the Google Android class hierarchy today. As far as UI goes, it looks like there’s great support for 2D/3D visuals. There’s some APIs for doing MIDI and sampled sound output. There’s even a class for doing speech reco.
What I don’t see is anything supporting synthesized speech output. That’s a bit depressing. It would be a huge boon to have an open environment for developing mobile audio apps. Talking cell phones can be a bit pricey because they’re primarily intended as assistive technologies (i.e., small market). But I can imagine a ton of applications with speech-displays that could be useful to sighted and blind users alike: listening to your email while you walk instead of reading it on a tiny screen, announcements about upcoming meetings in your calendar, voice-jockey-like naming of songs about to come up on your MP3 playlist, spoken caller ID, …
Perhaps it’s possible to add custom classes to support FreeTTS or some other Java-accessible engine. However, it would be much nicer to have the speech API in the platform itself so it’s available everywhere. Maybe they left it out because all the free engines are too resource hungry? Somehow, I can’t imagine something like espeak being too bulky for a mobile platform.