HTML5 audio caching
One of my latest coding endeavors is a text-to-speech interface for JavaScript using HTML5 <audio> elements to output synthesized speech from a server. To reduce the latency between a speech request and actual speech output, I'm using various levels of caching. One of these is the regular browser disk cache based on HTTP headers.
It turns out that browser caching behavior for <audio> data varies wildly among browsers. The following table shows the HTML5 <audio> caching behavior of various browsers. I tested all of them on OS X 10.6 with the standard Mac Apache server hosting all of the tested audio files.
| Browser | <audio> Behavior |
|---|---|
| Firefox 3.6 | Respects cache headers for the sound data. Only contacts the server when the cache item expires. <audio> elements pointing to the same src reuse the cache data. |
| Chrome 5.0.322.2 | Contacts the server on every load(). When it receives a 304 response, does not refetch content.* |
| Safari 4.0.4 | Contacts the server to fetch first two bytes of the audio file on every load(). Receives a 206 response with partial content. Fetches the additional bytes from the file. Receives another 206 response with the partial content. Performs another fetch and receives a 304 response with no data. Continues to alternate between fetches that receive 206 partial data responses and 304 not modified responses. Nothing appears to get cached. |
| Webkit r54921 | Same behavior as Safari 4.0.4. |
* Though not cache related, audio output in Chrome is often clipped before the end of the actual audio data. When this occurs, Chrome fires the onended event even before the audible output finishes.
Except for Firefox 3.6, all of these browsers seem to exhibit pretty terrible caching behavior when it comes to audio. I've reported bugs where I thought appropriate, but maybe I'm missing something. Am I supposed to include additional headers in the server-side response? Or maybe I'm glossing over some key part of the <audio> API? If so, please let me know. If not, yikes: <audio> support has definite room for improvement.
pyttsx
pyttsx is a cross-platform text-to-speech package for Python. It has a simple API for producing speech, setting some basic engine properties, and getting start/stop/word callbacks. pyttsx currently supports SAPI5, NSSpeechSynthesizer, and espeak, but it can be extended to support other engines and libraries.
The project BSD licensed and hosted on Launchpad. PyPI tracks downloads for the latest stable version and documentation.
Outfox in Greasemonkey revisited
There was some traffic in the Outfox group about my GMail announcer userscript failing in Outfox 0.3.x. The Outfox API has improved quite a bit since 0.1.0, so it's no surprise my script no longer works.
Here's a new example script that does work with the latest Outfox 0.3.5 release. Instead of polluting the example with all the complications of navigating the GMail DOM, I've picked a much simpler target. This script simply speaks the number of major sections (level 2 headings) in a Wikipedia article when the page loads. It's not as sexy, but the code is much easier to understand.
To try this script, make sure you have the Greasemonkey 0.8 and Outfox 0.3.5 extensions installed on Firefox 3.0 or 3.5. Then visit the following link to have GM install the script: citation_announcer.user.js.
// ==UserScript== // @name Sections count // @namespace http://www.mindtrove.info/ // @description Speaks the number of h2 sections in a Wikipedia article // @include http://*.wikipedia.org/wiki/* // @require http://www.json.org/json2.js // @require http://outfox.googlecode.com/svn/trunk/js/outfox.js // ==/UserScript== // number of major sections var sections = 0; function onOutfoxAudioInit(response) { // say the number of main sections outfox.audio.say(sections + ' main sections'); // return the parameter for other outfox deferred callbacks return response; } function onOutfoxInit(version) { var content = document.getElementById('bodyContent'); // count the number of main sections sections = content.getElementsByTagName('h2').length; // take one back for the TOC heading if it's present if(document.getElementById('toc')) { --sections; } // start the outfox audio service var def = outfox.startService('audio'); def.addCallback(onOutfoxAudioInit); // return the parameter for other outfox deferred callbacks return version; } function onDOMContentLoaded() { // create a node for outfox use var div = document.createElement('div'); document.body.appendChild(div); // initialize outfox var def = outfox.init(div, JSON.stringify, JSON.parse); def.addCallback(onOutfoxInit); } // this event triggers execution of the GM script onDOMContentLoaded();
Outfoxing Gmail with Greasemonkey
NOTE: The code in this post is out-of-date and does not work with recent versions of Outfox. See http://mindtrove.info/outfox-in-greasemonkey-revisited/ for a simpler, more compatible example. If you do update the GMail announcer code so it works with Outfox again, drop me a line and I'll link to your script.
Can you remember a time when the title of this blog post might have landed me in a straight jacket? Can you believe that was just a few short years ago? Yea, I can't either.
Anyway, Gary's post Outfox: speech, sound, and more for Firefox talks about a new Firefox extension. He's using it to create cross-platform, self-voicing Web apps for kids with disabilities using a pure JS API. He hopes to extend his work to support alternative input devices such as game pads and switches as the Outfox extension matures and grows more flexible.
One of the other potential uses listed on the Outfox homepage is Adding new I/O to web sites with Greasemonkey. Interesting. It's one thing to include Outfox explicitly in a page, but can it possibly work when injected by GM? What about for a complex app like Gmail with multiple iframes, dynamic changes, refreshing, etc.?
To learn about Outfox (and for fun), I decided to write a quick GM script for Gmail that announces the senders and times of new messages (bold items) in the inbox. (I would have done subject and summary too, but Outfox 0.1.0 appears to have some unicode issues and balked at some of the Gmail separator characters. Less is more at this point.) The script makes the announcement when the Gmail interface first loads, any time Gmail automatically refreshes its inbox view, or when the user clicks the refresh link to check for new mail. It is smart enough to announce a given message only once, however, so you don't hear the same message over and over again on each refresh.
Yes. It does actually work.
To try this script, make sure you have the Greasemonkey 0.8 and Outfox 0.1 extensions installed on Firefox 3. (Or use the latest available version of each.) Then visit the following link to have GM install the script: gmail_announcer.user.js.
For reference, the entire script is listed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | // ==UserScript== // @name Gmail Announcer // @namespace http://www.mindtrove.info/ // @description Speaks new Gmail inbox messages using Outfox // @include https://mail.google.com/mail/* // @include http://mail.google.com/mail/* // @require http://outfox.googlecode.com/svn/trunk/js/outfox.js // ==/UserScript== var need_say = null; var ids = {}; function sayMessages(msgs) { if(!outfox.defaults.config) { // outfox really needs a better way to detect ready ... need_say = msgs; return false; } var header = 'New messages'; for(var id in msgs) { // say all messages var msg = msgs[id]; var segs = msg.split('»'); var sender = segs[0]; var time = segs[1].slice(segs[1].search('…')+2); if(header) { outfox.say(header); header = null; } outfox.say(sender + ' at ' + time); } return true; } function onOutfoxReady() { if(need_say) { // say anything already queued sayMessages(need_say); ids = need_say; need_say = null; } } function onTableChange(event) { var div = event.target; var trs = div.getElementsByTagName('tr'); var count = 0; var new_ids = {}; var curr_ids = {}; for(var i=0; i < trs.length; i++) { var tr = trs[i]; if(tr.innerHTML.search('<b>') != -1) { // marked as a new message if(ids[tr.id] == undefined) { // never announced new_ids[tr.id] = tr.textContent; ++count; } // curr is announced + new curr_ids[tr.id] = tr.textContent; } } // report if we can if(sayMessages(new_ids)) { ids = curr_ids; } } function onDocumentChange(event) { if(event.target.tagName == 'DIV') { var div = event.target; var tables = div.getElementsByTagName('table'); for(var i in tables) { var table = tables[i]; if(table.id != '' && !table.getAttribute('role')) { // watch just table changes from now on var div = table.parentNode.parentNode; div.addEventListener('DOMNodeInserted', onTableChange, false); document.removeEventListener('DOMNodeInserted', onDocumentChange, false); // start outfox var div = document.createElement('div'); document.body.appendChild(div); outfox.init(div, onOutfoxReady); // kick off initial read manually onTableChange({'target' : table.parentNode}); } } } } document.addEventListener('DOMNodeInserted', onDocumentChange, false); |
Android speech synth (where are you?)
I took a peek at the Google Android class hierarchy today. As far as UI goes, it looks like there’s great support for 2D/3D visuals. There’s some APIs for doing MIDI and sampled sound output. There’s even a class for doing speech reco.
What I don’t see is anything supporting synthesized speech output. That’s a bit depressing. It would be a huge boon to have an open environment for developing mobile audio apps. Talking cell phones can be a bit pricey because they’re primarily intended as assistive technologies (i.e., small market). But I can imagine a ton of applications with speech-displays that could be useful to sighted and blind users alike: listening to your email while you walk instead of reading it on a tiny screen, announcements about upcoming meetings in your calendar, voice-jockey-like naming of songs about to come up on your MP3 playlist, spoken caller ID, …
Perhaps it’s possible to add custom classes to support FreeTTS or some other Java-accessible engine. However, it would be much nicer to have the speech API in the platform itself so it’s available everywhere. Maybe they left it out because all the free engines are too resource hungry? Somehow, I can’t imagine something like espeak being too bulky for a mobile platform.