Blog Contact Software Music
E4X and the DOM

Reading through tonyg’s recent post I came across something i haven’t yet seen in use - inline XML within Javascript code. E4X, it seems, has landed. It is now available by default in Firefox and Rhino - other implementation will surely follow.

E4X, shorthand for ECMAScript for XML is a nice language extension to Javascript adding native XML support. It adds XML types, a notation for literal XML and some basic operations. Previously, if you wanted to use XML in your Javascript code, you had two choices. Since XML has a textual representation, you could work with strings. This approach, however, is extremely error-prone, and is of limited use if you intend to do anything more sophisticted than just generating XML. The other approach is to use the XML DOM, which exposes the full power of XML using a consistent model, but is too verbose and so rather unpleasant to use.

Example: XML using strings / innerHTML

// Short, but notice how I forgot to close the paragraph
// Also, this is non-standard, and only works in HTML
myElement.innerHTML = '<p><b>Hello</b> <i>World</i>';

Example: XML using the DOM

// That must be one of the longest hello world
// examples I've ever written
var paragraph = document.createElement('p');
var bold = document.createElement('b');
var hello = document.createTextNode('Hello');
bold.appendChild(hello);
var italic = document.createElement('i');
var world = document.createTextNode('World');
italic.appendChild(world);
var space = document.createTextNode(' ');
paragraph.appendChild(bold);
paragraph.appendChild(space);
paragraph.appendChild(world);
myElement.appendhChild(paragraph);

As it happens, I am working on something that requires quite a lot of DOM manipulation within the browser, and tired of constructing XML using the DOM API I set to give the new E4X capabilities of Firefox 1.5 a try. The dissapointing reality, I soon found out, is that while E4X is very much present, it can’t be used for accessing or creating DOM elements. So if you plan on parsing some XML data, or generating XML from your program you can use E4X, but DOM manipulation, arguably the most important activity involving XML in a browser is not served by this new extension at all.

Example: How E4X could be used with the DOM

// This is structured XML, notice how there are no quotes
var p_xml =  <p><b>Hello</b> <i>World</i><p>;
// But unfortunately you can't do that
var p_element = document.createElement(p_xml);
myElement.appendChild(p_element)

Javascript is a complete, general-purpose language, but in practice, it is being used exclusively as an extension for host environments. In Firefox, for example, it is used for adding program logic to the browser’s display formats - HTML, XUL and SVG. These formats can be expressed in text, but in order to manipulate them you need to access them using the DOM. For HTML, firefox adopted the nasty innerHTML non-standard extension, which allows the user to access the contents of a node as text. Fortunately, this extension doesn’t work with non-HTML elements. E4X could have been the perfect replacement - a compromise between using the dumb textual representation and the structured, but counter-intuitive DOM.

Why doesn’t Firefox provide a way to construct and manipulate DOM elements using E4X? It’s hard to blame the mozilla developers, given that the ECMA standard does not include any mention of the DOM or how to interact with it. Any extension they would have come up with would end being the next generation innerHTML non-standard.

This failure of the E4X standard, together with tonyg’s previous critique of E4X, as well as other rumours from the Javascript development arena have me wondering whether the standartisation efforts by ECMA have greatly benefited the language and its active community.

Estimating the number of blog subscriptions

Estimating the number of readers of plain web-pages is relatively straightforward. It can be done either using tools like Webalizer and Analog, that analyse the access log for the web server or by counters, ranging from simplistic ‘number of visitors’ dynamic images to the sophisticated Google Analytics, that use cross-domain resource loading to gather information about the readers accessing the site in a central database.

Unlike traditional website visitors, most readers of a blog use a news aggregator to periodically pull new items from the blog’s syndication feed. As a result, the co-relation between the number of requests and the number of times an item is read is broken, and to confuse things even more - many readers use a public aggregator service (like Bloglines and LiveJournal), which saves the feed to a central repository and serves the saved entries to many readers. For such services, growth in the number of subscribers is not represented by an increase in the number of requests made.

To get a rough estimate of the number of subscribers to a feed we need to separate between requests made by public services on behalf of more than one user, and requests made by individual news aggregators. Fortunately, a de facto convention evolved which allows public aggregators to identify themselves as such and report the number of subscribers to the feed. The way this is done is by including the number of subscribers in the user-agent request
header (unfortunately no real standard exists yet, and every aggregator uses a slightly different format). All other requests are from individuals, and the number of requests from unique ip addresses roughly co-responds to the number of subscribers. In my analysis I decided to restrict this number to those addresses from which at least three requests were made during a
twenty-four hour period. That way we don’t take into account users who accidentally stumbled upon the XML feed without actually intending to subscribe to it (because they wanted to copy the feed’s url from their browser’s address bar, for example, or because they use a preemptive caching mechanism like the google toolbar).

Finally, for most blogs there’s more than one way to get a feed. There are several popular feed formats (RSS and Atom and a few others, each having several different versions), and the blogging software we use may have more than one url for getting a feed (for example: one by using query parameters and one using a static resource), and an aggregator may subscriber to more than one of these. For public aggregators it is sensible to add the number of subscribers for each version. For private aggregators we can ignore redundant requests from the same address (they are probably being read by the same person anyway).

The estimate we get is still very inaccurate (and probably too low). First, not all public aggregators bother reporting the number of subscribers they serve. Google, Yahoo and MSN, for example, have a very large user base and most definitely access our feed on a regular
basis, but we simply don’t know how many users hide behind them (and to make things worse, some of them may access our feeds from different addresses on different occasions, causing us to record them more than once even if they don’t have more than one subscriber). Likewise,
subscribers not using a fixed address (dial-up and mobile phones subscribers, subscribers behind anonymising proxies) may cause slightly inflated figures too. Finally, some readers don’t use feed
aggregators at all, instead reading the blog by occasionally visiting the HTML version of the blog using a browser.

There is another metric I could be extracting from the logs which I did not, so far, bother with. Following the logs over time, it would be nice to identify the relationship between requests to the HTML version of the blog and the number of subscriptions to the feed. Some blog entries must act as conversion points - people read them and then decide to subscribe to the feed. It would be interesting to know which entries are successful at recruiting new subscribers, because for a heterogeneous blog (with many writers, styles and categories) it is often difficult to know what readers are most interested in. I may try to add this in the future.

If you too are curious about the number of subscribers to your blog (and have access to the HTTP access log of the server hosting it) you can give my little script, Blogalizer, a try. Your questions, suggestions and improvments are, naturally, very welcome.

Console DJ

pytone is a terminal jukebox … sort of the curses equivalent of itunes. found about it earlier today from the uncle, and i’ll never be the same again. if you’re on the look for a decent player, give pytone a try - you’d be surprised!