diumenge, desembre 28, 2008

YQL and Greasmonkey

YQL - Background information

YQL allows (among other things) to convert any html flavor to well-formed xml. The conversion takes place server side at the Yahoo! servers, therefore creating no overload on the browser client.

To this end, the url of the page to be converted is built into a new url that queries the YQL engine. This in turns fetches the original html code, performs the conversion and outputs the well-formed xml code (including the opening <?xml... processing instruction). The xml response consists of a <query> root node, a <diagnostics> element containing some meta data concerning the query, and a <results> element containing the xml subtrees under the nodes that match the xpath selector. This will be more clear after you try the examples below.

The (most basic) structure of a YQL query for such a conversion results from the concatenation of the following elements:
  1. http://query.yahooapis.com/v1/public/yql?q=
  2. select * from html where url = "http://www.example.com"
  3. & format=xml

Spaces, quotes and other special characters must be url-encoded, so the final url looks this way:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.example.com%22&format=xml

Example you can try:

Source page: http://www.google.com
YQL query url: http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.google.com%22&format=xml

Furthermore, the output of a YQL query can be restricted to some portion of the document provided that this portion can be unambiguously identified by means of an XPath selector. To this end, it suffices to extend the query part ('2' in the list above) of the url in the following way:
  1. http://query.yahooapis.com/v1/public/yql?q=
  2. select * from html where url = "http://www.example.com
    and xpath = "//div[@class='foo']/a"
  3. & format=xml
Example you can try:
  1. http://query.yahooapis.com/v1/public/yql?q=
  2. select * from html where url = "http://www.google.com"
    and xpath = "//a[@class='gb2']"
  3. & format=xml
which results in the following url:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.google.com%22%20and%20xpath%3D"%2F%2Fa%5B%40class%3D%27gb2%27%5D"&format=xml

This fetches all DOM nodes corresponding to anchors of class 'gb2'.

It must be taken into account that the YQL engine performs some manipulations on the original html code that cannot be anticipated in all cases. Some examples include:

* Substitution of the HTML <b> tag by the <strong> tag.
Example:
Source: <b>Some text</b>
Output: <strong>Some text</strong>

* Substitution of a single white space character in the HTML code by a chain of white space characters, possibly containing newlines.
Example:
Source: <span>Some text</span>
Output: <span>Some
text
</span>


This might be tricky if you are testing some text nodes for equality against some string, as in the xpath "//span[text()='Some text']". For this reason it is advisable to run a query against the whole document without an xpath first, and then construct the xpath based on the output of the first query. There are some basic text manipulation functions incorporated into the XLST syntax that can help to cope with such problems, for example using normalize-space() instead of text().

YQL and Greasemonkey

In the context of Greasemonkey scripts, this can be used to reduce the computation effort at the client side and the band width utilization, since a YQL query can be used to fetch only the portion of data which is relevant to the script

A known limitation of the Greasemonkey API function GM_xmlHttpRequest is that, unlike its official model xmlHttpRequest, it lacks support for the responseXML property of the argument of the onload handler. The usual way to go around this is to use the responseText property which holds the fetched document as a single utf-8 encoded string. This can be converted into a DOM tree by means of the DOMParser object:

var parser = new DOMParser();
var doc = parser.parseFromString(your_xml_string, "text/xml");

or alternatively into a XML object through the XML()object constructor (this is an e4x feature introduced in Javascript 1.6, also available for Greasemonkey scripts):

var doc = new XML(your_xml_string);

depending on what do you intend to do with it, and how. See also https://developer.mozilla.org/en/Parsing_and_serializing_XML.

I've written a small Greasemonkey script that illustrates the use of YQL, check it out!