YQL allows (among other things) to convert any html flavor to well-formed xml. The conversion takes place server side at the Yahoo! servers, therefore creating no overload on the browser client.
To this end, the url of the page to be converted is built into a new url that queries the YQL engine. This in turns fetches the original html code, performs the conversion and outputs the well-formed xml code (including the opening <?xml... processing instruction). The xml response consists of a <query> root node, a <diagnostics> element containing some meta data concerning the query, and a <results> element containing the xml subtrees under the nodes that match the xpath selector. This will be more clear after you try the examples below.
The (most basic) structure of a YQL query for such a conversion results from the concatenation of the following elements:
- http://query.yahooapis.com/v1/public/yql?q=
- select * from html where url = "http://www.example.com"
- & format=xml
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.example.com%22&format=xml
Example you can try:
Source page: http://www.google.com
YQL query url: http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.google.com%22&format=xml
Furthermore, the output of a YQL query can be restricted to some portion of the document provided that this portion can be unambiguously identified by means of an XPath selector. To this end, it suffices to extend the query part ('2' in the list above) of the url in the following way:
- http://query.yahooapis.com/v1/public/yql?q=
- select * from html where url = "http://www.example.com
and xpath = "//div[@class='foo']/a" - & format=xml
- http://query.yahooapis.com/v1/public/yql?q=
- select * from html where url = "http://www.google.com"
and xpath = "//a[@class='gb2']" - & format=xml
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.google.com%22%20and%20xpath%3D"%2F%2Fa%5B%40class%3D%27gb2%27%5D"&format=xml
This fetches all DOM nodes corresponding to anchors of class 'gb2'.
It must be taken into account that the YQL engine performs some manipulations on the original html code that cannot be anticipated in all cases. Some examples include:
* Substitution of the HTML <b> tag by the <strong> tag.
Example:
Source: <b>Some text</b>
Output: <strong>Some text</strong>
* Substitution of a single white space character in the HTML code by a chain of white space characters, possibly containing newlines.
Example:
Source: <span>Some text</span>
Output: <span>Some
text</span>
This might be tricky if you are testing some text nodes for equality against some string, as in the xpath "//span[text()='Some text']". For this reason it is advisable to run a query against the whole document without an xpath first, and then construct the xpath based on the output of the first query. There are some basic text manipulation functions incorporated into the XLST syntax that can help to cope with such problems, for example using normalize-space() instead of text().
YQL and Greasemonkey
In the context of Greasemonkey scripts, this can be used to reduce the computation effort at the client side and the band width utilization, since a YQL query can be used to fetch only the portion of data which is relevant to the script
A known limitation of the Greasemonkey API function GM_xmlHttpRequest is that, unlike its official model xmlHttpRequest, it lacks support for the responseXML property of the argument of the onload handler. The usual way to go around this is to use the responseText property which holds the fetched document as a single utf-8 encoded string. This can be converted into a DOM tree by means of the DOMParser object:
var parser = new DOMParser();
var doc = parser.parseFromString(your_xml_string, "text/xml");
or alternatively into a XML object through the XML()object constructor (this is an e4x feature introduced in Javascript 1.6, also available for Greasemonkey scripts):
var doc = new XML(your_xml_string);
depending on what do you intend to do with it, and how. See also https://developer.mozilla.org/en/Parsing_and_serializing_XML.
I've written a small Greasemonkey script that illustrates the use of YQL, check it out!