While XQuery was designed for querying large document bases, it serves as a fine tool for transforming simple documents as well. Whether simplifying complex pages for display on small screens, or extracting elements from multiple pages to aggregate them together on a home-grown portal, or simply extracting data from Web pages because there’s no other programmatic way to get the data. This article shows how XQuery offers a fast and easy way to scrape HTML pages for the data you need.
If your page is not valid Xhtml, or is malformed in general then tools like xquery have a hard time getting at the content. They rely on having a valid DOM to work.
Thought as much. Badly constructed web-pages is rather common.
I’ll stick to perl’s HTML::TreeBuilder, thanks.
Perl’s HTML::TreeBuilder isn’t as good at dealing with broken html as htmltidy.
Typically for screen scraping people use htmltidy to xhtml, then extract nodes using xpath or just reformat it in xslt.
as people have mentioned if the HTML is not vaild these parsers fail. But i have had good luck with hotsax and good ole regex’s
http://hotsax.sourceforge.net/
-best
-greg
Bah…Perl’s LWP is like 1000x better than this and its been around forever.
Plus you can even do stuff like login to a remote website, share cookies back and forth, keep track of state etc.. you can’t do that with this.
And its super easy to do in Perl.
http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
LWP has nothing to do with this. This article is about parsing HTML documents and extracting data from them, not about retrieving HTML documents from a website.
In the article they run the output through JTidy before looking at it with XQuery, so malformed HTML is not an issue here. However, with two rounds of processing speed may be a problem, depending on how efficient their tools are.