Easy Screen-scraping with XQuery

Eugenia Loli 2005-03-25 Java 7 Comments

While XQuery was designed for querying large document bases, it serves as a fine tool for transforming simple documents as well. Whether simplifying complex pages for display on small screens, or extracting elements from multiple pages to aggregate them together on a home-grown portal, or simply extracting data from Web pages because there’s no other programmatic way to get the data. This article shows how XQuery offers a fast and easy way to scrape HTML pages for the data you need.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

7 Comments

2005-03-25 11:45 pm
Anonymous
If your page is not valid Xhtml, or is malformed in general then tools like xquery have a hard time getting at the content. They rely on having a valid DOM to work.
2005-03-26 12:54 am
Anonymous
Thought as much. Badly constructed web-pages is rather common.
I’ll stick to perl’s HTML::TreeBuilder, thanks.
2005-03-26 3:18 am
Anonymous
Perl’s HTML::TreeBuilder isn’t as good at dealing with broken html as htmltidy.
Typically for screen scraping people use htmltidy to xhtml, then extract nodes using xpath or just reformat it in xslt.
2005-03-26 8:24 am
Anonymous
as people have mentioned if the HTML is not vaild these parsers fail. But i have had good luck with hotsax and good ole regex’s
http://hotsax.sourceforge.net/
-best
-greg
2005-03-26 4:13 pm
Anonymous
Bah…Perl’s LWP is like 1000x better than this and its been around forever.
Plus you can even do stuff like login to a remote website, share cookies back and forth, keep track of state etc.. you can’t do that with this.
And its super easy to do in Perl.
http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
2005-03-26 9:05 pm
Anonymous
LWP has nothing to do with this. This article is about parsing HTML documents and extracting data from them, not about retrieving HTML documents from a website.
2005-03-26 10:56 pm
Anonymous
In the article they run the output through JTidy before looking at it with XQuery, so malformed HTML is not an issue here. However, with two rounds of processing speed may be a problem, depending on how efficient their tools are.