Go to the navigation

elblogg

Posts Tagged ‘parsing’

the right tool for the job

So, you’re going to parse a webpage, to extract some information. For instance if you want to get the tracking information for your last online order, and you want to display the tracking information changes using growl, dbus notifications or xosd.
You know regular expressions, so you go to the job with your long range missiles ready. But wait a minute, you’ll probably solve the problem but is regular expressions really the right tool?
The pro for regular expressions is that you can use the same tool you always use for parsing jobs, but then again you doesn’t learn anything new out of this. You might fortify your position as regex wizard even more, but how about something completely different?

Now. Most webpages is written in HTML, and some even in XHTML, for HTML documents languages like Python has a built-in parser, after the model of the SAX-parser. (It’s probably the other way around, the SAX parser is built on the base of the HTML parser…) Most programming languages has good support for XML, so for XHTML documents, you can use the HTML-parser, a SAX-parser or even the XML-DOM parsers.

The benefit of doing it this way is that your parser will probably be more robust to minor changes in the webpage. You don’t reinvent the wheel (The best way I’ve found to parse HTML documents using regular expressions is to make a specialized SAX-like parser anyway). Your code will probably be readable in a year, and others might even be able to understand your code. And finally, you learn something new, which might give you a fresh view on a lot of problems.

Now back to the original issue, to make a parser for the parcel tracking of your postal service. Here’s an example parsing the shipment tracking page of posten, the norwegian postal service.

Read the rest of this entry »

Bloggurat Twingly BlogRank Blogglisten