the right tool for the job

So, you’re going to parse a webpage, to extract some information. For instance if you want to get the tracking information for your last online order, and you want to display the tracking information changes using growl, dbus notifications or xosd.

You know regular expressions, so you go to the job with your long range missiles ready. But wait a minute, you’ll probably solve the problem but is regular expressions really the right tool?

The pro for regular expressions is that you can use the same tool you always use for parsing jobs, but then again you doesn’t learn anything new out of this. You might fortify your position as regex wizard even more, but how about something completely different?

Now. Most webpages is written in HTML, and some even in XHTML, for HTML documents languages like Python has a built-in parser, after the model of the SAX-parser. (It’s probably the other way around, the SAX parser is built on the base of the HTML parser…) Most programming languages has good support for XML, so for XHTML documents, you can use the HTML-parser, a SAX-parser or even the XML-DOM parsers.

The benefit of doing it this way is that your parser will probably be more robust to minor changes in the webpage. You don’t reinvent the wheel (The best way I’ve found to parse HTML documents using regular expressions is to make a specialized SAX-like parser anyway). Your code will probably be readable in a year, and others might even be able to understand your code. And finally, you learn something new, which might give you a fresh view on a lot of problems.

Now back to the original issue, to make a parser for the parcel tracking of your postal service. Here’s an example parsing the shipment tracking page of posten, the norwegian postal service.

And yes, it lacks comments.

download posten.py

?View Code PYTHON
import urllib2
import time
from HTMLParser import HTMLParser
import sys
 
class PostTrack:
    def __init__(self):
        self.events=[]
    def append(self,item):
        self.events.insert(,PostTrackEvent(item))
 
    def __getitem__(self,i):
        return self.events[i]
    def __str__(self):
        return "\n".join([str(i) for i in self])
 
class PostTrackEvent:
    def __init__(self,data):
        self.date=data[]
        self.time=data[1]
        self.event=data[2].lower()
        self.location=data[3]
        self.note=data[4]
        self.ts=time.strptime("%s %s" % (self.date,self.time), "%d.%m.%y %H.%M")
        self.isotime=self.timestamp("%Y-%m-%d %H:%M:00")
    def timestamp(self,format):
        return time.strftime(format, self.ts)
    def __getitem__(self,key):
        return self.__dict__[key]
    def __str__(self):
        return "%(isotime)-20s%(event)-30s%(location)s %(note)s" % self
 
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.state=
        self.hendelser=PostTrack()
 
    def handle_starttag(self, tag, attrs):
        if tag == "th" and self.state==:
            self.state=1
        if tag == "td" and self.state==2:
            self.state=3
        if tag == "tr" and self.state==2:
            self.hendelse=[]
 
    def handle_endtag(self, tag):
        if tag == "th" and self.state==1:
            pass
            self.state=
        if tag == "td" and self.state==3:
            self.state=2
        if tag == "table":
            self.state=
        if tag == "tr" and self.state==2 and self.__dict__.has_key("hendelse"):
            self.hendelser.append(self.hendelse)
    def handle_data(self,data):
        if self.state==1:
            if data.strip() == "Hendelse":
                self.state=2
        if self.state==3:
            self.hendelse.append(data.strip())
 
    def getit(self):
        return self.hendelser
 
def goFetchAndParse(shipmentid):
    url=("http://sporing.posten.no/Sporing/"
        "KMSporingInternett.aspx?shipmentNumber=%s" % shipmentid)
    data="".join(urllib2.urlopen(url).readlines())
    h=MyHTMLParser()
    h.feed(data)
    return h.getit()
 
print goFetchAndParse(sys.argv[1])