Handling Unclosed HTML tags with BeautifulSoup4

Feb 08, 2020

by James

A side project of mine is to archive the air pollution data for the state of Texas from the Texas Commission on Environmental Quality (TCEQ). My archiver then tweets out via the @Kuukihouston when thresholds of certain compounds go above certain thresholds that have been deemed by the EPA to be a health risk.

Recently I added support to automatically update the list of locations that it collects data from, rather than having a fixed list. Doing so is very straight forward: download the webpage, look for the <select> box that contains the sites, and scrape the value and text for each <option>.

There was only only a single hiccup during development of this feature: the developers don’t close their option tags and instead rely on web browsers “to do the right thing”.

That is their code looks like this:

Oyster Creek [29] Channelview [R]

When it should look like this:

Oyster Creek [29] Channelview [R]

Lucky web browsers excel in guessing and fixing incorrect html. But as I do not rely on a web browser to parse the html, I’m using BeautifulSoup. The BeaitfulSoup html.parser closes the tags at the end of all of the options i.e. just before the </select> tag. What this does is when I try to get the text for the first option in the list, I get the text for the first option + every following option.

The simple fix is to switch from the html.parser parser to the lxml parser, which will close the open <option> tags at the beginning of the next <option> tag, allowing me to get the text for each individual item.

# Bad
soup = BeautifulSoup(response.text, ‘html.parser')
# Good 
soup = BeautifulSoup(response.text, 'lxml')