Handling Unclosed HTML tags with BeautifulSoup4
A side project of mine is to archive the air pollution data for the state of Texas from the Texas Commission on Environmental Quality (TCEQ). My archiver then tweets out via the @KuukihoustonΒ when thresholds of certain compounds go above certain thresholds that have been deemed by the EPA to be a health risk.
Recently I added support to automatically update the list of locations that it collects data from, rather than having a fixed list. Doing so is very straight forward: download the webpage, look for the <select>
box that contains the sites, and scrape the value and text for each <option>
.
There was only only a single hiccup during development of this feature: the developers donβt close their option tags and instead rely on web browsers βto do the right thingβ.
That is their code looks like this:
Β Β Β Β Oyster Creek [29]Β Β Β Β Channelview [R]
When it should look like this:
Β Β Β Β Oyster Creek [29]Β Β Β Β Channelview [R]
Lucky web browsers excel in guessing and fixing incorrect html. But as I do not rely on a web browser to parse the html, Iβm using BeautifulSoup. The BeaitfulSoup Β html.parser
closes the tags at the end of all of the options i.e. just before the </select>
tag. What this does is when I try to get the text for the first option in the list, I get the text for the first option + every following option.
The simple fix is to switch from the html.parser
parser to the lxml
parser, which will close the open <option>
tags at the beginning of the next <option>
tag, allowing me to get the text for each individual item.
# Bad
soup = BeautifulSoup(response.text, βhtml.parser')
# Good
soup = BeautifulSoup(response.text, 'lxml')