I might get in trouble, but it’s worth it
If you’re wondering why I am doing part 1 covers a lot of explanation. In the last article we covered how we can make requests to get access into the site, working around the pain of getting a session id and authorization token.
Next is to parse the page itself to get the information we need. I’m pretty new to parsing websites, but first thing you need to do is inspect the page using something like Chrome DevTools to see possible avenues of how we can access the information.
So after right clicking and inspecting the HTML element that has my element I have an idea of how to access the field. One thing in HTML is parameter class can be searched by parsers, however no guarantee of uniqueness, but this is going to at minimum get us the element we want in a list.
<td class="border-0">Assessment of $XXX.XX is due on XX/XX/XXXX</td>
Otherwise we can start at the first id which is the entire contents of the page outside of the header images and some other pieces of information. I favor the first method as I can do it quicker and leverage string operations to get the information I want.
Note: I did a preliminary check and this is the only instance of that class name.
Alright now how do we take HTML packed with a bunch of other crap we don’t want and get our value. One library in Python I am a huge fan of is BeautifulSoup. If you haven’t seen, this library is primarily for scraping websites and is very easy to setup.
All we have to do from our
result variable that contents the last requests with proper authentication is set that content in the parser.
soup = BeautifulSoup(result.content, 'html.parser')
Now one handy function is find, which goes and hunts out elements that match are specific conditions. In our case we want to search
border-0 and see if we get any information back outside of what we expect.
data = soup.find(class_="border-0")
# prints: ['Assessment of $XXX.XX is due on XX/XX/XXXX']
Great we grab the information pretty fast with only a couple extra lines of code and understanding of the page. Next is to…