Scraping Internet Outages with Selenium and Python

Mike Gouline
9 min readJun 1, 2021

--

Imagine you have an internet provider. You probably do. This provider sometimes performs scheduled maintenance, resulting in your internet temporarily disconnecting. Now imagine there’s no way of getting notified about future maintenance windows, unless you check their outages page every day. Annoying, but we can do something about that with a bit of code.

This was not a hypothetical situation for me and I figured I could automate the process with a script, like any developer would. Unfortunately, internet providers are seldom developer friendly, so there was no API that I could tap into. There was only a customer dashboard with a list of current and future outages, which I would have to scrape.

While I realise not everyone is facing this exact problem, I’m sure people find themselves needing to automate something on archaic websites from time to time, so hopefully this article is useful to them. To avoid making a target out of any real website, I built a simple HTML mock that we will be scraping instead.

Research

Step one is researching the problem space. There are three major parts:

  1. Obtaining the outages
  2. Serving them
  3. Hosting the code away from your computer

I haven’t scraped before, but 5 minutes of research showed Selenium to be a common solution. You have a lot of language choice here, including Java (and anything interoperable, like Kotlin), C#, JavaScript, Python and Ruby. Since I spend most of my day job writing Python (for data/ML), I picked that and you can follow along with any other language you prefer.

Photo by Sigmund on Unsplash

My first idea for serving the outages was sending notifications via email or SMS, but how do you avoid forgetting right after receiving them? You create an event in your calendar! So why not skip the intermediate step and create a calendar that you can just subscribe to, like public holidays or sporting calendars. All you have to do is generate a static iCalendar file and host it somewhere.

This brings me to the final piece — hosting. There’s no need for a request-response type application here, because the outages are unlikely to change too frequently, so you really just need a scheduled fetch operation that updates the iCal file and hosts it for your calendar app or service to synchronise. The simplest approach I came up with is to run it as a scheduled GitHub Actions workflow and host the static file as a Gist. If you have access to AWS, Google Cloud or Azure, you can just as easily use a scheduled serverless function and blob storage, e.g. Lambda that writes to a public S3 bucket, triggered via CloudWatch Events.

Code

Let’s look at the code. This section will only cover some snippets of code to give you a feel for how things work, you can find the complete working project at https://github.com/gouline/outages.

Website

You will obviously skip this part when scraping a real website, but to avoid pissing off some webmasters, I used the power of GitHub Pages to create a mock login page that redirects to a static list of outages (regardless of what username and password you entered):

  1. Create two static HTML pages: login.html and outages.html
  2. Put them under /docs in the GitHub repository
  3. Go to the repository ‘Settings’, choose the ‘Pages’ tab, and enable GitHub Pages for the primary branch and the /docs directory
  4. The two pages are now hosted under https://[USERNAME].github.io/[REPOSITORY]/ (if you have a custom domain associated with the main GitHub Pages repository, this will redirect and still work)

Now we have something to scrape.

Scraping

For those of you who have never scraped websites before, the process looks roughly like writing HTML files in reverse. You inspect the page in your browser, find the information that you want to grab, and think about how you would traverse down the DOM, by tag, IDs and classes, to get it.

First, download the ChromeDriver and put it somewhere in your executable path. Then install the Selenium binding for Python:

pip install selenium

Here’s how you instantiate a simple headless driver:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

When troubleshooting unexpected behaviour, you can temporarily comment out the --headless line to see what the code sees in a separate Chrome window (you can even inspect the page).

The driver throws errors whenever something cannot be found, so it’s a good idea to surround whatever you are doing with a try-finally statement to make sure you close it, even if something goes wrong.

try:
outages = get_outages(driver)
finally:
driver.close()

To load a page, you just get it:

driver.get("https://gouline.github.io/internet-outages/provider/login.html")

Now let’s fill out the credentials and submit the form:

import osusername = os.getenv("PROVIDER_USERNAME")
password = os.getenv("PROVIDER_PASSWORD")
driver.find_element_by_id("username").send_keys(username) driver.find_element_by_id("password").send_keys(password) driver.find_element_by_tag_name("form").submit()

Most of the time you will be using find_element_by_id, find_element_by_tag_name or find_element_by_class_name for simple parsing. All these return the first matching element and throw an error if they cannot find it. You can also use their plural variants (e.g. find_elements_by_tag_name) to return a list of matching elements that you can loop through.

After a page redirect, you can use this to wait until the new page title contains the text “Outages”, to avoid errors just because the page hasn’t loaded yet:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(EC.title_contains("Outages"))

Now we loop through the elements with the class list-group-item that we know contain each outage on the mock page and then find the container class inside with each attribute of that outage:

outages = []for i in driver.find_elements_by_class_name("list-group-item"):
outage = Outage()
for c in i.find_elements_by_class_name("container"):
title = c.find_element_by_tag_name("strong").text
value = c.find_element_by_tag_name("p").text
outage.put_attribute(title, value) outages.append(outage)

Corresponding HTML for reference:

<div class="list-group-item">
<div class="container">
<strong>Start</strong>
<p>Thu 10 Jun 2021 12:00AM PST</p>
</div>
<div class="container">
<strong>End</strong>
<p>Thu 10 Jun 2021 07:00AM PST</p>
</div>
<div class="container">
<strong>Severity</strong>
<p>High</p>
</div>
</div>

Notice how you can call find_* functions on the driver for the whole page or recursively on any returned elements. Be careful with find_element_* functions because they throw NoSuchElementException when they cannot find what you are looking for, so if this is expected, make sure to surround them with a try-catch statement.

That about covers the overall parsing principle. Unfortunately, in most real-world use cases, HTML gets complex quickly and you will have to resort to find_element_by_xpath, but XPath syntax is way too broad to cover in this article, so I will leave it up to the reader to explore the documentation.

Generating Calendar

Now that we have some outages, we need to generate the iCal file. If you’ve never seen one, here is what an example one looks like:

BEGIN:VCALENDAR
VERSION:2.0
BEGIN:VEVENT
DESCRIPTION:Scheduled maintenance
DTSTART:20210101T000000Z
SUMMARY:Event Name
UID:TEST-UID-1
END:VEVENT
END:VCALENDAR

As you can see, it wouldn’t take much to generate it manually, but fortunately, ics.py exists so we don’t have to. You can install it like so:

pip install ics

Given a list of outages that we scraped before, we add them as events:

import icscal = ics.Calendar()for outage in outages:
event = ics.Event(
name=outage.name,
begin=outage.start.timestamp(),
end=outage.end.timestamp(),
uid=outage.uid,
description=outage.description(),
)
cal.events.add(event)

While most arguments are self-explanatory, an important note about uid — the unique identifier for each event — is that it can be omitted, but it’s then generated randomly every time the file is updated. Depending on how your calendar app or service handles synchronisation, this may cause undesirable behaviour, since technically all events will be brand new at each refresh.

Finally, we write the calendar to a plaintext file:

with open(filename, "w") as f:
f.writelines(cal)

GitHub Actions

Now that we have some Python code that we can run locally, we want to run it somewhere on a schedule, say daily. Here’s how you can configure GitHub Actions to do that.

By default, Actions are enabled for all GitHub repositories, so all you have to do is create a workflow configuration. Let’s create .github/workflows/deploy.yml and start filling it in.

First, we specify what triggers this workflow. We want it to run on a schedule:

on:
schedule:
- cron: "0 5,17 * * *"

This example makes it trigger at 5:00 and 17:00 (UTC) every day. For more options, refer to documentation.

Next, we define what steps need to be executed and on what platform:

jobs:
run:
runs-on: ubuntu-latest
name: run
steps:
- uses: actions/checkout@v2

Remember how you had to install the ChromeDriver before? This needs to be done on the CI runner as well. Thankfully, there’s the setup-chromedriver action on the Marketplace for that:

      - uses: nanasess/setup-chromedriver@master

We need to install Python dependencies:

      - name: Requirements
run: pip3 install selenium ics

And run the Python scraper we wrote earlier:

      - name: Run
run: python3 outages.py
env:
PROVIDER_USERNAME: ${{ secrets.PROVIDER_USERNAME }}
PROVIDER_PASSWORD: ${{ secrets.PROVIDER_PASSWORD }}

Presumably, you will need to save those credentials we used to log into the website we were scraping. Never store credentials in plaintext in your repository! You can store them as encrypted secrets instead. Just go to ‘Settings’ in your repository, then the ‘Secrets’ tab, and create two secrets called PROVIDER_USERNAME and PROVIDER_PASSWORD, referenced above.

What GitHub repository secrets should look like

Finally, we need a way to upload the resulting calendar file to GitHub Gist. We could do it manually via the API, but we’re all busy people, so there’s another Marketplace action called deploy-to-gist to do it for us:

- name: Deploy
uses: exuanbo/actions-deploy-gist@v1
with:
token: ${{ secrets.GIST_TOKEN }}
gist_id: YOUR_GIST_ID
gist_file_name: outages.ics
file_path: ./dist/outages.ics

The field gist_file_name controls the name of the file in the target gist and file_path is where to find the file to upload after executing the Python script. Two more configuration steps before we’re done though:

  1. Go to https://gist.github.com/ and create an empty gist — the gist_id will be in the URL, i.e. https://gist.github.com/[USERNAME]/[GIST_ID]
  2. Create a personal access token with scope “Create gists” and save it as a repository secret GIST_TOKEN (same as the website credentials before) — this gives the workflow permission to edit your gist

Done! If you followed the instructions correctly, your workflow will now run on schedule and upload your calendar file to GitHub Gist.

To synchronise your calendar app or service with your generated calendar, point it to this URL (with your values) https://gist.githubusercontent.com/[USERNAME]/[GIST_ID]/raw — this exposes your gist as a raw file that can be downloaded.

Optimisations

If you looked at the GitHub repository I linked to in the beginning, you would have noticed a few additional things not discussed in this article that you may want to consider doing in your own repository, especially if you are new to Python:

  • Create a build script, such as a Makefile and/or setup.py, and call its targets from the CI workflow, instead of explicit commands
  • Extract your dependencies into requirements.txt and install them all at once with pip install -r requirements.txt
  • Separate your Python code into multiple files as needed and put them all in a directory called outages (with an __init__.py file), so that you can call it as a module with python -m outages (more on that here)
  • Write tests! It may be a small throwaway project, but it’s still a good idea to write at least some basic unit and integration tests

Having said that, I purposely only focused on things relevant to scraping websites and generating the calendar file. How to structure Python projects is out of scope for what I wanted to address and your implementation will be different, depending on what website you are scraping and how you want to approach it, anyway.

Closing Thoughts

Hopefully, this was helpful for somebody facing a similar task. Web scraping is a massive topic that’s impossible to cover in one article and that was not my intention. This was more of a walk-through of a weekend project where you approach a real-world problem with a can-do attitude and some basic programming skills.

As I mentioned in the research section, GitHub Actions was picked purely because it’s simple and free, but it’s by no means the only, or even the nicest, option. Hosting the Python script on a cloud platform, such as AWS, Azure or Google Cloud, is left as an exercise for the reader. Feel free to contribute your setup in the comments and I will happily include it in the footnotes.

Thank you for reading!

--

--