RHY - Home

Developing StatsCanPy

By Rielly H. Young, August 29th, 2024

Despite it's flaws, python holds a special place in my heart. While most of my professional development utilizes C#, there are still opportunities where python is my go-to.

My favourite aspect of python is how quickly you can get to coding. When I'm building a pipeline or trying to do some Adhoc analysis, all you need is a notebook & you can start writing.

I was recently in a position where I needed to do some Adhoc analysis using StatsCan data. To do this, I needed to built a small ETL pipeline in a Databricks environment to process ~5GBs of data.

The initial problem was simple: StatsCan's API required the use of a table-id when querying data. The popular python module `statscan` acts as a wrapper for the API and returns data as pandas dataframes. When using module, the table id needed to be cleaned, which was annoying (i.e., removal of hyphens & dropping the last two numbers). Thus, users (myself in this case) needed to search for the table-id that they wanted, then keep track of which IDs are associated with each table. Another (small) issue was the pandas aspect; I was working in Databrick & therefore wanted to leverage PySpark.

So with these two issues in mind, I went to work building my own module (with the very original name `StatsCanPy`). Most of the package isn't anything special; it is a basic wrapper around the aforemention StatsCan API. The interesting part is how we're handling `get_table_from_name(arg)` method. This too is quite simple; we send a basic get request to StatCan's website under the search tab using the arg. Then, using regular expression, the method parses the HTML table returned by the GET request to construct a 2D array of table names and IDs. Assuming at least one match is found, the method then uses the 0th positioned table_id to query the API. If no matches are found the method returns an exception. There is an additional method to list datasets based off a table name. Similar to the previous method, this one leverages StatsCan's search feature, and returns a parsed list of table names and their associated table IDs. This is useful if you don't know exactly which dataset you want to query.

Deploying the package to the Python Package Index (PyPI) was delightfully easy. The bulk of work is handled by the `setuptools` module. which is where you define your package's metadata. Once that is done, the final step is creating your deployment pipeline. For `StatsCanPy`, I used github actions for CI/CD. The deployment pipeline installs the required dependencies, runs unit tests and, assuming they pass, builds the package for deployment. The total runtime of the pipeline is <2 minutes. Once the pipeline run is complete, the package can be installed using *pip*.

All in, I was able to build and deploy v.1 of `StatsCanPy` in ~4 hours. Circling back to my original point about ease of use, by the afternoon I was able to begin my analysis -- importing my own module from pip (pretty cool IMO).