So, time for yet another blog post starting with an obligatory statement consisting of how it’s been a long time, I should get back into the habit of blogging again, etc, etc.
With that out of the way, one of the open source side projects that I’ve been hacking on lately is a Python library of utilities that have saved me a bit of time and annoyance. Without a better name for it, I decided to call it pg-utils. How did I find this useful? Well, here are a few examples:
I often had to point myself back to the documentation to make sure I got the arguments/DSN right (“was that
username? And wait…wasn’t it
So, one of the first things I did was build what would become pg-utils Connection class. As long as you have the relevant environment variables set up, you can now just do:
Easy to remember!
While Pandas is great at manipulating datasets that are large enough to fit on one machine, but possibly not large enough to fit into memory, concerns over performance and data security can sometimes make analysis in the database more convenient.
Connection, the main class is
Table, that acts as a metadata wrapper that performs (some) calculations in a lazy manner. For example, this creates a table in the database with one million rows and two columns:
x, which is uniformly distributed on the interval
y which is drawn from the standard normal distribution.
t is a metadata object. It doesn’t hold any actual data within the table. However, there is a limited subset of the Pandas API that works via the database. For example, we have the
All of these calculations are done in the database, and not in Pandas.
I’ve found this useful, and started hacking away adding various bits of Pandas-esque functionality.
However, it turns out there’s an easier way…
We can replicate most of this in pyspark:
In particular, this allows for an API similar to that of Pandas, and lazy evaluation is a built-in feature to Spark.
In the end, I don’t regret building
pg-utils at all. I learned quite a bit from it, including the ability to read some of the Pandas source code.
Tags: code, python, postgresql, spark, lessons learned