Lessons Learned: When A Side-Project Is (Mostly) Deprecated

2016-07-25

So, time for yet another blog post starting with an obligatory statement consisting of how it’s been a long time, I should get back into the habit of blogging again, etc, etc.

With that out of the way, one of the open source side projects that I’ve been hacking on lately is a Python library of utilities that have saved me a bit of time and annoyance. Without a better name for it, I decided to call it pg-utils. How did I find this useful? Well, here are a few examples:

  • Easy Connections:

Of course, psycopg2 is the go-to Python library for connecting to PostgreSQL databases. Maybe it’s just my laziness, but I’ve found the arguments to psycopg2.connect to be a bit annoying:

import psycopg2

conn = psycopg2.connect(database="test", user="postgres", password="secret")

# or 

conn = psycopg2.connect("dbname=test user=postgres password=secret")

I often had to point myself back to the documentation to make sure I got the arguments/DSN right (“was that user or username? And wait…wasn’t it dbname?” etc).

So, one of the first things I did was build what would become pg-utils Connection class. As long as you have the relevant environment variables set up, you can now just do:

from pg_utils import connection

conn = connection.Connection()

Easy to remember!

  • Bits and Pieces of the Pandas API, With Calculations Done In The Database

While Pandas is great at manipulating datasets that are large enough to fit on one machine, but possibly not large enough to fit into memory, concerns over performance and data security can sometimes make analysis in the database more convenient.

Other than Connection, the main class is Table, that acts as a metadata wrapper that performs (some) calculations in a lazy manner. For example, this creates a table in the database with one million rows and two columns: x, which is uniformly distributed on the interval [0,1), and y which is drawn from the standard normal distribution.

In [1]: from pg_utils import table

In [2]: t = table.Table.create("pg_utils_test",
   ...:         """create table pg_utils_test as
                    select random() as x,
   ...:             sqrt(-2 * ln(u1))*cos(2*PI()*u2) as y
   ...:             from(
   ...:                 select random() as u1, random() as u2
   ...:                 from generate_series(1, 1000000)
   ...:             )a""")

Now, t is a metadata object. It doesn’t hold any actual data within the table. However, there is a limited subset of the Pandas API that works via the database. For example, we have the describe method:

In [3]: t.describe()
Out[3]:
                      x               y
count    1000000.000000  1000000.000000
mean           0.499628       -0.000075
std_dev        0.288671        0.999630
minimum        0.000001       -4.589661
25%            0.249420       -0.672603
50%            0.499709        0.000695
75%            0.749733        0.673413
maximum        0.999999        4.867347

All of these calculations are done in the database, and not in Pandas.

I’ve found this useful, and started hacking away adding various bits of Pandas-esque functionality.

However, it turns out there’s an easier way…

  • It turns out that Apache Spark does most of this…

We can replicate most of this in pyspark:

    In [1]: import pyspark

    In [2]: import os

    In [3]: jdbc_url = "jdbc:postgresql://[hostname]:[port]/[database_name]?user=[username]&password=[password]"

    In [4]: sc = pyspark.SparkContext(appName="spark_fiddle")

    In [5]: sql_context = pyspark.SQLContext(sc)

    In [6]: data = sql_context.read.format("jdbc").options(url=jdbc_url,
    ...:     dbtable="pg_utils_test").load()

    In [7]: data.describe().show()
    +-------+--------------------+--------------------+
    |summary|                   x|                   y|
    +-------+--------------------+--------------------+
    |  count|               10000|               10000|
    |   mean| 0.49732450409224255|0.002857195614962924|
    | stddev|   0.290124075930788|  0.9932281865323003|
    |    min|1.405193470418453...| -3.6793861540325414|
    |    max|  0.9999190648086369|  3.6616916084243227|
    +-------+--------------------+--------------------+

In particular, this allows for an API similar to that of Pandas, and lazy evaluation is a built-in feature to Spark.

In the end, I don’t regret building pg-utils at all. I learned quite a bit from it, including the ability to read some of the Pandas source code.

Tags: code, python, postgresql, spark, lessons learned