Client-side instrumentation for under $1 per month. No servers necessary.

In a world where the importance of data is steadily increasing yet the cost of computing power is steadily decreasing, there are fewer and fewer excuses to not have control of your own data. To explore that point I instrumented this site as inexpensively as I possibly could, without sacrificing reliability or functionality. I have full control of all data that is generated, the instrumentation is highly customizable, the output is simple to use, and I don’t have to be available at all hours to keep it working. »

Built to Scale: Running Highly-Concurrent ETL with Apache Airflow (part 1)

Apache Airflow has seemingly taken the data engineering world by storm. It was originally created and maintained by Airbnb, and has been part of the Apache Foundation for several years now. After heavily leveraging it for about a year (almost 2 million ¡idempotent! ETL tasks later) and seeing its full potential (but numerous drawbacks), I was tasked with streamlining the deployment and operation of the system. The obvious first step? »

Why Your Company Should Own Its Own Data

When considering software and related infrastructure, the business of today is caught in a never-ending cycle of “build vs. buy”. Many third-party companies solve serious challenges such as managing sales pipelines, accounting automation, payment processing, and internal communication. These alternatives to “building it yourself” empower companies to operate faster or more efficiently, and overall benefit to the customer is often net-positive. When considering various alternatives, there is one critical component of your business that you should strongly reconsider leaving in the hands of third parties, however: your data and supporting data infrastructure. »

How to Populate Fillable PDF's with Python

I recently was working on a small Python project, and one requirement was to populate a PDF form based on some set of data. Easy right? Not so much. After browsing the internet for a bit, I came across posts like these: Filling PDF Forms In Python - The Right Way How can I auto populate a PDF form in Django/Python? Using Python to fill PDF form fields? … but was left wholly unsatisfied. »

Data Pipeline Design Considerations

There are many factors to consider when designing data pipelines including disparate data sources, dependency management, interprocess monitoring, quality control, maintainability, and timeliness. Toolset choices for each step are incredibly important, and early decisions have tremendous implications on future successes. The following post is meant to be a reference to ask the right questions from the start of the design process, instead of halfway through. In terms of the V-Model of systems engineering, it is intended to fall between the “high level design” and “detailed design” steps. »