Without freshness, why should anyone trust you or your systems? Why should they pay your company money or take your recommendations and "science" seriously? Why should they spend their time building on your systems or data warehouse?
Two years ago we set up Open-Source Snowplow at CarGurus to fulfill a need for self-managed client-side instrumentation. Since that time it has become incredibly impactful for the entire company and has scaled significantly beyond what was originally envisioned. The following is an overview of why we set up the system, our experience with it, what we have learned, and where we see it continuing to go.
This post is quite long, so before getting too far into the details....
SnowSQL is the command-line interface for accessing your Snowflake instance.
The following is a quick "how to" guide for setting it up.
GDPR was approved by EU parliament on April 14, 2016, went into effect May 25, 2018, and impacts any business handling any personal data of any EU resident.
At a high level, GDPR is a directive on the protection of personal data and can be scoped twofold. First, the law protects persons concerned by processing of personal data. Second, it enforces additional accountability on businesses involved in the processing of personal data.
What does GDPR do exactly? Let's dive in.
We all know that GDPR (also known as RGPD in France) has brought data policy into the spotlight for many technical organizations. As of May 25, 2018, if your systems (both automated and otherwise!) handle PII of individuals residing in the EU, you must comply with regulation. While this enforcement date makes the topic seem like old news, many US-based companies are unclear of the specifics and vastly underprepared to deal with the implications.
Before diving too far into the deep end of implementation detail, one must first understand the basics. The only way to conform to this regulation (which many US-based companies still don't) is to thoroughly understand what data needs to be handled with care.
So... what is personal data?
In a world where the importance of data is steadily increasing yet the cost of computing power is steadily decreasing, there are fewer and fewer excuses to not have control of your own data. To explore that point I instrumented this site as inexpensively as I possibly could, without sacrificing reliability or functionality. I have full control of all data that is generated, the instrumentation is highly customizable, the output is simple to use, and I don't have to be available at all hours to keep it working.
And It costs less than $1 per month.
When considering software and related infrastructure, the business of today is caught in a never-ending cycle of "build vs. buy". Many third-party companies solve serious challenges such as managing sales pipelines, accounting automation, payment processing, and internal communication. These alternatives to "building it yourself" empower companies to operate faster or more efficiently, and overall benefit to the customer is often net-positive. When considering various alternatives, there is one critical component of your business that you should strongly reconsider leaving in the hands of third parties, however: your data and supporting data infrastructure.
There are many factors to consider when designing data pipelines, which include disparate data sources, dependency management, interprocess monitoring, quality control, maintainability, and timeliness. Toolset choices for each step are incredibly important, and early decisions have tremendous implications on future successes. The following post is meant to be a reference to ask the right questions from the start of the design process, instead of halfway through. In terms of the V-Model of systems engineering, it is intended to fall between the “high level design” and “detailed design” steps: