Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square
Monday, March 24, 2014 at 8:56AM
Pete Soderling in BigData, Example

This is a guest repost by Pete Soderling, Founder at Hakka Labs, creating a community where software engineers come to grow.

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.

How does Stripe do it?

We spoke to Avi Bryant at Stripe who gave us a nice description of the way Stripe has approached the building of their data pipeline.

Stripe feeds data to HDFS from various sources, many of them 
unstructured or semi-structured -  server logs, for example, or JSON 
and BSON documents. In every case, the first step is to translate 
these into a structured format. We've standardized on using Thrift to 
define the logical structure, and Parquet as the on-disk storage 

We chose Parquet because it's an efficient columnar format 
which is natively understood by Cloudera's Impala query engine, which 
gives us very fast relational access to our data for ad-hoc reporting. 
The combination of Parquet and Thrift can also be used efficiently and 
idiomatically from Twitter's Scalding framework, which is our tool of 
choice for complex batch processing.

The next stage is "denormalization": to keep our analytical jobs and 
queries fast, we do the most common joins ahead of time, in Scalding, 
writing to new set of Thrift schemas. At the same time, we do a bunch 
of enhancing and annotating of the data: for example, geocoding IP 
addresses, parsing user agent strings, or cleaning up missing values.

In many cases, this results in schemas with nested structure, which 
works well from Scalding and which Parquet is happy to store, but 
which Impala cannot currently query. We've developed a simple tool 
which will transform arbitrarily nested Parquet data into an 
equivalent flattened schema, and where necessary we use this to 
maintain a parallel copy of each data source for use from Impala. 
We're looking forward to future versions of Impala which might remove 
this extra step.

Tapad’s Data Pipeline

Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:

He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.

Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:

How Etsy Handles Data

For another example, we reached Rafe Colburn, the engineering manager of Etsy's data team and asked how they handled their pipeline. Here's the scoop from Rafe:

Etsy's analytics pipeline isn't especially linear. It starts with our instrumentation, which consists of an event logger that runs in the browser and another that can be called from the back end. Both ping some internal "beacon" servers.

We actually use the good old logrotate program to ship the generated Apache access logs to HDFS when they reach a certain size, and process them using Hadoop. We also snapshot our production data (which resides in MySQL) nightly and copy it to HDFS as well, so that we can join our clickstream data to our transactional data.

We usually send the output of our Hadoop jobs to our Vertica data warehouse, where we replicate our production data as well, for further analysis. We use that data to feed our homemade reporting and analytics tools.

For features on that use data generated from Hadoop jobs, we have a custom tool that takes the output of a job and stores it on our sharded MySQL cluster where it can be accessed at scale. This year we're looking at integrating Kafka into our pipeline to move data both from our instruments to Hadoop (and to streaming analytics tools), and also to send data back from our analytics platforms to the public site.

Square’s Approach

Another example from a company that has a sophisticated data pipeline is Square. We reached one of their engineering managers, Pascal-Louis Perez, who gave us a strategic view of their pipeline architecture.

Because of the importance of payments flowing through its system, Square has extended the concept of ‘reconciliation’ throughout its entire data pipeline; with each transformation data must be able to be audited and verified. The main issue with this approach, according to Pascal, is that it can be challenging to scale. For every payment received there are "roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.”

Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream. In Pascal’s words, "The streams represent the first level abstraction distancing the data-pipeline from the diversity of sources of data. The next level are operators combining one of multiple streams, and producing themselves one or multiple streams. One example operator is a "matcher" which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.”

Pascal notes that  the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.


It’s pretty obvious that cramming data into Hadoop won’t give you these capabilities! 

Article originally appeared on (
See website for complete article licensing information.