How TripleLift Built an Adtech Data Pipeline Processing Billions of Events Per Day
This is a guest post by Eunice Do, Data Engineer at TripleLift, a technology company leading the next generation of programmatic advertising.
What is the name of your system and where can we find out more about it?
The system is the data pipeline at TripleLift. TripleLift is an adtech company, and like most companies in this industry, we deal with high volumes of data on a daily basis. Over the last 3 years, the TripleLift data pipeline scaled from processing millions of events per day to processing billions. This processing can be summed up as the continuous aggregation and delivery of reporting data to users in a cost efficient manner. In this article, we'll mostly be focusing on the current state of this multi-billion event pipeline.
To follow the 5 year journey leading up to the current state, check out this talk on the evolution of the TripleLift pipeline by our VP of Engineering.
Why did you decide to build this system?
We needed a system that could:
- scale with rapid data volume growth
- aggregate log level data in a cost efficient manner
- have a clear, manageable job dependency chain
- automate idempotent job runs and retries
- deliver data to BI and reporting tools within expected SLAs
- handle query load on those reporting tools
How big is your system? Try to give a feel for how much work your system does.
Approximate system stats - circa April 2020
- The largest event log receives 30 billion events per day
- 5 daily aggregation jobs, with the largest outputting 7.5 GB
- 25 hourly aggregation jobs, with the largest outputting 2.5 GB
- 15 hourly jobs ingest aggregate data into BI tools
- The highest cardinality, normalized aggregation has 75 dimensions and 55 metrics
What is your in/out bandwidth usage?
The data pipeline aggregates event data collected by our Kafka cluster, which has an approximate I/O of 2.5GB/hr.
How fast are you growing?
We’ve experienced fortuitous, rapid year over year growth for the past few years. From 2016 to 2017, the amount of data processed in our pipeline increased by 4.75x. Then by 2.5x from 2017 to 2018. And finally by 3.75x from 2018 to 2019. That’s nearly 50x in 3 years!
How Is Your System Architected?
At a high level our data pipeline runs batch processes with a flow that consists of:
- raw event collection and persistence
- denormalization and normalization of data via multiple levels of aggregation
- persistence or ingestion of aggregated data into various datastores
- exposure of ingested data in UI’s and reporting tools for querying
We begin with the data collection process, in which raw event data is published to any of our 50+ Kafka topics. These events are consumed by Secor (an open source consumer created by Pinterest) and written out to AWS S3 in parquet format.
We use Apache Airflow to facilitate the scheduling and dependency management necessary in both steps 2 and 3.
Aggregation tasks are kicked off by Airflow via job submit requests to the Databricks API. The aggregations are run with Apache Spark on Databricks clusters. Data is first denormalized into wide tables by joining on a multitude of raw event logs in order to paint a full picture of what occurred pre, during, and post auction for an ad slot. Denormalized logs are persisted to S3.
After the denormalization tasks enter a success state, the Airflow scheduler kicks off their downstream normalization tasks. Each of these aggregations roll the denormalized data up into more narrowly scoped sets of dimensions and metrics that fit the business contexts of specific reports. These final aggregations are persisted to S3 as well.
Upon the success of each final aggregation task, the Airflow scheduler kicks off their various downstream persistence or ingestion tasks. One such task copies aggregated data into Snowflake, a data analytics platform which serves as the backend for our business intelligence tools. Another task ingests data into Imply Druid, a managed cloud solution consisting of a time-optimized, columnar datastore that supports ad-hoc analytics queries over large datasets.
Finally, step 4 is a joint effort between our business intelligence and data engineering teams. The primary places where aggregated data can be queried are our internal reporting APIs, Looker (which is backed by Snowflake), and Imply Pivot (a drag and drop analytics UI bundled into the Imply Druid solution).
What lessons have you learned?
Data decisions tend to have far reaching repercussions. For example:
- It is not straightforward to change the way an existing field is derived once it is defined. This is due to a common need to maintain the historical continuity for that field.
- If a mistake in any of the multiple levels of aggregations is left running for some time, then it can potentially be time consuming and expensive to backfill that time period with the incorrect data.
- If incorrect or incomplete data was made available for querying, then there is no telling where that data has already ended up.
What do you wish you would have done differently?
For a long time we didn’t have a clear approach to the scope of data we made accessible, or where we made it accessible.
- A retention policy didn’t exist, and a lot of data was stored indefinitely.
- Users were allowed to query for that indefinitely stored data, which degraded query performance for everyone else.
- All reporting tools supported all types of queries. In other words, we failed to properly identify internal vs. external reporting and reporting vs. analytics use cases. This lack of discipline in the way we made our data accessible ended up shaping our data. For example, we didn’t have hot vs. cold tiering, high vs. low granularity reporting data, or a reasonable data retention policy.
We’ve since given more thought to our approach and taken measures such as implementing AWS S3 lifecycle rules, defining retention policies per queryable datasource, and designating reporting tools to handle either quick, investigative queries, or large reports over long date ranges - but not both.
How are you thinking of changing your architecture in the future?
We plan to supplement our batch pipeline by building a real time streaming application using Kafka Streams. This was chosen from a few proofs of concept involving Spark Structured Streaming, Kafka Streams, and KSQL.
How do you graph network and server statistics and trends?
We use Prometheus to store application metrics. We then aggregate the metrics in Grafana dashboards and set alerts on those dashboards.
How many people are in your team?
There are a total of 4 data engineers on the team, and we make up about a tenth of the entire engineering organization. The teams we often collaborate with are infrastructure, solutions, and data science. For example, we recently onboarded the data science team with Airflow, and their model runs are now automated.
Which languages do you use to develop your system?
Python for Airflow, Spark Scala for the aggregations, and Java for some reporting tools.