How we built our analytics pipeline

Introduction

At Prizma, analytics are our lifeblood.  We collect up to 50M events per day, which is used to display contextually relevant and personalized video content. They let us track performance in real-time, continuously improve our recommendations, enable personalization  as well as provide critical metrics to our partners via the Prizma dashboard.

We needed a solution for storing this data that allowed us to query it in real time while managing costs (after all, we’re a startup).  We explored a number of different solutions before we found one that was a fit for us.  This blog post will explore our process and share our final conclusions. The intended audience is other engineers and data scientists, although we won’t get too far into the weeds technically.

Choosing a data warehouse

The most important decision to make when designing our analytics infrastructure was choosing a data warehouse.  We had been using Keen.io, a managed solution for storing and aggregating event data.  However, we found ourselves approaching the limit of queries we could run over our data. .  Anything more complex than a single unnested SELECT statement would require custom code to orchestrate the execution.  Queries that joined our event data with other sources of data were infeasible.

Another sticking point was pricing. We were being charged by the number of events ingested and our event volume was putting us at the limit of our pricing bucket.  We didn’t want our decisions about what data to collect to be driven by cost, and furthermore, we knew that the underlying storage and bandwidth were cheap enough that there had to be another cost effective solution.

Having had positive experiences with columnar data stores previously, I knew that they were the way to go for Prizma’s data warehouse.  Since we’re a small team and didn’t want to manage our own infrastructure, this left us deciding between Amazon RedShift and Google BigQuery, the two most popular managed columnar data stores.

RedShift vs BigQuery

RedShift is Amazon’s product in this space.  It runs on virtual machines that Amazon provisions on your behalf.   BigQuery, on the other hand, is a fully managed service.  You don’t have to worry about virtual machines, you just give BigQuery your data and tell it what queries to run.  We are heavy AWS users, which would make RedShift seem to be a more attractive option, however the pricing concerned us.  In order to model the total cost, you need to know how many instances you’ll need, but Amazon’s documentation is of little help with this. All they tell you is that the number and type of instances you need depends on the queries you will run.  In other words, to determine our pricing, we’d have to build a RedShift cluster and test out real queries on real data.  BigQuery, on the other hand, is priced on the amount of data accessed by your queries.  This is straightforward to estimate if you know roughly the size of your data sets and what queries you’ll be running, without actually having to build anything out.

Since we wouldn’t be able to do an accurate price comparison without investing engineering resources in RedShift, and since two of our engineers already had experience with BigQuery, BigQuery was the clear choice.  We also liked that the billing model meant that we wouldn’t be paying for compute time when no queries were being run.   There were some other BigQuery features that helped sway us, like support for streaming inserts and nested and repeated record types.

Event pipeline

Now that we had settled on a data warehouse, we needed a way to get our events into it.  We were already using fluentd as our event collector, which meant that changing our data store  was just a simple configuration change.  We had a choice here between using BigQuery’s streaming inserts feature or regular load jobs.  With streaming inserts, you can add records as often as you’d like, with or without batching.  On the other hand, load jobs are free, but require batching since you are limited in the number of jobs you can run per day.  In the end, we decided that even though we could batch inserts with fluentd, streaming inserts were cheap enough that it wasn’t worth worrying about hitting any limits with load jobs.

fluentd

Fluentd is an open-source daemon that sits between data sources, like event streams or application logs, and data stores, like S3 or MongoDB.  It decouples the concerns of data collection and storage, while handling details that don’t fit nicely into the request-oriented nature of web applications, like batching.  It’s also blazingly fast, with an advertised throughput of around 13K events/second/core. Since Fluentd already had a plugin for BigQuery, we were able to set up our configuration change and have events written to BigQuery with only a few hours work (mostly setting up access credentials).  We also used fluentd to stream events to our backup storage on S3.

The pipeline

The pipeline

 

Improvements

Building this pipeline, we ended up optimizing for simplicity and flexibility. This allowed us to get an event aggregation solution off the ground that allowed us to collect a large amount of data and process it in real time while managing costs. However, since we don’t pre-aggregate any data, our queries end up performing some redundant calculations.  If we did pre-aggregate, we would have to choose between aggregating in real-time or in batches, each with its own downside.  With real time aggregations, new metrics will have to be backfilled, and batched aggregations mean forgoing real time metrics.  In the future, we may explore using tools like Google Cloud Dataflow  which has a novel computational model that can be used for both real time and batch processing, potentially offering the best of both worlds.

Subscribe to Prizma Blog Updates