Modeling Long-Term Drivers of User Engagement

Video engagement is driven by many factors. On the one hand, topics can increase in popularity based on events and social traction in ways that are highly variable, but on the other hand, users also have fairly stable preferences over the kinds of content they engage with in general. At Prizma, in addition to using highly responsive adaptive learning systems to continuously respond to what your users are engaging with right now, we also utilize advanced machine learning algorithms to predict the longer-term, deeper and more abstract drivers of your users' interests.

We track a variety of user interactions with our videos, and use this data over relatively longer periods of time to train models to detect persistent and more general patterns in what your users find interesting and enjoy watching. These models provide us a priori estimates of video performance even before any user data is collected. This enables us to ensure high user engagement as as soon as new content is available

The feature space used in these models consist of a variety of textual features extracted from video metadata, including keywords, sequences of words, closely related words and important people, places and things to understand the topic and substance of a video. We include detailed standardized relationships between key entities which enable us to understand more deeply their more abstract characteristics that are interesting to your users. We also use these features to help go beyond and infer other psychographic dimensions which include "motivations", i.e. the reasons why someone might be watching a video.  When we build our models , these broader, more abstract features are often some of the most important pieces of information for predicting viewer engagement.

Below is a simple visualization of the relative contributions of thousands of video features including keywords, named entities and Prizma’s psychographic features to video engagement on one of our partners over the last six weeks.  As you can see the vast majority of these features are relatively neutral, with a handful of salient features showing real positive or negative predictive value over time.

In this model, among the top features contributing to video engagement were: 1- motivations, including “wanting to laugh”, “wanting to take care of yourself”, and “wanting to know other people's opinions”  and 2 - details about the key celebrities, e.g. whether they are comedians or politicians. With respect to predictive power, these more abstract features often carry more weight than the more specific counterparts in the traditional metadata column, exceeding even those that are highly correlated to these features.

The ability to estimate video’s performance a priori significantly reduces the time and data required to maximize user engagement. This is especially important for environments where the specific popular topics vary rapidly, for example news sites, or where less data is available due to low traffic, or for partners with large and rapidly growing video libraries. This is also helpful when the usual context or personalization based signals are weaker, for example on a site’s home page.

These estimates have resulted in significant improvements in video engagement for our partners. In one A/B test, we compared the performance of our recommendations with and without predicted scores on tens of thousands of users. We found that using these long-term performance predictors increased number of initiated views by about 20%, but they had an even larger impact on the completion rate for those views which increased by >40%. Since these models emphasize the deeper interests of your users, while also improving click rates, we are able to create much larger improvements in further downstream signals of user retention. The results are summarized below.

The success of these algorithms is determined partly by our extensive, in-depth human-centered feature space. Because these dimensions are interpretable and usable, these algorithms can provide deep actionable insights about the wider interests and motivations of your users. We will not only continue to use these models to ensure the best experience for your users, but we also hope to provide insights from these models to help you understand your users better.

Subscribe to Prizma Blog Updates

How we built our analytics pipeline


At Prizma, analytics are our lifeblood.  We collect up to 50M events per day, which is used to display contextually relevant and personalized video content. They let us track performance in real-time, continuously improve our recommendations, enable personalization  as well as provide critical metrics to our partners via the Prizma dashboard.

We needed a solution for storing this data that allowed us to query it in real time while managing costs (after all, we’re a startup).  We explored a number of different solutions before we found one that was a fit for us.  This blog post will explore our process and share our final conclusions. The intended audience is other engineers and data scientists, although we won’t get too far into the weeds technically.

Choosing a data warehouse

The most important decision to make when designing our analytics infrastructure was choosing a data warehouse.  We had been using, a managed solution for storing and aggregating event data.  However, we found ourselves approaching the limit of queries we could run over our data. .  Anything more complex than a single unnested SELECT statement would require custom code to orchestrate the execution.  Queries that joined our event data with other sources of data were infeasible.

Another sticking point was pricing. We were being charged by the number of events ingested and our event volume was putting us at the limit of our pricing bucket.  We didn’t want our decisions about what data to collect to be driven by cost, and furthermore, we knew that the underlying storage and bandwidth were cheap enough that there had to be another cost effective solution.

Having had positive experiences with columnar data stores previously, I knew that they were the way to go for Prizma’s data warehouse.  Since we’re a small team and didn’t want to manage our own infrastructure, this left us deciding between Amazon RedShift and Google BigQuery, the two most popular managed columnar data stores.

RedShift vs BigQuery

RedShift is Amazon’s product in this space.  It runs on virtual machines that Amazon provisions on your behalf.   BigQuery, on the other hand, is a fully managed service.  You don’t have to worry about virtual machines, you just give BigQuery your data and tell it what queries to run.  We are heavy AWS users, which would make RedShift seem to be a more attractive option, however the pricing concerned us.  In order to model the total cost, you need to know how many instances you’ll need, but Amazon’s documentation is of little help with this. All they tell you is that the number and type of instances you need depends on the queries you will run.  In other words, to determine our pricing, we’d have to build a RedShift cluster and test out real queries on real data.  BigQuery, on the other hand, is priced on the amount of data accessed by your queries.  This is straightforward to estimate if you know roughly the size of your data sets and what queries you’ll be running, without actually having to build anything out.

Since we wouldn’t be able to do an accurate price comparison without investing engineering resources in RedShift, and since two of our engineers already had experience with BigQuery, BigQuery was the clear choice.  We also liked that the billing model meant that we wouldn’t be paying for compute time when no queries were being run.   There were some other BigQuery features that helped sway us, like support for streaming inserts and nested and repeated record types.

Event pipeline

Now that we had settled on a data warehouse, we needed a way to get our events into it.  We were already using fluentd as our event collector, which meant that changing our data store  was just a simple configuration change.  We had a choice here between using BigQuery’s streaming inserts feature or regular load jobs.  With streaming inserts, you can add records as often as you’d like, with or without batching.  On the other hand, load jobs are free, but require batching since you are limited in the number of jobs you can run per day.  In the end, we decided that even though we could batch inserts with fluentd, streaming inserts were cheap enough that it wasn’t worth worrying about hitting any limits with load jobs.


Fluentd is an open-source daemon that sits between data sources, like event streams or application logs, and data stores, like S3 or MongoDB.  It decouples the concerns of data collection and storage, while handling details that don’t fit nicely into the request-oriented nature of web applications, like batching.  It’s also blazingly fast, with an advertised throughput of around 13K events/second/core. Since Fluentd already had a plugin for BigQuery, we were able to set up our configuration change and have events written to BigQuery with only a few hours work (mostly setting up access credentials).  We also used fluentd to stream events to our backup storage on S3.

The pipeline

The pipeline



Building this pipeline, we ended up optimizing for simplicity and flexibility. This allowed us to get an event aggregation solution off the ground that allowed us to collect a large amount of data and process it in real time while managing costs. However, since we don’t pre-aggregate any data, our queries end up performing some redundant calculations.  If we did pre-aggregate, we would have to choose between aggregating in real-time or in batches, each with its own downside.  With real time aggregations, new metrics will have to be backfilled, and batched aggregations mean forgoing real time metrics.  In the future, we may explore using tools like Google Cloud Dataflow  which has a novel computational model that can be used for both real time and batch processing, potentially offering the best of both worlds.

Subscribe to Prizma Blog Updates