Analytics

Video and User Insights for Media: AI and Big Data Analytics

As data becomes increasingly ubiquitous for media companies broadcasters are increasingly looking for more effective and creative ways to use that data to optimize all aspects of their business. At Prizma we have been innovating around collecting and using data to optimize video discovery since we launched, but along the way we realized that we could use the same systems to generate actionable analytics for product, editorial and marketing teams - leading us to begin developing our new Media Intelligence Layer. Prizma was invited to co-present our new offering at NAB on stage alongside the Google Cloud product team.

The Prizma Media Engagement and Intelligence Solution is powered by artificial intelligence (AI) and enables premium media brands and digital platforms to effectively engage with their target users with content across various digital environments and to develop a deep understanding of their audience.

Our solution can be leveraged for a variety of business use cases, from predictive behavioral analytics that inform editorial teams on what content to create, how and where to market their video content, how to optimize the distribution of media assets, or create an automated personalized content discovery experience on their owned and operated digital networks.

The Prizma Offering:

Many media clients are interested in using a combination of content metadata, user data (demographics, usage behavior, psychographics), video views, web traffic data, and monetization data to answer critical business questions, such as:

  • Which kinds of videos resonate with different audience segments? Are there specific categories, topics, or personalities that seem to generate better (or worse) engagement with different target segments?
  • How can we compare performance across a variety of distribution channels to extract generalized, usable insights for various teams and uses?
  • What kinds of videos should my content teams be producing (enable rapid editorial response to viewer demand and predictive performance analytics)?
  • Which videos should I distribute by platform?
    Which traffic sources give me the most engaged users?
  • How can we predict video engagement to help inform editorial, distribution and marketing decisions (especially by user segment)? 

Google Cloud Media Offering:

The Google Cloud Media and Prizma teams demonstrated how media customers can easily extract business insights using the Google Cloud stack and data analytics pipeline (especially data from Google Services, such as YouTube, Google Analytics or DoubleClick), while layering Prizma’s Media Intelligence Solution on top of user and video data collected by Prizma. Using BigQuery to compile the data from multiple sources and Data Studio for rapid and flexible visualization, the Prizma team was able to demonstrate how, over a particular time period, different stories, celebrities, and topics resonated with audiences on YouTube vs. O&O and provided insights on how to create higher levels of engagement for different audience segments by platform.

The two teams also demonstrated the seamless integration of these components together, and how the combined offering can deliver better business results quickly.

    Subscribe to Prizma Blog Updates

    Modeling Long-Term Drivers of User Engagement

    Video engagement is driven by many factors. On the one hand, topics can increase in popularity based on events and social traction in ways that are highly variable, but on the other hand, users also have fairly stable preferences over the kinds of content they engage with in general. At Prizma, in addition to using highly responsive adaptive learning systems to continuously respond to what your users are engaging with right now, we also utilize advanced machine learning algorithms to predict the longer-term, deeper and more abstract drivers of your users' interests.

    We track a variety of user interactions with our videos, and use this data over relatively longer periods of time to train models to detect persistent and more general patterns in what your users find interesting and enjoy watching. These models provide us a priori estimates of video performance even before any user data is collected. This enables us to ensure high user engagement as as soon as new content is available

    The feature space used in these models consist of a variety of textual features extracted from video metadata, including keywords, sequences of words, closely related words and important people, places and things to understand the topic and substance of a video. We include detailed standardized relationships between key entities which enable us to understand more deeply their more abstract characteristics that are interesting to your users. We also use these features to help go beyond and infer other psychographic dimensions which include "motivations", i.e. the reasons why someone might be watching a video.  When we build our models , these broader, more abstract features are often some of the most important pieces of information for predicting viewer engagement.

    Below is a simple visualization of the relative contributions of thousands of video features including keywords, named entities and Prizma’s psychographic features to video engagement on one of our partners over the last six weeks.  As you can see the vast majority of these features are relatively neutral, with a handful of salient features showing real positive or negative predictive value over time.

    In this model, among the top features contributing to video engagement were: 1- motivations, including “wanting to laugh”, “wanting to take care of yourself”, and “wanting to know other people's opinions”  and 2 - details about the key celebrities, e.g. whether they are comedians or politicians. With respect to predictive power, these more abstract features often carry more weight than the more specific counterparts in the traditional metadata column, exceeding even those that are highly correlated to these features.

    The ability to estimate video’s performance a priori significantly reduces the time and data required to maximize user engagement. This is especially important for environments where the specific popular topics vary rapidly, for example news sites, or where less data is available due to low traffic, or for partners with large and rapidly growing video libraries. This is also helpful when the usual context or personalization based signals are weaker, for example on a site’s home page.

    These estimates have resulted in significant improvements in video engagement for our partners. In one A/B test, we compared the performance of our recommendations with and without predicted scores on tens of thousands of users. We found that using these long-term performance predictors increased number of initiated views by about 20%, but they had an even larger impact on the completion rate for those views which increased by >40%. Since these models emphasize the deeper interests of your users, while also improving click rates, we are able to create much larger improvements in further downstream signals of user retention. The results are summarized below.

    The success of these algorithms is determined partly by our extensive, in-depth human-centered feature space. Because these dimensions are interpretable and usable, these algorithms can provide deep actionable insights about the wider interests and motivations of your users. We will not only continue to use these models to ensure the best experience for your users, but we also hope to provide insights from these models to help you understand your users better.

    Subscribe to Prizma Blog Updates

    How we built our analytics pipeline

    Introduction

    At Prizma, analytics are our lifeblood.  We collect up to 50M events per day, which is used to display contextually relevant and personalized video content. They let us track performance in real-time, continuously improve our recommendations, enable personalization  as well as provide critical metrics to our partners via the Prizma dashboard.

    We needed a solution for storing this data that allowed us to query it in real time while managing costs (after all, we’re a startup).  We explored a number of different solutions before we found one that was a fit for us.  This blog post will explore our process and share our final conclusions. The intended audience is other engineers and data scientists, although we won’t get too far into the weeds technically.

    Choosing a data warehouse

    The most important decision to make when designing our analytics infrastructure was choosing a data warehouse.  We had been using Keen.io, a managed solution for storing and aggregating event data.  However, we found ourselves approaching the limit of queries we could run over our data. .  Anything more complex than a single unnested SELECT statement would require custom code to orchestrate the execution.  Queries that joined our event data with other sources of data were infeasible.

    Another sticking point was pricing. We were being charged by the number of events ingested and our event volume was putting us at the limit of our pricing bucket.  We didn’t want our decisions about what data to collect to be driven by cost, and furthermore, we knew that the underlying storage and bandwidth were cheap enough that there had to be another cost effective solution.

    Having had positive experiences with columnar data stores previously, I knew that they were the way to go for Prizma’s data warehouse.  Since we’re a small team and didn’t want to manage our own infrastructure, this left us deciding between Amazon RedShift and Google BigQuery, the two most popular managed columnar data stores.

    RedShift vs BigQuery

    RedShift is Amazon’s product in this space.  It runs on virtual machines that Amazon provisions on your behalf.   BigQuery, on the other hand, is a fully managed service.  You don’t have to worry about virtual machines, you just give BigQuery your data and tell it what queries to run.  We are heavy AWS users, which would make RedShift seem to be a more attractive option, however the pricing concerned us.  In order to model the total cost, you need to know how many instances you’ll need, but Amazon’s documentation is of little help with this. All they tell you is that the number and type of instances you need depends on the queries you will run.  In other words, to determine our pricing, we’d have to build a RedShift cluster and test out real queries on real data.  BigQuery, on the other hand, is priced on the amount of data accessed by your queries.  This is straightforward to estimate if you know roughly the size of your data sets and what queries you’ll be running, without actually having to build anything out.

    Since we wouldn’t be able to do an accurate price comparison without investing engineering resources in RedShift, and since two of our engineers already had experience with BigQuery, BigQuery was the clear choice.  We also liked that the billing model meant that we wouldn’t be paying for compute time when no queries were being run.   There were some other BigQuery features that helped sway us, like support for streaming inserts and nested and repeated record types.

    Event pipeline

    Now that we had settled on a data warehouse, we needed a way to get our events into it.  We were already using fluentd as our event collector, which meant that changing our data store  was just a simple configuration change.  We had a choice here between using BigQuery’s streaming inserts feature or regular load jobs.  With streaming inserts, you can add records as often as you’d like, with or without batching.  On the other hand, load jobs are free, but require batching since you are limited in the number of jobs you can run per day.  In the end, we decided that even though we could batch inserts with fluentd, streaming inserts were cheap enough that it wasn’t worth worrying about hitting any limits with load jobs.

    fluentd

    Fluentd is an open-source daemon that sits between data sources, like event streams or application logs, and data stores, like S3 or MongoDB.  It decouples the concerns of data collection and storage, while handling details that don’t fit nicely into the request-oriented nature of web applications, like batching.  It’s also blazingly fast, with an advertised throughput of around 13K events/second/core. Since Fluentd already had a plugin for BigQuery, we were able to set up our configuration change and have events written to BigQuery with only a few hours work (mostly setting up access credentials).  We also used fluentd to stream events to our backup storage on S3.

     The pipeline

    The pipeline

     

    Improvements

    Building this pipeline, we ended up optimizing for simplicity and flexibility. This allowed us to get an event aggregation solution off the ground that allowed us to collect a large amount of data and process it in real time while managing costs. However, since we don’t pre-aggregate any data, our queries end up performing some redundant calculations.  If we did pre-aggregate, we would have to choose between aggregating in real-time or in batches, each with its own downside.  With real time aggregations, new metrics will have to be backfilled, and batched aggregations mean forgoing real time metrics.  In the future, we may explore using tools like Google Cloud Dataflow  which has a novel computational model that can be used for both real time and batch processing, potentially offering the best of both worlds.

    Subscribe to Prizma Blog Updates

    Why you should care about psychographics? Taking a human centered approach to engagement

    At Prizma we are always trying to understand the “why” of what people are watching.  It’s one thing to know that people are watching a lot of videos about politics, but it’s another to know whether what they’re watching is coming from a desire for information, or driven by outrage around the news, or the desire to empathize with other people.  Understanding that “why” is part of how Prizma is able to drive consistently high performance while surprising and delighting users.  

    psychology-544405_1280.png

    In order to do this, we use machine learning to generate a “psychographic” feature space that covers some of the underlying reasons why users might be engaging with content.  This enhanced feature space informs our entire approach to both recommendations and optimization and allows us to pinpoint not only which videos are doing well, but why.

    Screen Shot 2017-01-20 at 12.12.42 PM.png

    For example, when we think about people’s preferences we often think about topics, or perhaps their favorite celebrities (including the ones that they love to hate) as what drives their engagement.  However, these kinds of tags are often incapable of capturing the emotions or driving forces behind that engagement.  When we looked at data from the last several weeks on one of our partner’s sites we found increased performance from videos that covered politics and the incoming President, both did a little shy of 50% better than videos that didn’t cover those subjects, but when we considered our tag of “outrage” we were able to identify the videos that drove nearly 3× the engagement of other videos. It is clear that while people have a renewed interest in politics, “outrage” is one of the key emotions that gets users to watch videos in the current climate.

    Screen Shot 2017-01-18 at 8.12.21 PM.png

    In practice, our ability to infer these more general features translates into tangible value for our customers. In one A/B test we compared the performance of our recommendations pipeline with and without our extended feature space. We found that by utilizing psychographic features we were able to increase the number of initiated views by 10% and the number of completed views by >30%, indicating that while these features are important to the “click”, they are even more important for generating sustained user attention and engagement.

    We have calibrated and honed our psychographic dimensions based on how well they describe and distinguish our partners’ content; with the aid of intuitive psychological models that use some of the language used by users and creators alike to describe content. This human-centered approach has the capacity to provide actionable insights for our partners. We train these models on highly diverse data and utilize these dimensions throughout our pipeline. Using our psychographic dimensions reduces the resources and data points required to generate more abstract representations of both videos, and user preferences  - allowing us to create high-quality video discovery experiences for any publisher and environment.

    We believe that this human-centered approach to understanding content is the key to driving deeper video engagement. As we expand our offering, we hope to go beyond using these psychographics to inform our own optimizations, to provide deeper insights to content creators and advertisers helping them to understand their users better, and create the most engaging content for their audience. 

     

    Subscribe to Prizma Blog Updates