bigquery

Video and User Insights for Media: AI and Big Data Analytics

As data becomes increasingly ubiquitous for media companies broadcasters are increasingly looking for more effective and creative ways to use that data to optimize all aspects of their business. At Prizma we have been innovating around collecting and using data to optimize video discovery since we launched, but along the way we realized that we could use the same systems to generate actionable analytics for product, editorial and marketing teams - leading us to begin developing our new Media Intelligence Layer. Prizma was invited to co-present our new offering at NAB on stage alongside the Google Cloud product team.

The Prizma Media Engagement and Intelligence Solution is powered by artificial intelligence (AI) and enables premium media brands and digital platforms to effectively engage with their target users with content across various digital environments and to develop a deep understanding of their audience.

Our solution can be leveraged for a variety of business use cases, from predictive behavioral analytics that inform editorial teams on what content to create, how and where to market their video content, how to optimize the distribution of media assets, or create an automated personalized content discovery experience on their owned and operated digital networks.

The Prizma Offering:

Many media clients are interested in using a combination of content metadata, user data (demographics, usage behavior, psychographics), video views, web traffic data, and monetization data to answer critical business questions, such as:

  • Which kinds of videos resonate with different audience segments? Are there specific categories, topics, or personalities that seem to generate better (or worse) engagement with different target segments?
  • How can we compare performance across a variety of distribution channels to extract generalized, usable insights for various teams and uses?
  • What kinds of videos should my content teams be producing (enable rapid editorial response to viewer demand and predictive performance analytics)?
  • Which videos should I distribute by platform?
    Which traffic sources give me the most engaged users?
  • How can we predict video engagement to help inform editorial, distribution and marketing decisions (especially by user segment)? 

Google Cloud Media Offering:

The Google Cloud Media and Prizma teams demonstrated how media customers can easily extract business insights using the Google Cloud stack and data analytics pipeline (especially data from Google Services, such as YouTube, Google Analytics or DoubleClick), while layering Prizma’s Media Intelligence Solution on top of user and video data collected by Prizma. Using BigQuery to compile the data from multiple sources and Data Studio for rapid and flexible visualization, the Prizma team was able to demonstrate how, over a particular time period, different stories, celebrities, and topics resonated with audiences on YouTube vs. O&O and provided insights on how to create higher levels of engagement for different audience segments by platform.

The two teams also demonstrated the seamless integration of these components together, and how the combined offering can deliver better business results quickly.

    Subscribe to Prizma Blog Updates

    How we built our analytics pipeline

    Introduction

    At Prizma, analytics are our lifeblood.  We collect up to 50M events per day, which is used to display contextually relevant and personalized video content. They let us track performance in real-time, continuously improve our recommendations, enable personalization  as well as provide critical metrics to our partners via the Prizma dashboard.

    We needed a solution for storing this data that allowed us to query it in real time while managing costs (after all, we’re a startup).  We explored a number of different solutions before we found one that was a fit for us.  This blog post will explore our process and share our final conclusions. The intended audience is other engineers and data scientists, although we won’t get too far into the weeds technically.

    Choosing a data warehouse

    The most important decision to make when designing our analytics infrastructure was choosing a data warehouse.  We had been using Keen.io, a managed solution for storing and aggregating event data.  However, we found ourselves approaching the limit of queries we could run over our data. .  Anything more complex than a single unnested SELECT statement would require custom code to orchestrate the execution.  Queries that joined our event data with other sources of data were infeasible.

    Another sticking point was pricing. We were being charged by the number of events ingested and our event volume was putting us at the limit of our pricing bucket.  We didn’t want our decisions about what data to collect to be driven by cost, and furthermore, we knew that the underlying storage and bandwidth were cheap enough that there had to be another cost effective solution.

    Having had positive experiences with columnar data stores previously, I knew that they were the way to go for Prizma’s data warehouse.  Since we’re a small team and didn’t want to manage our own infrastructure, this left us deciding between Amazon RedShift and Google BigQuery, the two most popular managed columnar data stores.

    RedShift vs BigQuery

    RedShift is Amazon’s product in this space.  It runs on virtual machines that Amazon provisions on your behalf.   BigQuery, on the other hand, is a fully managed service.  You don’t have to worry about virtual machines, you just give BigQuery your data and tell it what queries to run.  We are heavy AWS users, which would make RedShift seem to be a more attractive option, however the pricing concerned us.  In order to model the total cost, you need to know how many instances you’ll need, but Amazon’s documentation is of little help with this. All they tell you is that the number and type of instances you need depends on the queries you will run.  In other words, to determine our pricing, we’d have to build a RedShift cluster and test out real queries on real data.  BigQuery, on the other hand, is priced on the amount of data accessed by your queries.  This is straightforward to estimate if you know roughly the size of your data sets and what queries you’ll be running, without actually having to build anything out.

    Since we wouldn’t be able to do an accurate price comparison without investing engineering resources in RedShift, and since two of our engineers already had experience with BigQuery, BigQuery was the clear choice.  We also liked that the billing model meant that we wouldn’t be paying for compute time when no queries were being run.   There were some other BigQuery features that helped sway us, like support for streaming inserts and nested and repeated record types.

    Event pipeline

    Now that we had settled on a data warehouse, we needed a way to get our events into it.  We were already using fluentd as our event collector, which meant that changing our data store  was just a simple configuration change.  We had a choice here between using BigQuery’s streaming inserts feature or regular load jobs.  With streaming inserts, you can add records as often as you’d like, with or without batching.  On the other hand, load jobs are free, but require batching since you are limited in the number of jobs you can run per day.  In the end, we decided that even though we could batch inserts with fluentd, streaming inserts were cheap enough that it wasn’t worth worrying about hitting any limits with load jobs.

    fluentd

    Fluentd is an open-source daemon that sits between data sources, like event streams or application logs, and data stores, like S3 or MongoDB.  It decouples the concerns of data collection and storage, while handling details that don’t fit nicely into the request-oriented nature of web applications, like batching.  It’s also blazingly fast, with an advertised throughput of around 13K events/second/core. Since Fluentd already had a plugin for BigQuery, we were able to set up our configuration change and have events written to BigQuery with only a few hours work (mostly setting up access credentials).  We also used fluentd to stream events to our backup storage on S3.

    The pipeline

    The pipeline

     

    Improvements

    Building this pipeline, we ended up optimizing for simplicity and flexibility. This allowed us to get an event aggregation solution off the ground that allowed us to collect a large amount of data and process it in real time while managing costs. However, since we don’t pre-aggregate any data, our queries end up performing some redundant calculations.  If we did pre-aggregate, we would have to choose between aggregating in real-time or in batches, each with its own downside.  With real time aggregations, new metrics will have to be backfilled, and batched aggregations mean forgoing real time metrics.  In the future, we may explore using tools like Google Cloud Dataflow  which has a novel computational model that can be used for both real time and batch processing, potentially offering the best of both worlds.

    Subscribe to Prizma Blog Updates