How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data
This is a guest post by Eric Czech, Chief Architect at Next Big Sound, talks about some unique approaches taken to solving scalability challenges in music analytics.
Tracking online activity is hardly a new idea, but doing it for the entire music industry isn't easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges. Next Big Sound collects this type of data from over a hundred sources, standardizes everything, and offers that information to record labels, band managers, and artists through a web-based analytics platform.
While many of our applications use open-source systems like Hadoop, HBase, Cassandra, Mongo, RabbitMQ, and MySQL, our usage is fairly standard, but there is one aspect of what we do that is pretty unique. We collect or receive information from 100+ sources and we struggled early on to find a way to deal with how data from those sources changed over time, and we ultimately decided that we needed a data storage solution that could represent those changes. Basically, we needed to be able to "version" or "branch" the data from those sources in much the same way that we use revision control (via Git) to control the code that creates it. We did this by adding a logical layer to a Cloudera distribution and after integrating that layer within Apache Pig, HBase, Hive, and HDFS, we now have a basic version control framework for large amounts of data in Hadoop clusters.
As a sort of "Moneyball for Music," Next Big Sound has grown from a single server LAMP site tracking plays on MySpace (it was cool when we started) for a handful of artists to building industry-wide popularity charts for Billboard and ingesting records of every song streamed on Spotify. The growth rate of data has been close to exponential and early adoption of distributed systems has been crucial in keeping up. With over 100 sources tracked coming from both public and proprietary providers, dealing with the heterogenous nature of music analytics has required some novel solutions that go beyond the features that come for free with modern distributed databases.
Next Big Sound has also transitioned between full cloud providers (Slicehost), hybrid providers (Rackspace), and colocation (Zcolo) all while running with a small engineering staff using nothing but open source systems. There was no shortage of lessons learned in this process and we hope others can take something away from our successes and failures.
How does Next Big Sound make beautiful data out of all that music? Step inside and see...