serendip.me is a social music service that helps people discover great music shared by their friends, and also introduces them to their “music soulmates” - people outside their immediate social circle that shares a similar taste in music.
Choosing the stack
One of the challenges of building serendip was the need to handle a large amount of data from day one, since a main feature of serendip is that it collects every piece of music being shared on Twitter from public music services. So when we approached the question of choosing the language and technologies to use, an important consideration was the ability to scale.
The JVM seemed the right basis for our system as for its proven performance and tooling. It's also the language of choice for a lot of open source system (like Elasticsearch) which enables using their native clients - a big plus.
When we looked at the JVM ecosystem, scala stood out as an interesting language option that allowed a modern approach to writing code, while keeping full interoperability with Java. Another argument in favour of scala was the akka actor framework which seemed to be a good fit for a stream processing infrastructure (and indeed it was!). The Play web framework was just starting to get some adoption and looked promising. Back when we started, at the very beginning of 2011, these were still kind of bleeding edge technologies. So of course we were very pleased that by the end of 2011 scala and akka consolidated to become Typesafe, with Play joining in shortly after.
MongoDB was chosen for its combination of developer friendliness, ease of use, feature set and possible scalability (using auto-sharding). We learned very soon that the way we wanted to use and query our data will require creating a lot of big indexes on MongoDB, which will cause us to be hitting performance and memory issues pretty fast. So we kept using MongoDB mainly as a key-value document store, also relying on its atomic increments for several features that required counters.
With this type of usage MongoDB turned out to be pretty solid. It is also rather easy to operate, but mainly because we managed to avoid using sharding and went with a single replica-set (the sharding architecture of MongoDB is pretty complex).
For querying our data we needed a system with full blown search capabilities. Out of the possible open source search solutions, Elasticsearch came as the most scalable and cloud oriented system. Its dynamic indexing schema and the many search and faceting possibilities it provides allowed us to build many features on top of it, making it a central component in our architecture.
We chose to manage both MongoDB and Elasticsearch ourselves and not use a hosted solution for two main reasons. First, we wanted full control over both systems. We did not want to depend on another element for software upgrades/downgrades. And second, the amount of data we process meant that a hosted solution was more expensive than managing it directly on EC2 ourselves.