Wordnik - 10 million API Requests a Day on MongoDB and Scala
Wordnik is an online dictionary and language resource that has both a website and an API component. Their goal is to show you as much information as possible, as fast as we can find it, for every word in English, and to give you a place where you can make your own opinions about words known. As cool as that is, what is really cool is the information they share in their blog about their experiences building a web service. They've written an excellent series of articles and presentations you may find useful:
- What has technology doSave & Closene for words lately?
- Eventual consistency. Using an eventually consistent model they can do work in parallel and we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.D
- Document-oriented storage. Dictionary entries are more naturally modeled as hierarchical documents and using that model has made it quicker to find data and is easier for development.
- 12 Months with MongoDB
- Primary driver for migrating to MongoDB was for performance. MySQL didn't work for them.
- Mongo serves an average of 500k requests/hour. Peak traffic is 4x that.
- > 12 billion documents in Mongo, storage is ~3TB per node
- Can easily sustain an insert speed of 8k documents/second, often burst to 50k/sec
- A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe
- Every type of retrieval has become significantly faster than the MySQL implementation:
- example fetch time reduced from 400ms to 60ms
- dictionary entries from 20ms to 1ms
- document metadata from 30ms to .1ms
- spelling suggestions from 10ms to 1.2ms
- Mongo's built-in caching allowed them to remove the memcached layer and speed up calls by 1-2ms.
- From MySQL to MongoDB - Migrating to a Live Application by Tony Tam
- An explanation of their experiences moving from MySQL to MongoDB.
- Wordnik stores a corpus of words, hierarchical data, and user data. The MySQL design was far more complex and required a complex caching layer to perform well. With MongoDB the system is 20x faster. Now there are no joins or the need of a caching layer. The whole system was simpler.
- Wordnik is primarily a read-only system and performance is limited mainly by disk speed.
-
They use dual quad-core 2.4GHz intel cpus with 72GB ram. They are physical servers and in master-slave mode and use 5.3TB LUNs on the DAS. They found virtual servers didn't have the IO performance they needed.
- Keeping the Lights On with MongoDB by Tony Tam
- A presentation of how they use and mange MongoDB.
- Wordnik API
- They've rewritten their REST API in Scala. Scala has helped them remove a lot of code and standardize "traits" throughout the API.
- MongoDB Admin Tools
- Wordnik has built some tools to manage large deployments of MongoDB and has open sourced them.
- Wordnik Bypasses Processing Bottleneck with Hadoop
- Add 8,000 words per second to their corupus of words.
- Map-reduce jobs are run on Hadoop to offload MongoDB and prevent any blocking queries. Data is append-only so there's no reason to hit MongoDB.
- Incremental updates to their data are stored in flat files, which are periodically imported into HDFS.
Overall impressions:
- Wordnik had a very specific problem to solve and set out to find the best tool that would help them solve that problem. They were willing to code around any faults they found and figure out how to make MongoDB work best for them. Performance requirements drove everything.
- After performance, the naturalness of the document data model seemed to be the biggest win for them. The ability to easily model complex data hierarchically and have that perform well, reverberated across the system.
- Code is now: faster, more flexible, and dramatically smaller.
- They have settled on specialized tools for the job. MongoDB is responsible for document storage for runtime data. Hadoop is responsible for analytics, data processing, reporting, and time-based aggregation.Definitely an experience report worth keeping an eye on.