Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.
Abstract:
This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
Related Articles
- On Hacker News
- Prismatic Architecture - Using Machine Learning On Social Networks To Figure Out What You Should Read On The Web
- DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second
- When All The Program's A Graph - Prismatic's Plumbing Library
- Twitter: Streaming MapReduce with Summingbird
- Apache: What is Samza?: a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
- Storm: a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.