« Stuff The Internet Says On Scalability For June 21, 2013 | Main | Scaling Mailbox - From 0 to One Million Users in 6 Weeks and 100 Million Messages Per Day »

Paper: MegaPipe: A New Programming Interface for Scalable Network I/O

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (video, slides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that's why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc.  What they created is MegaPipe, "a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API."

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What's this most excellent and interesting paper about?

Message-oriented network workloads, where connections are short and/or message sizes are small, are CPU intensive and scale poorly on multi-core systems with the BSD Socket API. We present MegaPipe, a new API for efficient, scalable network I/O for message-oriented workloads. The design of MegaPipe centers around the abstraction of a channel a per-core, bidirectional pipe between the kernel and user space, used to exchange both I/O requests and event notifications. On top of the channel abstraction, we introduce three key concepts of MegaPipe: partitioning, lightweight socket (lwsocket), and batching.


We implement MegaPipe in Linux and adapt memcached and nginx. Our results show that, by embracing a clean-slate design approach, MegaPipe is able to exploit new opportunities for improved performance and ease of programmability. In microbenchmarks on an 8-core server with 64 B messages, MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

Performance with Small Messages:

Small messages result in greater relative network I/O overhead in comparison to larger messages. In fact, the per-message overhead remains roughly constant and thus, independent of message size; in comparison with a 64 B message, a 1 KiB message adds only about 2% overhead due to the copying between user and kernel on our system, despite the large size difference.

Partitioned listening sockets:

Instead of a single listening socket shared across cores, MegaPipe allows applications to clone a listening socket and partition its associated queue across cores. Such partitioning improves performance with multiple cores while giving applications control over their use of parallelism.

Lightweight sockets:

Sockets are represented by file descriptors and hence inherit some unnecessary filerelated overheads. MegaPipe instead introduces lwsocket, a lightweight socket abstraction that is not wrapped in filerelated data structures and thus is free from system-wide synchronization.

System Call Batching:

MegaPipe amortizes system call overheads by batching asynchronous I/O requests and completion notifications within a channel.

Related Articles

Reader Comments (3)

This sounds pretty great. So, I guess what I'm not seeing are (1) what are the downsides and (2) why isn't this already the way we do things?

June 19, 2013 | Unregistered CommenterG Gordon Worley III

I only read the slides, but the downside might be increased latency. The benefits they advertise are increased throughput (proven with benchmarks) and a more intuitive API (subjective, but I am inclined to agree). The throughput increase is achieved by reducing the number of system calls per transaction from 1 to 1/(number of operations in a batch). For use cases which are not sensitive to a moderate latency increase such as streaming video, there is no disadvantage.

The reason this isn't the way we do things already is that it is a non-obvious innovation.

June 20, 2013 | Unregistered CommenterRobert de Forest

very nice talk, however, I wonder why it's just a talk, and not source code. Not even mentioned anywhere, nor in the google search results. I would have loved to look at it, and see if I could integrated it into my own projects and see how performance scales then.

OTOH, I am pretty sure the Linux kernel devs would have implemented it already if it would be really better? no?

June 30, 2013 | Unregistered CommenterChristian Parpart

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>