QSBTY2FsYWJsZSBBbHRlcm5hdGl2ZSB0byBSRVNUZnVsIENvbW11bmljYXRpb246IE1pbWlja2lu ZyBHb29nbGXigJlzIFNlYXJjaCBBdXRvY29tcGxldGUgd2l0aCBhIFNpbmdsZSBNaWdyYXRvcnlE YXRhIFNlcnZlcg==

This is a guest post by Mihai Rotaru, CTO of MigratoryData.

Using the RESTful HTTP request-response approach can become very inefficient for websites requiring real-time communication. We propose a new approach and exemplify it with a well-known feature that requires real-time communication, and which is included by most websites: search box autocomplete.

Google, which is one of the most demanding web search environments, seems to handle about 40,000 searches per second according to an estimation made by Internet Live Stats. Supposing that for each search, a number of 6 autocomplete requests are made, we show that MigratoryData can handle this load using a single 1U server.

More precisely, we show that a single MigratoryData server running on a 1U machine can handle 240,000 autocomplete requests per second from 1 million concurrent users with a mean round-trip latency of 11.82 milliseconds.

The Current Approach and Its Limitations

Autocomplete provides users with search suggestions as users type in their queries in the search box. The current approach is based on the HTTP request-response model. For every single character typed by the user, an HTTP request is sent to a web server and the search suggestions are returned into the HTTP response. There are two limitations with the current approach: bandwidth and latency.

On one hand, for each autocomplete request which only contains a few bytes (i.e. the characters typed into the search box by the user), the browser automatically adds hundreds of bytes as HTTP headers. For websites with important search activity, this data overhead means substantial waste in bandwidth, as well as in CPU cycles required to process these unnecessary HTTP headers.

On the other hand, for each HTTP request, a new TCP connection is established between the user and the web server, as well as a possible TLS/SSL handshake. Having to perform this action for each character a user types has an impact on the latency (i.e. the time from when the user types a character until it gets the search results). The workaround to this limitation is to use HTTP Keep-Alive connections, which allow sending multiple HTTP requests along the same TCP connection during a timeout. Even in this case, however, after the timeout expires, a new connection will be established.

A New Approach

The WebSocket protocol arises as an alternative to overcome the limitations of the RESTful HTTP approach discussed above. It is known that the WebSocket protocol overhead only consists of a few bytes. Therefore, compared with the hundreds of bytes added by the HTTP protocol, the improvement in terms of overhead is substantial. Moreover, the WebSocket protocol uses persistent connections by design. Therefore, it creates the premise for achieving low-latency communication, as it imposes no periodic reconnections.

Many WebSocket server implementations exist. However, while any of these implementations would normally optimize bandwidth when compared with the RESTful HTTP approach, not all WebSocket server implementations will provide the same level of low latency and scalability.

Note – The WebSocket protocol by itself does not guarantee the server’s better scalability or lower latency when compared to web servers - it only creates the premise for this possibility. The degree to which better scalability and lower latency can be achieved depends on each WebSocket server implementation.

MigratoryData Server is one such existing WebSocket server implementation. It is known as being the first server implementation which addressed the C10M problem: 10 million concurrent users on a single server.

MigratoryData provides a common API with libraries for the most popular programming environments, including web applications. It exposes a subject-based publish/subscribe communication paradigm. Built according to the pub/sub model, it also exposes an asynchronous request/response model as follows:

A producer subscribes to a subject X

A consumer sends a request message with the subject X to which it attaches a reply subject Y (the consumer automatically subscribes to the subject Y if it is not already subscribed)

Once the message is received by the producer, it extracts the reply subject Y from the request message and sends back a reply message with the subject Y

In the following sections we show that this request/response interaction over WebSockets can be used as a scalable alternative to the RESTful HTTP approach.

Benchmark Setup

We used four identical machines each with 2 x CPU Intel Xeon E5-2670 @ 2.60GHz and 64 GB RAM as follows:

Machine A ran one instance of MigratoryData Server 5.0.20

Machine B and Machine C ran two instances of the Requestor tool, used to open 500,000 concurrent WebSockets connections each; autocomplete requests were made along these connections

Machine D ran 16 instances of the Provider tool, used to reply with search suggestions based on the autocomplete requests.

All four machines ran CentOS Linux 7.2 with the default kernel 3.10.0-327.28.3.el7.x86_64 with no kernel tuning.

In order to send an autocomplete request by one user, N, of the one million users, the Requestor tool selects randomly one of the subjects subscribed to by the sixteen providers, say /s/M. In addition, the Requestor tool subscribes the user, N, to the subject /c/N (if not already done) and publishes a request message to the MigratoryData Server with the following attributes:

Subject: /c/M

Reply Subject: /s/N

Payload: a 32-byte random string representing the search query

The Provider M, which is subscribed to the subject /s/M, will receive the message above and will respond by publishing a reply message to the MigratoryData Server with the following attributes:

Subject: /c/N

Payload: a 256-byte random string representing the search suggestions

Because the user, N, is subscribed to the subject /c/N, it will receive the reply message above. The round-trip latency will be computed as the time difference from when the request message was created until the reply message was received by the user.

Note – The round-trip latency of a request-reply communication includes the time it takes for the request message to travel from Requestor to MigratoryData Server, and then to Provider, plus the time it takes for the reply message to travel from Provider to MigratoryData Server, and finally to Requestor.

Finally, it is worthy to note that, in the setup above, requests are balanced among Provider instances - which represent the search services. This architecture allows for search services (including their search caches) to be scaled horizontally, which mimics the RESTful HTTP approach, where requests are also balanced among multiple search services.

Summary of Results

Every second, the two Requestor instances made 240,000 autocomplete requests for 240,000 users randomly selected from the one million concurrent users and received search suggestions with a mean round-trip latency of 11.8 milliseconds, a 95th percentile latency of 20 milliseconds, and a 99th percentile latency of 130 milliseconds (computed for more than 4 billion requests).

Number of Concurrent WebSocket Connections

1,000,016

Number of Subscribed Subjects

1,000,016

Number of Requests Per Second

240,000 request messages per second

Total Message Throughput (from and to Requestors & Providers)

960,000 messages per second

Mean Latency

11.82 milliseconds

Standard Deviation Latency

26.28 milliseconds

95th Percentile Latency

20 milliseconds

99th Percentile Latency

130 milliseconds

Max Latency

1783 milliseconds

Total Number of Requests

4,084,890,291

Hardware

One 1U server with 2 x Intel Xeon E5-2670
@ 2.60GHz, 64 GB RAM, and Intel X520-DA1 10 GbE network adapter

Operating System

CentOS Linux 7.2 with default kernel 3.10.0-327.28.3.el7.x86_64 (no kernel tuning)

Java Runtime Environment

Oracle 1.8.0_40-b25

Incoming Network Utilization (from both Providers & Requestors)

1.06 Gigabit per second

Outgoing Network Utilization (to both Providers & Requestors)

1.17 Gigabit per second

CPU Utilization

65%

Results

MigratoryData Server provides monitoring via JMX and other protocols. We used the jconsole tool (part of the Java Development Kit) for JMX monitoring. The screenshots in the results presented below are obtained during JMX monitoring.

Connections and Messages

As depicted in the Benchmark Setup diagram, 1,000,000 concurrent WebSocket connections were opened to MigratoryData server by two Requestor instances, which simulated one million users. Each of the one million users subscribed to a distinct subject, which was then used to obtain replies with search suggestions. In addition, 16 Provider instances were used to open 16 connections to MigratoryData server, to simulate search suggestion services. Each of the 16 services subscribed to a distinct subject, which was then used to get the autocomplete requests. The same number of concurrent connections 1,000,016 is indicated by the JMX attribute ConnectedSessions in the screenshot below.

In this benchmark setup, users sent 240,000 request messages per second. Therefore, the number of total incoming messages per second to MigratoryData server is a sum of the 240,000 request messages per second from Requestors, plus the 240,000 reply messages per second from Providers.

In addition, the number of outgoing messages per second from MigratoryData server is a sum of the 240,000 reply messages per second delivered to Requestors, plus the240,000 request messages per second delivered to Provides.

These total numbers (480,000 outgoing messages per second and 480,000 incoming messages per second) are corroborated by the JMX attributes OutPublishedMessagesPerSecond and InPublishedMessagesPerSecond in the screenshot below.

Therefore, the total throughput handled by the MigratoyData Server both for incoming and outgoing messages is close to 1 million messages per second.

Finally, it’s also worthwhile to note that the benchmark test was performed over a period of almost 5 hours. At a rate of 240,000 requests per second, MigratoryData handled more than 4 billion total requests!

CPU and Memory Utilization

As indicated by the screenshot, CPU usage was under 70% during the benchmark. The max memory allocated to JVM was 30 GB. Finally, the benchmark test, which was run over almost 5 hours, shows a predictable usage pattern for both memory and CPU.

Latency

As defined in the Benchmark Setup section, the round-trip latency is a sum of the time needed for a request to travel from a Requestor to MigratoryData server, and then to a Provider, plus the time needed for the reply message to travel from the Provider to MigratoryData server, and finally to the Requestor.

For this benchmark, we computed the round-trip latency for each request/reply interaction - a total of more than 4 billion latency values. In addition, we computed the mean, standard deviation, and maximum latency values. These latency statistics are computed incrementally for each new request/reply interaction, using a total of more than 4 billion latency values. These values are summarized as follows:

Mean Latency: 11.82 milliseconds

Standard Deviation Latency: 26.28 milliseconds

Maximum Latency: 1783 milliseconds

In addition, we used the HdrHistogram library to compute the percentiles of latency. In the chart below you can see both the number of requests (in millions) and the round-trip latency (in milliseconds) by percentile distribution.

For example, in the chart above we can see that the 95th percentile is 20 milliseconds and the 99th percentile is 130 milliseconds. Thus, for 3.8 billion requests out of the total of 4 billion, the round-trip latency is under 20 milliseconds, and for 3.96 billion requests out of the total of 4 billion, the round-trip latency is under 130 milliseconds.

Note – More optimization can be done to reduce the latencies of the 99th percentiles and higher. These values are typically impacted by the garbage collections of the JVM. In a previous benchmark for another scenario we show that using Zing JVM from Azul Systems, optimized for garbage collection, we were able to reduce the 99th percentile latency from 585 milliseconds to 25 milliseconds, and the maximum latency from 1700 milliseconds to 126 milliseconds.

Conclusion

In this article we proposed a new communication architecture for websites with large number of users, high frequency of requests, and/or requiring low-latency communication, and exemplified with the particular use case of search box autocomplete.

We showed that a scalable WebSocket server, providing an easy-to-use programming model, such as publish-subscribe, represents a good alternative to the current RESTfull HTTP architecture, both in terms of latency and bandwidth, while maintaining a comparable complexity for programming.