Example

How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct.

Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential.

Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems.

Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year.

The advantage of their new system is that they can now look at their data in anyway they want:

Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.

When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.

This switch has changed how they run their business. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff."

In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system.

Before getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report.

Information Sources

MapReduce at Rackspace

A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division). This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently.

The Platform

Hadoop

Hadoop Distributed File System (HDFS)

Lucene

Solr

Tomcat

The Stats

Rackspace has more than 50K devices and 7 data centers.

The mail system and logging servers are currently in 3 of the Rackspace data centers.

The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.

Several hundred gigabytes of email log data is generated each day.

Background on Mailtrust

Email hosting company

Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us

80K business customers, 700K mailboxes.

2 hosted mail products: Noteworthy, MS Exchange

The Noteworthy System:
* Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync.
* ~600 servers, commodity hardware, designed to work around frequent failures.

The MS Exchange System:
* MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.
* ~100 servers, higher-end hardware, SAN & DAS storage.

The Architecture

The way the current Hadoop based system works is:

Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time.

MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.

Once the indexes have been built, they are compressed and stored away in HDFS.

Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

The System Evolution

The Problem

Mailtrust is a very customer service focused company. It is extremely important for our support techs to be able to examine mail logs in order to troubleshoot problems for our customers. Our support techs need to search the logs hundreds of times per day, so the tools that provide this functionality must be fast and accurate. With over 600 mail servers, and hundreds of gigabytes of raw log data produced each day, this can be tricky to manage. Here is a brief history of the Mailtrust logging architecture, problems we faced, how we over came them, and what the system looks like today...

Logging v1.0

Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog.

Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

Logging v1.1

Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. An engineer could tell the script what type of mail server to search (inbound smtp, outbound smtp, backend mailbox). The script would look at /etc/hosts for a list of servers of that type, and then iterate through each server, ssh in, perform the grep and then output the results. The script could also search in the past via "gunzip -c /var/log/maillog.* | grep"

Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

Logging v2.0

We released a log search tool that the support techs could use directly, without involving the engineers. The support team was given a web-based tool where they could search the logs. It allowed searching by the sender or recipient's email address, domain name or IP address. All of these were indexed fields in a MySQL database. Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. This made cleanup really fast compared to running a conditional DELETE command on a large table. Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size.

To get the logs into the database, each mail server initially wrote its log data to a local 16MB tempfs partition. Logrotate was called via cron every 60 seconds to rotate the temporary log file and then preprocess the data before sending it on to the centralized log server. This preprocessing step reduced the volume of data that had to be transmitted over the network to the log server, and this also distributed the processing workload to avoid creating bottleneck on the log server. After the data was processed locally, the script would send comma delimited log data back to syslog-ng on the local server, and syslog-ng would then send it over the network to the centralized log server. The log server was configured to receive data on 6 different ports, one for each type of log data... inbound smtp, outbound smtp, backend smtp, spam/virus filtering, POP3 and IMAP. As log data was received, the records were inserted one by one into the database via MySQL INSERT commands.

Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

Logging v2.1

Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time.

Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

Logging v2.2

Introduced Merge Tables in order to speed up loading the log data into the database. With this version, every 10 minutes our script would create a new database table and then load the text logs into the empty table. This made the LOAD command extremely fast because there were no existing database indexes that could negatively affect performance. After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. The web search tool was modified to allow searching within the following time ranges: all day, past 12-hours, past 6-hours, past 2-hours. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created.

Problems: This version of the logging system worked reliably for about one year. But we began having problems with it as our support team, customer base and server count grew. When we reached about 100 servers the database LOAD operations would take 2-3 minutes to run, which was acceptable, but the server was now always under a heavy cpu and disk IO load. Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy.

Also, there were several occasions where our engineers performed a software upgrade to a particular application, which changed log format in such a way that broke the preprocessing script. Since our raw logs were deleted from the local mail servers every 60 seconds, we'd have no way to recover the missing logs when this occurred. Additionally, the log search tool was becoming ever more critical to our support team's daily operations; however, the logging system had no redundancy. There was no RAID, no backups, no failover system. We also do not have a good plan for scaling the log system beyond a single monolithic server. Incrementally patching problems and tweaking performance with the log system was taking up a lot of time and we needed something better. We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

Logging v3.0

While designing v3.0, we looked at several commercial log processing applications. Splunk stood out, and did just about everything we wanted; however, we worried that using a vendor product like this might limit our abilities to build new features down the road. For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. The system is described here: http://blog.racklabs.com/?p=66

We believe this new system will be able to scale with us as our company grows. And there is a lot of momentum behind the Hadoop project, which gives us a lot of confidence that its scalability will continue to improve. Yahoo is one of the major contributors to the project and has built Hadoop clusters that contain thousands of servers, and they are aggressively working to get Hadoop to support tens of thousands of servers.

Problems: To date, the only problems we have found have been our own bugs; and we fix those as we find them.

We are actively running v3.0 today, but we're not going to stop here. We have a lot of plans for new features...

The Future

Version 3.1 is being coded currently. It includes new MapReduce jobs that support Microsoft Exchange log processing. (currently we only process Noteworthy logs with this system). We plan to go live in March.

In version 4.0 we plan to put the log search tool in the hands of our customers so that they can have the same troubleshooting power that our support team has. This will most likely require reorganizing the way we store log index shards so that they are grouped by user, rather than letting Solr randomly group them. Our resellers seem to be excited about this, because it should allow them to better support their customers. Who knows what we'll build after v4.0...

Google Architecture

Database People Hating on MapReduce

Product: Hadoop

Running Hadoop MapReduce on Amazon EC2 and Amazon S3

Solr