Looking for good business examples of compaines using Hadoop

I have read the blog about Mailtrust/Rackspace as well the interesting things with Google and Yahoo. Who else is using Hadoop/MapReduce to solve business problems. TIA

Click to read more ...


SLA monitoring

Hi, We're running a enterprise SaaS solution that currently holds about 700 customers with up to 50.000 users per customer (growing quickly). Our customers have SLA agreements with us that contains guaranteed uptimes, response times and other performance counters. With an increasing number of customers and traffic we find it difficult to provide our customer with actual SLA data. We could set up external probes that monitors certain parts of the application, but this is time consuming with 700 customers (we do it today for our biggest clients). We can also extract data from web logs but they are now approaching about 30-40 GB a day. What we really need is monitoring software that not only focuses on the internal performance counters but also lets us see the application from the customers viewpoint and allows us to aggregate data in different ways. Would the best approach be to develop a custom solution (for instance a distributed app that aggregates data from different logs every night and store them in a data warehouse) or are there products out there that are suitable for a high scalability environment? Any input is greatly appreciated!

Click to read more ...


Handling of Session for a site running from more than 1 data center

If using a DB to store session(used by some app server, ex.. websphere), how would an enterprise class site that is housed in 2 different data centers(that are live/live) maintain the session between both data centers. The problem as I see it is that since each data center has their own session database, if I was to flip the users to only access Data Center 1(by changing the DNS records for the site or some other Load balancing technique) then that would cause all previous Data Center 2 users to lose their session. What would be some pure hardware based solutions to this that are being used now? That way the applications supporting the web site can be abstracted from this. As I see now, a solution is to possibly have the session databases in both centers some how replicate the data to each other. I just don't see the best way to even accomplish this you are not even guraunteed that the session ID's will be unique since it's 2 different Application Server tiers(again websphere). Not to mention if the 2 data centers are some distance apart this could be difficult to accomplish as well.

Click to read more ...


IPS/IDS for heavy content site

All, My site would have heavy content (video/pictures). I'm looking for an efficient IPS/IDS solution which would not introduce much of latency. I'm more familiar with Cisco ASA and also familiar with Juniper, Foundry and others. I also came across snort but haven't used it before. I'm more of looking for an appliance (for the ease of configuration,support etc...) Could any one share their thoughts on performane of IPS/IDS from this vendors? Thanks! Janakan Rajendran

Click to read more ...


Streaming Video on Amazon EC2?

An Amazon EC2 Flash Video Streaming solution has been announced by Wowza Media. What do you think about the future of similar solutions? Is Amazon EC2 and S3 ready for video streaming? I have found threads on their forums related to the performance, scalability and high availability of the hosted streaming solution. How would you make it scalable? Is it really cheaper than traditional hosting? Looking forward to your thoughts!

Click to read more ...


Product: Collectl - Performance Data Collector

From their website: There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include:

  • You are be able to run with non-integral sampling intervals.
  • Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else.
  • Brief, verbose, and plot formats are supported.
  • You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks or even Lustre file systems.
  • Collectl will align its sampling on integral second boundaries.
  • Supports process and slab monitoring.
  • New to the 2.4.0 release is the monitoring of process i/o statistics. Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp. The following is an example of simply running the collectl command with no arguments and using its default settings. Below we see what the cpu, network and disk were doing while writing a large file: #<--------CPU--------><-----------Disks-----------><-----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 37 37 382 188 0 0 27144 254 45 68 3 21 25 25 366 180 20 4 31280 296 0 1 0 0 25 25 368 183 0 0 31720 275 2 20 0 1 Output can also be saved in a rolling set of logs for later playback or displayed interactively in a variety of formats. If all that isn't enough there are additional mechanisms for supplying data to external tools via a socket interface or by generating its output as s-expressions, a format of choice for some tools such as supermon. You can even create files in space-separated formats for plotting with external packages like the one below which was done with gnuplot using 1 second samples.

    Click to read more ...

  • Sunday

    Ideas on how to scale a shared inventory database???

    We have a database today that holds all of our shared inventory. How do we scale out ? We run into concurrency issues today as mutliple users may want to access the same inventory,etc. Im sure its a common problem.. So how do folks implement this while also having faster response to available inventory and also ensuring no downtime Thanks

    Click to read more ...


    The case against ORM Frameworks in High Scalability Architectures

    Let me begin by saying that I have used and continue to use various ORM frameworks such as hibernate, ibatis, propel and activerecord in applications and websites that have a user base ranging from a couple hundred to 500k users. Especially for projects that have to be up and running in a short duration of time, ORM frameworks significantly reduce the effort required to manipulate and persist OOP objects by providing time saving facilities such as automatically generated model objects, integrated unit testing, secure variable substitution, etc. Hibernate even supports horizontal data partitioning via Hibernate Shards. However, the lay of the land is significantly different in the rarefied space occupied by applications needing to support millions of users. Profiling an application at this level and paying particular attention to the operations needed to move data to and from the database, it becomes evident that a significant portion of the operations are API related, whereby the ORM framework is traversing the abstraction layer built between the application logic and the native methods that ultimately interact with the database. I see a couple of problems with this level of abstraction and for the purpose of this discussion, I will purposely ignore caching for the sake of keeping the scope succinct. 1. The process of optimizing database queries is as much an art as it is a science and I am yet to see an ORM framework that does this well. In the case of mysql, optimization involves using facilities such as explain, benchmark, analyze table, show index, and the slow queries log to identify non-performing queries and tweak them to extract the leanest performance. These optimizations necessarily work best when applied as close as possible to the bare metal, so to speak, and the abstraction of an ORM framework negates to an extent the benefits of optimization. The devil remains in the details and the further away you are from the details, the lesser a chance you have to find and square with the devil. 2. At the end of the day, an ORM framework is essentially middleware. My reading of some of the real life architectures presented on this sites seems to reinforce the assessment that middleware will only take you so far, beyond which you have to roll your own. This makes perfect sense. ORM frameworks are built to serve as wide an audience as possible and while their success is unquestionable in the commodity/middle market, they are not and cannot possibly be tooled to accommodate the atypical demands of high scalability architecture. That would be akin to running with hares and hunting with the hounds. Building a framework for hight scalability would also require that the builders have a front and center seat in an enterprise where they are exposed to the machinery and day to day operations of a high scalability site. A situation for which you would be hard pressed to find another installation bearing similar characteristics or with similar requirements. Additionally, and without putting down the developers who contribute to these frameworks, a majority of them would not have the exposure to a bona fide high scalability architecture to be able to bring their experience to bear on the framework code base. 3. Just as with kernel developers, I have a significant amount of faith in the folks that spend their every waking hour coding database engines such as MySQL, Postgres, Oracle, MS SQL etc. Consequently, when the main goal is ultimate performance and scalability, I generally frown upon efforts to introduce a middle man between the wicked fast database and the application logic. And having invested the time and effort over many years to learn the intricacies of a database engine, I am more apt to cast my lot with the devil that I know than abdicate control to a framework, however versatile. One could argue that it makes sense to start off with an ORM framework and as the demands for the site begin to eclipse what the framework can provide, gradually transition to a custom built solution. In my experience, refactoring on the database tier for a site that has a significant amount of data and needs to be operational 24x7 is pure hell. So much so that a more feasible option would be to build a parallel site then migrate and switch over. Of course this could be mitigated by using a service oriented architecture and thereby giving yourself some degree of maneuverability, but at the end of the day, there will be thousands of operations trying to read and write to the db every second. You are had, whichever which way you turn. Taking a look at the mediawiki source code that powers the Wikimedia sites including Wikipedia, there are two classes, DatabaseMySQL and DatabasePostgress which encapsulate the native PHP functions that talk to MySQL or PostgreSQL respectively. The other main classes such as the Article class then use these database classes to interact with the db. Simple and straight forward and in my opinion, the best way to get maximum performance and throughput.

    Click to read more ...


    The AOL XMPP scalability challenge

    Large scale distributed instant messaging, presence based protocol are a real challenge. With big players adopting the standard, the XMPP (eXtensible Messaging and Presence Protocol) community is facing the need to validate protocol and implementations to even larger scale.

    Click to read more ...


    How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

    How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems. Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year. The advantage of their new system is that they can now look at their data in anyway they want:

  • Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.
  • When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. This switch has changed how they run their business. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff." In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system. Before getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report.

    Information Sources

  • MapReduce at Rackspace
  • A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division). This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently.

    The Platform

  • Hadoop
  • Hadoop Distributed File System (HDFS)
  • Lucene
  • Solr
  • Tomcat

    The Stats

  • Rackspace has more than 50K devices and 7 data centers.
  • The mail system and logging servers are currently in 3 of the Rackspace data centers.
  • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.
  • Several hundred gigabytes of email log data is generated each day.

    Background on Mailtrust

  • Email hosting company
  • Founded in 1999, merged with Rackspace in 2007, previous name:
  • 80K business customers, 700K mailboxes.
  • 2 hosted mail products: Noteworthy, MS Exchange
  • The Noteworthy System: * Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. * ~600 servers, commodity hardware, designed to work around frequent failures.
  • The MS Exchange System: * MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. * ~100 servers, higher-end hardware, SAN & DAS storage.

    The Architecture

    The way the current Hadoop based system works is:
  • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time.
  • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.
  • Once the indexes have been built, they are compressed and stored away in HDFS.
  • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

    The System Evolution

    The Problem

    Mailtrust is a very customer service focused company. It is extremely important for our support techs to be able to examine mail logs in order to troubleshoot problems for our customers. Our support techs need to search the logs hundreds of times per day, so the tools that provide this functionality must be fast and accurate. With over 600 mail servers, and hundreds of gigabytes of raw log data produced each day, this can be tricky to manage. Here is a brief history of the Mailtrust logging architecture, problems we faced, how we over came them, and what the system looks like today...

    Logging v1.0

    Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

    Logging v1.1

    Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. An engineer could tell the script what type of mail server to search (inbound smtp, outbound smtp, backend mailbox). The script would look at /etc/hosts for a list of servers of that type, and then iterate through each server, ssh in, perform the grep and then output the results. The script could also search in the past via "gunzip -c /var/log/maillog.* | grep" Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

    Logging v2.0

    We released a log search tool that the support techs could use directly, without involving the engineers. The support team was given a web-based tool where they could search the logs. It allowed searching by the sender or recipient's email address, domain name or IP address. All of these were indexed fields in a MySQL database. Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. This made cleanup really fast compared to running a conditional DELETE command on a large table. Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. To get the logs into the database, each mail server initially wrote its log data to a local 16MB tempfs partition. Logrotate was called via cron every 60 seconds to rotate the temporary log file and then preprocess the data before sending it on to the centralized log server. This preprocessing step reduced the volume of data that had to be transmitted over the network to the log server, and this also distributed the processing workload to avoid creating bottleneck on the log server. After the data was processed locally, the script would send comma delimited log data back to syslog-ng on the local server, and syslog-ng would then send it over the network to the centralized log server. The log server was configured to receive data on 6 different ports, one for each type of log data... inbound smtp, outbound smtp, backend smtp, spam/virus filtering, POP3 and IMAP. As log data was received, the records were inserted one by one into the database via MySQL INSERT commands. Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

    Logging v2.1

    Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

    Logging v2.2

    Introduced Merge Tables in order to speed up loading the log data into the database. With this version, every 10 minutes our script would create a new database table and then load the text logs into the empty table. This made the LOAD command extremely fast because there were no existing database indexes that could negatively affect performance. After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. The web search tool was modified to allow searching within the following time ranges: all day, past 12-hours, past 6-hours, past 2-hours. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. Problems: This version of the logging system worked reliably for about one year. But we began having problems with it as our support team, customer base and server count grew. When we reached about 100 servers the database LOAD operations would take 2-3 minutes to run, which was acceptable, but the server was now always under a heavy cpu and disk IO load. Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. Also, there were several occasions where our engineers performed a software upgrade to a particular application, which changed log format in such a way that broke the preprocessing script. Since our raw logs were deleted from the local mail servers every 60 seconds, we'd have no way to recover the missing logs when this occurred. Additionally, the log search tool was becoming ever more critical to our support team's daily operations; however, the logging system had no redundancy. There was no RAID, no backups, no failover system. We also do not have a good plan for scaling the log system beyond a single monolithic server. Incrementally patching problems and tweaking performance with the log system was taking up a lot of time and we needed something better. We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

    Logging v3.0

    While designing v3.0, we looked at several commercial log processing applications. Splunk stood out, and did just about everything we wanted; however, we worried that using a vendor product like this might limit our abilities to build new features down the road. For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. The system is described here: We believe this new system will be able to scale with us as our company grows. And there is a lot of momentum behind the Hadoop project, which gives us a lot of confidence that its scalability will continue to improve. Yahoo is one of the major contributors to the project and has built Hadoop clusters that contain thousands of servers, and they are aggressively working to get Hadoop to support tens of thousands of servers. Problems: To date, the only problems we have found have been our own bugs; and we fix those as we find them. We are actively running v3.0 today, but we're not going to stop here. We have a lot of plans for new features...

    The Future

    Version 3.1 is being coded currently. It includes new MapReduce jobs that support Microsoft Exchange log processing. (currently we only process Noteworthy logs with this system). We plan to go live in March. In version 4.0 we plan to put the log search tool in the hands of our customers so that they can have the same troubleshooting power that our support team has. This will most likely require reorganizing the way we store log index shards so that they are grouped by user, rather than letting Solr randomly group them. Our resellers seem to be excited about this, because it should allow them to better support their customers. Who knows what we'll build after v4.0...

    Related Articles

  • Google Architecture
  • Database People Hating on MapReduce
  • Product: Hadoop
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3
  • Solr

    Click to read more ...