This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.
During the tutorial, we will perform the following data processing steps.... read more on Cloudera website
This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.
These are from Laura Thomson of OmniTi:
- Profile early, profile often. Pick a profiling tool and learn it in and out.
- Dev-ops cooperation is essential. The most critical difference in organizations that handles crises well.
- Test on production data. Code behavior (especially performance) is often data driven.
- Track and trend. Understanding your historical performance characteristics is essential for spotting emerging problems.
- Assumptions will burn you. Systems are complex and often break in unexpected ways.
- Decouple. Isolate performance failures.
- Cache. Caching is the core of most optimizations.
- Federate. Data federation is taking a single data set and spreading it across multiple database/application servers.
- Replicate. Replication is making synchronized copies of data available in more than one place.
- Avoid straining hard-to-scale resources. Some resources are inherently hard to scale: Uncacheable’ data, Data with a very high read+write rate, Non-federatable data, Data in a black-box
- Use a compiler cache. A compiler cache sits inside the engine and caches the parsed optrees.
- Be mindful of using external data sources. External data (RDBMS, App Server, 3rd Party data feeds) are the number one cause of application bottlenecks.
- Avoid recursive or heavy looping code. Deeply recursive code is expensive in PHP.
- Don’t Outsmart Yourself . Don’t try to work around perceived inefficiencies in PHP (at least not in userspace code!)
- Build with caching in mind. Caching is the most important tool in your tool box.
Another fun one... Should I be Worried About Scaling?
What happens when the comfortable SQL world of yesterday changes out from under you? You feel like this (it uses some language picante so be advised):
Here's a link to the source: http://browsertoolkit.com/fault-tolerance.png
A great way to communicate the dislocation one feels when technology changes. It also gets you to wonder if those changes at the user level are completely necessary for carrying out the task?
Lest you think this just an old guard reactionary sentiment, Boxed Ice chose MongoDB over CouchDB partly because "CouchDB requires map/reduce style queries which is much more complicated to work with."
Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem.
Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it.
Here's the video Thinking at Scale. And here's a summary of some of the lessons learned from the talk:
* Say 32 GB of RAM available to a machine. You can get 1-2 TB of data on disk. The amount of data a machine can store is greater than the amount it can manipulate in memory so you have to swap out RAM.
* With an average job size of 180 GB it would take 45 minutes to read that data off of disk sequentially. Random access would be much slower.
* An individual SATA drive can read at 75 MB/sec. To process 180 GB you would have to read at 75 MB/sec which leaves the CPU doing very little with the data.
* Grids moved data to computation. Data was typically stored on a large filer/SAN.
* The new large scale computing approach is to move computation to where the data is already stored. A file has limited processing power relative to storage size so not useful.
* So move processing to individual nodes that store only a small amount of the data at a time.
* This gets around implementation complexity and bandwidth limitations of a centralized filer. Distributed systems can drown themselves if they start sharing data.
* Failure with large systems is inevitable so partial progress must be kept for long jobs and jobs must be restarted when a failure is detected. Complex distributed systems make job restarting difficult because of the state that must be maintained.
* Processing should be close to linear with the number of nodes. Losing 5% of nodes should not end up with a 50% loss in throughput. Doubling the size of the cluster should double the number of jobs that can be processed. No job should be able to nuke the system.
* Workload should be transferred as new nodes are added and failures occur.
* Node changes (failures, additions, new hard drives, more memory, etc) should be transparent to jobs. Users shouldn't have to deal with changes, the system should handle them transparently.
* To get around the limits faced byMPI (message passing interface) based systems nothing is shared.
* In map-reduce (MR) systems data is read locally and processed locally. Results are written back locally.
* Nodes do not talk to each other.
* Data is paritioned onto machines in advance and computations happen where data is stored.
* In MPI communication is explicit. Programs know who they are talking to and what they are talking about. In MR communication is implicit. It is taken care of by the system. Data is routed where it needs to go. This simplifies programs by removing the complexity of explicit coordination. Allows developers to concentrate on solving their problem without know level network stack and programming details.
* On multi-core computers each core would be treated as a separate node.
* Goal is to have locality of reference. Tasks are processed on the same node as where the data is stored or at least on the same rack. This removes a load step. The data isn't loaded onto a filer. It's not then loaded onto a processing machines. It's already where spreat around the cluster where it needs to be used upfront.
* In standard MR it's processing large files of data, typically 1 GB or more. This allows streaming reads from this disk. Typical file system block sizes are 4K, for MR they are 64MB to 256MB, which allows writing large linear chunks which reduces seeks on reading.
* Tasks are restarted transparently on failure because tasks are independent of each other.
* Data is replicated across nodes for fault tolerance purposes.
* Task independence allows speculative task execution. The same task can be started on difference nodes and the fastest result can be used. This allows problems like broken disk controllers to be worked around.
* If necessary inputs can be processed on another machine. There's a big penalty for going off node and off rack.
* Nodes have no identity to the programmer. Nodes can run multiple jobs.
Virtualization offers a lot of benefits, but it also comes with a cost (memory, CPU, network, IO, licensing). If you are in or running a cloud then some form of virtualization may not even be an option. But if you are running your own string of servers you can choose to go without. Free will and all that. Should you or shouldn't you?
In a detailed comparison the folks at 37signals found that running their Rails application servers without virtualization resulted in A 66% reduction in the response time while handling multiples of the traffic is beyond what I expected.
As is common 37signals runs their big database servers without virtualization. They use a scale-up approach at the database tier so extracting every bit of performance out of those servers is key. Application servers typically use a scale-out approach for scalability, which is virtualization friendly, but that says nothing about performance.
Finding performance increases, especially when you are running on a dynamic language is a challenge. The improvements found by 37signals were significant. Something to consider when you are casting around wondering how you can get more bang for your buck.
Update 7: Basecamp, now with more vroom. Basecamp application servers running Ruby code were upgraded and virtualization was removed. The result: A 66 % reduction in the response time while handling multiples of the traffic is beyond what I expected. They still use virtualization (Linux KVM), just less of it now.
Update 6: Things We’ve Learned at 37Signals. Themes: less is more; don't worry be happy.
Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between Apache web servers and Mongrel application servers.
Update 4: O'Rielly's Tim O'Brien interviews David Hansson, Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on, hardware gets cheaper and cheaper. In my mind, you don't want to shard unless you positively have to, sort of a last resort approach.
Update 3: The need for speed: Making Basecamp faster. Pages now load twice as fast, cut CPU usage by a third and database time by about half. Results achieved by: Analysis, Caching, MySQL optimizations, Hardware upgrades.
Update 2: customer support is handled in real-time using Campfire.
Update: highly useful information on creating a customer billing system.
In the giving spirit of Christmas the folks at 37signals have shared a bit about how their system works. 37signals is most famous for loosing Ruby on Rails into the world and they've use RoR to make their very popular Basecamp, Highrise, Backpack, and Campfire products. RoR takes a lot of heat for being a performance dog, but 37signals seems to handle a lot of traffic with relatively normal sounding resources. This is just an initial data dump, they promise to add more details later. As they add more I'll update it here.
* 2,000,000 people with accounts
* 1,340,000 projects
* 13,200,000 to-do items
* 9,200,000 messages
* 12,200,000 comments
* 5,500,000 time tracking entries
* 4,000,000 milestones
* Just under 1,000,000 pages
* 6,800,000 to-do items
* 1,500,000 notes
* 829,000 photos
* 370,000 files
* 5.9 terabytes of customer-uploaded files
* 888 GB files uploaded (900,000 requests)
* 2 TB files downloaded (8,500,000 requests)
Credit Card Processing Process
- Apache is able to deliver roughly 700% more requests per second with Squid when serving 1KB and 100KB images.
- Server load is reduced using Squid because the server does not have to create a bunch of Apache processes to handle the requests.
- APC Cache took a system that could barely handle 10-20 requests per second to handling 50-60 requests per second. A 400% increase.
- APC allowed the load times to remain under 5 seconds even with 200 concurrent threads slamming on the server.
- These two caches are easy to setup and install and allow you to get a lot more performance out of them.
Update 7: How do you know when you need more memcache servers?. Dathan Pattishall talks about using memcache not to scale, but to reduce latency and reduce I/O spikes, and how to use stats to know when more servers are needed.
Update 6: Stock Traders Find Speed Pays, in Milliseconds. Goldman Sachs is making record profits off a 500 millisecond trading advantage. Yes, latency matters. As an interesting aside, Libet found 500 msecs is about the time it takes the brain to weave together an experience of consciousness from all our sensor inputs.
Update 5: Shopzilla's Site Redo - You Get What You Measure. At the Velocity conference Phil Dixon, from Shopzilla, presented data showing a 5 second speed up resulted in a 25% increase in page views, a 10% increase in revenue, a 50% reduction in hardware, and a 120% increase traffic from Google. Built a new service oriented Java based stack. Keep it simple. Quality is a design decision. Obsessively easure everything. Used agile and built the site one page at a time to get feedback. Use proxies to incrementally expose users to new pages for A/B testing. Oracle Coherence Grid for caching. 1.5 second page load SLA. 650ms server side SLA. Make 30 parallel calls on server. 100 million requests a day. SLAs measure 95th percentile, averages not useful. Little things make a big difference.
Update 4: Slow Pages Lose Users. At the Velocity Conference Jake Brutlag (Google Search) and Eric Schurman (Microsoft Bing) presented study data showing delays under half a second impact business metrics and delay costs increase over time and persist. Page weight not key. Progressive rendering helps a lot.
Update 3: Nati Shalom's Take on this article. Lots of good stuff on designing architectures for latency minimization.
Update 2: Why Latency Lags Bandwidth, and What it Means to Computing by David Patterson. Reasons: Moore's Law helps BW more than latency; Distance limits latency; Bandwidth easier to sell; Latency help BW, but not vice versa; Bandwidth hurts latency; OS overhead hurts latency more than BW. Three ways to cope: Caching, Replication, Prediction. We haven't talked about prediction. Games use prediction, i.e, project where a character will go, but it's not a strategy much used in websites.
Update: Efficient data transfer through zero copy. Copying data kills. This excellent article explains the path data takes through the OS and how to reduce the number of copies to the big zero.
Latency matters. Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%. A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 milliseconds behind the competition.
The Amazon results were reported by Greg Linden in his presentation Make Data Useful. In one of Greg's slides Google VP Marissa Mayer, in reference to the Google results, is quoted as saying "Users really respond to speed." And everyone wants responsive users. Ka-ching! People hate waiting and they're repulsed by seemingly small delays.
The less interactive a site becomes the more likely users are to click away and do something else. Latency is the mother of interactivity. Though it's possible through various UI techniques to make pages subjectively feel faster, slow sites generally lead to higher customer defection rates, which lead to lower conversation rates, which results in lower sales. Yet for some reason latency isn't a topic talked a lot about for web apps. We talk a lot about about building high-capacity sites, but very little about how to build low-latency sites. We apparently do so at the expense of our immortal bottom line.
I wondered if latency went to zero if sales would be infinite? But alas, as Dan Pritchett says, Latency Exists, Cope!. So we can't hide the "latency problem" by appointing a Latency Czar to conduct a nice little war on latency. Instead, we need to learn how to minimize and manage latency. It turns out a lot of problems are better solved that way.
How do we recover that which is most meaningful--sales--and build low-latency systems?
I'm excited that the topic of latency came up. There are a few good presentations on this topic I've been dying for a chance to reference. And latency is one of those quantifiable qualities that takes real engineering to create. A lot of what we do is bolt together other people's toys. Building high-capacity low-latency system takes mad skills. Which is fun. And which may also account for why we see latency a core design skill in real-time and market trading type systems, but not web systems. We certainly want our nuclear power plant plutonium fuel rod lowering hardware to respond to interrupts with sufficient alacrity. While less serious, trading companies are always in a technological arms race to create lower latency systems. He with the fastest system creates a sort of private wire for receiving and acting on information faster than everyone else. Knowing who has the bestest price the firstest is a huge advantage. But if our little shopping cart takes an extra 500 milliseconds to display, the world won't end. Or will it?
My unsophisticated definition of latency is that it is the elapsed time between A and B where A and B are something you care about. Low-latency and high-latency are relative terms. The latency requirements for a femptosecond laser are far different than for mail delivery via the pony express, yet both systems can be characterized by latency. A system has low-latency if it's low enough to meet requirements, otherwise it's a high-latency system.
The best explanation of latency I've ever read is still It's the Latency, Stupid by admitted network wizard Stuart Cheshire. A wonderful and detailed rant explaining latency as it relates to network communication, but the ideas are applicable everywhere.
Stuart's major point: If you have a network link with low bandwidth then it's an easy matter of putting several in parallel to make a combined link with higher bandwidth, but if you have a network link with bad latency then no amount of money can turn any number of them into a link with good latency.
I like the parallel with sharding in this observation. We put shards in parallel to increase capacity, but request latency through the system remains the same. So if we want to increase interactivity we have to address every component in the system that introduces latency and minimize or remove it's contribution. There's no "easy" scale-out strategy for fixing latency problems.
Sources of Latency
My parents told me latency was brought by Santa Clause in the dead of night, but that turns out not to be true! So where does latency come from?
Draw out the list of every hop a client request takes and the potential number of latency gremlins is quite impressive.
The Downsides of LatencyLower sales may be the terminal condition of latency problems, but the differential diagnosis is made of many and varied ailments. As latency increases work stays queued at all levels of the system which puts stress everywhere. It's like dementia, the system forgets how to do anything. Some of the problems you may see are: Queues grow; Memory grows; Timeouts cascade; Memory grows; Paging increases; Retries cascade; State machines reset; Locks are held longer; Threads block; Deadlock occurs; Predictability declines; Throughput declines; Messages drop; Quality plummets.
For a better list take a look at The Many Flavors of System Latency.. along the Critical Path of Peak Performance by Todd Jobson. A great analysis of the subject.
Managing LatencyThe general algorithm for managing latency is:
Hardly a revelation, but it's actually rare for applications to view their work flow in terms of latency. This is part of the Log Everything All the Time mantra. Time stamp every part of your system. Look at mean latency, standard deviation, and outliers. See if you can't make the mean a little nicer, pinch in that standard deviation, and chop off some of those spikes. With latency variability is the name of the game, but that doesn't mean that variability can't be better controlled and managed. Target your latency slimming efforts where it matters the most and you get the most bang for your buck.
Next we will talk about various ideas for what you can do about latency once you've found it.
Dan Pritchett's Lessons for Managing LatencyDan Pritchett is one of the few who has openly written on architecting for latency. Here are some of Dan's suggestions for structuring systems to manage latency:
Clearly each of these principles is a major topic all on their own. For more details please read: Dan Pritchett has written a few excellent papers on managing latency: The Challenges of Latency, Architecting for Latency, Latency Exists, Cope!.
GigaSpaces Lessons for Lowering LatencyGigsSpaces is an in-memory grid vendor and as such is on the vanguard of the RAM is the New Disk style of application building. In this approach disk is pushed aside for keeping all data in RAM. Following this line of logic GigaSpaces came up with these low latency architecture principles:
The thinking is the primary source of latency in a system centers around accessing disk. So skip the disk and keep everything in memory. Very logical. As memory is an order of magnitude faster than disk it's hard to argue that latency in such a system wouldn't plummet.
Latency is minimized because objects are in kept memory and work requests are directed directly to the machine containing the already in-memory object. The object implements the request behavior on the same machine. There's no pulling data from a disk. There isn't even the hit of accessing a cache server. And since all other object requests are also served from in-memory objects we've minimized the Service Dependency Latency problem as well.
GigaSpaces isn't the only player in this market. You might want to also take a look at: Scaleout Software, Grid Gain, Teracotta, GemStone, and Coherence. We'll have more on some of these products later.
Miscellaneous Latency Reduction Ideas
Why don't most users experience high data rates? pinpoints poor network design as one major source of latency: On a single high performance network today, measured latencies are typically ~1.5x - 3x that expected from the speed of light in fiber. This is mostly due to taking longer than line-of-site paths. Between different networks (via NAPs) latency is usually much worse. Some extra distance is required, based on the availability of fiber routes and interconnects, but much more attention should be given to minimizing latency as we design our network topologies and routing.
Application Server Architecture Matters AgainWith the general move over the past few years to a standard shared nothing two-tierish architecture, discussion of application server architectures has become a neglected topic, mainly because there weren't application servers anymore. Web requests came in, data was retrieved from the database, and results were calculated and returned to the user. No application server. The web server became the application server. This was quite a change from previous architectures which were more application server oriented. Though they weren't called application servers, they were call daemons or even just servers (as in client-server).
Let's say we buy into RAM is the New Disk. This means we'll have many persistent processes filled with many objects spread over many boxes. A stream of requests are directed at each process and those requests must be executed in each process. How should those processes be designed?
Sure, having objects in memory reduces latency, but it's very easy through poor programming practice to lose all of that advantage. And then some. Fortunately we have a ton of literature on how to structure servers. I have a more thorough discussion here in Architecture Discussion. Also take a look at SEDA, an architecture for highly concurrent servers and ACE, an OO network programming toolkit in C++.
A few general suggestions:
ColocateLocating applications together reduces latency by reducing data hops. The number and location of network hops a message has to travel through is a big part of the end-to-end latency of a system.
For example, from New York to the London Stock Exchange a round trip message takes 84 milliseconds to send, from Frankfurt it take 18 milliseconds, and from Tokyo it takes 208 milliseconds. If you want to minimize latency then the clear strategy is to colocate your service in the London Stock Exchange. Distance is minimized and you can probably use a faster network too.
Virtualization technology makes it easier than ever to compose separate systems together. Add a cloud infrastructure to that and it becomes almost easy to dramatically lower latencies by colocating applications.
Minimize the Number of HopsLatency increases with each hop in a system. The fewer hops the less latency. So put those hops on a diet. Some hop reducing ideas are:
Build Your own Field-programmable Gate Array (FPGA)This one may seem a little off the wall, but creating your own custom FPGA may be a killer option for some problems. A FPGA is a semiconductor device containing programmable logic. Typical computer programs are a series of instructions that are loaded and interpreted by a general purpose microprocessor, like the one in your desk top computer. Using a FPGA it's possible to bypass the overhead of a general purpose microprocessor and code your application directly into silicon. For some classes of problems the performance increases can be dramatic.
FPGAs are programmed with your task specific algorithm. Usually something compute intensive like medical imaging, modeling bond yields, cryptography, and matching patterns for deep packet inspections. I/O heavy operations probably won't benefit from FPGAs. Sure, the same algorithm could be run on a standard platform, but the advantage FPGAs have is even though they may run at a relatively low clock rates, FPGAs can perform many calculations in parallel. So perhaps orders-of-magnitude more work is being performed each clock cycle. Also, FPGAs often use content addressable memory which provides a significant speedup for indexing, searching, and matching operations. We also may see a move to FPGAs because they use less power. Stay lean and green.
In embedded projects FPGAs and ASICS (application-specific integrated circuit) are avoided like the plague. If you can get by with an off-the-shelf microprocessors (Intel, AMD, ARM, PPC, etc) you do it. It's a time-to-market issue. Standard microprocessors are, well, standard, so that makes them easy to work with. Operating systems will already have board support packages for standard processors, which makes building a system faster and cheaper. Once custom hardware is involved it becomes a lot of work to support the new chip in hardware and software. Creating a software only solution is much more flexible in a world where constant change rules. Hardware resists change. So does software, but since people think it doesn't we have to act like software is infinitely malleable.
Sometimes hardware is the way to go. If you are building a NIC that has to process packets at line speed the chances are an off-the-shelf processor won't be cost effective and may not be fast enough. Your typical high end graphics card, for example, is a marvel of engineering. Graphics cards are so powerful these days distributed computation projects like Folding@home get a substantial amount of their processing power from graphics cards. Traditional CPUs are creamed by NVIDIA GeForce GPUs which perform protein-folding simulations up to 140 times faster. The downside is GPUs require very specialized programming, so it's easier to write for a standard CPU and be done with it.
That same protein folding power can be available to your own applications. ACTIV Financial, for example, uses a custom FGPA for low latency processing of high speed financial data flows. ACTIV's competitors use a traditional commodity box approach where financial data is processed by a large number of commodity servers. Let's say an application takes 12 servers. Using a FPGA the number of servers can be collapsed down to one because more instructions are performed simultaneously which means fewer machines ar needed. Using the FPGA architecture they process 20 times more messages than they did before and have reduced latency from one millisecond down to less than 100 microseconds.
Part of the performance improvement comes from the high speed main memory and network IO access FPGAs enjoy with the processor. Both Intel and AMD make it relatively easy to connect FPGAs to their chips. Using these mechanisms data moves back and forth between your processing engine and the main processor with minimal latency. In a standard architecture all this communication and manipulation would happen over a network.
FPGAs are programmed using hardware description languages like Verilog and VHDL. You can't get away from the hardware when programming FPGAs, which is a major bridge to cross for us software types. Many moons ago I took a Verilog FPGA programming class. It's not easy, nothing is ever easy, but it is possible. And for the right problem it might even be worth it.
There have been reports that software engineering is dead. Maybe, like the future, software engineering is simply not evenly distributed? When you read this paper I think you'll agree there is some real engineering going on, it's just that most of the things we need to build do not require real engineering. Much like my old childhood tree fort could be patched together and was "good enough." This brings to mind the old joke: If a software tree falls in the woods would anyone hear it fall? Only if it tweeted on the way down...
What this paper really showed me is we need not only to change programming practices and constructs, but we also need to design solutions that allow for deep parallelism to begin with. Grafting parallelism on later is difficult. Parallel execution requires knowing precisely how components are dependent on each other and that level of precision tends to go far beyond the human attention span.
In particular this paper deals with how to parallelize the browser on cell phones. We are entering a multi-core smartphone dominated world. As network connections become faster, applications, like the browser, become CPU bound:
What's clear though is their job would have been a heck of a lot easier if the stack would have been designed with parallelization in mind from the beginning.
Leo Meyerovich, one of the authors of the paper, talks about the need for a more rigorous underpinning in blog postThe Point of Semantics:
As part of the preparation for a paper submission, I'm finishing up my formalization of a subset of CSS 2.1 (blocks, inlines, inline-blocks, and floats) from last year. My first two, direct formalization approaches failed the smell test so Ras and I created a more orthogonal kernel language. It's small, and as the CSS spec is a scattered hodge-podge of prose and visual examples riddled with ambiguities, we phrase it as a total and deterministic attribute grammar that is easy to evaluate in parallel.
I asked Leo what rules we could follow to create more parallelizable constructs from the beginning and he said that's what he'll be working on for the next couple years :-) Some advice he had was:
Some things Leo will be working on are:
I've been enjoying higher-order data flow models (Flapjax) and task parallelism (Cilk++) for awhile now and have been thinking about this, including support for controlled sharing (e.g., SharC for type qualifiers and I'm still trying to figure out implicitly transactional flows for FRP). For a browser, I think it will remain as specialized libraries written in privileged languages where good engineers can rock and put together and be exposed in higher-level languages. Hopefully gradually typing will extend into lower levels to support this. The above hints at a layered framework with the bulk in the high-level -- think parallel scripting. However, as a community, we don't know how to include performance guides in large software, so parallelism is a challenge. I prototyped one of my algorithms in a parallel python variant: the sequential C was magnitudes faster than then 20-core python. Of course, the parallel C++ was even faster :)
Browsing Web 3.0 on 3 Watts