This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of database scalability strategies in the form of a broad introduction. More specifically, the intent is to elaborate on the majority of these implementations by example.
The SHOP.COM Cache System is now available at http://code.google.com/p/sccache/ The SHOP.COM Cache System is an object cache system that... * is an in-process cache and external, shared Cache * is horizontally scalable * stores cached objects to disk * supports associative keys * is non-transactional * can have any size key and any size data * does auto-GC based on TTL * is container and platform neutral It was built in-house at SHOP.COM (by me) and has powered our website for years. We are open-sourcing it in the hope that it will be useful to others and to get some help in its maintenance. This is our first open source attempt and we'd appreciate any help and comments.
I thought with the job situation these days that people might be interested in some open jobs at Facebook. Here's what's available:
Facebook is hiring! We are looking for a Systems Engineer/Architect and Site Reliability Engineer. I have attached the job descriptions below. If you are interested, please contact Michelle Bostock mbostock-at-facebook.com. Thanks and Happy Holidays! Systems Architect Palo Alto, CA Description Facebook is seeking a seasoned Systems Architect to join the Operations team. The position is full-time and is based in our main office in downtown Palo Alto and will report to the Manager of Systems Operations. Responsibilities * Analyze application flow and infrastructure design to improve performance and scalability of the site * Collaborate on design of services infrastructure from servers to networking * Monitor, analyze, and make recommendations as appropriate to improve site stability and availability * Evaluate hardware and software technologies to improve site efficiency and performance * Troubleshoot and solve issues with hardware, applications, and network components * Lead team efforts from design to implementation, prioritize tasks and resources while interacting with Engineering and Operations * Document current and future configuration processes and policies * Participate in 24x7 on-call support Requirements * B.S. in Computer Science or equivalent experience * 4+ years of experience in Operations with large web farms * Extensive knowledge of web architecture and technologies, including Linux, Apache, MySQL, PHP, TCP/IP, security, HTTP, LDAP and MTAs * Strong background/interest in application and infrastructure design * Scripting and programming skills * Excellent verbal and written communication skills
Site Reliability Engineer Palo Alto, CA Description Facebook is seeking talented operations engineers to join the Site Reliability Engineering team. The ideal candidate will have strong communication skills, a passion for tinkering with Linux, and an almost insane fondness for fast-paced, seat-of-your-pants troubleshooting and crisis management. The position is full-time and is based in our main office in downtown Palo Alto. This position reports to the Manager of Site Reliability Engineering. Responsibilities * Monitor the stability and performance of the website * Remotely troubleshoot and diagnose hardware problems * Debug issues with Linux software, applications and network * Resolve technical challenges encountered in LAMP technologies * Develop and maintain monitoring tools and automation systems * Predict and respond to utilization variances across multiple datacenters * Identify and triage all outage related events * Facilitate communication, coordinate escalation, and work with subject matter experts to implement critical fixes * Automate and streamline processes * Track issues and run reports Requirements * 2-3 years+ Linux support/sys admin experience in an Internet operations environment * BA/BS in Computer Science or a related field, or equivalent experience * Working knowledge of Linux, Cisco, TCP/IP, Apache and mySQL * Experience working with network management systems and monitoring tools, such as Nagios, Ganglia and Cacti * Competency in Shell, PHP, Perl or Python. C is a plus * Solid understanding of web services architecture and commonly employed technologies * A sense of urgency in responding to and resolving critical issues that relate to the performance of the site and/or core infrastructure * Excellent verbal and written communication skills * Participation in a shifted coverage schedule, including working nights and on-call rotations
How to scale MySQL on a 32 core system with 256 threads? Diagonal scalability in a box. An impressive benchmark that achieved more than 79,000 SQL queries per second on a single 4 RU server! Is this real? If so what is the role of good old horizontal scalability? The goals of the benchmark:
- Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440
- Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS
- Do it with minimal tuning i.e as close as possible as out-of-the-box
Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:
...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.
To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.
Some of the problem Facebook faced and fixed:
When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.
A summary of potential strategies:
You can find their changes on github, the hub that says "git."
This is an interesting and still relevant research paper by Jim Gray, Prashant Shenoy at Microsoft Research that examines the rules of thumb for the design of data storage systems. It looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance. Jim Gray has an updated presentation on this interesting topic: Long Term Storage Trends and You. Robin Harris has a great post that reflects on the Rules of Thumb whitepaper on his StorageMojo blog: Architecting the Internet Data Center - Parts I-IV.
An excellent article by Bryan Cantrill and Jeff Bonwick on how to write multi-threaded code. With more processors and no magic bullet solution for how to use them, knowing how to write multiprocessor code that doesn't screw up your system is still a valuable skill. Some topics:
Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind. Warning #2: this post is wild.
Kevin KellyKevin Kelly is Senior Maverick at Wired magazine. He helped launch Wired in 1993, and served as its Executive Editor until January 1999. He co-founded the ongoing Hackers' Conference, and was involved with the launch of the WELL, a pioneering online service started in 1985. He authored the best-selling New Rules for the New Economy and the classic book on decentralized emergent systems, Out of Control
One MachineThere is only one time in the history of each planet when its inhabitants first wire up its innumerable parts to make one large Machine. Later that Machine may run faster, but there is only one time when it is born. You and I are alive at this moment. Is this global web of computers, servers and trunk lines a mere mechanical circuit, a very large tool, or does it reach a threshold where something, well, different happens? Kevin Kelly's hypothesis is this: The rapidly increasing sum of all computational devices in the world connected online, including wirelessly, forms a superorganism of computation with its own emergent behaviors. I define the One Machine as the emerging superorganism of computers. It is a megasupercomputer composed of billions of sub computers. The sub computers can compute individually on their own, and from most perspectives these units are distinct complete pieces of gear. But there is an emerging smartness in their collective that is smarter than any individual computer. We could say learning (or smartness) occurs at the level of the superorganism.
The Next 6500 Days of the WebKevin Kelly recently gave a short talk on the upcoming Web 10.0 at the Web 2.0 Summit in San Francisco. It is like an update to his previous TED talk on Predicting the next 5000 days of the web. He makes us realize that the Web is only around 6500 days old and argues that the next 6500 days will be something entirely different.
Dimensions of the One MachineKevin Kelly's post on his blog The Technium back from 2007 shows us the dimensions of the One Machine: The next stage in human technological evolution is a single thinking/web/computer that is planetary in dimensions. This planetary computer will be the largest, most complex and most dependable machine we have ever built. It will also be the platform that most business and culture will run on. Today it contains approximately 1.2 billion personal computers, 2.7 billion cell phones, 1.3 billion land phones, 27 million data servers, and 80 million wireless PDAs. The processor chips of all these parts are feeding the computation of the internet/web/telecommunications system. A very rough estimate of the computing power of this Machine then is that it contains a billion times a billion, or one quintillion (10 ^ 18) transistors. There are about 100 billion neurons in the human brain. Today the Machine has as 5 orders more transistors than you have neurons in your head. And the Machine, unlike your brain, is doubling in power every couple of years at the minimum. If the Machine has 100 quadrillion transistors, how fast is it running? If we include spam, there are 196 billion emails sent every day. That's 2.2 million per second, or 2 megahertz. Every year 1trillion text messages are sent. That works out to 31,000 per second, or 31 kilohertz. Each day 14 billion instant messages are sent, at 162 kilohertz. The number of searches runs at 14 kilohertz. Links are clicked at the rate of 520,000 per second, or .5 megahertz. There are 20 billion visible, searchable web pages and another 900 billion dark, unsearchable, or deep web pages. The average number of links found on each searchable web page is 62. Assuming the same count for dynamic pages that means there's 55 trillion links in the full web. We could think of each link as a synapse -- a potential connection waiting to me made. There is roughly between 100 billion and 100 trillion synapses in the human brain, which puts the Machine in the same neighborhood as our brains. We could start by saying the Machine currently has 1 HB (Human Brain) equivalent. That measure might hold up for a decade or so, but after it gets to 100 HB, or 10,000 HB, it begins to feel like using inches to measure galactic space. Check out Kevin Kelly's blog for the conclusions and more (wild?) ideas. How do You see the future of the Web?
- One Machine and its dimensions on Kevin Kelly's blog The Technium
- Wired: We are the Web
- Kevin Kelly on the Future of the Web
- Kevin Kelly @ TED: Predicting the next 5000 days of the web
- John H. Holland: Emergence: From Chaos To Order
At 37 Signals Joshua Sierles describes how 37 Signals uses Sprinkle to configure their servers within EC2. Sprinkle defines a domain specific meta-language for describing and processing the installation of software. You can find an interesting discussion of Sprinkle's creation story by the creator himself, Marcus Crafter, in Sprinkle Some Powder!. Marcus divides provisioning tools into two categories:
OK, this interview is with me on Java scalability issues. I sound like a bigger idiot than I would like, but I suppose it could have been worse. The Java World folks were very nice and did a good job, so there’s no blame on them :-) The interview went an interesting direction, but there’s more I’d like add and I will do so here. Two major rules regarding Java and scalability have popped out at me:
How Java affects both Performance and ScalabilityPeter Williams in this very informative blog post discusses how Java affects both performance and scalability. The main points are:
The Top 10 Ways to Botch Enterprise Java Application Scalability and ReliabilityThis is a wonderful presentation by Oracle’s Cameron Purdy. Here’s a PDF. Cameron was CEO of Tangosol before Oracle bought them out. Tangosol made Coherence, a distributed cache. Cameron is a long time prolific contributor to the Java community. His presentation is a must see. He’s both entertaining and technically excellent. The main points he makes in the presentation are:
AzulAzul is a Java Compute Appliance and is the ultimate scale-up play for Java. It kind of does what Google App Engine does at the framework level but does it to the JVM at the hardware level. Current standard practice is to deploy Java application across a cluster of commodity servers. Azul does the opposite. It goes big. The most recent release can contain up to 864 processor cores and 768 GB of memory. That’s big. Azul transparently runs unmodified Java applications on their specialized hardware platform which allows even the most mild mannered of Java apps to scale. Their hardware-assisted garbage collector dramatically reduces application pauses and gives access to hundred of gigabytes of RAM. Some very impressive performance improvements are possible. In one case study Breakthrough Scalability of an Application Constrained to an x86 Server, an application was given access to 384 cores and 128 GB of memory on an Azul compute appliance. The result was a 45x improvements in scalability. Scalability was increased along a number of dimensions (quoted from their article):
X10X10 is not just an inexpensive home automation control system. X10 is also:
A type-safe, modern, parallel, distributed object-oriented language intended to be very easily accessible to Java(TM) programmers. It is targeted to future low-end and high-end systems with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in scalable cluster configurations. A member of the Partitioned Global Address Space (PGAS) family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs; constructs for termination detection (finish) and phased computation (clocks); the use of lock-free synchronization (atomic blocks); and the manipulation of global arrays and data structures. An Eclipse-based Integrated Development Environment (IDE) has been developed at IBM for X10 to help further increase programmer productivity by providing state-of-the-art functionality for viewing, editing, navigating, executing, and manipulating X10 programs.X10 is built on top of Java. X10 adds:
Clojure is a dynamic programming language that targets the Java Virtual Machine. It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection. Clojure is a dialect of Lisp, and shares with Lisp the code-as-data philosophy and a powerful macro system. Clojure is predominantly a functional programming language, and features a rich set of immutable, persistent data structures. When mutable state is needed, Clojure offers a software transactional memory system and reactive Agent system that ensure clean, correct, multithreaded designs.Clojure doesn’t fit my aging mental model. The message-passing actor model of Erlang is more my style. Interestingly the difference between Erlang and Clojure is quite purposeful. Clojure wants to be efficient while operating in the same process rather than taking a message passing hit for every operation. Clojure requires specifying an agent as the receiver of a message where I prefer a more publish-subscribe approach where message senders and consumers are independent. Clojure's use of Java threads makes latency difficult to control. And I'm not sure a S-expression based language can ever become popular. But these are relatively minor issues compared to the task of making Java safe for parallelism. Java does OOP well enough, but sucks at concurrency. Clojure is a nice middle-ground that may be able to make concurrency-oriented programming by real humans in Java a reality. OSGI is a solution that should make dynamic and high availability deployment of Java web services a reality. OSGi is a dynamic module system for Java. Class loading done right. OSGi defines an architecture for developing and deploying modular applications and libraries by creating a microkernel-style architecture. There’s a core set of modules that make up a basic platform and new functionality is dynamically layered in with a plugin. Using OSGi these plugins are isolated, secured and controlled from the rest of code. The unit of deployment is an OSGi bundle, which is simply a JAR file with an OSGi manifest. This approach allows loosely-coupled application modules to be developed by a team of developers. Everything is kept in-sync using version numbers and module dependency ranges. If you’ve ever worked with Linux this should sound familiar. It’s basically how packages are installed on Linux.