By John Engales CTO, Rackspace. Good presentation of the stages a typical successful website goes through: * Stage 1 - The Beginning: Simple architecture, low complexity. no redundancy. Firewall, load balancer, a pair of web servers, database server, and internal storage. * Stage 2 - More of the same, just bigger. * Stage 3 - The Pain Begins: publicity hits. Use reverse proxy, cache static content, load balancers, more databases, re-coding. * Stage 4 - The Pain Intensifies: caching with memcached, writes overload and replication takes too long, start database partitioning, shared storage makes sense for content, significant re-architecting for DB. * Stage 5 - This Really Hurts!: rethink entire application, partition on geography user ID, etc, create user clusters, using hashing scheme for locating which user belongs to which cluster. * Stage 6 - Getting a little less painful: scalable application and database architecture, acceptable performance, starting to add ne features again, optimizing some code, still growing but manageable. * Stage 7 - Entering the unknown: where are the remaining bottlenecks (power, space, bandwidth, CDN, firewall, load balancer, storage, people, process, database), all eggs in one basked (single datacenter, single instance of data).
This article on scaling cookie baking recipes showed up in one my key word alerts. Lots of weird things show up in alerts, but I really like cookies and the parallels were just so delicious. Scaling in the cookie baking world is: the process of multiplying your recipe by many times to produce much more dough for many more cookies. It’s the difference between making enough dough in one batch to make two dozen cookies, or 2000 cookies. Hey, pretty close to the website notion. Yet as any good cook knows any scaled up recipe must be tweaked a little as things change at scale. Let's see what else we're supposed to do (quoted from the article):
CloudCamp is an interesting unconference where early adapters of Cloud Computing technologies exchange ideas. With the rapid change occurring in the industry, we need a place we can meet to share our experiences, challenges and solutions. At CloudCamp, you are encouraged you to share your thoughts in several open discussions, as we strive for the advancement of Cloud Computing. End users, IT professionals and vendors are all encouraged to participate. CloudCamp Silicon Valley 08 is scheduled for Tuesday, September 30, 2008 from 06:00 PM - 10:00 PM in Sun Microsystems' EBC Briefing Center 15 Network Circle Menlo Park, CA 94025 CloudCamp follows an interactive, unscripted unconference format. You can propose your own session or you can attend a session proposed by someone else. Either way, you are encouraged to engage in the discussion and “Vote with your feet”, which means … “find another session if you don’t find the session helpful”. Pick and choose from the conversations; rant and rave, or sit back and watch. At CloudCamp, we tend to discuss the following topics: * Infrastructure as a service (Joyent, Amazon Ec2, Nirvanix, etc) * Platform as a service (BungeeLabs, AppEngine, etc) * Software as a service (salesforce.com) * Application / Data / Storage (development in the cloud)
You want to have a scalable website. You want a website which can handle traffic spikes (think if you are getting on Digg, Slahsdot, Reddit, Techcrunch or other very popular websites frontpage). Regular hosting companies (especially shared hosting) can offer only so much. The servers usually get crushed under the load in short time. But there is hope. A new breed of hosting companies emerged recently. A new breed which can offer you the scalability you need at a fraction of the cost. Welcome to the world of “cloud computing!” (or “grid computing” or “utility computing”, which are terms for the same thing). Here's a website which compiled a list of cloud computing hosting companies (with short descriptions, prices and customer lists for each of them). Read the entire article about Cloud computing, grid computing, utility computing list at MyTestBox.com - web software reviews, news, tips & tricks.
How do we scale datacenters? Should we build a few mammoth million machine datacenters or many smaller micro datacenters? Intuitively we usually go with a bigger is better economies of scale type argument, but it may not be so. What works for Walmart may not work for White Box World. Mega datacenters may actually exhibit diseconomies of scale. It may be better to run applications over many distributed micro datacenters instead of one large one. This paper by Ken Church, Albert Greenberg, and James Hamilton, all from Microsoft, takes a look at the different issues and concludes:
Putting it all together, the micro model offers a design point with attractive performance, reliability, scale and cost. Given how much the industry is currently investing in the mega model, the industry would do well to consider the micro alternative.
Hi, I am very glad that this site exists, as I have learned more about clustering on this site than for quite some time reading stuff elsewhere. Oftentimes, one can find lots of material about clustering, but the practical real-life information is missing. Not so wih this site. I am currently planning the development of an application which has a lot of enterprise features and requirements. On the other side (if the tiny chance of success might strike us), this application would not be an in-house application of a financial institution, or something like that, but some kind of communit/web 2.0 web site. Thus it is an enterprise application with (hopefully, but surely unlikely) the user numbers of a social networking site. Each user initiated transaction involves huge resssources business logic wise (including insane amounts of encryption oprations). Of course, I do not intend to induldge into premature scaling, but to invest every minute I have into the implementation of business logic features. Nevertheless, I do not want to make some extremely bad choices which would force a complete reimplementation straight after the first tiny success - i.e. I want to start with the right technology and architecture, but wait with the implementation of the scalability and high availyability features. Because of the enterprise aspects of this software, my first thought was to use Java SE 6 and Java EE 5 technologies only in order to get all the JEE features and to be vendor independent at the same time. For implementation and testing purposes I thought of Glassfish v2UR2, Postgresql 8.3 and Solaris 10. As all of the major JEE-Appserver vendors advertise the clustering capabilities, I thought that this could not be a bad move. Hopefully, Glassfish would provide HA and scalability, if not there would always be Geronimo, JBoss, Weblogic, or Websphere. Now it seems that there are vast differences between different products: - JEE-Application servers are scaling only to some degree(?). It seems that JEE is almost exclusively used for enterprise applications like SAP ERP or applications at financial institutions? Therefore, there is no need for extreme scalability. - Terracotta seems to be very nice, as one do not have to learn the insanely huge JEE-technology stack, but can just write a mostly Java-SE-only threaded application(?). But Terracotta does not seem to scale very well either (bottleneck with write-operations caused by the master-worker architecture?) and we would be dependend on the future of the Terracotta Corporation. JEE on the other side is vendor neutral. - Oracle Coherence. This product seems to be the best distributed caching product and the holy grail of scalability(?). But it is oracle-expensive. Absolutely nothing for a tiny start-up with no financing. JEE is vendor neutral and thus possibly much cheaper. Do you think that it is possible that one could produce a JEE-Architecture which could provide massive scalability (many hundreds of AppServer) using only the Glassfish clustering features? Or am I on a completely wrong track? Do we have to plan for Oracle Coherence usage? Are there other possibilities? Thanks a lot for any opinions or hints! regards, mike
Func is used to manage a large network using bash or Python scripts. It targets easy and simple remote scripting and one-off tasks over SSH by creating a secure (SSL certifications) XMLRPC API for communication. Any kind of application can be written on top of it. Other configuration management tools specialize in mass configuration. They say here's what the machine should look like and keep it that way. Func allows you to program your cluster. If you've ever tried to securely remote script a gang of machines using SSH keys you know what a total nightmare that can be. Some example commands:
Using the command line: func "*.example.org" call yumcmd update Using the Pthon API: import func.overlord.client as fc client = fc.Client("*.example.org;*.example.com") client.yumcmd.update() client.service.start("acme-server") print client.hardware.info()Func may certainly overlap in functionality with other tools like Puppet and cfengine, but as programmers we always need more than one way to do it and definitely see how I could have used Func on a few projects.
Hello everyone, I'm designing a website/widget that my business partner and I expect to serve millions of hits daily. As such we must shard our database (and we're designing with shards in mind right from the beginning). However, the one thing I haven't been able to figure out from Googling is the best hardware to go with for shards. I'm using exclusively InnoDB tables. We'll (eventually) be running 3 groups of database servers: a) Session servers for php sessions. These will have a very high write volume. b) ID servers. These will match a couple primary indices (such as user ID) to a given shard. These will have an intense read load, plus a moderate amount of writes. c) Shard servers. These will hold the bulk of the data. These will have a high read load and a lowish write load. Group A is done as a database instead of using memcached so users aren't logged out if a memcached server goes down. As the write load is high, a pair of high performance master-master servers seems obvious. What's the ideal hardware setup for machines with this role? Maxed RAM and fast disks seem reasonable. Should I bother with RAID > 0 if I have a live backup on the other master? I hear 4 cores is optimal for InnoDB -- recommendations? Group B. Again, it looks like maxed RAM is recommended here. What about disks? Should I go for 10K or will regular SATA2 drives be okay? RAID 0, 5, 10? Cores? Should I think about slaves to a master-master setup? Group C. It seems to me these machines can be of any capacity because the data they hold is easily spread between shards. What is the query-per-second per dollar sweet spot when it comes to cores and number of disks? Should I beef these machines up, or stick with low end hardware? Should I still max the RAM? I have some other thoughts on system setup, too. As the data stored in the PHP sessions won't change frequently (it'll likely remain static for a user's entire visit -- all variable data can be stored in Group C shard servers), I'm thinking of using a memcached setup in front of the database and only pushing writes through to the database when necessary. Your thoughts? We're also starting this on a minimal budget (of course), so where in the above is it best spent? Keep in mind that I can recycle machines used in Group A & B in Group C as times goes on. Anyway, I'd love to hear from the expertise of the forum. I've been reading for a long time, and I'll be writing as our project evolves :) --Mark
We build web applications…and there are plenty of them around. Now, if we hit the jackpot and our application becomes very popular, traffic goes up, and our servers are brought down by the hordes of people coming to our website. What do we do in that situation? Of course, I am not talking here about the kind of traffic Digg, Yahoo Buzz or other social media sites can bring to a website, which is temporary overnight traffic, or a website which uses cloud computing like Amazon EC2 service, MediaTemple Grid Service or Mosso Hosting Cloud service. I am talking about traffic that consistently increases over time as the service achieves success. Google.com, Yahoo.com, Myspace.com, Facebook.com, Plentyoffish.com, Linkedin.com, Youtube.com and others are examples of services which have constant high traffic. Knowing that users want speed from their applications, these services will always use a Content Delivery Network (CDN) to deliver that speed. What is a Content Delivery Network? A Content Delivery Network (CDN) is a collection of web servers distributed across multiple locations to deliver content more efficiently to users. The server selected for delivering content to a specific user is typically based on a measure of network proximity. For example, the server with the fewest network hops or the server with the quickest response time is chosen. This will help scaling a web application by taking a part of the load from the service servers. Read the entire article about Content Delivery Networks (CDN) list of providers at MyTestBox.com - web software reviews, news, tips & tricks.
In the era of Web 2.0 traditional approaches to capacity planning are often difficult to implement. Guerrilla Capacity Planning facilitates rapid forecasting of capacity requirements based on the opportunistic use of whatever performance data and tools are available. One unique Guerrilla tool is Virtual Load Testing, based on Dr. Gunther's "Universal Law of Computational Scaling", which provides a highly cost-effective method for assessing application scalability. Neil Gunther, M.Sc., Ph.D. is an internationally recognized computer system performance consultant who founded Performance Dynamics Company in 1994. Some reasons why you should understand this law: 1. A lot of people use the term "scalability" without clearly defining it, let alone defining it quantitatively. Computer system scalability must be quantified. If you can't quantify it, you can't guarantee it. The universal law of computational scaling provides that quantification. 2. One the greatest impediments to applying queueing theory models (whether analytic or simulation) is the inscrutibility of service times within an application. Every queueing facility in a performance model requires a service time as an input parameter. No service time, no queue. Without the appropriate queues in the model, system performance metrics like throughtput and response time, cannot be predicted. The universal law of computational scaling leapfrogs this entire problem by NOT requiring ANY low-level service time measurements as inputs. The universal scalability model is a single equation expressed in terms of two parameters α and β. The relative capacity C(N) is a normalized throughput given by: C(N) = N / ( 1 + αN + βN (N − 1) ) where N represents either: 1. (Software Scalability) the number of users or load generators on a fixed hardware configuration. In this case, the number of users acts as the independent variable while the CPU configuration remains constant for the range of user load measurements. 2. (Hardware Scalability) the number of physical processors or nodes in the hardware configuration. In this case, the number of user processes executing per CPU (say 10) is assumed to be the same for every added CPU. Therefore, on a 4 CPU platform you would run 40 virtual users. with `α' (alpha) the contention parameter, and `β' (beta) the coherency-delay parameter. This model has wide-spread applicability, including:
- Accounts for such effects as VM thrashing, and cache-miss latencies.
- Can also be used to model disk arrays, SANs, and multicore processors.
- Can also be used to model certain types of network I/O
- The user-load form is the most common application of eqn.
- Can be used in combination with measurement tools like LoadRunner, Benchmark Factory, etc.