We build web applications…and there are plenty of them around. Now, if we hit the jackpot and our application becomes very popular, traffic goes up, and our servers are brought down by the hordes of people coming to our website. What do we do in that situation? Of course, I am not talking here about the kind of traffic Digg, Yahoo Buzz or other social media sites can bring to a website, which is temporary overnight traffic, or a website which uses cloud computing like Amazon EC2 service, MediaTemple Grid Service or Mosso Hosting Cloud service. I am talking about traffic that consistently increases over time as the service achieves success. Google.com, Yahoo.com, Myspace.com, Facebook.com, Plentyoffish.com, Linkedin.com, Youtube.com and others are examples of services which have constant high traffic. Knowing that users want speed from their applications, these services will always use a Content Delivery Network (CDN) to deliver that speed. What is a Content Delivery Network? A Content Delivery Network (CDN) is a collection of web servers distributed across multiple locations to deliver content more efficiently to users. The server selected for delivering content to a specific user is typically based on a measure of network proximity. For example, the server with the fewest network hops or the server with the quickest response time is chosen. This will help scaling a web application by taking a part of the load from the service servers. Read the entire article about Content Delivery Networks (CDN) list of providers at MyTestBox.com - web software reviews, news, tips & tricks.
In the era of Web 2.0 traditional approaches to capacity planning are often difficult to implement. Guerrilla Capacity Planning facilitates rapid forecasting of capacity requirements based on the opportunistic use of whatever performance data and tools are available. One unique Guerrilla tool is Virtual Load Testing, based on Dr. Gunther's "Universal Law of Computational Scaling", which provides a highly cost-effective method for assessing application scalability. Neil Gunther, M.Sc., Ph.D. is an internationally recognized computer system performance consultant who founded Performance Dynamics Company in 1994. Some reasons why you should understand this law: 1. A lot of people use the term "scalability" without clearly defining it, let alone defining it quantitatively. Computer system scalability must be quantified. If you can't quantify it, you can't guarantee it. The universal law of computational scaling provides that quantification. 2. One the greatest impediments to applying queueing theory models (whether analytic or simulation) is the inscrutibility of service times within an application. Every queueing facility in a performance model requires a service time as an input parameter. No service time, no queue. Without the appropriate queues in the model, system performance metrics like throughtput and response time, cannot be predicted. The universal law of computational scaling leapfrogs this entire problem by NOT requiring ANY low-level service time measurements as inputs. The universal scalability model is a single equation expressed in terms of two parameters α and β. The relative capacity C(N) is a normalized throughput given by: C(N) = N / ( 1 + αN + βN (N − 1) ) where N represents either: 1. (Software Scalability) the number of users or load generators on a fixed hardware configuration. In this case, the number of users acts as the independent variable while the CPU configuration remains constant for the range of user load measurements. 2. (Hardware Scalability) the number of physical processors or nodes in the hardware configuration. In this case, the number of user processes executing per CPU (say 10) is assumed to be the same for every added CPU. Therefore, on a 4 CPU platform you would run 40 virtual users. with `α' (alpha) the contention parameter, and `β' (beta) the coherency-delay parameter. This model has wide-spread applicability, including:
- Accounts for such effects as VM thrashing, and cache-miss latencies.
- Can also be used to model disk arrays, SANs, and multicore processors.
- Can also be used to model certain types of network I/O
- The user-load form is the most common application of eqn.
- Can be used in combination with measurement tools like LoadRunner, Benchmark Factory, etc.
With Tungsten Replicator Continuent is trying to deliver a better master/slave replication system. Their goal: scalability, reliability with seamless failover, no performance loss. From their website: The Tungsten Replicator implements open source database-neutral master/slave replication. Master/slave replication is a highly flexible technology that can solve a wide variety of problems including the following: * Availability - Failing over to a slave database if your master database dies * Performance Scaling - Spreading reads across many copies of data * Cross-Site Clustering - Maintaining active database replicas across WANs * Change Data Capture - Extracting changes to load data warehouses or update other systems * Zero Downtime Upgrade - Performing upgrades on a slave server which then becomes the master The Tungsten Replicator architecture is flexible and designed to support addition of new databases easily. It includes pluggable extractor and applier modules to help transfer data from master to slave. The Replicator is designed to include a number of specialized features designed to improve its usefulness for particular problems like availability. * Replicated changes have transaction IDs and are stored in a transaction history log that is identical for each server. This feature allows masters and slaves to exchange roles easily. * Smooth procedures for planned and unplanned failover. * Built-in consistency check tables and events allow users to check consistency between tables without stopping replication or applications. * Support for statement as well as row replication. * Hooks to allow data transformations when replicating between different database types. Tungsten Replicator is not a toy. It is designed to allow commercial construction of robust database cluster
We will be developing an RIA that will have a lot of database access. Think something like a QuickBooks but with about 50 transactions entered per hour per user. Users will be in the system for 7 to 9 hours a day and there will be around 20,000 users, all logged in at the same time. Reporting will be done just like a QuickBooks style app plus a lot of extra things you don't do in QuickBooks. Our operations is familiar with W2003 Server and MS SQL Server so they are recommending we stick with that. I originally requested Linux and PostgreSQL. How far can a single database server get me? If we have a 4 processor, 8 core, 128gb server, how far am I going to get before I need to shard or do something else? I know there are a lot of factors involved but in general for this size of a site, what should the strategy be? I've read almost all articles on this website but most of the applications are not RIA type of apps with this type of usage or they are architectures for sites with millions of users which we also won't have.
Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.
I came across an interesting study about who are the leaders in open source content management systems market in the year of 2008. The study was just released to the public and it was conducted by Ric Sheves from Water & Stone web development company. At 50 pages, there is a significant amount of data in this study that should be of use to developers or to anyone who is looking to commit to a web publishing system (also known as a Content Management System). Read the entire article about who the open source content management systems market leader is for 2008 at MyTestBox.com - web software reviews, news, tips & tricks.
Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?
It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.
Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.
To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.
Hi everyone, I'm researching on Scalability for a college paper, and found this site great, but it has too many tips, articles and the like, but I can't see a hierarchical organization of subjects, I would need something like a checklist of things or fields, or technologies to take into account when assesing scalability. So far I've identified these: - Hardware scalability: - scale out - scale up - Cache What types of cache are there? app-level, os-level, network-level, I/O-level? - Load Balancing - DB Clustering Am I missing something important? (I'm sure I am) I don't expect you to give a lecture here, but maybe point some things out, give me some useful links... Thanks!
I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating: A mathematical analysis by Ritter (2002) (one of the original developers of Napster) presented a detailed numerical argument demonstrating that the Gnutella network could not scale to the capacity of its competitor, the Napster network. Essentially, that model showed that the Gnutella network is severely bandwidth-limited long before the P2P population reaches a million peers. In each of these previous studies, the conclusions have overlooked the intrinsic bandwidth limits of the underlying topology in the Gnutella network: a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition). Trees are known to have lower aggregate bandwidth than higher dimensional topologies, e.g., hypercubes and hypertori. Studies of interconnection topologies in the literature have tended to focus on hardware implementations (see, e.g., Culler et al. 1996; Buyya 1999), which are generally limited by the cost of the chips and wires to a few thousand nodes. P2P networks, on the other hand, are intended to support from hundreds of thousands to millions of simultaneous peers, and since they are implemented in software, hyper-topologies are relatively unfettered by the economics of hardware. In this chapter, we analyze the scalability of several alternative topologies and compare their throughput up to 2–3 million peers. The virtual hypercube and the virtual hypertorus offer near-linear scalable bandwidth subject to the number of peer TCP/IP connections that can be simultaneously kept open.
ScaleOut StateServer is an in-memory distributed cache across a server farm or compute grid. Unlike middleware vendors, StateServer is aims at being a very good data cache, it doesn't try to handle job scheduling as well. StateServer is what you might get when you take Memcached and merge in all the value added distributed caching features you've ever dreamed of. True, Memcached is free and ScaleOut StateServer is very far from free, but for those looking a for a satisfying out-of-the-box experience, StateServer may be just the caching solution you are looking for. Yes, "solution" is one of those "oh my God I'm going to pay through the nose" indicator words, but it really applies here. Memcached is a framework whereas StateServer has already prepackaged most features you would need to add through your own programming efforts. Why use a distributed cache? Because it combines the holly quadrinity of computing: better performance, linear scalability, high availability, and fast application development. Performance is better because data is accessed from memory instead of through a database to a disk. Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding. Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure. Application development is faster because there's only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data. StateServer follows the RAM is the new disk credo. Memory is assumed to be the system of record, not the database. If you want data to be stored in a database and have the two kept in sync, then you'll have to add that layer yourself. All the standard memcached techniques should work as well for StateServer. Consider however that a database layer may not be needed. Reliability is handled by StateServer because it keeps multiple data copies, reroutes on failure, and has an option for geographical distribution for another layer of added safety. Storing to disk wouldn't make you any safer. Via email I asked them a few questions. The key question was how they stacked up against Memcached? As that is surely one of the more popular challenges they would get in any sales cycle, I was very curious about their answer. And they did a great job differentiation themselves. What did they say? First, for an in-depth discussion of their technology take a look ScaleOut Software Technology, but here a few of the highlights: