Why doesn't anyone use j2ee?

From a reader:

> Was reading through your very interesting/useful site. >Most of the architectures are non j2ee-Does that mean that >there aren't enough websites that are scalable(with youtube > like userbase) built with j2ee tech-would like to know if there > are any and their architecture as >well.
eBay uses Java, but in a very pragmatic way. They use servlets, an application server, the JDK, and they do the rest themselves. They skip JSP, entity beans, and JMS. When you need to scale putting all your eggs in one basket is a risky strategy. Why use JSP when you can do better? When use entity beans when you can do better? Use servlets because they are a very effective way of handling http requests. Use Java because it is fast, runs everywhere, and has a boat load of libraries you can use to build your build your custom system. Probably the major reason J2EE is absentee is simply LAMP. LAMP is just so incredibly functional for most 2-tier shared nothing sites they don't need a better infrastructure for writing an application tier. Personally, I pretty excited about GWT which uses Java and servlets. We'll see if that starts to take off a little bit more.

Click to read more ...


2 tier switch selection for colocation

Hi, I am interested in some experienced advice for choosing switches for a colocated 2-tier architecture. I have the hardware chosen for the webservers, app servers, and db servers, but need some advice on the network switch in between: colocation port -> firewall(load balancer) -> 2+ web servers (app servers) -> gigabit switch -> DB server(possibly cluster for future expansion) the question is that I am just starting out, i wonder which rackmount gigabit switch to select for the private LAN between the app server -> DB servers. Do I need managed for that? Cisco switches are the best, but they are the most expensive...I am looking at possibly using Dell/Netgear gigabit switches. Thanks for any input

Click to read more ...


On-Demand Infinitely Scalable Database Seed the Amazon EC2 Cloud

Amazon's EC2 sounds good, but how do you make use of all that throbbing CPU power? A few companies are stepping up to fill the how-to gap. Elastra provides unlimited on-demand creation of MySQL and PostgresSQL instances for $.50/server/hour. They contend their clusters perform "nearly" as well as a local database deployed using local storage. RightScale says they "enable you to run your entire web business on Amazon Web Services with reliability, scalability and performance – and pushbutton control of complex system administration tasks." This includes web servers, DNS, and MySQL services. Prices start at $500 a month. Later I'll write more about these and other related services like 3tera, but these services are the canary in the coal mine, the face of change, the bellwether of the new data center. How we build scalable web sites is about to change.

Click to read more ...


Log Everything All the Time

This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong? All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all. What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did. So there are really no logging levels. You need to log everything that will help you diagnose any future problem. What you really need is a time machine, but you don't have one. But if you log enough state you can mimic a time machine. This is what will allow you to follow a request from start to finish and see if what you expect to be happening is actually happening. Did an interface drop a packet? Did a reply timeout? Is a mutex on perma-lock? So many things can go wrong. Over time systems usually evolve to the point of logging everything. They start with little or no logging. Then problem by problem they add more and more logging. But the problem is the logging isn't systematic or well thought out, which leads to poor coverage and poor performance. Logs are where you find anomalies. An anomaly is something unexpected, like operations happening that you didn't expect, in a different order than expected, or taking longer than expected. Anomalies have always driven science forward. Finding and fixing them will help make your system better too. They expose flaws you might not otherwise see. They broaden you understanding of how your system really responds to the world. So step back and take a look at what you need to debug problems in the field. Don't be afraid to add what you need to see how your system actually works. For example, every request needs to have assigned to it a globally unique sequence number that is passed with every operation related to the request so all work for a request can be tied together. This will allow you to trace the comment add from the client all the way through the system. Usually when looking at log data you have no idea what work maps to which request. Once you know that debugging becomes a lot easier. Every hop a request takes should log meta information about how long the request took to process, how big the request was, what the status of the request was. This will help you pinpoint latency issues and any outliers that happen with big messages. If you do this correctly you can simulate the running of system completely from this log data. I am not being completely honest when I say there are no debugging levels. There are two levels: system and developer. System is logging everything you need to log to debug the system. It is never turned off. There is no need. System logging is always on. Developers can add more detailed log levels for their code that can be turned on and off on a module by module basis. For example, if you have a routing algorithm you may only want to see the detailed logging for that on occasion. The trick is there are no generic info type debug levels. You create a named module in your software with a debug level for tracing the routing algorithm. You can turn that on when you want and only that feature is impacted. I usually have a configuration file with initial debug levels. But then I make each process have a command port hosting a simple embedded web server and telnet processor so you can change debug levels and other setting on the fly through the web or telnet interface. This is pretty handy in the field and during development. I can hear many of you saying this is too inefficient. We could never log all that data! That's crazy! No true. I've worked on very sensitive high performance real-time embedded systems where every nanosecond was dear and they still had very high levels of logging, even in driver land. It's in how you do it. You would be right if you logged everything within the same thread directly to disk. Then you are toast. It won't ever work. So don't do that. There are lots of tricks you can use to make logging fast enough that you can do it all the time:

  • Make logging efficient from the start so you aren't afraid to use it.
  • Create a dead simple to use log library that makes logging trivial for developers. Document it. Provide example code. Check for it during code reviews.
  • Log to a separate task and let the task push out log data when it can.
  • Use a preallocated buffer pool for log messages so memory allocation is just pop and push.
  • Log integer values for very time sensitive code.
  • For less time sensitive code sprintf'ing into a preallocated buffer is usually quite fast. When it's not you can use reference counted data structures and do the formatting in the logging thread.
  • Triggering a log message should take exactly one table lookup. Then the performance hit is minimal.
  • Don't do any formatting before it is determined the log is needed. This removes constant overhead for each log message.
  • Allow fancy stream based formatting so developers feel free to dump all the data they wish in any format they wish.
  • In an ISR context do not take locks or you'll introduce unbounded variable latency into the system.
  • Directly format data into fixed size buffers in the log message. This way there is no unavoidable overhead.
  • Make the log message directly queueable to the log task so queuing doesn't take more memory allocations. Memory allocation is a primary source of arbitrary latency and dead lock because of the locking. Avoid memory allocation in the log path.
  • Make the logging thread a lower priority so it won't starve the main application thread.
  • Store log messages in a circular queue to limit resource usage.
  • Write log messages to disk in big sequential blocks for efficiency.
  • Every object in your system should be dumpable to a log message. This makes logging trivial for developers.
  • Tie your logging system into your monitoring system so all the logging data from every process on every host winds its way to your centralized monitoring system. At the same time you can send all your SLA related metrics and other stats. This can all be collected in the back ground so it doesn't impact performance.
  • Add meta data throughout the request handling process that makes it easy to diagnose problems and alert on future potential problems.
  • Map software components to subsystems that are individually controllable, cross application trace levels aren't useful.
  • Add a command ports to processes that make it easy to set program behaviors at run-time and view important statistics and logging information.
  • Log information like task switch counts and times, queue depths and high and low watermarks, free memory, drop counts, mutex wait times, CPU usage, disk and network IO, and anything else that may give a full picture of how your software is behaving in the real world. In large scale distributed systems logging data is all you have to debug most problems. So log everything all the time and you may still get that call at 2AM, but at least you'll know you'll have a fighting chance to fix any problems that do come up.

    Click to read more ...

  • Wednesday

    Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed?

    Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system. Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way? Let's be real, how could Skype possibly test booting 220 million servers over a random configuration of resources? Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one. The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life conditions in a lab because only real-life has the variety of configurations and the massive resources needed to simulate itself. It's like simulating the universe. How do you simulate the universe if the computational matrix you need is the universe itself? You can't. You end up building smaller models and those models sometimes fail. I worked at set-top company for a while and our big boot scenario was the restart of entire neighbor hoods after a power failure. To make an easy upgrade path, each set-top downloaded their image from the head-end on boot, only a boot image was in EEPROM. This is a very stressful scenario for the system. How do you test it? How do you test thousands of booting set-tops when they don't even exist yet? How do you test the network characteristics of a cable system in the lab? How do you design a system not to croak under the load? Cleverness. One part of the solution was really cool. The boot images were continually broadcast over the network so each set-top would pick up blocks of the boot image. The image would be stitched together from blocks rather than having thousands of boxes individually download images, which would never work. This massively reduced the traffic over the network. Clever tricks like this can get you a long ways. Work. Great pools of workstations were used simulate set-tops and software was made to insert drops and simulate asymmetric network communications. But how could we ever simulate 220 million different users? Then, no way. Maybe now you could use grid services like Amazon's EC2. Help from your friends. Microsoft is not being a good neighbor. They should roll out updates at a much more gradual rate so these problems don't happen. Booting loads networks, taxes CPUs, fills queues, drops connections, stresses services, increases process switching, drops packets, encourages dead lock, steals RAM and file descriptors and other resources. So it would be nice if MS was smarter about their updates. But since you can't rely on such consideration, you always have to handle the load. I assume they used exponential backoff algorithms to limit login attempts, but with so many people this probably didn't matter. Perhaps they could insert a random wait to smooth out login traffic. But again, with so many people it probably won't matter. Perhaps they could stop automatic logins on boot? That would solve the problem at the expense of user convenience. No go. Perhaps their servers could be tuned to accept connections at a fast rate yet condition how quickly they respond to the rest of the login process? Not good enough I suppose. So how did Skype fix their problem? They explain it here : The parameters of the P2P network have been tuned to be smarter about how similar situations should be handled. Once we found the algorithmic fix to ensure continued operation in the face of high numbers of client reboots, the efforts focused squarely on stabilizing the P2P core. The fix means that we’ve tuned Skype’s P2P core so that it can cope with simultaneous P2P network load and core size changes similar to those that occurred on August 16. Whenever I see the word "tune" I get the premonition shivers. Tuning means you are just one unexpected problem away from being out of tune and your perfectly functioning symphony sounding like a band of percussion happy monkeys. Tuned things break under change. Tweak the cosmological constant just a little and wham, there's no human life. It needs to work by design. Or it needs to be self-adaptive and not finessed by human hands for each new disaster scenario. And this is where we get into the nature of P2P. Would the same problem have happened in a centralized architecture with resources spread strategically throughout the globe and automatic load balancing between different data centers? In a centralized model would it have been easier to bring more resources on line to handle the load? Would the outage have been easier to diagnose and last a much shorter amount of time? There are of course no definitive answers to these questions. But many of the web's most successful systems like YouTube, Amazon, Ebay, Google, GoogleTalk, and Flickr use a centralized model. They handle millions of users and massive amounts of content and have pretty good reliability records. Does P2P bring enough to the architecture that you should build a system around it? That to me is the interesting question that arises out of this incident.

    Related Articles

  • Vanilla Skype Part 2. This document gives a detailed explanation of Skype's supernode architecture and details the weakness of using your end users as your redundancy strategy.

    Click to read more ...

  • Tuesday

    Google Utilities : An online google guide,tools and Utilities.

    An online google guide,tools and Utilities to use with the Google search engine.Free google tools and utilities.

    Click to read more ...


    Product: Varnish

    Varnish is a state-of-the-art, high-performance HTTP accelerator. Varnish is targeted primarily at the FreeBSD 6 and Linux 2.6 platforms, and will take full advantage of the virtual memory system and advanced I/O features offered by these operating systems. Varnish was written from the ground up to be a high performance caching reverse proxy. Squid is a forward proxy that can be configured as a reverse proxy. Besides - Squid is rather old and designed like computer programs where supposed to be designed in 1980. Varnish is reported to be 10x-20x faster than Squid on the same hardware.

    Click to read more ...


    Postgresql on high availability websites?

    I was looking at the pingdom infrastructure matrix ( and I saw that no sites are using Postgresql, and then I searched through and saw very few mentions of postgresql. Are there any examples of high-traffic sites that use postgresql? Does anyone have any experience with it? I'm having trouble finding good, recent studies of postgres (and postgres compared w/ mysql) online.

    Click to read more ...


    Profiling WEB applications

    Hi, Some of the articles of the site claims profiling is essential. Is there any established approach to profiling WEB apps? Or it too much depends on technologies used?

    Click to read more ...


    Wikimedia architecture

    Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site:

    Information Sources

  • Wikimedia architecture
  • scale-out vs scale-up in the from Oracle to MySQL blog.


  • Apache
  • Linux
  • MySQL
  • PHP
  • Squid
  • LVS
  • Lucene for Search
  • Memcached for Distributed Object Cache
  • Lighttpd Image Server

    The Stats

  • 8 million articles spread over hundreds of language projects (english, dutch, ...)
  • 10th busiest site in the world (source: Alexa)
  • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
  • 30 000 HTTP requests/s during peak-time
  • 3 Gbit/s of data traffic
  • 3 data centers: Tampa, Amsterdam, Seoul
  • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
  • managed by ~ 6 people
  • 3 clusters on 3 different continents

    The Architecture

  • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
  • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
  • 55 Squid servers currently, plus 20 waiting for setup.
  • 1,000 HTTP requests/s per server, up to 2,500 under stress
  • ~ 100 - 250 Mbit/s per server
  • ~ 14 000 - 32 000 open connections per server
  • Up to 40 GB of disk caches per Squid server
  • Up to 4 disks per server (1U rack servers)
  • 8 GB of memory, half of that used by Squid
  • Hit rates: 85% for Text, 98% for Media, since the use of CARP.
  • PowerDNS provides geographical distribution.
  • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
  • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
  • One centrally managed & synchronized software installation for hundreds of wikis.
  • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
  • Hardware shared with External Storage and Memcached tasks
  • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
  • Actual revision text is stored as blobs in External storage
  • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
  • Separate database per wiki (not separate server!)
  • One master, many replicated slaves
  • Read operations are load balanced over the slaves, write operations go to the master
  • The master is used for some read operations in case the slaves are not yet up to date (lagged)
  • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

    Lessons Learned

  • Focus on architecture, not so much on operations or nontechnical stuff.
  • Sometimes caching costs more than recalculating or looking up at the data source...profiling!
  • Avoid expensive algorithms, database queries, etc.
  • Cache every result that is expensive and has temporal locality of reference.
  • Focus on the hot spots in the code (profiling!).
  • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis
  • Improve caching: temporal and spatial locality of reference and reduces the data set size per server
  • Text is compressed and only revisions between articles are stored.
  • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
  • Disk seek I/O limited, the more disk spindles, the better!
  • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
  • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
  • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

    Click to read more ...