advertise
Wednesday
Aug222007

Wikimedia architecture

Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

Information Sources

  • Wikimedia architecture
  • http://meta.wikimedia.org/wiki/Wikimedia_servers
  • scale-out vs scale-up in the from Oracle to MySQL blog.

    Platform

  • Apache
  • Linux
  • MySQL
  • PHP
  • Squid
  • LVS
  • Lucene for Search
  • Memcached for Distributed Object Cache
  • Lighttpd Image Server

    The Stats

  • 8 million articles spread over hundreds of language projects (english, dutch, ...)
  • 10th busiest site in the world (source: Alexa)
  • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
  • 30 000 HTTP requests/s during peak-time
  • 3 Gbit/s of data traffic
  • 3 data centers: Tampa, Amsterdam, Seoul
  • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
  • managed by ~ 6 people
  • 3 clusters on 3 different continents

    The Architecture

  • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
  • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
  • 55 Squid servers currently, plus 20 waiting for setup.
  • 1,000 HTTP requests/s per server, up to 2,500 under stress
  • ~ 100 - 250 Mbit/s per server
  • ~ 14 000 - 32 000 open connections per server
  • Up to 40 GB of disk caches per Squid server
  • Up to 4 disks per server (1U rack servers)
  • 8 GB of memory, half of that used by Squid
  • Hit rates: 85% for Text, 98% for Media, since the use of CARP.
  • PowerDNS provides geographical distribution.
  • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
  • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
  • One centrally managed & synchronized software installation for hundreds of wikis.
  • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
  • Hardware shared with External Storage and Memcached tasks
  • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
  • Actual revision text is stored as blobs in External storage
  • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
  • Separate database per wiki (not separate server!)
  • One master, many replicated slaves
  • Read operations are load balanced over the slaves, write operations go to the master
  • The master is used for some read operations in case the slaves are not yet up to date (lagged)
  • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

    Lessons Learned

  • Focus on architecture, not so much on operations or nontechnical stuff.
  • Sometimes caching costs more than recalculating or looking up at the data source...profiling!
  • Avoid expensive algorithms, database queries, etc.
  • Cache every result that is expensive and has temporal locality of reference.
  • Focus on the hot spots in the code (profiling!).
  • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis
  • Improve caching: temporal and spatial locality of reference and reduces the data set size per server
  • Text is compressed and only revisions between articles are stored.
  • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
  • Disk seek I/O limited, the more disk spindles, the better!
  • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
  • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
  • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

    Click to read more ...

  • Wednesday
    Aug222007

    How many machines do you need to run your site?

    Amazingly TechCrunch runs their website on one web server and one database server, according to the fascinating survey What the Web’s most popular sites are running on by Pingdom, a provider of uptime and response time monitoring. Early we learned PlentyOfFish catches and releases many millions of hits a day on just 1 web server and three database servers. Google runs a Dalek army full of servers. YouSendIt, a company making it easy to send and receive large files, has 24 web servers, 3 database servers, 170 storage servers, and a few miscellaneous servers. Vimeo, a video sharing company, has 100 servers for streaming video, 4 web servers, and 2 database servers. Meebo, an AJAX based instant messaging company, uses 40 servers to handle messaging, over 40 web servers, and 10 servers for forums, jabber, testing, and so on. FeedBurner, a news feed management company, has 70 web servers, 15 database servers, and 10 miscellaneous servers. Now multiply FeedBurner's server count by two because they maintain two geographically separate sites, in an active-passive configuration, for high availability purposes. How many servers will you need and how can you trick yourself into using fewer?

    Find Someone Like You and Base Your Resource Estimates Off Them

    We see quite a disparity in the number of servers needed for popular web sites. It ranges from just a few servers to many hundreds. Where do you fit? The easiest approach to figuring out how many servers you'll need is to find a company similar to yours and look how many they need. You won't need that many right away, but as you grow it's something to think about. Can your data center handle your growth? Do they have enough affordable bandwidth and rack space? How will you install and manage all the machines? Who will do the work? And a million other similar questions that might be better handled if you had some idea where you are going.

    Get Someone Else to Do it

    Clearly content sites end up needing a lot of servers. Videos, music, pictures, blogs, and attachments all eat up space and since that's your business you have no alternative but to find a way to store all that data. This is unstructured data that can be stored outside the database in a SAN or NAS. Or, rather that building your own storage infrastructure, you can follow the golden rule of laziness: get someone else to do it. That's what SmugMug, an image sharing company did. They use S3 to store many hundreds of terabytes of data. This drops the expense of creating a large highly available storage infrastructure so much that it creates a whole new level of competition for content rich sites. At one time expertise in creating massive storage farms would have been enough to keep competition away, but no more. These sorts of abilities are becoming commoditized, affordable, and open. PlentyOfFish and YouTube make use of CDNs to reduce the amount of infrastructure they need to create for themselves. If you need to stream video why not let a CDN do it instead of building out your own expensive infrastructure? You can take a "let other people do it approach" for services like email, DNS, backup, forums, and blogs too. These are all now outsourcable. Does it make sense to put these services in your data center if you don't need to? If you have compute intensive tasks you can use Amazon services without needing to perform your own build out. And an approach I am really excited to investigate in the future is a new breed of grid based virtual private data centers like 3tera and mediatemple. Their claim to fame is that you can componetize your infrastructure in such a way that you can scale automatically and transparently using their grid as demand fluctuates. I don't have any experience with this approach yet, but it's interesting and probably where the world is heading. If your web site is relatively simple blog then with mostly static content then you can get away with far fewer servers. Even a popular site like Digg has only 30GB of data to store.

    How do your resources scale with the number of users?

    A question you have to ask also is do your resources scale linearly, exponentially, or not much at all with the number of users. A blog site may not scale much with the number of users. Some sites scale linearly as users are added. And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added. Getting a feel for the type of site you have can help more realistic numbers pop up on your magic server eight-ball.

    What's your caching strategy?

    A lot of sites use Memcached and Squid for caching. You can fill up a few racks with caching servers. How many servers will you need for caching? Or can you get away with just beefing up the database server cache?

    Do you need servers for application specific tasks?

    Servers aren't just for storage, database, and the web servers. You may have a bit of computation going on. YouTube offloads tag calculations to a server farm. GoogleTalk has to have servers for handling presence calculations. PlentyOfFish has servers to handle geographical searches because they are so resource intensive. GigaVox needs servers to transcode podcasts into different formats and include fresh commercial content. If you are a calendar service you may need servers to calculate more complicated schedule availability schemes and to sync address books. So depending on your site, you may have to budget for many application related servers like these. The Pingdom folks also created a sweet table on what technologies the companies profiled on this site are using. You can find it at What nine of the world’s largest websites are running on. I'm very jealous of their masterful colorful graphics-fu style. Someday I hope rise to that level of presentation skill.

    Click to read more ...

    Tuesday
    Aug212007

    What does the next generation data center look like?

    That's what people at the NGDC Conference get together and talk about. A lot of interesting subjects: data center virtualization HPC & grid; advanced facilitates management and planning; advanced network and services; applications; data center optimization and security; managing and protecting information. The Grid – Distributed Computing at Scale presentation is an interesting one.

    Click to read more ...

    Monday
    Aug202007

    TypePad Architecture

    TypePad is considered the largest paid blogging service in the world. After experience problems because of their meteoric growth, they eventually transitioned to an architecture patterned after their sister company, LiveJournal. Site: http://www.typepad.com/

    The Platform

  • MySQL
  • Memcached
  • Perl
  • MogileFS
  • Apache
  • Linux

    The Stats

  • As of 2005 TypePad sends 250mbps of traffic using multiple network pipes for 3TB of traffic a day. They were growing by 10-20% each month. I was unable to find more recent statistics.

    The Architecture

  • Original Architecture: - Single server running Linux, Apache, Postgres, Perl, mod_perl - Storage was NFS on a filer.
  • A Devastating Crash Caused a New Direction - A RAID controller failed and spewed data across all RAID disks. - The database was corrupted and the backups were corrupted. - Their redundant filers suffered from "split brain" syndrome.
  • They move to LiveJournal Architecture type architecture which isn't surprising since TypePad and LiveJounral are both owned by Six Apart. - Replicated MySQL clusters partitioned by ID. - A global DB generated globally unique sequence numbers and mapped users to partitions. - Other data was mapped by role.
  • Highly Available Database Configuration: - A master-master MySQL replication model is used. - The Linux clustering heartbeat was used to failover using virtual IP addresses.
  • MogileFS is used to serve images.
  • Perlbal is used as reverse proxy and to load balance requests.
  • A reliable, asynchronous job dispatch system called TheSchwartz is used to support moblogging, adding comments, future publishing, cache invalidation, and publishing.
  • Memcached is used to store counts, sets, stats, and heavyweight data.
  • Migration from the old architecture to the new architecture was tricky: - All users were migrated over without service interruption. - Postgres was removed. - During the migration images were served from NFS and MogileFS.
  • Benefits of their new architecture: - Can easily add new machines and adjust workload. - More highly available and is cheaply scalable

    Lessons Learned

  • Small details are important.
  • Every mistake is a learning experience.
  • Success requires coordination and cooperation.

    Related Articles

  • LiveJournal Architecture.
  • Linux High Availability.

    Click to read more ...

  • Friday
    Aug172007

    What is the best hosting option?

    The questions was extracted from: http://highscalability.com/plentyoffish-architecture#comment-126 For startup like Markus, what is the best hosting option (and grow more later)? host your own server or use ISP co-location option? He still has to pay huge money on the bandwidth with that payload, right?

    Click to read more ...

    Thursday
    Aug162007

    What tech is used to build your favorite site?

    Find out with Builtwith.com. It scans a site and guesses how the site is built. I ran it on this site and it said: Apache, Windows, PHP, Adsense, RSS, CSS, Javascript, and UTF-8 encoding. Correct, yet I think it should have guessed Drupal was the CMS and it should have been able to determine which AJAX library is used. Though it's kind of cool to see which sites use PHP and other technologies.

    Click to read more ...

    Thursday
    Aug162007

    Scaling Secret #2: Denormalizing Your Way to Speed and Profit

    Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with. The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again. Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data.

    Information Sources

  • Normalization Is for Sissies by Pat Helland
  • Data normalization, is it really that good? by Arnon Rotem-Gal-Oz
  • When Not to Normalize your SQL Database by Dare Obasanjo
  • MegaData by Joe Gregorio
  • Audio of talk by Adam Bosworth at the MySQL Users Conference 2005 We normalize data to prevent anomalies. Anomalies are bad things like forgetting to update someone's address in an all the places its been stored when they move. This anomaly happens because the address has been duplicated. So to prevent the anomaly we don't duplicate data. We split everything up so it is stored once and exactly once. Bad things are far less likely to happen if we follow this strategy. And that's a good thing. The process of getting rid of all potential bad things is called normalization and we have a bunch of rules to follow to normalize our data. The price of normalization is that when we want a person's address we have to go find the person and their address in separate operations and bring the data together again. This is called a join. The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again. Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update. If you say your database is the bottleneck then the finger is pointed back and you and you are asked what you are doing wrong. Have you created proper indexes? Is your schema design good? Is your database efficient? Are you tuning your queries? Have you cached in the database? Have you used views? Have you cached complicated queries in memcached? Can you get more parallel IO out of your database? And all these are valid and good questions. For your typical transactional database these would be your normal paths of attack. But we aren't talking about your normal database. We are talking about web scale services that have to process loads higher than any database can scale to. At some point you need a different approach. Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier. We sees this with Ebay who moved all significant functionality out of the database and into applications. Flickr shards and replicates their data to reach high performance levels. For Flickr this moves transaction logic back into their application layer, but the win is higher scalability. Joe Gregorio has identified some common themes across these new mega-data systems:
  • Distributed - The data has to be distributed across multiple machines.
  • Joinless - No joins, and no referential integrity, at least at the data store level.
  • De-Normalized - De-normalization is needed if you are avoiding joins.
  • Transcationless - No transactions It's the web model pushed to the data tier. Ironically, it may take a web model on the back-end to support a web model on the front-end.

    The Great Data Ownership Wars: The Database vs. The Application

    A not so subtle clue as to who won the data wars is to look at the words used. Data that are split up are considered "normal." Those who keep their data whole are considered "de-normal." All right, that's not what those words mean, but it was to good to pass up. :-) Traditionally the database owns the data. Referential integrity, triggers, stored procedures, and everything else that keeps the data safe and whole is in the database. Applications are prevented from screwing up the data. And this makes sense until you scale. Centralizing all behavior in the database won't mega-scale as the web does, which is why Ebay went completely the other way. Ebay maintains data integrity through a service layer that encapsulates all data access. The service layer handles referential integrity, managing replicated copies, doing joins, and so on. It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle. All this sharding and denormalization and duplicating at one levels feels so wrong because it's so different than we were all taught. And unless you are a really large website you probably don't need to worry about this level of complexity. But it's a really fascinating and unexpected evolution in design. Scaling to handle the world wide web requires techniques and strategies that are often at odds with our years of experience. It will be fun to see where it all leads.

    Related Articles

  • Flickr both denormalizes and duplicates data. Horror!
  • Ebay is the most radical in moving almost all functionality out of the database and into the application.
  • Plenty of Fish also advocates denormalization as a key strategy.
  • Hadoop - a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce.

    Click to read more ...

  • Friday
    Aug102007

    How do we make a large real-time search engine?

    We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle). The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable. Could you point me to some examples or articles I could review to design a solution for such this context?

    Click to read more ...

    Thursday
    Aug092007

    Lots of questions for high scalability / high availability

    Hey, I do have a website that I would like to scale. Right now we have 10 servers but this does not scale well. I know how to deal with my apache web servers but have problems with sql servers. I would like to use the "scale out" system and add servers when we need. We have over 100Gb of data for mysql and we tried to have around 20G per server. It works well except that if a server goes down then 1/5 of the user can't access the website. We could use replication but we would need to at least double sql servers to replicate each server. And maybe in the future it's not gonna be enough we would need maybe 3 slaves per master ... well I don't really like this idea. I would prefer to have 8 servers that all deal with data from the 5 servers we have right now and then we could add new servers when we need. I looked at NFS but that does not seem to be a good idea for SQL servers ? Can you confirm?

    Click to read more ...

    Wednesday
    Aug082007

    Partial String Matching

    Is there any alternative to LIKE '%...%' OR LIKE '%...%' in MySQL if you have to offer partial string matching on a large dataset?

    Click to read more ...