Any Suggestions for the Architecture Template?

Here's my template for describing the architecture of a system. The idea is to have people fill out this template and that then becomes the basis for a profile. This is how the Friends for Sale post was created and I think that turned out well. People always want more detail, but realistically you can only expect so much. The template is definitely too long, but it's more just a series of questions to jog people's memories and then they can answer whatever they think is important. What I want to ask is if you can think of any things to add/delete/change in the template? What do you want to know about the systems people are building? So if you have the time, please take a look and tell me what you think.

Getting to Know You

* What is the name of your system and where can we find out more about it? * What is your system is for? * Why did you decide to build this system? * How is your project financed? * What is your revenue model? * How do you market your product? * How long have you been working on it? * How big is your system? Try to give a feel for how much work your system does. * Number of unique visitors? * Number of monthly page views? * What is your in/out bandwidth usage? * How many documents, do you serve? How many images? How much data? * How fast are you growing? * What is your ratio of free to paying users? * What is your user churn? * How many accounts have been active in the past month?

How is your system architected?

* What is the architecture of your system? Talk about how your system works in as much detail as you feel comfortable with. * What particular design/architecture/implementation challenges does your system have? * What did you do to meet these challenges? * How does your system evolve to meet new scaling challenges? * Do you use any particularly cool technologies are algorithms? * What do you do that is unique and different that people could best learn from? * What lessons have you learned? * Why have you succeeded? * What do you wish you would have done differently? * What wouldn't you change? * How much up front design should you do? * How are you thinking of changing your architecture in the future?

How is your team setup?

* How many people are in your team? * Where are they located? * Who performs what roles? * Do you have a particular management philosophy? * If you have a distributed team how do you make that work? * What skillets does your team possess? * What is your development environment? * What is your development process? * Is there anything that you would do different or that you have found surprising?

What infrastructure do you use?

* Which languages do you use to develop your system? * How many servers do you have? * How is functionality allocated to the servers? * How are the servers provisioned? * What operating systems do you use? * Which web server do you use? * Which database do you use? * Do you use a reverse proxy? * Do you collocate, use a grid service, use a hosting service, etc? * What is your storage strategy? DAS/SAN/NAS/SCSI/SATA/etc/other? * How much capacity do you have? * How do you grow capacity? * Do you use a storage service? * Do you use storage virtualization? * How do you handle session management? * How is your database architected? Master/slave? Shard? Other? * How do you handle load balancing? * Which web framework/AJAX Library do you use? * Which real-time messaging frame works do you use? * Which distributed job management system do you use? * How do you handle ad serving? * Do you have a standard API to your website? If so, how do you implement it? * If you use a dynamic language which instruction caching product to use? * What is your object and content caching strategy? * What is your client side caching strategy? * Which third party services do you use to help build your system?

How do you manage your system?

* How do check global availability and simulate end-user performance? * How do you health check your server and networks? * How you do graph network and server statistics and trends? * How do you test your system? * How you analyze performance? * How do you handle security?

How do you handle customer support?

How do you decide what features to add/keep?

* Do you implement web analytics? * Do you do A/B testing? ! How is your data center setup? * How many data centers do you run in? * How is your system deployed in data centers? * Are your data centers active/active, active/passive? * How do you handle syncing between data centers and fail over and load balancing? * Which firewall product do you use? * Which DNS service do you use? * Which routers do you use? * Which switches do you use? * Which email system do you use? * How do you handle spam? * How do you handle virus checking of email and uploads? * How do you backup and restore your system? * How are software and hardware upgrades rolled out? * How do you handle major changes in database schemas on upgrades? * What is your fault tolerance and business continuity plan? * Do you have a separate operations team managing your website? * Do you use a content delivery network? If so, which one and what for? * How much do you pay monthly for your setup? ! Miscellaneous * Who do you admire? * Have you patterned your company/approach on someone else? * Are there any questions you would add/remove/change in this list?

Click to read more ...


Yandex Architecture

Update: Anatomy of a crash in a new part of Yandex written in Django. Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it. Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn:

  • 3.5 billion pages in the search index.
  • Over several thousand servers.
  • 35 million searches a day.
  • Several data centers around Russia.
  • Two-layer architecture.
  • The database is split in pieces and when a search is requested, it pulls the bits from the different database servers and brings it together for the user.
  • Languages used: c++, perl, some java.
  • FreeBSD is used as their server OS.
  • $72 million in revenue in 2006.

    Click to read more ...

  • Friday

    Kevin's Great Adventures in SSDland

    Update: Final Thoughts on SSD and MySQL AKA Battleship Spinn3r. Tips on how to make your database 10x faster using solid state drives. Potential exists for 100x speedup. Solid-state drives (SSDs) are the holy grail of storage. The promise of RAM speeds and hard disk like persistence have for years driven us crazy with power user lust, but they've stayed tantalizingly just out of reach. Always too expensive, too small, and oddly too slow. Has that changed? Can you now miraculously have your cake and eat it too? Can you now have it both ways? Is balancing work with family life now as easy as tripping over a terabyte drive? In a pioneering series of blog articles Kevin Burton conducts original research on next generation SSD drives in real world configurations. For an experience report on his great adventure you can turn to: Could SSD Mean a Rise in MyISAM Usage?, Serverbeach, MySQL and Mtron SSDs, Prediction: SSD Blades in 2008, Zeus IOPS - Another High Performance SSD, Thoughts on SSD and MySQL 5.1, Thoughts on Maria and SSD, 24 Hours with an SSD and MySQL, Random Write Performance in SSDs, SSD + PBXT = Crazy Suspicious!, More SSD vs HDD vs InnoDB vs MyISAM Numbers. A lot of fascinating findings so far. Unfortunately Goldilocks may still find the porridge too slow. SSDs turn out to be fast, but not as fast as you might hope in out of the box configurations. MySQL on SSD is fast for sequential reads and writes, but random reads are writes are relatively slow. Kevin speculates "Log structured filesystems can come into play here and seriously increase performance by turning all random writes into sequential writes" and that "Bigtable and append only databases would FLY on flash." The upshot for me is we need a storage engine designed specifically for SSDs as they provide a very different design space from hard disks. Kevin has a lot of excellent details and observations on his site and he'll no doubt be coming up with a lot more.

    Click to read more ...


    Product: Capistrano - Automate Remote Tasks Via SSH

    Update: Deployment with Capistrano  by Charles Max Wood.  Nice simple step-by-step for using Capistrano for deployment.

    From their website:
    Simply put, Capistrano is a tool for automating tasks on one or more remote servers. It executes commands in parallel on all targeted machines, and provides a mechanism for rolling back changes across multiple machines. It is ideal for anyone doing any kind of system administration, either professionally or incidentally.

    * Great for automating tasks via SSH on remote servers, like software installation, application deployment, configuration management, ad hoc server monitoring, and more.
    * Ideal for system administrators, whether professional or incidental.
    * Easy to customize. Its configuration files use the Ruby programming language syntax, but you don't need to know Ruby to do most things with Capistrano.
    * Easy to extend. Capistrano is written in the Ruby programming language, and may be extended easily by writing additional Ruby modules.

    One of the original use cases for Capistrano was for deploying web applications. (This is still by far its most popular use case.) In order to make deploying these applications reliable, Capistrano needed to ensure that if something went wrong during the deployment, changes made to that point on the other servers could be rolled back, leaving each server in its original state.

    If you ever need similar functionality in your own recipes, you can introduce a transaction:

    task :deploy do
    transaction do

    task :update_code do
    on_rollback { run "rm -rf #{release_path}" }

    task :symlink do
    on_rollback { run "rm #{current_path}; ln -s #{previous_release} #{current_path}" }
    run "rm #{current_path}; ln -s #{release_path} #{current_path}"

    The first task, “deploy” wraps a transaction around its invocations of “update_code” and “symlink”. If an error happens within that transaction, the “on_rollback” handlers that have been declared are all executed, in reverse order.

    This does mean that transactions aren’t magical. They don’t really automatically track and revert your changes. You need to do that yourself, and register on_rollback handlers appropriately, that take the necessary steps to undo the changes that the task has made. Still, even as lo-fi as Capistrano transactions are, they can be quite powerful when used properly.

    From the Ruby on Rail manual:

    Ultimately, Capistrano is a utility that can execute commands in parallel on multiple servers. It allows you to define tasks, which can include commands that are executed on the servers. You can also define roles for your servers, and then specify that certain tasks apply only to certain roles.

    Capistrano is very configurable. The default configuration includes a set of basic tasks applicable to web deployment. (More on these tasks will be said later.)

    Capistrano can do just about anything you can write shell script for. You just run those snippets of shell script on remote servers, possibly interacting with them based on their output. You can also upload files, and Capistrano includes some basic templating to allow you to dynamically create and deploy things like maintenance screens, configuration files, shell scripts, and more.

    Related Articles


  • Friends for Sale uses Capistrano for deployment.

  • Thursday

    Tracking usage of public resources - throttling accesses per hour

    Hi, We have an application that allows the user to define a publicly available resource with an ID. The ID can then be accessed via an HTTP call, passing the ID. While we're not a picture site, thinking of a resource like a picture may help understand what is going on. We need to be able to stop access to the resource if it is accessed 'x' times in an hour, regardless of who is requesting it. We see two options - go to the database for each request to see if the # of returned in the last hour is within the limit. - keep a counter in each of the application servers and sync the counters every few minutes or # of requests to determine if we've passed the limit. The sync point would be the database. Going to the database (and updating it!) each time we get a request isn't very attractive. We also have a load balanced farm of servers, so we know 'x' is going to have to be a soft limit if we count in the app serevrs. (We know there will be a period of time between syncing the counts in the app servers where we'll overshoot the limit. That is okay since we'll catch the limit violation and stop the requests.) Other thoughts on how do to this? Thanks, Chris

    Click to read more ...


    Building a email communication system

    hi, the website i work for is looking to build a email system that can handle a fair few emails (up to a hundred thousand a day). These comprise emails like registration emails, newsletters, lots of user triggered emails and overnight emails. At present we queue them in SQL and feed them into an smtp server on one of our web servers when the queue drops below a certain level. this has caused our mail system to crash as well as hammer our DB server (shared!!!). We have got an architecture of what we want to build but thought there might be something we could buy off the shelf that allowed us to keep templated emails, lists of recipients, schedule sends etc and report on it. We can't find anything What do big websites like amazon etc use or people a little smaller but who still send loads of mail (flickr, ebuyer, or other ecommerce sites) Cheers tarqs

    Click to read more ...


    Hadoop Getting Closer to 1.0 Release

    Update: Yahoo! Launches World's Largest Hadoop Production Application. A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop, an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone:

    The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performance to scale linearly as you increase the cluster size. We learn from this process and then increase the cluster size again. Each time you increase the cluster size reliability becomes a bigger challenge since the number and kind of failures increase.
    It 's tempting to say just jump to the end game, don't bother with all those errors and trials, but there's a lot of learning and experience that must be earned on the way to scaling anything.

    Click to read more ...


    How to deal with an I/O bottleneck to disk?

    A site I'm working with has an I/O bottleneck. They're using a static server to deliver all of the pictures/video content/zip downloads ecetera but now that the bandwith out of that server is approaching 50Mbit/second the latency on serving small files has increased to become unacceptable. I'm curious how other people have dealt with this situation. Seperating into two different servers would require a significant change to the sites architecutre (because the premise is that all uploads go into one server, all subdirectorie are created in one directory, etc.) and may not really solve the problem.

    Click to read more ...


    limit on the number of databases open

    Have a few doubts.. here are the qs 1) is there any limit on the number of databases that can be accessed simultaneously? (MySQL) 2) will it be a problem to scale in the future if there are large number of small databases(2-5 MB) each?

    Click to read more ...


    Web Accelerators - snake oil or miracle remedy?

    Perhaps this question is borderline off-topic but since high scalability solutions often have a global aspect I will give it a try... Have anybody had any experience with different techniques for speeding up their application to places that have a problem with poor ping response time? Ideally I would love to be running only one data center world-wide but one day I know that our sales department will sign up a customer with an unacceptable response time... Could installing a web-accelerator in front of our application extend the reach of our current data center or will we just add complexity and another source of potential errors?

    Click to read more ...