advertise
Sunday
Jan272008

Scalability vs Performance vs Availability vs Reliability.. Also scale up vs scale out ???

Where do you draw the line between scalability vs Performance vs High Availability vs Reliability? I guess at the end of the day, we all want to be highly available, great performance and always reliable. So is it safe to say that scalability is the answer ? Also when do you start to think scale out vs scale up ?

Click to read more ...

Friday
Jan252008

Google: Introduction to Distributed System Design

Update: Google added videos on Cluster Computing and MapReduce. There are five lectures: Introduction, MapReduce, Distributed File Systems, Clustering Algorithms, and Graph Algorithms. Advanced website design depends on deep distributed system design knowledge. Where do you get this knowledge? Try Google. They have a a whole Code for Educators program with tutorials and lectures on AJAX programming, distributed systems, and web security. Looks pretty nice.

Click to read more ...

Friday
Jan252008

Application Database and DAL Architecture

Hi gurus, I'm totally new to this high scalability thing. I'm trying to create a website with scalability in mind (personal project). In my application I'll have forums for different groups of people (each group will have their own forums, members of groups can still post in other groups' forums but each group will mainly be using their forums most of the time). Now, I'm going to start with about 2000 groups with the potential of reaching up to 10000 groups (this is the maximum due to the nature of my application). I was thinking that having all posts in one table will be way too much for one table (esp. that some groups are expected to post hundreds or even thousands times per day, let's say about 500 of the groups, the rest of the groups won't be that active though) as I'll have to index the PostID, ParentPostID, GroupID and PostDate which can produce large indexes (consequentially causing slow inserts) if having everything in one table. So, I'm thinking of a way to divide the posts in many tables, here are some of the things I thought of: 1. Creating a separate table for every group e.g. ForumsPosts_x, where x is the GroupID (which has its own pros and cons, some of the pros that I can have small indexes and also use identity columns, I also assume it should be easy to move the tables to other databases should the application grow. Well, I posted this idea on some other forums and most people told me it's a sign of bad design if I have thousands of tables in my database. I was also concerned how to design my DAL if I do this. Should I use sprocs with dynamic SQL or use SQL text directly in my DAL code and what about the query plan caching if having a large number of tables .. so many problems here!) 2. Put everything in one table and if the site grows move some of the groups to another database (I'm concerned though about having many databases on the same machine, will it affect performance? of course I won't have hundreds of databases on the same machine but may be about 5 or even 10 databases on the same machine) I also have some other questions: I'm going to use ASP.NET for this project, I was planning initially to use SQL Server as a database but I'm worried about the SQL Server part and the cost of growth, should I consider an alternative like MySQL? But how will it perform with ASP.NET though in a high scalability scenario? Any suggestions are highly appreciated...

Click to read more ...

Thursday
Jan242008

Mailinator Architecture

Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder. Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator. Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-) I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form. How does Mailinator carry out its work as a spam busting super hero? Site: http://mailinator.com/

Information Sources

  • The Architecture of Mailinator
  • Mailinator's 2006 Stats

    The Platform

  • Linux
  • Tomcat
  • Java

    The Stats

  • Will process an estimated 1.29 BILLION emails for 2007. 450.74 million in 2006. 280.68 million in 2005.
  • Peak rate of 6.5 million emails/day or 4513/min or 75/sec.
  • Mailinator runs on a very modest machine with an AMD 2Ghz Athlon processor, 1GB of RAM (much less is used), and a low-performance 80G IDE hard drive. And the machine is not very busy at all.
  • Mailinator runs for months unattended and very few emails are lost, even under constant spam attacks and high peak loads.

    The Architecture

  • Having a free system means the system doesn't have to be perfect. So the design goals are: - Design a system that values survival above all else, even users. Survival is key because Mailinator must fight off attacks on a daily basis. - Provide 99.99% uptime and accuracy for users. Higher uptime goals would be impractical and costly. And since the service is free this is just part of rules of the game for users. - Support the following service model: user signs-up for something, goes to Mailinator, clicks on the subscription link, and forgets about it. This means email doesn't have to be stored persistently on disk. Email can reside in RAM because it is temporary (3-4 hours). If you want a real mailbox then use another service.
  • The original flow of email handling was: - Sendmail received email in a single on-disk mailbox. - The Java based Mailinator grabbed emails using IMAP and/or POP (it changed over time) and deleted them. - The system then loaded all emails into memory and let them sit there. - The oldest email was pushed out once the 20,000 in memory limit was reached.
  • The original architecture worked well: - It was stable and stayed up for months at a time. - It used almost all the 1GB of RAM. - Problems started when the incoming email rate started surpassing 800,000 a day. The system broke down because of disk contention between Mailinator and the email subsystem.
  • The New Architecture: - The idea was to remove the path through the disk which was accomplished with a complete system rewrite. - The web application, the email server, and all email storage run in one JVM. - Sendmail was replaced with a custom built SMTP server. Because of the nature of Mailinator a full SMTP server was not necessary. Mailinator does not need to send email. And it's primary duty is to accept or reject email as fast as possible. This is the downside of layering. Layering is very often given as a key strategy in scaling, but it can kill performance because crucial decisions are best handled at the highest levels of the stack. So work flows through the system only to be dumped at the lower layers when many of the RAM and cycle stealing operations have already been accomplished. So the decision to go with a custome SMTP server is an interesting and brave decision. Most people at this point would just add more hardware. And they wouldn't be wrong, but it's interesting to see this path taken as well. Maybe with more DOM and AOP like architectures we can flatten the stack and get better performance when needed. - Now Mailinator receives an email directly, parses it, and stores it into memory. The disk is bypassed completely and the disk remains fairly idle. - Emails are written to disk when the system is coming down so they can be reloaded on startup. - Logging was shut-off to remove the risk of subpoenaes. When logging was performed log data was written in batches so several thousand logs lines would be written in one disk write. This minimized at disk contention at the risk of losing helpful diagnostic information. - The system uses under 300 threads. More aren't needed. - On arrival each email passes through a filter system and is stored in RAM if all filters are passed. - Every inbox is limited to only 10 emails so popular inboxes, like joe@mailinator.com, can't blow the system. - No incoming email can be over 100k and all attachments are immediately discarded. This saves on RAM.
  • Emails are compressed in RAM: - Since 99% of emails are never looked at, compressed email saves RAM. They are only ever decompressed when someone looks at them. - Mailinator can store about 80,000 emails in RAM, using under 300MB of RAM compared to the 20,000 emails which were stored in 1GB RAM in the original design. - With this pool the average email lifespan is about 3-4 hours. - It's likely 200,000 emails could fit in memory, but there hasn't been a real need. - This is one of the design details I love because it's based on real application usage patterns. RAM is precious and CPU is not, so use compression to save RAM at the expense of CPU, knowing you won't have to take the CPU hit twice, most of the time.
  • Mailinator does not guarantee anonymity and privacy: - There is no privacy. Anyone can read any inbox at anytime. - Relaxing these constrains, while shocking, makes the design much simpler. - For the user it is simple because there is no sign up needed. When a web site asks you for an email address you can just enter an mailinator address. You don't need to create a separate account. Typing in the email address effectively creates the mailinator account. Simple. - In practice users still get a high level of privacy.
  • Goal of survivability leads to aggressive SPAM filtering. - Mailinator doesn't have anything against SPAM, but because it gets so much SPAM, it must be filtered out when it threatens the up time of the system. - Which leads to this rule: If you do anything (spammer or not) that starts affecting the system - your emails will be refused and you may be locked out.
  • To be accepted an email must pass the following filter chain: - Bounce: all bounced emails are dropped. - IP: too much email from a single IP are dropped - Subject: too much email on the same subject is dropped - Potty: subjects containing words that indicate hate or crimes or just downright nastiness are dropped.
  • Surviving Email Floods from a Single IP Adress - An AgingHashmap is used to filter out spammers from a particular IP address. When an email arrives on a IP address the IP is put in the map and a counter is increased for all subsequent emails. - After a certain period of time with no emails the counter is cleared. - When a sender reaches a threshold email count the sender is blocked. This prevents a sender from flooding the system. - Many systems use this sort of logic to protect all sorts of resources, like comments. You can use memcached for the same purpose in a distributed system.
  • Protecting Against Zombie Attacks: - Spam can be sent from a large coordinates sets of different IP addresses, called zombie networks. The same message is sent from thousands of different IP addresses so the techniques for stopping email from a single IP address are not sufficient. - This filtering is a little more complex than IP blocking because you have to parse enough of the email to get the subject line and matching subject strings is a little more resource intensive. - When something like 20 emails with the same subject within 2 minutes, all emails with that subject are then banned for 1 hour. - Interestingly, subjects are not banned forever because that would mean Mailinator would have to track subjects forever and the system design is inherently transient. This is pretty clever I think. At the cost of a few "bad" emails getting through the system is much simpler because no persistent list must be managed and that list surely would become a bottleneck. A system with more stringent SPAM filtering goals would have to create a much more complex and less robust architecture. - Nealy 9% of emails are blocked with this filter. - From my reading Mailinator filters only on IP and subject, so it doesn't have to read the body of the email body to accept or reject the email. This minimizes resource usage when most email will be rejected.
  • To lessen the danger from DOS attacks: - All connections that are silent for a specific period of time are droped. - Mailinator sends replies to email senders very slowly, like 10 or 20 or 30 seconds, even for a very small amount of data. This slows down spammers who are trying to send out spam as fast as possible and may make them rethink sending email again to that address. The wait period is reduced during busy periods so email isn't dropped.

    Lessons Learned

  • Perfection is a trap. How many systems are made much more complicated by the drive to be 100% everything. If you've been in those meetings you know what they are like. Oh, we can't do it this way or that way because there's .01% chance of something going wrong. Instead ask: how imperfect can you be and be good enough?
  • What you throw out is as important as what you keep in. We have many preconceptions of how to design systems. We make take for granted that you need to scale-out, you need to have email accessible days later, and that you must provide private accounts for everyone. But you really need these things? What can you toss?
  • Know the purpose of your system and design accordingly. Being everything to everyone means you are nothing to nobody. Keeping emails for a short period of time, allowing some SPAM to get through, and accepting less than 100% uptime create a strong vision for the system that help drive the design in all areas. You would only build your own SMTP server if you had a very strong idea of what your system was about and what you needed. I know this would have never occurred to me as an idea. I would have added more hardware.
  • Fail fast for the common case before committing resources. A high percentage of email is rejected so it makes sense to reject it as early as possible in the stack to minimize resources to accomplish the task. Figure out how to short circuit frequently failed items as fast as possible. This is important and often over looked scaling strategy.
  • Efficiency often means build it yourself. Off the shelf tools tend to do the whole job. If you only need part of the job done you may be able to write a custom component that runs much faster.
  • Adaptively forget. A little failure is OK. All the blocked IP addresses don't need to be remembered forever. Let the block decisions build up from local data rather than global state. This is powerfully simple and robust architecture.
  • Java doesn't have to be slow. Enough said.
  • Avoid the disk. Many applications need to hit the disk, but the disk is always a bottleneck. Can you design around the disk using other creative strategies?
  • Constrain resource usage. Put in constraints, like inbox size, that will keep your system for spiking uncontrollably. Unconstrained resource usage must be avoided with limited resources.
  • Compress data. Compression can be a major win when trying to conserve RAM. I've seen memory usage drop by more than half when using compression with very little overhead. If you are communicating locally, just have the client encode the data and keep it encoded. Build APIs to access the data without have to decode the full message.
  • Use fixed size resource pools to handle load. Many applications don't control resource usage, like memory, and they crash when too much is used. To create a really robust system fix your resources and drop work when those resources are full. You can age resources, give priority access, give fair access, or use any other logic to arbitrate resource access, but because the resource will be limited, you will stay up under load.
  • If you don't keep data it can't be subpoenaed. Because Mailinator doesn't store email or logs on disk noting can be subpoenaed.
  • Use what you know. We've seen this lesson a few times. Paul knew Java better than anything else, so he used it, made it work, and he got the job got done.
  • Find your own Mailinators. Sure, Mailinator is a small system. In a large system it would just be a small feature, but your system is composed of many Mailinator sized projects. What if you developed some of those like Mailinator?
  • KISS exists, though it's rare. Keeping it simple is always talked about, but rarely are we shown real examples. It's mostly just your way is complex and my way is simple because it's my way. Mailinator is a good example of simple design.
  • Robustness is a function of architecture. To create a design that efficiently uses memory and survives massive spam attacks required an architectural approach that looked at the entire stack.

    Related Articles

  • PlentyOfFish champions straight forward bare bones simplicity.
  • Varnish smartly uses OS features to find incredible performance.
  • ThemBid gracefully pieces together open source components.

    Click to read more ...

  • Tuesday
    Jan222008

    The high scalability community

    Hi, First of all; thanks for a creating a GREAT resource on high scalability architecture. For us building high scalability solutions from the west coast of (tiny) Norway good input on the subject isn't always abundant. Which leads me to my next question; Are there any events or conferences on high scalability / SaaS in the US or internationally that any of you would recommend architects or data center managers to attend?

    Click to read more ...

    Monday
    Jan212008

    Product: Hyperic

    From Wikipedia: Hyperic HQ is a popular open source IT Operations computer system and network monitoring application software. It auto-discovers all system resources and their metrics, including hardware, operating systems, virtualization, databases, middleware, applications, and services. It watches hosts and services that you specify, alerting you when things go bad and again when they get better. It also provides historical charting and event correlation for faster problem identification. The Hyperic HQ server is a distributed J2EE application that runs on top of the open source JBoss Application Server. It is written in Java and portable C code and runs on Linux, Windows, Solaris, HP-UX and Mac OS X. Hyperic HQ Portal is a Java and AJAX User Interface that includes: * Inventory/Application Model & host hierarchy * Monitoring of network services (SMTP, POP3, HTTP, NNTP, ICMP, SNMP) * Monitoring of host resources (processor load, disk usage, system logs) * Remote monitoring supported through SSH or SSL encrypted tunnels. * Continuous Auto-Discovery of system resources including hardware, software and services * Track log & configuration data * Remote resource control for corrective actions such as starting and stopping services, vacuum database table, or snapshotting a VM * Ability to define event handlers to be run during service or host events for proactive problem resolution * Problem Resource Identification & Root Cause Analysis * Event Correlation * Alerting when service or host problems occur or get resolved via email, pager, Text messaging, RSS * Security/Access Control * Simple plug-in design that allows users to easily develop their own service checks depending on needs, by using the tools of choice (XML, J2EE, Bash, C++, Perl, Ruby, Python, PHP, C#, etc.) I met Javier Soltero, the CEO of Hyperic at the Velocity Web Performance and Operations dinner. I hit him with my best stuff and he didn't flinch a bit. Javier showed a deep understanding of the issues, a real passion for his product and the space, and the knowing good humor of someone who has been through a few wars and learned a little something along the way. I don' know if that translates to an excellent product, but it would at least make me take a look.

    Click to read more ...

    Thursday
    Jan172008

    Load Balancing of web server traffic

    How to detect Congestion occurence in the network? Parameter of Load Balancer?

    Click to read more ...

    Thursday
    Jan172008

    Database People Hating on MapReduce

    Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark. The culture clash is still what fascinates me. David DeWitt writes in the Database Column that MapReduce is a major step backwards:

  • A giant step backward in the programming paradigm for large-scale data intensive applications
  • A sub-optimal implementation, in that it uses brute force instead of indexing
  • Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  • Missing most of the features that are routinely included in current DBMS
  • Incompatible with all of the tools DBMS users have come to depend on Listening to databasers and map reducers talk is like eavesdropping on your average family holiday mashup. Every holiday people who have virtually nothing in common are thrown together because they incidentally share a little DNA or are married to the shared DNA. In desperation everyone gravitates to some shared enemy they can all confidently bash. But after that moment is relieved and awkward silence once again looms, nothing is left but more drinking and tackling sensitive topics you just know will end badly. Database folks love their schemas, relational purity and their swiss army knife indexes. You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking. Map reducers lover their pure functional models, their self-healing clustery filled ecosystems, and the shear joy of the semi-organized chaos of letting a 10,000 CPUs simultaneously bloom. I for one stand firmly by the relish tray. Transactions have a place and so does structured data. That's why Google contributes heavily to MySQL. Yet, I too like my map reduce engine, distributed file system combo platter. With map reduce I can implement any complex behavior over any data set. With enough machines that work can be performed in a predictable amount of time. You aren't limit to set logic, SQL types, and tweaked indexes. That's pretty good stuff too. Much like a staunchly conservative nail crunching father and his too soft pansy liberal son, these two camps will never understand each other. Every sign of beauty in one person's eyes is just another confirmation to the other side of impending senility. Why even try? Just hug in a manly way and agree to meet again next year.

    Click to read more ...

  • Thursday
    Jan172008

    Moving old to new. Do not be afraid of the re-write -- but take some help

    Recently I had to help users on one of my opensource project ISPMan. http://ispman.net This project started in 2001 as I was too unwilling to take care of the DNS and VitualHosting stuff as it was a side-thing to the company I worked for (so i wrote a software that took care of all these little details) Summary: A large project that needs a rewrite can be done in a matter of day. I will not give you a full case study about a project that went through a re-write but a case study about how easy it is to re-write something. Details: My boss was cool enough to let me open-source the project and obviously, I got a lot of cool-cred out of it. Later on I also did some support and implementation and earned quiet some money with it. Eventually I had to let the project go out of my hand to the community as I only did it to facilitate a job that wasnt williing to do. (Setup DNS zones of multiple servers, find out which host should host the website and put VirtualHost section there, find out which mail server should take care of the mailbox and create a mailbox, etc) The project was quiet successful and there are a number of users who are using it. One of the project members took himself to be the project manager and has been running the project since. I have been out of this project almost 5 years (not much time, I have had 2 kids 4.5 years and 8 months old, and my job was very demanding). The stress from my job has weakened a bit now (It took me really 3.5 years to bring them to a stable "actually an oxymoron when we are talking about high scalability" state). Back to the topic. We have had having complains about not having this feature and that for this opensource product. I tried to put in this feature... "I was the main authour. How complicated would it be for me to add this new feature.." and eventuall I went "What the ...". Yup, I hated my code, I hated everything about it. (as a programmer, as a sysadmin, as an op). Yes it was me coded it 7 years ago, it was me who insisted it to be like that... etc. But times have changes.. Its not 2001 any more. So I want to re-write it. Its not the first time that the idea of re-write was in place. Couple of years the senior members (any one who had to deal with the code) wanted to re-write. Its not extensible, its not pluggable, blah, blah, blah. Yup it was not written to do so. I was just written so I dont have to do the work which I did not wanted to and it served it welll... So if you want more extend it, re-write it, fork-it, you get my point (from the guy who wrote an app and opensourced it). I understand that point of view of the developers too "Why do I have to know how you LDAP scheme works, or how does Cyrus mail server works, etc" The re-write part: Using a PHP framework ( http://framework.zend.com ) ( something that almost did not exist in 2001 ) I was able to get something up and running in a couple of hours. http://ispdirector.net This whole thing was not possible without. * PHP5 (we moved from perl to php, in 2001 PHP was really Pretty Home Pages) * Zend framework ( opensource frameworks were scarce in 2001) * Some experience that LDAP as great it is, should not be where you put all your eggs This particular example does not say that perl(or any other language ) is bad and php is good or ldap(or any other directory) is bad and mysql is good , its just how we did it for this particuliar project. Oh and forgot about the "Get Some Help" Tellling your colleagues that they have to move to a new API can be more difficult then to hear a few blah and blah by the secreteries moved from XP to Vista This was about really moving from Perl to PHP, from PHP to Java, from Java to Perl, from Perl to Ruby. The whole point is that it does not matter. If you can do X faster than Y than I take you (In compute intensive scenario) If you do all your calculations this way, it might go somewhere.

    Click to read more ...

    Wednesday
    Jan162008

    Strategy: Asynchronous Queued Virus Scanning

    Atif Ghaffar has a nice strategy to deal with virus checking uploads:

  • Upload item into a safe area. If necessary, the uploader blocks waiting for a result.
  • Queue a work order into a job system so all the work can be distributed throughout your cluster.
  • A service in your cluster performs the virus scan and informs the uploader of the result.
  • Move the vetted item into your system. This removes the CPU bottleneck from your web servers and distributes it through your cluster. Keep your web servers providing prompt service to users. Let your cluster do the heavy lifting. This minimizes response time and maximizes throughput. A similar system can be used for creating thumbnails, transcoding, copyright checks, updating indexes, event notification or any other kind of intensive work.

    Click to read more ...