Update: Google added videos on Cluster Computing and MapReduce. There are five lectures: Introduction, MapReduce, Distributed File Systems, Clustering Algorithms, and Graph Algorithms. Advanced website design depends on deep distributed system design knowledge. Where do you get this knowledge? Try Google. They have a a whole Code for Educators program with tutorials and lectures on AJAX programming, distributed systems, and web security. Looks pretty nice.
Hi gurus, I'm totally new to this high scalability thing. I'm trying to create a website with scalability in mind (personal project). In my application I'll have forums for different groups of people (each group will have their own forums, members of groups can still post in other groups' forums but each group will mainly be using their forums most of the time). Now, I'm going to start with about 2000 groups with the potential of reaching up to 10000 groups (this is the maximum due to the nature of my application). I was thinking that having all posts in one table will be way too much for one table (esp. that some groups are expected to post hundreds or even thousands times per day, let's say about 500 of the groups, the rest of the groups won't be that active though) as I'll have to index the PostID, ParentPostID, GroupID and PostDate which can produce large indexes (consequentially causing slow inserts) if having everything in one table. So, I'm thinking of a way to divide the posts in many tables, here are some of the things I thought of: 1. Creating a separate table for every group e.g. ForumsPosts_x, where x is the GroupID (which has its own pros and cons, some of the pros that I can have small indexes and also use identity columns, I also assume it should be easy to move the tables to other databases should the application grow. Well, I posted this idea on some other forums and most people told me it's a sign of bad design if I have thousands of tables in my database. I was also concerned how to design my DAL if I do this. Should I use sprocs with dynamic SQL or use SQL text directly in my DAL code and what about the query plan caching if having a large number of tables .. so many problems here!) 2. Put everything in one table and if the site grows move some of the groups to another database (I'm concerned though about having many databases on the same machine, will it affect performance? of course I won't have hundreds of databases on the same machine but may be about 5 or even 10 databases on the same machine) I also have some other questions: I'm going to use ASP.NET for this project, I was planning initially to use SQL Server as a database but I'm worried about the SQL Server part and the cost of growth, should I consider an alternative like MySQL? But how will it perform with ASP.NET though in a high scalability scenario? Any suggestions are highly appreciated...
Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder. Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator. Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-) I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form. How does Mailinator carry out its work as a spam busting super hero? Site: http://mailinator.com/
Hi, First of all; thanks for a creating a GREAT resource on high scalability architecture. For us building high scalability solutions from the west coast of (tiny) Norway good input on the subject isn't always abundant. Which leads me to my next question; Are there any events or conferences on high scalability / SaaS in the US or internationally that any of you would recommend architects or data center managers to attend?
From Wikipedia: Hyperic HQ is a popular open source IT Operations computer system and network monitoring application software. It auto-discovers all system resources and their metrics, including hardware, operating systems, virtualization, databases, middleware, applications, and services. It watches hosts and services that you specify, alerting you when things go bad and again when they get better. It also provides historical charting and event correlation for faster problem identification. The Hyperic HQ server is a distributed J2EE application that runs on top of the open source JBoss Application Server. It is written in Java and portable C code and runs on Linux, Windows, Solaris, HP-UX and Mac OS X. Hyperic HQ Portal is a Java and AJAX User Interface that includes: * Inventory/Application Model & host hierarchy * Monitoring of network services (SMTP, POP3, HTTP, NNTP, ICMP, SNMP) * Monitoring of host resources (processor load, disk usage, system logs) * Remote monitoring supported through SSH or SSL encrypted tunnels. * Continuous Auto-Discovery of system resources including hardware, software and services * Track log & configuration data * Remote resource control for corrective actions such as starting and stopping services, vacuum database table, or snapshotting a VM * Ability to define event handlers to be run during service or host events for proactive problem resolution * Problem Resource Identification & Root Cause Analysis * Event Correlation * Alerting when service or host problems occur or get resolved via email, pager, Text messaging, RSS * Security/Access Control * Simple plug-in design that allows users to easily develop their own service checks depending on needs, by using the tools of choice (XML, J2EE, Bash, C++, Perl, Ruby, Python, PHP, C#, etc.) I met Javier Soltero, the CEO of Hyperic at the Velocity Web Performance and Operations dinner. I hit him with my best stuff and he didn't flinch a bit. Javier showed a deep understanding of the issues, a real passion for his product and the space, and the knowing good humor of someone who has been through a few wars and learned a little something along the way. I don' know if that translates to an excellent product, but it would at least make me take a look.
Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark. The culture clash is still what fascinates me. David DeWitt writes in the Database Column that MapReduce is a major step backwards:
Recently I had to help users on one of my opensource project ISPMan. http://ispman.net This project started in 2001 as I was too unwilling to take care of the DNS and VitualHosting stuff as it was a side-thing to the company I worked for (so i wrote a software that took care of all these little details) Summary: A large project that needs a rewrite can be done in a matter of day. I will not give you a full case study about a project that went through a re-write but a case study about how easy it is to re-write something. Details: My boss was cool enough to let me open-source the project and obviously, I got a lot of cool-cred out of it. Later on I also did some support and implementation and earned quiet some money with it. Eventually I had to let the project go out of my hand to the community as I only did it to facilitate a job that wasnt williing to do. (Setup DNS zones of multiple servers, find out which host should host the website and put VirtualHost section there, find out which mail server should take care of the mailbox and create a mailbox, etc) The project was quiet successful and there are a number of users who are using it. One of the project members took himself to be the project manager and has been running the project since. I have been out of this project almost 5 years (not much time, I have had 2 kids 4.5 years and 8 months old, and my job was very demanding). The stress from my job has weakened a bit now (It took me really 3.5 years to bring them to a stable "actually an oxymoron when we are talking about high scalability" state). Back to the topic. We have had having complains about not having this feature and that for this opensource product. I tried to put in this feature... "I was the main authour. How complicated would it be for me to add this new feature.." and eventuall I went "What the ...". Yup, I hated my code, I hated everything about it. (as a programmer, as a sysadmin, as an op). Yes it was me coded it 7 years ago, it was me who insisted it to be like that... etc. But times have changes.. Its not 2001 any more. So I want to re-write it. Its not the first time that the idea of re-write was in place. Couple of years the senior members (any one who had to deal with the code) wanted to re-write. Its not extensible, its not pluggable, blah, blah, blah. Yup it was not written to do so. I was just written so I dont have to do the work which I did not wanted to and it served it welll... So if you want more extend it, re-write it, fork-it, you get my point (from the guy who wrote an app and opensourced it). I understand that point of view of the developers too "Why do I have to know how you LDAP scheme works, or how does Cyrus mail server works, etc" The re-write part: Using a PHP framework ( http://framework.zend.com ) ( something that almost did not exist in 2001 ) I was able to get something up and running in a couple of hours. http://ispdirector.net This whole thing was not possible without. * PHP5 (we moved from perl to php, in 2001 PHP was really Pretty Home Pages) * Zend framework ( opensource frameworks were scarce in 2001) * Some experience that LDAP as great it is, should not be where you put all your eggs This particular example does not say that perl(or any other language ) is bad and php is good or ldap(or any other directory) is bad and mysql is good , its just how we did it for this particuliar project. Oh and forgot about the "Get Some Help" Tellling your colleagues that they have to move to a new API can be more difficult then to hear a few blah and blah by the secreteries moved from XP to Vista This was about really moving from Perl to PHP, from PHP to Java, from Java to Perl, from Perl to Ruby. The whole point is that it does not matter. If you can do X faster than Y than I take you (In compute intensive scenario) If you do all your calculations this way, it might go somewhere.
Atif Ghaffar has a nice strategy to deal with virus checking uploads:
Sun is buying MySQL for $1 billion. The MySQL team has worked long and hard so I don't begrudge them their pay day. Strike while the iron is offering a lot of cash I say. And I have nothing against Sun. Yet I can't help but think this changes the mental calculation of what database to use. When Oracle acquired Innobase a new independent storage engine was needed for MySQL. How is this different? Does this change your thinking any? Would Martha say it's a good thing? Like Luke I've searched my feelings, but the force is not with me and I don't really know how I feel about it.