How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems. Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year. The advantage of their new system is that they can now look at their data in anyway they want:
Hi, I am using drupal for my clients website, and was thinking is it possible to host all ( about 500) of them on the same server(maybe VPS or dedicated). Here is the situation..... Each clients website has a database with about 50 tables each, all the databases are small in size about 2-5 MB .... and the websites are low traffic websites with say.. 50 hits/day on avg.... that means about 2000 queries/db/day ..... (avg 40 queries per hit).... Wanted to know if it is possible to have so many databases about 500 on the same server? what are the things that i should look into if i should make this happen?
I am planning the scaling of a hosted service, similar to typepad etc. and would appreciate feedback on my plan so far. Looking into scaling storage, I have come accross MogileFS and OpenAFS. My concern with these is I am not at all experienced with them and as the sole tech guy I don't want to build something into this hosting service that proves complex to update and adminster. So, I'm thinking of building replication and scalability right into the application, in a similar but simplified way to how MogileFS works (I think). So, for our database table of uploaded files, here's how it currently looks (simplified): fileid (pkey) filename ownerid For adding the replication and scalability, I would add a few more columns: serveroneid servertwoid serverthreeid s3 At the time the user uploads a file, it will go to a specific server (managed by the application) and the id of that server will be placed in the "serverone" column. Then hourly or so, a cron job will run through the "files" table, and copy any files that haven't been replicated (where servertwo and serverthree are null) to other servers. Another cron will copy files to Amazon's s3 for an extra backup (if null then copy to s3). Now at the client level, when the page to display the file is loaded, it will know which of the three servers it can pull the file from. If one server goes down, the application will know and use one of the other servers. When storage capacity runs low, another server is added with a big drive, perhaps not even having raid on it. These servers will also be used for php serving through load balancing. I'm probably missing some big drawbacks of this approach but it appeals to me that it should be quite simple to implement and be less complex to adminster than systems like MogileFS which would present a lot more unknowns.
One of the most interesting new features of Oracle 11 is the new function result caching mechanism. Until now, making sure that a PL/SQL function gets executed only as many times as necessary was a black art. The new caching system makes that quite easy -- here is how it works.
OK, I know this site is for scalable web site design. But as there aren't any sites I can find for graceful failure under "slashdotted" like pressure I'll ask here. Does anyone have a sensible way, once you have a "web application" that either won't scale, or can't scale, that you can give some users a good consistent experience and bounce other users to a busy site page. I have seen sites do this to varying degrees, some of which work better than others, but no explanations beyond simply bouncing requests to a "we're busy page server" when you have more than a given number of connections. This is obviously useless as a web page likely requires multiple connection (ignoring keep-alive, pipelining etc) multiple connection to completely render properly. The normal problem is users getting a page and not the "furniture" for that page like images or css. Other problems are having to wait ages to get the busy page or the site being slow even if you do "get in". And some site let a user "in" and then as they browse around they get bounced out suddenly to the busy page. Obviously not being the developer for sites I deal with (I am an infrastructure bod) I can't solve the problem where it should have been pre-emptively solved. That is to say I can't write the code to be scalable or re-write the code to do some simple session filtering or the like (and not being a developer I get dirty looks when I point developers at information like your site ... I can hear them thinking "how dare you suggest I don't know how to code a web site you lowly infrastructure cretin"). Before developer on-line lynch me I should point out that sometimes the cause of not being able to scale a site is that I can't get in new hardware quick enough, but then who knows when you will get slashdotted right ?. So my question applies even when a developer of genius level brilliance has built a unsurpasibly scalable web site for me to run the infrastructure for. My best guess so far is using something like HAProxy to load balance sessions, and then use it's more advanced total session count, and cookie issuing abilities to track users and bounce some at a given "heavy load" point. This isn't ideal as the heavy load point would have to be based on connection counts not server load or server response times, but it's the best I can come up with so far. Also, having mentioned brilliant developers writing great sites not always making my question redundant, could I ask, do people normally think about coping with overload when designing scalable solution - surely they should but I don't see much talk about it. Couldn't a simple Java filter or the equivalent for other things be built into applications ? It'd be nice to have a site that not only scales, but "is nice" when waiting for the infrastructure it runs on to be scaled, which could be several days when you have to purchase new hardware.
Before you proceed make sure you have physical volume(something like /dev/sda1, /dev/sda4, etc) with no data. This is going to be the gfs volume which you will export to other nodes. It should be on the node which is going to be your gnbd server. If you dont have such volume create one using fdisk. I used mounted gfs volume as a DOCUMENT ROOT for my Apache server nodes(Load Balanced). I tried it on FC4 64-bit. If you plan to try it on any other distribution or 32-bit arch.. still the procedure remains same. Since I built it from source but not RPMs, you may have to simply supply config options with a different CFLAGS. Full details at http://linuxsutra.chakravaka.com/redhat-cluster/2006/11/01/howto-gfs-gnbd
All, I'm looking for a faster/reliable solution for DR/BC as well as for sclability for my web/db servers. I came across VMWare Infrastructure and other products. The I/O performance concerns me to go with virtual servers. I'm also looking into imaging software such as Acrnois. Could anyone share their thoughts on how it's being done with bigger names such as google/youtube etc..? Thank you, Regards, Janakan Rajendran.
From FRESH Ports and their website:
ISPman is an ISP management software written in perl, using an LDAP
backend to manage virtual hosts for an ISP. It can be used to manage,
DNS, virtual hosts for apache config, postfix configuration, cyrus
mail boxes, proftpd etc.
ISPMan was written as a management tool for the network at 4unet where
between 30 to 50 domains are hosted and the number is crazily growing.
Managing these domains and their users was a little time consuming,
and needed an Administrator who knows linux and these daemons
fluently. Now the help-desk can easily manage the domains and users.
LDAP data can be easily replicated site wide, and mail box server can
be scaled from 1 to n as required. An LDAP entry called maildrop
tells the SMTP server (postfix) where to deliver the mail. The SMTP
servers can be loadbalanced with one of many load balancing
techniques. The program is written with scalability and High
availability in mind.
This may not be the right software for you if you want to run a small
ISP on a single box or if you want to use this software as an LDAP
editor or a DNS management software by itself.
ISPMan is written mostly in Perl and is based on four major components. All these components are based on open standards and are easily customizable.
Windows and SQL Server : Receive so much negativity in terms of the Highly Available, Scalable Platform..
I remain neutral, but time and again, when people talk Windows or SQL Server, they seem to consider them unreliable with limits around scalability, performance and availability. And then you start looking at some of the big boys you have listed here in the architectural section and most of them are on Linux, MySQL,Oracle platforms that we dont see Windows and SQL Server in there.. What are your thoughts ?
Where do you draw the line between scalability vs Performance vs High Availability vs Reliability? I guess at the end of the day, we all want to be highly available, great performance and always reliable. So is it safe to say that scalability is the answer ? Also when do you start to think scale out vs scale up ?