hi, the website i work for is looking to build a email system that can handle a fair few emails (up to a hundred thousand a day). These comprise emails like registration emails, newsletters, lots of user triggered emails and overnight emails. At present we queue them in SQL and feed them into an smtp server on one of our web servers when the queue drops below a certain level. this has caused our mail system to crash as well as hammer our DB server (shared!!!). We have got an architecture of what we want to build but thought there might be something we could buy off the shelf that allowed us to keep templated emails, lists of recipients, schedule sends etc and report on it. We can't find anything What do big websites like amazon etc use or people a little smaller but who still send loads of mail (flickr, ebuyer, or other ecommerce sites) Cheers tarqs
Update: Yahoo! Launches World's Largest Hadoop Production Application. A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop, an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone:
The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performance to scale linearly as you increase the cluster size. We learn from this process and then increase the cluster size again. Each time you increase the cluster size reliability becomes a bigger challenge since the number and kind of failures increase.It 's tempting to say just jump to the end game, don't bother with all those errors and trials, but there's a lot of learning and experience that must be earned on the way to scaling anything.
A site I'm working with has an I/O bottleneck. They're using a static server to deliver all of the pictures/video content/zip downloads ecetera but now that the bandwith out of that server is approaching 50Mbit/second the latency on serving small files has increased to become unacceptable. I'm curious how other people have dealt with this situation. Seperating into two different servers would require a significant change to the sites architecutre (because the premise is that all uploads go into one server, all subdirectorie are created in one directory, etc.) and may not really solve the problem.
Have a few doubts.. here are the qs 1) is there any limit on the number of databases that can be accessed simultaneously? (MySQL) 2) will it be a problem to scale in the future if there are large number of small databases(2-5 MB) each?
Perhaps this question is borderline off-topic but since high scalability solutions often have a global aspect I will give it a try... Have anybody had any experience with different techniques for speeding up their application to places that have a problem with poor ping response time? Ideally I would love to be running only one data center world-wide but one day I know that our sales department will sign up a customer with an unacceptable response time... Could installing a web-accelerator in front of our application extend the reach of our current data center or will we just add complexity and another source of potential errors?
Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue. Interesting problem. Same thing happens with sites using a lot of SSL. They need to purchase specialized SSL concentrators to handle the load which makes capacity planning a lot trickier and more expensive. In the comments Allen conjectured What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT's and GET's of private files which require cryptographic credentials, rather than GET's of public files that require no credentials). As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage. Perhaps they all decided to go live with a new service at around the same time, although this is not clear. We see these kinds of bring up problems all the time. The Skype failure was blamed on software updates which caused all nodes to relogin at the same time. Bring up a new disk storage filer and if you aren't load balancing requests all new storage requests will go to that new filer and you'll be down lickity split. Booting is one of the most stressful times on large networks. Bandwidth and CPU all become restricted which causes a cascade of failures. ARP packets can get dropped or lost and machines never get their IP addresses. Packets drop which causes retransmissions which chews up bandwidth which uses CPU and causes more drops. CPUs spike which causes timeouts and reconnects which again spirals everything out of control. When I worked at a settop company we had the scenario of a neighborhood rebooting after a power outage. Lots of houses needing to boot large boot images over asymmetric low bandwidth cable connections. As a fix we broadcasted boot image blocks to all settops. No settops performed your typical boot image download. Worked like a charm. Amazon's problem was a subtle one in a very obscure corner of their system. It's not surprising they found a weakness. But I'm sure Amazon will be back even bigger and better once they get their improvements on line.
How do you plan to scale your system as you reach predictable milestones? This topic came up in another venue and it reminded me about a great comment an Anonymous wrote a while ago and I wanted to make sure that comment didn't get lost.
The Anonymous scaling plan was relatively simple and direct:
My two cents on what I'm using to start a website from scratch using a single server for now. Later, I'll scale out horizontally when the need arises.
Let suppose i have table which stored tags .Now user can enter keywords and i have to search through all the records in table and find post which contain tags entered by user .user can enter more than 1 keywords. What strategy ,technique i use to search fast .There maybe more than millions records and many users are firing same query. Thanks
We have a lot of dependencies to our SQL databases and we have heard that caching does help a lot as we move into scaling and providing better performance. So the question is what are some reliable software products out there that we could consider in this space ? We want to put a lot of frequently called database calls that do not change frequently into this caching layer. Also what would be an easy way to move only those database changes into the cache as opposed to reloading or pulling it into cache every few mins or hours. We need something smart that would just push changes to the caching layer as it happens. I guess we could build our own, but are there any good reliable products out there ? Please also mention how they play with regards to pricing 'cos that would be a determining factor as well. Thanks
Update: GIGAOM on rPath Burns EC2 Appliances in a Web Portal. rBuilder adds a portal that lets users turn software into virtual appliances. rPath demoed their virtual appliance management system at Monday's AWS Meetup. What they do is help you build a generic virtual machine image deployable on Amazon, VMWare, Xen and other targets. The idea is to build your software application independent of the underlying operating system and deploy it in your own or someone else's datacenter without worrying about all the details. To put their service in context think of rPath as how you build, deploy, and upgrade images and someone like Right Scale has how you can run and managed a cluster of deployed images. To build a Virtual Appliance you pull together all your packages through their web interface or through a Python based "recipe" system, select a VM target, and "cook" it all into a VM image you can immediately deploy and run. To make this magic happen they use the Conary package manager system and they have their own RedHat compatible OS. One of their major features is a very fine grained package management systems which allows them to perform minimal inplace upgrades of deployed images. The downside is you must use their packaging system and their OS for this to work. Any code you want to install must be installable using their packaging system. There's a free community version available on their website for Open Sourcers.. They make their money from people buying a Virtual Appliance of their build and packaging system and deploying it internally. So you can integrate their Virtual Appliance system as part of your build and deployment infrastructure. As part of your nightly build create appliances and have them automatically deployed to your test jigs. Once testing is complete you can deploy into your datacenter. Their smart upgrade features are very nice for a datacenter. Usually package management during upgrades is a complete nightmare. For cloud deployment I think this feature is less useful as I would simply create a new image, fire up a new instance using the new image, and bring down my old images without the cost of a software upgrade. Of course you still have to worry about protocol and data compatibilities. rPath's Virtual Appliance is kind of a hard idea to really understand because it still ahead of curve of what most people are doing. But I think as we move into a world of multiple clouds we must seed with our images, a layer above the clouds is necessary to manage the whole process. rPath is saying we've already built that layer so you don't have to.