Hi, I am an owner of an large community website and currently we are having problems with our database architecture. We are using 2 database servers and spread tables across them to divide read/writes. We have about 90% reads and 10% writes. We use Memcached on all our webservers to cache as much as we can, traffic is load balanced between webservers. We have 2 extra servers ready to put to use! We have looked into a couple of solution so far: Continuent Uni/Cluster aka Sequoia -> Commercial version way too expensive and Java isn't as fast as it suppose to be. MySQL Proxy -> We couldn't find any good example on how to create a master - master with failover scenario. MySQL Clustering -> Seems to be not mature enough, had a lot of performance issues when we tried to go online with it. MySQL DRDB HA -> Only good for failover, cannot be scaled! MySQL Replication -> Well don't get me started ;) So now I turn to you guys to help me out, I am with my hands in my hair and see the site still growning and performance slowly getting to its limit. Really need your help!! HELP!
I haven't developed an AppEngine application yet, I'm just taking a look around their documentation and seeing what stands out for me. It's not the much speculated super cluster VM. AppEngine is solidly grounded in code and structure. It reminds me a little of the guy who ran a website out of S3 with a splash of Heroku thrown in as a chaser. The idea is clearly to take advantage of our massive multi-core future by creating a shared nothing infrastructure based firmly on a core set of infinitely scalable database, storage and CPU services. Don't forget Google also has a few other services to leverage: email, login, blogs, video, search, ads, metrics, and apps. A shared nothing request is a simple beast. By its very nature shared nothing architectures must be composed of services which are themselves already scalable and Google is signing up to supply that scalable infrastructure. Google has been busy creating a platform of out-of-the-box scalable services to build on. Now they have their scripting engine to bind it all together. Everything that could have tied you to a machine is tossed. No disk access, no threads, no sockets, no root, no system calls, no nothing but service based access. Services are king because they are easily made scalable by load balancing and other tricks of the trade that are easily turned behind the scenes, without any application awareness or involvement. Using the CGI interface was not a mistake. CGI is the perfect metaphor for our brave new app container world: get a request, process the request, die, repeat. Using AppEngine you have no choice but to write an app that can be splayed across a pointy well sharpened CPU grid. CGI was devalued because a new process had to be started for every request. It was too slow, too resource intensive. Ironic that in the cloud that's exactly what you want because that's exactly how you cause yourself fewer problems and buy yourself more flexibility. The model is pure abstraction. The implementation is pure pragmatism. Your application exists in the cloud and is in no way tied to any single machine or cluster of machines. CPUs run parallel through your application like a swarm of busy bees while wizards safely hidden in a pocket of space-time can bend reality as much as they desire without the muggles taking notice. Yet the abstraction is implemented in a very specific dynamic language that they already have experience with and have confidence they can make work. It's a pretty smart approach. No surprise I guess. One might ask: is LAMP dead? Certainly not in the way Microsoft was hoping. AppEngine is so much easier to use than the AWS environment of EC2, S3, SQS, and SDB. Creating an app in AWS takes real expertise. That's why I made the comparison of AppEngine to Heroku. Heroku is a load and go approach for RoR whereas AppEngine uses Python. You basically make a Python app using services and it scales. Simple. So simple you can't do much beyond making a web app. Nobody is going to make a super scalable transcoding service out of AppEngine. You simply can't load the needed software because you don't have your own servers. This is where Amazon wins big. But AppEngine does hit a sweet spot in the market: website builders who might have previously went with LAMP. What isn't scalable about AppEngine is the scalability of the complexity of the applications you can build. It's a simple request response system. I didn't notice a cron service, for example. Since you can't write your own services a cron service would give you an opportunity to get a little CPU time of your own to do work. To extend this notion a bit what I would like to see as an event driven state machine service that could drive web services. If email needs to be sent every hour, for example, who will invoke your service every hour so you can get the CPU to send the email? If you have a long running seven step asynchronous event driven algorithm to follow, how will you get the CPU to implement the steps? This may be Google's intent. Or somewhere in the development cycle we may get more features of this sort. But for now it's a serious weakness. Here's are a quick tour of a few interesting points. Please note I'm copying large chunks of their documentation in this post as that seems the quickest way to the finish line...
import wsgiref.handlers from google.appengine.ext import webapp class MainPage(webapp.RequestHandler): def get(self): self.response.headers['Content-Type'] = 'text/plain' self.response.out.write('Hello, webapp World!') def main(): application = webapp.WSGIApplication( [('/', MainPage)], debug=True) wsgiref.handlers.CGIHandler().run(application) if __name__ == "__main__": main()This code defines one request handler, MainPage, mapped to the root URL (/). When webapp receives an HTTP GET request to the URL /, it instantiates the MainPage class and calls the instance's get method. Inside the method, information about the request is available using self.request. Typically, the method sets properties on self.response to prepare the response, then exits. webapp sends a response based on the final state of the MainPage instance. The application itself is represented by a webapp.WSGIApplication instance. The parameter debug=true passed to its constructor tells webapp to print stack traces to the browser output if a handler encounters an error or raises an uncaught exception. You may wish to remove this option from the final version of your application.
Example of creation: from google.appengine.ext import db from google.appengine.api import users class Pet(db.Model): name = db.StringProperty(required=True) type = db.StringProperty(required=True, choices=set("cat", "dog", "bird")) birthdate = db.DateProperty() weight_in_pounds = db.IntegerProperty() spayed_or_neutered = db.BooleanProperty() owner = db.UserProperty() pet = Pet(name="Fluffy", type="cat", owner=users.get_current_user()) pet.weight_in_pounds = 24 pet.put() Example of get, modify, save: if users.get_current_user(): user_pets = db.GqlQuery("SELECT * FROM Pet WHERE pet.owner = :1", users.get_current_user()) for pet in user_pets: pet.spayed_or_neutered = True db.put(user_pets)Looks like your normal overly complex data access. Me, I appreciate the simplicity of a string based property interface.
Scalr is a fully redundant, self-curing and self-scaling hosting environment utilizing Amazon's EC2. It has been recently open sourced on Google Code. Scalr allows you to create server farms through a web-based interface using prebuilt AMI's for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of. Scalr promises automatic high-availability and scaling for developers by health and load monitoring. The health of the farm is continuously monitored and maintained. When the Load Average on a type of node goes above a configurable threshold a new node is inserted into the farm to spread the load and the cluster is reconfigured. When a node crashes a new machine of that type is inserted into the farm to replace it. 4 AMI's are provided for load balancers, mysql databases, application servers, and a generic base image to customize. Scalr allows you to further customize each image, bundle the image and use that for future nodes that are inserted into the farm. You can make changes to one machine and use that for a specific type of node. New machines of this type will be brought online to meet current levels and the old machines are terminated one by one. The open source scalr platform with the combination of the static EC2 IP addresses makes elastic computing easier to implement. Check out the blog announcement by Intridea for more info. As AWS conquers the scalable web application hosting space it is time to check out the new Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB (Programming) book on amazon.com. What do you think of the opportunities of using scalr for automatic scalability?
It is fairly obvious that web site performance can be increased by making the code run faster and optimising the response time. But that only scales up to a point. To really take our web sites to the next level, we need to look at the performance problem from a different angle.
Skype uses PostgreSQL as their backend database. PostgreSQL doesn't get enough run in the database world so I was excited to see how PostgreSQL is used "as the main DB for most of [Skype's] business needs." Their approach is to use a traditional stored procedure interface for accessing data and on top of that layer proxy servers which hash SQL requests to a set of database servers that actually carry out queries. The result is a horizontally partitioned system that they think will scale to handle 1 billion users.
First, code to insert a user in a database: CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS text AS $$ BEGIN PERFORM 1 FROM users WHERE username = i_username; IF NOT FOUND THEN INSERT INTO users (username) VALUES (i_username); RETURN 'user created'; ELSE RETURN 'user already exists'; END IF; END; $$ LANGUAGE plpgsql SECURITY DEFINER; Heres the proxy code to distribute the user insert to the correct partition: queries=# CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS TEXT AS $$ CLUSTER 'queries'; RUN ON hashtext(i_username); $$ LANGUAGE plproxy; Your SQL query looks normal: SELECT insert_user("username");- The result of a query is exactly that same as if was executed on the remote database. - Currently they can route 1000-2000 requests/sec on Dual Opteron servers to a 16 parition cluster.
Not sure if this is the right place to post this but here goes anyway. We are looking to hire an outside firm to help with development of a scalable and potentially high-traffic web site. We are not looking for an individual but rather a firm with enough well rounded expertise to help us with various aspects of this. Basic requirements: LAMP stack or other open source solution Very proficient in cross-browser web development Flex/AIR development for RIA Java/C/C++ proficiency Expertise with Comet and push server technology Experience with development of high-traffic web sites Use of Amazon Web Services infrastructure a plus If anyone knows of consulting firms that can take on such a project, I would appreciate your feedback. TIA
It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all.
This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit.
Adapted from their website:
Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.
Hi, I am building a video-sharing site and I'm looking for an efficient way to update video views count. The easiest way would be to perform an SQL update to increase the "views" counter every time a video is viewed, but naturally I want to avoid DB write access as much as possible. I am looking for an efficient temporary storage to which I could connect and say "increment views of video X". Every so often I would save the changes to my main database, and remove the counter from this temporary storage. I am having a hard time finding such temporary storage, however. My first thought was memcache, but it's not ideal as I wouldn't like to lose the data if memcache goes down. Also, memcache's increment command requires that the key is already present - that means that every time a video is viewed, I would have to check if the key already exists in memcache, before I can actually send the increment command. What do people use to solve this kind of issues? Kind regards, Tomasz
Jean-Paul de Vooght of our Switzerland contingent created a nifty little WidSets widget that lets you better read HighScalability from your mobile phone. I thought untethered readers might like to give it a try. Thanks to Jean-Paul for making it available! WidSets is: a simple service that brings you information normally accessed via the Internet by sending it directly to your mobile phone . Using mini-applications called widgets, it sends you the latest updates to your favorite websites. The system uses RSS feeds to push information from these websites directly to your mobile phone the minute they’re updated.