Development of highly scalable web site

Not sure if this is the right place to post this but here goes anyway. We are looking to hire an outside firm to help with development of a scalable and potentially high-traffic web site. We are not looking for an individual but rather a firm with enough well rounded expertise to help us with various aspects of this. Basic requirements: LAMP stack or other open source solution Very proficient in cross-browser web development Flex/AIR development for RIA Java/C/C++ proficiency Expertise with Comet and push server technology Experience with development of high-traffic web sites Use of Amazon Web Services infrastructure a plus If anyone knows of consulting firms that can take on such a project, I would appreciate your feedback. TIA

Click to read more ...


Product: Supervisor - Monitor and Control Your Processes

It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all. This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit. Adapted from their website: Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.

  • It is often inconvenient to need to write "rc.d" scripts for every single process instance. rc.d scripts are a great lowest-common-denominator form of process initialization/autostart/management, but they can be painful to write and maintain. Additionally, rc.d scripts cannot automatically restart a crashed process and many programs do not restart themselves properly on a crash. Supervisord starts processes as its subprocesses, and can be configured to automatically restart them on a crash. It can also automatically be configured to start processes on its own invocation.
  • It's often difficult to get accurate up/down status on processes on UNIX. Pidfiles often lie. Supervisord starts processes as subprocesses, so it always knows the true up/down status of its children and can be queried conveniently for this data.
  • Users who need to control process state often need only to do that. They don't want or need full-blown shell access to the machine on which the processes are running. Supervisorctl allows a very limited form of access to the machine, essentially allowing users to see process status and control supervisord-controlled subprocesses by emitting "stop", "start", and "restart" commands from a simple shell or web UI.
  • Users often need to control processes on many machines. Supervisor provides a simple, secure, and uniform mechanism for interactively and automatically controlling processes on groups of machines.
  • Processes which listen on "low" TCP ports often need to be started and restarted as the root user (a UNIX misfeature). It's usually the case that it's perfectly fine to allow "normal" people to stop or restart such a process, but providing them with shell access is often impractical, and providing them with root access or sudo access is often impossible. It's also (rightly) difficult to explain to them why this problem exists. If supervisord is started as root, it is possible to allow "normal" users to control such processes without needing to explain the intricacies of the problem to them.
  • Processes often need to be started and stopped in groups, sometimes even in a "priority order". It's often difficult to explain to people how to do this. Supervisor allows you to assign priorities to processes, and allows user to emit commands via the supervisorctl client like "start all", and "restart all", which starts them in the preassigned priority order. Additionally, processes can be grouped into "process groups" and a set of logically related processes can be stopped and started as a unit. Supervisor also has a web interface and an XMP-RPC interface:
  • A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an internet socket. Visit the server URL (e.g. http://localhost:9001/) to view and control process status through the web interface after activating the configuration file's [inet_http_server] section. XML-RPC Interface
  • The same HTTP server which serves the web UI serves up an XML-RPC interface that can be used to interrogate and control supervisor and the programs it runs. To use the XML-RPC interface, connect to supervisor's http port with any XML-RPC client library and run commands against it. An example of doing this using Python's xmlrpclib client library is as follows.

    Related Articles

  • PyCon Presentation: Supervisor as a Platform
  • Monitor Pylons application with supervisord
  • Supervisor Manual

    Click to read more ...

  • Tuesday

    How to update video views count effectively?

    Hi, I am building a video-sharing site and I'm looking for an efficient way to update video views count. The easiest way would be to perform an SQL update to increase the "views" counter every time a video is viewed, but naturally I want to avoid DB write access as much as possible. I am looking for an efficient temporary storage to which I could connect and say "increment views of video X". Every so often I would save the changes to my main database, and remove the counter from this temporary storage. I am having a hard time finding such temporary storage, however. My first thought was memcache, but it's not ideal as I wouldn't like to lose the data if memcache goes down. Also, memcache's increment command requires that the key is already present - that means that every time a video is viewed, I would have to check if the key already exists in memcache, before I can actually send the increment command. What do people use to solve this kind of issues? Kind regards, Tomasz

    Click to read more ...


    Read HighScalability on Your Mobile Phone Using WidSets Widgets

    Jean-Paul de Vooght of our Switzerland contingent created a nifty little WidSets widget that lets you better read HighScalability from your mobile phone. I thought untethered readers might like to give it a try. Thanks to Jean-Paul for making it available! WidSets is: a simple service that brings you information normally accessed via the Internet by sending it directly to your mobile phone . Using mini-applications called widgets, it sends you the latest updates to your favorite websites. The system uses RSS feeds to push information from these websites directly to your mobile phone the minute they’re updated.

    Click to read more ...


    Scaling Out MySQL

    This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.

    Click to read more ...


    20 New Rules for Faster Web Pages

    Update: Nice explanation in The importance of bandwidth versus latency of how long latencies cause cascading delays in resource loading. Doloto tries to optimize how resources are loaded. Twenty new rules have been added to the original 14 rules for sizzling web performance. Part of scalability is worrying about performance too. The front-end is where 80-90% of end-user response time is spent and following these best practices improved the performance of Yahoo! properties by 25-50%. The rules are divided into server, content, cookie, JavaScript, CSS, images, and mobile categories. The new rules are:

  • Flush the buffer early [server]
  • Use GET for AJAX requests [server]
  • Post-load components [content]
  • Preload components [content]
  • Reduce the number of DOM elements [content]
  • Split components across domains [content]
  • Minimize the number of iframes [content]
  • No 404s [content]
  • Reduce cookie size [cookie]
  • Use cookie-free domains for components [cookie]
  • Minimize DOM access [javascript]
  • Develop smart event handlers [javascript]
  • Choose <link> over @import [css]
  • Avoid filters [css]
  • Optimize images [images]
  • Optimize CSS sprites [images]
  • Don't scale images in HTML [images]
  • Make favicon.ico small and cacheable [images]
  • Keep components under 25K [mobile]
  • Pack components into a multipart document [mobile] Thanks to Simon Willison for the link.

    Click to read more ...

  • Friday

    How to Get DNS Names of a Web Server

    For some special reason, I'm trying to make a web server able to get all the DNS names mapped to its IP. Let me explain more, I'm creating a website that will run in a web farm, every web server in the farm will have some subdomains mapped to its ip, what I want is that whenever my application starts on a web server is to be able to get all the subdomains mapped/assigned to that server, e.g., I understand that I have to use reverse dns lookup (i.e. give the IP get the domain name), but I also want to get all the subdomains not just the first one that maps to that IP. I've been reading about DNS on the internet but I don't seem to find any information on how to achieve what I want, normally you use dns to get the ip of a domain but I'm not sure that all servers enable reverse lookup. The problem is that I'm still not sure whether I'll host my own DNS server or use the services of some company (many companies offer DNS hosting services), so, my question is: - If I host my own DNS server, will it be possible to get all the subdomains using reverse lookup? Another question here, if I enable reverse lookup on my DNS server, can this have any negative side effects? As to security .. etc .. is there any way I can enable only my web servers to do reverse lookup while preventing anybody else on the internet from using reverse lookup? - If use the DNS hosting services of some company, will I be able to do what I want? ie. get the subdomains mapped to the IP address of a web server? Unfortunately I don't have much experience with working with web farms, so I would like also to ask whether every web server in the web farm gets its own static IP or how does it work? I mean you have the firewall ... etc .. so I don't know how IP assignments works in a web farm scenario .. Thanks a million in advance and sorry for my really long post .. Wal

    Click to read more ...


    Amazon Announces Static IP Addresses and Multiple Datacenter Operation

    Amazon is fixing two of their major problems: no static IP addresses and single datacenter operation. By adding these two new features developers can finally build a no apology system on Amazon. Before you always had to throw in an apology or two. No, we don't have low failover times because of the silly DNS games and unexceptionable DNS update and propagation times and no, we don't operate in more than one datacenter. No more. Now Amazon is adding Elastic IP Addresses and Availability Zones. Elastic IP addresses are far better than normal IP addresses because they are both in tight with Jessica Alba and they are: Static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account, not a particular instance, and you control that address until you choose to explicitly release it. Unlike traditional static IP addresses, however, Elastic IP addresses allow you to mask instance or availability zone failures by programmatically remapping your public IP addresses to any instance associated with your account. Rather than waiting on a data technician to reconfigure or replace your host, or waiting for DNS to propagate to all of your customers, Amazon EC2 enables you to engineer around problems with your instance or software by programmatically remapping your Elastic IP address to a replacement instance. About the new feature RightScale says: Amazon did a very nice job in creating something much more powerful than simply adding “static IPs” to their offering. They are giving us dynamically remappable IP addresses that fit well into the overall cloud computing paradigm that we can use to manage servers better than with traditional hosting solutions. Mostly good news. It's not great news because RightScale also says "Assigning or reassigning an IP to an instance takes a couple of minutes..." So it's not as speedy as one would hope, but at least you don't have to wait for TTL to kick in and everyone up and down the stack to get new IP addresses. Cached static IP addresses will always be valid, which simplifies and speeds things up considerably, especially when using redundant load balancers as the entry point into your system. The other power feature added was the ability to specify which datacenter your instances run in. Amazon calls this feature Availability Zones: Availability Zones provide the ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and availability zones. Regions are geographically dispersed and will be in separate geographic areas or countries. Currently, Amazon EC2 exposes only a single region. Availability zones are distinct locations that are engineered to be insulated from failures in other availability zones and provide inexpensive, low latency network connectivity to other availability zones in the same region. Regions consist of one or more availability zones. By launching instances in separate availability zones, you can protect your applications from failure of a single location. You might also be wondering how fast connections are between datacenters. They are said to be: "inexpensive, low latency network connectivity to other availability zones in the same region." I tend to believe this. I've been surprised before how fast datacenter links can be in that you didn't have code specially for these configurations. How this will impact S3 and SimpleDB latencies is an interesting question. And how system design will need to change once datacenters are in different regions of the world is another interesting question. Other services still have more datacenters, more geographically dispersed datacenters than Amazon, and better content migration capabilities, but this is a great first step that allows developers to add another layer of reliability to their systems. Update: I thought this was a good explanation of using static IPs in Some more about EC2: Slashdot hasn't run many stories on EC2 (none that I know of) because until now it's been a niche service. Without a way to guarantee that you can have a static IP, there had been a single point of failure: if your outward-facing VMs all went down, your only recourse was to start up more VMs on new, dynamically-assigned IPs, point your DNS to them, and wait hours for your users' DNS caches to expire. That meant that while it may have been a good service for sites that needed to do massive private computation, it was an unacceptable hosting service. Now with static IPs, you basically set up your service to have several VMs which provide the outward-facing service (maybe running a webserver, or a reverse proxy for your internal webservers), and you point your public, static IPs at those. If one or more of them goes down, you start up new copies of those VMs and repoint the IPs to them. No DNS changes required.

    Related Articles

  • Jesse Robbins links to a good explanation by RightScale of Availability Zones: Setting up a fault-tolerant site using Amazon’s Availability Zones.
  • RightScale also writes DNS, Elastic IPs (EIP) and how things fit together when upgrading a server

    Click to read more ...

  • Tuesday

    Paper: On Designing and Deploying Internet-Scale Services

    Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

    The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:
  • Expect failure and design for failure.
  • Implement redundancy and fault recovery.
  • Depend upon a commodity hardware slice.
  • Keep things simple and robust.
  • Automate everything.

    Personally, I'm still trying to figure out how to make something simple.

    Next are some good thoughts on how to design operations friendly software:
  • Quick service health check. This is the services version of a build verification test.
  • Develop in the full environment.
  • Zero trust of underlying components.
  • Do not build the same functionality in multiple components.
  • One pod or cluster should not affect another pod or cluster.
  • Allow (rare) emergency human intervention.
  • Enforce admission control at all levels.
  • Partition services.
  • Understand the network design.
  • Analyze throughput and latency.
  • Treat operations utilities as part of the service.
  • Understand access patterns.
  • Version everything.
  • Keep the unit/functional tests from the last release.
  • Avoid single points of failure.
  • Support single-version software. Have all your customers run the same version.
  • Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

    And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

    You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

    The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

    My favorite part of the whole paper is:
    We have long believed that 80% of operations issues originate in design and development, so this section
    on overall service design is the largest and most important. When systems fail, there is a natural tendency
    to look first to operations since that is where the problem actually took place. Most operations issues,
    however, either have their genesis in design and development are best solved there.

    Understand this and I think much of the rest follows naturally.
  • Monday


    If you would like to advertise on this site please contact me at We can work out the details over email. Thanks

    Click to read more ...