advertise
Monday
Jul302007

Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

Can you really create an infinitely scalable infrastructure for less than $100 using Amazon's storage, grid, and queuing services platform? It appears so, at least for the right application. Amazon beams a spot light on the future battle of the roll-your-own versus the connect-the-dots approach to building next generation websites using core external services. Their argument is strong. Using Amazon's platform you can quickly build an infrastructure that would otherwise take an eternity to make, a pile of money to create, and an unbounded mass of people to implement and maintain. Yet Amazon doesn't provide SLAs, so you can you really trust them with your crown jewels? Facebook recently leap frogged Amazon's vision with an even more comprehensive set of services. The battle for the future is on. Site: http://aws.amazon.com/

Information Sources

  • Slides: Building Highly Scalable Web Applications
  • Podcast: Technometria: Amazon Web Services
  • Amazon Services Home.

    Platform

  • Amazon ECS (E-Commerce Service)
  • Amazon S3 (simple storage service)
  • Amazon SQS (simple queuing service)
  • Amazon EC2 (grid service)
  • Amazon Web Search Service
  • Amazon Flexible Payments Service (Amazon FPS)
  • REST and SOAP Service Interfaces

    What's Inside?

    Why use external services?

  • Amazon's services replace the boxes, wires, and disk drives part of the application stack.
  • Amazon has spent ten years and over $1 billion developing a world-class web service that millions of customers use every day. Maybe you can leverage that experience for your site?
  • Focus on the customer. 70% if Web Development isn't about providing customer value. It's about building and managing data centers. Your efforts would be better spent on your customers and not plumbing.
  • Quicker to market. Scaling is hard. Let someone else worry about that while you concentrate on adding user value.
  • Designing for peak load is expensive. So turn fixed costs into variable costs. Say you want to handle high traffic flows from slashdot or digg, or you have high seasonal demand, having the infrastructure in place to handle those loads is a high fixed cost. You could use that money better elsewhere. It make sense to create an infrastructure where you can automatically and temporarily scale resources to handle peak demand.
  • High reliability and availability. A dedicated service may be more reliable than a service you could create. It say "may" because Amazon doesn't provide an SLA, so you wont get any guarantees. The idea is that Amazon is cheap enough and reliable enough that the few failures will be acceptable. Besides, SLAs usually just refund some money when things go wrong, they don't really guarantee anything.
  • It's a cheap CDN. Amazon's storage network could serve a relatively inexpensive content delivery network. This option is discussed in Reducing Your Website's Bandwidth Usage. The idea is that just the frequent downloading of a simple favicon.ico file can use a significant portion of your bandwidth. Using S3 for $2/month to offload 90% of your bandwidth to an external host is a good deal. However, without an SLA S3 can't be thought of as a proper CDN.

    Amazon ECS (E-Commerce Service)

  • This service exposes Amazon's product data and e-commerce functionality: Detailed Product Information on all Amazon.com Products, Access to Product Images, All Customer Reviews associated with a Product, etc.
  • Amazon products are aggressively priced.
  • I found this service disappointing. If you want to build a store on top of Amazon it seems great, but I didn't see a way to add your own products to the store, so I don't think it's generally useful.

    Amazon S3 (simple storage service)

  • This service stores data in Amazon's storage network.
  • $.15 per GPB per month storage
  • $.01 for 1000 to 10000 requests.
  • $.10 - $.17 per GB data transfer.
  • The service is: fast, relaible, scalable, redundant, dispersed.
  • You can have per object URLs. This means you can reference an image or other file directly with a URL, so it's usable in a web page.
  • Typical use: CDN and backup storage.
  • Storage is distributed to multiple locations so you get a level of geographical distribution.

    Amazon SQS (simple queuing service)

  • This service provides an internet scale queuing service for storing messages. Distributed actors put work on the queue and take work off the queue.
  • $.10 per 1000 messages.
  • $.10 - $.18 per GB data transfer.
  • This service is: scalable, elastic, reliable, simple, secure.
  • Typical use: a centralized work queue. You put jobs on the queue and different actors can pop work of the queue and process them when they get CPU time.
  • Expected message latency, as of 2007, was 2-10 seconds. This is horrible for many applications, not bad for many others.
  • Part of scalability. Have any number of producers and consumers. You don't worry about it.
  • Queues are spread across multiple machines and multiple data centers.

    Amazon EC2 (grid service)

  • This service provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
  • Basically you create a Xen image for your Linux distro and upload it into their "elastic compute cloud." Using an API you can then start as many instances as you like.
  • Typical use: transcoding, audio work, load testing.
  • Root level access to the server and full control over the machine.
  • Can scale up and scale down on a minute-by-minute basis.
  • For real-time processing one criticism has been slow CPUs (1.75 Ghz Xeon). This probably won't be a problem if your application is written to linearly scale.
  • An EC2 instance is not persistent so you can't store a database there. You have some local storage, but it goes away when the instance goes away.
  • Takes a few minutes to start and stop images, so it's not really on demand.
  • You can add anything you want to an image. If you want a database you can add it in.

    GigaVox Media Example Web-Scale Architecture

  • You can start to see how Amazon's services can work together. Let's say you have a large batch of MP2s you would like transcode to MP3s. You would store the original media into S3, queue the work request into SQS, and have instances running in EC2 to take work of the queue and perform the transcoding, storing the results back into S3. And this is exactly what GigaVox does.
  • GigaVox is a podcasting company. - They take original recordings and transcode them say from MP2 to MP3. Many other transcodings are also performed. - Then these chunks of media are assembled together into a delivery format based on building a show. For example, old podcasts can be reassembled each night with up to date advertisements. - To do this at scale would take a lot of costly resources.
  • Using Amazon's services GigaVox gets geographically redundancy and failover for relatively inexpensive CPU, bandwidth, and storage charges, and bandwidth costs. You have no boxes or wires. No data center to manage. And you can grow with small fixed costs.
  • Messages are time stamped on the queue. If the message waited in the queue for too long then they can start more EC2 images. You can balance costs. You could also layer in a customer based priority mechanism.
  • They have each instance have its own messaging queue for command and control.
  • For security reasons they upload files through ftp to instances rather than going through S3.
  • All bandwidth withing the Amazon cloud is free. This is an important business consideration for making the services work together.
  • Another set of instances and queues handles assembling the delivered media.
  • Allows GigaVox to deliver value to their customer at a low startup cost.

    Lessons Learned

  • Build or buy is always a difficult decision. If a service doesn't work then you may lose your customers and there's nothing more you can do other send yet another urgent email to nobody in particular. This is a horrible feeling. Yet, if it does work you could be way ahead of the game. How to choose? That would be telling :-)
  • Build a layer of virtualization so you can switch to another provider when they become available or so you can replace it with your own service. This lessens your dependency on Amazon in the event they get tired of offering services or their performance deteriorates.
  • As a startup using Amazon services isn't a big risk because you are already in a risky situation. And any risk is moderated by the very low cost of starting up and money is always an issue for startups.
  • For many use cases buying your own dedicated servers may still be a better approach as you get more control, lower latency, and the same hardware is usable for multiple purposes.
  • Software as a service is a powerful and practical idea. It changes how you build software. It forces you to layer your software around interfaces. And once your software is composed of interfaces you have loosely coupled components that can be easily replaced. You also have the basis for a platform API should you ever want to provide an API you your customers. The highest level of development would to use the same API you give your customers to build your service.
  • Loosely coupled, message based architectures combined with service interfaces allow you to think several levels up the abstraction layer. You don't have to wallow in the muck, which frees you how to structure your application using large scale blocks of behavior.
  • Designing a UI for an asynchronous interactive interface poses some challenges. It may take a while to perform an operation, so how do you interact with the user to handle that?
  • Instinctively I doubted Amazon could deliver. But if you have the right type of problem, you really can do a lot of work cheaply using Amazon services.

    See Also

  • Flickr and YouTube also deal with service level APIs.
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3

    Click to read more ...

  • Monday
    Jul302007

    allowed contributed

    Saturday
    Jul282007

    Product: Web Log Storming

    Web Log Storming is an interactive, desktop-based Web Log Analyzer for Windows. The whole new concept of log analysis makes it clearly different from any other web log analyzer. Browse through statistics to get into details - down to individual visitor's session. Check individual visitor behavior pattern and how it fits into your desired scenario. Web Log Storming does far more than just generate common reports - it displays detailed web site statistics with interactive graphs and reports. Very complete detailed log analysis of activity from every visitor to your web site is only a mouse-click away. In other words, analyze your web logs like never before! It's easy to track sessions, hits, page views, downloads, or whatever metric is most important to each user. You can look at referring pages and see which search engines and keywords were used to bring visitors to the site. Web site behavior, from the top entry and exit pages, to the paths that users follow, can be analyzed. You can learn which countries your visitors come from, and which operating systems and browsers they use. You'll learn how your bandwidth is being used, and how much time users spend on your site. You can tell how popular your files, images, directories, and queries are.

    Click to read more ...

    Saturday
    Jul282007

    Product: FastStats Log Analyzer 

    FastStats Log Analyzer enables you to: * Determine whether your CPC advertising is profitable: Are you spending $0.75 per click on Google or Overture, but only receiving $0.56 per click in revenue? * Tune site traffic patterns: FastStats's Hyperlink Tree View feature lets you visually see how traffic flows through your web site. * High-performance solution for even the busiest web sites: Our software has been clocked at over 1000 MB/min. Other popular log file analysis tools (we won't name names), run at 1/40th the speed. We've been in the business for over 6 years, delivering value, quality, and good customer service to our clients. Our products are used for data mining at some of the world's busiest web sites -- why not give FastStats a try at your web site? FastStats log file analysis supports a wide variety of web server log files, including Apache logs and Microsoft IIS logs.

    Click to read more ...

    Saturday
    Jul282007

    Product: Web Log Expert

    WebLog Expert is a fast and powerful access log analyzer. It will give you information about your site's visitors: activity statistics, accessed files, paths through the site, information about referring pages, search engines, browsers, operating systems, and more. The program produces easy-to-read HTML reports that include both text information (tables) and charts. View the WebLog Expert sample report to get the general idea of the variety of information about your site's usage it can provide. WebLog Expert can analyze logs of Apache and IIS web servers. It can even read GZ and ZIP compressed logs so you won't need to unpack them manually. The log analyzer features intuitive interface. Built-in wizards will help you quickly and easily create a profile for your site and analyze it.

    Click to read more ...

    Friday
    Jul272007

    Product: Munin Monitoriting Tool

    Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities. After completing a installation a high number of monitoring plugins will be playing with no more effort. Using Munin you can easily monitor the performance of your computers, networks, SANs, applications, weather measurements and whatever comes to mind. It makes it easy to determine "what's different today" when a performance problem crops up. It makes it easy to see how you're doing capacity-wise on any resources.

    Click to read more ...

    Thursday
    Jul262007

    Product: eAccelerator a PHP Accelerator

    eAccelerator is a free open-source PHP accelerator, optimizer, and dynamic content cache. It increases the performance of PHP scripts by caching them in their compiled state, so that the overhead of compiling is almost completely eliminated. It also optimizes scripts to speed up their execution. eAccelerator typically reduces server load and increases the speed of your PHP code by 1-10 times.

    Click to read more ...

    Thursday
    Jul262007

    Product: Symfony a Web Framework

    symfony is an open-source PHP web framework based on the best practices of web development, thoroughly tried on several active websites, symfony aims to speed up the creation and maintenance of web applications, and to replace the repetitive coding tasks by power, control and pleasure. Symfony provides a lot of features seamlessly integrated together, such as: * simple templating and helpers * cache management * smart URLs * scaffolding * multilingualism and I18N support * object model and MVC separation * Ajax support * enterprise ready

    Click to read more ...

    Thursday
    Jul262007

    ThemBid Architecture

    ThemBid provides a market where people needing work done broadcast their request and accept bids from people competing for the job. Unlike many of the sites profiled at HighScalability, ThemBid is not in the popular press as often as Paris Hilton. It's not a media darling or a giant of the industry. But what I like is they have a strategy, a point-of-view for building websites and were gracious enough to share very detailed instructions on how to go about building a website. They even delve into actual installation details of the various software packages they use. Anyone can benefit by taking a look at their work. Site: http://www.thembid.com/

    Information Sources

  • Build Scalable Web 2.0 Sites with Ubuntu, Symfony, and Lighttpd

    Platform

  • Linux (Ubuntu)
  • Symfony
  • Lighttpd
  • PHP
  • eAccelerator
  • Eclipse
  • Munin
  • AWStats

    What's Inside?

    The Stats

  • Started work in December of 2006 and had a full demo by March 2007.
  • One developer/sys admin worked with a part-time graphics designer.
  • Targeted a few thousand users after launch.

    The Architecture

  • Hardware. Dual core server with 2GB RAM
  • Storage. 2 x 36SCSI 10K RPM on RAID1.
  • Data Center. They went with with Layeredtech for the managed server because of past positive experiences.
  • Development Environment. Ubuntu and Eclipse.
  • OS. They chose the server distribution of Ubuntu because that's what they use on the client side and Ubuntu supports "simpler installation and easier maintenance than typical IT deployments."
  • Web Server. Lighttpd is used to handle static content and forward the dynamic PHP page requests to FastCGI.
  • Database. MySQL. When growth is necessary the idea is to move to a master-slave arrangement and them maybe MySQL cluster.
  • Web Framework. Went with PHP because they knew it and other successful sites like Digg and Yahoo successfully deploy PHP. They chose Symfony as there framework because of its nice documentation and active development community. And Yahoo also uses Symfony. It's a decision that has worked well for them.
  • PHP Cache. eAccelerator is used to compile and cache PHP scripts.
  • Object and Content Cache. The plan is to cache a lot of content. For a bid site like theirs this makes sense. Many of the pieces are used over and over again so putting them in memory will speed up the entire system and take pressure off the database and the IO system. Initially the used a SQLite cache on top of of a memory based file system. This choice was because it was supported by Symfony. When a memcached plugin is available they'll try that.
  • Client Side Cache. Lighttp's mod_expire module is used to prevent Javascript, style sheets, and images that rarely change from being uncessarily redownloaded by the browser.
  • Monitoring. Munin is used to monitor their resource usage. It's as simple as visiting "yoursite.com/status" to see what's going on.
  • Log Analysis. AWStats is used to track hits and types of requests. This information can be used to target bottlenecks.
  • Scalability Plan. - Use Munin to tell when to think about upgrading. When your growth trend will soon cross your resources trend, it's time to do something. - Move MySQL to a separate server. This frees up resources (CPU, disk, memory). What you want to run on this server depend on its capabilities. Maybe run a memcached server on it. - Move to a distributed memory cache using memcached. - Add a MySQL master/slave configuration. - If more webservers are needed us LVS on the front end as a load balancer.
  • Future Directions. Work on fault tolerance.

    Lessons Learned

  • It's possible to create a nice site fairly quickly with just a few people using commonly available low cost tools. And your system will be solid and powerful. No cut corners.
  • Use feedback from your system to know what needs optimizing and when it's time to scale.
  • Good documentation and an active community draw people. These are very attractive qualities for people making decisions about what to use. It's hard to go with a tool chain when it looks like you may get stuck in the future with no way out and no help. If you make tools make them dead easy to understand, learn, use, and deploy.
  • Stick with the familiar. It may not be optimal, it may not be the best, but it's more important that you get started and make progress. You don't want to delay releasing your site so you can learn a completely different tool chain that may make your life somewhat easier and in some projected future. The future is now.
  • Use what works for other people. The fact that Yahoo and Digg use PHP is a good recommendation. Certainly PHP is not the only way to build a site, but it does cut your risk level and help you sleep at night. It also means there's an active community that can help you when you have problems.

    Click to read more ...

  • Thursday
    Jul262007

    Product: AWStats a Log Analyzer

    AWStats is a free powerful and featureful tool that generates advanced web, streaming, ftp or mail server statistics, graphically. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages. It uses a partial information file to be able to process large log files, often and quickly. It can analyze log files from all major server tools like Apache log files (NCSA combined/XLF/ELF log format or common/CLF log format), WebStar, IIS (W3C log format) and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers.

    Click to read more ...