advertise
« Sponsored Post: Percona Live!, Strata, Box, BetterWorks, New Relic, NoSQL Now!, Surge, Tungsten, AppDynamics, Couchbase, CloudSigma, ManageEngine, Site24x7 | Main | Stuff The Internet Says On Scalability For August 19, 2011 »
Monday
Aug222011

Strategy: Run a Scalable, Available, and Cheap Static Site on S3 or GitHub

One of the best projects I've ever worked on was creating a large scale web site publishing system that was almost entirely static. A large team of very talented creatives made the art work, writers penned the content, and designers generated templates. All assets were controlled in a database. Then all that was extracted, after applying many different filters, to a static site that was uploaded via ftp to dozens of web servers. It worked great. Reliable, fast, cheap, and simple. Updates were a bit of a pain as it required pushing a lot of files to a lot of servers, and that took time, but otherwise a solid system.

Alas, this elegant system was replaced with a new fangled dynamic database based system. Content was pulled from a database using a dynamic language generated front-end. With a recent series of posts from Amazon's Werner Vogels, chronicling his experience of transforming his All Things Distributed blog into a static site using S3's ability to serve web pages, I get the pleasure of asking: are we back to static sites again?  

It's a pleasure to ask the question because in many ways a completely static site is the holly grail of content heavy sites. A static site is one in which files (html, images, sound, movies, etc) sit in a filesystem and a web server directly translates a URL to a file that is directly reads the file from the file system and spits it out to the browser via a HTTP request. Not much can go wrong in this path. Not much going wrong is a virtue. It means you don't need to worry about things. It will just work. And it will just keep on working over time, bit rot hits program and service heavy sites a lot harder than static sites.

Here's how Werner makes his site static:

  • S3 - stores the files and serves the website, creating a website without servers. S3 is not your only option, but an obvious one of course for him. We'll talk more about using Github and Google App Engine too.
  • Disqus - for comments. 
  • Bing - site search. Google wants $100 a year for a site search feature. I remember when Google was free...
  • DropBox - used to sync website files to whatever computer he is on so they can be edited locally. Then a static site generator is run on the files. And then the files are copied to S3 which makes them available on the Internet using S3.
  • Jekyll - static website generator. Written in Ruby and uses YAML for metadata management and uses the Liquid template engine to manipulate the content.
  • s3cmd - push files to S3.
  • http://wwwizer.com - free service that gets around S3's requirement that your website include www in the domain name. This service will redirect a naked domain name to www.domain so it all works as expected. Joseph Barillari has a good discussion of this issue.

The articles describing his journey are: New AWS feature: Run your website from Amazon S3Free at Last - A Fully Self-Sustained Blog Running in Amazon S3, and No Server Required - Jekyll & Amazon S3

Using DropBox is the clever bit here for me. DropBox makes it so the files follow you so you can edit them on any of your machines. This is also the downside of the approach. You need a local machine with a complete tool set, which is a pain. Ironically, that is why I prefer cloud based approaches. I want to run a blog from any web enabled device, like an iPhone or an iPad, I don't want to mess with programs.

Static sites are scalable sites. The web server or OS can easily cache popular pages in memory and serve them like mint juleps at the Kentucky Derby. If a single server is overwhelmed then a static site can easily be served out of a CDN or replicated out to a load balanced configuration. For this reason static sites are fast. If you use a distributed file system underneath then you can even avoid the problem of a slow disk becoming the hot spot.

Content is editable with your favorite text editor. Nice and simple.

Filesystems tend to be reliable. Using S3 is even more reliable and cheap too. If there is a failure you can just republish or restore your site with a simple command. Databases tend to spike memory, fill up tables, hit slow queries, and have a myriad of other annoying failure modes. All that can be skipped in a static site. Web servers that go much beyond simple file serving can also act the diva.

The problem with static sites is that they are, well, static. Once the Internet was entirely static, except for the blink tag and animated gifs of course. Then CGI changed all that and the web has never sat still since. 

So what we do is outsource all the dynamic bits to services and leave the content static. Comments can be handled by a service like Disqus. Search can be handled by Bing. Ad serving is already a service. Like buttons are all bits of includible javascript. And security concerns (hacking, SQL Injection, etc) are blissfully minimized. This is the mashup culture. 

And it mostly works. I seriously considered this approach when I had to move HighScalability off shared hosting.

Some of the downsides: 

  • You can't do a lot with .htaccess. If you have a lot security checks and URL mapping magic then you can't do that with S3.
  • No PHP or any other language induced dynamism using a language invoked by the web server. You are of course perfectly free to create a service and mash it into your site. Google App Engine is still a great platform for this kind of mini service layer.

The big downside for me was:

  • Not multi-user. This limitation hits all aspects of the site. I want multiple people to be able to add content to HighScalability. I want to give special privileges to users. I want to assign roles. I want to control what certain users can see. SquareSpace has that capability as do other content management systems. A site generated by a static site generator does not have these capabilities.
  • Engagers. These are tools to get users engaged with your site so hopefully they'll stick longer. Features like the top posts of all times, read counts on articles, most recent post lists, tag clouds. These are harder to do with static generators.
  • Monetizers. These are features that help you make money. They often include engagers, but can include features like email list signup, related content recomendations, white paper matching, sign up for consulting services, sponsored text links, yhat sort of thing. Difficult to implement on a static system. An obvious solution is to have a common CMS meta data service that all mashup services can work across, but that probably won't fly.

For building a static website S3 is not the only game in town. Github can also be used to host a static website. A blog can be generated and updated with a simple git push. This reduces the required set of installed programs to a more manageable level. Now you don't even need git. You can edit files directly using their web interface. The way it works is Github will automatically build your website every time you push changes to the repository. More details here:GitHub Pages, Publishing a Blog with GitHub Pages and Jekyll, and Github as a CDN.

Google App Engine is also an alternative for a static site. More details at: DryDrop, manage static web site with GAE and Github.

Now there's a bit of push to move blogs over to social networking sites like Google+. The advantage is you have a built in mechanism for attracting more readers, great discussion features, increased possibility of engagement, no cost, excellent device availability, and no maintenance worries. For blogs that don't need to monetize this is an excellent option. Though I do worry about what happens when you want to hop on to the next trendy social network and all the old content is simply dust.

Wrapping up:

  • If your blog is strictly about content then a static web site approach is scalable, fast, cheap, flexible, and reliable. We have a rich set of tools now for making static websites a reality.
  • If your blog isn't your precence on the web then invest time in social network (including StackExchage, Quora, etc) sites instead of blog.
  • If want to increase user engagement or otherwise creatively monetize your blog then a CMS is a better option. 
  • If you want to have multiple users and content creators on your blog then a CMS is a better option.

So more links on creating static web sites:

Reader Comments (14)

Hi Todd, nice write up. I absolutley agree with the downsides you mention. This was for me really an excercise is seeing what I would need to do to make the no-server work. And to understand what we could do better on the AWS side.

A full blown multi-user CMS with a rich plugin ecosystem like Wordpress is just years ahead in functionality from static generators Jekyll or Cactus. But since my last post people have been sending me references to other static generators and there is a wide variety in tools evolving. Make no mistake Jekyll is really "Blogging like a Hacker" and thus comes with all the rough edges you would expect. For example a site with many postings (like yours) would need some serious organizing to make it manageable in Jekyll.

I do like the decentralized nature of the setup. I can write from anywhere and update the site. Given that I am the only writer there is a natural lack of concurrency :-) But being able to write articles in places where you may not have a local install of jekyll is certainly something I like to get to as well. I Like what Ted Kulp did in Automating Jekyll Builds where he basically has a process (on a server) watching a dropbox folder; when he drops a posting in it the sites gets regenerated and pushed to S3. It still requires a server some where but I am pretty sure I can outsource that to Heroku to not have to run something myself. I am just having fun to see how far I can push this...

August 22, 2011 | Unregistered CommenterWerner

That's what I enjoy about your articles Werner, you are obviously getting a kick from working this whole thing out. It's fun to ride along!

August 22, 2011 | Registered CommenterTodd Hoff

I did a little test, and looked thru the last week of my browser's history and tried to reverse engineer how much of the content could be served through static generators (more advanced ones than Werner's, I used my imagination a little) and once I got past the search engine results and factored out my personal sites (email, twitter, facebook, etc...) basically ALL of the pages could be statified (I made up that word, needed something short).

Statification is hardly ever mentioned in conversations of scale or system-simplification. Once a site is statified, it is fundamentally different in terms of how it can be served: comparing a page being served from a PHP-mysql stack to a page being served from a S3 bucket or an Akamai edge server equates to about a 1-10K FOLD difference in terms of system resources (c10K can be done w/ a single core nowadays).

Statification can be applied to a lot of webpages (not just blogs), but AFAIK there is no good cookbook for doing it (anyone know of any?) and it does not seem to be a commonly used tool by most developer's (or maybe its just not sexy enough to get a lot of media coverage).

Personally, I am a huge fan of pushing any and all functionality (w/in reason) as far from the backend (i.e. database) and as far INTO the frontend (i.e. browser) as possible, this is the key to scaling and respecting data locality, so thanks to both of you for reminding people of this technique: more brains on it means a multi-user CMS static generator is that much closer, so I will get my webpages that much quicker :).

August 22, 2011 | Unregistered CommenterRussell Sullivan

The Bing site search widget was pulled in February 2011.

http://www.bing.com/community/site_blogs/b/webmaster/archive/2011/04/19/bing-com-siteowner-shut-down-options-for-search-results.aspx

August 22, 2011 | Unregistered CommenterJonathan Prior

With regards to being able to allow users to add/edit content to HighScalability: if your static site files are kept in a {git,hg,bzr} repository, you can allow users to clone your local repo, make changes, and then push them back to you, where you can review and push to S3. Doing this would probably even save you a little space on your Dropbox, because you could just keep a bare {git,hg,bzr} repo on your Dropbox and push and pull to that on your local machine(s). I do that for a lot of my code, and I love that even if I don't have an internet connection (and therefore can't push to github), I can just do a git push dropbox master and know that the next time I am connected, my repo will be backed up to my Dropbox.

August 22, 2011 | Unregistered CommenterPaul

In regards to @Jonathon Prior's statement, I saw this on HN a while back: http://jeffkreeftmeijer.com/2011/introducing-tapir-simple-search-for-static-sites/

August 22, 2011 | Unregistered CommenterPaul

Paul, that's hard core, but interesting. My first thought would be that approach would scare people off, but maybe not. I don't get the contributions I was hoping for, maybe a git approach would be more attractive?

Jonathan, thanks for that. When I could find the site search for Bing I thought it was just me. Tapir looks interesting.

August 22, 2011 | Registered CommenterTodd Hoff

In my mind, the best approach is still static-with-AJAX-where-needed. IOW, serve up the html/css/javascript files statically, and then let javascript create the "dynamic" page for you. This help with scaling since the client machine is doing the GUI processing work, not the server.

Of course, there still need to be "dynamic" REST endpoints, etc. This separation is also great for easily developing multiple user interfaces that draw on the same data source.

I guess that's exactly in line with the "mashup" culture you mentioned. :)

August 23, 2011 | Unregistered CommenterSteve

This is a great post that highlights the advantages of serving mostly static content. The reality is that most sites can get by with a minimum of dynamic content and also that many, many sites are really just static anyway. We've had great success using Bricolage to publish our site at www.groupcomplete.com. Bricolage is not new or fancy but does a terrific job of separating the content from the templates, pushing the end-result content up to our servers and sure beats the trouble of worrying whether a dynamic CMS has crashed or been subjected to the latest security flaw.

August 23, 2011 | Unregistered CommenterMatt

"...The web server or OS can easily cache popular pages in memory and serve them like mint juleps at the Kentucky Derby..."

That approaches literature. Well done.

August 24, 2011 | Unregistered CommenterDon McArthur

It's interesting that most corporate websites maintained by large enterprises rely on this approach, and most enterprise-grade CMS are actually 'offline'. There are benefits in this approach in terms of IT operations and security; however, the predominant reason is that these CMSs are of early web days legacy when dynamic websites were exotic. So things went the full circle.

August 26, 2011 | Unregistered CommenterIgor Lobanov

Couldn't you use a private CMS somewhere for editing and maintaining page text (granted CSS will have to be maintained through Jekyll or whatever generator you wanted) - this way if you edited the page in your CMS you could easily update your static site by just regenerating from the database on the fly?

This could allow for wordpress installations to become private on your desktop installations - and then push out the static to S3. The best of both worlds - ease of use editing in the CMS - fast site serving and the elimination of database calls for internet use?

August 29, 2011 | Unregistered Commentercreeva

Barebones CMS is a static file generating CMS. The built-in cache is sufficient for most purposes but if you throw a server like nginx in front and some special rules that look for the cache files, you could simply serve up the cached content it generates directly, thus completely bypassing PHP. Then you would obviously have no limit on system performance other than how fast the hardware can serve static content. It isn't exactly ready for use as a blog, but it doesn't seem like it would be too difficult to turn Barebones CMS into one.

August 29, 2011 | Unregistered CommenterTom Tom

High Scalability rocks! The best of its kind.

Coming to static sites - Blog engine MovableType supports static publishing - http://www.movabletype.org/documentation/administrator/publishing/static-and-dynamic-publishing.html for high traffic blogs. Static publishing works great if we handle the dynamic content through javascript.

Thanks

September 29, 2011 | Unregistered CommenterArun Avanathan

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>