Using the Ambient Cloud as an Application Runtime

This is an excerpt from my article Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud.

The future looks many, big, complex, and adaptive:

Many clouds.
Many servers.
Many operating systems.
Many languages.
Many storage services.
Many database services.
Many software services.
Many adjunct human networks (like Mechanical Turk).
Many fast interconnects.
Many CDNs.
Many cache memory pools.
Many application profiles (simple request-response, live streaming, computationally complex, sensor driven, memory intensive, storage intensive, monolithic, decomposable, etc).
Many legal jurisdictions. Don't want to perform a function on Patriot Act "protected" systems then move the function elsewhere.
Many SLAs.
Many data driven pricing policies that like airplane pricing algorithms will price "seats" to maximize profit using multi-variate time sensitive pricing models.
Many competitive products. The need to defend your territory never seems to go away. Though what will map to scent-marking I'm not sure.
Many and evolving resource gradients.
Big concurrency. Everyone and everything is a potential source of real-time data that needs to processed in parallel to be processed at all within tolerable latencies.
Big redundancy. Redundant nodes in an unpredictable world will provide cover for component failures and workers to take over when another fails.
Big crushing transient traffic spikes as new mega worldwide social networks rapidly shift their collective attention from new shiny thing to new shiny thing.
Big increases in application complexity to keep streams synchronized acrosss networks. Event handling will go off the charts as networks grow larger and denser and intelligent behaviour attaches to billions of events generated per second.
Big data. Sources and amounts of historical and real-time data are increasing at increasing rates.

This challenging, energetic, ever changing world is a very different looking world than today. It's as if Bambi was dropped into the middle of a Velociraptor pack.

How can we design software systems to take advantage of the new world? Adaptation is the powerful unifying concept tying together how we can respond to its challenges. It's at the heart of the black box. Today's standard architectures do not deal well adaptation. Even something as simple as adapting elastically to varying CPU demands is considered a major architectural change.

The future will be more biologically inspired. Crazy numbers of resources. Crazy numbers of options. Crazy activity. Crazy speeds. Crazy change. Redundancy. Independent action without a centralized controller. Failure is an option and is key for creating fitter applications. Transient alliances form to carry out a task and disband. Independent evolution. And an eye constantly focused on more competitive organisms. It's an approach with less predictably and order than will be comfortable.

Potential Ideas and Technologies for the Ambient Cloud

While there's no clear blueprint for how to create this future, there are deep and compelling hints of what technologies and ideas could be part of the solution. We are going to cover some of these next.

Compute Grid

The Ambient Cloud excels as a compute grid, where a compute grid is a framework that supports parallel execution. As we've already talked about this potential of the Ambient Cloud in our discussion of Plura Processing, I won't go into more detail here.

The Ambient Cloud is a Runtime, Not a Platform

Imagine you are chunk of running code. When you look out, what do you see? The answer, for me, defines the difference between a runtime and a platform.

On a platform you see only what the platform provider allows you to see. In a runtime you see whatever you built the program to see and whatever is out there to talk to in whatever language the things want to talk in. It seems a subtle difference, but it's actually quite major.

The power of a platform like Google App Engine (GAE) comes as much from what you can't do as much as what you can do using Google's baked in services. In GAE you don't have access to the machine you run on, you can't spawn threads, you can't access many built-in APIs, there are quotas on CPU use, and you can only access the external world through a limited set of APIs or through HTTP, as long you can do it in less than 30 seconds. You live in a virtualized world. I'm not in anyway saying this is bad. There are very good and valid reasons for all these restrictions. The problem comes when you need to do something outside what is permitted.

In a runtime environment you would not have restrictions by design. You would have access to whatever is allowed on whatever you are running on. If that's an iPhone, for example, then you are limited only by the iPhone's restrictions. If you want a native look and feel iPhone app then get down and dirty with native APIs and build one. You are able to build an application that can best exploit whatever resources are available..

The Ambient Cloud should not be an abstraction layer that tries to make the multiplicity of the world appear the same no matter how different the underlying things they really are. As seductive as this approach is, it eventually doesn't work. You are limited by the vision of the platform provider and that doesn't work if you are trying to build something unlimited.

There are, have been, and always will be attempts to make a generic APIs, but innovation cycles usually invalidate these attempts in short order. We often see this cycle: new service, service specific APIs, generic APIs, generic frameworks, and generators from specifications. Then an innovation happens and it all falls apart and the cycle starts over again.

At certain points in the cycle technology becomes inverted. Applications play second fiddle to frameworks until the frameworks no longer look and work as well as the new fine tuned applications. In the Ambient Cloud applications always play first violin because it's only the application that can carry out their objectives while tracking the ever changing generations of technologies and services.

Applications can't wait for standards bodies. Opportunity barely knocks at all, but it certainly hardly every knocks twice. If Facebook games are the new hotness then make that happen. If Facebook is old hat and iPhone social games are the new hotness then jump on that. It's the application's job to make the user not care how it's done. For this reason the dream of utility computing may never be fully achieved. The wide variety of different clouds and different compute resources will make it difficult to come up with a true standardization layer.

There's a parallel here to military strategy. During the Vietnam War Colonel John Boyd wanted to figure out why pilots were dying at a faster rate over the skies of Vietnam than they did in Korea and WWII. After prodigious research Boyd found that the key was the agility of the figher plane, not the plane's speed, service ceiling, rate of climb, or even turning ability as was conventionally thought.

Agility: The ability to change its energy state rapidly. To turn, or climb, or accelerate faster than its opponent. And most importantly, to keep up that high energy state in the grueling, high-G turns that rapidly bled out speed and options. The ability to inflict damage and get out of the way of the counterstroke.

Conventional wisdom doesn't value agility highly. It likes the straightforward application of massive force. But Boyd found the key to victory was often: agility, speed, precision, and effectiveness. Winners don't always make the best decisions, they instead make the fastest decisions.

The Ambient Cloud values agility. You are able to do whatever is needed to carry out your objectives without being limited by a platform.

Autonomic Computing

Jeffrey Kephart and David Chess of IBM in 2003 published a stunning vision one possible future for computing their article The Vision of Autonomic Computing. Their vision is:

Systems manage themselves according to an administrator’s goals. New components integrate as effortlessly as a new cell establishes itself in the human body. These ideas are not science fiction, but elements of the grand challenge to create self-managing computing systems.

The inspiration for their approach is the human autonomic nervous system that unconsciously controls key bodily functions like respiration, heart rate, and blood pressure. Since computing systems don't breath yet, IBM has defined the following four infrastructure areas distributed autonomic nervous system could control:

Self-Configuration: Automatic configuration of components;
Self-Healing: Automatic discovery, and correction of faults;
Self-Optimization: Automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements;
Self-Protection: Proactive identification and protection from arbitrary attacks.

This is a compelling vision because at a certain level of complexity humans have to give way to automated solutions. But after a number of years it doesn't seem like we are that much closer to realizing their vision. Why not?

A lot of these features are being subsumed by the cloud. Each cloud vendor creates something like this for their own infrastructure. So we see Autonomic Computing in some shape or form within particular platforms. The question for applications is how far will cloud interoperability go? Will clouds interoperate at such a low level that applications can depends on these features everywhere they go? It seems unlikely as these features are key product differentiators between clouds. That means the responsibility of how to span all the different domains in the Ambient Cloud is up to the application.

One reason Autonomic Computing may not have taken off is because it doesn't contain enough of the Worse is Better spirit. It's a top down formulation of how all components should interact. Top down standards are always a hard sell in growth industries with lots of competition looking to differentiate themselves. Allowing features to bubble up from real world experience, warts and all, seems a more internet compatible model.

Another tactic that may have hindered Autonomic Computing is their emphasis on infrastructure instead of applications. Autonomic Computing hasn't provided a platform on which applications can be built. Without applications it's difficult for enterprises, datacenters, and equipment vendors to rationalize investing in such a complicated infrastructure.

With all the obvious power inherent in Autonomic Computing, it still needs to be more concerned with helping developers build complete applications end-to-end. That's what the Ambient Cloud will need to do too.

FAWN - Distributed Key-Value Database

It's tempting to think all the processors in sensor networks and cell phones can't do any real work. Not so. For a different view on the potential power of all these computer resources take a look at the amazing FAWN (Fast Array of Wimpy Nodes) project out of Carnegie Mellon University. What is FAWN?

We have designed and implemented a clustered key-value storage system, FAWN-KV, that runs atop these node. Nodes in FAWN-KV use a specialized log-like back-end hash-based datastore (FAWN-DS) to ensure that the system can absorb the large write workload imposed by frequent node arrivals and departures. FAWN-KV uses a two-level cache hierarchy to ensure that imbalanced workloads cannot create hot-spots on one or a few wimpy nodes that impair the system's ability to service queries at its guaranteed rate.

Our evaluation of a small-scale FAWN cluster and several candidate FAWN node systems suggest that FAWN can be a practical approach to building large-scale storage for seek-intensive workloads. Our further analysis indicates that a FAWN cluster is cost-competitive with other approaches (e.g., DRAM, multitudes of magnetic disks, solid-state disk) to providing high query rates, while consuming 3-10x less power.

FAWN is being positioned as a way to create low power clusters. Given the impact of electricity costs on TCO and the restricted availability of power at datacenters, this makes a lot of sense. But I think FAWN has a more intriguing role: opening up huge slices of the Ambient Cloud for greater use. Much of the Ambient Cloud looks suspiciously like what FAWN is already optimized to run on. The impact could be immense if FAWN or something like it could be deployed in the Ambient Cloud.

Although the 500MHz Atom processors being used are affectionately called wimpy, they are at least somewhat comparable to Amazon's small instance type which uses a 1 GHz Opteron. FAWN's 256 MB of main memory is comparable to low end VPS configurations. So the individual nodes are not powerhouses, but they are not without capability either. There is also a 1.6 GHz version of the Atom processor, which is a quite capable processor slice.

FAWN is not alone in embracing the small is beautiful school of low cost, power-efficient server design. It's part of new approach called physicalization. Physicalization is an approach to server design that packs multiple, cheap, low-power systems into a single rack space.

The "physicalization" name is a play on virtualization. Virtualization takes very powerful machines and runs as many virtual machines on them as they can. Machine utilization goes up which reduces total cost of ownership and everyone is happy. Well not everyone, which explains the physicalization strategy.

Physicalization runs in the opposite direction. Virtualization pays a large overhead penalty, especially for I/O intensive applications. Given that data intensive applications are I/O sensitive it makes sense not to share. Instead, make small, efficient Microslice Servers that can do more work for the same energy dollar.

This is the key rational for FAWN's hardware design. It's also the key to understanding why FAWN built a key-value store at its software layer. What FAWN supports is lots of hardware nodes that need to be kept busy. So the key to FAWN is exploiting its massive parallelism using partitioned data sets. This is exactly what key-value stores like Amazon's Dynamo excel at because keys can be hashed and distributed across available nodes.

In essence FAWN is a highly power-efficient cluster for performing key-value lookups, like those used at Amazon and Facebook for fetching pictures, Facebook pages, etc. So one way to think of FAWN is as a memcached replacement that has been architected to handle 100s and 1000s of terabytes of data. While each node isn't exactly a super computer, the aggregate performance and capacity is astounding. Just imagine what that performance will be like when in the not too distant future when devices like smart phones will have 64 cores and 1 PB of flash?

When you start to think about FAWN as a key-value store/memcached replacement it gets exciting because many websites are buying into the NoSQL movement, a large contigent of which are architecting their sites around key-value stores. The important question to ask is where will those key-value stores be hosted? On expensive EC2 instances? Or might key-value stores run on inexpensive clusters of low power machines in the Ambient Cloud, like Plura?

FAWN is just one part of a solution, it clearly doesn't solve all problems. It uses some old tech and is not currently suitable for CPU intensive tasks like video processing, but that will change once newer more powerful processors are used. Some workloads can't be easily partitioned, for example, relational databases, where the scale-down and multiply approach of FAWN isn't a win. But some hybrid form of a NoSQL store will become important in the future. The combination of horizontally scalability, joinless access, schemaless data models, fast writes and reads, and forgiving transactional semantics makes NoSQL a perfect match for much of the web. Once flash densities increase even the per byte cost advantage that disks have now for large data sets and streaming media may be breached, the Ambient Cloud itself will be a formidable storage device.

The Compound Materials approach of building systems is valuable here. We don't need one backend infrastructure that can do it all. Streaming and static media can still be served from a CDN or S3, but for the highly scalable, key-value component of your system FAWN would be a wonderful option. As a generalized key-value store FAWN can be used to store any kind of data. There's no reason an indexing system couldn't be built on top and there's no reason map-reduce couldn't run on top of FAWN either.

One huge tension we have now when designing systems centers on the great disk vs RAM vs flash memory debate. Currently there's a sort of RAM/disk hybrid as the dominant model. For large datasets RAM is used as a cache and disk as the storage of record. A company like Facebook, for example, keeps all their important warm data in 28 terabytes of memory cache. Google no doubt has astounding amounts of cache also, but they have designed their infrastructure to scale on top of a distributed file system spread across countless hard drives. In their quest to store all the world's information, Google has become the master of disk. Competing with Google based on disk technology is going to be very difficult. A better strategy may be to skip ahead to next generation technologies and architectures.

While new wonder storage technologies are always just over the horizon, our current options are: RAM, disk, and flash. RAM is fast, expensive, low density, and power hungry, which means the amount of RAM that can be allocated per server is relatively small, especially if you need to store many petabytes of data. Disk is cheap, high density, and slow, which means it is the only game in town for storing and querying big data. Flash/SSD has a larger capacity than RAM, but less than disk, is slower than RAM and faster than disk, is power efficient, and is becoming cheaper than RAM. As flash capacity and cost per unit is hitching a ride on Moore’s Law, we can speculate that in a few years that flash and flash/RAM hybrids will start replacing disk based architectures. In the Ambient Cloud that speculation makes a lot of sense as it leverages a huge amount of available unused flash storage.

Flash based systems require a different architecture, which complicates their adoption. Today the entire software stack is architected around optimizing around the deficiencies of a spinning disk. SSD is very different in character. SSD as the "solid state" part in the name implies has no seek time/head settle time, so all the optimization is at best a waste, and at worst completely counter productive. Flash is also comparatively slow to write, which is why FAWN uses a log structured file system. Using flash means the entire software stack needs to change for flash to be the best it can be. FAWN is one of the first large scale, distributed key-value stores optimized for flash, and as such it's a good model for a next generation storage platform.

Another interesting system that is comparable to FAWN is Gordon. Gordon is a system architecture for data-centric applications that combines low-power processors, flash memory, and datacentric programming systems to improve performance for data-centric applications while reducing power consumption.

To summarize, FAWN is a good template for a super scalable key-value data storage because:

It operates efficiently on the hardware profile common in the Ambient Cloud: lower power and flash storage.
It is capable of efficiently storing and accessing large amounts of data.
New nodes can dynamically leave and join the system.

Cleversafe - Space and Bandwidth Efficient High Availability

The Ambient Cloud differs from the traditional cloud in one very important way: nodes are unreliable. Take a smart phone as an example. A phone becomes unavailable when it goes in and out of signal range. People lose their phones. Or perhaps they buy a new phone. Or perhaps someone actually uses all the storage on their phone and has to dump their Ambient Cloud partition. Or perhaps sensor nodes may die at a high rate because they live out in the harsh world. There's lots of churn in the Ambient cloud and that's a problem because nobody wants to lose their data.

Reliability is commonly setup by creating a replica or two or three in different geographical locations and that pretty much gives you your high availability. With replicas there are always issues with transactions, syncronization, fail-over, recovery, etc, but since can be fairly confident that your replicas are reliable, there's no need to make dozens of copies.

In the Ambient Cloud nodes are less reliable so you need more copies to be safe. The problem is each copy uses storage and bandwidth in proportion to the number of copies. And transactions and failover become more complicated as replicas are added.

Making copies may seem like the only option for ensuring availability, but there's another little known method called information dispersal, that is used by a company called Cleversafe in their storage product. Here's their explanation:

Using information dispersal --slicing the data and spreading it out among servers, only to be reassembled when you need it--that very same file has a 60% storage overhead in a standard configuration. That’s it. It doesn’t increase. And, if a number of servers where slices are dispersed happen to fail, you can get your entire, uncorrupted and undamaged file back, as long as you have access to a minimum threshold for retrieval. That means, for example, if six of 16 slices are down, you can still get the entire file back. The entire file. At a lower cost. With less complexity. Without replication. Information dispersal does it without the stress, risk, hardware and bandwidth associated with the old way.

This approach is based on a fascinating field of study called Reed-Solomon coding. For an overview take a look at A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Reed-Solomon encoding is not new, it is already used in CDs, DVDs and Blu-ray Discs, in data transmission technologies such as DSL & WiMAX, in broadcast systems such as DVB and ATSC, and in computer applications such as RAID 6 systems.

The takeaways for the Ambient Cloud are:

Data can be sliced up into multiple slices and spread across the Ambient Cloud. As parallelism is the strength of the Ambient Cloud this is a powerful synergy.
Efficient use of bandwidth and storage by not making complete copies.
Not all the slices are needed to reassemble data. It's fault tolerant. If even a substantial number of nodes go down the data can still be reassembled.
All updates are versioned so old and obsolete data will not be used in reassembling objects. This takes care of the scenario when a node comes back and joins the cloud again. Background rebuild processes take care of rebuilding data that is out of sync.
Data is secure because given any one slice you can't reassemble the whole object.

I had a long email thread with a Cleversafe representative and they were very helpful in helping me understand how their system works. While Cleversafe is promising, because it does provide reliability over unreliable elements, there are some issues:

Cleversafe, quite understandably, uses an appliance model. To use their system you need to buy their appliances. That won't really work over the Ambient Cloud as the Ambient Cloud is effectively the appliance.
Their target market are large files, generally several gigabytes files in length, with one writer at a time. The overhead of dealing with small objects with millions of readers and writers is way too inefficient.

The architecture of their system is based around disk storage managed by sliceservers. It's possible using flash based storage might make a more parallelized design possible, especially if techniques like eventual consistency and client based read repair were employed.

While I haven't been able to work out the how, I can't help but think there is something essential in this approach, building reliability on top of unreliable components, that meshes well with the Ambient Cloud.

OceanStore - Global Data Store Designed to Scale to Billions of Users

There have been several attempts at large scale distributed storage systems. One of the most interesting is OceanStore, which was designed for 1 billion users, each storing 10,000 objects, for a total of 10 trillion objects. What is OceanStore?

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data.

Here's a quick overview of OceanStore's main features:

Infrastructure is comprised of untrusted servers
Data protected through redundancy and cryptographic techniques
Uniform and highly-available access to information; separation of information from location
Servers geographically distributed and exploit caching close to clients
Data can be cached anywhere, anytime, and can flow freely
Nomadic data (an extreme consequence of separating information from location)
Promiscuous caching: trade off consistency for availability; continuous introspective monitoring to discover data closer to users
Named by a globally unique id, GUID
Objects are replicated and stored on multiple servers
Provide ways to locate a replica for an object
Objects modified through updates
Messages route directly to destination, avoiding the round-trips that a separate data location and routing process would incur

What is attractive about OceanStore for the Ambient Cloud is that it:

Is a generalized object store and not a pure file system.
Assumes from the start that the infrastructure is constantly changing and can handle nodes dropping in and out.
Provides truly nomadic data that is free to migrate and be replicated anywhere in the world.
Built-in continuous testing and repair of information. Automated preventive maintenance.

All these are necessary attributes of a planet-scaling system. Of particular interest is the ability to keep numerous replicas. Imagine millions of users accessing data. If there's only one copy of data then that CPU will be swamped serving that data. So copies need to made in order to let massively parallel streams of users access data in parallel. Not many systems have this feature. Keeping replicas to a minimum is the usual rule because they are hard to maintain. A side benefit is geographic information can also be used to route to the nearest replica.

A prototype called Pond was built around 2003 and at that time Pond, in the wide area, out performed NFS by up to a factor of 4.6 on read intensive phases of the Andrew benchmark, but underperforms NFS by as much as a factor of 7.3 on write intensive phases. The slowness of the writes is attributed largely to erasure coding, which is similar to how Cleversafe works. One possible fix is to use Low-Density Parity-Check (LDPC) codes instead.

In the paper Erasure Coding vs. Replication: A Quantitative Comparison, they show show that a replication based system requires an "order of magnitude more bandwidth, storage, and disk seeks as an erasure encoded system of the same size" to achieve the same mean time to failure (MTTF). So clearly this technology has advantages for the Ambient Cloud. The problem is it also has some drawbacks. Opening connections to a large number of machines in order to reassemble small objects will be slow. So just using erasure encoding to store objects probably won't work.

The innovative solution recommended by the paper is to separate the mechanisms for durability from the mechanisms of latency reduction. Erasure encoding can be used for durability and a caching layer can be used for latency reduction. This is exactly the approach Pond takes in their implementation.

It's not clear what the current status of OceanStore is, it hasn't been updated for a while so it may be dead. For the project publications see The OceanStore Project.

NoSQL - Thinking Different

NoSQL has been mentioned a few times already. In this section what I mostly mean by NoSQL is the spirit of tackling problems differently, not relying on how things have always been done and being willing to overcome the inertia of the past. That's the same sort of adventuring spirit that will be needed to planet-scale.

An example of thinking differently at scale is how Google has you code using Google App Engine (GAE). GAE is built on a distributed file system so getting each entity is really retrieving a different disk block from a different cluster. This is already very different than a relational database where the assumption is that lookups are very fast because of how the data is stored. Individual gets in GAE are actually slow, but they are highly scalable because they are distributed across a sea of disks. Doing a table scan becomes very slow because each entity could be anywhere in any cluster. The implication is that as data becomes large aggregate operators like ave() become impossible. For the programmer what this means is that the average has to be calculated on the write operation, not on the read. In other words, precompute on writes so your reads are very fast. And if the write volume is high the sums will have to be sharded. Another difficult concession to scalability concerns.

All these type of techniques, and there are many more, are completely alien to what people are used to with SQL. But what Google did is create a system that could scale while still being usable enough for developers. Amazon's SimpleDB also asks you to make a series of compromises in order to scale. Developers trying to figure out how to structure a system using key-value stores in order to have fast scalable lookups are dealing with similar issues. They are asking, what needs to change so we can scale our applications?

So we have to be willing to step outside our comfort zones. Play with different models of consistency. Play with moving responsibility to the programmer (and hopefully back again). Play with different programming patterns. Play with how data is stored. Play with different ways of making queries using approaches like Pig Latin, MapReduce, HIVE, and Dryad. The idea being that we may have to write applications a lot differently in the Ambient Cloud than when using a centralized SQL database.

Market Based Resource Allocation

The Internet is expanding to include a large number of competing services: CPU, storage, database, search engine, discovery, payment processing, application specific, security, privacy, memory, pub/sub, email filtering, blacklisting, messaging, job scheduler, and other best of breed services. Both general purpose and specialized clouds will coexist.

Competition between services and freely available resources in the Ambient Cloud creates a continuously variable supply at different price/performance curves. Adaptive applications can reconfigure and redeploy themselves to take advantage of available opportunities when they occur.

These different price/performance curves are to software like chemical gradients are to single-cell organisms. Single-cell organisms move in the right direction by detecting chemicals in their surroundings. If they sense something pleasurable, like food, they follow in the direction of the highest concentration of nutrients (the gradient). If they sense something painful, like a toxic chemical, they make a random move and start over again. Foraging paths follow regions of high food quality. Here we see the beginnings of our nervous system and brain.

Applications in the Ambient Cloud can similarly follow a complex multi-dimensional gradient in order to optimize its goals. Goals can include: minimize cost, maximize throughput, maximize performance, maximize availability, and so on. Cost will become a major part of architecture evolution.

We've already seen this kind of thinking in our discussion of algorithmic trading. But we see it in a lot of other areas of our lives too. eBay is one example of a trading market that most of us have experience with.

If you've flown then you probably know the airline industry uses dynamic pricing in order to increase revenue by selling goods to buyers “at the right time, at the right price.” Airlines now let software make automatic pricing decisions on over 90 percent of tickets sold.

Maybe you've used Priceline.com to find a hotel, airline ticket, or rental car? Priceline use a value-based pricing approach where buyers make an offer on a product and they see if there's a seller willing to make a deal at that price.

For ages we've had commodity markets where raw commodities like food, metals, electricity, and oil are traded. Aren't compute resources are just another commodity to trade?

What markets are good at is allocating supply to demand. What automated market mechanisms are good at is doing this quickly, efficiently, intelligently, without human intervention. All of which are required if applications are going to trade resources. We've already gone over several examples of how this works in other markets, there's not reason in shouldn't work with compute resources as well.

There are a couple problems with this vision:

There is no market currently for compute resources.
Application architectures are not flexible enough to integrate market mediated resources even if they were available.

So we are in a bit of a chicken and egg situation here. Both angles need to be worked. With the service model becoming dominant, command and control through APIs is possible, which should enable a market to be made. And traditional architectures are relatively static, rigid, and require a lot of hand holding. More work will have to be done in this area as well.

The reason why I think a market is so critical to the Ambient Cloud is because I think it's the only way to make the dynamic pool of exponentially growing compute resources available to applications. Applications can only consume resources if they are packaged up and made available. Since the resources are dynamic there needs to be some sort of exchange set up to allow applications to know when new resources become available and search for existing resources. Once resources have been found a contract needs to entered, payment made, and then the resource need to be integrated into the application and released when done. Given the complexity, size, and rate of change of this operation, only a market mechanism will work to match supply and demand.

The first few parts of this section have been on possible directions to take on designing software for the Ambient Cloud. There is inspiration in FAWN, Cleversafe, OceanStore, and NoSQL. But that's not enough. What developers need to write applications is a stable, guaranteed environment for that software to run in. And I think a market enabled Ambient Cloud can be that platform.

If you would like to read the rest of the article please take a look at Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud.