In the more cool stuff I've never heard of before department is something called Self Cleansing Intrusion Tolerance (SCIT). Botnets are created when vulnerable computers live long enough to become infected with the will to do the evil bidding of their evil masters. Security is almost always about removing vulnerabilities (a process which to outside observers often looks like a dog chasing its tail). SCIT takes a different approach, it works on the availability angle. Something I never thought of before, but which makes a great deal of sense once I thought about it. With SCIT you stop and restart VM instances every minute (or whatever depending in your desired window vulnerability).... This short exposure window means worms and viri do not have long enough to fully infect a machine and carry out a coordinated attack. A machine is up for a while. Does work. And then is torn down again only to be reborn as a clean VM with no possibility of infection (unless of course the VM mechanisms become infected). It's like curing cancer by constantly moving your consciousness to new blemish free bodies. Hmmm... SCIT is really a genius approach to scalable (I have to work in scalability somewhere) security and and fits perfectly with cloud computing and swarm (cloud of clouds) computing. Clouds provide plenty of VMs so there is a constant ready supply of new hosts. From a software design perspective EC2 has been training us to expect failures and build Crash Only Software. We've gone stateless where we can so load balancing to a new VM is not problem. Where we can't go stateless we use work queues and clusters so again, reincarnating to new VMs is not a problem. So purposefully restarting VMs to starve zombie networks was born for cloud computing. If a wider move could be made to cloud backed thin clients the internet might be a safer place to live, play, and work. Imagine being free(er) from spam blasts and DDOS attacks. Oh what a wonderful world it would be...
Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.
Framework Fixation SolutionsHow can you avoid the framework fixation crash?
Hi, Generating unique ids is a common requirements in many projects. Generally, this responsibility is given to Database layer. By using sequences or some other technique. This is a problem for horizontal scalability. What are the Guid generation schemes used in high scalable web sites generally? I have seen use java's SecureRandom class to generate Guid. What are the other methods generally used? Thanks Unmesh
I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools. The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm. Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity. So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI. http://code.google.com/p/pyshards/wiki/Pyshards Devin
Update 2: Hank Williams says iPhone Background Processing: Not Fixed But Halfway There. Excellent analysis of all the reasons you need real background processing. Hey, you can't even build an alarm clock! Hard to believe some commenters say it's not so.. Update: Josh Lowensohn of Webware tells us Why users should be scared of Apple's new notification system. A big item on the iPhone developer iWishlist has been background processing. If you can't write an app to poll for new data in the background how will you keep your even more important non-foreground app in sync? Live from the Apple developer conference we learn the solution is a centralized push based architecture. Here's the relevant MacRumorsLive transcript:
I have a table .This table has many columns but search performed based on 1 columns ,this table can have more than million rows. The data in these columns is something like funny,new york,hollywood User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful. I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.
If you just can't get enough high scalability talk you might want to take a look GigaOm's Structure 08 conference. The slate of speakers looks appropriately interesting and San Francisco is truly magical this time of year. High Scalability readers even get a price break is you use the HIGHSCALE discount code! I'll be on vacation so I won't see you there, but it looks like a good time. For a nice change of pace consider visiting MoMA next door. Here's a blurb on the conference:
A reminder to our readers about Structure 08, GigaOm's upcoming conference dedicated to web infrastructure. In addition to keynotes from leaders like Jim Crowe, chairman and CEO of Level 3 Communications and Werner Vogels, CTO of Amazon, the event will feature workshops from Google App Engine, Microsoft and a special workshop from Fenwick and West who will cover how to raise money for an infrastructure start up. Learn from the guru's at Amazon, Google, Microsoft, Sun, VMWare and more about what the future of internet infrastructure is at Structure 08, which will be held on June 25 in San Francisco.
Scalability forces us to think differently. What worked on a small scale doesn't always work on a large scale -- and costs are no different. If 90% of our application is free of contention, and only 10% is spent on a shared resources, we will need to grow our compute resources by a factor of 100 to scale by a factor of 10! Another important thing to note is that 10x, in this case, is the limit of our ability to scale, even if more resources are added. 1. The cost of non-linearly scalable applications grows exponentially with the demand for more scale. 2. Non-linearly scalable applications have an absolute limit of scalability. According to Amdhal's Law, with 10% contention, the maximum scaling limit is 10. With 40% contention, our maximum scaling limit is 2.5 - no matter how many hardware resources we will throw at the problem. This post discuss in further details how to measure the true cost of non linearly scalable systems and suggest a model for reducing that cost significantly.