Why is there something rather than nothing? That's the kind of question the Large Hadron Collider in CERN is hopefully poised to answer. And what is the output of this beautiful 17-mile long, 6 billion dollar wabi-sabish proton smashing machine? Data. Great heaping torrents of Grand Canyon sized data. 15 million gigabytes every year. That's 1000 times the information printed in books every year. It's so much data 10,000 scientists will use a grid of 80,000+ computers, in 300 computer centers , in 50 different countries just to help make sense of it all.
How will all this data be collected, transported, stored, and analyzed? It turns out, using what amounts to sort of Internet of Particles instead of an Internet of Things.
As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce, an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant, Google's new faster, real-time search system.
The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus.
Details are slowly coming out about their new goals and approach:
We don't have a lot of details on how Google pulled off their technically very impressive Google Instant release, but in Google Instant behind the scenes, they did share some interesting facts:
- Google was serving more than a billion searches per day.
- With Google Instant they served 5-7X more results pages than previously.
- Typical search results were returned in less than a quarter of second.
- A team of 50+ worked on the project for an extended period of time.
Although Google is associated with muscular data centers, they just didn't throw more server capacity at the problem, they worked smarter too. What were their general strategies?
- Lesson #1. Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE!
- Lesson #2. Don't use out-of-the-box configurations.
- Lesson #3. Single points of contention will eventually become a bottleneck.
- Lesson #4. Plan in advance.
- Lesson #5. Offload your databases as much as possible.
- Lesson #6. File systems matter and can run out of space / inodes.
For more details and explanations see the original post.
Jesper Söderlund put together an excellent list of four general scalability patterns and four subpatterns in his post Scalability patterns and an interesting story:
- Load distribution - Spread the system load across multiple processing units
- Load balancing / load sharing - Spreading the load across many components with equal properties for handling the request
- Partitioning - Spreading the load across many components by routing an individual request to a component that owns that data specific
- Vertical partitioning - Spreading the load across the functional boundaries of a problem space, separate functions being handled by different processing units
- Horizontal partitioning - Spreading a single type of data element across many instances, according to some partitioning key, e.g. hashing the player id and doing a modulus operation, etc. Quite often referred to as sharding.
- Queuing and batch - Achieve efficiencies of scale by processing batches of data, usually because the overhead of an operation is amortized across multiple request
- Relaxing of data constraints - Many different techniques and trade-offs with regards to the immediacy of processing / storing / access to data fall in this strategy
- Parallelization - Work on the same task in parallel on multiple processing units
For more details and explanations see the original post.
- deviantART is Hiring a Senior Software Engineer.
- Okta is hiring! Okta provides a ground-breaking cloud adoption and management solution and they are looking for people in many different areas.
Cool Products and Services
- CloudSigma. Instantly scalable European cloud servers.
- ManageEngine Applications Manager : Application Performance monitoring and Virtualization monitoring.
- www.site24x7.com : Website Monitoring Service from a global monitoring network.
This is so funny I laughed until I cried! Definitely NSFW. OMG it's hilarious, but it's also not a bad overview of the issues. Especially loved: You read the latest post on HighScalability.com and think you are a f*cking Google and architect and parrot slogans like Web Scale and Sharding but you have no idea what the f*ck you are talking about. There are so many more gems like that.
Thanks to Alex Popescu for posting this on MongoDB is Web Scale. Whoever made this deserves a Webby.
The need for IT consolidation is most evident in two types of organizations. In the first group, IT grew organically with business over the decades, and survived changes of strategy, management, staff and vendor orientation. The second group of businesses capital groups are characterized by rapid growth through acquisitions (followed by attempts to integrate radically different IT environments). In both groups, their IT infrastructures have typically been pieced together over the past 20 (or more) years.
Read more on BigDataMatters.com