advertise
Wednesday
Dec082010

How To Get Experience Working With Large Datasets

The Giant Twins

I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though Livedoor.com also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.

Click to read more ...

Tuesday
Dec072010

Sponsored Post: Joyent, Membase, Appirio, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

Cool Products and Services

Click to read more ...

Monday
Dec062010

What the heck are you actually using NoSQL for?

It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together?

In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in.

Click to read more ...

Friday
Dec032010

GPU vs CPU Smackdown : The Rise of Throughput-Oriented Architectures

In some ways the original Amazon cloud, the one most of us still live in, was like that really cool house that when you stepped inside and saw the old green shag carpet in the living room, you knew the house hadn't been updated in a while. The network is a little slow, the processors are a bit dated, and virtualization made the house just feel smaller. It has been difficult to run high bandwidth or low latency workloads in the cloud. Bottlenecks everywhere. Not a big deal for most applications, but for many high performance applications (HPC) it was a killer.

In a typical house you might just do a remodel. Upgrade a few rooms. Swap out builder quality appliances with gleaming stainless steel monsters. But Amazon has a big lot, instead of remodeling they simply keep adding on entire new wings, kind of like the Winchester Mystery House of computing.

The first new wing added was a CPU based HPC system featuring blazingly fast Nehalem chips, virtualization replaced by a close to metal Hardware Virtual Machine (HVM) architecture, and the network is a monster 10 gigabits with the ability to specify placement groups to carve out a low-latency, high bandwidth cluster. Bottlenecks removed. Most people still probably don't even know this part of the house exists.

The newest addition is a beauty, it's a graphics processing unit (GPU) cluster as described by Werner Vogels in Expanding the Cloud - Adding the Incredible Power of the Amazon EC2 Cluster GPU Instances . It's completely modern and contemporary. The shag carpet is out. In are Nvidia M2050 GPU based clusters which make short work of applications in the sciences, finance, oil & gas, movie studios and graphics.

Click to read more ...

Wednesday
Dec012010

8 Commonly Used Scalable System Design Patterns

Ricky Ho in Scalable System Design Patterns has created a great list of scalability patterns along with very well done explanatory graphics. A summary of the patterns are:

  1. Load Balancer - a dispatcher determines which worker instance will handle a request based on different policies.
  2. Scatter and Gather - a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
  3. Result Cache - a dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
  4. Shared Space - all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
  5. Pipe and Filter - all workers connected by pipes across which data flows.
  6. MapReduce -  targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
  7. Bulk Synchronous Parallel - a  lock-step execution across all workers, coordinated by a master.
  8. Execution Orchestrator - an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.
Wednesday
Dec012010

Sponsored Post: Cloudkick, Strata, Undertone, Joyent, Appirio, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

Cool Products and Services

Click to read more ...

Tuesday
Nov302010

NoCAP – Part III – GigaSpaces clustering explained..

In many of the recent discussions on the design of large scale systems (a.k.a. Web Scale) it was argued that the right set of tradeoffs for building large scale systems would be to give away Consistency for Availability and Partition tolerance. Those arguments relied on the foundation of the CAP theorem developed in early 2000-2002. One of the core principals behind the CAP theorem is that you must choose two out of the three CAP properties. In many of the transactional systems giving away consistency is either impossible or yields a huge complexity in the design of those systems. In this series of posts, I've tried to suggest a different set of tradeoffs in which we could achieve scalability without compromising on consistency. I also argued that rather than choosing only two out of the three CAP properties we could choose various degrees of all three. The degrees would be determined by the most likely availability and partition tolerance scenarios in our specific application.  The suggested model was based on the experience we had in GigaSpaces over the course of the past years and was successfully deployed in many mission critical systems today in Finance, Telco and ecommerce business. I hope that through the sharing of this experience we could come up with a broader set of patterns on how to build large scale systems that would fit also to mission critical transactional systems. Read more... 

 

Monday
Nov292010

Stuff the Internet Says on Scalability For November 29th, 2010

Eating turkey all weekend and wondering what you might have missed?

Wednesday
Nov242010

Great Introductory Video on Scalability from Harvard Computer Science

Professor David Malan gives a very good lecture on scalability for dynamic websites. It's not highly technical, it's an extension course, but it's a great introduction to a wide variety of topics. I really like his teaching style. He continually asks questions, prompts for input, and gives accessible explanations. Some of the topics covered: vertical scaling; horizontal scaling; PHP acceleration; load balancing: DNS, L7, sticky sessions, load balancers; caching; MySQL: replication, load balancing, partitioning, high availability.

Watch it on Academic Earth

This is one lecture in a series of 13 lectures on building dynamic websites. Students learn how to: build dynamic websites with Ajax and with LinuxApacheMySQL, and PHP (LAMP); set up domain names with DNSstructure pages with XHTML and CSS how to program in JavaScript and PHPconfigure Apacheand MySQL; design and query databases with SQL; use Ajax with both XML andJSON build mashups.

For the lecture notes go to the OpenCourseWare site.

Related Articles

 

Tuesday
Nov232010

Sponsored Post: Imo, Undertone, Joyent, Appirio, Tuenti, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

Cool Products and Services

Click to read more ...