« Rather small site architecture. | Main | High performance file server »

Strategy: Sample to Reduce Data Set

Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows
"rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result."

When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results.

Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help.

But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate.

We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.

Reader Comments (3)

Great and interesting post on confidence building. Agree with you fully. However we have our own view on confidence building too. You can find out more at

November 29, 1990 | Unregistered Commenterconfi

Since this is funny spam I won't delete it. Google alert is a dangerous tool for the undiscriminating. But if I were you I wouldn't fall backwards into my arms...

November 29, 1990 | Unregistered CommenterTodd Hoff

Arjen stop spamming my browser, lol. Every where i turn with regards to scalability your name pops up.

November 29, 1990 | Unregistered CommenterCoen Hyde

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>