Strategy: Sample to Reduce Data Set

Todd Hoff's picture

Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows
"rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result."

When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results.

Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help.

But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate.

We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.

Comments

Re: Strategy: Sample to Reduce Data Set

Great and interesting post on confidence building. Agree with you fully. However we have our own view on confidence building too. You can find out more at http://www.confidencebuildingcourses.com

Todd Hoff's picture

Re: Strategy: Sample to Reduce Data Set

Since this is funny spam I won't delete it. Google alert is a dangerous tool for the undiscriminating. But if I were you I wouldn't fall backwards into my arms...

Re: Strategy: Sample to Reduce Data Set

Arjen stop spamming my browser, lol. Every where i turn with regards to scalability your name pops up.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?><h1 ?=?><h2 ?=?><h3 ?=?>
  • Lines and paragraphs break automatically.
  • Glossary terms will be automatically marked with links to their descriptions
  • You may link to webpages through the weblinks registry

More information about formatting options

To combat spam, please enter the code in the image.