advertise
« Still Time to Attend My Webinar Tomorrow: What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications | Main | Sponsored Post: Joyent, Membase, Appirio, CloudSigma, ManageEngine, Site24x7 »
Wednesday
Dec082010

How To Get Experience Working With Large Datasets

The Giant Twins

I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though Livedoor.com also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.

Those Humans
The real fun comes when the public (that’s you guys) are generating data to be pumped into the system. MailChannels’ work with email, which is human generated (lies! 95% is actually from spambots). Humans are unpredictable. They suddenly all get excited about the same thing at the same time, they are demanding, impatient, and smell funny. The latter will not be so much of problem to you, but the other points will if you intend to open your doors to user generated information.

Need More Humans
Opening your doors will not necessarily bring you the large volumes of data you require. In the beginning it will be a bit draughty, and you might consider closing those doors. What you need is more of those human eyeballs looking through their glowing screens at your data collectors. Making that a reality is the hard part. If you do get to that point, then you will probably not have “getting experience with large datasets” at the fore-front of your mind. Like sex, working with large datasets is most important to those not working with large datasets. So we will look elsewhere.

Data Hunting
The Internet is really just one giant bucket of data soup. Sure, we all just see the blonde, brunette or redhead, but it’s actually just green squiggly codes raining down in uniform vertical lines behind the scenes. What you need to figure out is how to get those streams of data that are flying around in the tubes of the Internet to be directed through your tube, into your MongoDB NoSQL database, Solr search engine, Hadoop distributed file-system or Cassandra cluster.

“All I see now is blonde, brunette, redhead”
– Cypher, The Matrix

Data Sources
The trouble with most sources of data is that they are owned and the data is copyrighted or proprietary. You can scrape websites and, if you fly under the radar, you will get a dataset. Although, if you want a large dataset, then it will take a lot of scraping. Instead, you should look for data that you can acquire more efficiently and, hopefully, legally. If nothing else, collecting the data in a legal manner will help you sleep better at night and you have a chance at going on to use that data to build something useful.

teenie • zach

Here’s a list of places that have data available, provided by my good friend, Geoff Webb.

Some others I think are worth looking at are WikipediaFreebase and DBpedia. Freebase pulls its data from Wikipedia on a regular basis, as well as from TVRage, Metacritic, Bloomberg and CorpWatch. DBpedia also pulls data from Wikipedia, as well as YAGO, Wordnet and other sources.

You can download a dump of the Freebase dataset here. More information on the these dump files can be found here.

 

Click here to read more...

 

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments (4)

The Physics arXiv Blog from Technology Review lists 70 online database sources:
https://www.technologyreview.com/blog/arxiv/26097/

Stanford's SNAP library also offers a large collection of network datasets useful for variety of purposes:
http://snap.stanford.edu/data/index.html

December 9, 2010 | Unregistered CommenterPavan Yara

Phil, don't forget you can just generate random data too! /dev/random and /var/lib/dict have been very useful in tons of testing and learning sessions I have been a part of.

Cheers.

December 9, 2010 | Unregistered CommenterEric Wamsley

Thanks Eric.

Actually, this article does continue on my own blog and I do discuss generating your own data in almost an identical way, as well as a hybrid approach. Great to know that you unwittingly agreed with me :)

http://www.philwhln.com/?p=394

Cheers,
Phil

December 10, 2010 | Unregistered CommenterPhil Whelan

There's dozens and dozens of life-sciences databases listed here, most of which can be downloaded for free:

http://en.wikipedia.org/wiki/List_of_biological_databases

Gene sequences, protein structures, interactions, functional annotations etc.

If biology isn't your thing, try:

http://infochimps.com/

http://mldata.org/

http://www.kaggle.com/

December 20, 2010 | Unregistered CommenterAndrew Clegg

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>