I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though Livedoor.com also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.
The real fun comes when the public (that’s you guys) are generating data to be pumped into the system. MailChannels’ work with email, which is human generated (lies! 95% is actually from spambots). Humans are unpredictable. They suddenly all get excited about the same thing at the same time, they are demanding, impatient, and smell funny. The latter will not be so much of problem to you, but the other points will if you intend to open your doors to user generated information.
Need More Humans
Opening your doors will not necessarily bring you the large volumes of data you require. In the beginning it will be a bit draughty, and you might consider closing those doors. What you need is more of those human eyeballs looking through their glowing screens at your data collectors. Making that a reality is the hard part. If you do get to that point, then you will probably not have “getting experience with large datasets” at the fore-front of your mind. Like sex, working with large datasets is most important to those not working with large datasets. So we will look elsewhere.
The Internet is really just one giant bucket of data soup. Sure, we all just see the blonde, brunette or redhead, but it’s actually just green squiggly codes raining down in uniform vertical lines behind the scenes. What you need to figure out is how to get those streams of data that are flying around in the tubes of the Internet to be directed through your tube, into your MongoDB NoSQL database, Solr search engine, Hadoop distributed file-system or Cassandra cluster.
“All I see now is blonde, brunette, redhead”
– Cypher, The Matrix
The trouble with most sources of data is that they are owned and the data is copyrighted or proprietary. You can scrape websites and, if you fly under the radar, you will get a dataset. Although, if you want a large dataset, then it will take a lot of scraping. Instead, you should look for data that you can acquire more efficiently and, hopefully, legally. If nothing else, collecting the data in a legal manner will help you sleep better at night and you have a chance at going on to use that data to build something useful.
Here’s a list of places that have data available, provided by my good friend, Geoff Webb.
Some others I think are worth looking at are Wikipedia, Freebase and DBpedia. Freebase pulls its data from Wikipedia on a regular basis, as well as from TVRage, Metacritic, Bloomberg and CorpWatch. DBpedia also pulls data from Wikipedia, as well as YAGO, Wordnet and other sources.