Im sure most are familiar with Facebooks 'news feed'. If not, the 'news feed' basically lists recent activity of all of your friends. I dont see how you can get this information efficiently from a DB: * Im assuming all user activity is inserted in a "actions" table. * first get a list of all your friends * then query the actions table to return recent activity where the activity belongs to someone on your friends list This can't be efficient especially considering some people have 200+ friends. So what am I missing? How do you think Facebook is implementing their "news feed". Im not asking for any specific details, just a general point in the right direction, as I cant see how they are implementing the 'news feed efficiently. Thanks.
All the cool kids advocate scaling out as the secret sauce of scaling. And it is, but don't forget to serve some tasty "scaling up" as a side dish. Scaling up doesn't have to mean buying a jet propelled, liquid cooled, 128 core monster super computer. Scaling up can just mean buying at the high end of the commodity buffet by buying more cores, more memory and using a shared nothing architecture to take advantage of all that power without adding complexity. Scale out when you need to, but big beefy boxes can absorb a lot of load before it's necessary to hit up your data center for more rack space. Here are a few examples of scaling out and up:
We’re seeing machines with eight cores and 32G of memory. If we were to buy eight disks for these boxes it’s really like buying 8 machines with 4G each and one disk. This partially goes into the horizontal vs vertical scale discussion. Is it better to buy one $10k box or 10 $1k boxes? I think it’s neither. Buy 4 $2.5k boxes. The new multicore stuff is super cheap.
Scaling out doesn’t mean using crappy hardware. I think people take the “scale out” model (that they’ve often only read about from outdated conference presentations) to quite an extreme. They think scaling out means using desktop-class, bad hardware, and just buying a ton of them. That model doesn’t work, and it’s hell to maintain in the long term. Use commodity hardware. You often hear the term “commodity hardware” in reference to scale out. While crappy hardware is also commodity, what this means is that instead of getting stuck on the low-end $40k machine, with thoughts of upgrading to the $250k machine, and maybe later the $1M machine, you use data partitioning and any number of let’s say $5k machines. That doesn’t mean a $1k single-disk crappy machine as said above. What does it mean for the machine to be “commodity”? It means that the components are standardized, common, and the price is set by the market, not by a single corporation. Use commodity machines configured with a good balance of price vs. performance.
WordPress.com hosts 300 servers in 5 different data centers. It's always useful to learn how large installations manage all their unruly children: Currently we Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues. The post covers how these different tools are working for them and the comment section has some interesting discussions too.
Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated. Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues. Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand. From the abstract you can get a feel for what the paper is about:
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.My first impressions after reading the paper:
A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles. Site: http://feedblendr.com/
Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups? Site: http://novcrequired.com/
Am I mad to consider using .Net2 and AJAX for a high-scalability application? In case you wonder why, it's the legacy of a website built on IIS and .Net 1.1, and we're looking for ways to make the content more attractive and interactive. In this case, it's a medical image library being shared by a few Wikis and online coursework for medical students ( < 15K users) and doctors ( < 150K users) But I'm worried about the performance overhead. We already have a performance problem because of personalising the content for users according to their type (student or doctor), and for doctors, their grade and speciality.
Automattic recently purchase Gravatar and have switched the server onto their hosting platform. WordPress.com host over 1.7 million blogs with well over 60'000 new posts submitted each day generating 10 - 12 million page views per day. Barry on WordPress.com has a great post on the changes they've introduced to help Gravatar scale.
Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.