This post I provided a summary of recent discussions outlining the main challenges that developers face today when deploying their existing JEE application to the cloud such as complexity, database integration, security, standard JEE support etc. In this post i also provided summary of how we managed to handle those challenges with our new Cloud Computing Framework by pointing to an existing production reference of a leading Telco provider.
For those interested in building scalable systems, today I will speak about the Facebook Char architecture. Starting keynote:
''When your feature’s userbase will go from 0 to 70 million practically overnight, scalability has to be baked in from the start.''Eugene Lutuchy, lead engineer on Facebook Chat
Facebook's engg. director aditya talks about facebook architecture. How they use mysql, php and memcache. How they have modified the above to suit their requirements.
I'm seeking for a design pattern or advice or directions. I need to count views/downloads of a set of resources, let them to be identified by their respective URLs. This is not a big problem. I also need to keep a list of viewed/downloaded resources in the last X days. This list needs to be updated every now and then to reflect real last X days of usage. So resources that were requested prior to X days get evicted from it. So it's sort of a black box, you feed messages (download request) in and it gives you that list of URLs with counters on the other end. How would you go about designing it?
Hibernate and iBATIS and other similar tools have documentation with recommendations for avoiding the "N+1 select" problem. The problem being that if you wanted to retrieve a set of widgets from a table, one query would be used to to retrieve all the ids of the matching widgets (select widget_id from widget where ...) and then for each id, another select is used to retrieve the details of that widget (select * from widget where widget_id = ?). If you have 100 widgets, it requires 101 queries to get the details of them all. I can see why this is bad, but what if you're doing entity caching? i.e. If you run the first query to get your list of ids, and then for each widget you retrive it from the cache. Surely in that case, N+1(+caching) is good? Assuming of course that there is a high probability of all of the matching entities being in the cache. I may be asking a daft question here - one whose answer is obviously implied by the large scalable mechanisms for storing data that are in use these days.
Learned lessons from the largest player (Flickr, YouTube, Google, etc) I would like to write today about some learned lessons from the biggest player in the high Scalable Web application. I will divide the lessons into 4 points: * Start slow, and small, and measuring the right thing. * Vertical Scalability vs. Horizontal Scalability. * Every problem has its own solution. * General learned lesson Read more
Lessons learned from OpenX's large-scale deployment to Amazon EC2:
I had posted a note the other day about collectl and its ganglia interface but perhaps I wasn't provocative enough to get any responses so let me ask it a different way, specifically how do people monitor their clusters and more importantly how often? Do you monitor to get a general sense of what the system is doing OR do you monitor with the expectation that when something goes wrong you'll have enough data to diagnose the problem? Or both? I suspect both... Many cluster-based monitoring tools tend to have a data collection daemon running on each target node which periodically sends data to some central management station. That machine typically writes the data to some database from which it can then extract historical plots. Some even put up graphics in real-time. From my experience working with large clusters - and I'm talking either many hundreds or even 1000s of nodes, most have to limit both the amount of data they manage centrally as well as the frequency that they collect it, otherwise they'll overwhelm their management station because most DBs can't write hundreds of counters many times/minute from thousands of nodes. As a related example, how many of you run sar at the default monitoring interval of 10 minutes? Do you really think you're getting useful information? What happens if you have a 2 minutes burst of 100% network load and you're idle the other 8 minutes? Sar will happily tell you the network load was 20% and you'll never know your network is tanking. The point of all this is I do think there's a place for central monitoring, though I'm personally not a fan because of the inaccuracy of infrequent data samples, but I also appreciate some data is better than none, as long as you realize the inherent accuracy problems. And that's where collectl comes in and my previous comment about ganglia. When I wrote collectl my overarching design goal, from which I haven't wavered, was to provide highly accurate local data with minimal overhead so you will take samples in the 1-10 second range without fear of impacting the rest of the system. You can literally sample just about everything going on every 10 seconds and use <0.1% of the CPU. If you're willing to give up a few more tenths of a percent you can even monitor processes and slab activity, though you should only sample them at a 60 second frequency because it IS expensive to monitor them. However I also realize this doesn't do any good if do have 1Ks of machine you want to watch and so that's where the socket interface comes in over which collectl can send data to a central manager at that same frequency OR if you prefer have is send its remote data at a different rate, giving you the best of both worlds. Collectl can provide it's data to a central management station while at the same time providing local logging for accuracy, which will let you do a deep dive into the data if a problem does arrive for which there is not enough data stored centrally. My point about the ganglia interface was my response to the fact that a lot of of people running large (as well as smaller) clusters do use ganglia but like most central monitoring stations have to give up the accuracy of finer-grained data and I was just wondering if anyone looking at this forum use ganglia and if they might be interested in trying out the collectl interface to it. -mark