This presentation illustrates how one can scale EXISTING JEE application and deploy it on Amazon cloud using GigaSpaces as the scale-out application server while: * Not having to re-write your application * Preventing lock-in to specific cloud provider * Enabling seamless portability between your local environment to cloud environment o No code or configuration change is required between the two environments o Develop local - test on the cloud o Built for iterative development
Here's a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.
The upshot of the paper is Oracle rules and MySQL sucks for sharding. Which is technically probable, if you don't throw in minor points like cost and ease of use. The points where they think Oracle wins: online schema changes, more robust replication, higher availability, better corruption handling, better use of large RAM and multiple cores, better and better tested partitioning features, better monitoring, and better gas mileage.
I have two servers connected via Internet (NOT IN THE SAME LAN) serving the same website (http://www.ourexample.com).The problem is files uploaded on serverA and serverB cannot see each other immediately,thus rsync with certain intervals is not a good solution. Can anybody give me some advice on the following options? 1.NFS over Internet for file sharing 2.sshfs 3.inotify(our system's kernel does not support this and we donot want to risk upgrading our kernel as well) 4.drbd in active-active mode 5 or any other solutions Any suggestions will be welcomed. Thank you in advance.
Datacenterknowledge.com: In an effort to boost its refocused cloud computing initiative, Sun Microsystems (JAVA) has acquired Q-layer, a Belgian provider that automates the deployment of both public and private clouds. Sun says Q-layer’s technology will help users instantly provision servers, storage, bandwidth and applications. Do you have experience with Q-layers technology like its Virtual Private DataCenter and NephOS?
It seems that HTTP calls have become a default way to think about distributed systems. HTTP and Web services definitely have a lot to offer, but they are not the only way to do things and there are definitely cases where web is not the right choice. Unfortunately, lots of people just stick with web services and hack on, trying to fit a square peg in a round hole. In cases such as these, a different distribution paradigm can save us quite a lot of time and effort both in development and later in maintenance. One of those different paradigms is messaging.
How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks. Abstract:
Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
Lessons LearnedAt the end of the paper they identify several insights they had about developing petascale tools:
While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion. To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.
Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster. From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Thanks to Kevin Burton for linking to the complete article.
In article Building Super-Scalable Web Systems with REST Udi Dahan tells an interesting story of how they made a weather reporting system scale for over 10 million users. So many users hitting their weather database didn't scale. Caching in a straightforward way wouldn't work because weather is obviously local. Caching all local reports would bring the entire database into memory, which would work for some companies, but wasn't cost efficient for them. So in typical REST fashion they turned locations into URIs. For example: http://weather.myclient.com/UK/London. This allows the weather information to be cached by intermediaries instead of hitting their servers. Hopefully for each location their servers will be hit a few times and then the caches will be hit until expiry. In order to send users directly to the correct location an IP location check is performed on login and stored in a cookie. The lookup is done once and from then on out a GET is performed directly on the resource. There's no need to hit their servers and do a lookup on the user to get the location. That's all bypassed. I like Udi's summary of the approach and is why I think this is a good strategy : This isn’t a “cheap trick”. While being straight forward for something like weather, understanding the nature of your data and intelligently mapping that to a URI space is critical to building a scalable system, and reaping the benefits of REST.