Strategy: Consider When a Service Starts Billing in Your Algorithm Cost

At Monday's Cloud Computing Meetup, Paco Nathan gave an excellent Getting Started on Hadoop talk (slides). I found one of Paco's strategies particularly interesting: consider when a service starts charging in cost calculations. Depending on your use case it may be cheaper to go with a more expensive service that charges only for work accomplished rather than charging for both work + startup time.

Click to read more ...


Sponsored Post: ezRez, VoltDB and Digg are Hiring


Hot Scalability Links for July 17, 2010

And by hot I also mean temperature. Summer has arrived. It's sizzling here in Silicon Valley. Thank you air conditioning!

  • Scale the web by appointing a Crawler Czar? Tom Foremski has the idea that Google should open up their index so sites wouldn't have to endure the constant pounding by ravenous crawler bots. Don MacAskill of SmugMug estimates 50% of our web server CPU resources are spent serving crawlers. What a waste. How this would all work with real-time feeds, paid  feeds (Twitter, movies, ...), etc. is unknown, but does it make sense for all that money to be spent on extracting the same data over and over again?
  • Tweets of Gold:
    • jamesurquhart: Key to applications is architecture. Key for infrastructure supporting archs is configurability. Configurability==features.
    • tjake:  People who choose their datastore based oh hearsay and not their own evaluation are doomed.
    • b6n: No global lock ever goes unpunished
    • MichaelSurtees: scalability, systems & process feed each other right?
    • jamesgolick: Statements like: "NoSQL database systems are designed for scalability." make me sad.
    • agastiya: Focus on stability and features first, scalability and manageability second, per-unit performance last of all. This is a quote from Jeff Darcy

    Click to read more ...


DynaTrace's Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co

DynaTrace in Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co, has provided a useful compilation of performance problems, with potential solutions, that they've found while working with their clients. 

  1. Too Many Database Calls -  too many database query per request/transaction.
  2. Synchronized to Death - in a high-load or production environment over-synchronization results in severe performance and scalability problems.
  3. Too chatty on the remoting channels - too many calls across these remoting boundaries and in the end causes performance and scalability problems.
  4. Wrong usage of O/R-Mappers - incorrect usage of the framework itself too often results in unexpected performance and scalability problems within these frameworks.
  5. Memory Leaks - GC does not prevent memory leaks, it is important to release object references as soon as they are no longer needed.

Click to read more ...


Sponsored Post: VoltDB and Digg are Hiring

Who's Hiring?

VoltDB Field/Community Engineer

VoltDB is attracting more and more users every day. If you have a strong technical background in SQL and Linux, are experienced with production database deployments, and have a passion for customers and community, you could be just the person we are looking for.  Are you excited about the prospect of working with users to develop and deploy VoltDB applications, and about helping users participate in the thriving VoltDB community? If so, read on at their job page.

Get Your High Scalability Fix at Digg 

Interested in working on cutting-edge high-scale infrastructure at Digg? We're making a big investment in scaling and have committed to the NoSQL (Not only SQL) path with Cassandra. We're using other open-source infrastructure to help us scale including Hadoop, RabbitMQ, Zookeeper, Thrift, HDFS and Lucene. We're rewriting Digg from the ground up and we need amazing developers to join our world-class team. If you think you are up for the challenge, or you know someone who might be, take a look at our jobs page for more information.


DbShards Part Deux - The Internals

This is a follow up article by Cory Isaacson to the first article on DbShards, Product: dbShards - Share Nothing. Shard Everything, describing some of the details about how DbShards works on the inside.

The dbShards architecture is a true “shared nothing” implementation of Database Sharding. The high-level view of dbShards is shown here:

The above diagram shows how dbShards works for achieving massive database scalability across multiple database servers, using native DBMS engines and our dbShards components. The important components are:

Click to read more ...


Creating Scalable Digital Libraries

Like many other media content providers, libraries and museums are increasingly moving their content onto the Web.  While the move itself is no easy process (with digitization, web development, and training costs), being able to successfully deliver content to a wide audience is an ongoing concern, particularly for large libraries.

Much of the concern is financial, as most libraries do not have the internal budget or outside investors that for-profit businesses enjoy.  Even large university libraries will face serious budget constraints that even other university departments, such as science and technology would not face.

Creating a scalable infrastructure and also distributing a large digital collection that can handle multiple requests, requires planning that many librarians have not even imagined.  They must stop thinking in terms of "one-item-per-customer" and start thinking in terms of numerous users accessing the same information simultaneously.

Click to read more ...


So, Why is Twitter Really Not Using Cassandra to Store Tweets?

A firestorm of accusations circled around recently saying that Cassandra, the elected-by-major-adopters emperor of the NoSQL movement, has no clothes. It was said Twitter was dumping Cassandra; Reddit outages were linked to Cassandra; and even Facebook, Cassandra's cradle of birth, was said to have abandoned Cassandra. Shouts of NoSQL Fail! were heard in the streets. Much gloating followed. Is the emperor really naked? Casually dressed maybe, but not naked.

(Note: after this point the article contains a flow chart that is NSFW. Some people are very sensitive about cussing, so if that's you, please go back, don't read on. Danger! There are no nude pictures or anything, just some strong language. But this is my most favorite flow chart of all time, so it's worth it :-)

Is Twitter really abandoning Cassandra?

Click to read more ...


Hot Scalability Links for July 9, 2010

  • Facebook serves 3 billion Like buttons a day says VentureBeat.
  • CloudScaling reports: Rumor Mill: Google EC2 Competitor Coming in 2010? It looks like GAE for PaaS and an EC2 clone for IaaS.
  • Tweets of gold:
    • alandipert: scalability is a drug
    • seldo: Scalability lesson #23: if any part of your system involves a list that gets bigger over time, eventually that list will become too big.
    • obfuscurityHer: "Go look at the pictures on the database." Me: "You mean our fileserver?" Her: "Whatever." 
    • luiscab: Ouch, I just read on an Info Mgmt rag that Hadoop could easily be an acronym for "Heck, Another Darn Obscure Open-source Project."
    • sanity: Depressed about how much time I've had to spend searching for the right database solution for a new project. Each has it's flaws
    • ioshints: You cannot take a car, grow it 10 times and expect to get a mining truck. 

Click to read more ...


Cloud AWS Infrastructure vs. Physical Infrastructure

This is a guest post by Frédéric Faure (architect at Ysance) on the differences between using a cloud infrastructure and building your own. Frédéric was kind enough to translate the original French version of this article into English.

I’ve been noticing many questions about the differences inherent in choosing between a Cloud infrastructure such as AWS (Amazon Web Services) and a traditional physical infrastructure. Firstly, there are a certain number of preconceived notions on this subject that I will attempt to decode for you. Then, it must be understood that each infrastructure has its advantages and disadvantages: a Cloud-type infrastructure does not necessarily fulfill your requirements in every case, however, it can satisfy some of them by optimizing or facilitating the features offered by a traditional physical infrastructure. I will therefore demonstrate the differences between the two that I have noticed, in order to help you make up your own mind.

Click to read more ...