advertise
Monday
Oct192009

Drupal's Scalability Makeover - You give up some control and you get back scalability

Drupal 7 is having a scalability makeover. Karoly Negyesi, Drupal Core Developer and Public Development Team Lead, explains the process in this video: Drupal 7 APIs, scalability mindset. Karoly states the general theme of the changes as: You give up some control and you get back scalability. An interesting comment on the politics of scalability?

Makeover may not be quite the right word though. A makeover implies a cosmetic change, looking better by changing the surface. Drupal's changes will go deeper than that, right to Drupal's core. It's a genuine and authentic change that will hopefully allow one of the Internet's most venerable Content Management Systems (CMSs) to compete with a constant stream of younger and sexier models.

Drupal is based on an older LAMP stack approach where PHP modules are scooped up and merged together each time a request is made to Drupal. Drupal's most intriguing idea is how it is built, expands, and changes by weaving together a single system out of individual components called modules. Built-in modules include comments, RSS, contact forms, forums, and Clean URLs. Add in modules include things like CSE to add Google's Custom Search Engine, modules to add in AdSense, CAPTCHA, and Sitemaps. Drupal establishes AOP extension points that allow modules to work remarkably well together, creating a site that feels like one single site even though it has been constructed from dozens of modules hunted and gathered from all over the digital world. 

The problem is the PHP code can directly access the database and directly render to the UI, there is little required layering. Part of Drupal's amazing configurability and extensibility has been how easy it is for everything to work together by changing the database. But when there's no layering it's almost impossible to optimize the system. If you have 20 different modules they each can make 20 separate calls to the database when what we really want is one call. And because of the direct SQL access when the number of writes increases there's no systematic way to distribute the writes across multiple servers. So we see as Drupal sites grow in the number of modules and the number of users both performance and scalability tank.

The younger models architect their systems differently. Sites like Google, Amazon, Facebook are written terms of an API and a framework, a service based approach. Using a service based approach the web tier can be programmed in terms of services that themselves are scalable so the entire system is scalable. When the API is skipped there are no leverage points that can be made to scale. It becomes a big ball of mud.

More layering and more APIs is exactly the direction Drupal is taking. Exactly how is Drupal changing?

Click to read more ...

Friday
Oct162009

Paper: Scaling Online Social Networks without Pains

We saw in Why are Facebook, Digg, and Twitter so hard to scale? scaling social networks is a lot harder than you might think. This paper, Scaling Online Social Networks without Pains, from a team at Telefonica Research in Spain hopes to meet the challenge of status distribution, user generated content distribution, and managing the social graph through a technique they call One-Hop Replication (OHR). OHR abstracts and delegates the complexity of scaling up from the social network application. The abstract:
Online Social Networks (OSN) face serious scalability challenges due to their rapid growth and popularity. To address this issue we present a novel approach to scale up OSN called One Hop Replication (OHR). Our system combines partitioning and replication in a middleware to transparently scale up a centralized OSN design, and therefore, avoid the OSN application to undergo the costly transition to a fully distributed system to meet its scalability needs. OHR exploits some of the structural characteristics of Social Networks: 1) most of the information is one-hop away, and 2) the topology of the network of connections among people displays a strong community structure. We evaluate our system and its potential benefits and overheads using data from real OSNs: Twitter and Orkut. We show that OHR has the potential to provide out-of-the-box transparent scalability while maintaining the replication overhead costs in check.
Thursday
Oct152009

Hot Scalability Links for Oct 15 2009 

Update: Social networks in the database: using a graph database. Anders Nawroth puts graphs through their paces by representing, traversing, and performing other common social network operations using a graph database.

Update: Deployment with Capistrano by Charles Max Wood. Simple step-by-step for using Capistrano for deployment.

Log-structured file systems: There's one in every SSD by Valerie Aurora. SSDs have totally changed the performance characteristics of storage! Disks are dead! Long live flash!

An Engineer's Guide to Bandwidth by DGentry. It's a rough world out there, and we need to to a better job of thinking about and testing under realistic network conditions.

Analyzing air traffic performance with InfoBright and MonetDB by Vadim of the MySQL Performance Blog.

Scalable Delivery of Stream Query Result by Zhou, Y ; Salehi, A ; Aberer, K. In this paper, we leverage Distributed Publish/Subscribe System (DPSS), a scalable data dissemination infrastructure, for efficient stream query result delivery.

Tuesday
Oct132009

Why are Facebook, Digg, and Twitter so hard to scale?

Real-time social graphs (connectivity between people, places, and things). That's why scaling Facebook is hard says Jeff Rothschild, Vice President of Technology at Facebook. Social networking sites like Facebook, Digg, and Twitter are simply harder than traditional websites to scale. Why is that? Why would social networking sites be any more difficult to scale than traditional web sites? Let's find out.

Traditional websites are easier to scale than social networking sites for two reasons:

Click to read more ...

Monday
Oct122009

High Performance at Massive Scale – Lessons learned at Facebook

Jeff Rothschild, Vice President of Technology at Facebook gave a great presentation at UC San Diego on our favorite subject: "High Performance at Massive Scale –  Lessons learned at Facebook". The abstract for the talk is:

Facebook has grown into one of the largest sites on the Internet today serving over 200 billion pages per month. The nature of social data makes engineering a site for this level of scale a particularly challenging proposition. In this presentation, I will discuss the aspects of social data that present challenges for scalability and will describe the the core architectural components and design principles that Facebook has used to address these challenges. In addition, I will discuss emerging technologies that offer new opportunities for building cost-effective high performance web architectures.

There's a lot of interesting about this talk that we'll get into  later, but I thought you might want a head start on learning how Facebook handles 30K+ machines, 300 million active users, 20 billion photos, and 25TB per day of logging data.

Click to read more ...

Friday
Oct092009

Have you collectl'd yet? If not, maybe collectl-utils will make it easier to do so

I'm not sure how many people who follow this have even tried collectl but I wanted to let you all know that I just released a set of utilities called strangely enough collectl-utils, which you can get at http://collectl-utils.sourceforge.net. One web-based utility called colplot gives you the ability to very easily plot data from multiple systems in a way that makes correlating them over time very easy.

Click to read more ...

Thursday
Oct082009

Riak - web-shaped data storage system

Update: Short presentation NYC by Bryan Fink  demonstrating the riak web-shaped data storage engine

Riak is another new and interesting key-value store entrant. Some of the features it offers are:

  • Document-oriented
  • Scalable, decentralized key-value store
  • Standard getput, and delete operations. 
  • Distributed, fault-tolerant storage solution.
  • Configurable levels of consistency, availability, and partition tolerance
  • Support for Erlang, Ruby, PHP, Javascript, Java, Python, HTTP
  •  open source and NoSQL
  • Pluggable backends
  • Eventing system
  • Monitoring
  • Inter-cluster replication
  • Links between records that can be traversed.
  • Map/Reduce. Functions are executed on the data node. One interesting difference is that a list keys are required to specify which values are operated on as apposed to running calculations on all values. 

Related Articles

  • Hacker News Thread. More juicy details on how Riak compares to Cassandra, mongodb, couchdb, etc. 

 

Wednesday
Oct072009

How to Avoid the Top 5 Scale-Out Pitfalls

Scale-Out is incrementally adding servers as needed to scale rather than buying larger servers. Here's the MySQL idea of what a scale-out architecture looks like:


This MySQL article lists 5 problems to avoid when scaling out:
  1. Don't Think Synchronously. Introduce asynchronous communication, parallelization, and strategies to deal with approximate or slightly outdated data.
  2. Don't Think Vertically.  Scaling by bigger machines won't work. Plan on horizontal scaling and asynchronous architectures form the start which make it easy to add capacity on demand.
  3. Don't Mix Transactions with Business Intelligence. Transactions and analytics are inherently different. Separate out different types of data onto different databases.
  4. Avoid Mixing Hot and Cold Data. Static and fast changing data are inherently different. Separate out different types of data onto different databases.
  5. Don't Forget the Power of Memory.  Make data accessible in RAM by smartly partitioning data across servers.

More information at Scale-Out & Replication Best Practices for High-Growth Businesses.

Tuesday
Oct062009

Building a Unique Data Warehouse

There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data(like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the datashowing up on customer bandwidth graphs on our customer portal.

After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data. Because the tracked metric was so small (a 64-bit counter) compared to the unique identifier (32-bit network component ID, 32-bit timestamp, 16-bit data type identifier) existing database technologies would choke on the key sizes.

Eventually it was decided that the best solution was to write our own wrapper for standard MySQL databases. No fancy features, no clustering, no merge tables or partitioning, no extra indexes, just hundreds of thousands of flat tables on as many physical machines as was necessary. I chronicled the whole decision making process in the full article, located here, on our developers' blog.

Tuesday
Oct062009

10 Ways to Take your Site from One to One Million Users by Kevin Rose  

At the Future of Web Apps conference Kevin Rose (Digg, Pownce, Wefollow) gave a cool presentation on the top 10 down and dirty ways you can grow your web app. He took the questions he's most often asked and turned it into a very informative talk.

This isn't the typical kind of scalability we cover on this site. There aren't any infrastructure and operations tips. But the reason we care about scalability is to support users and Kevin has a lot of good techniques to help your user base bloom.

Here's a summary of the 10 ways to grow your consumer web application:

1. Ego. Ask does this feature increase the users self-worth or stroke the ego? What emotional and visible awards will a user receive for contributing to your site? Are they gaining reputation, badges, show case what they've done in the community? Sites that have done it well:

Twitter.com followers. Followers turns every single celebrity as spokesperson for your service. Celebrities continually pimp your service in the hopes of getting more followers. It's an amazing self-reinforcing traffic generator. Why do followers work? Twitter communication is one way. It's simple. Followers don't have to be approved and there aren't complicated permission schemes about who can see what. It means something for people to increase their follower account. It becomes a contest to see who can have more. So even spam followers are valuable to users as it helps them win the game.

Digg.com leader boards. Leader Boards show the score for a user activity. In digg it was based on the number of articles submitted. Encourage people to have a competition and do work inside the digg ecosystem. Everyone wants to see their name in lights. 

Digg.com highlight users. Users who submitted stories where rewarded by having their name in a larger font and a friending icon put beside their story submission. Users liked this.

2. Simplicity. Simplicity is the key. A lot of people overbuild features. Don't over build features. Release something and see what users are going to do. Pick 2-3 on your site and do them extremely well. Focus on those 2-3 things. Always ask if there's anything you take out from a feature. Make it lighter and cleaner and easy to understand and use.

3. Build and Release. Stop thinking you understand your users. You think users will love this or that and you'll probably be wrong. So don't spend 6 months building features users may not love or will only use 20% of. Learn from what users actually do on your site. Avoid analysis paralysis, especially as you get larger. Decide, build, release, get feedback, iterate.

4. Hack the Press. There are techniques you can use that will get you more publicity.

Invite only system. Get press by creating an invite only system. Have a limited number of invites and seed them with bloggers.  Get the buzz going. Give each user a limited number of invites (4 or 5). It gets bloggers talking about your service. The main stream press calls and you say you are not ready. This amps the hype cycle. Make new features login-only, accessible only if you log in but make them visible and marked beta on the site. This increases the number of registered users.

Talk to junior bloggers. On Tech Crunch, for example, find the most junior blogger and pitch them. It's more likely you'll get covered.

Attend parties for events you can't afford.  You can go to the after parties for events you can't afford. Figure out who you want to talk to. Follow their twitter accounts and see where they are going. 

Have a demo in-hand. People won't understand your great vision without a demo. Bring an iPhone or laptop to show case the demo. Keep the demo short, 30-60 seconds. Say: Hey, I just need 30 seconds of your time, it's really cool, and here's why I think you'll like it. Slant it towards what they do or why they cover.

5. Connect with your community.

Start a podcast. A big driver in the early days of Digg. Influencers will listen and they are the heart of your ecosystem. 

Throw a launch party and yearly and quarterly events. Personally invite influencers and their friends. Just have a party at a bar. Throw them around conferences as people are already there. 

Engage and interact with your community.

Don't visually punish users. Often users don't understand bad behaviour yet as they think they are just playing they game your system sets up. Walk through the positive behaviours you want to reinforce on the site.

6. Advisors. Have a strong group of advisors. Think about which technical, marketing and other problems you'll have and seek out people to help you. Give them stock compensation. A strong advisory team helps with VCs.

7. Leverage your user base to spread the world. 

FarmVille. tells users when other players have helped them and asks the player to repay the favor. This gets players back into the system by using a social obligation hack. They also require having a certain number of friends before you expand your farm. They give away rare prizes.

Wefollow. Tweets hashtags when people follow someone else. This further publicizes the system. They also ask when a new user hits the system if they wanted to be added to the directory, telling the user that X hundred thousand of your closest friends have already added themselves. This is the number one way they get new users.

8. Provide value for third party sites. Wallstreet Journal, for example, puts FriendFeed, Twitter, etc links on every page because they think it adds value to their site. Is there some way you can provide value like that?

9. Analyze your traffic. Install Google analytics, See where people are entering from. Where they are going. Where they are exiting from and how you can improve those pages.

10. The entire picture. Step back and look at the entire picture. Look at users who are creating quality content. Quality content drives more traffic to your site. Traffic going out of your site encourages other sites to add buttons to your site which encourages more users and more traffic into your site. It's a circle of life. Look at how your whole eco system is doing.

Related Articles