advertise
Wednesday
Feb052014

Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?

This is a guest repost by Ron Pressler, the founder and CEO of Parallel Universe, a Y Combinator company building advanced middleware for real-time applications.

Little’s Law helps us determine the maximum request rate a server can handle. When we apply it, we find that the dominating factor limiting a server’s capacity is not the hardware but the OS. Should we buy more hardware if software is the problem? If not, how can we remove that software limitation in a way that does not make the code much harder to write and understand?

Many modern web applications are composed of multiple (often many) HTTP services (this is often called a micro-service architecture). This architecture has many advantages in terms of code reuse and maintainability, scalability and fault tolerance. In this post I’d like to examine one particular bottleneck in the approach, which hinders scalability as well as fault tolerance, and various ways to deal with it (I am using the term “scalability” very loosely in this post to refer to software’s ability to extract the most performance out of the available resources). We will begin with a trivial example, analyze its problems, and explore solutions offered by various languages, frameworks and libraries.

Our Little Service

Let’s suppose we have an HTTP service accessed directly by the client (say, web browser or mobile app), which calls various other HTTP services to complete its task. This is how such code might look in Java:

Tuesday
Feb042014

Sponsored Post: Logentries, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7  

Who's Hiring?

  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.
    • Senior Server Side Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies.  You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data, including: Develop scalable, robust systems that will gather, process, store large amount of data Define/develop Big Data technologies for Apple internal and customer facing applications. Please apply here.
    • Senior Server Side Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies.  You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data, including: Develop scalable, robust systems that will gather, process, store large amount of data Define/develop Big Data technologies for Apple internal and customer facing applications. Please apply here.
    • Senior Engineer: Emerging Technology. Apple’s Emerging Technology group is looking for a senior engineer passionate about exploring emerging technologies to create paradigm shifting cloud based solutions. Please apply here
    • Sr Software Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies. You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data. Please apply here.
    • C++ Senior Developer and Architect- Maps. The Maps Team is looking for a senior developer and architect to support and grow some of the core backend services that support Apple Map's Front End Services. Please apply here.  

  • We need awesome people @ Booking.com - We want YOU! Come design next
    generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.

Fun and Informative Events

  • Aerospike Webinar: “Getting the Most out of Your Flash/SSDs”. Tune in to Aerospike's latest webinar, “Getting the Most Out of your Flash/SSDs” at 10am PST Tuesday, Feb. 18 to learn how to select, test and prepare your drives for maximum database performance. Register now. 

Cool Products and Services

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • MongoDB Backup Free Usage Tier Announced. We're pleased to introduce the free usage tier to MongoDB Management Service (MMS). MMS Backup provides point-in-time recovery for replica sets and consistent snapshots for sharded systems with minimal performance impact. Start backing up today at mms.mongodb.com.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Click to read more ...

Monday
Feb032014

How Google Backs Up the Internet Along With Exabytes of Other Data

Raymond Blum leads a team of Site Reliability Engineers charged with keeping Google's data secret and keeping it safe. Of course Google would never say how much data this actually is, but from comments it seems that it is not yet a yottabyte, but is many exabytes in size. GMail alone is approaching low exabytes of data.

Mr. Blum, in the video How Google Backs Up the Internet, explained common backup strategies don’t work for Google for a very googly sounding reason: typically they scale effort with capacity. If backing up twice as much data requires twice as much stuff to do it, where stuff is time, energy, space, etc., it won’t work, it doesn’t scale.  You have to find efficiencies so that capacity can scale faster than the effort needed to support that capacity. A different plan is needed when making the jump from backing up one exabyte to backing up two exabytes. And the talk is largely about how Google makes that happen.

Some major themes of the talk:

  • No data loss, ever. Even the infamous GMail outage did not lose data, but the story is more complicated than just a lot of tape backup. Data was retrieved from across the stack, which requires engineering at every level, including the human.

  • Backups are useless. It’s the restore you care about. It’s a restore system not a backup system. Backups are a tax you pay for the luxury of a restore. Shift work to backups and make them as complicated as needed to make restores so simple a cat could do it.

  • You can’t scale linearly. You can’t have 100 times as much data require 100 times the people or machine resources. Look for force multipliers. Automation is the major way of improving utilization and efficiency.

  • Redundancy in everything. Google stuff fails all the time. It’s crap. The same way cells in our body die. Google doesn’t dream that things don’t die. It plans for it.

  • Diversity in everything. If you are worried about site locality put data in multiple sites. If you are worried about user error have levels of isolation from user interaction. If you want production from a software bug put it on different software. Store stuff on different vendor gear to reduce large vendor bug effects.

  • Take humans out of the loop. How many copies of an email are kept by GMail? It’s not something a human should care about. Some parameters are configured by GMail and the system take care of it. This is a constant theme. High level policies are set and systems make it so. Only bother a human if something outside the norm occurs.

  • Prove it. If you don’t try it it doesn’t work. Backups and restores are continually tested to verify they work.

There’s a lot to learn here for any organization, big or small. Mr. Blum’s talk is entertaining, informative, and well worth watching. He does really seem to love the challenge of his job.

Here’s my gloss on this very interesting talk where we learn many secrets from inside the beast:

Click to read more ...

Friday
Jan312014

Stuff The Internet Says On Scalability For January 31st, 2014

Hey, it's HighScalability time:


Largest battle ever on Eve Online. 2,000 players. $200K in damage. Awesome pics.

 

  • teaspoon of soil: hosts up to a billion bacteria spread among a million species.

  • Quotable Quotes: 
    • Vivek Prakash: The problem of scaling always takes a toll on you.
    • @jcsalterego: See This One Weird Trick Hypervisors Don't Want You To Know

  • Upgrades are the great killer of software systems. Do you really want a pill that would supply materials with instructions for nanobots to form new neurons and place them near existing cells to be replaced so you have a new brain within six months? Scary as hell. But there's an nanoapp for that.

  • Ted Nelson has a fascinating series of Computers for Cynics vidcasts on YouTube. I'd ony really known of Mr. Nelson from his writings on hypertext, but he has a broad and penetrating insight into the early days of the computer industry. He's not really cynical, but I've always had a hard time differentiation realism from cynicism. Suffice it to say he thinks there have been many wrong paths chosen by our industry and not everything is as told by various industry hagiographies. You may like: The Nightmare of Files and Directories, The Database Mess, The Dance of Apple and Microsoft, and The Real Story of the World Wide Web.

  • From small beginnings. Where it all started: "the internet" in 1969: His idea for the project was the "spirit of community" and was interested in "having computers help people communicate with other people" (Licklider, Licklider, and Robert Taylor) as opposed to using the computer to communicate for us.... By the end of 1969, ARPANET was able to connect to four locations: UCLA, UC Santa Barbara, SRI, and Utah.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Wednesday
Jan292014

10 Things Bitly Should Have Monitored

Monitor, monitor, monitor. That's the advice every startup gives once they reach a certain size. But can you ever monitor enough? If you are Bitly and everyone will complain when you are down, probably not.

Here are 10 Things We Forgot to Monitor from Bitly, along with good stories and copious amounts of code snippets. Well worth reading, especially after you've already started monitoring the lower hanging fruit.

An interesting revelation from the article is that:

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2.  

  1. Fork Rate. A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second. 
  2. Flow control packets.  A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic.
  3. Swap In/Out Rate. Measure the right thing. It's the rate memory is swapped in/out that can impact performance, not the quantity. 
  4. Server Boot Notification. Use an init script to capture when servers are dying. Servers do die, but are they dying too often? 
  5. NTP Clock Offset. If you are not checking one of you servers is probably not properly time synced. 
  6. DNS Resolutions. This is a key part of your infrastructure that often goes unchecked. It can be the source of a lot of latency and availability problems. On the Internal DNS check quantity, latency, and availability. Also verify External DNS servers give the correct answers and are available. 
  7. SSL Expiration. Don't let those certificates expire. Set up an expiration check.
  8. DELL OpenManage Server Administrator (OMSA).  Monitor the outputs from OMSA to know when failures have occurred. 
  9. Connection Limits. Do you know how close you are to your connection limits?
  10. Load Balancer Status. It's important to have visibility into your load balancer status by making the health stats visible.  
Tuesday
Jan282014

How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

This is a guest post by Eric Czech, Chief Architect at Next Big Sound, talks about some unique approaches taken to solving scalability challenges in music analytics.

Tracking online activity is hardly a new idea, but doing it for the entire music industry isn't easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges. Next Big Sound collects this type of data from over a hundred sources, standardizes everything, and offers that information to record labels, band managers, and artists through a web-based analytics platform.

While many of our applications use open-source systems like Hadoop, HBase, Cassandra, Mongo, RabbitMQ, and MySQL, our usage is fairly standard, but there is one aspect of what we do that is pretty unique. We collect or receive information from 100+ sources and we struggled early on to find a way to deal with how data from those sources changed over time, and we ultimately decided that we needed a data storage solution that could represent those changes.  Basically, we needed to be able to "version" or "branch" the data from those sources in much the same way that we use revision control (via Git) to control the code that creates it.  We did this by adding a logical layer to a Cloudera distribution and after integrating that layer within Apache Pig, HBase, Hive, and HDFS, we now have a basic version control framework for large amounts of data in Hadoop clusters.

As a sort of "Moneyball for Music," Next Big Sound has grown from a single server LAMP site tracking plays on MySpace (it was cool when we started) for a handful of artists to building industry-wide popularity charts for Billboard and ingesting records of every song streamed on Spotify. The growth rate of data has been close to exponential and early adoption of distributed systems has been crucial in keeping up. With over 100 sources tracked coming from both public and proprietary providers, dealing with the heterogenous nature of music analytics has required some novel solutions that go beyond the features that come for free with modern distributed databases.

Next Big Sound has also transitioned between full cloud providers (Slicehost), hybrid providers (Rackspace), and colocation (Zcolo) all while running with a small engineering staff using nothing but open source systems. There was no shortage of lessons learned in this process and we hope others can take something away from our successes and failures.

How does Next Big Sound make beautiful data out of all that music? Step inside and see...

Click to read more ...

Friday
Jan242014

Stuff The Internet Says On Scalability For January 24th, 2014

Hey, it's HighScalability time:


Gorgeous image from Scientific American's Your Brain by the Numbers
  • Quotable Quotes: 
    • @jezhumble: Google does everything off trunk despite 10k devs across 40 offices. 
    • @KentLangley: "in 2016. When it goes online, the SKA is expected to produce 700 terabytes of data each day" 
    • Jonathan Marks: It's actually a talk about how NOT to be creative. And what he [John Cleese] describes is the way most international broadcasters operated for most of their existence. They were content factories, slave to an artificial transmission schedule. Because they didn't take time to be creative, then ended up sounding like a tape machine. They were run by a computer algorithm. Not a human soul. There was never room for a creative pause. Routine was the solution. And that's creativities biggest enemy.

  • 40% better single-threaded performance in MariaDB. Using perf, cache misses were found and the fix was using the right gcc flags. But the big hairy key idea is: modern high-performance CPUs, it is necessary to do detailed measurements using the built-in performance counters in order to get any kind of understanding of how an application performs and what the bottlenecks are. Forget about looking at the code and counting instructions or cycles as we did in the old days. It no longer works, not even to within an order of magnitude.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Wednesday
Jan222014

How would you build the next Internet? Loons, Drones, Copters, Satellites, or Something Else?

If you were going to design a next generation Internet at the physical layer that routes around the current Internet, what would it look like? What should it do? How should it work? Who should own it? How should it be paid for? How would you access it?

It has long been said the Internet routes around obstacles. Snowden has revealed some major obstacles. The beauty of the current current app and web system is the physical network doesn't matter. We can just replace it with something else. Something that doesn't flow through choke points like backhaul networks, under sea cables, and cell towers. What might that something else look like?

Google's Loon Project

Project Loon was so named because the idea was thought to be loony. Maybe not...

Click to read more ...

Tuesday
Jan212014

Sponsored Post: Netflix, Logentries, Host Color, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7

Who's Hiring?

  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.
    • Sr. Security Software Developer. We are looking for an excellent programmer who's done extensive security programming. This individual will participate in various security projects from the start to the end. In addition to security concepts, it's important to have intricate knowledge of different flavors of Unix operating systems to develop code that's compact and optimal. Familiarity with key exchange protocols, authentication protocols and crypto analysis is a plus. Please apply here.
    • Senior Server Side Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies.  You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data, including: Develop scalable, robust systems that will gather, process, store large amount of data Define/develop Big Data technologies for Apple internal and customer facing applications. Please apply here.
    • Senior Server Side Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies.  You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data, including: Develop scalable, robust systems that will gather, process, store large amount of data Define/develop Big Data technologies for Apple internal and customer facing applications. Please apply here.
    • Senior Engineer: Emerging Technology. Apple’s Emerging Technology group is looking for a senior engineer passionate about exploring emerging technologies to create paradigm shifting cloud based solutions. Please apply here
    • Sr Software Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies. You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data. Please apply here.

  • The Netflix Cloud Performance Team is hiring. Help tackle the more complex scalability challenges emerging on the cloud today, wrangling tens of thousands of instances, handling billions of requests a day. We are searching for a Senior Performance Architect and a Senior Cloud Performance Tools Engineer

  • We need awesome people @ Booking.com - We want YOU! Come design next
    generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.

Fun and Informative Events

  • Your amazing event here.

Cool Products and Services

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • HostColorCloud Servers based on OpenNebula Cloud automation IaaS. Cloud Start features: 256 MB RAM; 1000 GB Premium Bandwidth; 10 GB fast, RAID 10 protected Storage; 1 CPU Core;  /30 IPv4 (4 IPs) and /48 IPv6 subnet and costs only $9.95/mo. Client could choose an OS (CentOS is default). 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • MongoDB Backup Free Usage Tier Announced. We're pleased to introduce the free usage tier to MongoDB Management Service (MMS). MMS Backup provides point-in-time recovery for replica sets and consistent snapshots for sharded systems with minimal performance impact. Start backing up today at mms.mongodb.com.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • Intel-Aerospike AppNexus Case Study. This new Intel case study explains how the world's largest independent advertising platform, AppNexus, increased their performance, reliability and serviceability while reducing each cluster size from about 50 machines to 8 by integrating Intel SSDs in a wide SATA configuration with the Aerospike NoSQL database. Download the case study to learn more.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Click to read more ...

Monday
Jan202014

8 Ways Stardog Made its Database Insanely Scalable

Stardog makes a commercial graph database that is a great example of what can be accomplished with a scale-up strategy on BigIron. In a recent article StarDog described how they made their new 2.1 release insanely scalable, improving query scalability by about 3 orders of magnitude and it can now handle 50 billion triples on a $10,000 server with 32 cores and 256 GB RAM. It can also load 20B datasets at 300,000 triples per second. 

What did they do that you can also do?

Click to read more ...