Entries in Distributed Computing (5)


The Story of Batching to Streaming Analytics at Optimizely

Our mission at Optimizely is to help decision makers turn data into action. This requires us to move data with speed and reliability. We track billions of user events, such as page views, clicks and custom events, on a daily basis. To provide our customers with immediate access to key business insights about their users has always been our top most priority. Because of this, we are constantly innovating on our data ingestion pipeline.

In this article we will introduce how we transformed our data ingestion pipeline from batching to streaming to provide our customers with real-time session metrics.


Unification. Previously, we maintained two data stores for different use cases - HBase is used for computing Experimentation metrics, whereas Druid is used for calculating Personalization results. These two systems were developed with distinctive requirements in mind:



Instant event ingestion

Delayed event ingestion ok

Query latency in seconds

Query latency in subseconds

Visitor level metrics

Session level metrics

As our business requirements evolve, however, things quickly became difficult to scale. Maintaining a Druid + HBase Lambda architecture (see below) to satisfy these business needs became a technical burden for the engineering team. We need a solution that reduces backend complexity and increases development productivity. More importantly, a unified counting infrastructure creates a generic platform for many of our future product needs.

Consistency. As mentioned above, the two counting infrastructures provide different metrics and computational guarantees. For example, Experimentation results show you the number of visitors visited your landing page whereas Personalization shows you the number of sessions instead. We want to bring consistent metrics to our customers and support both type of statistics across our products.

Real-time results. Our session based results are computed using MR jobs, which can be delayed up to hours after the events are received. A real-time solution will provide our customers with more up-to-date view of their data.

Druid + HBase

In our earlier posts, we introduced our backend ingestion pipeline and how we use Druid and MR to store transactional stats based on user sessions. One biggest benefit we get from Druid is the low latency results at query time. However, it does come with its own set of drawbacks. For example, since segment files are immutable, it is impossible to incrementally update the indexes. As a result, we are forced to reprocess user events within a given time window if we need to fix certain data issues such as out of order events. In addition, we had difficulty scaling the number of dimensions and dimension cardinality, and queries expanding long period of time became expensive.

On the other hand, we also use HBase for our visitor based computation. We write each event into an HBase cell, which gave us maximum flexibility in terms of supporting the kind of queries we can run. When a customer needs to find out “how many unique visitors have triggered an add-to-cart conversion”, for example, we do a scan over the range of dataset for that experimentation. Since events are pushed into HBase (through Kafka) near real-time, data generally reflect the current state of the world. However, our current table schema does not aggregate any metadata associated with each event. These metadata include generic set of information such as browser types and geolocation details, as well as customer specific tags used for customized data segmentation. The redundancy of these data prevents us from supporting large number of custom segmentations, as it increases our storage cost and query scan time.


Click to read more ...


22 Recommendations for Building Effective High Traffic Web Software

This is a guest post by Ashwanth Fernando, Software Engineer from the trenches at large scale internet companies.

Inspired by the book "Effective Java" by Joshua Bloch, I wanted to share my holistic recommendations on building high traffic web software (i.e. web applications/services that serve high traffic loads). Some of these items may not be just about software design but also around surrounding areas such as the engineering organization, culture etc.

Two disclaimers up front:

1) This is my opinion.
2) There will be real world situations where the below principles will be wrong as in all things "software". Please use common sense all the time.

Consider using more than one datacenter

There have been numerous horror stories about businesses, ahem going out of business because they just had a single datacenter. Its really important to have more than one data center if you want to protect yourself from natural disasters or electrical supply failures. Run all your datacenters in active-active configuration. It may cost extra money, but its well worth it rather than having an active passive configuration and then finding out at the end that for some pieces of data, your passive hardware was not consistent with the active one.

Consider a sparse datacenter deployment

Click to read more ...


10 Things You Should Know About AWS

Authored by Chris Fregly:  Former Netflix Streaming Platform Engineer, AWS Certified Solution Architect and Purveyor of

Ahead of the upcoming 2nd annual re:Invent conference, inspired by Simone Brunozzi’s recent presentation at an AWS Meetup in San Francisco, and collected from a few of my recent consulting engagements, I’ve compiled a list of 10 useful time and clock-tick saving tips about AWS.

1) Query AWS resource metadata


Can’t remember the EBS-Optimized IO throughput of your c1.xlarge cluster?  How about the size limit of an S3 object on a single PUT? is the answer to all of your AWS-resource metadata questions.  Interested in integrating with your application?  You’re in luck.  There’s now a REST API, as well!

Note:  These are default soft limits and will vary by account.

2) Tame your S3 buckets


Delete an entire S3 bucket with a single CLI command:  

aws s3 rb s3://<bucket-name> --force

Recursively copy a local directory to S3:

aws s3 cp <local-dir-name> s3://<bucket-name> --region <region-name> --recursive

3) Understand AWS cross-region dependencies

Click to read more ...


100 Node Hazelcast cluster on Amazon EC2

Deploying, running and monitoring application on a big cluster is a challenging task. Recently Hazelcast team deployed a demo application on Amazon EC2 platform to show how Hazelcast p2p cluster scales and screen recorded the entire process from deployment to monitoring.

Hazelcast is open source (Apache License), transactional, distributed caching solution for Java. It is a little more than a cache though as it provides distributed implementation of map, multimap, queue, topic, lock and executor service. 

Details of running 100 node Hazelcast cluster on Amazon EC2 can be found here. Make sure to watch the screencast!


Big Data on Grids or on Clouds? 

 Contributed by Wolfgang Gentzsch:

Now that we have a new computing paradigm, Cloud Computing, how can Clouds help our data? Replace our internal data vaults as we hoped Grids would? Are Grids dead now that we have Clouds? Despite all the promising developments in the Grid and Cloud computing space, and the avalanche of publications and talks on this subject, many people still seem to be confused about internal data and compute resources, versus Grids versus Clouds, and they are hesitant to take the next step. I think there are a number of issues driving this uncertainty.

read more at: