The Dollar Shave Club Architecture Unilever Bought for $1 Billion

This is a guest post by Jason Bosco, the Dollar Shave Club’s Director of Engineering, Core Platform & Infrastructure, on the infrastructure of its ecommerce technology.

With more than 3 million members, Dollar Shave Club will do over $200 million in revenue this year. Although most are familiar with the company’s marketing, this immense growth in just a few years since launch is largely due to its team of 45 engineers.

Dollar Shave Club engineering by the numbers:

Core Stats

Super Bowl Ads served with no downtime: 1

Monthly Traffic Bandwidth: 9 TB

Orders processed via Arm: 38 Million orders

Total Bugs Found: 4,566

Automation Tests Ran: 312,000

Emails sent via Voice: 195 Million emails

Analytics data points processed and stored in Hippocampus: 534 Million

Size of dataset in Hippocampus: 1.5TB

Currently Deployed Apps / Services: 22

Number of servers: 325

Technology Stack

Ember for a front-end framework

Primarily Ruby on Rails on the backend

Node.js for high-throughput background processing needs (eg: in Voice)

Golang for infrastructure software

Python for infrastructure & data science

Elixir for 1 internal app

Ruby for Test Automation

Swift and Objective C for Native iOS App

Infrastructure

Fully Hosted on AWS

Ubuntu & CoreOS

Ansible & Terraform for Configuration Management

Transitioning to Docker-based deployments

Jenkins for deployment coordination

Nginx & Varnish

Fastly for application delivery

Sumologic for log aggregation

CloudPassage for security monitoring

Vault by HashiCorp for secrets storage & provisioning

Data Stores

Primarily MySQL hosted on RDS

Memcached hosted on Elasticache for caching

Self-hosted Redis servers primarily for queuing

A dash of Kinesis for handling orders from spiky traffic

Amazon Redshift for a data warehouse

Messaging & Queuing

Resque and Sidekiq for async job processing & messaging

RabbitMQ for messaging

Kafka for stream processing

Analytics & Business Intelligence

Snowplow & Adobe Analytics for web/mobile analytics

AWS Elastic MapReduce

FlyData to ETL data from MySQL into Redshift

Databricks (Hosted Spark)

Looker as the BI front-end

Near-realtime data availability for reporting

Monitoring

Rollbar, Sentry & Crashlytics for exception tracking

DataDog for custom application metrics & metrics aggregation

SysDig for infrastructure metrics & monitoring

NewRelic for application performance monitoring

Site24x7 for availability monitoring

PagerDuty for on-call alerting

QA and Test Automation

CircleCI for running unit tests

Jenkins + TestUnit + Selenium + SauceLabs for browser-based Automated tests

Jenkins + TestUnit + Selenium + SauceLabs for Brain Automated tests

Jenkins + TestUnit for API Functional Tests

Jenkins + TestUnit + Appium + SauceLabs for Native Android Automated Tests

Jenkins + TestUnit + Appium + SauceLabs for Native iOS Automated Tests

Jenkins + TestUnit + Selenium + SauceLabs + Proxy Server for BI Test Automation

SOASTA + Regex Scripts for Stress, Soak, Load and Performance Testing.

Engineering Workflow

Slack for cross-team communication

Trello for task tracking

Hubot with custom plugins as our chat bot

Github as our code repository

ReviewNinja integrated with Github Status API for code reviews

Continuous deployment - multiple deployments per day typically

Moving to continuous delivery

On-the-fly sandbox environments for feature development

Currently, single-button push deployment using Jenkins, moving towards continuous delivery

Vagrant box running docker containers => fully-functioning development environments for new engineers on their first day

Architecture

Event-driven architecture

Moving from a monolithic architecture to “medium” services interacting through a common message bus

VCL-based edge-routing on the CDN edges, deployed just like any other app.

Web and Mobile frontends talk to an API layer

API layer talks to services, aggregates data and formats it for clients

Services talk to the data stores and message bus

Scheduled tasks run as one master job that breaks itself up into smaller jobs in resque/sidekiq

Technology components include internal tools for customer service (Brain), marketing automation platform (Voice), fulfillment system (Arm), subscription billing system (Baby Boy) and our data infrastructure (Hippocampus).

Team

45 top-notch entrepreneurial and highly-skilled engineers working out of Marina Del Rey, CA HQ

Engineers participate in cross-functional teams called squads along with product managers, designers, UX and stakeholders to deliver end-to-end features.

Teams are vertically divided based on domains into Frontend, Backend, QA & IT.

Front-end team owns Web UI for DSC.com & internal tools and our iOS & Android apps.

Backend team owns web backends for DSC.com & internal tools, Internal Services (Billing and Fulfillment), Data Platform & Infrastructure.

QA teams owns testing and automation infrastructure for all digital products.

IT team owns Office & Warehouse IT.

Engineers get to attend one company-sponsored conference every year.

Engineers get to buy as many books / learning resources as they need.

Standing desks for all. One treadmill desk currently available as a pilot.

Weekly engineering team lunches.

Tech Belly events every other week where engineers present talks on technology topics over lunch.

Engineers are encouraged to experiment with bleeding edge technology and create proposals through Requests for Proposal (RFCs).

Engineers are encouraged to open source tools and libraries where it makes sense

Every engineer gets a standard issue of a 15” Mac Book Pro, 27” Mac Display and a 24” monitor.

One 3D-printer available to print props and more 3D printers.

Lessons Learned

Scaling becomes an easier challenge when components you’re trying to scale are composed of simple and small services.

Documentation & knowledge sharing are important for fast-growing teams.

A well-nurtured test-suite is critical to fast-evolving systems.

Redis uses an approximate LRU algorithm, so it’s not suitable if you have precise LRU requirements for caching

Web performance is critical, especially on mobile - every millisecond costs us revenue

Usability & User Experience are important even for internal tools: efficient tools = more productive teams

On HackerNews