Vinted Architecture: Keeping a busy portal stable by deploying several hundred times per day
This is guest post by Nerijus Bendžiūnas and Tomas Varaneckas of Vinted.
Vinted is a peer-to-peer marketplace to sell, buy and swap clothes. It allows members to communicate directly and has the features of a social networking service.
Started in 2008 as a small community for Lithuanian girls, it developed into a worldwide project that serves over 7 million users in 8 different countries, and is growing non-stop, handling over 200 M requests per day.
Stats
- 7 million users and growing
- 2.5 million monthly active users
- 200 million requests / day (pageviews + API calls)
- 930 million photos
- 80 TB raw space for image storage
- 60 TB HDFS raw space for analytics data
- 70% of traffic comes from mobile apps (API requests)
- 3+ hours / week time spent per member
- 25 million listed items
- Over 220 servers:
- 47 for internal tools (vpn, chef, monitoring, development, graphing, build, backups, etc.)
- 38 for the Hadoop ecosystem
- 34 for image processing and storage
- 30 for Unicorn and microservices
- 28 for MySQL databases, including replicas
- 19 for the Sphinx search
- 10 for resque background jobs
- 10 for load balancing with Nginx
- 6 for Kafka
- 4 for Redis
- 4 for email delivery
Technology Stack
Third party services
- GitHub for code, issues and discussions
- NewRelic for monitoring app response times
- CloudFlare for DDoS protection and DNS
- Amazon Web Services (SES, Glacier) for notifications and long term backups
- Slack for work conversations and "chat ops"
- Trello for tasks
- Pingdom for uptime monitoring
Core app and services
- Ruby for services, scripts and the main app
- Rails for the main app and internal apps
- Unicorn to serve the main app
- Percona MySQL Server as main database
- Sphinx Search for full-text search and for reducing load to MySQL (testing ElasticSearch)
- Capistrano for deployments
- Jenkins for running tests, deployments and various other tasks
- Memcached for caching
- Resque for background jobs
- Redis for Resque and Feed
- RabbitMQ for passing messages to services
- HAproxy for high availability and load balancing
- GlusterFS for distributed storage
- Nginx to serve web requests
- Apache Zookeeper for distributed locking
- Flashcache for increased I/O throughput
Data warehouse stack
- Clojure for data ingestion services
- Apache Kafka for storing in-flight data
- Camus for offloading data from Kafka to HDFS
- Apache Hive as SQL-on-Hadoop solution
- Cloudera CDH Hadoop distribution
- Cloudera Impala as low latency SQL-on-Hadoop solution
- Apache Spark in experimentation phase
- Apache Oozie as workflow scheduler (investigating alternatives)
- Scalding for data transformations
- Avro for data serialization
- Apache Parquet for data serialization
Servers and Provisioning
- Mostly bare metal
- Multiple data centers
- CentOS
- Chef for provisioning nearly everything
- Berkshelf for managing Chef dependencies
- Test Kitchen for running infrastructure tests on VMs
- Ansible for rolling upgrades
Monitoring
- Nagios for usual ops monitoring
- Cacti for graphs and capacity planning
- Graylog2 for app logs
- Logstash for server logs
- Kibana for querying logstash
- Statsd for collecting real-time metrics from the app
- Graphite for storing real-time metrics from the app
- Grafana for creating beautiful dashboards with app metrics
- Hubot for chat-based monitoring
Architecture overview
- Bare metal servers are used as a cheaper and more powerful alternative to cloud providers.
- Servers hosted in 3 different data centers across Europe and the US.
- Nginx frontends route HTTP requests to unicorn workers and do SSL termination.
- Pacemaker is used to ensure high availability of Nginx servers.
- In different countries, every Vinted portal has its own separate deployment and set of resources.
- To increase machine utilization, most services that belong to multiple portals may run alongside each other on a single bare metal server. Chef recipes take care of unique port allocation and other separation concerns.
- MySQL sharding is avoided by having a separate database for each portal.
- In our largest portal, functional sharding is already happening, and a point where the single largest table will not fit into the server is months away.
- Caching in Rails app uses custom mechanism with L2 cache that prevents spikes when the main cache is expired. Several caching strategies are used:
- remote (in memcached)
- local (in unicorn worker memory)
- semi-local (in unicorn worker memory, falling back to memcached when local cache expires).
- Several microservices are built around the core rails app, all with a clear purpose, like sending iOS push notifications, storing and serving brand names, storing and serving hashtags.
- Microservices can be either per-portal, per-datacenter or global.
- Multiple instances of each microservice are running for high availability.
- HTTP-based microservices are balanced by Nginx.
- Custom feeds for each member are implemented using Redis.
- Catalog pages are loaded from the Sphinx index rather than MySQL when filters are used (custom size, color, etc.).
- Images are processed by a separate Rails app.
- Processed images are stored in GlusterFS.
- Images are cached after 3rd hit, since the first few hits usually come from the uploader and an image may be rotated.
- Memcached instances are sharded using twemproxy.
Team
- over 150 full-time employees
- 30 developers (backend, frontend, mobile)
- 5 site reliability engineers
- Main HQ in Vilnius, Lithuania
- Offices in the USA, Germany, France, the Czech Republic and Poland
How The Company Works
- Nearly all information is open to all employees
- Using GitHub for almost everything (including to discuss company issues)
- Real-time discussions in Slack
- Everyone is free to participate in things that interest them
- Autonomous teams
- No seniority levels
- Cross-functional development teams
- No enforced process, teams decides how to manage themselves
- Teams work on high-level problems and decide how to tackle them
- Each team decides how, when and where they work
- Teams hire for themselves when a need arises
Development Cycle
- Developers branch from master.
- Changes become pull requests in GitHub.
- Jenkins runs Pronto to do static code and style checks, runs all tests on pull request branches and updates the GitHub pull request with a status.
- Other developers review the change, add comments.
- Pull requests may be updated multiple times with various fixes.
- Jenkins runs all tests after each update.
- Git history is cleaned up before merging with master to keep master history concise.
- Accepted pull requests are merged into master.
- Jenkins runs master build with all tests and triggers deployment jobs to roll out the new version.
- Several minutes later, after the code has reached master branch, it is applied in production.
- Pull requests are usually small.
- Deployments happen ~300 times per day.
Avoiding Failures
Deploying hundreds of times per day does not mean everything always has to be broken, but keeping the site stable requires some discipline.
- Deployments do not happen if tests are broken. There is no way to override this, master has to be green in order to deploy.
- Automatic deployments are locked every night and during weekends.
- Anyone can lock deployments manually from a Slack chat.
- When deployments are locked, it is possible to "force" a deployment manually, from the Slack chat.
- The team is very picky when it comes to reviewing code. The quality bar is set high. Tests are mandatory.
- Before Unicorn reloads code, a warmup is done in production. It includes several requests to various critical parts of the portal.
- During warmup, if any one request does not return 200 OK, the old code remains and deployment fails.
- Occasionally a problem is not caught by tests / warmup, and broken code is released to production.
- Errors are streamed to graylog server.
- Alarms are trigerred if the error rate exceeds a threshold.
- Error rate alarms and failed build notifications are reported instantly to a Slack chat.
- All errors contain extra metadata: Unicorn worker host and pid, HTTP request details, git revision of the code that caused the problem, error backtrace.
- Some types of "Fatal" errors are also reported directly to a Slack chat.
- Each deployment log contains GitHub diff URL with all the changes.
- If something breaks in production, it is easy to pinpoint the problem quickly due to a small changeset and instant error feedback.
- Deployment details are reported to NewRelic, which makes it easy to troubleshoot introduction of performance bottlenecks.
Reducing Time To Production
There is a big focus on both fast and stable releases, the team is always working to keep build and deployment times as short as possible.
- Full test suite contains >7000 tests and runs in ~3 minutes.
- New tests are added constantly.
- Jenkins runs on a bare metal machine with 32 CPU, 128G RAM and SSD drives.
- Jenkins splits all tests into multiple chunks and runs them in parallel, aggregating final results.
- Tests would run over 1 hour on an average machine without parallelisation.
- Rails assets are pre-built in Jenkins for every commit in master branch.
- During deployment, prebuilt assets are downloaded from jenkins and uploaded to all target servers.
- BitTorrent is used when uploading a release to target servers.
Running live database migrations
We can change the structure of our production MySQL databases without downtime, in most cases even at rush hour.
- Database migrations can happen during any deployment
- Percona toolkit's pt-online-schema-change is used to run alters on a copy of the table, while keeping it in sync with the original, and to switch the copy with the original when the change is done
Operations
Server Provisioning
- Chef is used to provision nearly every aspect of our infrastructure.
- SRE team does pull requests for infrastructure changes just like Development teams do for code changes.
- Jenkins is verifying all Chef pull requests, running cross dependency checks, foodcritic, etc.
- Chef code changes are reviewed in GitHub by the team.
- Developers can make infrastructure change pull requests to Chef repository.
- When pull requests are merged, Jenkins uploads changed Chef cookbooks and applies them in production.
- Plans to spawn DigitalOcean droplets to run Chef kitchen tests with Jenkins.
- Jenkins itself is configured with Chef, not using the web UI.
Monitoring
- Cacti graphs server side metrics
- Nagios checks for everything that can be up or down
- Nagios health checks for some of our apps and services
- Statsd / Graphite for tracking app metrics, like member logins, signups, database object creation
- Grafana used for nice dashboards
- Grafana dashboards can be described and created dynamically using Chef
Chat Ops
- Hubot sits in Slack.
- Slack rooms are subscribed to various event notifications via hubot-pubsub.
- #deploy room shows information about deployments, with links to GitHub diffs, Jenkins console outputs, etc. Right here deployments can be triggered, locked and unlocked with simple chat commands.
- #deploy-fail room shows only deployment failures.
- #failboat room shows announcements about error rates and individual errors in production.
- There multiple #failboat-* rooms that give information about cron failures, stuck resque jobs, nagios warnings, newrelic alerts.
- Twice a day Graylog errors are crunched and cross-checked with GitHub, reports are generated and presented to the development team.
- If someone mentions you in GitHub, Hubot gives you a link to the issue in a private message on Slack.
- When an issue or pull request is created in GitHub, teams that are mentioned receive pings in their appropriate Slack rooms.
- Graphite graphs can be queried and displayed in Slack chat via Hubot.
- You can ask Hubot questions about infrastructure, i.e. as about the machines where a certain service is deployed. Hubot queries the Chef server to get the data.
- Developers use Hubot notifications for certain events that happen in our app. For example, customer the support chatroom is automatically notified about possible fraud cases.
- Hubot is integrated with the office netatmo module and can tell everyone what CO2, noise, temperature and humidity levels are.
Data warehouse stack
- We ingest data from two sources:
- Website event tracking data: impressions, clicks etc.
- Database (MySQL) changes.
- Most event tracking data is published to an HTTP endpoint from JavaScript/Android/iOS front-end apps. Some events are published to a local UDP socket by our core Ruby on Rails application. We chose UDP in order to avoid blocking the request.
- There's a service that listens on the UDP socket for new events, buffers them and periodically emits to a raw events' topic in Kafka.
- A stream processing service consumes the raw events topic, validates the events, encodes them in Avro format and emits to event-specific topics.
- We keep our Avro schemas of events in a separate centralized schema registry service.
- Invalid events are placed into a separate Kafka topic for possible correction.
- We use LinkedIn's Camus job to offload events from event-specific topics to HDFS incrementally. Time to Hadoop for an event is usually up to an hour.
- Using Hive and Impala for ad-hoc queries and data analysis.
- Working on Scalding-based solution for data processing.
- Reports run from our in-house grown OLAP-like reporting system written in Ruby.
Lessons Learned
Product development
- Invest in code quality.
- Cover absolutely everything with tests.
- Release small changes.
- Use feature switches to deploy unfinished features to production.
- Do not keep long running branches of code when developing something.
- Invest in a fast release cycle.
- Build the product mobile-first.
- Design API very well from the very beginning, it is difficult to change it later.
- Including sender hostname and app name in RabbitMQ messages can be a life saver.
- Do not guess or make assumptions, do AB testing and check the data.
- If you plan on using Redis for something big, think about sharding from very beginning.
- Never use forked and slightly modified ruby gems if you are planning on keeping your Rails up to date.
- Make sharing your content through social networks as easy as possible.
- Search engine optimization is a full-time job.
Infrastructure and Operations
- Deploy often to increase system stability. It may sound counter-intuitive at first.
- Automate everything you can.
- Monitor everything that can be down.
- Network switch buffers matter.
- Capacity planning is hard.
- Think about HA from the start.
- RabbitMQ cluster is not worth the effort.
- Measure and graph everything you can: HTTP response times, Resque queue sizes, Model persistence rates, Redis response times. You cannot improve what you cannot measure.
Data warehouse stack
- There's a myriad of tools in the Hadoop ecosystem. It's hard to pick the right one.
- If you are writing a user defined function (UDF) for Hive it's time to consider a non-SQL solution for data transformations.
- The "Hadoop application" notion is vague. Oftentimes we need to glue our application components together manually.
- Everyone writes their own workflow manager. And for a good reason.
- Distributed system monitoring is tough. Anomaly detection and graphing helps.
- The core Hadoop infrastructure (HDFS, YARN) is rock solid but the newer tools usually have unpleasant nuances.
- Kafka is amazing.