9ish Low Latency Strategies for SaaS Companies
Achieving very low latencies takes special engineering, but if you are a SaaS company latencies of a few hundred milliseconds are possible for complex business logic using standard technologies like load balancers, queues, JVMs, and rest APIs.
Itai Frenkel, a software engineer at Forter, which provides a Fraud Prevention Decision as a Service, shows how in an excellent article: 9.5 Low Latency Decision as a Service Design Patterns.
While any article on latency will have some familiar suggestions, Itai goes into some new territory you can really learn from. The full article is rich with detail, so you'll want to read it, but here's a short gloss:
- Measure your API Latency as a Function of Probability. Measure latency using percentiles. Distinguish real traffic from stress test tools that generate synthetic requests because they simulate a best case scenario where the caches are already hot.
- Define How the Client Should React in Case of a Timeout. Networks break and since what customers care about is end-to-end service availability, APIs need to timeout requests, retry on failure, and make requests idempotent to prevent inconsistencies from duplicate requests. A reconciliation process is necessary to handle inconsistencies.
- All subsystems must be able to make a decision. Run with two production systems. If problems with the new production system get bad enough trigger a graceful degradation alert which causes a reversion to the old production release. Switching can be done with the Amazon ELB API. This allows you to both move fast and have time to fix any problems that pop up.
- Handle Excess Traffic. Divert excess customer traffic to a different set of machines rather than push back to the client.
- Use Dynamic Timeouts for I/O Operations. In most cases use a maximum timeout period, which is the time for the I/O operation to succeed 99.9% of the time. When a garbage collection or a network hiccup occurs, speed up the rest of the processing and use a minimum timeout, which is the time for the I/O operation to succeed 80% of the time.
- Use Auto-Healing (Automatic Fail over). Fail-fast and crash on unhandalable exceptions. Use a scaling group to detect failure and spin up a new replacement instance. Use load balancing to divert traffic to healthy instances. Use HTTP Fencing to fail over to a healthy service when an an entire services stops responding.
- Plot Latencies, it’s a Big Time Saver. To detect latency problems timestamp before and after every sub-component of the system. Plot the latency of each transaction, you might find unexpected patterns emerge, like it may take time for a system to warm up before it performs well. If problems are found look for recent code changes that may have produced the problem.
- Know Your (Latency) Enemy. There are several potential sources of latency problems: flakey ec2 instances; overhead from logging/eventing frameworks; slow regular expressions; TCP's Nagle Algorithm and delayed ACKs; JVM garbage collection. Don't perform premature optimizations, but do compare latencies on different ec2 instances and determine on your logging systems and other frameworks impact latency.
- Low-Latency Tweaks for Data Stores. When comparing databases make sure you are comparing similar features and capabilities. Think about if indexes and data are in-memory or on disk? Are there enough CPUs to handle the load? Do you have too many or too few client threads? Are your threads at the right priority? Are you in the sweet spot for the write/read trade-offs for your database?
- Offload Tasks Before/After the Low Latency Critical Path. To further improve latency for their fraud detection system, Forter split their workload into three different subsystems: Event stream processing; Transaction processing on the critical path; Batch processing. They have an interesting table where for each component they indicate the limiting design constraint; is data loss acceptable?; is it hard business requirement?