hot links

Stuff The Internet Says On Scalability For September 9th, 2016

High Scalability

08 Sep 2016 — 13 min read

Hey, it's HighScalability time:

An alternate universe where Zeppelins rule the sky. 1929. (@AeroDork)

If you like this sort of Stuff then please support me on Patreon.

15%: Facebook's reduction in latency using HTTP2's server push; 1.9x: nanotube transistors outperform silicon; 200: projectors used to film a "hologram"; 50%: of people fall for phishing attacks (it's OK to click); 5x: increased engagement using Google's Progressive Web Apps; 115,000+: Cassandra nodes at Apple; $500 million: Pokémon Go; $150M: Delta's cost for datacenter outage;

Quotable Quotes:
- Dan Lyons: I wanted to write a book about what it’s like to be 50 and trying to reinvent yourself – that struggle. There are all these books and inspirational speakers talking about being a lifelong learner and it’s so great to reinvent yourself, the brand of you. And I wanted to say, you know, it’s not like that. It’s actually really painful.
- Engineers & Coffee~ In modern application development everything is a stream now versus historically everything was a transaction. Make a request and the you're done. It's easier to write analytics on top of streams versus using Hive. It's cool that Kinesis is all real-time and has the power of SQL.
- David Smith: The [iOS] market has been pulling me along towards advertising based apps, and I’ve found that the less I fight back with anachronistic ideas about how software “should” be sold, the more sustainable a business I have.
- @tef_ebooks: (how do you keep a lisp user in suspense
- @bodil: Use tests to verify your assumptions. Use a type checker to verify your implementations. Always.
- tostitos1979: Here is a factoid for the youngins ... the Internet/Arpanet was created BEFORE the first microprocessor! In fact, Intel was originally founded to make RAM ICs. They only later created the first microprocessor (the 4004)!
- gsubes: Our tests showed than even with larger messages (100k price ticks per request) pipes were still a magnitude slower [than Memory Mapping].
- Quincy Larson: Did you know the average developer only get two hours of uninterrupted work done a day? They spend the other 6 hours in varying states of distraction.
- StorageMojo: Achieving lower-than-DRAM pricing requires volume, and that’s where NRAM has a competitive advantage over, say, 3D XPoint. Processing can be done on today’s flash, DRAM or logic lines. NRAM processing only needs spin coating and patterning – as well as carbon nanotubes – which modern fabs all support.
- Xiao Mina: We’ve seen this story before: as cost of production and distribution go down, the range of creativity goes up.
- @clarkkaren: Give humans a system and they'll game it. The End.
- Jim Starkey: AmorphousDB is my modest effort to question everything database.
  
  The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata.
- @jdub: On Reddit: “What is the use of Elastic IPs, if I can use ELB or an Auto Scaling Group instead?” STUDENT, YOU HAVE ACHIEVED ZEN OF CLOUD.
- @BenedictEvans: A key premise for the next decade: it's easier for software to enter other industries than for other industries to hire software people
- @jasongorman: To clarify, "dependency injection" literally just means passing an object's collaborators as constructor/method params. That's all it is.
- jackpeterfletch: Grand solution to world hunger, available on Kindle!
- @swardley: Optimise flow. Often when you examine flows then you’ll find bottlenecks, inefficiencies and profitless flows. There will be things that you’re doing that you just don’t need to. Be very careful here to consider not only efficiency but effectiveness.
- @PatrickMcFadin: #uber is fully replicated and active-active to make sure you never get stranded. #cassandrasummit
- @FSVO: A monk named Chaitin found an algorithm for expressing the complexity of sutras. His master commented, “This monk could be shorter.”
- Dotzler: We [Firefox] can learn from the competition [Chrome]. The way they implemented multi-process is RAM-intensive, it can get out of hand. We are learning from them and building an architecture that doesn’t eat all your RAM.
- @hichaelmart: Although CPU bound calculations [on OpenWhisk] seem about 4x slower than Lambda, so not too bad. Lambda still the winner so far though.
- Shel Kaphan: Okay, I’m going to be building this website to run a bookstore [Amazon] and I haven’t done that before but it doesn’t sound so hard. When I’m done with that I’m not sure what I’ll do.
- sixhobbits: "Our logger failed silently" "Shouldn't that have been recorded somewhere?" "I guess it's turtles all the way down"
- @xmal: Trying to explain that CRDT causal contexts are a natural evolution of TCP sequence numbering and vector clocks in reliable causal broadcast
- Joi Ito: Just like it is impossible to make another Silicon Valley somewhere else, although everyone tries—after spending four days in Shenzhen, I’m convinced that it’s impossible to reproduce this ecosystem anywhere else.
- @adriancolyer: "My claim is that it is possible to write grand programs, noble programs, truly magnificent ones..." Knuth 1974
- @Excellion: According to legend, if you say Blockchain three times fast, your databases will magically become immutable & your company a fintech leader.
- bec0: The world has changed. Dennard scaling has mostly been replaced. The economic Moore's Law has morphed. It had too...we have all gotten used to its benefits.
- @cloud_opinion: 5 stages of Cloud Grief: It's not secure / It's someone's computer / We do private cloud / Hybrid cloud / Lambda is full of servers anyway
- @DDD_Borat: "Why you not like framework annotations in your code?" - "Would you put bumper sticker on a Ferrari?" Rofl
- @robert_winslow: Slow software is your fault. These are the real speed limits: billions of CPU instructions, GBs of RAM access, 100k+ SSD I/Os... per second.
- Walter Bentley: I am proud to say, OpenStack held up to the torment. Did not experience not one single API request failure throughout my numerous load tests — yet another proof point that OpenStack is ready for enterprise/production use.
- @xaprb: Let's fork it, say the people who have never put their heart and 5 years of their life into a product only to watch someone else fork it.
- @adrianco: People asking Docker to slow down is like OpenStack folks asking AWS to standardize and slow down.
- @amcafee: "In 1974, it was illegal for an airline to charge < $1,442 for a flight between New York City and Los Angeles."
- Fairly Nerdy: For most real world scenarios, where you are betting against the house which has a house edge, f* becomes negative, which means that you shouldn’t be playing that game. Truthfully it means that you should take the other side of the wager, become the house, and make them bet against you!
- Judd Kaiser: Experience shows that good scalability can be achieved on 10 GigE networking provided that you stay above about 50,000 cells per core. That means, for example, that a 20 M cell problem shows good scaling up to about 400 cores; beyond that, interprocess communication latency begins to dominate and scaling degrades.

Maybe the real reason Uber wants driverless cars is hiring, er...onboarding drivers from across the globe is a really tough problem to solve. Each location has their own processes and that kills scalability. Screening processes and regulations vary, some countries have a very long list of required documents, and onboarding flows vary. Here's the story: How Uber Engineering Massively Scaled Global Driver Onboarding. So you can't use the same app everywhere. The solution was, as it often is, is to go meta and dynamic: the onboarding state machine (OSM) easily configure a set of steps for each onboarding process in each country, state, city, or any level of granularity we need, coupled with an event system that allows us to easily switch users from one step to another depending on their actions or input. The onboarding API can then easily query the OSM to know at which step in the process a user is. Clients are now stateless, responsible only for their UI, 100% of the business logic in the shared back end. They went from Flask to Tornado and a lighter version of their initial JSON schema architecture, where only data is passed to the client, not UI definitions.

We've learned the best security practice is to close everything down and allow only what you want to pass through, like ports on a router. Browsers do the opposite with Trusted Root Certificate Authorities. Browsers keep a massive list of potentially dodgy CAs and they let them all through. Security Now 576. Google publishes a list of Certificate Authorities it doesn't trust.

The vertical hardware advantage. Rene Ritchie: Far more importantly, at least to me, is the Apple A10 Fusion chip inside. It's quad core now... kinda. In a twist on big-endian/little-endian design, Apple has two high-performance cores, which are over 40% faster than last year and over 20x faster than the original iPhone. They're matched with two high-efficiency cores that provide significant power savings. Apple's using a custom-designed performance controller to manage it all, with real-time processor assignment for power/efficiency. It's got six graphics cores as well, which are 50% faster than last year but with only two-thirds the power draw. And for those keeping hockey-curve score at home, it's over 240x faster than the original. All told, it's enough to give iPhone 7 over two more hours of average battery life than the iPhone 6s, and iPhone 7 Plus over an hour more than iPhone 6s Plus.

More rough Algorithmic Justice. How Google obliterated my 4 year old Chrome extension featuring 24k+ users: I’m not playing the victim here, and I don’t expect this post to solve anything for my extension. I just want you to think twice before creating a revenue stream based on the Chrome Web Store. Good discussions on reddit and on HackerNews.

Regexes are so useful they are hard not to use, but their use has surprising implications at the system level. Solutions to slow regexes: quarantine or use a different regex engine. Loggly shares details in Lessons Learned from Using Regexes At Scale: Were we to process this slow regex along with all the other regexes, we increase overall parsing latency for all customers by about 1%. If the customer upgrades her account, however, and now her regex applies to 2% of our overall volume, parsing becomes twice as slow for every customer. Unless we have that much capacity sitting around or we are willing to spin up twice as many parsers, we simply cannot support such a slow-running regex. Therefore, the line between a regex that is too slow to support and not slow enough to make a difference is quite small. The less expressive Thompson NFA algorithm runs in linear time regardless of the regex or input.

What would be the top 125 software systems? Here's a list for buildings: RECORD’s Top 125 Buildings. My favorite is the The Glass House. For software, I don't know, and why I don't know is interesting. I guess I don't think of software on the same terms as buildings, which seems wrong on the surface. They are analogous. I can't think of any software that has the same clarity, simplicity, and beauty. Ideas reified are always messy in code.

Tab hoarders rejoice. A mutliprocess architecture for Firefox yields a 400 percent improvement in responsiveness and a 700 percent improvement in responsiveness for loading large web pages. Why? LuminescentMoon: Firefox handles multiple tabs better than Chrome. Firefox doesn't use process-per-site-instance so it doesn't have the overhead of an entire JS VM stack per site instance. Also, it aggressively swaps the entire state of a tab's site content to disk when running low on RAM. DrDichotomous: But if you don't know what code you're running, and can't ensure it's well-behaved, then all bets are off. One piece of crappy JS, UI interaction flaw, or weird browser event loop quirk could leave all the other threads stuttering or even hanging entirely.

Videos from ICML (International Conference on Machine Learning) 2016 are now available. You might like the tutorial on Deep Reinforcement Learning.

Animats: System V messages weren't bad, but they aren't used much. The real trick is tight IPC and CPU scheduling integration. You want a send from process A to process B to result in an immediate transfer of control from process A to process B, preferably on the same CPU. The data you just sent is in the CPU's cache. QNX is one of the few OSs where somebody thought about this. With unidirectional or pipe-like IPC, the sender sends, which unblocks the receiver, but the sender doesn't block. So the OS can't just toss control to the receiver. The receiver goes on the ready-to-run list and, quite likely, another CPU starts running it. Meanwhile, the sending process runs for a short while longer and then typically blocks reading from some reply pipe/queue. It takes two extra trips through the scheduler that way. Worse, if the CPU is busy, sending a message can put you at the end of the line for CPU time, which makes for awful IPC latency under load. It's one of the classic mistakes in microkernel design.

For HPC is it InfiniBand or bust? Not so much. Getting Faster, Cost-effective Simulation on the Cloud: The data below shows results for a standard ANSYS CFD benchmark problem which simulates the flow around a Formula 1 race car with 140 million cells. The results may alter your opinion about whether or not you can do “real HPC” on AWS. The results show near-ideal scalability well past 1000 cores and a reduced overall solution time even beyond 2000 cores.

Not sure about this. The latency still sucks and disrupts flow. The death of localhost and the rise of cloud development.

State machines don't get the play they should. Here's a great example of their use from eBay for generating and validating a one-time code. Finite-State Machine for Single-Use Code Authentication. You get robustness, security, readability, and the structure allows developers to manage changes and configure values effectively and almost bug-free. State machines are one of the few ways thought is directly expressed in code. eBay use squirrel-foundation as for their State Machine library. Thw

Lessons Learned From Software Rewrites. Collect metrics to figure out how your system is really being used. You might be surprised. And Beware of zombies: To our surprise we found out that's because it was querying the database 31 times within a single execution, mostly doing work related to dead features. Once potential candidates were identified, they were put into a code quarantine for a few weeks to make sure no one was using them, and then deleted completely. We spent a few months doing that and ended up reducing the number of features by ~40%, which was a good return on investment.

You can still be IO bound using SSDs. At least if you are searching hundreds of billions of documents like Dropbox. Improving the performance of full-text search: latency was caused by an increase in I/O operations per sec (IOPS) on the index servers. This increase caused the index servers to be I/O bound...to make the encoding more compact, we made two changes — we switched to using delta encoding for the list of document IDs, and using run-length encoding for the list of attributes. Results: reduced the total size of the encoded search index by 33%...significant improvements in the 95th percentile latency of our system.

To slay the Java performance dragon you could do no better than join Martin Thompson's quest for predictable latency with Java concurrency, where you will discover algorithms and data structures that provide very high throughput while keeping latency low and predictable.

When MapReduce is not enough there are the Next generation tools for data science: Spark’s sweet spot is quickly developing exploratory/interactive analysis and iterative algorithms, e.g., gradient descent and MCMC, whereas Dataflow’s sweet spot is processing streaming data and highly-optimized, robust, fixed pipelines...Spark’s sweet-spot: iterative algorithms...Beam/Dataflow’s sweet spot: streaming processing.

Another good example how at scale you really need to worry about IO and that requires looking at your encoding scheme to reduce the amount of data being wrtten. Facebook's MyRocks: A space- and write-optimized MySQL database is heavily optimized for space efficiency. The result: MyRocks writes orders of magnitude less than InnoDB and is more than six times faster than InnoDB.

Very nice example with code of GraphQL with the Serverless Framework. Maybe a bit overboard as a solution to remember when to water your plants :)

Alpha Centauri here we come. What's it like to be on a project that will only come to full fruition 30 years hence? Breakthrough Starshot Report 2: Drilling Down to the Basics that talks about the the mission is to “…ensure that Starshot engineering activities can and will result in a 0.2c mission to Alpha Centauri.”

I don't feel as safe as I used to. PEGASUS iOS Kernel Vulnerability Explained. Deserializing XML *in* the iOS Kernel. Complex macros that are over 10 lines long?

There is another. Chinese chip maker Phytium Technology. Details Emerge On China’s 64-Core ARM Chip: The Mars FT-2000/64 chip is based on the FTC661 generation of Xiaomi cores, and has the same 512 KB L2 cache per core, but delivers a whopping 128 MB of L3 cache across the 64 cores on the die. The cores on the Mars chip have a design frequency of between 1.5 GHz and 2 GHz, just like the cores in the Earth chips. The Mars processor has sixteen DDR3 memory controllers, with deliver a total of 204.8 GB/sec of memory bandwidth running at 1.6 GHz. The whole shebang needs 2,892 pins and has a maximum power draw of 100 watts

StorageMojo with a good set of Notes on VMworld 2016. Also a good overview in Network Break 102: VMworld, Huawei Connect & A Bit Of Docker. It appears the strategy for VMware is to keep customers close by being everywhere cloud: hybrid, public, private, SDN, hyper-converged, Open Stack, containers.

If you must share state, and we know you want to, then Advanced synchronization methods can boost the performance of multicore software: If a thread holding a lock is preempted, a lock-based algorithm cannot make progress until it runs again. Indeed, as figure 8b shows, when the number of threads exceeds 20, the throughput of the lock-based delegation algorithm plummets by 15 times, whereas both LCRQ and the MS queue maintain their peak throughput.

Nice exploration and tying together of ideas, but I wish they would have delivered on the promise of solving the Facebook notification problem with open-source off-the-shelf components. Event sourcing, CQRS, stream processing and Apache Kafka: What’s the connection? Event sourcing provides an efficient means for applications to log their inherent, and inevitable changes in state, using a zero loss protocol. This means recovery is simple and efficient, as it is based entirely on a journal, or an ordered log like Kafka. CQRS goes a step further, turning raw events into queryable view; a view that is carefully formed to be relevant to other business processes. Kafka Streams provides both the declarative functions required to create these views in a streaming fashion, as well as a scalable query layer, so users can interact with this view directly. The result is an event-sourcing and CQRS based application architecture, wherever applicable, built on Apache Kafka; allowing such applications to also leverage the core competency of Kafka

Good explanation of Merkle Trees: a data structure where every non-leaf node contains the hash of the labels of its child nodes, and the leaves have their own values hashed2. Because of this characteristic, Merkle Trees are used to verify that two or more parties have the same data without exchanging the entire data collection.

yahoo/pulsar: a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. Horizontally scalable (Millions of independent topics and millions of messages published per second). Strong ordering and consistency guarantees. Low latency durable storage. Designed for being deployed as a hosted service.

Netflix/ndbench (more): pluggable cloud-enabled benchmarking tool that can be used across any data store system. NDBench provides plugin support for the major data store systems that we use -- Cassandra (Thrift and CQL), Dynomite (Redis), and Elasticsearch. It can also be extended to other client APIs.

Another installment of Geeking with Greg's Quickl Links, your Xkcd and SMBC leader.

Stuff The Internet Says On Scalability For September 9th, 2016

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale