hot links

Stuff The Internet Says On Scalability For March 24th, 2017

High Scalability

24 Mar 2017 — 18 min read

Hey, it's HighScalability time:

This is real and oh so eerie. Custom microscope takes a 33 hour time lapse of a tadpole egg dividing.
If you like this sort of Stuff then please support me on Patreon.

40Gbit/s: indoor optical wireless networks; 15%: energy produced by wind in Europe; 5: new tasty particles; 2000: Qubits are easy; 30 minutes: flight time for electric helicopter; 42.9%: of heathen StackOverflowers prefer tabs;

Quotable Quotes:
- @RichRogersIoT: "Did you know? The collective noun for a group of programmers is a merge-conflict." - @omervk
- @tjholowaychuk: reviewed my dad's company AWS expenses, devs love over-provisioning, by like 90% too, guess that's where "serverless" cost savings come in
- @karpathy: Nature is evolving ~7 billion ~10 PetaFLOP NI agents in parallel, and has been for ~10M+s of years, in a very realistic simulator. Not fair.
- @rbranson: This is funny, but legit. Production software tends to be ugly because production is ugly. The ugliness outpaces our ability to abstract it.
- @joeweinman: @harrietgreen1 : Watson IoT center opened in Munich... $200 million dollar investment; 1000 engineers #ibminterconnect
- David Gerard: This [IBM Blockchain Service] is bollocks all the way down.
- digi_owl: Sometimes it seems that the diff between a CPU and a cluster is the suffix put on the latency times.
- Scott Aaronson: I’m at an It from Qubit meeting at Stanford, where everyone is talking about how to map quantum theories of gravity to quantum circuits acting on finite sets of qubits, and the questions in quantum circuit complexity that are thereby raised.
- Founder Collective: Firebase didn’t try to do everything at once. Instead, they focused on a few core problems and executed brilliantly. “We built a nice syntax with sugar on top,” says Tamplin. “We made real-time possible and delightful.” It is a reminder that entrepreneurs can rapidly add value to the ecosystem if they really focus.
- Elizabeth Kolbert: Reason developed not to enable us to solve abstract, logical problems or even to help us draw conclusions from unfamiliar data; rather, it developed to resolve the problems posed by living in collaborative groups.
- Western Union: the ‘telephone’ has too many shortcomings to be seriously considered as a means of communication.
- Arthur Doskow: being fair, being humane may cost money. And this is the real issue with many algorithms. In economists’ terms, the inhumanity associated with an algorithm could be referred to as an externality.
- Francis: The point is that even if GPUs will support lower precision data types exclusively for AI, ML and DNN, they will still carry the big overhead of the graphics pipeline, hence lower efficiency than an FPGA (in terms of FLOPS/WATT). The winner? Dedicated AI processors, e.g. Google TPU
- James Glasnapp: When we move out of the physical space to a technological one, how is the concept of a “line” assessed by the customer who can’t actually see the line?
- Frank: On the other hand, if institutionalized slavery still existed, factories would be looking at around $7,500 in annual costs for housing, food and healthcare per “worker”.
- Baron Schwartz: If anyone thought that NoSQL was just a flare-up and it’s died down now, they were wrong...In my opinion, three important areas where markets aren’t being satisfied by relational technologies are relational and SQL backwardness, time series, and streaming data.
- CJefferson: The problem is, people tell me that if I just learn Haskell, Idris, Closure, Coffescript, Rust, C++17, C#, F#, Swift, D, Lua, Scala, Ruby, Python, Lisp, Scheme, Julia, Emacs Lisp, Vimscript, Smalltalk, Tcl, Verilog, Perl, Go... then I'll finally find 'programming nirvana'.
- @spectatorindex: Scientists had to delete Urban Dictionary's data from the memory of IBM's Watson, because it was learning to swear in its answers.
- Animats: [Homomorphically Encrypted Deep Learning] is a way for someone to run a trained network on their own machine without being able to extract the parameters of the network. That's DRM.
- Dino Dai Zovi: Attackers will take the least cost path through an attack graph from their start node to their goal node.
- @hshaban: JUST IN: Senate votes to repeal web privacy rules, allowing broadband providers to sell customer data w/o consent including browsing history
- KBZX5000: The biggest problem you face, as a student, when taking a programming course at a University level, is that the commercially applicable part of it is very limited in scope.
  
  You tend to become decent at writhing algorithms. A somewhat dubious skill, unless you are extremely gifted in mathematics and / or somehow have access to current or unique hardware IP's (IP as in Intellectual Property).
- Brian Bailey: The increase in complexity of the power delivery network (PDN) is starting to outpace increases in functional complexity, adding to the already escalating costs of modern chips. With no signs of slowdown, designers have to ensure that overdesign and margining do not eat up all of the profit margin.
- rbanffy: Those old enough will remember the AS/400 (now called iSeries) computers map all storage to a single address space. You had no disk - you had just an address space that encompassed everything and an OS that dealt with that.
- @disruptivedean: Biggest source of latency in mobile networks isn't milliseconds in core, it's months or years to get new cell sites / coverage installed
- Greg Ferro: Why Is 40G Ethernet Obsolete? Short Answer: COST. The primary issue is that 40G Ethernet uses 4x10G signalling lanes. On UTP, 40G uses 4 pairs at 10G each.
- @adriaanm: "We chose Scala as the language because we wanted the latest features of Spark, as well as [...] types, closures, immutability [...]"Adriaan Moors added,
- ajamesm: There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).
- @evan2645: "Chaos doesn't cause problems, it reveals them" - @nora_js #SREcon17Americas #SRECon17
- chrissnell: We've been running large ES clusters here at Revinate for about four years now. I've found the sweet spot to be about 14-16 data nodes, plus three master-only nodes. Right now, we're running them under OpenStack on top of our own bare metal with SAS disks. It works well but I have been working on a plan to migrate them to live under Kubernetes like the rest of our infrastructure. I think the answer is to put them in StatefulSets with local hostPath volumes on SSD.
- @beaucronin: Major recurring theme of deep learning twitter is how even those 100% dedicated to the field can't keep up with progress.
- Chris McNab: VPN certificates and keys are often found within and lifted from email, ticketing, and chat services.
- @bodil: And it took two hours where the Rust version has taken three days and I'm still not sure it works.
- azirbel: One thing that's generalizable (though maybe obvious) is to explicitly define the SLAs for each microservice. There were a few weeks where we gave ourselves paging errors every time a smaller service had a deploy or went down due to unimportant errors.
- bigzen: I'm worn out on articles dissing the performance of SQL databases without quoting any hard numbers and then proceeding to replace the systems with no thanks of development in the latest and great tech. I have nothing against spark, but I find it very hard to believe that alarm code is now readable than SQL. In fact, my experience is just the opposite.
- jhgg: We are experimenting with webworkers to power a very complicated autocomplete and scoring system in our client. So far so good. We're able to keep the UI running at 60fps while we match, score and sort results in a web-worker.
- DoubleGlazing: NoSQL doesn't reduce development effort. What you gain from not having to worry about modifying schemas and enforcing referential integrity, you lose from having to add more code to your app to check that a DB document has a certain value. In essence you are moving responsibility for data integrity away from the DB and in to your app, something I think is quite dangerous.
- Const-me: Too bad many computer scientists who write books about those algorithms prefer to view RAM in an old-fashioned way, as fast and byte-addressable.
- Azur: It always annoys me a bit when tardigrades are described as extremely hardy: they are not. It is ONLY in the desiccated, cryptobiotic, form they are resistant to adverse conditions.
- rebootthesystem: Hardware engineers can design FPGA-based hardware optimized for ML. A second set of engineers then uses these boards/FPGA's just as they would GPU's. They write code in whatever language to use them as ML co-processors. This second group doesn't have to be composed of hardware engineers. Today someone using a GPU doesn't have to be a hardware engineer who knows how to design a GPU. Same thing.

There should be some sort of Metcalfe's law for events. Maybe: the value of a platform is proportional to the square of the number of scriptable events emitted by unconnected services in the system. CloudWatch Events Now Supports AWS Step Functions as a Target. @ben11kehoe: This is *really* useful: Automate your incident response processes with bulletproof state machines #aws

Cute faux O'Reilly book cover. Solving Imaginary Scaling Issues.

Intel's Optane SSD is finally out, though not quite meeting it's initial this will change everything promise, it still might change a lot of things. Intel’s first Optane SSD: 375GB that you can also use as RAM. 10x DRAM latency. 1/1000 NAND latency. 2400MB/s read, 2000MB/s write. 30 full-drive writes per day. 2.5x better density. $4/GB (1/2 RAM cost). 1.5TB capacity. 500k mixed random IOPS. Great random write response. Targeted at power users with big files, like databases. NDAs are still in place so there's more to learn later. PCPerspective: comparing a server with 768GB of DRAM to one with 128GB of DRAM combined with a pair of P4800X's, 80% of the transactions per second were possible (with 1/6th of the DRAM). More impressive was that matrix multiplication of the data saw a 1.1x *increase* in performance. This seems impossible, as Optane is still slower than DRAM, but the key here was that in the case of the DRAM-only configuration, half of the database was hanging off of the 'wrong' CPU. foboz1: For anyone think that this a solution looking for a problem, think about two things: Big Data and mobile/embedded. Big Data has an endless appetite for large quantities for memory and fast storage; 3D XPoint plays into the memory hierarchy nicely. At the extreme other end of the scale, it may be fast enough to obviate the need for having DRAM+NAND in some applications. raxx7: And 3D XPoint isn't free of limitations yet. RAM has 50-100 ns latency, 50 GB/s bandwidth (128 bit interface) and unlimited write endurance. If 3D XPoint NVDIMM can't deliver this, we'll still need to manage the difference between RAM and 3D XPoint NVDIMM. zogus: The real breakthrough will come, I think, when the OS and applications are re-written so that they no longer assume that a computer's memory consists of a small, fast RAM bank and a huge, slow persistent set of storage--a model that had held true since just about forever. VertexMaster: Given that DRAM is currently an order of magnitude faster (and several orders vs this real-world x-point product) I really have a hard time seeing where this fits in. sologoub: we built a system using Druid as the primary store of reporting data. The setup worked amazingly well with the size/cardinality of the data we had, but was constantly bottlenecked at paging segments in and out of RAM. Economically, we just couldn't justify a system with RAM big enough to hold the primary dataset...I don't have access to the original planning calculations anymore, but 375GB at $1520 would definitely have been a game changer in terms of performance/$, and I suspect be good enough to make the end user feel like the entire dataset was in memory.

Perhaps the biggest problem Firebase has is developers getting over the hurt and betrayal from their last software relationship. valuearb: I finished the site on time, on budget, the web service demoed great at it's launch show, and the client signed up a ton of customers and was just waiting for our e-commerce implementation to launch their important new business. Then Parse was canceled.

In an adversarial environment if you can't control your algorithms should you code them? Google Search Lead: I Can't Go After Individual Problems In Search Results. You make yourself the mechanism for amplification attacks. All a group need do is influence weightings in Google's algorithm and it impacts billions of people. Not a bad day's work.

Videos from Strata + Hadoop World 2017 are now available.

Pinterest Driving user growth with performance improvements. Improving mobile web home landing page performance by 60 percent increased the mobile signup conversion rate by 40 percent. How? Lazy loading of code and assets in smaller chunks; switching to React realized a substantial performance gains from its rendering model; streaming responses instead of buffering gets bytes to the browser faster; introduced multiple layers of caching in our CDN setup, enabled IPv6, switched to higher tiers of service (CDN) and introduced SSL edge termination (DSA) globally; on the backend parallelize calls that are unnecessarily sequential; only return data that's needed to the browser. Result: 40 percent decrease in Pinner wait time, a 15 percent increase in SEO traffic and a 15 percent increase in conversion rate to signup.

Farmers hate DRM for much the same reasons as the digerati: they want to fix their stuff when it breaks and they don't want some faceless megacorp shutting down their stuff. Why American Farmers Are Hacking Their Tractors With Ukrainian Firmware: "To avoid the draconian locks that John Deere puts on the tractors they buy, farmers throughout America's heartland have started hacking their equipment with firmware that's cracked in Eastern Europe and traded on invite-only, paid online forums." The question is: whose stuff is it? We increasingly do not own the things we buy. Glenn Reynolds: Do we even own things anymore? Also, Five States Are Considering Bills to Legalize the 'Right to Repair' Electronics.

Isn't this a little like saying the future is in biology, not farming? The future is algorithms, not code. What is one without the other?

So many fascinating details on the Tactics, Techniques, and Procedures of Russian-speaking hacker called “M4g”. You'll find a mesmerizing list of attacks. Who is vulnerable? Organizations embracing cloud services but not 2FA were soft targets. All those wonderful helpful knowledge bases made available over the internet are actually instruction booklets on how to hack your system. What to do? Segregate risky Internet-exposed servers; Do not leave secrets on your wiki, etc; Don’t re-use cryptographic seeds, keys, or credentials across production and non-production; Software certificates are an insufficient second factor. Use a real 2FA solution such as Duo and consider hardware tokens to protect private keys via FIPS 201–1 and other standards, as supported by the YubiKey 4 environments; Compromised servers made odd DNS requests (e.g. resolving m4g.ru). Use a monitoring tool such as DNS Analytics for Splunk to flag anomalies;

Video from Facebook's Video @Scale 2017 are now available. Topics include: Transitioning Codecs on Mobile; Streaming 360 Video; Next-Generation Transport for Live, Glitch-Free, High-Quality Video Delivery Over Commodity Internet; Scaling Low-Latency Live Streams; Video Understanding @Scale.

Building Safe A.I.: "In this blogpost, we're going to train a neural network that is fully encrypted during training (trained on unencrypted data). The result will be a neural network with two beneficial properties. First, the neural network's intelligence is protected from those who might want to steal it, allowing valuable AIs to be trained in insecure environments without risking theft of their intelligence. Secondly, the network can only make encrypted predictions (which presumably have no impact on the outside world because the outside world cannot understand the predictions without a secret key)." Good discussion on HackerNews of the implications.

The Stack That Helped Opendoor Buy and Sell Over $1B in Homes: We use Puma as our webserver, and Postgres for our database — one big benefit is the PostGIS extension for location data. Sidekiq runs our asynchronous jobs with support from Redis. Elasticsearch shows up everywhere in our internal tools... Webpack to build our frontend apps, and serve them using the Rails Asset Pipeline... Imgix to store photos of our homes...we try to break isolated logic out into microservices...uses a version-history-aware computation graph to calculate and back-test our internal costs...To let these services authenticate to one another, we use an Elixir app called Paladin...Authentication is based on JWTs provided by Warden and Guardian...For data ingestion, we pull from a variety of sources (like tax record and assessor data). We dump most of this data into an RDS Postgres database...For our machine learning model, we use Python with building blocks from SqlAlchemy, scikit-learn, and Pandas. We use Flask for routing/handling requests. We use Docker to build images and Kubernetes for deployment and scaling...We recently migrated to Google’s BigQuery...we added Twilio so we could automatically send unlock codes over SMS...We built our app in React Native...We manage our repositories and do code reviews on GitHub...We use Heroku for hosting, and run automated tests on CircleCI. Slack bots report what’s being deployed...Help Scout and Dyn for emails...Talkdesk and Twilio for calls and customer service...HelloSign for online contract signing...New Relic and Papertrail for system monitoring...Sentry for error reporting...For analytics, we’ve used a lot of tools: Mixpanel for the web, Amplitude for mobile, Heap for retroactive event tracking...Looker for digging into that data and making dashboards.

Is the Internet about content or protocols? A lot of people think it's all about content. A Better Way to Organize the Internet: Content-Centric Networking: we’ve developed a new architecture based on how information is organized within the network, rather than the IP addresses of hosts. That’s why it’s called content-centric networking—it’s based on how content is named and stored, instead of where it is located. We’ve designed new protocols that can find and retrieve content from wherever it happens to be in the network at a given time and also perform many additional tasks that could make networks faster, more resilient, and more secure.

The Cloudcast: There is a general understanding you have to make strategic investments if you are going to be successful...for some organizations it goes as far a saying for this particular initiative you are answering simple questions: does this make me more efficient? Does this make more secure? Does this make me more agile? Is this part of a new strategic business? If the answer to all those questions is no then you go back an ask why are we even spending money on this?

Did you know there's something called Data-Center TCP? Great deep dive on TCP with Thomas Graf, Linux Core Team. TCP in the Data Center and Beyond. Though for east-west traffic why are we still use TCP?

How to Scale PostgreSQL on AWS–Learnings from Citus Cloud. Highly Available: A common way to do that is through integrating PostgreSQL streaming replication with a distributed synchronization solution such as etcd or ZooKeeper. In fact, open source solutions such as Governor and Patroni. Failovers: You can therefore start by attaching an elastic IP to the primary node. When your monitoring agent promotes a new secondary, you can detach this elastic IP from the old primary and attach it to the newly promoted machine. This way, the application doesn’t need to change the endpoint it talks to – it happens magically. Scale vertically: AWS’ Elastic Block Storage (EBS) enables you to quickly detach from one EC2 instance and attach to another instance that has more resources. Scale horizontally: Running your database on the cloud removes significant operational burden associated with deploying a distributed RDBMS. Disaster recovery: PostgreSQL has a rich ecosystem and comes with open source technologies for automatic backups.

A better blockchain? Scaling Consensus? This Turing Winner Thinks He's Found a Way: By employing cryptographic sortition, the theory is that algorand can scale on demand. Other benefits include security and speed...In algorand, a small subset of players run Byzantine consensus on behalf of the entire system. That allows the protocol to be run at higher speeds, and as more players are replaced in each step, the idea is it makes the system secure in an adversarial environment...Once agreement is reached, and the block is certified by the signatures of a sufficient number of players in the last step of byzantine agreement, that block is then gossiped through the network so all users in the system can add it to the blockchain.

NoSQL versus RDBMS wars still going strong. Choose SQL: choose an SQL database for your web application. Why? It's not really easier for programmers. Scaling is not and most probably won’t be your problem. Unless you are running a logging service do you really need faster writes?

Researchers are using Darwin’s theories to evolve AI, so only the strongest algorithms survive: the OpenAI team set 1,440 worker algorithms to the task of playing Atari. The workers played until they reached Game Over, at which point they reported their scores to the master. The algorithms that garnered the best scores were copied, as in the Google research, and the copies were randomly mutated. The mutated workers then went back into rotation and the process repeated itself, with advantageous mutations being rewarded and bad ones killed.

The video isn't available yet, but Ben Stopford has a put together a deck that makes sense on its own: QCon 2017 Slides: The Power of the Log. Good stuff.

A Small EC2 Instance Can Handle HN Front Page: I was running a small EC2 instance with an 8gb volume attached. 1 vCPU and 2GB memory. That’s it. No load balancers. No elastic beanstalk or auto scaling. No “kubernetes cluster”. Nope. Just a tiny little server meant to entertain a thousand guests. I was stunned at the results. Throughout a barrage of what would in cumulative turn out to be over 120,000 visits and thousands of simultaneous requests, the server did not so much as stutter. The sites loaded as fast as ever, and note syncing requests completed almost instantaneously.

Good advice on What makes a successful eSports game: Once it's made sure the basics are easy enough, the game has to have qualities that gives it a very high skill cap to master. If anyone could master the game, then where would the competition be? All of the top 4 eSport games mentioned have an extremely high skill cap. Not only does a good strategy come into part where you have to outsmart your opponent, but you also have to be able to adapt quickly to different situations and have the skill to know what actually is the right decision to make.

We Failed at Publishing Competitive Games so You Don't Have to. Lesson: launch earlier with fewer characters. It's cheaper. Gives you a longer sales window as you add more characters over time. Really understand why people like the game they like at detailed level. Reproduce that.

A spider web is a prey sensing signal-processing computer based on vibration and flow. An interesting idea researchers are exploring. Computing with spiders’ webs: It is known that spiders have sensitive mechanoreceptors on their legs that allow them to measure vibrations. Spiders also probe their webs by sending out vibrations and observing how it responds in form of vibration patterns. This suggests that the web is not only a static, passive structure to catch prey, but can rather actively contribute to the pattern recognition task to locate and categorize everything that is happening in the web.

Our universe as a simulation is just the Singularity version of responding God did it to every question you don't know the answer to.

Scaling Financial Reporting at Airbnb. ransom1538: tldr; They built an event/messaging system. When an event occurs they broadcast that event with meta data. (EG. event: reservation_booked, meta {..}). Before they tightly coupled reporting sql inside the business logic.

Nice collection. Best online video courses for Data Structures And Algorithms.

Excellent explanation. Notes on Lock Free Programming (Part 1): In this post, I will summarize my understanding of a non-blocking (lock free) design of linked list data structure as presented in paper “A Pragmatic Implementation of Non-Blocking Linked-Lists” by Tim Harris.

rigtorp/MPMCQueue: A bounded multi-producer multi-consumer lock-free queue written in C++11. Also, Ode to a Vyukov Queue.

google/guetzli [post, paper]: "Guetzli is a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg." But it's not fast: I get about 0.5 MPixel/min on an 8350. But it does seem to work well at avoiding ringing artifacts. It does also shift colors, but I can't really perceive that at 100% zoom in my tests so far.

brendangregg/perf-tools: A miscellaneous collection of in-development and unsupported performance analysis tools for Linux ftrace and perf_events (aka the "perf" command). Both ftrace and perf are core Linux tracing tools, included in the kernel source.

RedisLabsModules/rejson: is a Redis module that implements ECMA-404 The JSON Data Interchange Standard as a native data type. It allows storing, updating and fetching JSON values from Redis keys (documents). The JSON values are managed as binary objects, thus allowing Redis-blazing performance.
alexellis/faas: a framework for building serverless functions on Docker Swarm with first class support for metrics. Any UNIX process can be packaged as a function enabling you to consume a range of web events without repetitive boiler-plate coding.

bitwrap/bitwrap-io: Solving State Explosion with Petri-Nets and Vector Clocks.

distill.pub: a new interactive, visual journal for machine learning research. Dedicated to clear explanations of machine learning. Distill is a modern medium for presenting research. That medium will enable new ways of thinking that enable new discoveries.

weld-project.github.io: a runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a small common intermediate representation, similar to CUDA and OpenCL.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [article]: In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks.

ALGORAND: The Efficient and Democratic Ledger: Algorand is a truly decentralized, new, and secure way to manage a shared ledger. Unlike prior approaches based on {\em proof of work}, it requires a negligible amount of computation, and generates a transaction history that does not fork with overwhelmingly high probability. This approach cryptographically selects ---in a way that is provably immune from manipulations, unpredictable until the last minute, but ultimately universally clear--- a set of verifiers in charge of constructing a block of valid transactions. This approach applies to any way of implementing a shared ledger via a tamper-proof sequence of blocks, including traditional blockchains. This paper also presents more efficient alternatives to blockchains, which may be of independent interest. Algorand significantly enhances all applications based on a public ledger: payments, smart contracts, stock settlement, etc. But, for concreteness, we shall describe it only as a money platform.

Stuff The Internet Says On Scalability For March 24th, 2017

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale