AWS

How do you explain the unreasonable effectiveness of cloud security?

With the enormous attack surface of cloud providers like AWS, Azure, and GCP, why aren't there more security problems? Data breaches and cyber attacks occur daily. How do you explain the unreasonable effectiveness of cloud security?

Google has an ebook on their security approach; Microsoft has some web pages. Both are the equivalent of that person who is disgustingly healthy and you ask them how they do it and they say "I don't know. I just eat right, exercise, and get plenty of sleep." Not all that useful. Most of us want a hack, a trick to good health. Who wants to eat right?

I'm sure Amazon also eats right, exercises, and gets plenty of sleep (probably not the people who work there), but AWS also has a secret that when that disgustingly healthy person starts talking about at a party, you just can't help leaning in and listening.

What's the trick to 6-pack security? Proving systems correct. Does your datacenter do that? I didn't think so. AWS does.

Dr. Byron Cook gave an enthusiastic talk on Formal Reasoning about the Security of Amazon Web Service. He's clearly excited about finally applying his research in a real-world setting. This is the trojan horse the FloC (Federated Logic Conference) community has been waiting for. It's almost as if he's a FLoC guy working at AWS rather than an AWS guy giving a FLoC talk.

The main take-aways for me were:

AWS has made a big bet on Automated Reasoning to increase cloud security.
Provable cloud security is an accelerator for AWS adoption. Customers want and like it.
Testing more is meaningless—soundness is key. Soundness means they understand the semantics of a policy language and then use constraint solving, which is sound, to prove it.
This is the future of security and most of us have no idea how to live in it. As good as all this work is, it's still a band-aid on a 20th century software production process. For us to achieve reliable security, human driven programming should not be long for this brave new world.

Here's a gloss of some of the main points of this enlightening talk:

Introduction

Bio: Dr. Byron Cook is Director of Automated Reasoning at Amazon Web Services and Professor of Computer Science at University College London. Prior to joining Amazon, Byron was known for his contributions to the Terminator/T2 termination prover, BioModelAnalyzer, SLAM, and Microsoft’s Static Driver Verifier.
Dr. Cook left academia and took a job at AWS with the idea of creating a formal methods revolution in cloud security. He wanted to create a major step change, to really change the game. Now it has been four years. **spoilers** He succeeded. Formal methods have arrived and are no longer irrelevant.
AWS YoY revenue growth is ~45%. Planned Availability Zone growth is 55% YoY. Planned growth in geographic regions is 22% YoY. Services, features, developers, and customers are growing exponentially, so the cloud and security are interesting and important areas of research.
The keys to the AWS success story are agility and security based on rapid feature development from customer feedback and data. To increase cloud security AWS has made a big bet on Automated Reasoning.
There's a difference between security in the cloud and of the cloud. In the cloud customers need to securely setup identity configurations, access policies, virtual networks, code, etc. Of the cloud means making virtualization, encryption, protocols, hardware, compliance, etc. all secure. It's called a Shared Responsibiltity Model.

Security Of the Cloud

AWS using formal methods started quietly in 2014 with a pilot project. The game plan was:
- Find security problems where we can help.
- Apply new or existing tools.
- Collaborate share ideas, code, data with the community
- Deliver value to AWS and our customers when things go well
- Add problems to "research questions" bucket when they fail
- Repeat
AWS is the perfect place to merge theory with practice. Overly theoretical people don't know what to work on and overly practical people don't know what they are doing.
Pre-launch protocol proof engagements are now growing exponentially. After initial developer skepticism, it's gaining acceptance internally. Quote from an influential AWS employee: "The world thought this formal verification stuff was snake oil and would never be practical, but you have proved them wrong."
The AWS security team performs formal security reviews of all features/services, e.g. 1,430 services/features in 2017, a 41% year-over-year increase from 2016. Mitigations to security risks that are developed during these security reviews are documented as a part of the security review process. Another important activity within AWS is ensuring that the cloud infrastructure stays secure after launch, especially as the system is modified incrementally by developers.
In 2017 alone the security team used deductive theorem provers or model checking tools to reason about cryptographic protocols/systems, hypervisors, boot-loaders/BIOS/firmware, garbage collec- tors, and network designs. Overall, formal verification engagements within the AWS security team increased 76% year-over-year in 2017, and found 45% more pre-launch security findings year-over-year in 2017.
Formal reasoning about the security of the cloud has two activities: secure design and secure operation.
Secure design: AppSec (application security) review and compliance certification plan/automation. For an AppSec review you build a threat model, decide on mitigation strategies for threats, worry about compliance (ITAR, PCI, FIPs, FedRAMP, etc).
Secure operation: you need to maintain security in the field while it's running. Things happen, for example, a researcher finds a problem in open source code that you're using, so you have to decide if this is a problem for you, ask if you've built enough mitigations.
Every aspect of formal reasoning has a formal constraint solving component.
Security protocols. In AWS large fleets of machines share secrets with each other and try to keep their state up to date using security protocols that are hard to get right. In AWS they try to prove those protocols correct via reduction of the protocol to proofs in VCC.
Customers love that they are doing formal verification. Customers actually say this why they've moved to AWS. Proofs are an accelerator for AWS adoption.
Soundness is key. A logical system has the soundness property if and only if every formula that can be proved in the system is logically valid with respect to the semantics of the system. You don't want to tell a customer we use proof and logic and then say "oh yah, we forgot a case so it doesn't matter." Customers don't want to hear that you are doing 10% more testing. Testing more is almost meaningless. Testing isn't good enough. There are always cases missed in testing. Proofs are meaningful. Soundness means they understand the semantics of the policy language and then use constraint solving, which is sound, to prove it.
Other things they are trying to prove correct:
- Correct virtualization
- Memory safety
- Protocol conformance
- Compliance
- Code
- Properties from AppSec/Compliance review
- Using tools like Boogie, Coq, CBMC, Dafny, HOL-light, Infer, OpenJML, SAW, and SMACK.
For the TLS code used in S3 they've proved HMAC, random bit generator, and the hand shake protocol correct.
One issue with proofs is that code changes continuously, how can the proofs keep up? For the TLS code they rerun the proofs on check-in. A future goal is to be able to this for everything.

Security In the Cloud

Customers want to design systems that are secure. They have their own threat models and mitigations. They are typically regulated (GDPR, credit card data, etc). How do you establish compliance and security while building on top of AWS? When your system is deployed, how do you operate it securely? How do you maintain security compliance? When things go wrong, what's the blast radius?
AWS builds services on top of AWS so if they can use proof for their stuff then customers get the benefit too.
If you want to know why IAM (Identity and Access Management) is so complicated and hard to understand, it's because IAM is like FOL or first-order logic.
IAM is all that stands between data and hackers. One little formula in logic that's saying who can access the data, when they can access it, and under what conditions they can access it. Get it wrong and it's a disaster. So it's important to get IAM right.
He gives an example of an IAM policy that's wrong for reasons that hard to understand, because, well, IAM is hard to understand. Even he had a hard time explaining IAM.
Imagine you have a whole set of IAM policies and you want to prevent bad policies from being installed. There are enterprise-wide guardrails that allow customers to define the space of policies they are comfortable with. Whenever there's a change to the policies it's checked to see if it's allowed. Using a internal cloud service call Zelkova, what they are doing is reducing the semantics of the policy language into SMT and calling SMT solvers.
Macie is a tool that looks for ways data might be leaving your organization and warns you about it. It uses Zelkova to search for exfiltration routes and then uses machine learning to categorize data at risk, alerting customers. Constraint solving is used to identify routes out.
This approach will be used by many more services. SMT solvers Z3 and CVC4 are called many 10s of millions of times a day. Forecast is for exponential growth in the use of these services.
VPC (Virtual Private Cloud) is an important security control. Customers build huge networks in VPCs. There are question about these networks. Does this network meet customer enterprise-wide security and compliance requirements? How do you enforce rules like certain services should only communicate via certain machines? How do you enforce data only going through certain countries?
A lot of the rules for PCI can be encoded in logic. Then a constraint solver is used to prove the network correctly implements the specification. The tool that's doing the checking must understand the abstract concepts and the interactions between them. Abstract concepts like AZs, DirectConnect, Instances, Internal Gateways, Load Balancers, NAT, Tags, etc.
The argument might be that this is all too complex and there should be simpler ways of doing things. The problem is every time you try to simplify something there's going to be billion dollar businesses saying they can't use your service. Customers need these complex kinds of capabilities. Customers like this stuff.
Datalog solver Souffle and SMT solver Monostat were called may 10s of thousands of times a day last year.
Provable cloud security is the new buzzword. Soundness is key for customers. If you want to prove a packet can't arrive where it's not supposed to, you need these kinds of systems.
Future work: build proof tools (model checkers) that really use the elasticity of the cloud in all its glory; make threat modeling more formal; repair proofs for continuous integration; proofs make a lot of assumptions on preconditions that need to be checked; Security is about risk management, just because you found a bug doesn't mean anything, it's the risk associated with that bug that counts.