The Technology Behind Apple Photos and the Future of Deep Learning and Privacy

There’s a war between two visions of how the ubiquitous AI assisted future will be rendered: on the cloud or on the device. And as with any great drama it helps the story along if we have two archetypal antagonists. On the cloud side we have Google. On the device side we have Apple. Who will win? Both? Neither? Or do we all win?

If you would have asked me a week ago I would have said the cloud would win. Definitely. If you read an article like Jeff Dean On Large-Scale Deep Learning At Google you can’t help but be amazed at what Google is accomplishing. Impressive. Wide ranging. Smart. Systematic. Dominant.

Apple has been largely absent from the trend of sprinkling deep learning fairy dust on their products. This should not be all that surprising. Apple moves at their own pace. Apple doesn’t reach for early adopters, they release a technology when it’s a win for the mass consumer market.

There’s an idea because Apple is so secretive they might have hidden away vast deep learning chops we don’t even know about yet. We, of course, have no way of knowing.

What may prove more true is that Apple is going about deep learning in a different way: differential privacy + powerful on device processors + offline training with downloadable models + a commitment to really really not knowing anything personal about you + the deep learning equivalent of perfect forward secrecy.

Photos vs Photos

In the WWDC 2016 Keynote Apple introduced their new photos application, announcing that deep learning will be used to help search for images, group relevant photos into albums, and gather photos, videos and locations into mini snapshots.

If this sounds a lot like Google Photos, it should. The Google Photos team implemented the ability to search photos without tagging them. You can find pictures of statues, Yoda, drawings, water, etc without having the pictures being tagged.

What’s different is how they go about their respective jobs.

What approach is Apple taking? We get a some visibility into Apple’s approach from a recent illuminating episode of Talk Show Show at WWDC 2016, hosted by John Gruber with Apple’s dynamic duo of Phil Schiller and Craig Federighi as guests.

When does deep learning happen?

Gruber asked exactly the the question I wanted to know: When does deep learning happen?

It turns out there are several answers to that question:

  • The deep learning happens in Apple’s datacenters.

    • This process builds a model that is downloaded to the device.

    • Training does not occur on user data. External data sets are used to build the models.

  • The model is applied on the device when a picture is taken.

    • The analysis is performed instantaneously as the photo is going into your photo library.

    • It takes something like eleven billion calculations to categorize a picture with tags like “that's a horse” or  “that's a mountain.”

    • The GPUs on iOS devices these days really cook, so the whole process is essentially instantaneous. Apparently on an amortized basis the power draw is nothing to worry about.  

  • All your existing photos are analyzed in the background. Because there’s a considerable amount of computation the analysis of occurs over night when the device is plugged into AC power.

  • The analysis results are not shared between all your devices.

    • Every device goes through the same process, each device does its own processing.

    • In the future this might change, the results might be shared. Clearly developing a secure system to share this kind of data is a huge task, so it’s understandable why it would come later.

Privacy is Where the Difference Is

While Apple doesn’t talk about how the training occurs it’s likely to be some variant of what Google has pioneered with deep learning.

What’s really different is in how privacy is handled. Google stores all your data to the cloud and trains their models on your data along with everyone else's. Google knows exactly who you are. In fact, I often have this dystopian vision of Google creating a neural network simulation of my brain, continually probing it to see how I might react to candidate advertising regimes. *shiver*

Apple takes a completely different approach. Apple never actually knows the analysis that occurrs on your phone. Apple never actually sees your data. This is mentioned numerous times during the interview. Apple really wants you to know your data is private and that Apple is out of the loop.

Craig Federighi:

Yeah. To be clear, the photos themselves are, the architecture sets are encrypted in the cloud, and the metadata — any metadata about the photos that you create or that we create through deep learning classification is encrypted in a way that Apple's not reading it.

How can Apple compete if they aren’t uploading your data and learning everything about you? Through a little mathematical wizardry called Differential privacy (DP). I’ve never heard of it before either, so we’re all playing catch up.

Matthew Green has an excellent introduction to DP in his article What is Differential Privacy? Essentially DP is big data play that uses statistics to hide user identities, making it provably mathematically impossible to reverse anonymity.

Does it work? Matthew Green:

But maybe this is all too "inside baseball". At the end of the day, it sure looks like Apple is honestly trying to do something to improve user privacy, and given the alternatives, maybe that's more important than anything else.

Craig Federighi goes through a DP example (slightly edited):

The idea is that if we wanted to know what word, y'know, a new word that everyone was, that lots of people were typing, that we didn't know so that we would stop marking it as a spelling error. Or maybe we'd even suggest it on the keyboard.
Yeah, like now it's just, it's trending, it's hot, we want all our customers to be able to know that word, but we don't want to know you and Phil in particular are typing it. We want to have no way to have any knowledge of that.
You can imagine if what we're essentially assembling is a picture of little pieces of data, y'know, of the forest, but all we're getting is a little piece. And when we get that little piece, even each device will statistically, much of the time, even lie about its little piece. Right?
But those lies will all cancel out with enough data and the picture will suddenly resolve, with enough data points, will resolve itself. And so, and yet, literally, if we were trying to learn a word, we would send one bit — we'd send a position and a single — we'd hash the word, we'd send a single bit from the hash, we'd say at position 23, Phil saw a 1. But Phil's phone would flip a coin and actually say, "Actually, I'm going to lie about that. I'm going to say zero even though I saw a one."
And that's the data that goes to Apple. And Apple, with enough of that data, can build a composite picture and say, "Holy smokes, we have a word here. And this many people roughly are seeing it." And that's typically what you want to know. You want to know what's happening at large, but we have no desire to know what, specifically, who is doing what.

Apple is leveraging the fact that they have a billion phones out in the wild to their advantage.

A key point that Gruber brought up is how DP ensures forward secrecy. Because the data is not being naively deanonymized it will be impossible to at some later time figure out who belongs to what data. Even if some court orders Apple to match data to a person Apple will not be able to do so. Even if some later management team at Apple changes direction and wants to exchange dollars for privacy, it won't be able to.

Google also develops models with impressive functionality that are small enough to run on Smartphones. One example is a combined vision plus translation app that uses computer vision to recognize text in a viewfinder. It then translates the text and then superimposes the translated text on the image itself. The models are small enough that it all runs on the device. Google knows for technology to disappear intelligence must move to the edge. It can’t be dependent on network umbilical cord connected to a remote cloud brain. Since TensorFlow models can run on a phone, we can expect hybrid approaches to be the rule, combining both cloud training with on device models, though it seems unlikely Google would adopt Differential Privacy.

What Capabilities is Apple Losing with Differential Privacy?

It seems like Apple is giving up the ability to learn deeply about you as an individual, though I’m quite sure I don’t understand all the stuff nearly well enough to say for certain.

Take Google’s Smart Reply feature as an example. On a phone you want to be able to respond quickly to email and typing is a pain. So Google developed a system to predict likely replies for a message.

The first step was to train a small model to predict if a message is the kind of message that would have a short reply. If so a bigger more computationally expensive model is activated that takes the message in as a sequence and tries to predict the sequence of the response word.

For example, to an email asking about a Thanksgiving invitation the three predicted replies are: Count us in; We’ll be there; Sorry we won’t be able to make it.

This seems like something Apple could possibly create.

Let’s take it one step further, by creating a model that will predict my likely response. How would I likely respond to a message? I don’t think Apple can do that sort of personalization. Apple doesn’t have my identity in the cloud, it just has an aggregated view of all the data. When it comes to personalization Apple is limiting itself to training that can only occur on the device with only data provided on the device.

So there's a data poverty issue. Is there enough accessible data on a device to learn about the real me? Will Apple only know me from iMessage or Siri? Or does Apple hijack access to Twitter, email, Facebook, Google search, etc?

Then there's a computation issue. Listening to Jeff Dean my impression is that these neural networks are composed of hundreds of millions of parameters, not something that could run on a device.

Then there's the multiple personality problem. Siri, for example, would appear to me to have multiple personalities. I interact differently with my phone, my iPad, and my desktop, so if training is per device I would see a different Siri on each device. Apple would have to develop some sort of meta training layer where all the devices cooperate to form a single unified view of its user. That sounds a lot more challenging than shipping everything back to the cloud.

Is this lack of personalization a killer? It would be for Google. Google recently had their own impressive developer conference, Google I/O 2016, where they doubled down on a machine learning everywhere strategy. One example is Google Assistant, a new personal AI, that looks like it’s going to obliterate Robert Scoble’s infamous freaky line.

Does Apple care? Google seems to be interested in exploring the full flowering of deep learning as an end to itself. Apple seems more focussed on how deep learning can make a better product, which is a much different goal, a very Applish goal. As long as Apple is at least notionally on par with Google, Google would have to be far superior and provide an even more compelling ecosystem to win over converts. We will see.

Every team will have to decide how they want to build and deploy future deep learning systems. It’s as much a technological as it is an ethical question. Until now we’ve only had one example of how to build deep learning systems. Apple has provided a different model.

Unfortunately, however wonderful Apple’s privacy model is, it will be hard to spread. Apple will likely keep their technology closed. Google on the other hand is busy terraforming the world with their vision of deep learning. This is no doubt fine with Apple because it locks them in as the privacy platform of choice, but it kind of sucks for the rest of us.