Deep learning for the layperson

Warning: deep learning experts will probably cringe at the below simplification. Would be very interested to hear how others would explain deep learning to a person who has minimal background in math or computer science!

Artificial intelligence has bugged us for a really long time and a lot of smart people have thought about it in many different ways. These days, a method called 'deep learning' is all the rage and, when wielded properly, has shown some Pretty Impressive Results™ (i.e., comparable to human performance) on certain intelligence tasks like distinguishing dog breeds. But a sentence like the previous one is so confusing. What's 'human performance'? Are we doing anything besides telling one dog from another? And what is deep learning? Sounds so mysterious.

A brief (and revisionist-for-simplicity) history lesson might clear things up. {Sentence about philosophy and stuff people thought about before computers came along goes here}. Way back in the day, people taught the first machines (e.g., the Jacquard loom or the Babbage-Lovelace Analytical Engine) via formal, mathematical rules. For example, you could think of an adding machine as having some level of 'intelligence'. People made these rules more and more complicated. But as anyone with a passing familiarity with English will know, sometimes one set of rules conflicts with another and the meta-rules that tell us which set of rules to use in which case conflict with yet other meta-rules. What a cluster eff.

At some point in this not-linear-or-even-dependent-history, people started making predictions using data. If you've had high school statistics, you may have predicted a house's price based on the number of rooms it has. You started with a table of data containing house prices and the number of rooms each house has, did some matrix math, and came up with an equation of the form:

$$ Price = a * Rooms + b
$$

This is machine learning! Our model encodes information from the data it's seen to make predictions based on inputs it hasn't seen before. All the learning algorithms you hear about are just more complex flavors of the above (and of course I've now offended everyone in the field). Our story doesn't end here, of course. How do we know that the number of rooms are the best predictor of housing prices? What if a house's number of bathrooms or neighborhood also play a role? In the above example, we're learning the relationship between a feature (rooms) and our output (price). The effectiveness of our model depends heavily on how we choose to represent our problem.

And what if we don't have structured, labeled tables? In the image of Lincoln below, we can see that an image really just is a bunch of numbers. We don't have any labels to tell us that there's a beard or two eyes. Or even that a beard and two eyes might be important to telling if a picture contains Lincoln.

Images of Lincoln can vary in so many ways (e.g., there's a strong shadow or we're seeing the side of his face) that how are we supposed to figure out what sets of pixels matter to recognizing Lincoln in an image we haven't seen before? Luckily for us, there is a field called representation learning that uses machine learning to automatically figure out what features, or representation, matter to the question at hand. For example, an autoencoder learns to encode (and often compress) some input and then decode it back:

$$ input = decode(encode(input))
$$

To be more concrete, we choose a model for the autoencoder. Let's say our autoencoder treats each pixel in an image as a separate input (e.g., 192 'features' for the 16x12 Lincoln image). In the encoding step, it might map these 192 pixels to 20 values in some way (in a sense compressing 192 pixels to 20). In the decoding step, we try to recover the original 192 pixels from the 20 values.

For each image we use to train our model, we compute some error e.g.,

$$ error = input - decode(encode(input))
$$

to adjust how we map the 192 pixels to the 20 values. After seeing many examples, our autoencoder learns which of the 192 pixels are most important to each of the 20 values. You can see why people call this 'learning'. This sounds a lot like how we learn. If something works well, we'll do more of that. If something doesn't, we'll do less.

You might ask how this is different from the house price example. We're still defining a model (192 pixels get mapped into 20 intermediate values) and aren't we just using pixels as our features instead of number of rooms? As far as I can tell, there isn't a hard and fast rule, but pixels are considered fungible and unstructured enough that they aren't features in and of themselves. Such are the murky waters of artificial intelligence.

So if we can tell what parts of our data are most important to answering our questions and how to weight information we've seen before to come up with the best answer, why haven't we solved artificial intelligence?

One problem is efficiency. If I were to tell you how to get to the local 7-Eleven, I could assume that you know how to recognize road markers, street names, the 7-Eleven logo, etc. But early approaches to learning representation were very shallow. Recognizing road markers involves recognizing differences in color, but so does recognizing street names. In the shallow representation, we'd have to repeat these 'lower-level' computations for each 'higher-level' concept. To use programming parlance, it'd be like you couldn't use subroutines.

Deep learning seeks to resolve this by learning a hierarchy of representations. In order to recognize faces, my model may have learned to recognize edges and corners at the lowest level, noses and eyes in the next, and finally faces in the highest level. I need to recognize edges in order to recognize both noses and eyes, but now, they can both use the same edge detector instead of each requiring their own detector.

But again, we tread in murky waters: what counts as deep? Some people say more than three layers counts as deep, but the field hasn't even really agreed on what counts as 'layer'. Moving on...

I'll concede this article is getting more abstract, but remember that I'm 'teaching' my deep learning model in much the same ways as before: feeding in some data, comparing the result to my expected output, and adjusting the model accordingly.

If you've come this far and found this note unsatisfying, I don't blame you. Conceptually, deep learning really is that simple. Of course, the 'magic' (or devil) is in the details. How do we choose an initial model? How do we adjust the model based on results?

Today, researchers are using deep learning methods for everything from recognizing speech to helping robots find objects. And while this approach is much more powerful than anything we've seen before, there's still a long way to go. {Insert all the poetry about the current state and future of deep learning}.

(By the way, people are still working on each of the other strategies -- formal rules, machine learning, and representation learning -- and more for artificial intelligence. Some people feel that deep learning is a new paradigm that's here to stay; others feel that deep learning is just having its day in the sun. I guess we'll see.)