In recent years the concept of deep learning has been gaining widespread attention. The media frequently reports on talent acquisitions in this field, such as those by Google and Facebook, and startups which claim to employ deep learning are met with enthusiasm. Gratuitous comparisons with the human brain are frequent. But is this just a trendy buzz word? What exactly is deep learning and how is it relevant to developments in machine intelligence?
For many researchers, deep learning is simply a continuation of the multi-decade advancement in our ability to make use of large scale neural networks. Let’s first take a quick tour of the problems that neural networks and related technologies are trying to solve, and later we will examine the deep learning architectures in greater detail.
Machine learning generally breaks down into two application areas which are closely related: classification and regression.
In the classification task, you are trying to do automatic recognition. You create a training data set for which you have known labels, for example, images of different types of vegetables. Here you have manually assigned the correct class label, such as yam, carrot, potato, etc, to each one. The images are going to be the input to the algorithm and the class labels are going to be the required output.
During the training phase you show the images to the algorithm and automatically adjust the many parameters in order that the generated output best matches the known labels. Then, when this is done, you hope that the algorithm will be able to generalize its classification ability to yield correct labels for new images which were not in the training set. You can evaluate this capability on a different hand-labeled data set which is used for testing. This is the classic framework that is used for face, object, or voice recognition, or for flagging the presence of particular patterns, e.g. in face detection.
Regression, on the other hand, deals with continuous outputs where the problem is to compute an important quantity from any particular input vector.
For example I might supply a list of latitudes and longitudes and want the algorithm to predict the height of the terrain at that location. The algorithm would be trained by giving it a limited set of spot heights at various coordinates, and it would have to generalize in order to fill in the missing regions when queried about any location that was not in the training set.
The simplest example of regression is fitting a line through some data points, but more complex applications include evaluating the 3D pose of a face, or computing human joint angles, as is done by the Microsoft Kinect. Regression learning can be used as part of a control system for robotics, e.g. in following a path through an environment, or in computing motion in a scene.
The kinds of techniques used for regression and classification are very similar because both involve multidimensional generalization beyond training data and the difference is just whether the output is discrete or continuous.
In order to actually make an algorithm that can learn, one tries to come up with a super general black box which has many parameters (think of control knobs) that can be twiddled in order to change the mapping from its input to its output. Then the age-old techniques of gradient descent are used to automatically tune all these parameters to get the best results possible on the training set. Much of the art of machine learning research lies in inventing the contents of the black box and in designing the optimization algorithm so that training converges to good solutions in a reasonable time.
Inspired by simple understanding of the brain in the 1940s, the concepts of neural networks have their roots in the work of McCulloch and Pitts, and with Rosenblatt’s perceptron, but only became trainable when Werbos created the back-propagation learning algorithm in 1975. This algorithm carries out gradient descent by back-propagating gradient updates towards the input of the network using the chain rule.
Typically a neural network consists of a set of layers with nodes that take input from the preceding layer, multiply these numbers by weights, sum up different inputs, and pass the result through a nonlinearity such as the logistic function, hyperbolic tangent, or rectifying function. Multiple layers can be stacked one after the next, and each layer can have different numbers of nodes in order to implement dimension reduction or expansion.
In simple neural networks there is an input layer, a hidden layer, and an output layer. The outputs are continuous functions of the inputs, and so the network naturally performs regression. To obtain a classification result, each node in the output layer might be thresholded or trained to represent the probability that a particular class is present. Alternatively, neural networks can be used for intelligent dimension reduction, where the output is classified by using another machine learning technique, such as by using a support vector machine.
But what is deep learning? Central to this new area are two concepts: using new kinds of networks with many layers, and unsupervised or semi-supervised training.
Mathematical analysis shows that conventional neural networks can learn any functional mapping from input to output. However for many such functions, the number of hidden units might have to be exponentially large for a shallow network. Adding more layers allows more complex functions to be learned using a more reasonable number of neurons. But then we start to get a problem with the back propagation algorithm which is known as the problem of vanishing gradients. The rate of learning for each layer goes down as you propagate information towards the input layer, and this results in a diminishing improvement of the network’s performance with new added layers. It becomes impossible to train it to find a solution that you know exists.
A related obstacle to deep learning is the tendency for units to become stuck in a saturated mode during training, particularly when using sigmoidal output functions. These fundamental issues held up the use of multi-level neural networks during most of the last two decades.
New progress in deep learning has resulted because methods have recently been found to circumvent these problems, especially when combined with unsupervised pre-training.
Earlier we described the situation where a classifier was trained by giving it a completely labeled data set. This is the fully-supervised training scenario. Another approach is to train a network on data without any labels at all.
In this scenario you simply show the system many examples of the type of input that you would typically expect it to see in the application scenario. For example, if you were classifying images of vegetables on a conveyor belt, you might show it many thousands of typical images of vegetables.
What unsupervised training tries to do is to get the network to learn statistical regularities in the input space. To achieve this you need to define an objective function to maximize, which typically revolves around some kind of sparseness or information preservation criteria within the context of dimension reduction. You try to get the network to learn the best way to maintain as much information about the input as possible despite a data bottleneck. In this way the network hopefully learns a sort of conceptual vocabulary which is used to best describe the typical data.
This became part of the solution for deep learning. Unsupervised learning algorithms work well for these networks, where they can even be trained in a greedy layer-wise fashion: train the first layer, and then train the second layer on inputs that are pre-processed by the first, and so on. Once you have a network that has been trained on unlabeled data, it is possible to build a classifier by passing the output of the network through a support vector machine, or by using other techniques.
The final layer that generates the classifications is trained by using a fully-labeled data set which can have fewer examples than the data set used for the unsupervised pass. This is an example of semi-supervised learning. This classifier ends up working very well because its input is has been already trained to represent the important statistical features of the input space.
Alternatively, the pre-trained deep network weights can be used as the starting point for further training on labeled data using the same architecture, so the whole network is updated to achieve the desired output, without adding a new classifier. The unsupervised pre-training pass tends to place the network weights in a much better state than random initialization would allow, leading to impressive gains in performance.
This approach also plays well into the developing field of big data analysis. Unsupervised learning can be made to run on millions or billions of examples without costly labeling operations. Huge potential internet-wide training sets, such as images, video, music, or web pages are readily available.
It can be argued that human learning proceeds in a semi-supervised fashion because children are exposed to vast amounts of sensory data and their brains must build statistical models or vocabularies to represent the world, even without specific input from adults. They learn from very few examples because adults only occasionally provide class labels, such as the names of objects.
However, there are some limitations to this view: Learning in the brain is a complex phenomenon involving local adaptation to sensory statistics by networks which are also constantly influenced by top-down attention and reward signals, such as those that come from tactile experiences during active exploration, or from internal goal acquisitions. In humans these top-down influences are are only occasionally a result of adult supervision.
Another trend which has enabled greater use of deep learning is the huge improvement in hardware resources over the last few years, where many high performance GPUs can be combined in relatively inexpensive machines.
The training phase of learning is very time consuming but maps well onto modern number-crunching platforms, enabling larger networks to be trained in a reasonable time. Once deep networks have been trained, a simple feed-forward pass of data through these networks can be done quickly, e.g. allowing for real-time object recognition from video on mobile devices.
Restricted Boltzmann machines (RBM) are a commonly used architecture for layers in deep learning and are a type of probabilistic undirected graph containing visible units and hidden units. Any particular choice of weights of the RBM defines a specific probability distribution on the input space. Learning proceeds to adjust the weights in order to match the predicted distribution to the actual distribution of the data.
Multiple levels of RBMs can be stacked together and trained greedily in an unsupervised manner. For a long time it was intractable to train these networks in a reasonable time, but a recent advance called the contrastive divergence algorithm, which makes use of a clever form of Gibbs sampling, was found to be effective. It gives good results, despite using a somewhat mysterious approximation to the correct objective function.
For training of networks on image data, where the types of statistical regularities are expected to be translation invariant, convolutional neural networks have found much favor because they greatly reduce the number of parameters that need to be learned. This type of network involves sets of weights that are applied simultaneously to many locations in an image, and so the algorithm does not need to learn that the statistics of the left of the image are the same as those at the right.
Some of the best results in image recognition have been obtained using convolutional neural networks that were trained with sparseness priors, which reduce the number of units that can be simultaneously active and encourage the network to learn important non-Gaussian statistics from the example images.
Other example deep learning architectures include sparse or de-noising auto-encoders which attempt to preserve information content by always being able to reconstruct their own inputs. These can be stacked to obtain multiple layers and are very effective at recognition tasks.
An important distinction in deep learning is between algorithms that use probabilistic graphical models and those that use energy models. Probabilistic models (such as those used by RBMs) associate a probability with each joint configuration of the input and hidden node states. The problem here is that the sum of all these probabilities must add up to one, which causes a complex normalization term to appear in the equations and creates tremendous problems for optimization. Energy models on the other hand associate an energy with each configuration, without the requirement for normalization, and optimization proceeds to attempt to locate the configuration with lowest energy. There are various trade-offs between these two approaches.
Many great researchers are involved in deep learning, but Yann LeCun (Director of AI Research, Facebook), Geoffrey Hinton (University of Toronto, Google), Yoshua Bengio (University of Montreal), and Andrew Ng (Stanford University, Baidu), are some important contributors to the area.
In 2012, Google trained a large scale unsupervised 9-layer auto-encoder network on 10 million images sampled from YouTube videos. They found that some of the neurons became specialized for face representation or body representation, and they also found neurons that could discriminate for cat faces, which are common in YouTube videos. These neurons were obtained without any class labeling and so demonstrate that the system is able to learn representational modes based on statistical clusters in the training set. You can read their paper here, and also much more about the breadth of deep learning at Google here.
Each year, the Image-Net organization runs an image recognition challenge. Recently the winners have all been deep network based algorithms which are approaching human performance on some tasks. In his keynote Jeff Dean from Google claimed that anything that humans can recognize in 100ms, the right deep network can already pretty much recognize at comparable error rates. That is, pre-attentive vision is almost solved.
Deep learning is showing great promise in natural language understanding. It has become possible to learn embeddings from words or paragraphs into some specific multi-dimensional space that encodes certain concepts. For example, words that have similar meanings end up being close together in this space, and often the distances between words like “man” and “woman” are similar to the distances between “uncle” and “aunt” or “king” and “queen”. This has relevance to machine translation. Google has made public a piece of code called word2vec which can generate these word embeddings in English. You can download it here.
In 2014, Microsoft unveiled their real-time language translator for Skype, which appears to make use of deep learning. You can read more about the history of its development here. Deep learning can be used for both the audio recognition part of the task, and also for the translation of semantic language structure.
Other areas where deep learning shows promise are in financial analysis, e.g. predicting future stock prices (where good results are not typically published); user sentiment analysis, such as extracting information about the emotions people display when they write product reviews or when they post on twitter; or product prediction, where the system learns what user preferences are likely to be, based on past product purchases. Facebook has started a big deep learning initiative to tease out all kinds of useful information from user generated content, presumably for marketing purposes.
My own interests in deep learning relate to creating whole new classes of products that benefit from richer interactions with humans. These vary from responsive apps to “internet of things” devices and robots. In these scenarios, the deep learning approach offers a way to integrate information across sensory modalities, that is combine information about light, acceleration, touch, sound, or any other sensor data, to detect events in a device’s environment that can drive more meaningful and convenient user interaction.
If you believe the hype about deep learning you would think we almost have conscious machines and a computer can recognize anything that a human can discern. This is not yet so. There are a number of reasons why we are barely scratching the surface of machine intelligence.
In every era of computer science, people make comparisons between a new exciting field and the way the brain works. Brain comparisons are the stuff of pop science reporting. Back in the early 1990s when Kalman filters were really becoming popular for computer vision, there were plenty of comparisons made to the neuroscience of eye tracking and human vision, and people stated that the human brain carries out a form of Kalman filtering. Now it is clear that in some areas, the brain does do some processing that in a way resembles a Kalman filter, but the hyperbole at the time was unwarranted.
Now we are seeing various startups in machine learning, and especially in neurally inspired hardware using lots of comparisons to the brain. The feeling from publicity is that once we build these neuromorphic chips we will have intelligent artificial brains. But there is so much more to understand about actual intelligence and the theoretical developments are still very young.
One can see from machine learning classification errors how we are some way from capturing the kinds of visual scene representations that are used by humans.
At the moment, deep networks largely make use of feedforward processing. Given an input, the network computes a new representation at its output. Biological neural networks are far more dynamic: They involve recurrent connections that feed information both forward from the senses to areas that involve abstract processing, laterally within representational layers, and backwards from these areas towards the input layers. This information chatter also forms a complex dynamic system that oscillates and modulates over time – it is not merely a static inference like that of a graphical model.
The balance between feedforward and feedback information flow in the brain is always being adjusted dynamically depending on task context as prior expectations from higher levels of processing influence the detail of processing at lower levels. In a similar way, there are complex interactions related to saliency (recognizing what is important) and attention (the process by which regions of the sensory input are dynamically selected for preferential processing at later levels.)
Time series analysis is also an integral part of the brain, with audition, language, motion perception, body movement awareness, and temporal narratives of experience forming a large part of our structure. None of these areas of functionality have really been integrated in a meaningful way in any current neural network.
One very immature area of theoretical study is reinforcement learning, but this is of central importance in the brain, and is likely to be essential to the creation of future robotics. Inference, task planning, memory, and mapping are other areas that would need to be integrated within a heterogenous system in order to build more intelligent machines.
Even though there is a long way to go, a huge momentum is building for future development in deep learning and related disciplines. It is certain we can expect much more work, and many more exciting results in this area.