Here you can find my solutions to the exercises in Michael Nielsen’s free online book Neural Networks and Deep Learning.
I decided to do those exercises out in the open to force myself to, well, do them but also prove to myself that I actually understood the material. I have a personal and professional interest in neural networks and I want to get it right. That being said, there is no guarantee that I will succeed in this endeavour and I by no means claim to have all the right answers. If you find any mistakes, please consider raising an issue on the GitHub repository of this blog.
Perceptrons and Sigmoid Neurons
Sigmoid Neurons Simulating Perceptrons, Part I
ExerciseSuppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, . Show that the behaviour of the network doesn't change.
Before I solve this exercise, I want to recap what we have learned so far. A perceptron is a type of an artificial neuron that looks like this:
A perceptron takes an input vector , a weight vector , and a bias and produces a single binary where
To show the above, I will show that the behavior of one single perceptron will not change if we multiply its weight vector and bias with a positive constant .
The output of the perceptron would be:
I first used the distributive law and then the fact that a positive number does not change the sign of a multiplication. Thus we arrived at the definition of a perceptron’s output. Since the behavior of a single perceptron does not change by the introduction of , the behavior of a neural network as a whole would not change.
Sigmoid Neurons Simulating Perceptrons, Part II
ExerciseSuppose we have the same setup as the last problem—a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that for the input to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant . Show that in the limit as the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when for one of the perceptrons?
A sigmoid neuron looks exactly like a perceptron but the output now computes as where is called the sigmoid function:
In particular that means that the output can be any value between and instead of either or .
As in the exercise above, let’s assess the behavior of a single sigmoid neuron when weight and bias are multiplied with a positive :
As approaches infinity we get:
That is exactly the same behavior a perceptron would have for . Since this is true for one single sigmoid neuron, it also holds true for a full network of those neurons.
For , on the other hand, we get:
A perceptron would output here.
A Simple Network to Classify Handwritten Digits
ExerciseThere is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 3 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least , and incorrect outputs have activation less than .
I quickly came up with the weight vectors for each of the four output neurons where is for the neuron representing the most significant bit (MSB) and is for the neuron representing the least significant bit (LSB).
Consider for example the numbers 8 and 9 which are 1000 and 1001 in binary notation. The yellow bits are the MSBs and hence vector needs a in position and to activate this bit. The weighted input is then To be precise we have for or and for other digits. for and and approximately for other digits. The other digit inputs and weights work accordingly. That is a great start.
Now we have to figure out which bias to choose for each neuron. Let’s have a look at the sigmoid function for that:
The sigmoid function centers around and with we shift the activation for an output that “sets a bit” (where ) to . Similarly, the activation of a neuron that signals an “unset bit” (where ) would output .
We could now classify a neuron output greater than as “this bit is set” and an output less than as “this bit is not set” and be done with it. But let’s try to find the factor that gives us the same precision that the neurons in the hidden layers have.
We are looking for a factor such that an output of “sets a bit” and at the same time an output of signals an “unset bit”.
Let’s consider the case of a set bit first. We can set since this is the minimum value possible based on the precision of the old output layer and a bigger value would just increase the value of .
Now I will look at the output for an unset bit where I we want to achieve an activation of less than . We can set since across our four weight vectors we have a maximum of five contributing s in (and each has an activation of less than according to the precision of the old output layer in the exercise text) and:
That gives us a which:
With no loss of generality, I choose (because it looks better than some crooked number). That gives us a bias of and those weight vectors:
Learning with Gradient Descent
If you need a bit more intuition and visuals around gradient decent I can recommend Gradient descent, how neural networks learn by 3Blue1Brown and the lectures 2.5, 2.6, and 2.7 of Andrew Ng’s Machine Learning course.
Optimality of Gradient Decent
ExerciseProve the assertion of the last paragraph. Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.
Let’s recap what this chapter is all about. Please note that I will tackle this topic in a general manner, i.e. for the moment we can forget about neural networks, weights, and biases. All we need to know is that we have a smooth cost function where is the input vector. We want to minimize and use gradient decent to change step by step such that it approaches a local minimum of .
Fortunately, smart people have shown that a function’s gradient gives you the direction of steepest ascent and the change in a function can be approximated by the inner product of the change in the input vector and that I won’t go into the nature of this gradient or how it can be computed. All that matters for me right now is that there exists a way to compute .:
If we choose with we get:
This is great! Since and this guarantees that , i.e. will always decrease.
To put that more visually, we take the direction that increases most quickly and make sure we go exactly the opposite way to decrease most quickly. The exercise now is to show that the direction of the negative gradient actually gives us the “biggest” decrease possible.
Let’s revisit the equation for the approximate change in with :
Now we consider another vector such that and . In particular, that means that . We can use that in the equation above:
The last line is explained by the Cauchy-Schwarz inequation that states that which implies that , or .
And there we have it: We have shown that . That means the change gives us a “bigger” decrease in than any other vector .
To be continued…