Derivative of Deep Neural Networks loss function

This blog is inspired by the blog by Brandon Da Silve which did all the derivation.

The blog mentions the different neural network architecture, activation and loss functions and its derivative with respect to the output. The summary is below

Activation	loss(L)	$\frac{\partial \mathcal{L}}{\partial z}$
None	MSE	$y - \hat{y}$
Sigmoid	Binary Cross Entropy	$y - \hat{y}$
Softmax	NLL	$y - \hat{y}$

As you can see all these loss activation combination have the same derivative.

Another thing which is mentioned in the blog is that if you combine Sigmoid activation with MSE then you get a different derivative. This might be the reason why they dont learn.

Anyways, below are the derivation of the different loss functions.

Linear Regression

\[ \begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[1.5ex] &\text{Activation Function}: &&\text{None} \\[1.5ex] &\text{Prediction}: &&\hat{y} = z \\[0.5ex] &\text{Loss Function}: &&\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2 \end{align*} \]

We are interested in calculating the derivative of the loss with respect to (z). Throughout this post, we will do this by applying the chain rule: \[ \frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \] First we will calculate the partial derivative of the loss with respect to our prediction: \[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y \]

Next, although silly, we calculate the partial derivative of our prediction with respect to the linear equation. Of course since the linear equation is our prediction (since we’re doing linear regression), the partial derivative is just 1: \[ \frac{\partial \hat{y}}{\partial z} = 1 \] When we combine them together, the derivative of the loss with respect to the linear equation is: \[ \frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y \]

Logistic Regression / Binary cross entropy with logits

\[ \begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[0.5ex] &\text{Activation Function}: &&\sigma(z) = \frac{1}{1 + e^{-z}} \\[0.5ex] &\text{Prediction}: &&\hat{y} = \sigma(z) \\[1.5ex] &\text{Loss Function}: &&\mathcal{L} = -(y\log\hat{y} + (1-y)\log(1-\hat{y})) \end{align*} \]

The partial derivative of the loss with respect to our prediction is pretty simple to calculate:

\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} \]

Next we will calculate the derivative of our prediction with respect to the linear equation. We can use a little algebra to move things around and get a nice expression for the derivative: \[ \begin{align*} \frac{\partial \hat{y}}{\partial z} &= \frac{\partial}{\partial z}\left[\frac{1}{1 + e^{-z}}\right] \\[0.75ex] &= \frac{e^{-z}}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1 + e^{-z} - 1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1 + e^{-z}}{(1 + e^{-z})^2} - \frac{1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1}{1 + e^{-z}} - \frac{1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1}{1 + e^{-z}} \left(1 - \frac{1}{1 + e^{-z}}\right) \\[0.75ex] &= \hat{y}(1 - \hat{y}) \end{align*} \]

Isn’t that awesome?! Anyways, enough of my love for math, let’s move on. Now we’ll combine the two partial derivatives to get our final expression for the derivative of the loss with respect to the linear equation.

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial z} &= \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right)\hat{y}(1 - \hat{y}) \\[0.75ex] &= -\frac{y}{\hat{y}}\hat{y}(1 - \hat{y}) + \frac{1-y}{1-\hat{y}}\hat{y}(1 - \hat{y}) \\[0.75ex] &= -y(1 - \hat{y}) + (1-y)\hat{y} \\[0.75ex] &= -y + y\hat{y} + \hat{y} - y\hat{y} \\[0.75ex] &= \hat{y} - y \end{align*} \]

Softmax NLL

[\[\begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[0.5ex] &\text{Activation Function}: &&\varphi(z_i) = \frac{e^{z_i}}{\sum_n e^{z_n}} \\[0.5ex] &\text{Prediction}: &&\hat{y_i} = \varphi(z_i) \\[1.5ex] &\text{Loss Function}: &&\mathcal{L} = -\sum_i y_i\log\hat{y_i} \end{align*}\]]

Let’s calculate the first partial derivative of the loss with respect to our prediction: \[ \frac{\partial \mathcal{L}}{\partial \hat{y_i}} = -\sum_i \frac{y_i}{\hat{y_i}} \]

That was pretty easy! Now let’s tackle the monster… the partial derivative of our prediction with respect to the linear equation: \[ \frac{\partial \hat{y_i}}{\partial z_j} = \frac{\sum_n e^{z_n} \frac{\partial}{\partial z_j}[e^{z_i}] - e^{z_i} \frac{\partial}{\partial z_j}\left[\sum_n e^{z_n}\right]}{\left(\sum_n e^{z_n}\right)^2} \]

It is important to realize that we need to break this down into two parts. The first is when $i = j$ and the second is when $i \neq j$. \[ if (i = j): \begin{align*} \frac{\partial \hat{y_i}}{\partial z_j} &= \frac{e^{z_j}\sum_n e^{z_n} - e^{z_j}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_j}\sum_n e^{z_n}}{\left(\sum_n e^{z_n}\right)^2} - \frac{e^{z_j}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_j}}{\sum_n e^{z_n}} - \frac{e^{z_j}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_j}}{\sum_n e^{z_n}} - \frac{e^{z_j}}{\sum_n e^{z_n}} \frac{e^{z_j}}{\sum_n e^{z_n}} \\[0.75ex] &= \frac{e^{z_j}}{\sum_n e^{z_n}} \left(1 - \frac{e^{z_j}}{\sum_n e^{z_n}}\right) \\[0.75ex] &= \hat{y_j}(1 - \hat{y_j}) \end{align*} \]

\[ if (i \neq j): \begin{align*} \frac{\partial \hat{y_i}}{\partial z_j} &= \frac{0 - e^{z_i}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= - \frac{e^{z_i}}{\sum_n e^{z_n}} \frac{e^{z_j}}{\sum_n e^{z_n}} \\[0.75ex] &= - \hat{y_i}\hat{y_j} \end{align*} \]

We can therefore combine them as follows: \[ \frac{\partial \mathcal{L}}{\partial z_j} = - \hat{y_j}(1 - \hat{y_j})\frac{y_j}{\hat{y_j}} - \sum_{i \neq j} \frac{y_i}{\hat{y_i}}(-\hat{y}_i\hat{y_j}) \]

The left side of the equation is where $i = j$, while the right side is where $i \neq j$. You will notice that we can cancel out a few terms, so the equation now becomes: \[ \frac{\partial \mathcal{L}}{\partial z_j} = - y_j(1 - \hat{y_j}) + \sum_{i \neq j} y_i\hat{y_j} \]

These next few steps trip some people out, so pay close attention. The first thing we’re going to do is change the subscript on the left side from $y_i$ to $y_j$ since $i = j$ for that part of the equation: \[ \frac{\partial \mathcal{L}}{\partial z_j} = - y_j(1 - \hat{y_j}) + \sum_{i \neq j} y_i\hat{y_j} \]

Next, we are going to multiply out the left side of the equation to get:

\[ \frac{\partial \mathcal{L}}{\partial z_j} = - y_j + y_j\hat{y_j} + \sum_{i \neq j} y_i\hat{y_j} \]

We will then factor out $\hat{y_j}$ to get: \[ \frac{\partial \mathcal{L}}{\partial z_j} = - y_j + \hat{y_j}\left(y_j + \sum_{i \neq j} y_i\right) \]

This is where the magic happens. We realize that inside the bracket $y_j$ can become $y_i$ since it is from the left side of the equation. Since y is a one-hot encoded vector: $y_j + \sum_{i \neq j} y_i = 1$

So our final partial derivative equals: \[ \frac{\partial \mathcal{L}}{\partial z_j} = \hat{y_j} - y_j = \hat{y} - y \]

L2 loss with sigmoid

First we will calculate the partial derivative of the loss with respect to our prediction: \[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y \]

Now we’ll combine the two partial derivatives to get our final expression for the derivative of the loss with respect to the linear equation.

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial z} &= \left(\hat{y} - y \right) \hat{y}(1 - \hat{y}) \\[0.75ex] \end{align*} \]

Log Softmax NLL (somthing wrong)

[\[\begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[0.5ex] &\text{Activation Function}: &&\varphi(z_i) = \log(\frac{e^{z_i}}{\sum_n e^{z_n}}) \\[0.5ex] & && = z_i - \sum_n z_n\\ &\text{Prediction}: &&\hat{y_i} = \varphi(z_i) \\[1.5ex] &\text{Loss Function}: &&\mathcal{L} = -\sum_i y_i\log\hat{y_i} \end{align*}\]]

Let’s calculate the first partial derivative of the loss with respect to our prediction: \[ \frac{\partial \mathcal{L}}{\partial \hat{y_i}} = -\sum_i \frac{y_i}{\hat{y_i}} \]

the partial derivative of our prediction with respect to the linear equation: \[ \frac{\partial \hat{y_i}}{\partial z_j} = \frac{\partial z_i}{\partial z_j} - \frac{\partial \sum_n z_n}{\partial z_j} \]

if $ i = j $ \[ \frac{\partial \hat{y_i}}{\partial z_j} = \frac{\partial z_j}{\partial z_j} - \frac{\partial z_j}{\partial z_j} \\ = 0 \] if $ i j $ \[ \frac{\partial \hat{y_i}}{\partial z_j} = \frac{\partial z_i}{\partial z_j} - \frac{\partial \sum_n z_n}{\partial z_j} \\ = 0 - \frac{\partial z_j}{\partial z_j} \\ = -1 \]

We can therefore combine them as follows: \[ \frac{\partial \mathcal{L}}{\partial z_j} = 0 \frac{y_i}{\hat{y_i}} - \sum_{i \neq j} \frac{y_i}{\hat{y_i}}(-1) \\ = \sum_{i \neq j} \frac{y_i}{\hat{y_i}} \]

Sympy Coding

import sympy as sym

y, y_hat, z = sym.symbols('y, y_hat, z')

Sigmoid BCE

sigmoid = 1 / (1 + sym.exp(-z))
sigmoid

1/(1 + exp(-z))

loss = -(y * sym.log(y_hat) + (1 - y)*sym.log(1 - y_hat))
loss

-y*log(y_hat) - (1 - y)*log(1 - y_hat)

sym.diff(loss, y_hat)

-y/y_hat - (y - 1)/(1 - y_hat)

sym.diff(sigmoid, z)

exp(-z)/(1 + exp(-z))**2

Softmax

softmax = sym.exp(z)