## Introduction

I will assume that you have seen some calculus, including multivariable calculus. That is you know how to differentiate a differentiable function $f\colon \Bbb R \to \Bbb R$, to obtain a new function $$\frac{\partial f}{\partial x} \colon \Bbb R \to \Bbb R.$$ You also know how to differentiate a multivariable function $f\colon \Bbb R^m \to \Bbb R^n$, to obtain a map $$Jf\colon \Bbb R^m \to \mathrm{Hom}(\Bbb R^m, \Bbb R^n),$$ which takes a point $x\in \Bbb R^m$ to the Jacobian at $x$, which is an $(m,n)$-matrix; alternatively you can think of the Jacobian as a matrix of functions with the $(i,j)$-entry being $\frac{\partial f_i}{\partial x_j}\colon \Bbb R^m \to \Bbb R.$

The Jacobian is a convenient form for packaging the derivative of a vector valued function with respect to a list of variables arranged into a vector. Many of us learn the basic results (the product rule, the sum rule, the chain rule) very well for single variable calculus and, hopefully, pretty well for multivariable calculus. We may have less experience using these compact forms in practice.

While this knowledge is very useful for considering vector valued functions indexed by vectors of variables, it can be a bit tricky to differentiate matrix valued functions with respect to matrices or vectors of variables and get a useful compact form.

### Potential objection

A bright student might observe that an element of a matrix is an element of $\Bbb R^m\otimes \Bbb R^n\cong \Bbb R^{mn}$ and hence we can always regard an $(m,n)$-matrix as an $mn$-vector^{1}. This is true, but we usually organize data into matrices because that is the most natural structure for them to have, e.g., the Jacobian above. Here are some similar naturally occurring structures:

- We have a table with 100 entries. On each row is data from some sample, say temperature and pressure. This naturally fits into a $(100,2)$-matrix, whose rows are the samples and whose columns are the respective measurements. So we can recover each datum, by using two numbers: the row index and the column index.
- We have a list of 100 grayscale images 640×480 images. To recover a datum (the darkness of a pixel in a sample) in this table, we would use
*three*integers:- the image number,
- the $x$-coordinate, and
- the $y$-coordinate.

This is encoded as a*tensor*of shape $(100,640,480)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}$. While we could encode this as a $307200000$-vector, it would make it much more difficult to understand and meaningfully manipulate.

- We have a list of 100 color (i.e., RGB) images 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample) in this table, we would use
*four*integers:- the image number,
- the $x$-coordinate,
- the $y$-coordinate, and
- which color we are interested in (e.g., red = 0, green = 1, blue = 2).

This is encoded as a tensor of shape $(100,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$.

- We have a list of 100 color (i.e., RGB) videos each consisting of 18000 frames of 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample at a particular frame in the table) in this table, we would use
*five*integers:- the image number,
- the frame number,
- the $x$-coordinate,
- the $y$-coordinate, and
- which color we are interested in (e.g., red = 0, green = 1, blue = 2).

This is encoded as a tensor of shape $(100,18000,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{18000}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$. Again, it would be very unnatural to work with this as a single vector.

### Definition

Suppose we have a function $F$ taking values in tensors of shape $(n_1,\cdots, n_\ell)$ which depends on variables indexed by a tensor $X$ of shape $(m_1,\cdots, m_k)$. The derivative $\frac{\partial F}{\partial X}$ is a function from $\Bbb R^{m_1}\otimes \cdots \otimes \Bbb R^{m_k}$ to tensors of shape $(n_1,\cdots,n_\ell,m_1,\cdots,m_k)$ whose $(i_1,\cdots, i_\ell, j_1, \cdots, j_k)$-entry is $\frac{\partial F_{i_1,\cdots, i_\ell}}{\partial X_{j_1,\cdots, j_k}}(-)$.

### Some scandalous abuses of notation

We will typically abuse notation and let $n$-vectors also denote column vectors, i.e., tensors of shape $(n,1)$.

For convenience, many authors (including this one), will sometimes take tensors of the form $(\cdots, 1, \cdots)$ and regard them as tensors of the form $(\cdots,\cdots)$ (i.e., we omit the 1 term by ‘squeezing the tensor’. This abuse of notation is both convenient and confusing. Avoiding this abuse tends to cause ones to accumulate in the shape of our tensors and these one’s do not add any value to our interpretation. The best way to follow what is going on is to check it for yourself.

A less scandalous abuse of notation is to regularly add transposes to tensors of $(1,1)$ when convenient.

## Some results

Since the only way to get used to tensor differentiation is to do some work with it yourself, we leave the following results as exercises.

### Exercise

- Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables. Show that $\frac{\partial f}{\partial X}=Jf$.
- Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=Ax$ for an $(n,m)$-matrix $A$. Show that $\frac{\partial f}{\partial X}=A$ (a constant function).
- Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA$ for an $(m,n)$-matrix $A$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,n,m)$-tensors whose $(1,i,j)$ entry is $(A^T)_{i,j}$.
- Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA x$ for an $(m,m)$-matrix $A$. Show that $\frac{\partial

f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose $(1,i)$-entry is $((A+A^T)x)_i$. - Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=(Ax-y)^T(Ax-y)$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Show that $\frac{\partial

f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$ - Suppose that $f\colon \Bbb R^m\to \Bbb R^m$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=2(Ax-y)^TA$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Here we regard the function as taking values in $(m)$-tensors. Show that $\frac{\partial f}{\partial X}$ is the $(m,m)$-tensor $2A^TA$.

f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$

- To simplify our notation, $\otimes := \otimes_{\Bbb R}$. ↩