NMT Tutorial 3扩展a. 深度学习的矩阵微积分基础

写在前面:矩阵微积分是深度学习的数学基础之一,但是这部分内容在大学计算机系(及相关非数学类专业)本科几乎没有介绍,想要了解全凭自学。我之前看过的比较好的资料有三个:维基百科的Matrix Calculus词条The Matrix CookbookTowser的《机器学习中的矩阵/向量求导》。然而前两个都是英文资料,而且主要重视结论,可以当做字典用,但是看完总有点“知其然不知其所以然”的感觉(维基词条似乎有简单的计算过程介绍,但是还是有点简略);Towser大神的文章写得很不错,但是我数学较差,看到张量相关的部分还是觉得有点脑子转不过来

最近这几天我整理自己的微博收藏时,无意间发现北邮陈光老师(爱可可爱生活@微博)曾经推荐过一篇文章:Terence Parr和Jeremy Howard合作的The Matrix Calculus You Need For Deep Learning,感觉比较符合我的需求(再次证明收藏过的东西如果不回顾就是白收藏),相对Towser的文章更基础一点。这篇博客就是我阅读该论文的一些摘要

预备知识

对于一元函数的导数,存在如下几条规则(以下均认为\(x\)是自变量):

  • 常数函数的导数为0:\(f(x) = c \rightarrow df/dx = 0\)
  • 常量相乘法则:\((cf(x))' = c\frac{df}{dx}\)
  • 幂函数求导法则:\(f(x) = x^n \rightarrow \frac{df}{dx} = nx^{n-1}\)
  • 加法法则:\(\frac{d}{dx}(f(x) + g(x)) = \frac{df}{dx} + \frac{dg}{dx}\)
  • 减法法则:\(\frac{d}{dx}(f(x) - g(x)) = \frac{df}{dx} - \frac{dg}{dx}\)
  • 乘法法则:\(\frac{d}{dx}(f(x)\cdot g(x)) = f(x)\cdot \frac{dg}{dx} + \frac{df}{dx}\cdot g(x)\)
  • 链式法则:\(\frac{d}{dx}(f(g(x))) = \frac{df(u)}{du}\cdot \frac{du}{dx}\),若令\(u=g(x)\)

对于二元函数,需要引入偏导数的概念。假设函数为\(f(x,y)\),求函数对\(x\)\(y\)的偏导数时,将另一个变量看作是常量(对于多元函数,求对某个变量的偏导数时,将其它所有变量都视为常量)。求得的偏导数可以组合成为梯度,记为\(\nabla f(x,y)\)。即 \[ \nabla f(x,y) = \left[\begin{matrix}\frac{\partial f(x,y)}{\partial x} \\ \frac{\partial f(x,y)}{\partial y}\end{matrix}\right] \] (注意:原文里梯度表示成了一个行向量,并说明他们使用了numerator layout。但是这样会使得标量对向量的微分结果与该向量形状相异,也与主流记法相左。因此本文记录时做了转换,使用主流的denominator layout记法)

矩阵微积分

同一个函数对不同变量求偏导的结果可以组成称为梯度,多个函数对各变量求偏导的结果则可以组合成一个矩阵,称为雅可比矩阵(Jacobian matrix)。例如如果有函数\(f, g\),两者各自的梯度可以组合为 \[ J = \left[\begin{matrix}\nabla^\mathsf{T} f(x,y) \\ \nabla^\mathsf{T} g(x, y)\end{matrix}\right] \]

雅可比矩阵的泛化

可以将多元变量组合成一个向量:\(f(x_1, x_2, \ldots, x_n) = f(\boldsymbol{x})\)(本文认为所有向量都是\(n \times 1\)维的,即 \[ \boldsymbol{x} = \left[\begin{matrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{matrix}\right] \] )假设有\(m\)个函数分别向量\(\boldsymbol{x}\)计算得出一个标量,即 \[ \begin{align*} y_1 &= f_1(\boldsymbol{x}) \\ y_2 &= f_2(\boldsymbol{x}) \\ &\vdots \\ y_m &= f_m(\boldsymbol{x}) \end{align*} \] 可以简记为 \[ \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{x}) \] \(\boldsymbol{y}\)\(\boldsymbol{x}\)求导的结果就是将每个函数对\(\boldsymbol{x}\)的导数堆叠起来得到的雅可比矩阵: \[ \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \left[\begin{matrix}\frac{\partial}{\partial x_1}f_1(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_1(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_1(\boldsymbol{x})\\ \frac{\partial}{\partial x_1}f_2(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_2(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_2(\boldsymbol{x})\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_1}f_m(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_m(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_m(\boldsymbol{x})\\ \end{matrix}\right] \]

两向量间逐元素运算的导数

假设\(\bigcirc\)是一个对两向量逐元素进行计算的操作符(例如\(\bigoplus\)是向量加法,就是两个向量逐元素相加),则对于\(\boldsymbol{y} = \boldsymbol{f}(\boldsymbol{w}) \bigcirc \boldsymbol{g}(\boldsymbol{x})\),假设\(n = m = |y| = |w| = |x|\),可以做如下展开 \[ \left[\begin{matrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{matrix}\right] = \left[\begin{matrix} f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x}) \\ f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x}) \\ \vdots \\ f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})\end{matrix}\right] \] \(\boldsymbol{y}\)分别对\(\boldsymbol{w}\)\(\boldsymbol{x}\)求导可以得到两个方阵 \[ J_{\boldsymbol{w}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{w}} = \left[\begin{matrix}\frac{\partial }{\partial w_1}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) \\ \frac{\partial }{\partial w_1}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial }{\partial w_1}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) \\ \end{matrix}\right] \]

\[ J_{\boldsymbol{x}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \left[\begin{matrix}\frac{\partial }{\partial x_1}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) \\ \frac{\partial }{\partial x_1}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial }{\partial x_1}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) \\ \end{matrix}\right] \]

由于\(\bigcirc\)是逐元素操作,因此\(f_i\)是一个关于\(w_i\)的函数,\(g_i\)也是一个关于\(x_i\)的函数。因此当\(j \not= i\)时,\(\frac{\partial }{\partial w_j}f_i(w_i) = \frac{\partial }{\partial w_j}g_i(x_i) = 0\),而且$ 0 0 = 0$,因此上述两个雅可比矩阵都是对角矩阵,可以简记为 \[ \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{w}} = {\rm diag}\left(\frac{\partial}{\partial w_1}(f_1(w_1) \bigcirc g_1(x_1)), \frac{\partial}{\partial w_2}(f_2(w_2) \bigcirc g_2(x_2)), \ldots, \frac{\partial}{\partial w_n}(f_n(w_n) \bigcirc g_n(x_n))\right) \]

\[ \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = {\rm diag}\left(\frac{\partial}{\partial x_1}(f_1(x_1) \bigcirc g_1(x_1)), \frac{\partial}{\partial x_2}(f_2(x_2) \bigcirc g_2(x_2)), \ldots, \frac{\partial}{\partial x_n}(f_n(x_n) \bigcirc g_n(x_n))\right) \]

如果只考虑最简单的向量计算,不妨令\(\boldsymbol{f}(\boldsymbol{w}) = \boldsymbol{w}, \boldsymbol{g}(\boldsymbol{x}) = \boldsymbol{x}\),因此有\(f_i(w_i) = w_i\)。假设\(\bigcirc\)为逐元素相加,那么会有 \[ \frac{\partial }{\partial w_i}(f_i(w_i) + g_i(x_i)) = 1 = \frac{\partial}{\partial x}(f_i(w_i) + g_i(x_i)) \] 其它减、乘、除同理,可以得到如下式子(由于向量加法\(+\)的法则就是\(\oplus\),因此\(\oplus\)简写为\(+\)。减法同理) \[ \begin{align*} \frac{\partial (\boldsymbol{w} + \boldsymbol{x})}{\partial \boldsymbol{w}} &= \boldsymbol{I} = \frac{\partial (\boldsymbol{w} + \boldsymbol{x})}{\partial \boldsymbol{x}}\\ \frac{\partial (\boldsymbol{w} - \boldsymbol{x})}{\partial \boldsymbol{w}} &= \boldsymbol{I} \\ \frac{\partial (\boldsymbol{w} - \boldsymbol{x})}{\partial \boldsymbol{x}} &= -\boldsymbol{I} \\ \frac{\partial (\boldsymbol{w} \otimes \boldsymbol{x})}{\partial \boldsymbol{w}} &= {\rm diag}(\boldsymbol{x}) \\ \frac{\partial (\boldsymbol{w} \otimes \boldsymbol{x})}{\partial \boldsymbol{x}} &= {\rm diag}(\boldsymbol{w}) \\ \frac{\partial (\boldsymbol{w} \oslash \boldsymbol{x})}{\partial \boldsymbol{w}} &= {\rm diag}\left(\cdots \frac{1}{x_i}\cdots\right)\\ \frac{\partial (\boldsymbol{w} \oslash \boldsymbol{x})}{\partial \boldsymbol{x}} &= {\rm diag}\left(\cdots -\frac{w_i}{x_i^2}\cdots\right) \end{align*} \]

向量与标量运算的导数

向量与标量之间的运算,可以把标量扩展成一个同维度的向量,然后对两个向量做逐元素的运算。标量的扩展一般是将其乘以一个全为1的向量,例如计算\(\boldsymbol{y} = \boldsymbol{x} + z\)实际上是\(\boldsymbol{y} = \boldsymbol{f}(\boldsymbol{x}) + \boldsymbol{g}(z)\),其中\(\boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x}, \boldsymbol{g}(z) = \boldsymbol{1}z\)

因此可以得到 \[ \begin{align*} \frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x} + z) &= \boldsymbol{I} \\ \frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x}z) &= \boldsymbol{I}z \\ \frac{\partial}{\partial z}(\boldsymbol{x} + z) &= \boldsymbol{1} \\ \frac{\partial}{\partial z}(\boldsymbol{x}z) &= \boldsymbol{x} \end{align*} \]

向量的求和规约操作

求和规约,原文是sum reduce,实际上就是把一个向量的所有元素求和

\(y = {\rm sum}(\boldsymbol{f}(\boldsymbol{x})) = \sum_{i=1}^n f_i(\boldsymbol{x})\),展开可得 \[ \begin{align*} \frac{\partial y}{\partial \boldsymbol{x}} &= \left[\begin{matrix}\frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} & \cdots & \frac{\partial y}{\partial x_n} \end{matrix}\right]^\mathsf{T} \\ &= \left[\begin{matrix}\frac{\partial }{\partial x_1}\sum_i f_i(\boldsymbol{x}) & \frac{\partial }{\partial x_2}\sum_i f_i(\boldsymbol{x}) & \cdots & \frac{\partial }{\partial x_n}\sum_i f_i(\boldsymbol{x}) \end{matrix}\right]^\mathsf{T} \\ &= \left[\begin{matrix}\sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_1} & \sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_2} & \cdots & \sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_n}\end{matrix}\right]^\mathsf{T} \end{align*} \] 讨论一个最简单的情况,就是\(y={\rm sum}(\boldsymbol{x})\)。由之前的讨论,有\(f_i(\boldsymbol{x}) = x_i\)。将定义代入上式,并考虑对\(i \not= j\)\(\frac{\partial }{\partial x_j}x_i = 0\),易得 \[ y = {\rm sum}(\boldsymbol{x}) \rightarrow \nabla y = \boldsymbol{1} \]

链式法则

对复杂表达式(各种表达式组合嵌套得到的表达式),需要使用链式法则求导。本文将链式法则分为了三种

单变量链式法则

其实就是高数里学到的。假设\(y = f(g(x))\),令\(g(x) = u\),则单变量链式法则为 \[ \frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx} \]

单变量全微分链式法则

在最简单的单变量链式法则里,所有中间变量都是单变量函数。对于\(y = f(x) = x+x^2\)这种函数,情况会复杂一些。如果引入中间变量\(u_1\)\(u_2\),有 \[ \begin{align*} &u_1(x) &&= x^2 \\ &u_2(x, u_1) &&= x + u_1 && (y=f(x)=u_2(x, u_1)) \end{align*} \] 如果只使用前面的单变量链式法则,那么\(du_2/du_1 = 1, du_1/dx = 2x\),所以\(dy/dx = du_2/dx = du_2/du_1 \cdot du_1/dx = 2x\),跟结果(\(2x+1\))对不上。如果因为\(u_2\)有两个变量而引入偏导数,那么 \[ \begin{align*} \frac{\partial u_1(x)}{\partial x} &= 2x \\ \frac{\partial u_2(x, u_1)}{\partial u_1} &= 1 \end{align*} \] 注意此时不能直接说\(\partial u_2 / \partial x = 1\)因为对\(x\)求偏导数的时候限定了其它变量不会随着\(x\)的变化而变化,但是\(u_1\)实际上还是关于\(x\)的函数。因此应该使用单变量全微分链式法则:假设\(u_1, \ldots, u_n\)(可能)都与\(x\)有关,则 \[ \frac{\partial f(x, u_1, \ldots, u_n)}{\partial x} = \frac{\partial f}{\partial x} + \sum_{i=1}^n \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} \] 将上面的例子代入,可知 \[ \frac{\partial f(x, u_1)}{\partial x} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial u_1}\frac{\partial u_1}{\partial x} = 1 + 2x \] 注意全微分公式在任何时候都是算偏导数的加和,并不是因为例子里存在求和操作,而是因为它表示的是各个\(x\)\(y\)的变化量的加权求和。考虑式子\(y = f(x) = x \cdot x^2\),那么 \[ \begin{align*} &u_1(x) && = x^2 \\ &u_2(x, u_1) && = xu_1 \\ &\frac{\partial u_1}{\partial x} && = 2x \\ &\frac{\partial u_2}{\partial x} && = u_1 \\ &\frac{\partial u_2}{\partial u_1} && = x \end{align*} \] 使用全微分公式,可知 \[ \frac{dy}{dx} = \frac{\partial u_2}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} = u_1 + x \cdot 2x = x^2 + 2x^2 = 3x^2 \] 对前面介绍的全微分链式法则,如果引入一个新的变量\(u_{n+1} = x\),则可以得到一个更简洁的公式 \[ \frac{\partial f(u_1, \ldots, u_{n+1})}{\partial x} = \sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} \] 这看上去很像两个向量的内积: \[ \frac{\partial f}{\partial \boldsymbol{u}} \frac{\partial \boldsymbol{u}}{\partial x} \]

向量的链式法则

为了引出向量的链式法则,可以先看一个例子。假设\(\boldsymbol{y} = \boldsymbol{f}(x)\),其中 \[ \left[ \begin{matrix}y_1(x) \\ y_2(x)\end{matrix} \right] = \left[ \begin{matrix}f_1(x) \\ f_2(x)\end{matrix} \right] = \left[ \begin{matrix}\ln(x^2) \\ \sin (3x)\end{matrix} \right] \] 引入两个中间变量\(g_1\)\(g_2\),使得\(\boldsymbol{y} = \boldsymbol{f}(\boldsymbol{g}(x))\),其中 \[ \begin{align*} \left[ \begin{matrix}g_1(x) \\ g_2(x)\end{matrix} \right] &= \left[ \begin{matrix}x^2 \\ 3x\end{matrix} \right] \\ \left[ \begin{matrix}f_1(\boldsymbol{g}) \\ f_2(\boldsymbol{g})\end{matrix} \right] &= \left[ \begin{matrix}\ln (g_1) \\ \sin (g_2)\end{matrix} \right] \end{align*} \] 那么\(\partial \boldsymbol{y}/\partial x\)就可以使用全微分链式法则: \[ \frac{\partial \boldsymbol{y}}{\partial x} = \left[ \begin{matrix}\frac{\partial f_1(\boldsymbol{g})}{\partial x} \\ \frac{\partial f_2(\boldsymbol{g})}{\partial x}\end{matrix} \right] = \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x} \\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x}\end{matrix}\right] \] 这个向量可以写成矩阵与向量相乘的形式 \[ \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x} \\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x}\end{matrix}\right] = \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} & \frac{\partial f_1}{\partial g_2} \\ \frac{\partial f_2}{\partial g_1} & \frac{\partial f_2}{\partial g_2}\end{matrix}\right]\left[\begin{matrix}\frac{\partial g_1}{\partial x} \\ \frac{\partial g_2}{\partial x}\end{matrix}\right] \] 可以看到这个矩阵是雅可比矩阵,向量同理,即 \[ \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} & \frac{\partial f_1}{\partial g_2} \\ \frac{\partial f_2}{\partial g_1} & \frac{\partial f_2}{\partial g_2}\end{matrix}\right]\left[\begin{matrix}\frac{\partial g_1}{\partial x} \\ \frac{\partial g_2}{\partial x}\end{matrix}\right] = \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}}\frac{\partial \boldsymbol{g}}{\partial x} \]\(x\)不是标量而是向量\(\boldsymbol{x}\)时,同样的法则也成立,这意味着我们可以得到一个向量的链式法则 \[ \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{g}(\boldsymbol{x})) \rightarrow \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}}\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} \] 事实上,这个公式可以更加简洁。对很多应用,雅可比矩阵都是对角方阵,此时\(f_i\)是一个只与\(g_i\)有关的函数,而\(g_i\)是一个只与\(x_i\)有关的函数。即 \[ \begin{align*} \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}} &= {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\right) \\ \frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} &= {\rm diag}\left(\frac{\partial g_i}{\partial x_i}\right) \end{align*} \] 因此在这种情况下,向量的链式法则可以简化为 \[ \frac{\partial }{\partial \boldsymbol{x}}\boldsymbol{f}(\boldsymbol{g}(\boldsymbol{x})) = {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\right){\rm diag}\left(\frac{\partial g_i}{\partial x_i}\right) = {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\frac{\partial g_i}{\partial x_i}\right) \]

激活函数的梯度

假设网络是全连接前馈神经网络(即这里不考虑卷积、RNN等),激活函数为ReLU,那么有 \[ {\rm activation}(\boldsymbol{x}) = \max(0, \boldsymbol{w} \cdot \boldsymbol{x} + b) \] 要计算的是\(\frac{\partial }{\partial \boldsymbol{w}}(\boldsymbol{w} \cdot \boldsymbol{x} + b)\)\(\frac{\partial }{\partial b}(\boldsymbol{w} \cdot \boldsymbol{x} + b)\)。尽管之前没讨论过向量内积对向量的偏导数,但是考虑 \[ \boldsymbol{w} \cdot \boldsymbol{x} = \sum_{i}^n (w_ix_i) = {\rm sum}(\boldsymbol{w} \otimes \boldsymbol{x}) \] 而之前讨论过\({\rm sum}(\boldsymbol{x})\)\(\boldsymbol{w} \otimes \boldsymbol{x}\)的偏导数和向量的链式法则,那么引入中间变量 \[ \begin{align*} \boldsymbol{u} &= \boldsymbol{w} \otimes \boldsymbol{x} \\ y &= {\rm sum}(\boldsymbol{u}) \end{align*} \] 根据上面的推导,有 \[ \begin{align*} \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{w}} &= {\rm diag}(\boldsymbol{x}) \\ \frac{\partial y}{\partial \boldsymbol{u}} &= \boldsymbol{1} \end{align*} \] 因此 \[ \frac{\partial y}{\partial \boldsymbol{w}} = \frac{\partial y}{\partial \boldsymbol{u}}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{w}} = \boldsymbol{1}\cdot {\rm diag}(\boldsymbol{x}) = \boldsymbol{x} \] 再令\(y = \boldsymbol{w} \cdot \boldsymbol{x} + b\),不难得出 \[ \frac{\partial y}{\partial \boldsymbol{w}} = \boldsymbol{x},\ \ \ \frac{\partial y}{\partial b} = 1 \] 激活函数是一个分段函数,且在0点不连续。分段求导可得 \[ \begin{align*} \frac{\partial }{\partial z}\max(0, z) = \begin{cases}0 & z \le 0 \\ \frac{dz}{dz} = 1 & z > 0\end{cases} \end{align*} \] 因此 \[ \frac{\partial}{\partial \boldsymbol{x}}\max(0, \boldsymbol{x}) = \left[\begin{matrix}\frac{\partial }{\partial x_1}\max(0, x_1) \\ \frac{\partial }{\partial x_2}\max(0, x_2) \\ \vdots \\ \frac{\partial }{\partial x_n}\max(0, x_n)\end{matrix}\right] \] 综合起来,对激活函数,引入中间变量\(z\)来表示仿射变换,有 \[ \begin{align*} z(\boldsymbol{w}, b, \boldsymbol{x}) &= \boldsymbol{w} \cdot \boldsymbol{x} + b \\ {\rm activation}(z) &= \max(0, z) \end{align*} \] 通过链式法则 \[ \frac{\partial {\rm activation}}{\partial \boldsymbol{w}} = \frac{\partial {\rm activation}}{\partial z} \frac{\partial z}{\partial \boldsymbol{w}} \] 代入前面的推导,有 \[ \begin{align*} \frac{\partial {\rm activation}}{\partial \boldsymbol{w}} &= \begin{cases}0\frac{\partial z}{\partial \boldsymbol{w}} = \boldsymbol{0} & z \le 0\\ 1\frac{\partial z}{\partial \boldsymbol{w}} = \boldsymbol{x} & z > 0\end{cases} \\ \frac{\partial {\rm activation}}{\partial b} &= \begin{cases}0 & z \le 0 \\ 1 & z > 0 \end{cases} \end{align*} \]

神经网络损失函数的梯度

最后考虑一个完整的例子。假设模型的输入是\(\boldsymbol{X}\),且 \[ \boldsymbol{X} = [\boldsymbol{x}_1,\boldsymbol{x}_2, \ldots, \boldsymbol{x}_N]^\mathsf{T} \] 样本个数为\(N = |\boldsymbol{X}|\),令结果向量为 \[ \boldsymbol{y} = [{\rm target}(\boldsymbol{x}_1), {\rm target}(\boldsymbol{x}_2), \ldots, {\rm target}(\boldsymbol{x}_N)]^\mathsf{T} \] 其中每个\(y_i\)都是一个标量。假设损失函数使用平方误差函数,那么损失函数\(C\)\[ \begin{align*} C(\boldsymbol{w}, b, \boldsymbol{X}, \boldsymbol{y}) &= \frac{1}{N}\sum_{i=1}^N(y_i - {\rm activation}(\boldsymbol{x}_i))^2 \\ &= \frac{1}{N}\sum_{i=1}^N(y_i - \max(0, \boldsymbol{w} \cdot \boldsymbol{x}_i + b))^2 \end{align*} \] 引入中间变量,有 \[ \begin{align*} u(\boldsymbol{w}, b, \boldsymbol{x}) &= \max(0, \boldsymbol{w} \cdot \boldsymbol{x} + b) \\ v(y, u) &= y -u\\ C(v) &= \frac{1}{N}\sum_{i=1}^N v^2 \end{align*} \] 这里主要说明如何计算权重的梯度。偏置\(b\)的计算方法同理。由前面的推导可知 \[ \frac{\partial }{\partial \boldsymbol{w}} u(\boldsymbol{w}, b, \boldsymbol{x}) = \begin{cases}\boldsymbol{0} & \boldsymbol{w} \cdot \boldsymbol{x} + b \le 0 \\ \boldsymbol{x} & \boldsymbol{w}\cdot \boldsymbol{x} + b > 0\end{cases} \]\[ \frac{\partial v(y, u)}{\partial \boldsymbol{w}} = \frac{\partial }{\partial \boldsymbol{w}}(y - u) = \boldsymbol{0} - \frac{\partial u}{\partial \boldsymbol{w}} = -\frac{\partial u}{\partial \boldsymbol{w}} = \begin{cases}\boldsymbol{0} & \boldsymbol{w} \cdot \boldsymbol{x} + b \le 0 \\ -\boldsymbol{x} & \boldsymbol{w}\cdot \boldsymbol{x} + b > 0\end{cases} \] 因此,整个梯度的计算过程为 \[ \begin{align*} \frac{\partial C(v)}{\partial \boldsymbol{w}} &= \frac{\partial}{\partial \boldsymbol{w}}\frac{1}{N}\sum_{i=1}^N v^2 \\ &= \frac{1}{N}\sum_{i=1}^N\frac{\partial v^2}{\partial v}\frac{\partial v}{\partial \boldsymbol{w}} \\ &= \frac{1}{N}\sum_{i=1}^N2v\frac{\partial v}{\partial \boldsymbol{w}} \end{align*} \] 中间展开过程略,最后可得 \[ \frac{\partial C(v)}{\partial \boldsymbol{w}} = \begin{cases}\boldsymbol{0} & \boldsymbol{w}\cdot \boldsymbol{x}_i + b \le 0 \\ \frac{2}{N}\sum_{i=1}^N (\boldsymbol{w}\cdot \boldsymbol{x}_i + b - y_i)\boldsymbol{x}_i & \boldsymbol{w}\cdot \boldsymbol{x}_i + b > 0\end{cases} \] 后面关于梯度下降的讨论略

坚持原创技术分享,您的支持将鼓励我继续创作!