机器学习HW4-神经网络


欢迎同学们观看哔哩哔哩《数之道》相关视频,提升自己对神经网络的理解。

在这个作业中,我们的任务是构建一个神经网络来识别手写数字图片,本质上是一个类别数为10的图像分类问题。首先我们会搭建一个简单的神经网络,然后分别实现AdaGrad、RMSprop、Momentum、Nesterov Momentum和Adam优化器来加速模型收敛。
需要修改代码主要在第三部分:

  • 搭建三层MLP网络模型
  • 实现各种优化器
  • 训练模型

1 导入所需的Python库

# coding: utf-8
import numpy as np
import matplotlib.pyplot as plt
from utils import load_mnist

from collections import OrderedDict

2 加载数据并可视化

在这个实验中用到的数据集是MNIST。MNIST数据集是一个手写体数据集,包含训练集60000张图片和测试集10000张图片,其中每张图片是一张28×28的灰度图片,图片内容为一个0-9表示的手写数字。
一张灰度图片读取后是二维张量的,但是我们使用的神经网络中的全连接层的输入是一维张量。所以我们需要将一张二维灰度图片转换成一个一维张量(即向量),如图1:

展开二维图片成向量

在此之后每一个样本都是一个长度为784(28*28)的向量。

手写数字识别任务是一个多分类任务,共包含有10类,分别用0-9表示,为了用一个向量表示一个类别,通常使用one-hot编码,比如在手写数字识别中,类别0和1对应的one-hot向量分别为:
$$\mathrm{one-hot}(0)=\begin{bmatrix}1 \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ \end{bmatrix},\mathrm{one-hot}(1)=\begin{bmatrix}0 \ 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ \end{bmatrix}$$

2.1 加载数据

在训练集中,我们需要把样本的标记$Y$转化为one-hot向量。

#加载训练集或测试集
path = './MNIST Data' #数据集文件所在目录
# 加载训练集合测试集
# 设置normalization为True,将数据缩放到[0,1]之间
# 设置one_hot_label为True,将标签转化为one_hot向量
(x_train, y_train), (x_test, y_test) = load_mnist(path, normalize=True, one_hot_label=True)
print('The shape of X_train is:',x_train.shape)
print('The shape of Y_train is:',y_train.shape)
print('The shape of X_test is:',x_test.shape)
print('The shape of Y_test is:',y_test.shape)
The shape of X_train is: (60000, 784)
The shape of Y_train is: (60000, 10)
The shape of X_test is: (10000, 784)
The shape of Y_test is: (10000, 10)

2.2 可视化数据集

下面在训练集中选取几个图片看一下。

fig = plt.figure()

ax1 = fig.add_subplot(141)
ax1.imshow(x_train[1,:].reshape(28, 28), cmap='Greys')
ax2 = fig.add_subplot(142)
ax2.imshow(x_train[2,:].reshape(28,28), cmap='Greys')
ax3 = fig.add_subplot(143)
ax3.imshow(x_train[3,:].reshape(28,28), cmap='Greys')
ax4 = fig.add_subplot(144)
ax4.imshow(x_train[4,:].reshape(28,28), cmap='Greys')
plt.show()
print('one hot 标签:',y_train[1,:],y_train[2,:],y_train[3,:],y_train[4,:])
print('对应的实际标签:',np.argmax(y_train[1,:]),np.argmax(y_train[2,:]),np.argmax(y_train[3,:]),np.argmax(y_train[4,:]))


png

one hot 标签: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
对应的实际标签: 0 4 1 9

3 全连接神经网络

接下来,搭建一个三层全连接神经网络。网络结构见图二:

三层全连接神经网络结构

对公式所用符号的说明:
一般计算神经网络层数不算输入层,所以图2中为3层神经网络。用右上角的方括号表示相应的层,所有第1层的权重$W$为$W^{[1]}$,第1层的偏置项$b$为$b^{[1]}$(图中未标出),第1层的激活值$A$为$A^{[1]}$。

前两层的激活函数使用LeakyRelu,最后一层使用Softmax进行分类。

3.1 初始化网络参数

我们规定第1层的神经元个数为300,第二层个数为300,最后一层为10。输入向量$X$的维度为784,那么整个网络对应的参数也就可以确定了。
$W^{[1]}$的shape为$(784,300)$,$b^{[1]}$的shape为$(300,)$
$W^{[2]}$的shape为$(300,300)$,$b^{[2]}$的shape为$(300,)$
$W^{[3]}$的shape为$(300,10)$,$b^{[3]}$的shape为$(10,)$
这里使用随机正态分布乘上比例因子0.01来初始化$W$, 把$b$都初始化为0.
Hint: 使用np.random.randn(),np.zeros()

def initialize_parameters(input_size, hidden_size, output_size, weight_init_std):
    """
    @param input_size:输入向量维度
    @param hidden_size:中间神经元个数
    @param output_size:输出层神经元个数
    @param weight_init_sta:比例因子
    """
    np.random.seed(1)
    params = {}

    params['W1'] = np.random.randn(input_size,hidden_size) * weight_init_std
    params['b1'] = np.zeros((hidden_size,)) #请参考样例完成代码
    ### START CODE HERE ### 
    params['W2'] = np.random.randn(hidden_size,hidden_size) * weight_init_std
    params['b2'] = np.zeros((hidden_size,))
    params['W3'] = np.random.randn(hidden_size,output_size) * weight_init_std
    params['b3'] = np.zeros((output_size,))
    
    ### END CODE HERE ### 

    print("W1's shape:",params['W1'].shape)
    print("b1's shape:",params['b1'].shape)
    print("W2's shape:",params['W2'].shape)
    print("b2's shape:",params['b2'].shape)
    print("W3's shape:",params['W3'].shape)
    print("b3's shape:",params['b3'].shape) #请在调用该函数的地方观察该神经网络各个参数的shape,是否符合预期
    
    return params

3.2 定义神经网络的每一层

我们将用单独的类来实现各种神经网络层:

  • LeakyRelu
  • Affine
  • Softmax-with-loss

3.2.1 LeakyRelu层

激活函数LeakyRelu的表达式为:
$$
\left{\begin{matrix}
x & (x>0)\
\alpha x & (x \leq 0)
\end{matrix}\right.
$$
可以通过上式求出$y$关于$x$的导数:
$$
\frac{\partial y}{\partial x} = \left{\begin{matrix}
1 & (x>0)\
\alpha & (x \leq 0)
\end{matrix}\right.
$$

如果正向传播时的输入$x$大于0,则反向传播会将下游的值原封不动地传给上游。反过来,如果正向传播时的$x$小于等于0,则反向传播中传给上游的信号将乘上一个很小的常数,保证neuron依然起作用。

LeakyRelu

class LeakyRelu:
    def __init__(self):
        self.mask = None 
        self.alpha = 0.1
        
    def forward(self, x):
        self.mask = (x <= 0) #mask表示选择出x的值中小于等于0的部分内容
        out = x.copy()
        ### START CODE HERE ###  #请参考LeakyRelu表达式实现前向传播过程
        out[self.mask] = self.alpha*out[self.mask]
        ### END CODE HERE ### 
        return out

    def backward(self, dout):
        ### START CODE HERE ###  #请参考LeakyRelu表达式y关于x的导数公式实现反向传播过程
        dout[self.mask] = self.alpha*dout[self.mask]
        ### END CODE HERE ### 
        dx = dout
        return dx

测试效果

leakyRelu = LeakyRelu()
x = np.array( [[1.0, -0.5], [-2.0, 3.0]] )
leakyRelu.forward(x), leakyRelu.backward(x)
(array([[ 1.  , -0.05],
        [-0.2 ,  3.  ]]),
 array([[ 1.  , -0.05],
        [-0.2 ,  3.  ]]))

3.2.2 Affine层

神经网络的正向传播中,为了计算加权信号的总和,使用了矩阵的乘积运算,回顾一下公式:
$$
Y = XW + B
$$
Y = np.dot(X, W) + B。假设X,W,B的shape分别为(2,)、(2,3)、(3,)如图4,现在将这里进行的求矩阵的乘积与偏置的和的运算用计算图表示出来。将乘积运算用“dot”节点表示的话,则np.dot(X, W) + B的运算如图5:

单向量正向传播

Affine

以矩阵为对象的反向传播,按矩阵的各个元素进行计算时,步骤和以标量为对象的计算图相同。用公式表示:
$$
\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \cdot W^T
$$

$$
\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}
$$

根据这个式子我们可以得到一个向量反向传播的计算图(见图6)。

单向量反向传播

现在我们考虑N个数据一起进行正向传播(见图7),与刚刚不同的是,现在输入X的shape是(N, 2)。之后就和前面一样,在计算图上进行矩阵计算。反向传播时,需要注意矩阵的shape,推导出$\frac{\partial L}{\partial X}$ 和 $\frac{\partial L}{\partial W}$的过程与单个向量类似。

N个向量反向传播

class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        
        self.x = None
        self.original_x_shape = None
        # 权重和偏置参数的导数
        self.dW = None
        self.db = None

    def forward(self, x):
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)
        self.x = x
        ### START CODE HERE ### 
        out = np.dot(x,self.W)+self.b
        ### END CODE HERE ### 
        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        ### START CODE HERE ### 
        self.dW = np.dot(self.x.T,dout)
        self.db = np.sum(dout,axis=0)
        ### END CODE HERE ### 
        
        dx = dx.reshape(*self.original_x_shape)  # 还原输入数据的形状(对应张量)
        return dx

测试效果

w = np.ones([2, 3])
b = np.ones(3)
affine = Affine(w, b)
x = np.ones([1, 2])
# 假设 l = sum(y)
l = np.sum(affine.forward(x))
dout = np.ones([1,3])
l, affine.backward(dout)
(9.0, array([[3., 3.]]))

3.2.3 Softmax-with-Loss层

softmax函数会将输入值正规化后输出,如图所示:

Softmax层

用$x_i$表示向量$x$的第$i$个分量。具体公式为
$$softmax(x_i)=\frac{e^{x_i}}{\sum_{j=1}^{C}{e^{x_j}}}$$

当最后一层为softmax时,使用的损失函数一般为交叉熵函数($C$是类别的数量,在本次实验中即为10):
$$L(\hat{y},y)=-\sum_{j=1}^{C}{y_jlog\hat{y_j}}$$
成本函数为($m$是mini-batch的大小):
$$J(W^{[1]},b^{[1]},…)=\frac{1}{m}\sum_{i=1}^{m}{L(\hat{y}^{(i)},y^{(i)})}$$
mini-batch:训练集共有60000个图片,我们不能把$(60000,784)$的张量直接放入神经网络计算。因此,每次正向传播从中选取m个图像来进行前向传播,所以输入的张量大小为$(m,784)$.

Hint:最后的Loss是个标量

def softmax(x):
    x = x.T
    x = x - np.max(x, axis=0)
    y = np.exp(x) / np.sum(np.exp(x), axis=0)
    return y.T 

def cross_entropy_error(pred, y):
    if pred.ndim == 1:
        y = y.reshape(1, y.size)
        pred = pred.reshape(1, pred.size)
        
    # 监督数据是one-hot-vector的情况下,转换为正确解标签的索引
    if y.size == pred.size:
        y = y.argmax(axis=1)
             
    batch_size = pred.shape[0]
    
    res = None

    res = -np.sum(np.log(pred[:, y] + 1e-7)) / batch_size
    
    return res

实现Softmax 层(已完成)

考虑到这里也包含作为损失函数的交叉熵误差(cross entropy error),所以称为“Softmax-with-Loss 层”。Softmax-with-Loss 层(Softmax函数和交叉熵误差)的计算图如图9所示。

Softmax-with-Loss 层

可以看到,Softmax-with-Loss 层有些复杂。这里只给出了最终结果。

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.pred = None # softmax的输出
        self.y = None # 监督数据

    def forward(self, x, y):
        self.y = y
        self.pred = softmax(x)
        self.loss = cross_entropy_error(self.pred, self.y)
        
        return self.loss

    def backward(self, dout=1):
        batch_size = self.y.shape[0]
        if self.y.size == self.pred.size: # 监督数据是one-hot-vector的情况
            dx = (self.pred - self.y) / batch_size
        else:
            dx = self.pred.copy()
            dx[np.arange(batch_size), self.y] -= 1
            dx = dx / batch_size
        
        return dx

3.3 搭积木

现在我们把之前的每一层组合在一起就能搭成我们自己的三层神经网络了。
神经网络学习的步骤大致有5步:

  1. 初始化权重
  2. 随机选择一部分数据
  3. 计算梯度
  4. 更新参数
  5. 重复步骤2,3,4
class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std = 0.01):
        
        # 初始化权重
        self.params = initialize_parameters(input_size, hidden_size, output_size, weight_init_std)
        # 记录训练次数 adam里要用
        self.t = 0

        # 生成层
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
        self.layers['LeakyRelu1'] = LeakyRelu()
        self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['LeakyRelu2'] = LeakyRelu()
        self.layers['Affine3'] = Affine(self.params['W3'], self.params['b3'])
    
        self.lastLayer = SoftmaxWithLoss()
        
    def predict(self, x):
        # 前向传播
        pred = x.copy()
        for layer in self.layers.values():
            # 通过forward函数完成前向传播
            ### START CODE HERE ###
            pred = layer.forward(pred)
            ### END CODE HERE ###
            
        return pred
        
    def loss(self, x, y):
        # 计算交叉熵损失
        ### START CODE HERE ### 
        pred = self.predict(x) #计算关于x的预测结果
        loss = self.lastLayer.forward(pred,y) #使用SoftmaxWithLoss层计算预测结果和y之间的交叉熵损失
        ### END CODE HERE ### 
        return loss
    
    def accuracy(self, x, y):
        # 输入数据x和标签y,输出当前神经网络的预测准确率
        accuracy = None
        pred = self.predict(x)
        pred = np.argmax(pred, axis=1)
        if y.ndim != 1:
            y = np.argmax(y, axis=1)
        
        accuracy = np.sum(pred == y) / float(x.shape[0])

        return accuracy
        
    def gradient(self, x, y):
        # 前向传播
        self.loss(x, y)

        # 反向传播
        dout = 1
        dout = self.lastLayer.backward(dout)
        
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # 设定
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db
        grads['W3'], grads['b3'] = self.layers['Affine3'].dW, self.layers['Affine3'].db
        
        return grads

3.4 进行训练

3.4.1 定义参数更新函数

def update_parameters(network, grads, learning_rate=0.001):
    """
    使用梯度下降法更新network的参数
    """

    #在这里我们给出了最基础的梯度下降法更新网络参数的实现代码,请同学们参考并完成其他优化算法的代码
    
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        network.params[key] -= learning_rate * grads[key]  #在network现在的参数基础上减去学习率*梯度
    
    return 

3.4.2 定义训练函数

def train_network(network, update_params_method, iters_num, train_size, batch_size, learning_rate):
    train_loss_list = []
    train_acc_list = []
    test_acc_list = []

    iter_per_epoch = max(train_size / batch_size, 1)

    for i in range(iters_num):
        batch_mask = np.random.choice(train_size, batch_size)
        x_batch = x_train[batch_mask]
        t_batch = y_train[batch_mask]
        network.t += 1

        # 计算梯度
        grad = network.gradient(x_batch, t_batch)

        # 更新梯度
        update_params_method(network, grad, learning_rate)

        loss = network.loss(x_batch, t_batch)
        train_loss_list.append(loss)

        if i % iter_per_epoch == 0:
            train_acc = network.accuracy(x_train, y_train)
            test_acc = network.accuracy(x_test, y_test)
            train_acc_list.append(train_acc)
            test_acc_list.append(test_acc)
            print("Train acc:{:<.6f}\tTest acc:{:<.6f}".format(train_acc, test_acc))
# 读入数据
(x_train, y_train), (x_test, y_test) = load_mnist(path, normalize=True, one_hot_label=True)
# 定义神经网络
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
iters_num = 10000 #迭代次数
train_size = x_train.shape[0] #训练集的样本数量
batch_size = 100 #batch大小
learning_rate = 0.1 #学习率
train_network(network, update_parameters, iters_num, train_size, batch_size, learning_rate) #开始训练网络
Train acc:0.085400	Test acc:0.084700
Train acc:0.882817	Test acc:0.887600
Train acc:0.921217	Test acc:0.918800
Train acc:0.941400	Test acc:0.940300
Train acc:0.954433	Test acc:0.953100
Train acc:0.966600	Test acc:0.961500
Train acc:0.972000	Test acc:0.965700
Train acc:0.975467	Test acc:0.967000
Train acc:0.978200	Test acc:0.970800
Train acc:0.982183	Test acc:0.972900
Train acc:0.985733	Test acc:0.977900
Train acc:0.986983	Test acc:0.976200
Train acc:0.986350	Test acc:0.974700
Train acc:0.989633	Test acc:0.977500
Train acc:0.991583	Test acc:0.978200
Train acc:0.992133	Test acc:0.978600
Train acc:0.990667	Test acc:0.977800

3.5 梯度下降优化算法

下面回顾几种上课讲过的优化算法,注意它们之间的差异与联系。

3.5.1AdaGrad

当$J(\theta)$收敛到最低值附近时,因为步长$\alpha$大小固定的原因,$J(\theta)$会在最低值附近徘徊,而不能到达最低值。因此,AdaGrad的想法是随着迭代次数的增加降低学习率$\alpha$,学习率$\alpha$衰减的方式是
$$
\alpha^t = \frac{\alpha}{\sqrt{t+1}}
$$
其中t表示第t次迭代。

如果梯度数值小,$J(\theta)$的移动步长小,$J(\theta)$在坡度平缓的区域内下降速度会变慢。AdaGrad使用均方根来加快$J(\theta)$在平缓区域的下降速度。均方根的表示为
$$
\sigma^t = \sqrt{\frac{1}{t+1} \sum_{i=0}^{t}(g^i)2}
$$
其中$g^i$表示历史的梯度值。AdaGrad 的更新参数公式是
$$
\begin{aligned}
\theta^{t+1} & := \theta^t - \frac{\alpha^t}{\sigma^t} g^t \
& := \theta^t - \frac{\alpha}{\sqrt{\sum_{i=0}^t (g^i)^2}} g^t
\end{aligned}
$$
在坡度平缓的区域,均方根的数值小,梯度除以一个数值小的数会变大,从而加大了$J(\theta)$移动步长,也因此加快梯度下降速度。但是,AdaGrad的缺点是,随着迭代次数的增大,均方根会越来越大,梯度趋近于0,导致训练提前停止。为了防止分母为0,我们给分母加上一个小数值$\epsilon =10^{-7}$。
$$
\theta^{t+1} := \theta^t - \frac{\alpha}{\sqrt{\sum_{i=0}^t (g^i)^2} + \epsilon} g^t
$$

我们可以看到分母里会计算所有历史梯度值的平方和,所以在实现的时候不用保存所有的历史梯度值,只需要保存一个纪录所有历史梯度平方和的值即可。每个参数的历史梯度和初始值为0。
在代码实现中,我们使用epsilon代表ε

def initialize_grads_squared(network):
    """
    初始化历史梯度和
    """
    grads_squared = {}
    
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        
        grads_squared[key] = np.zeros(network.params[key].shape)
        
    return grads_squared
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
grads_squared = initialize_grads_squared(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
def update_parameters_with_adagrad(network, grads, learning_rate=0.001, epsilon = 1e-7):
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
        
        #计算历史梯度平方和
        grads_squared[key] += grads[key]*grads[key]
        network.params[key] -= learning_rate*grads[key]/(np.sqrt(grads_squared[key])+epsilon)
        ### END CODE HERE ###
        
    return 
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_adagrad, iters_num, train_size, batch_size, learning_rate)
Train acc:0.224533	Test acc:0.221600
Train acc:0.868767	Test acc:0.870600
Train acc:0.886033	Test acc:0.887800
Train acc:0.894200	Test acc:0.895500
Train acc:0.899167	Test acc:0.899200
Train acc:0.902833	Test acc:0.902300
Train acc:0.905533	Test acc:0.905000
Train acc:0.908300	Test acc:0.907900
Train acc:0.909200	Test acc:0.910200
Train acc:0.912150	Test acc:0.911400
Train acc:0.914283	Test acc:0.913000
Train acc:0.916233	Test acc:0.915000
Train acc:0.916717	Test acc:0.916500
Train acc:0.918017	Test acc:0.915700
Train acc:0.919483	Test acc:0.918000
Train acc:0.920233	Test acc:0.919800
Train acc:0.921217	Test acc:0.920200

3.5.2 RMSprop

从AdaGrad算法的公式可看出,所有时刻的梯度都对当前的参数更新有影响。如果早先的梯度并不是一个正确的方向,那么这些糟糕的梯度还是会影响到当前的参数更新。因此,RMSprop相当于就是只记录当前时刻前的某一段历史梯度和而不是所有历史梯度和。
RMSprop算法的公式如下:
$$
u^0 = 0 \
u^{t+1} = \rho u^t + (1-\rho) [\nabla J(\theta ^t)]^2 \
\theta^{t+1} = \theta^t - \frac{\alpha}{\sqrt{u^{t+1}}+\epsilon}\nabla J(\theta ^t)
$$
这里$\rho$是超参数,一般设为0.999,也不会调它。$\epsilon$是防止分母为0。另外值得注意的是,因为要整合这几个算法在一起,而Adam算法又融合了各种算法,所以,关于优化算法的超参数的命名与Adam里保持一致,公式里的$\rho$用下面参数beta代替。这些算法几乎都要保存一些变量,它们的初始化基本与AdaGrad初始化的方法一致,所以这部分初始化的代码就不重复了。

def update_parameters_with_rmsprop(network, grads, learning_rate=0.001, epsilon = 1e-7, beta=0.999):
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
        
        #公式里的u就是这里的 grads_squared         
        grads_squared[key] = beta*grads_squared[key]+(1-beta)*grads[key]*grads[key]
        network.params[key] -= learning_rate/(np.sqrt(grads_squared[key])+epsilon)*grads[key]
        
        ### END CODE HERE ###
    return 
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
grads_squared = initialize_grads_squared(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_rmsprop, iters_num, train_size, batch_size, learning_rate)
Train acc:0.154500	Test acc:0.145600
Train acc:0.964933	Test acc:0.959300
Train acc:0.975267	Test acc:0.968700
Train acc:0.984133	Test acc:0.973400
Train acc:0.987083	Test acc:0.975700
Train acc:0.988967	Test acc:0.975400
Train acc:0.992717	Test acc:0.977400
Train acc:0.994050	Test acc:0.977900
Train acc:0.993700	Test acc:0.978300
Train acc:0.993617	Test acc:0.976400
Train acc:0.993650	Test acc:0.976300
Train acc:0.993333	Test acc:0.976200
Train acc:0.995350	Test acc:0.980400
Train acc:0.996850	Test acc:0.979400
Train acc:0.995117	Test acc:0.977200
Train acc:0.994083	Test acc:0.975900
Train acc:0.998200	Test acc:0.980000

3.5.3 Momentum

动量梯度下降(Gradient Descent with Momentum)基本思想就是计算梯度的指数加权平均数,并利用该指数加权平均数更新权重。具体过程为:
$$
v^0 = 0 \
v^{t+1} = \rho v^t +\alpha \nabla J(\theta ^t) \
\theta^{t+1} = \theta ^t - v^{t+1}
$$

这里的$\rho$一般取0.9。

def initialize_velocity(network):
    v = {}
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        v[key] = np.zeros((network.params[key]).shape) 
    return v
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
v = initialize_velocity(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
def update_parameters_with_momentum(network, grads, learning_rate=0.001, beta=0.9):
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
        
        #公式里的u就是这里的 grads_squared         
        v[key] = beta*v[key]+learning_rate*grads[key]
        network.params[key] -= v[key]
        
        ### END CODE HERE ###
    return 
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_momentum, iters_num, train_size, batch_size, learning_rate)
Train acc:0.085050	Test acc:0.084000
Train acc:0.112367	Test acc:0.113500
Train acc:0.282417	Test acc:0.284400
Train acc:0.437133	Test acc:0.447100
Train acc:0.702383	Test acc:0.715900
Train acc:0.789133	Test acc:0.797000
Train acc:0.828717	Test acc:0.835100
Train acc:0.852583	Test acc:0.856400
Train acc:0.869817	Test acc:0.873100
Train acc:0.882033	Test acc:0.884900
Train acc:0.890717	Test acc:0.890700
Train acc:0.896317	Test acc:0.894500
Train acc:0.899783	Test acc:0.899300
Train acc:0.904583	Test acc:0.902500
Train acc:0.908617	Test acc:0.906200
Train acc:0.911867	Test acc:0.909400
Train acc:0.914083	Test acc:0.912900

3.5.4 Nesterov Momentum

Nesterov Momentum算法与Momentum不同的是在于,它会提前计算一个在速度作用后的梯度。具体算法如下:
$$
v^{t+1} = \rho v^t + \alpha \nabla J(\theta ^t - \rho v^t) \
\theta^{t+1} = \theta ^t - v^{t+1}
$$
但是在实现的时候,我们是不会算一次$J(\theta ^t)$再算一次$\nabla J(\theta ^t - \rho v^t)$的。具体编程实现时上式等价于下式:
$$
v^{t+1} = \rho v^t + \alpha \nabla J(\theta ^t) \
\theta^{t+1} = \theta ^t - \rho v^{t+1} - \alpha \nabla J(\theta ^t)
$$
这里的$\rho$一般取0.9。

def update_parameters_with_nesterov_momentum(network, grads, learning_rate=0.001, beta=0.9):
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
                
        v[key] = beta*v[key]+learning_rate*grads[key]
        network.params[key] -= beta*v[key]+learning_rate*grads[key]
        
        ### END CODE HERE ###
    return 
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
v = initialize_velocity(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_nesterov_momentum, iters_num, train_size, batch_size, learning_rate)
Train acc:0.084783	Test acc:0.084100
Train acc:0.112367	Test acc:0.113500
Train acc:0.283333	Test acc:0.285600
Train acc:0.440450	Test acc:0.450000
Train acc:0.703683	Test acc:0.717300
Train acc:0.789483	Test acc:0.796600
Train acc:0.829000	Test acc:0.835400
Train acc:0.853417	Test acc:0.856900
Train acc:0.869817	Test acc:0.873300
Train acc:0.882167	Test acc:0.884300
Train acc:0.890650	Test acc:0.890700
Train acc:0.896400	Test acc:0.894400
Train acc:0.899950	Test acc:0.899100
Train acc:0.904733	Test acc:0.902300
Train acc:0.908700	Test acc:0.906000
Train acc:0.911867	Test acc:0.909300
Train acc:0.914117	Test acc:0.912800

3.5.5 Adam

Adam优化算法(Adaptive Moment Estimation)是将Momentum和RMSprop结合在一起的算法,具体过程如下
$$
u^0 = 0 \
v^0 = 0 \
u^{t+1} = \rho_2 u^t +(1-\rho_2) [\nabla J(\theta ^t)]^2 \
v^{t+1} = \rho_1 v^t + (1-\rho_1)\nabla J(\theta ^t) \
\theta^{t+1} = \theta ^t - \frac{\alpha}{\sqrt{u^{t+1}}+\epsilon}v^{t+1}
$$
从上式可以看到,在最开始更新时,$u^{t},v^{t}$都是很小的。所以需要对早期的更新进行一个bias correction。完整公式如下
$$
u^0 = 0 \
v^0 = 0 \
u^{t+1} = \rho_2 u^t +(1-\rho_2) [\nabla J(\theta ^t)]^2 \
u^{t+1}{corrected} = \frac{u^{t+1}}{1-\rho_2^t} \
v^{t+1} = \rho_1 v^t + (1-\rho_1)\nabla J(\theta ^t) \
v^{t+1}
{corrected} = \frac{v^{t+1}}{1-\rho_1^t} \
\theta^{t+1} = \theta ^t - \frac{\alpha}{\sqrt{u^{t+1}{corrected}}+\epsilon}v^{t+1}{corrected}
$$

其中,一般设$\rho_1=0.9,\rho_2=0.999$.$\epsilon$也是防止分母过小或等于0.

def initialize_adam(network) :
    v = {}
    u = {}

    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        v[key] = np.zeros(np.shape(network.params[key]))
        u[key] = np.zeros(np.shape(network.params[key]))
            
    return v, u
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
v, u = initialize_adam(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
def update_parameters_with_adam(network, grads, learning_rate=0.001, epsilon=1e-7, beta1=0.9, beta2=0.999):
    v_corrected = {}
    u_corrected = {} 
    t = network.t #当前迭代次数
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
                
        v[key] = beta1*v[key]+(1-beta1)*grads[key]
        v_corrected[key] = v[key]/(1-beta1**t)
        
        u[key] = beta2*u[key]+(1-beta2)*grads[key]*grads[key]
        u_corrected[key] = u[key]/(1-beta2**t)
        
        network.params[key] -= learning_rate/(np.sqrt(u_corrected[key])+epsilon)*v_corrected[key]
        
        ### END CODE HERE ###
    return 
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_adam, iters_num, train_size, batch_size, learning_rate)
Train acc:0.143983	Test acc:0.144800
Train acc:0.958583	Test acc:0.954000
Train acc:0.971600	Test acc:0.966900
Train acc:0.978850	Test acc:0.967800
Train acc:0.983017	Test acc:0.972800
Train acc:0.985367	Test acc:0.972400
Train acc:0.986133	Test acc:0.974500
Train acc:0.991683	Test acc:0.978100
Train acc:0.986633	Test acc:0.971500
Train acc:0.995467	Test acc:0.981400
Train acc:0.993783	Test acc:0.980900
Train acc:0.994167	Test acc:0.980400
Train acc:0.994417	Test acc:0.978400
Train acc:0.995717	Test acc:0.979500
Train acc:0.997383	Test acc:0.980900
Train acc:0.996250	Test acc:0.978900
Train acc:0.996833	Test acc:0.980800

3.5.6 AdaBelief

Adam优化算法中,梯度更新的方向是 $\frac{v_{t+1}}{\sqrt{u_{t+1}}}$, 其中 $u_{t+1}$ 是 $[\nabla J(\theta ^t)]^2$ 的指数移动平均(exponential moving average,EMA)。

AdaBelief优化算法将梯度方向改为了 $\frac{v_{t+1}}{\sqrt{s_{t+1}}}$, 其中 $s_{t+1}$ 是 $[\nabla J(\theta ^t)-v_{t}]^2$ 的指数移动平均,AdaBelief的算法过程具体如下:
$$
s^0 = 0 \
v^0 = 0 \
v^{t+1} = \rho_1 v^t + (1-\rho_1)\nabla J(\theta ^t) \
s^{t+1} = \rho_2 s^t +(1-\rho_2) [\nabla J(\theta ^t)-v_{t+1}]^2 \
\theta^{t+1} = \theta ^t - \frac{\alpha}{\sqrt{s^{t+1}}+\epsilon}v^{t+1}
$$

与Adam优化算法同理,从上式可以看到,AdaBelief优化算法在最开始更新时,$s^{t},v^{t}$都是很小的。所以需要对早期的更新进行一个bias correction。完整公式如下
$$
s^0 = 0 \
v^0 = 0 \
v^{t+1} = \rho_1 v^t + (1-\rho_1)\nabla J(\theta ^t) \
v^{t+1}{corrected} = \frac{v^{t+1}}{1-\rho_1^t} \
s^{t+1} = \rho_2 s^t +(1-\rho_2) [\nabla J(\theta ^t)-v
{t+1}]^2 \
s^{t+1}{corrected} = \frac{s^{t+1}}{1-\rho_2^t} \
\theta^{t+1} = \theta ^t - \frac{\alpha}{\sqrt{s^{t+1}
{corrected}}+\epsilon}v^{t+1}_{corrected}
$$

其中,一般设$\rho_1=0.9,\rho_2=0.999$.$\epsilon$也是防止分母过小或等于0.

def initialize_adambelief(network) :
    v = {}
    s = {}

    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ###  #请初始化v和s
        v[key] = np.zeros(np.shape(network.params[key]))
        s[key] = np.zeros(np.shape(network.params[key]))
        ### END CODE HERE ###
            
    return v, s
network = TwoLayerNet(input_size=784, hidden_size=300, output_size=10)
v, s = initialize_adambelief(network)
W1's shape: (784, 300)
b1's shape: (300,)
W2's shape: (300, 300)
b2's shape: (300,)
W3's shape: (300, 10)
b3's shape: (10,)
def update_parameters_with_adambelief(network, grads, learning_rate=0.001, epsilon=1e-7, beta1=0.9, beta2=0.999):
    v_corrected = {}
    s_corrected = {} 
    t = network.t #当前迭代次数
    for key in ('W1', 'b1', 'W2', 'b2', 'W3', 'b3'):
        ### START CODE HERE ### 
                
        v[key] = beta1*v[key]+(1-beta1)*grads[key]
        v_corrected[key] = v[key]/(1-beta1**t)
        
        s[key] = beta2*s[key]+(1-beta2)*(grads[key]-v[key])*(grads[key]-v[key])
        s_corrected[key] = s[key]/(1-beta2**t)
        
        network.params[key] -= learning_rate/(np.sqrt(s_corrected[key])+epsilon)*v_corrected[key]
        ### END CODE HERE ###
    return 
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.001
train_network(network, update_parameters_with_adambelief, iters_num, train_size, batch_size, learning_rate)
Train acc:0.229883	Test acc:0.226300
Train acc:0.960717	Test acc:0.955500
Train acc:0.970600	Test acc:0.965000
Train acc:0.979550	Test acc:0.968900
Train acc:0.985500	Test acc:0.974000
Train acc:0.987800	Test acc:0.976200
Train acc:0.989600	Test acc:0.974900
Train acc:0.991533	Test acc:0.977100
Train acc:0.990117	Test acc:0.976700
Train acc:0.993717	Test acc:0.979300
Train acc:0.993467	Test acc:0.979100
Train acc:0.994500	Test acc:0.979400
Train acc:0.992783	Test acc:0.974000
Train acc:0.996283	Test acc:0.979000
Train acc:0.995833	Test acc:0.978000
Train acc:0.995383	Test acc:0.978100
Train acc:0.996683	Test acc:0.977900

4 总结

本次实验完整搭建了一个三层的全连接网络,使用了各种梯度更新优化算法训练MNIST数据集。
或许你可以试试通过调整网络参数和超参数再把准确度提高一点?(记得记录每次训练时的数据结果)


文章作者: J&Ocean
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 J&Ocean !
评论
  目录