神经网络识别手写数字2

你可以在这里阅读 上一篇

感谢英文原版在线书籍,这是我学习机器学习过程中感觉非常适合新手入门的一本书。鉴于知识分享的精神,我希望能将其翻译过来,并分享给所有想了解机器学习的人。


根据上一篇学到的知识,让我们使用随机梯度下降和MNIST数据集来实现我们的手写数字识别程序吧。如果你还没看过必备的知识,请移步到上一篇文章进行学习,你可以关注我的文集来获取我的持续更新。这里我们将使用Phthon(2.7),代码量仅仅74行,但是请注意,我们如果为了学习机器学习的思想并且希望能将这种技术应用到更多的领域的话,我建议不要过于关注代码,千万不要去试图背诵代码,因为这是没有意义的。

第一件事是获取MNIST数据集,如果是一个git的使用者,关于git是什么我就不介绍了,应该是每一个程序员或者技术研究者都会的才对。你可以使用下面的命令来获取数据集:

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

如果你没有用过git,也可以在这里下载

补充:前面的文章我介绍说MNIST数据有60000个训练数据,10000个测试数据,这是MNIST官方的介绍。实际上,我们这里的数据有点小不同。我们的测试数据是从训练数据中分离出一部分组成的,也就是说,我们把60000个图片的数据分成50000个组成训练集,然后剩下的10000个组成测试集。

我们会使用一个Python库Numpy。使用它提供的线性代数的运算,如果你还没有安装Numpy,可以在这里获得。

我先解释一下代码的结构设计,核心是Network类,这代表一个神经网络,这里给出初始化一个Network对象的代码:

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]

其中,sizes是一个list,包含了网络中每一层神经元的数量。例如,我们创建一个网络第一层有2个神经元,第二层有3个神经元,最后一层有1个神经元,我们可以使用下面的代码:

net = Network([2, 3, 1])

b和权重的值都是被初始化成随机数的,使用Numpy的np.random.randn函数生成的均值为0方差为1符合高斯分布的随机数。这些初始值是我们开始随机梯度下降开始的地方。但是在后面的章节,我们会有更好的方法来初始化权重和b值,只是现在先这样做。注意,网络中的第一层是输入层,这一层中没有设置b值,因为b值仅仅用于后续的输出层。

所有的b和权重都被作为list存储在Numpy的向量中。例如,net.weights[1]是Numpy向量中存储连接神经元第二和第三层的(不是第一和第二层,因为python的list的index开始是0)。因为net.weights[1]这样的写法过于复杂,我们可以定义向量是w。例如Wjk代表的是第二层第k个神经元和第三层中第j个神经元之间的权重。我们将 σ函数向量化:

a是第二层神经元的激活向量,很容易看出来公式(22)和公式(4)是相同的。

定义sigmoid函数:

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

注意,当z是一个向量或一个Numpy数组时,Numpy会自动的对向量中每一个元素使用sigmoid函数,就是向量化操作。

然后添加feedforward方法:当网络输入一个a,得到响应的输出*。这个方法实现的就是公式(22),每一层也都用这个方法:

def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

当然,我们主要是要Network来学习。我们给出一叫做SGD(stochastic gradient descent)的方法来实现随机梯度下降。下面是代码:

 def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The "training_data" is a list of tuples
        "(x, y)" representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If "test_data" is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

训练数据training_data 是元组(x, y)代表输入和期望的输出。变量epochs和mini_batch_size是你可以设置的训练次数和采样时小批量的个数。eta是学习速率η。如果可选参数test_data被设置了,程序就会在每次训练后将结果打印,这对跟踪调试很有帮助但是却会降低速度。

在每一次epoch训练中,都会把训练数据重新排序,然后放在大小指定的list中(mini_batches),这是一种简单的抽取训练数据的方法。然后对每一个mini_batch再执行单一的梯度下降,这一步是使用self.update_mini_batch(mini_batch, eta)这句代码执行。这会根据这一次的迭代更新权重和b值。接下来给出update_mini_batch的方法:

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

大部分工作都在这一行:

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

这个被调用的方法叫后向传播算法(backpropagation),这是一个计算损失函数最快的方法,所以update_mini_batch这个方法能很快的使用每一个训练样本mini_batch计算并更新self.weights 和 self.biases属性。

这里先不提供self.backprop的代码,我们将会在下面的章节学到有关后向传播算法(backpropagation)以及它的代码实现。现在假设它就是能够根据输入训练样本x恰当的返回梯度就行了。

让我们来看一看全部代码,包括文档说明和我上面省略的部分。其中self.backprop 方法中使用了sigmoid_prime方法来协助计算梯度,这个计算了 σ函数的导数。self.cost_derivative方法你可以通过看代码和注视了解,我们会在下一个章节详细解释。所有的代码可以在这里下载:

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

那么程序执行的效果怎么样呢?Well,让我们先加载MNIST数据,这里使用一个工具类来帮助我们做这件事mnist_loader.py,我们在Python的shell里执行这个文件:

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

接下来运行network,我们设置30个隐藏层:

>>> import network
>>> net = network.Network([784, 30, 10])

然后设置训练次数30次,epochs=30;每组训练数据10个,mini-batch.size = 10;学习速率3.0, η=3.0。

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

如果你在这时候运行代码,可能需要花点时间才能运行完。我建议你继续阅读,设置一些运行的东西,并定期检查代码输出。但是如果你想现在就运行,可以通过减少训练次数、减少隐藏层的数量或仅仅使用部分训练数据来加快速度。请注意:写这些代码只是为了帮助你了解神经网络工作的方式,而不是性能很高的代码。当然,一旦我们已经训练出来一个很好的神经网络,那么就可以直接将其移植到网页(用JS)或app等,这时候它也会运行的很快的。正如你看到的,仅仅一次训练以后,正确识别的数量就已经达到了9129个(一共10000个)。

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

但是,因为我们初始化时用的随机生成的权重和b,所以你运行的结果可能并不会和我的一模一样。

我们把隐藏层改成100看看:

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

我们会发现准确率提升了,至少在这种情况下增加隐藏层数量会帮助我们得到更好的结果。

如果我们减小学习效率到η=0.001:

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

结果就有点不近人情了:

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

再改变学习速率到0.01,发现结果又变好一点了。类似的当你发现改变一个参数能使得结果改变时就多试几次。我们可以最终选择更适合我们的这个参数。

一般,调试神经网络是比较困难的。当指定一个参数没有随机选择的好的时候尤其如此。假设我们尝试设置隐藏层30个神经元,学习效率η=100.0:

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

会发现学习效率太高了:

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

这个时候我们应该会适当调小学习速率,来提高准确率。但是假如这是我们第一次来尝试,那么我们可能不会立刻怀疑时学习速率太大的问题,而是可能会怀疑时我们神经网络的问题,比如是不是我们初始化权重和b值造成网络的问题呢?会不会是训练数据的问题呢?是不是训练次数问题?或者是不是应该改变学习算法呢?等等各种猜想,所以当你第一次遇到这个情况时,你是不确定是什么问题出现造成的这种结果。但是这里先不解释,会在之后的文章中解释这些问题。这里仅仅是展示源码。

我们来看看前面提到的如何加载MNIST数据的细节,下面是源码。数据结构是MNIST官网文档上面描述的是stuff、tuples和lists。如果你不了解ndarray,可以理解成向量。

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

我们知道,2的图案比1的更加黑一点,因为它更多区域被染成黑色。

有个建议是计算0到9的平均黑度,这样在有个数字要猜测时,可以先计算它的黑度然后再猜测它是什么数字,这个实现不太难,所以这里不写出代码,而是把代码放在了GitHub repository,这种方法能提高我们的准确率。

但是如果你想尽可能提高准确率,我们可以使用支持向量机算法SVM(support vector machine)。不用担心,暂时我们不需要理解SVM算法细节,我们现在可以显示用库 scikit-learn,里面提供了SVM算法C语言对Python方便的接口。代码在这里 here.这说明SVM比我们的算法更厉害,这有点不太好,所以在后面,我们会提高我们的算法,让它比SVM准确率更高。

SVM也有很多可以调整的参数,如果你感兴趣可以学习 this blog post作者是 Andreas Mueller 。

进行深度学习


我们可以通过转移我们的技术来分析另一个问题,是不是一张人脸:

可以参考如下模型:

然而,一个子问题还能拆解成更小的问题:

这样,就能把网络变成深度神经网络。人们现在经常训练有5到10个隐藏层的网络。事实证明,这些方法比浅层神经网络,即只有一个隐藏层的网络在许多问题上的表现要好得多。原因是深层网络建立一个复杂的概念层次的能力。

如果有问题,欢迎跟我联系讨论space-x@qq.com.我会尽量及时回复。

发表评论

关闭菜单