Études in Programming Intelligence: Step by Step Neural Network: Prototyping

We are going to implement the core NN functionality described in Mathematical Foundation using Python 3 and NumPy.
数学的基礎の頁で述べたニューラルネットワークの中核部分を実装する。実装には Python 3, NumPy を使う。

Personal Notes from debugging / 個人的実装メモ

Don't confuse the definition of softmax and that of sigmoid. Sigmoid has minus sign.
ソフトマックス関数とシグモイド関数の定義を混同しない。シグモイドはマイナス記号。
$$ softmax=\{\frac{exp(x_i)}{\sum_i exp(x_i)} \}\\ sigmoid=\frac{1}{1+exp({\color{red}-}x)}$$
Make sure that initial weight matrices contain both positive and negative values. Otherwise the network get stuck at a plateau just after a few iteration.
重みの初期化では正・負両方の重みが含まれるようにすること。そうでないと、へんなところで停滞する。
```
np.random.uniform(size=(n_out, 1))-0.5 
                                 #^^^^ HERE!
```
Initialization does matter. Initialisation with uniform distribution seems better than normal distribution. (Also, the value itself seems matter, but not sure about it.)
初期化のやり方は重要。正規分布で初期化するより、一様分布を使って初期化する方がよさそう。（更に値の大小も影響するように見えるが、詳しくは不明）
Of course, we should be using Glorot initialization.
もちろん Glorot 初期化を使うべき。

In the following code snippets, x indicates an input to a layer, and y indicates an output from a layer.
以下のコードでは x はレイヤーへの入力、y はレイヤーからの出力を示す。

Code: Forward Pass Part

### Forward pass
# Output from the hidden layer
x_out = sigmoid(B1+np.dot(W1, x_hid))
# Output from the output layer
y_out = softmax(B2+np.dot(W2, x_out))

Code: Back propagation Part

### Backward pass
error = y_out - t
dEdW2 += np.dot(error, x_out.T) / n_train
dEdB2 += error / n_train
error = np.dot(W2.T, error) * x_out * (1-x_out)
dEdW1 += np.dot(error, x_hid.T) / n_train
dEdB1 += error / n_train

Code: Parameter Update Part

### Update Weights
dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2
W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2

Code: Entire Training

import numpy as np


def sigmoid(x):
    return 1 / (1+np.exp(-x))


def softmax(x):
    y = np.exp(x)
    return y / np.sum(y)


def train_network(X_train, T_train, X_test, T_test, 
                  W1, B1, W2, B2, lr, n_epochs):
    """
    Multiclass classification with NN
    Args:
      X_train(NumPy Array) : Data matrix (shape: n_train, n_input)
      T_train(NumPy Array) : Label matrix (shape: n_train, n_output)

      X_test(NumPy Array) : Data matrix (shape: n_test, n_input)
      T_test(NumPy Array) : Label matrix (shape: n_test, n_output)

      W1(NumPy Array) : Weight matrix in layer1 (shape: n_hidden, n_input)
      B1(NumPy Array) : Bias vector in layer1 (shape: n_hidden, 1)
      W2(NumPy Array) : Weight matrix in layer2 (shape: n_output, n_hidden)
      B2(NumPy Array) : Bias vector in layer2 (shape: n_output, 1)

      lr(float)     : Learning rate
      n_epochs(int) : Maximum epochs to update the parameters

      where 
        n_train, n_test, are the number of training/test samples respertively. 
        n_input, n_output are the numbere of input/output features respectively
        n_hidden is the number of unit in the hidden layer

    Returns:
      W1, B1, W2, B2 (NumPy Array): Updated weight matrices and vectors
      ce_train(list of float)  : The history of cross entropy on training data.
      acc_train(list of float) : The history of cross entropy on training data.
      ce_test(list of float)   : The history of accuracy on training data.
      acc_test(list of float)  : The history of accuracy on training data.
    """
    ### Get the dimension of input variables
    n_train, n_test = X_train.shape[0], X_test.shape[0]
    n_input, n_output = X_train.shape[1], T_train.shape[1]

    ### Train the network
    ce_train, acc_train, ce_test, acc_test = [], [], [], []
    for epoch in range(n_epochs):
        ### Train on training data
        ce, acc = 0.0, 0.0
        ### Initialize the gradient matrices
        dEdW1, dEdB1 = np.zeros(W1.shape), np.zeros(B1.shape)
        dEdW2, dEdB2 = np.zeros(W2.shape), np.zeros(B2.shape)
        for x, t in zip(X_train, T_train):
            x_hid = x.reshape((-1, 1))
            ### Forward pass
            # Output from the hidden layer
            x_out = sigmoid(B1+np.dot(W1, x_hid))
            # Output from the output layer
            y_out = softmax(B2+np.dot(W2, x_out))

            ### Error Check
            t = t.reshape((-1, 1))
            # Cross-entropy
            ce += -np.sum(t*np.log(y_out)) / n_train
            # Accuracy
            if np.max(y_out)==np.max(y_out*t):
                acc += 1 / n_train

            ### Backward pass
            # Output layer
            error = y_out - t
            dEdW2 += np.dot(error, x_out.T) / n_train
            dEdB2 += error / n_train
            # Hidden layer
            error = np.dot(W2.T, error) * x_out * (1-x_out)
            dEdW1 += np.dot(error, x_hid.T) / n_train
            dEdB1 += error / n_train

        ### Store cross entropy and accuracy
        ce_train.append(ce)
        acc_train.append(acc)

        ### Update Weights
        dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2
        W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2 

        ### Evaluate error on test data
        ce, acc = 0.0, 0.0
        for x, t in zip(X_test, T_test):
            x_hid = x.reshape((-1, 1))
            ### Forward pass
            # Output from the hidden layer
            x_out = sigmoid(B1+np.dot(W1, x_hid))
            # Output from the output layer
            y_out = softmax(B2+np.dot(W2, x_out))

            ### Error Check
            t = t.reshape((-1, 1))
            # Cross-entropy
            ce += -np.sum(t*np.log(y_out)) / n_test
            # Accuracy
            if np.max(y_out)==np.max(y_out*t):
                acc += 1 / n_test

        ### Store cross entropy and accuracy
        ce_test.append(ce)
        acc_test.append(acc)

        print(("Epoch {:4}   " 
               "CE_train {:12.5e}   CE_test {:12.5e}   "
               "ACC_train {:6.3}   ACC_test {:6.3}").format( \
                epoch, ce_train[-1], ce_test[-1], 
                acc_train[-1], acc_test[-1]))
    return W1, B1, W2, B2, ce_train, ce_test, acc_train, acc_test

Result

I trained the network changing the learning rate $\alpha = 0.1, 0.3, 0.6$ and the number of hidden units $N_{hid} = 10, 20, 30$.
The following figures show the change of cross-entropy and accuracy during the training. As the value of the learning late increases, faster the network gets optimised. In any of the configuration, weight parameter has not reached to its local or global optimum, but even so, we can infer that better performance is archived with more hidden units (until certain point.).

学習率 $\alpha = 0.1, 0.3, 0.6$ 隠れ層のユニットの数を $N_{hid} = 10, 20, 30$ と変化させてネットワークを訓練した。
以下の図は訓練中のクロスエントロピーと正解率の推移を示す。学習率の値を大きくするごとに、ネットワークがより速く最適化されるのが分かる。どの実験の設定に於いても、重みパラメーターは局所最適値には達していないが、それでも隠れユニットの数を増やすことでネットワークの性能が上がることが分かる。（おそらく上限はあることだろう。）

Initialisation Matters

When initialising weight matrices, depending on which random distribution is used, the network optimisation appeared different, so I leave some notes on it. The following graphs shows the network optimisation with same parameter value but with different random distribution for initialisation, which are uniform distribution and normal distribution. (The data of uniform distribution is the one used on the above graphs.) It seems that the initial entropy is smaller in the case of uniform distribution and the network is optimised faster.

重み行列をランダムに初期化する際、分布に何を使うかに依って、最適化の具合に変化が見られたので記録としてここに記しておく。以下の図はパラメーターを同じ条件にして、重み行列の初期化に一様乱数と正規乱数を用いた場合のネットワーク性能の変化である。(一様乱数の方は上に既に掲載したデータである。) 一様乱数の方が初期エントロピーが低く、最適化も早いようである。

# Initialisation with uniform distribution 
W1 = np.random.uniform(size=(n_hid, n_in))-0.5
B1 = np.random.uniform(size=(n_hid, 1))-0.5
W2 = np.random.uniform(size=(n_out, n_hid))-0.5
B2 = np.random.uniform(size=(n_out, 1))-0.5

# Initialisation with normal distribution 
W1 = np.random.normal(size=(n_hid, n_in))
B1 = np.random.normal(size=(n_hid, 1))
W2 = np.random.normal(size=(n_out, n_hid))
B2 = np.random.normal(size=(n_out, 1))