Études in Programming Intelligence: February 2015

Saturday, February 28, 2015

Step by Step Neural Network: Hyperbolic Tangent

So far we used sigmoid function as the activation function in the hidden layer. I am going to try other functions also such as, hyperbolic tangent and rectified linear unit (ReLU). Firstly we look into hyperbolic tangent. The definition of hyperbolic tangent function and its derivative is as follow.

ここまで、隠れ層の活性化関数にはシグモイド関数を使って来た。ここでその他の関数、双曲正接関数やrectified linear unit (ReLU) 等も使ってみようと思う。まず双曲正接関数についてみていく。双曲正接関数の定義と微分は以下のようになる。

$$\begin{align*}\tanh(x) &= \frac {exp(x) - \exp(-x)} {exp(x) + exp(-x)} \\ \\ \frac{d}{dx}\tanh(x) &= 1-\tanh^2(x) \end{align*}$$

Hyperbolic tangent function is a linear transformation of sigmoid function and their relationship can be derived as follow.

双曲正接関数はシグモイド関数を線形変換したものになっており、その関係は以下のように導出できる。

$$\begin{align*} \tanh(x) &= \frac {exp(x) - exp(-x)} {exp(x) + exp(-x)} \\ \\ &= \frac{1 - exp(-2x)}{1 + exp(-2x)} \\ \\ &= \frac{1+exp(-2x)-2exp(-2x)}{1 + exp(-2x)} \\ &= 1 - \frac{2exp(-2x)}{1+exp(-2x)} \\ \\ &= 1 -2(\frac{1+exp(-2x)-1}{1+exp(-2x)})\\ \\ &= 1 -2 (1 - \frac{1}{1+exp(-2x)}) \\ \\ &= 1 -2 (1 - \sigma(2x)) \\ \\ &= 2\sigma(2x) -1 \end{align*}$$

where $\sigma(x)=\frac{1}{1+exp(x)}$ is sigmoid function.
ここで、$\sigma(x)=\frac{1}{1+exp(x)}$はシシグモイド関数

As $tanh$ has greater gradient than sigmoid function, applying $tanh$ should fasten the network optimisation. Also, the range of output of $tanh$ is $[-1, 1]$ and this is better suited for optimisation. The reason is described in "Efficient Backprop" by Le Cunn et al. Simply said, when all the components of an input vector have the same sign (plus or minus), then the weight vector of an activation unit is updated in either all decreased or all increased at the same time. Therefor, in order a weight vector to get optimised to have weight values with both positive and negative values, the network has to repeat the iteration more. This slows down the optimisation.

$tanh$ はシグモイド関数よりも急な勾配を持っているので、ネットワークの最適化がより高速になる。また、$tanh$の出力の値域が $[-1, 1]$ であり、"Efficient Backprop" by Le Cunn et al で述べられる理由により、ネットワークの最適化促進に一役買うことになる。簡単に言うと、レイヤーへの入力が全て同じ符号（正・負）を持っていたとすると、ある活性化ユニットの重みベクトルが更新される際には全ての重みの値が同時に増えるか、全ての重みの値が同時に減る、のどちらかにしかならない。これによって、重みが正負両方の値を持つ最適値に収束するためにはネットワークは最適化を何度も繰り返すことになり、全体の最適化が遅くなるのである。

Code (Modified Part)

from numpy import tanh 

### Forward pass
# Output from the hidden layer
x_out = tanh(B1+np.dot(W1, x_hid))
# Output from the output layer
y_out = softmax(B2+np.dot(W2, x_out))

### Backward pass
t = t.reshape((-1, 1))
# Output layer
error = y_out - t
dEdW2 += np.dot(error, x_out.T) / n_train
dEdB2 += error / n_train
# Hidden layer
error = np.dot(W2.T, error) * (1-x_out*x_out)
dEdW1 += np.dot(error, x_hid.T) / n_train
dEdB1 += error / n_train

Result

Following figures illustrates the effect of $tanh$ activation function compared against sigmoid function. The experiment is conducted using 10 mini-batch updates and Nesterov's accelerated gradient. The optimisation is accelerated effectively.

以下の表は活性化関数を $tanh$ にした時のネットワークの変化を、シグモイド関数の場合と比較して示している。実験には１０分割のミニバッチ更新と Nesterov's accelerated gradient を用いた。最適化が促進されているのが分かる。

Reference

Friday, February 27, 2015

Step by Step Neural Network: Mini-batch Update

Another approach to speed-up the network optimisation is to update parameters more frequently. Mini-batch process first divide training set into small batches, then compute gradient, and updates the parameter for each batch. So if training data is divided into 10 mini-batches, then parameters are updated 10 times par epoch. The size of batch and samples must be carefully chosen so that each mini-batch preserves the data distribution of the entire original training set.

As the network is modified during sweeping the training data, we need to measure the network performance with another set of data other than training dataset. So we divide the training data into two sets of data, validation set and training set. Training set is used for parameter optimisation and validation set is used to measure the network performance for each epoch.

The following figures show the effect of mini-batching for various optimisation configuration, compared against batch updating.

ネットワーク最適化を高速にする他の手段は、パラメーターの更新頻度を上げることである。ミニバッチは訓練データをより小さなバッチに分割し、それぞれのバッチを用いて勾配を計算、パラメーターを更新する。訓練データを10個のバッチに分割したならば、パラメーターは 1 Epoch で10回更新される。訓練データを分割する際は、元の訓練データの分布が保持されるように分割数と選出するサンプルを選ばなければならない。

訓練データを一巡する間にパラメーターが随時更新されてしまうので、ネットワークの性能を測るために別のデータセットが必要になる。そこで、元々の訓練データをバリデーションセットと訓練セットの二つに分割する。パラメーターの更新には訓練セットを（ミニバッチに分割して）用い、Epoch 毎の性能の評価にはバリデーションセットを用いる。

以下の図はミニバッチによるネットワーク最適化の変化を示す。

Step by Step Neural Network: Nesterov's Accelerated Gradient

To apply Nesterov's accelerated gradient, first, store the current weight parameters, next update the weights with the previous gradient, compute the gradient, then update the weight parameters from the stored position. using the same update rule as classic momentum.

Nesterov's accelerated gradient を適応するためには、まず、勾配を計算する前に現在の重みを保存しておき、次に重みを前回の勾配を使って更新、更新された重みを使って勾配を計算する。そして、保存しておいた重みから、新たに計算された勾配を使って重みを更新する。

Code

# Store the current weight, and jump with current gradient
W1_0, B1_0, W2_0, B2_0 = W1, B1, W2, B2
W1, B1, W2, B2 = W1+mc*dW1, B1+mc*dB1, W2+mc*dW2, B2+mc*dB2

# Compute gradient using W1, B1, W2, B2
# ...

# Update weights from the stored position W1_0, B1_0, W2_0, B2_0
dW1 = - lr * dEdW1 + mc * dW1
dB1 = - lr * dEdB1 + mc * dB1
dW2 = - lr * dEdW2 + mc * dW2
dB2 = - lr * dEdB2 + mc * dB2
W1, B1, W2, B2 = W1_0+dW1, B1_0+dB1, W2_0+dW2, B2_0+dB2

Result

The following figures show the effect of Nesterov's accelerated gradient (NAG), compared against classic momentum (CM) and optimisation without momentum (plain gradient descent) for different optimisation parameter values. Network is optimised faster with NAG than with CM, and the oscillation seen in the case of CM with high momentum value does not occur with NAG.

以下の図は Nesterov's accelerated gradient (NAG) によるネットワークの学習効果の変化を、様々な最適化パラメーターの値に対して、通常のモーメンタム(CM), モーメンタムなしの場合と比較して示している。NAG の場合 CM よりも改善が速く、CM でモーメンタム係数が大きい場合に発生した振動も起こらなくなっている。

Wednesday, February 25, 2015

Step by Step Neural Network: Classic Momentum

Using the update rule described in Momentum, I modified the prototype as follow.
モーメンタムの記述に従って実装１を以下のように修正する。

Code: Parameter Update Part

# Update Weights
dW1 = - lr * dEdW1 + mc * dW1
dB1 = - lr * dEdB1 + mc * dB1
dW2 = - lr * dEdW2 + mc * dW2
dB2 = - lr * dEdB2 + mc * dB2
W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2

Result

The following figures show the change of cross-entropy and accuracy of the network over the iteration. Momentum coefficient $\eta=0.3, 0.6, 0.9$ were used and the other hyper parameters were set same as the configuration in Prototyping1 (Result). We can see that the network gets optimised faster by introducing momentum. Also, when $\eta=0.9$, both cross-entropy and accuracy is oscillating.

以下の図は試作１（結果）での実験と同じパラメーター設定でモーメンタム $\eta=0.3, 0.6, 0.9$ と変化させたときのクロスエントロピーと正解率の推移である。モーメンタムを導入したことで、最適化が促進されたことが分かる。また $\eta=0.9$ では、エントロピー及び正答率の変化に、振動が見られる。

Sunday, February 22, 2015

Step by Step Neural Network: Momentum

In the previous posts, I implemented minimum functionality of neural network, but its computing speed is very slow, so I try to improve it by incorporating momentum term. With momentum term, the parameter update becomes as following.

前回で初歩的なネットワークの実装は出来たが、計算速度がとても遅いので、モーメンタム項を導入して改善を試みる。モーメンタムを加えるとパラメーター更新は以下のようになる。

$$\begin{align*} dW &\leftarrow - \alpha \frac{\partial E}{\partial W} \ + \ \eta \ dW \\ \\ dB &\leftarrow - \alpha \frac{\partial E}{\partial B} \ + \ \eta \ dB \\ \\ W &\leftarrow W + dW \\ \\ B &\leftarrow B + dB \end{align*} $$

where $\eta$ is momentum decay rate and $0 \leq \eta < 1$
ただし$\eta$ はモーメンタム減衰率で $0 \leq \eta < 1$ を満たす。

If $\frac{\partial E}{\partial W}$ is evaluated at $W+\eta dW$, instead of at $W$, then this formula turns into Nestrov's Accelerated Gradient.
ここで、$\frac{\partial E}{\partial W}$ の評価を $W$ の位置ではなく $W+\eta dW$ で行うと Nestrov's Accelerated Gradient になる。

$$\begin{align*} dW &\leftarrow - \alpha \frac{\partial E}{\partial W}\bigg|_{W=W+dW} \ + \ \eta \ dW \\ \\ dB &\leftarrow - \alpha \frac{\partial E}{\partial B}\bigg|_{B=B+dB} \ + \ \eta \ dB \\ \\ W &\leftarrow W + dW \\ \\ B &\leftarrow B + dB \end{align*} $$

Following figures show the effect of momentum term with different optimisation parameters. Similar figure can be found here
モーメンタム項の有無による、エラー改善の様子を下図に示す。こちらと同様の傾向が伺える。

Reference

Friday, February 6, 2015

Step by Step Neural Network: Prototyping

We are going to implement the core NN functionality described in Mathematical Foundation using Python 3 and NumPy.
数学的基礎の頁で述べたニューラルネットワークの中核部分を実装する。実装には Python 3, NumPy を使う。

Personal Notes from debugging / 個人的実装メモ

Don't confuse the definition of softmax and that of sigmoid. Sigmoid has minus sign.
ソフトマックス関数とシグモイド関数の定義を混同しない。シグモイドはマイナス記号。
$$ softmax=\{\frac{exp(x_i)}{\sum_i exp(x_i)} \}\\ sigmoid=\frac{1}{1+exp({\color{red}-}x)}$$
Make sure that initial weight matrices contain both positive and negative values. Otherwise the network get stuck at a plateau just after a few iteration.
重みの初期化では正・負両方の重みが含まれるようにすること。そうでないと、へんなところで停滞する。
```
np.random.uniform(size=(n_out, 1))-0.5 
                                 #^^^^ HERE!
```
Initialization does matter. Initialisation with uniform distribution seems better than normal distribution. (Also, the value itself seems matter, but not sure about it.)
初期化のやり方は重要。正規分布で初期化するより、一様分布を使って初期化する方がよさそう。（更に値の大小も影響するように見えるが、詳しくは不明）
Of course, we should be using Glorot initialization.
もちろん Glorot 初期化を使うべき。

In the following code snippets, x indicates an input to a layer, and y indicates an output from a layer.
以下のコードでは x はレイヤーへの入力、y はレイヤーからの出力を示す。

Code: Forward Pass Part

### Forward pass
# Output from the hidden layer
x_out = sigmoid(B1+np.dot(W1, x_hid))
# Output from the output layer
y_out = softmax(B2+np.dot(W2, x_out))

Code: Back propagation Part

### Backward pass
error = y_out - t
dEdW2 += np.dot(error, x_out.T) / n_train
dEdB2 += error / n_train
error = np.dot(W2.T, error) * x_out * (1-x_out)
dEdW1 += np.dot(error, x_hid.T) / n_train
dEdB1 += error / n_train

Code: Parameter Update Part

### Update Weights
dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2
W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2

Code: Entire Training

import numpy as np


def sigmoid(x):
    return 1 / (1+np.exp(-x))


def softmax(x):
    y = np.exp(x)
    return y / np.sum(y)


def train_network(X_train, T_train, X_test, T_test, 
                  W1, B1, W2, B2, lr, n_epochs):
    """
    Multiclass classification with NN
    Args:
      X_train(NumPy Array) : Data matrix (shape: n_train, n_input)
      T_train(NumPy Array) : Label matrix (shape: n_train, n_output)

      X_test(NumPy Array) : Data matrix (shape: n_test, n_input)
      T_test(NumPy Array) : Label matrix (shape: n_test, n_output)

      W1(NumPy Array) : Weight matrix in layer1 (shape: n_hidden, n_input)
      B1(NumPy Array) : Bias vector in layer1 (shape: n_hidden, 1)
      W2(NumPy Array) : Weight matrix in layer2 (shape: n_output, n_hidden)
      B2(NumPy Array) : Bias vector in layer2 (shape: n_output, 1)

      lr(float)     : Learning rate
      n_epochs(int) : Maximum epochs to update the parameters

      where 
        n_train, n_test, are the number of training/test samples respertively. 
        n_input, n_output are the numbere of input/output features respectively
        n_hidden is the number of unit in the hidden layer

    Returns:
      W1, B1, W2, B2 (NumPy Array): Updated weight matrices and vectors
      ce_train(list of float)  : The history of cross entropy on training data.
      acc_train(list of float) : The history of cross entropy on training data.
      ce_test(list of float)   : The history of accuracy on training data.
      acc_test(list of float)  : The history of accuracy on training data.
    """
    ### Get the dimension of input variables
    n_train, n_test = X_train.shape[0], X_test.shape[0]
    n_input, n_output = X_train.shape[1], T_train.shape[1]

    ### Train the network
    ce_train, acc_train, ce_test, acc_test = [], [], [], []
    for epoch in range(n_epochs):
        ### Train on training data
        ce, acc = 0.0, 0.0
        ### Initialize the gradient matrices
        dEdW1, dEdB1 = np.zeros(W1.shape), np.zeros(B1.shape)
        dEdW2, dEdB2 = np.zeros(W2.shape), np.zeros(B2.shape)
        for x, t in zip(X_train, T_train):
            x_hid = x.reshape((-1, 1))
            ### Forward pass
            # Output from the hidden layer
            x_out = sigmoid(B1+np.dot(W1, x_hid))
            # Output from the output layer
            y_out = softmax(B2+np.dot(W2, x_out))

            ### Error Check
            t = t.reshape((-1, 1))
            # Cross-entropy
            ce += -np.sum(t*np.log(y_out)) / n_train
            # Accuracy
            if np.max(y_out)==np.max(y_out*t):
                acc += 1 / n_train

            ### Backward pass
            # Output layer
            error = y_out - t
            dEdW2 += np.dot(error, x_out.T) / n_train
            dEdB2 += error / n_train
            # Hidden layer
            error = np.dot(W2.T, error) * x_out * (1-x_out)
            dEdW1 += np.dot(error, x_hid.T) / n_train
            dEdB1 += error / n_train

        ### Store cross entropy and accuracy
        ce_train.append(ce)
        acc_train.append(acc)

        ### Update Weights
        dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2
        W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2 

        ### Evaluate error on test data
        ce, acc = 0.0, 0.0
        for x, t in zip(X_test, T_test):
            x_hid = x.reshape((-1, 1))
            ### Forward pass
            # Output from the hidden layer
            x_out = sigmoid(B1+np.dot(W1, x_hid))
            # Output from the output layer
            y_out = softmax(B2+np.dot(W2, x_out))

            ### Error Check
            t = t.reshape((-1, 1))
            # Cross-entropy
            ce += -np.sum(t*np.log(y_out)) / n_test
            # Accuracy
            if np.max(y_out)==np.max(y_out*t):
                acc += 1 / n_test

        ### Store cross entropy and accuracy
        ce_test.append(ce)
        acc_test.append(acc)

        print(("Epoch {:4}   " 
               "CE_train {:12.5e}   CE_test {:12.5e}   "
               "ACC_train {:6.3}   ACC_test {:6.3}").format( \
                epoch, ce_train[-1], ce_test[-1], 
                acc_train[-1], acc_test[-1]))
    return W1, B1, W2, B2, ce_train, ce_test, acc_train, acc_test

Result

I trained the network changing the learning rate $\alpha = 0.1, 0.3, 0.6$ and the number of hidden units $N_{hid} = 10, 20, 30$.
The following figures show the change of cross-entropy and accuracy during the training. As the value of the learning late increases, faster the network gets optimised. In any of the configuration, weight parameter has not reached to its local or global optimum, but even so, we can infer that better performance is archived with more hidden units (until certain point.).

学習率 $\alpha = 0.1, 0.3, 0.6$ 隠れ層のユニットの数を $N_{hid} = 10, 20, 30$ と変化させてネットワークを訓練した。
以下の図は訓練中のクロスエントロピーと正解率の推移を示す。学習率の値を大きくするごとに、ネットワークがより速く最適化されるのが分かる。どの実験の設定に於いても、重みパラメーターは局所最適値には達していないが、それでも隠れユニットの数を増やすことでネットワークの性能が上がることが分かる。（おそらく上限はあることだろう。）

Initialisation Matters

When initialising weight matrices, depending on which random distribution is used, the network optimisation appeared different, so I leave some notes on it. The following graphs shows the network optimisation with same parameter value but with different random distribution for initialisation, which are uniform distribution and normal distribution. (The data of uniform distribution is the one used on the above graphs.) It seems that the initial entropy is smaller in the case of uniform distribution and the network is optimised faster.

重み行列をランダムに初期化する際、分布に何を使うかに依って、最適化の具合に変化が見られたので記録としてここに記しておく。以下の図はパラメーターを同じ条件にして、重み行列の初期化に一様乱数と正規乱数を用いた場合のネットワーク性能の変化である。(一様乱数の方は上に既に掲載したデータである。) 一様乱数の方が初期エントロピーが低く、最適化も早いようである。

# Initialisation with uniform distribution 
W1 = np.random.uniform(size=(n_hid, n_in))-0.5
B1 = np.random.uniform(size=(n_hid, 1))-0.5
W2 = np.random.uniform(size=(n_out, n_hid))-0.5
B2 = np.random.uniform(size=(n_out, 1))-0.5

# Initialisation with normal distribution 
W1 = np.random.normal(size=(n_hid, n_in))
B1 = np.random.normal(size=(n_hid, 1))
W2 = np.random.normal(size=(n_out, n_hid))
B2 = np.random.normal(size=(n_out, 1))