We are going to implement the core NN functionality described in Mathematical Foundation using Python 3 and NumPy.
数学的基礎の頁で述べたニューラルネットワークの中核部分を実装する。実装には Python 3, NumPy を使う。
数学的基礎の頁で述べたニューラルネットワークの中核部分を実装する。実装には Python 3, NumPy を使う。
Personal Notes from debugging / 個人的実装メモ
- Don't confuse the definition of softmax and that of sigmoid. Sigmoid has minus sign.
ソフトマックス関数とシグモイド関数の定義を混同しない。シグモイドはマイナス記号。
$$ softmax=\{\frac{exp(x_i)}{\sum_i exp(x_i)} \}\\ sigmoid=\frac{1}{1+exp({\color{red}-}x)}$$ - Make sure that initial weight matrices contain both positive and negative values. Otherwise the network get stuck at a plateau just after a few iteration.
重みの初期化では正・負両方の重みが含まれるようにすること。そうでないと、へんなところで停滞する。
np.random.uniform(size=(n_out, 1))-0.5 #^^^^ HERE!
- Initialization does matter. Initialisation with uniform distribution seems better than normal distribution. (Also, the value itself seems matter, but not sure about it.)
初期化のやり方は重要。正規分布で初期化するより、一様分布を使って初期化する方がよさそう。(更に値の大小も影響するように見えるが、詳しくは不明) - Of course, we should be using Glorot initialization.
もちろん Glorot 初期化を使うべき。
In the following code snippets, x indicates an input to a layer, and y indicates an output from a layer.
以下のコードでは x はレイヤーへの入力、y はレイヤーからの出力を示す。
以下のコードでは x はレイヤーへの入力、y はレイヤーからの出力を示す。
Code: Forward Pass Part
### Forward pass # Output from the hidden layer x_out = sigmoid(B1+np.dot(W1, x_hid)) # Output from the output layer y_out = softmax(B2+np.dot(W2, x_out))
Code: Back propagation Part
### Backward pass error = y_out - t dEdW2 += np.dot(error, x_out.T) / n_train dEdB2 += error / n_train error = np.dot(W2.T, error) * x_out * (1-x_out) dEdW1 += np.dot(error, x_hid.T) / n_train dEdB1 += error / n_train
Code: Parameter Update Part
### Update Weights dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2 W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2
Code: Entire Training
import numpy as np def sigmoid(x): return 1 / (1+np.exp(-x)) def softmax(x): y = np.exp(x) return y / np.sum(y) def train_network(X_train, T_train, X_test, T_test, W1, B1, W2, B2, lr, n_epochs): """ Multiclass classification with NN Args: X_train(NumPy Array) : Data matrix (shape: n_train, n_input) T_train(NumPy Array) : Label matrix (shape: n_train, n_output) X_test(NumPy Array) : Data matrix (shape: n_test, n_input) T_test(NumPy Array) : Label matrix (shape: n_test, n_output) W1(NumPy Array) : Weight matrix in layer1 (shape: n_hidden, n_input) B1(NumPy Array) : Bias vector in layer1 (shape: n_hidden, 1) W2(NumPy Array) : Weight matrix in layer2 (shape: n_output, n_hidden) B2(NumPy Array) : Bias vector in layer2 (shape: n_output, 1) lr(float) : Learning rate n_epochs(int) : Maximum epochs to update the parameters where n_train, n_test, are the number of training/test samples respertively. n_input, n_output are the numbere of input/output features respectively n_hidden is the number of unit in the hidden layer Returns: W1, B1, W2, B2 (NumPy Array): Updated weight matrices and vectors ce_train(list of float) : The history of cross entropy on training data. acc_train(list of float) : The history of cross entropy on training data. ce_test(list of float) : The history of accuracy on training data. acc_test(list of float) : The history of accuracy on training data. """ ### Get the dimension of input variables n_train, n_test = X_train.shape[0], X_test.shape[0] n_input, n_output = X_train.shape[1], T_train.shape[1] ### Train the network ce_train, acc_train, ce_test, acc_test = [], [], [], [] for epoch in range(n_epochs): ### Train on training data ce, acc = 0.0, 0.0 ### Initialize the gradient matrices dEdW1, dEdB1 = np.zeros(W1.shape), np.zeros(B1.shape) dEdW2, dEdB2 = np.zeros(W2.shape), np.zeros(B2.shape) for x, t in zip(X_train, T_train): x_hid = x.reshape((-1, 1)) ### Forward pass # Output from the hidden layer x_out = sigmoid(B1+np.dot(W1, x_hid)) # Output from the output layer y_out = softmax(B2+np.dot(W2, x_out)) ### Error Check t = t.reshape((-1, 1)) # Cross-entropy ce += -np.sum(t*np.log(y_out)) / n_train # Accuracy if np.max(y_out)==np.max(y_out*t): acc += 1 / n_train ### Backward pass # Output layer error = y_out - t dEdW2 += np.dot(error, x_out.T) / n_train dEdB2 += error / n_train # Hidden layer error = np.dot(W2.T, error) * x_out * (1-x_out) dEdW1 += np.dot(error, x_hid.T) / n_train dEdB1 += error / n_train ### Store cross entropy and accuracy ce_train.append(ce) acc_train.append(acc) ### Update Weights dW1, dB1, dW2, dB2 = -lr*dEdW1, -lr*dEdB1, -lr*dEdW2, -lr*dEdB2 W1, B1, W2, B2 = W1+dW1, B1+dB1, W2+dW2, B2+dB2 ### Evaluate error on test data ce, acc = 0.0, 0.0 for x, t in zip(X_test, T_test): x_hid = x.reshape((-1, 1)) ### Forward pass # Output from the hidden layer x_out = sigmoid(B1+np.dot(W1, x_hid)) # Output from the output layer y_out = softmax(B2+np.dot(W2, x_out)) ### Error Check t = t.reshape((-1, 1)) # Cross-entropy ce += -np.sum(t*np.log(y_out)) / n_test # Accuracy if np.max(y_out)==np.max(y_out*t): acc += 1 / n_test ### Store cross entropy and accuracy ce_test.append(ce) acc_test.append(acc) print(("Epoch {:4} " "CE_train {:12.5e} CE_test {:12.5e} " "ACC_train {:6.3} ACC_test {:6.3}").format( \ epoch, ce_train[-1], ce_test[-1], acc_train[-1], acc_test[-1])) return W1, B1, W2, B2, ce_train, ce_test, acc_train, acc_test
Result
I trained the network changing the learning rate $\alpha = 0.1, 0.3, 0.6$ and the number of hidden units $N_{hid} = 10, 20, 30$.The following figures show the change of cross-entropy and accuracy during the training. As the value of the learning late increases, faster the network gets optimised. In any of the configuration, weight parameter has not reached to its local or global optimum, but even so, we can infer that better performance is archived with more hidden units (until certain point.).
学習率 $\alpha = 0.1, 0.3, 0.6$ 隠れ層のユニットの数を $N_{hid} = 10, 20, 30$ と変化させてネットワークを訓練した。
以下の図は訓練中のクロスエントロピーと正解率の推移を示す。学習率の値を大きくするごとに、ネットワークがより速く最適化されるのが分かる。どの実験の設定に於いても、重みパラメーターは局所最適値には達していないが、それでも隠れユニットの数を増やすことでネットワークの性能が上がることが分かる。(おそらく上限はあることだろう。)
Initialisation Matters
When initialising weight matrices, depending on which random distribution is used, the network optimisation appeared different, so I leave some notes on it. The following graphs shows the network optimisation with same parameter value but with different random distribution for initialisation, which are uniform distribution and normal distribution. (The data of uniform distribution is the one used on the above graphs.) It seems that the initial entropy is smaller in the case of uniform distribution and the network is optimised faster.重み行列をランダムに初期化する際、分布に何を使うかに依って、最適化の具合に変化が見られたので記録としてここに記しておく。以下の図はパラメーターを同じ条件にして、重み行列の初期化に一様乱数と正規乱数を用いた場合のネットワーク性能の変化である。(一様乱数の方は上に既に掲載したデータである。) 一様乱数の方が初期エントロピーが低く、最適化も早いようである。
# Initialisation with uniform distribution W1 = np.random.uniform(size=(n_hid, n_in))-0.5 B1 = np.random.uniform(size=(n_hid, 1))-0.5 W2 = np.random.uniform(size=(n_out, n_hid))-0.5 B2 = np.random.uniform(size=(n_out, 1))-0.5
# Initialisation with normal distribution W1 = np.random.normal(size=(n_hid, n_in)) B1 = np.random.normal(size=(n_hid, 1)) W2 = np.random.normal(size=(n_out, n_hid)) B2 = np.random.normal(size=(n_out, 1))
No comments:
Post a Comment