Sunday, February 22, 2015

Step by Step Neural Network: Momentum

In the previous posts, I implemented minimum functionality of neural network, but its computing speed is very slow, so I try to improve it by incorporating momentum term. With momentum term, the parameter update becomes as following.

前回で初歩的なネットワークの実装は出来たが、計算速度がとても遅いので、モーメンタム項を導入して改善を試みる。モーメンタムを加えるとパラメーター更新は以下のようになる。
$$\begin{align*} dW &\leftarrow - \alpha \frac{\partial E}{\partial W} \ + \ \eta \ dW \\ \\ dB &\leftarrow - \alpha \frac{\partial E}{\partial B} \ + \ \eta \ dB \\ \\ W &\leftarrow W + dW \\ \\ B &\leftarrow B + dB \end{align*} $$

where $\eta$ is momentum decay rate and $0 \leq \eta < 1$
ただし$\eta$ はモーメンタム減衰率で $0 \leq \eta < 1$ を満たす。

If $\frac{\partial E}{\partial W}$ is evaluated at $W+\eta dW$, instead of at $W$, then this formula turns into Nestrov's Accelerated Gradient.
ここで、$\frac{\partial E}{\partial W}$ の評価を $W$ の位置ではなく $W+\eta dW$ で行うと Nestrov's Accelerated Gradient になる。
$$\begin{align*} dW &\leftarrow - \alpha \frac{\partial E}{\partial W}\bigg|_{W=W+dW} \ + \ \eta \ dW \\ \\ dB &\leftarrow - \alpha \frac{\partial E}{\partial B}\bigg|_{B=B+dB} \ + \ \eta \ dB \\ \\ W &\leftarrow W + dW \\ \\ B &\leftarrow B + dB \end{align*} $$

Following figures show the effect of momentum term with different optimisation parameters. Similar figure can be found here
モーメンタム項の有無による、エラー改善の様子を下図に示す。こちらと同様の傾向が伺える。



Reference

  1. CSC321 Lecture Slide
  2. Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton: On the importance of initialization and momentum in deep learning; Proceedings of the 30th International Conference on Machine Learning (ICML-13)

No comments:

Post a Comment