RNN(Recurrent Neural Networks)公式推导和实现

发布时间：2023-09-06 01:06责任编辑：彭小芳关键词：暂无标签

RNN(Recurrent Neural Networks)公式推导和实现

http://x-algo.cn/index.php/2016/04/25/rnn-recurrent-neural-networks-derivation-and-implementation/

2016-04-25分类：Deep Learning/NLP/RNN阅读(6997) 评论(7)

本文主要参考wildml的博客所写，所有的代码都是python实现。没有使用任何深度学习的工具，公式推导虽然枯燥，但是推导一遍之后对RNN的理解会更加的深入。看本文之前建议对传统的神经网络的基本知识已经了解，如果不了解的可以看此文：『神经网络（Neural Network）实现』。

所有可执行代码：Code to follow along is on Github.

文章目录[展开]

语言模型

熟悉NLP的应该会比较熟悉，就是将自然语言的一句话『概率化』。具体的，如果一个句子有m个词，那么这个句子生成的概率就是：

P(w1,...,wm)=∏mi=1P(wi∣w1,...,wi?1)P(w1,...,wm)=∏i=1mP(wi∣w1,...,wi?1)

其实就是假设下一次词生成的概率和只和句子前面的词有关，例如句子『He went to buy some chocolate』生成的概率可以表示为：P(他喜欢吃巧克力)=P(他喜欢吃)*P(巧克力|他喜欢吃)。

数据预处理

训练模型总需要语料，这里语料是来自google big query的reddit的评论数据，语料预处理会去掉一些低频词从而控制词典大小，低频词使用一个统一标识替换（这里是UNKNOWN_TOKEN），预处理之后每一个词都会使用一个唯一的编号替换；为了学出来哪些词常常作为句子开始和句子结束，引入SENTENCE_START和SENTENCE_END两个特殊字符。具体就看代码吧：

点击展开代码

网络结构

和传统的nn不同，但是也很好理解，rnn的网络结构如下图：

A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

不同之处就在于rnn是一个『循环网络』，并且有『状态』的概念。

如上图，t表示的是状态，xtxt表示的状态t的输入，stst表示状态t时隐层的输出，otot表示输出。特别的地方在于，隐层的输入有两个来源，一个是当前的xtxt输入、一个是上一个状态隐层的输出st?1st?1，W,U,VW,U,V为参数。使用公式可以将上面结构表示为：

sty^t=tanh(Uxt+Wst?1)=softmax(Vst)st=tanh?(Uxt+Wst?1)y^t=softmax(Vst)

如果隐层节点个数为100，字典大小C=8000，参数的维度信息为：

xtotstUVW∈R8000∈R8000∈R100∈R100×8000∈R8000×100∈R100×100xt∈R8000ot∈R8000st∈R100U∈R100×8000V∈R8000×100W∈R100×100

初始化

参数的初始化有很多种方法，都初始化为0将会导致『symmetric calculations 』（我也不懂），如何初始化其实是和具体的激活函数有关系，我们这里使用的是tanh，一种推荐的方式是初始化为[?1n√,1n√][?1n,1n]，其中n是前一层接入的链接数。更多信息请点击查看更多。

class RNNNumpy:

def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):

# Assign instance variables

self.word_dim = word_dim

self.hidden_dim = hidden_dim

self.bptt_truncate = bptt_truncate

# Randomly initialize the network parameters

self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))

self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))

self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

前向传播

类似传统的nn的方法，计算几个矩阵乘法即可：

def forward_propagation(self, x):

# The total number of time steps

T = len(x)

# During forward propagation we save all hidden states in s because need them later.

# We add one additional element for the initial hidden, which we set to 0

s = np.zeros((T + 1, self.hidden_dim))

s[-1] = np.zeros(self.hidden_dim)

# The outputs at each time step. Again, we save them for later.

o = np.zeros((T, self.word_dim))

# For each time step...

for t in np.arange(T):

# Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.

s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))

o[t] = softmax(self.V.dot(s[t]))

return [o, s]

预测函数可以写为：

def predict(self, x):

# Perform forward propagation and return index of the highest score

o, s = self.forward_propagation(x)

return np.argmax(o, axis=1)

损失函数

类似nn方法，使用交叉熵作为损失函数，如果有N个样本，损失函数可以写为：

L(y,o)=?1N∑n∈NynlogonL(y,o)=?1N∑n∈Nynlog?on

下面两个函数用来计算损失：

def calculate_total_loss(self, x, y):

L = 0

# For each sentence...

for i in np.arange(len(y)):

o, s = self.forward_propagation(x[i])

# We only care about our prediction of the "correct" words

correct_word_predictions = o[np.arange(len(y[i])), y[i]]

# Add to the loss based on how off we were

L += -1 * np.sum(np.log(correct_word_predictions))

return L

def calculate_loss(self, x, y):

# Divide the total loss by the number of training examples

N = np.sum((len(y_i) for y_i in y))

return self.calculate_total_loss(x,y)/N

BPTT学习参数

BPTT（Backpropagation Through Time）是一种非常直观的方法，和传统的BP类似，只不过传播的路径是个『循环』，并且路径上的参数是共享的。

损失是交叉熵，损失可以表示为：

Et(yt,y^t)E(y,y^)=?ytlogy^t=∑tEt(yt,y^t)=?∑tytlogy^tEt(yt,y^t)=?ytlog?y^tE(y,y^)=∑tEt(yt,y^t)=?∑tytlog?y^t

其中ytyt是真实值，(^yt)(^yt)是预估值，将误差展开可以用图表示为：

所以对所有误差求W的偏导数为：

?E?W=∑t?Et?W?E?W=∑t?Et?W

进一步可以将EtEt表示为：

?E3?V=?E3?y^3?y^3?V=?E3?y^3?y^3?z3?z3?V=(y^3?y3)?s3?E3?V=?E3?y^3?y^3?V=?E3?y^3?y^3?z3?z3?V=(y^3?y3)?s3

根据链式法则和RNN中W权值共享，可以得到：

?E3?W=∑k=03?E3?y^3?y^3?s3?s3?sk?sk?W?E3?W=∑k=03?E3?y^3?y^3?s3?s3?sk?sk?W

下图将这个过程表示的比较形象

BPTT更新梯度的代码：

def bptt(self, x, y):

T = len(y)

# Perform forward propagation

o, s = self.forward_propagation(x)

# We accumulate the gradients in these variables

dLdU = np.zeros(self.U.shape)

dLdV = np.zeros(self.V.shape)

dLdW = np.zeros(self.W.shape)

delta_o = o

delta_o[np.arange(len(y)), y] -= 1.

# For each output backwards...

for t in np.arange(T)[::-1]:

dLdV += np.outer(delta_o[t], s[t].T)

# Initial delta calculation: dL/dz

delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))

# Backpropagation through time (for at most self.bptt_truncate steps)

for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:

# print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)

# Add to gradients at each previous step

dLdW += np.outer(delta_t, s[bptt_step-1])

dLdU[:,x[bptt_step]] += delta_t

# Update delta for next step dL/dz at t-1

delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)

return [dLdU, dLdV, dLdW]

梯度弥散现象

tanh和sigmoid函数和导数的取值返回如下图，可以看到导数取值是[0-1]，用几次链式法则就会将梯度指数级别缩小，所以传播不了几层就会出现梯度非常弱。克服这个问题的LSTM是一种最近比较流行的解决方案。

Gradient Checking

梯度检验是非常有用的，检查的原理是一个点的『梯度』等于这个点的『斜率』，估算一个点的斜率可以通过求极限的方式：

?L?θ≈limh→0J(θ+h)?J(θ?h)2h?L?θ≈limh→0J(θ+h)?J(θ?h)2h

通过比较『斜率』和『梯度』的值，我们就可以判断梯度计算的是否有问题。需要注意的是这个检验成本还是很高的，因为我们的参数个数是百万量级的。

梯度检验的代码：

def gradient_check(self, x, y, h=0.001, error_threshold=0.01):

# Calculate the gradients using backpropagation. We want to checker if these are correct.

bptt_gradients = self.bptt(x, y)

# List of all parameters we want to check.

model_parameters = [‘U‘, ‘V‘, ‘W‘]

# Gradient check for each parameter

for pidx, pname in enumerate(model_parameters):

# Get the actual parameter value from the mode, e.g. model.W

parameter = operator.attrgetter(pname)(self)

print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))

# Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...

it = np.nditer(parameter, flags=[‘multi_index‘], op_flags=[‘readwrite‘])

while not it.finished:

ix = it.multi_index

# Save the original value so we can reset it later

original_value = parameter[ix]

# Estimate the gradient using (f(x+h) - f(x-h))/(2*h)

parameter[ix] = original_value + h

gradplus = self.calculate_total_loss([x],[y])

parameter[ix] = original_value - h

gradminus = self.calculate_total_loss([x],[y])

estimated_gradient = (gradplus - gradminus)/(2*h)

# Reset parameter to original value

parameter[ix] = original_value

# The gradient for this parameter calculated using backpropagation

backprop_gradient = bptt_gradients[pidx][ix]

# calculate The relative error: (|x - y|/(|x| + |y|))

relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))

# If the error is to large fail the gradient check

if relative_error > error_threshold:

print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)

print "+h Loss: %f" % gradplus

print "-h Loss: %f" % gradminus

print "Estimated_gradient: %f" % estimated_gradient

&nbs

知识推荐

我的编程学习网——分享web前端后端开发技术知识。垃圾信息处理邮箱 tousu563@163.com 网站地图