CS224N-NLP Assignment individual solution

Assignment 1:

Q1:Softmax:
1.

Prove:$softmax(x)=softmax(x+c)$,x is a vector and c is a constant
begin:
Let $x=(x_1,x_2,…,x_n)$
and then we know that $softmax(x)=(\frac{e^{x_1}}{\sum_j e^{x_j}},\frac{e^{x_2}}{\sum_j e^{x_j}},…,\frac{e^{x_n}}{\sum_j e^{x_j}})$
and $softmax(x+c)=(\frac{e^{x_1+c}}{\sum_j e^{x_j+c}},\frac{e^{x_2+c}}{\sum_j e^{x_j}},…,\frac{e^{x_n+c}}{\sum_j e^{x_j+c}})=\\=(\frac{e^{x_1}e^c}{e^c\sum_j e^{x_j}},\frac{e^{x_2}e^c}{e^c\sum_j e^{x_j}},…,\frac{e^{x_n}e^c}{e^c\sum_j e^{x_j}})=(\frac{e^{x_1}}{\sum_j e^{x_j}},\frac{e^{x_2}}{\sum_j e^{x_j}},…,\frac{e^{x_n}}{\sum_j e^{x_j}})=softmax(x)$
Then this question is solved

2.softmax

import numpy as np

def softmax_inrow(m):
   # print(m)
    rm=np.max(m,axis=1)
    rm_r=rm.reshape(rm.shape[0],1)
    #print(rm_r)
    m1=m-rm_r
    e1=np.exp(m1)
    sum=np.sum(e1,axis=1)
    sum=sum.reshape(sum.shape[0],1)
    e1=e1/sum
    #print(e1)
    return e1

def softmax_incoloum(m):
    # print(m)
    rm = np.max(m, axis=0)
    #rm_r = rm.reshape(rm.shape[0], 1)
    #print(rm)
    m1 = m - rm
    #print(m1)
    e1 = np.exp(m1)
    sum = np.sum(e1, axis=0)
    #sum = sum.reshape(sum.shape[0], 1)
    e1 = e1 / sum
    # print(e1)
    return e1

N=input()
D=input()
matrix=np.random.rand(int(N),int(D))
print(softmax_incoloum(matrix))
#print(softmax2(matrix))

Q2:
2.1:对sigma求导，并利用sigma函数来表示其导数

As $\sigma(x)=\frac{1}{1+e^{-x}}$
$\sigma’(x)=\frac{e^{-x}}{(1+e^{-x})^2}\\e^{-x}=\frac{1}{\sigma(x)}-1$
So,$\sigma’(x)=\sigma(x)(1-\sigma(x))$

2.2:求softmax作用于$\theta$下,ont-hot和softmax$\theta$交叉熵的梯度

solution:
We know that $CE(y,\hat y)=-\sum_iy_ilog(\hat y_i)$
and let $y=(s_1,s_2,…,s_n)$
$\theta=(d_1,d_2,…,d_n)$
and then $\frac{\partial CE}{\partial d_i}=\frac{\partial CE}{\partial s_k}\frac{\partial s_k}{\partial d_i}$
First we consider $\frac{\partial CE}{\partial s_k}$
$\frac{\partial CE}{\partial s_k}=\frac{\partial (-\sum_k y_klog(s_k))}{\partial s_k}=-\sum_k y_k\frac{1}{s_k}$
and then we consider $\frac{\partial s_k}{\partial d_i}$
there are two situations need to consider:
when $k=i$
$\frac{\partial s_k}{\partial d_i}=\frac{\partial ( \frac{e^z_j}{\sum_ke^z_k} )}{\partial s_k}=\frac{\partial s_i}{\partial d_i}=s_i-s_i^2$
when $k!=i$
$\frac{\partial s_k}{\partial d_i}=\frac{\partial ( \frac{e^z_j}{\sum_ke^z_k} )}{\partial s_k}=-s_is_j$
so $\frac{\partial CE}{\partial d_i}=\frac{\partial CE}{\partial s_k}\frac{\partial s_k}{\partial d_i}=-\frac{y_i}{s_i}s_i(1-s_i)+\sum_{i!=j}y_jy_s=s_i-y_i=s_i-1$
i.e.

It’s say that the $softmax(\theta)$ minus 1 in element i(i is the label of right prediction),then we get the gradient

2.3

loss function使用上述的交叉熵$CE(y,\hat y)$
h->y^激活函数为softmax
x->h激活函数为sigma
求梯度
$\hat y=softmax(W_2h+b_2)$
$h=\sigma(W_1x+b_1)$

solution:
similar to 2.2
i is the position index right prediction
$\frac{\partial CE}{\partial x_i}=\frac{\partial CE}{\partial \hat y_k}\frac{\partial \hat y_k}{\partial h_j}\frac{\partial h_j}{\partial x_i}$
similar to 2.2, $\frac{\partial CE}{\partial x_i}=W_2(\hat y_i-y_i)\frac{\partial h_j}{\partial x_i}$
consider $\frac{\partial h_j}{\partial x_i}$
if j=1,then $\frac{\partial h_j}{\partial x_i}=W_1\sigma’(W_1x+b)$,else $\frac{\partial h_j}{\partial x_i}=0$
so $\frac{\partial CE}{\partial x_i}=W_2(\hat y_i-y_i)-\sigma’(W_1x+b_1)_iW_1$

2.4(在2.3之前先做这题，把维度搞清楚)
assuming the input is D x -dimensional, the output is D y -dimensional, and there are H hidden units

x’s shape is $(M,D_x)$ and $W_1$’s shape is $(D_x,H)$,and $W_2$’s shape is $(H,D_y)$,$b_1$’s shape ‘(1,H)$b_2$’s shape (1,Dy) ‘
so there are $H(D_x+1)+D_y(H+1)$ parameters we need to train’

2.5

import numpy as np

def sigmoid(m):
    t=1.0/(1+np.exp(-m))
    return t

def sigmoid_gradient(x):
    return sigmoid(x)*(1.0-sigmoid(x))

2.6

import numpy as np
import random
def gradcheck_naive(f, x):
    """ Gradient check for a function f.
        Arguments:
        f -- a function that takes a single argument and outputs the
             cost and its gradients
        x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x)  # Evaluate function value at original point,获取函数值和梯度
    h = 1e-4  # Do not change this!

    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])  #一个迭代器，flags=['multi_index']表示对a进行多重索引，op_flags=['readwrite']表示不仅可以对x进行read（读取），还可以write（写入）
    while not it.finished:  #开始迭代
        ix = it.multi_index   #获取迭代到的索引，是(a,b)

        # Try modifying x[ix] with h defined above to compute
        # numerical gradients. Make sure you call random.setstate(rndstate)
        # before calling f(x) each time. This will make it possible
        # to test cost functions with built in randomness later.

        ### YOUR CODE HERE:
        x[ix] += h

        random.setstate(rndstate)
        new_f1 = f(x)[0]

        x[ix] -= 2 * h

        random.setstate(rndstate)
        new_f2 = f(x)[0]

        x[ix] += h

        numgrad = (new_f1 - new_f2) / (2 * h)
        ### END YOUR CODE


#比较梯度
        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print
            "Gradient check failed."
            print
            "First gradient error found at index %s" % str(ix)
            print
            "Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad)
            return

        it.iternext()  # Step to next dimension

    print("Gradient check passed!")

反向传播梯度计算


反向传播:
def forward_back_prop(data, labels, params, dimensions):
    """
           2个隐层的神经网络的前向运算和反向传播
    """
    if len(data.shape) >= 2:
        (N, _) = data.shape

    ### 展开每一层神经网络的参数
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])  #获取3个超参数

#获取所有参数
    W1 = np.reshape(params[ofs:ofs + Dx * H], (Dx, H)) #取出w1,大小Dx * H
    ofs += Dx * H #偏移增加
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))



    ### 前向传播
    h = sigmoid(np.dot(data,W1) + b1)
    yhat = softmax_inrow(np.dot(h,W2) + b2)
    ### END

    ### 反向传播
    cost = np.sum(-np.log(yhat[labels == 1])) / data.shape[0]  #-ylog(hat_y) (仅取q的项)，然后求平均

    d3 = (yhat - labels) / data.shape[0]
    gradW2 = np.dot(h.T, d3)   #根据公式来，计算gradW2
    gradb2 = np.sum(d3, 0, keepdims=True)


    dh = np.dot(d3, W2.T)
    grad_h = sigmoid_gradient(h) * dh

    gradW1 = np.dot(data.T, grad_h)
    gradb1 = np.sum(grad_h, 0)
    ### END

    ### Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(),
                           gradW2.flatten(), gradb2.flatten()))
    return cost,grad

3.word2vec

因为
$dsoftmax/dx_o(softmax关于输入向量的梯度)=\hat y-y$
链式法则。图中$\hat y=softmax(U^Tv_c)$
所以
分类讨论。就得

(b)类似a

一样。求导就行

import numpy as np
import random

from q1_softmax import softmax_inrow
from q2_gradcheck import gradcheck_naive
from q2_sigmoid import sigmoid, sigmoid_gradient

#normalize:x每个元素都除以各行平方和的平方根。
def normalizeRows(x):
    """ Row normalization function
    Implement a function that normalizes each row of a matrix to have
    unit length.
    """

    ### YOUR CODE HERE
    temp_x=x
    temp_x=np.square(x)

    norm=np.sum(temp_x,axis=1)

    norm=np.sqrt(norm)
    #print(x)
   #print(norm)
    print((x/norm))
    ### END YOUR CODE
    return (x/norm)

#target是目标的位置
def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
    """ Softmax cost function for word2vec models
        Implement the cost and gradients for one predicted word vector
        and one target word vector as a building block for word2vec
        models, assuming the softmax prediction function and cross
        entropy loss.
        Arguments:
        predicted -- numpy ndarray, predicted word vector (\hat{v} in
                     the written component)
        target -- integer, the index of the target word
        outputVectors -- "output" vectors (as rows) for all tokens
        dataset -- needed for negative sampling, unused here.
        Return:
        cost -- cross entropy cost for the softmax word prediction
        gradPred -- the gradient with respect to the predicted word
               vector  对已预测词向量的梯度
        grad -- the gradient with respect to all the other word
               vectors  对其他所有词向量的梯度
        We will not provide starter code for this function, but feel
        free to reference the code you previously wrote for this
        assignment!
        """
    #target是目标位置
    ### YOUR CODE HERE



    # 计算预测结果
    v_hat = predicted  #已经预测的词向量（中心词向量）
    z = np.dot(outputVectors, v_hat)   #预测得分。也就是输出词向量和中心词向量点乘(可以得到结果得分)
    preds = softmax_inrow(z)     #对预测得分softmax

    cost = -np.log(preds[target])  #计算损失

    # 计算梯度
    z = preds.copy()    #预测得分  (其实就是hat_y)
    z[target] -= 1.0    #梯度为 hat^y-y
    #z相当于^y
    grad = np.outer(z, v_hat) #外积 # dJ/dU  hat_y*u_w  v_hat其实就是v_c 计算输出词向量矩阵的梯度
    # np.outer函数:
    # ①对于多维向量，全部展开变为一维向量
    # ②第一个参数表示倍数，使得第二个向量每次变为几倍
    # ③第一个参数确定结果的行，第二个参数确定结果的列

    gradPred = np.dot(outputVectors.T, z) #dJ/dv_c
    ### END YOUR CODE


    return cost, gradPred, grad

def getNegativeSamples(target, dataset, K):
    """ Samples K indexes which are not the target """

    indices = [None] * K
    for k in range(K):
        newidx = dataset.sampleTokenIdx()
        while newidx == target:
            newidx = dataset.sampleTokenIdx()
        indices[k] = newidx
    return indices
def negSamplingCostAndGradient(predicted, target, outputVectors, dataset,
                               K=10):
    """ Negative sampling cost function for word2vec models

    Implement the cost and gradients for one predicted word vector
    and one target word vector as a building block for word2vec
    models, using the negative sampling technique. K is the sample
    size.

    Note: See test_word2vec below for dataset's initialization.

    Arguments/Return Specifications: same as softmaxCostAndGradient
    """
    # Sampling of indices is done for you. Do not modify this if you
    # wish to match the autograder and receive points!
    #为每个窗口取k个负样本
    indices = [target]
    indices.extend(getNegativeSamples(target, dataset, K))

#初始化
    grad = np.zeros(outputVectors.shape)
    gradPred = np.zeros(predicted.shape)
    cost = 0

    z = sigmoid(np.dot(outputVectors[target],predicted))

    cost -= np.log(z)
    grad[target] += predicted*(z-1.0)
    gradPred = outputVectors[target] * (z-1.0)

    #最小化这些词随中心词出现在中心词附近的概率
    for k in range(K):
        sample = indices[k+1]
        z = sigmoid(np.dot(outputVectors[sample],predicted))
        cost -= np.log(1.0-z)
        grad[sample] += predicted*z
        gradPred += outputVectors[sample] * z
    return cost, gradPred, grad


def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
             dataset, word2vecCostAndGradient=softmaxCostAndGradient):
    """ Skip-gram model in word2vec

    Implement the skip-gram model in this function.

    Arguments:
    currentWord -- a string of the current center word
    C -- integer, context size
    contextWords -- list of no more than 2*C strings, the context words
    tokens -- a dictionary that maps words to their indices in
              the word vector list
    inputVectors -- "input" word vectors (as rows) for all tokens
    outputVectors -- "output" word vectors (as rows) for all tokens
    word2vecCostAndGradient -- the cost and gradient function for
                               a prediction vector given the target
                               word vectors, could be one of the two
                               cost functions you implemented above.

    Return:
    cost -- the cost function value for the skip-gram model
    grad -- the gradient with respect to the word vectors
    """
    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    cword_idx = tokens[currentWord]

    v_hat = inputVectors[cword_idx]
    #skipgram即根据当前词预测一定范围内的上下文词汇，选择让概率分部值最大的向量
    for i in contextWords:#对于窗口中的每个单词
        idx = tokens[i] #target的下标(要预测的单词的下标)
        c_cost,c_grad_in,c_grad_out = word2vecCostAndGradient(v_hat,idx,outputVectors,dataset)
        #更新cost、grad 即使用k个单词来训练这个向量
        cost += c_cost
        gradOut += c_grad_out
        gradIn[cword_idx] += c_grad_in
    return cost, gradIn, gradOut