标签归档:word-embedding

tf.nn.embedding_lookup函数有什么作用?

问题:tf.nn.embedding_lookup函数有什么作用?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

我不了解此功能的职责。像查找表吗?用哪种方法返回每个ID对应的参数(以ID为单位)?

例如,在skip-gram模型中,如果使用tf.nn.embedding_lookup(embeddings, train_inputs),则为每个train_input找到对应的嵌入?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)?

For instance, in the skip-gram model if we use tf.nn.embedding_lookup(embeddings, train_inputs), then for each train_input it finds the correspond embedding?


回答 0

embedding_lookup函数检索params张量的行。该行为类似于对numpy中的数组使用索引。例如

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

params参数也可以是张量的列表,在这种情况下,ids将在张量之间分配。例如,给定的3张量列表[2, 64],默认行为是,他们将代表ids[0, 3][1, 4][2, 5]

partition_strategy控制ids列表中的分布方式。当矩阵可能太大而无法合为一体时,分区对于较大规模的问题很有用。

embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy. E.g.

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

params argument can be also a list of tensors in which case the ids will be distributed among the tensors. For example, given a list of 3 tensors [2, 64], the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5].

partition_strategy controls the way how the ids are distributed among the list. The partitioning is useful for larger scale problems when the matrix might be too large to keep in one piece.


回答 1

是的,在您明白这一点之前,很难理解此功能。

最简单的形式类似于tf.gather。它params根据所指定的索引返回的元素ids

例如(假设您在里面tf.InteractiveSession()

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

将返回[10 20 30 40],因为params的第一个元素(索引0)为,params 10的第二个元素(索引1)为20,依此类推。

同样,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

会回来的[20 20 40]

embedding_lookup比这更。该params参数可以是张量列表,而不是单个张量。

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

在这种情况下,ids根据分区策略,在中指定的索引对应于张量的元素,其中默认分区策略为’mod’。

在’mod’策略中,索引0对应于列表中第一个张量的第一个元素。索引1对应于第二张量的第一元素。索引2对应于第三张量的第一个元素,依此类推。假设params是张量的列表,对于所有索引,简单地index 对应第(i + 1)张量的第一个元素。i0..(n-1)n

现在,索引n不能对应于张量n + 1,因为列表params仅包含n张量。因此index n对应于第一个张量的第二个元素。类似地,index n+1对应于第二张量的第二个元素,依此类推。

因此,在代码中

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

下标0对应于第一个张量的第一个元素:1

索引1对应于第二张量的第一个元素:10

索引2对应于第一个张量的第二个元素:2

索引3对应于第二张量的第二个元素:20

因此,结果将是:

[ 2  1  2 10  2 20]

Yes, this function is hard to understand, until you get the point.

In its simplest form, it is similar to tf.gather. It returns the elements of params according to the indexes specified by ids.

For example (assuming you are inside tf.InteractiveSession())

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [10 20 30 40], because the first element (index 0) of params is 10, the second element of params (index 1) is 20, etc.

Similarly,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [20 20 40].

But embedding_lookup is more than that. The params argument can be a list of tensors, rather than a single tensor.

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

In such a case, the indexes, specified in ids, correspond to elements of tensors according to a partition strategy, where the default partition strategy is ‘mod’.

In the ‘mod’ strategy, index 0 corresponds to the first element of the first tensor in the list. Index 1 corresponds to the first element of the second tensor. Index 2 corresponds to the first element of the third tensor, and so on. Simply index i corresponds to the first element of the (i+1)th tensor , for all the indexes 0..(n-1), assuming params is a list of n tensors.

Now, index n cannot correspond to tensor n+1, because the list params contains only n tensors. So index n corresponds to the second element of the first tensor. Similarly, index n+1 corresponds to the second element of the second tensor, etc.

So, in the code

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

index 0 corresponds to the first element of the first tensor: 1

index 1 corresponds to the first element of the second tensor: 10

index 2 corresponds to the second element of the first tensor: 2

index 3 corresponds to the second element of the second tensor: 20

Thus, the result would be:

[ 2  1  2 10  2 20]

回答 2

是的,该tf.nn.embedding_lookup()函数的目的是在嵌入矩阵中执行查找并返回单词的嵌入(或简单地说是矢量表示)。

一个简单的嵌入矩阵(形状vocabulary_size x embedding_dimension:)如下所示。(即每个单词将由一个数字向量表示;因此,名称为word2vec


嵌入矩阵

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 

我分裂上述嵌入基质并装载仅vocab,这将是我们的词汇并在相应的向量emb阵列。

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)

在TensorFlow中嵌入查找

现在,我们将看到如何对某些任意输入语句执行嵌入查找

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

注意我们是怎么得到的嵌入使用从我们原来的嵌入矩阵(文字)的话指数在我们的词汇。

通常,此类嵌入查找是由第一层(称为“ 嵌入层”)执行的,然后将这些嵌入传递到RNN / LSTM / GRU层以进行进一步处理。


旁注:通常,词汇表还将具有特殊unk标记。因此,如果词汇表中不存在来自我们输入句子的标记,则将unk在嵌入矩阵中查找与之相对应的索引。


PS注意,embedding_dimension是一个超参数是一个具有调整他们的应用程序,但受欢迎的车型,如Word2Vec手套使用300维向量表示每个字。

奖励阅读 word2vec跳过语法模型

Yes, the purpose of tf.nn.embedding_lookup() function is to perform a lookup in the embedding matrix and return the embeddings (or in simple terms the vector representation) of words.

A simple embedding matrix (of shape: vocabulary_size x embedding_dimension) would look like below. (i.e. each word will be represented by a vector of numbers; hence the name word2vec)


Embedding Matrix

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 

I split the above embedding matrix and loaded only the words in vocab which will be our vocabulary and the corresponding vectors in emb array.

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)

Embedding Lookup in TensorFlow

Now we will see how can we perform embedding lookup for some arbitrary input sentence.

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

Observe how we got the embeddings from our original embedding matrix (with words) using the indices of words in our vocabulary.

Usually, such an embedding lookup is performed by the first layer (called Embedding layer) which then passes these embeddings to RNN/LSTM/GRU layers for further processing.


Side Note: Usually the vocabulary will also have a special unk token. So, if a token from our input sentence is not present in our vocabulary, then the index corresponding to unk will be looked up in the embedding matrix.


P.S. Note that embedding_dimension is a hyperparameter that one has to tune for their application but popular models like Word2Vec and GloVe uses 300 dimension vector for representing each word.

Bonus Reading word2vec skip-gram model


回答 3

这是描述嵌入查找过程的图像。



简而言之,它获取由ID列表指定的嵌入层的相应行,并将其作为张量提供。它是通过以下过程实现的。

  1. 定义一个占位符 lookup_ids = tf.placeholder([10])
  2. 定义嵌入层 embeddings = tf.Variable([100,10],...)
  3. 定义张量流操作 embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. 通过运行获取结果 lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})

Here’s an image depicting the process of embedding lookup.



Concisely, it gets the corresponding rows of a embedding layer, specified by a list of IDs and provide that as a tensor. It is achieved through the following process.

  1. Define a placeholder lookup_ids = tf.placeholder([10])
  2. Define a embedding layer embeddings = tf.Variable([100,10],...)
  3. Define the tensorflow operation embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. Get the results by running lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})

回答 4

当参数张量为高维时,id仅指最大维。也许对大多数人来说这很明显,但是我必须运行以下代码来理解这一点:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

只是尝试“ div”策略,对于一个张量,这没有什么区别。

这是输出:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]

When the params tensor is in high dimensions, the ids only refers to top dimension. Maybe it’s obvious to most of people but I have to run the following code to understand that:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

Just trying the ‘div’ strategy and for one tensor, it makes no difference.

Here is the output:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]

回答 5

另一种查看方式是,假设您将张量展平为一维数组,然后执行查找。

(例如)Tensor0 = [1,2,3],Tensor1 = [4,5,6],Tensor2 = [7,8,9]

展平的张量将如下[1,4,7,2,5,8,3,6,9]

现在,当您执行[0,3,4,1,7]的查找时,将会产生[1,2,5,4,6]

(i,e)例如,如果lookup值为7,而我们有3个张量(或具有3行的张量),

7/3 :(提醒为1,商为2)因此将显示Tensor1的第二个元素,即6

Another way to look at it is , assume that you flatten out the tensors to one dimensional array, and then you are performing a lookup

(eg) Tensor0=[1,2,3], Tensor1=[4,5,6], Tensor2=[7,8,9]

The flattened out tensor will be as follows [1,4,7,2,5,8,3,6,9]

Now when you do a lookup of [0,3,4,1,7] it will yeild [1,2,5,4,6]

(i,e) if lookup value is 7 for example , and we have 3 tensors (or a tensor with 3 rows) then,

7 / 3 : (Reminder is 1, Quotient is 2) So 2nd element of Tensor1 will be shown, which is 6


回答 6

由于我也对此功能感兴趣,因此我将给我两分钱。

我在2D情况下看到它的方式就像矩阵乘法(很容易推广到其他维度)。

考虑一个带有N个符号的词汇表。然后,您可以将符号x表示为尺寸为Nx1的矢量,并进行一次热编码。

但是,您不希望将此符号表示为Nx1的矢量,而是表示为尺寸为Mx1的y

因此,要将x转换为y,可以使用和嵌入尺寸为MxN的矩阵E

y = E x

本质上,这就是tf.nn.embedding_lookup(params,ids,…)所做的事情,细微的差别是ids只是一个数字,代表1在热编码矢量x中的位置1 。

Since I was also intrigued by this function, I’ll give my two cents.

The way I see it in the 2D case is just as a matrix multiplication (it’s easy to generalize to other dimensions).

Consider a vocabulary with N symbols. Then, you can represent a symbol x as a vector of dimensions Nx1, one-hot-encoded.

But you want a representation of this symbol not as a vector of Nx1, but as one with dimensions Mx1, called y.

So, to transform x into y, you can use and embedding matrix E, with dimensions MxN:

y = E x.

This is essentially what tf.nn.embedding_lookup(params, ids, …) is doing, with the nuance that ids are just one number that represents the position of the 1 in the one-hot-encoded vector x.


回答 7

添加到Asher Stern的答案中, params被解释为大嵌入张量的划分。它可以是表示完整嵌入张量的单个张量,也可以是X形张量的列表,除了第一维以外,它们均具有相同的形状,表示分片嵌入张量。

tf.nn.embedding_lookup考虑到嵌入(参数)会很大这一事实来编写函数。因此我们需要partition_strategy

Adding to Asher Stern’s answer, params is interpreted as a partitioning of a large embedding tensor. It can be a single tensor representing the complete embedding tensor, or a list of X tensors all of same shape except for the first dimension, representing sharded embedding tensors.

The function tf.nn.embedding_lookup is written considering the fact that embedding (params) will be large. Therefore we need partition_strategy.