在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

Question 1

I’ve recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I’ve reviewed uses a random (not pre-trained) embedding vectors like the following:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?

Question 2

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let’s say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().

Simply create W as a tf.constant() that takes embedding as its value:
```
W = tf.constant(embedding, name="W")
```
This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.
Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():
```
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

# ...
sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
```
This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I’ve assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.
If the embedding was trained as part of another TensorFlow model, you can use a to load the value from the other model’s checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:
```
W = tf.Variable(...)

embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})

# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")
```

Question 3

I use this method to load and share embedding.

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

Question 4

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. So, on my point of view the right way to pre-trained embeddings is:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

Question 5

2.0 Compatible Answer: There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced.

Some of them are Universal Sentence Encoder (USE), ELMO, BERT, etc.. and it is very easy to reuse them in your code.

Code to reuse the Pre-Trained Embedding, Universal Sentence Encoder is shown below:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link.

Question 6

With tensorflow version 2 its quite easy if you use the Embedding layer

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

Question 7

I was also facing embedding issue, So i wrote detailed tutorial with dataset. Here I would like to add what I tried You can also try this method,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .

在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

问题：在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

一行命令实现功能？Google工程师是这样做的

如何验证一个列表是否是另一个列表的子集？

在Mac上安装python模块的最兼容方法是什么？

在pytest中，conftest.py文件有什么用？

列表推导中的Lambda函数

Python-如何在python中验证网址？（格式是否正确）

在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

问题：在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

相关文章

排行榜展示

文章展示