如何使用gensim的word2vec模型和python计算句子相似度

问题:如何使用gensim的word2vec模型和python计算句子相似度

根据Gensim Word2Vec,我可以使用gensim包中的word2vec模型来计算2个单词之间的相似度。

例如

trained_model.similarity('woman', 'man') 
0.73723527

但是,word2vec模型无法预测句子相似度。我在gensim中发现了具有句子相似性的LSI模型,但似乎无法与word2vec模型结合使用。我拥有的每个句子的语料库长度不是很长(少于10个字)。那么,有没有简单的方法可以达到目标呢?

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

e.g.

trained_model.similarity('woman', 'man') 
0.73723527

However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn’t seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?


回答 0

这实际上是您要问的一个非常具有挑战性的问题。计算句子相似度需要建立句子的语法模型,了解等效结构(例如“昨天他去商店”和“昨天他去商店”),不仅要在代词和动词上找到相似性,还要在句子中找到相似性。专有名词,在许多真实的文本示例中找到统计共现/关系,等等。

您可以尝试的最简单的方法-尽管我不知道这样做的效果如何,并且肯定不会给您带来最佳效果-首先,请删除所有“停止”字词(例如“ the”,“ an”等等),然后对两个句子中的单词运行word2vec,将一个句子中的向量求和,将另一个句子中的向量求和,然后找出两者之间的区别总和。通过将它们加起来而不是按单词进行区分,您至少不会受到单词顺序的限制。话虽这么说,这将以多种方式失败,而且无论如何都不是一个好的解决方案(尽管对这个问题的好的解决方案几乎总是涉及一定数量的NLP,机器学习和其他聪明才智)。

因此,简短的答案是,不,没有简单的方法可以做到这一点(至少不能很好地做到这一点)。

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. “he walked to the store yesterday” and “yesterday, he walked to the store”), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.

The simplest thing you could try — though I don’t know how well this would perform and it would certainly not give you the optimal results — would be to first remove all “stop” words (words like “the”, “an”, etc. that don’t add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you’ll at least not be subject to word order. That being said, this will fail in lots of ways and isn’t a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).

So, short answer is, no, there’s no easy way to do this (at least not to do it well).


回答 1

由于您正在使用gensim,因此您可能应该使用doc2vec实现。doc2vec是word2vec在短语,句子和文档级别的扩展。这是一个非常简单的扩展,描述如下

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim非常好,因为它直观,快速且灵活。很棒的是,您可以从word2vec官方页面上获取预训练的单词嵌入,并且gensim的Doc2Vec模型的syn0层暴露出来,以便您可以使用这些高质量的向量来植入单词嵌入!

GoogleNews-vectors-negative300.bin.gz(与Google Code链接)

我认为gensim绝对是在向量空间中嵌入句子的最简单的工具(到目前为止,对我来说也是最好的)。

除了上面的Le&Mikolov的论文中提出的技术外,还有其他的从句到向量技术。斯坦福大学的Socher和Manning无疑是该领域最著名的两位研究人员。他们的工作基于构成原则-句子的语义来自:

1. semantics of the words

2. rules for how these words interact and combine into phrases

他们已经提出了一些这样的模型(变得越来越复杂),以介绍如何使用构图来构建句子级的表示形式。

2011年- 展开递归自动编码器(非常简单。如有兴趣,请从此处开始)

2012- 矩阵向量神经网络

2013- 神经张量网络

2015年- 树LSTM

他的论文都可以在socher.org上找到。其中一些模型可用,但是我仍然建议gensim的doc2vec。一方面,2011 URAE并不是特别强大。此外,它还经过预训练,适用于释义news-y数据。他提供的代码不允许您重新训练网络。您也无法交换不同的单词向量,因此您陷入了Turian的2011年pre2之前的word2vec嵌入。这些向量肯定不在word2vec或GloVe的水平上。

尚未与Tree LSTM合作,但似乎很有希望!

tl; dr是的,请使用gensim的doc2vec。但是其他方法确实存在!

Since you’re using gensim, you should probably use it’s doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It’s a pretty simple extension, described here

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is nice because it’s intuitive, fast, and flexible. What’s great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim’s Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!

GoogleNews-vectors-negative300.bin.gz (as linked in Google Code)

I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.

There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov’s paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally – semantics of the sentence come from:

1. semantics of the words

2. rules for how these words interact and combine into phrases

They’ve proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.

2011 – unfolding recursive autoencoder (very comparatively simple. start here if interested)

2012 – matrix-vector neural network

2013 – neural tensor network

2015 – Tree LSTM

his papers are all available at socher.org. Some of these models are available, but I’d still recommend gensim’s doc2vec. For one, the 2011 URAE isn’t particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can’t swap in different word vectors, so you’re stuck with 2011’s pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec’s or GloVe’s.

Haven’t worked with the Tree LSTM yet, but it seems very promising!

tl;dr Yeah, use gensim’s doc2vec. But other methods do exist!


回答 2

如果使用word2vec,则需要计算每个句子/文档中所有单词的平均向量,并在向量之间使用余弦相似度:

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

计算相似度:

s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

> 0.915479828613

If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

Calculate similarity:

s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

> 0.915479828613

回答 3

您可以使用Word Mover的距离算法。这是有关WMD简单描述

#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences 
s1 = 'the first sentence'
s2 = 'the second text'

#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)

print ('distance = %.3f' % distance)

ps:如果您遇到有关导入pyemd库的错误,则可以使用以下命令进行安装:

pip install pyemd

you can use Word Mover’s Distance algorithm. here is an easy description about WMD.

#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences 
s1 = 'the first sentence'
s2 = 'the second text'

#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)

print ('distance = %.3f' % distance)

P.s.: if you face an error about import pyemd library, you can install it using following command:

pip install pyemd

回答 4

一旦计算了两组单词向量的总和,就应该取向量之间的余弦,而不是diff。余弦可以通过对两个向量的点积进行归一化来计算。因此,字数不是一个因素。

Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.


回答 5

文档中有一项功能,可获取单词列表并比较它们的相似性。

s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name

distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())

There is a function from the documentation taking a list of words and comparing their similarities.

s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name

distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())

回答 6

我想更新现有的解决方案,以帮助将要计算句子的语义相似性的人们。

第1步:

使用gensim加载合适的模型并计算句子中单词的单词向量并将其存储为单词列表

步骤2:计算句子向量

句子之间语义相似度的计算以前很困难,但是最近提出了一篇名为“句子嵌入的简单但难以理解的基线 ”的论文,该论文提出了一种简单的方法,即计算句子中单词向量的加权平均值,然后将其删除平均向量在其第一个主成分上的投影。这里,单词w的权重为a /(a + p(w)),其中a为参数,而p(w)为(估计的)单词频率,称为平滑逆频率该方法的性能明显更好。

一个简单的代码使用SIF计算句子矢量(平滑逆频率)在已经给出了本文提出的方法在这里

步骤3:使用sklearn cosine_similarity为句子加载两个向量并计算相似度。

这是计算句子相似度的最简单有效的方法。

I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences.

Step 1:

Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list

Step 2 : Computing the sentence vector

The calculation of semantic similarity between sentences was difficult before but recently a paper named “A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS” was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.

A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here

Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.

This is the most simple and efficient method to compute the sentence similarity.


回答 7

我正在使用以下方法,效果很好。首先,您需要运行POSTagger,然后过滤句子以摆脱停用词(行列式,连词等)。我建议使用TextBlob APTagger。然后,通过获取句子中每个单词向量的均值来构建word2vec。Gemsim word2vec中的n_similarity方法通过允许传递两组单词进行比较来实现此目的。

I am using the following method and it works well. You first need to run a POSTagger and then filter your sentence to get rid of the stop words (determinants, conjunctions, …). I recommend TextBlob APTagger. Then you build a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vec does exactly that by allowing to pass two sets of words to compare.


回答 8

Word2Vec的扩展旨在解决比较短语或句子等较长文本的问题。其中之一是para2vec或doc2vec。

“句子和文档的分布式表示形式” http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/

There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences. One of them is paragraph2vec or doc2vec.

“Distributed Representations of Sentences and Documents” http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/


回答 9

Gensim段落嵌入实现了一个称为Doc2Vec的模型。

IPython笔记本提供了不同的教程:

另一种方法将依赖Word2VecWord Mover的距离(WMD),如本教程所示:

另一种解决方案是依靠平均向量:

from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = KeyedVectors.load('model.wv', mmap='r')
sim = compute_sentence_similarity('this is a sentence', 'this is also a sentence', wv)
print(sim)

最后,如果您可以运行Tensorflow,则可以尝试:https ://tfhub.dev/google/universal-sentence-encoder/2

Gensim implements a model called Doc2Vec for paragraph embedding.

There are different tutorials presented as IPython notebooks:

Another method would rely on Word2Vec and Word Mover’s Distance (WMD), as shown in this tutorial:

An alternative solution would be to rely on average vectors:

from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = KeyedVectors.load('model.wv', mmap='r')
sim = compute_sentence_similarity('this is a sentence', 'this is also a sentence', wv)
print(sim)

Finally, if you can run Tensorflow, you may try: https://tfhub.dev/google/universal-sentence-encoder/2


回答 10

我已经尝试了先前答案提供的方法。它是可行的,但是它的主要缺点是句子越长,相似度越大(为了计算相似度,我使用任意两个句子的两个均值嵌入的余弦值),因为单词越多,语义效果就越积极将添加到句子中。

我想我应该改变我的主意,用句子,而不是嵌入在研究文章这个

I have tried the methods provided by the previous answers. It works, but the main drawback of it is that the longer the sentences the larger similarity will be(to calculate the similarity I use the cosine score of the two mean embeddings of any two sentences) since the more the words the more positive semantic effects will be added to the sentence.

I thought I should change my mind and use the sentence embedding instead as studied in this paper and this.


回答 11

Facebook研究小组发布了一个名为InferSent Results的新解决方案,其代码已发布在Github上,请检查其回购。太棒了 我打算使用它。 https://github.com/facebookresearch/InferSent

他们的论文 https://arxiv.org/abs/1705.02364 摘要:许多现代的NLP系统都依赖词嵌入作为基本特征,而词嵌入以前是在大型语料库上以无监督方式进行训练的。然而,为更大的文本块(例如句子)获得嵌入的努力并没有那么成功。学习句子的无监督表示的几种尝试还没有达到令人满意的性能,因此不能被广泛采用。在本文中,我们展示了使用斯坦福自然语言推理数据集的监督数据训练的通用句子表示在各种传输任务上如何始终能够胜过诸如SkipThought向量之类的无监督方法。就像计算机视觉如何使用ImageNet获取功能,然后将其转移到其他任务中一样,我们的工作倾向于表明自然语言推理是否适合将学习转移到其他NLP任务。我们的编码器是公开可用的。

Facebook Research group released a new solution called InferSent Results and code are published on Github, check their repo. It is pretty awesome. I am planning to use it. https://github.com/facebookresearch/InferSent

their paper https://arxiv.org/abs/1705.02364 Abstract: Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.


回答 12

如果不使用Word2Vec,我们还有其他模型可以使用BERT进行嵌入。以下是参考链接 https://github.com/UKPLab/sentence-transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import scipy.spatial

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating a food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

其他链接以关注 https://github.com/hanxiao/bert-as-service

If not using Word2Vec we have other model to find it using BERT for embed. Below are reference link https://github.com/UKPLab/sentence-transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import scipy.spatial

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating a food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

Other Link to follow https://github.com/hanxiao/bert-as-service