python教程—简单实现N-Gram, tf-idf和余弦相似度在Python-Python实用宝典

python教程—简单实现N-Gram, tf-idf和余弦相似度在Python

我需要比较存储在DB中的文档,并得出0到1之间的相似性评分。我需要使用的方法必须非常简单。实现一个普通版本的n克(其中可以定义使用多少克),以及一个简单的tf-idf和cos相似性实现。

我需要比较存储在DB中的文档,并得出0到1之间的相似性评分。

我需要使用的方法必须非常简单。实现一个普通版本的n克(其中可以定义使用多少克),以及一个简单的tf-idf和cos相似性实现。

有什么程序可以做到这一点吗?还是应该从头开始写?

回答

查看NLTK包:http://www.nltk.org它有您需要的一切

cosine_similarity:

    def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

ngrams:

    def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [tuple(sequence[i:i+n]) for i in range(count)]

对于tf-idf,你必须先计算分布,我使用Lucene来做,但你也可以用NLTK做类似的事情,使用FreqDist:

< a href = " http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html frequency_distribution_index_term”rel = " noreferrer " > http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html frequency_distribution_index_term < / >

如果你喜欢丙酮酸,这将告诉你如何净化tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom

​Python实用宝典 (pythondict.com)
不只是一个宝典
欢迎关注公众号:Python实用宝典

本文由 Python实用宝典 作者:Python实用宝典 发表,其版权均为 Python实用宝典 所有,文章内容系作者个人观点,不代表 Python实用宝典 对观点赞同或支持。如需转载,请注明文章来源。
1

发表评论