python教程—简单实现N-Gram, tf-idf和余弦相似度在Python-Python实用宝典

# python教程—简单实现N-Gram, tf-idf和余弦相似度在Python

cosine_similarity:

```
def cosine_distance(u, v):
"""
Returns the cosine of the angle between vectors v and u. This is equal to
u.v / |u||v|.
"""
return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
```

ngrams:

```
"""
A utility that produces a sequence of ngrams from a sequence of items.
For example:

>>> ngrams([1,2,3,4,5], 3)
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ingram for an iterator version of this function.  Set pad_left

[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

@param sequence: the source data to be converted into ngrams
@type sequence: C{sequence} or C{iterator}
@param n: the degree of the ngrams
@type n: C{int}
@return: The ngrams
@rtype: C{list} of C{tuple}s
"""

sequence = chain((pad_symbol,) * (n-1), sequence)
sequence = chain(sequence, (pad_symbol,) * (n-1))
sequence = list(sequence)

count = max(0, len(sequence) - n + 1)
return [tuple(sequence[i:i+n]) for i in range(count)]
```

< a href = " http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html frequency_distribution_index_term”rel = " noreferrer " > http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html frequency_distribution_index_term < / >

```    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
for i in xrange(docs):
if tfv:
rec = {}
terms = tfv.getTerms()
frequencies = tfv.getTermFrequencies()
for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
tmap.setdefault(t, len(tmap))
rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
# and normalize the values using cosine normalization
if cosine_normalization:
denom = sum([x**2 for x in rec.values()])**0.5
for k,v in rec.items():
rec[k] = v / denom
```

​Python实用宝典 (pythondict.com)