标签归档:nltk

python中的n克,四克,五克,六克?

问题:python中的n克,四克,五克,六克?

我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至一百克?

谢谢!

I’m looking for a way to split a text into n-grams. Normally I would do something like:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

Thanks!


回答 0

其他用户提供的基于本地Python的出色答案。但是这就是nltk方法(以防万一,OP会因为重新发明nltk库中已经存在的内容而受到惩罚)。

有一个NGRAM模块,人们很少使用nltk。这不是因为很难读取ngram,而是基于ngram训练模型,其中n> 3将导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

回答 1

我很惊讶这还没有出现:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

I’m surprised that this hasn’t shown up yet:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

回答 2

仅使用nltk工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

输出示例

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了使ngram保持数组格式,只需删除 ' '.join

Using only nltk tools

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

Example output

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join


回答 3

这是做n-gram的另一种简单方法

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

here is another simple way for do n-grams

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

回答 4

对于需要二元组或三元组的情况,人们已经很好地回答了,但是在这种情况下,如果需要句子的每一个整组,您可以使用nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

如果您有一个限制,如三字母组的最大长度应为3,则可以使用max_len参数来指定它。

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

您可以修改max_len参数以实现任意克,即4克,5克,6甚至100克。

可以修改前面提到的解决方案以实现上面提到的解决方案,但是此解决方案比这要简单得多。

欲了解更多信息,请点击这里

而且,当您只需要一个特定的语法,例如bigram或trigram等时,可以使用MAHassan的答案中提到的nltk.util.ngrams。

People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max_len param to specify it.

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

You can just modify the max_len param to achieve whatever gram i.e four gram, five gram, six or even hundred gram.

The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that.

For further reading click here

And when you just need a specific gram like bigram or trigram etc you can use the nltk.util.ngrams as mentioned in M.A.Hassan’s answer.


回答 5

您可以使用以下命令轻松启动自己的功能itertools

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

You can easily whip up your own function to do this using itertools:

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

回答 6

使用python的buildin构建双字母组的一种更优雅的方法zip()。只需通过将原始字符串转换为列表split(),然后正常传递一次列表,然后一次偏移一个元素即可。

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

A more elegant approach to build bigrams with python’s builtin zip(). Simply convert the original string into a list by split(), then pass the list once normally and once offset by one element.

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

回答 7

我从未处理过nltk,但是在一些小班项目中做了N-grams。如果要查找字符串中所有N-gram出现的频率,可以采用这种方法。D会给你N个单词的直方图。

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

I have never dealt with nltk but did N-grams as part of some small class project. If you want to find the frequency of all N-grams occurring in the string, here is a way to do that. D would give you the histogram of your N-words.

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

回答 8

对于four_grams,它已经在NLTK中,这是一段代码,可以帮助您实现这一目标:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

希望对您有所帮助。

For four_grams it is already in NLTK, here is a piece of code that can help you toward this:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

I hope it helps.


回答 9

您可以使用sklearn.feature_extraction.text.CountVectorizer

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

您可以设置为ngram_size任何正整数。也就是说,您可以将文本拆分为四克,五克甚至一百克。

You can use sklearn.feature_extraction.text.CountVectorizer:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

You can set to ngram_size to any positive integer. I.e. you can split a text in four-grams, five-grams or even hundred-grams.


回答 10

如果效率是一个问题,您必须构建多个不同的n-gram(如您所说的最多一百个),但是您想使用纯python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

重新发布我以前的答案

If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.


回答 11

Nltk很棒,但有时对于某些项目来说是一项开销:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

使用示例:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

Nltk is great, but sometimes is a overhead for some projects:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Example use:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

回答 12

您可以使用以下代码获得所有4-6克的代码,而无需以下其他软件包:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

输出如下:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

您可以在此博客上找到更多详细信息

You can get all 4-6gram using the code without other package below:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

the output is below:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

you can find more detail on this blog


回答 13

大约七年后,这是一个更优雅的答案collections.deque

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

输出:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

After about seven years, here’s a more elegant answer using collections.deque:

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

Output:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

回答 14

如果您想为具有恒定内存使用量的大字符串提供纯迭代器解决方案:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

测试:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

输出:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

If you want a pure iterator solution for large strings with constant memory usage:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

如何使用Python检查单词是否为英语单词?

问题:如何使用Python检查单词是否为英语单词?

我想检查Python程序中英语词典中是否有单词。

我相信可以使用nltk wordnet接口,但是我不知道如何将其用于如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

将来,我可能想检查单词的单数形式是否在字典中(例如,属性->属性->英语单词)。我将如何实现?

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?


回答 0

要获得更大的功能和灵活性,请使用专用的拼写检查库,例如PyEnchant。有一个教程,或者您可以直接学习:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant带有一些词典(en_GB,en_US,de_DE,fr_FR),但是如果您需要更多语言,可以使用任何OpenOffice

似乎有一个名为的多元化图书馆inflect,但我不知道它是否有用。

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I’ve no idea whether it’s any good.


回答 1

它不适用于WordNet,因为WordNet并不包含所有英语单词。基于NLTK却没有附魔的另一种可能性是NLTK的语料库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

It won’t work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK’s words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

回答 2

使用NLTK

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

如果您在安装wordnet时遇到问题或想要尝试其他方法,则应该参考本文

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.


回答 3

使用集合存储单词列表,因为查找它们会更快:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

为了回答问题的第二部分,复数已经在一个好的单词列表中了,但是如果出于某种原因要专门从列表中排除那些复数,则确实可以编写一个函数来处理它。但是英语的复数规则非常棘手,以至于我只在单词列表中包括复数。

至于在哪里找到英语单词列表,我只是通过谷歌搜索“英语单词列表”找到了几个。这是其中之一:http : //www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt 如果您想特别使用其中一种方言,则可以使用Google的英式或美式英语。

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling “English word list”. Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.


回答 4

对于更快的基于NLTK的解决方案,您可以对单词集进行哈希处理以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

回答 5

我发现有3种基于包的解决方案可以解决该问题。它们是pyenchant,wordnet和语料库(自定义或来自ntlk)。使用py3无法在Win64中轻松安装Pyenchant。Wordnet不能很好地运行,因为它的语料库不完整。所以对我来说,我选择@Sadik回答的解决方案,并使用’set(words.words())’加快速度。

第一:

pip3 install nltk
python3

import nltk
nltk.download('words')

然后:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

回答 6

使用pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

回答 7

对于语义Web方法,您可以以RDF格式针对WordNet运行sparql查询。基本上只使用urllib模块发出GET请求并以JSON格式返回结果,然后使用python’json’模块进行解析。如果不是英文单词,您将不会获得任何结果。

另外,您可以查询Wiktionary的API

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.

As another idea, you could query Wiktionary’s API.


回答 8

对于所有Linux / Unix用户

如果您的操作系统使用Linux内核,则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录中,/usr/share/dict您有一个words文件。还有一个更具体american-englishbritish-english文件。这些包含该特定语言的所有单词。您可以通过每种编程语言来访问它,这就是为什么我认为您可能想了解这一点的原因。

现在,对于特定于python的用户,下面的python代码应该将列表单词分配为具有每个单词的值:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

希望这可以帮助!!!

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!!!


词形化与词干的区别是什么?

问题:词形化与词干的区别是什么?

什么时候使用每个?

另外… NLTK词素化是否取决于词类?如果不是,它会更准确吗?

When do I use each ?

Also…is the NLTK lemmatization dependent upon Parts of Speech? Wouldn’t it be more accurate if it was?


回答 0

简短而密集:http : //nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

词干和词根化的目的都是将单词的屈折形式和有时与派生相关的形式减少为通用的基本形式。

但是,这两个词的风格不同。词干通常是指粗略的启发式过程,该过程会切掉单词的结尾,以期在大多数时间正确实现此目标,并且通常包括删除派生词缀。词法词化通常是指使用单词的词汇和词法分析来正确处理事情,通常旨在仅去除词尾变化,并返回单词的基数或字典形式,这被称为引理。

从NLTK文档:

引词化和词干化是规范化的特殊情况。他们为一组相关的单词形式确定规范的代表。

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.


回答 1

合法化阻止密切相关。不同之处在于,词干分析器在不了解上下文的情况下对单个单词进行操作,因此无法根据词性区分具有不同含义的单词。但是,茎杆通常更易于实现和运行得更快,并且降低的精度对于某些应用可能并不重要。

例如:

  1. “更好”一词的引理是“好”。由于需要字典查找,因此干掉了该链接。

  2. 单词“ walk”是单词“ walking”的基本形式,因此,它在词干和词根化方面均匹配。

  3. 根据上下文,“开会”一词可以是名词的基本形式,也可以是动词(“见面”)的形式,例如“在我们上次见面”或“我们明天再见面”。与词根提取不同,词条分解原则上可以根据上下文选择适当的词条。

资料来源https : //zh.wikipedia.org/wiki/合法化

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

  1. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

  2. The word “walk” is the base form for word “walking”, and hence this is matched in both stemming and lemmatisation.

  3. The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context, e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

Source: https://en.wikipedia.org/wiki/Lemmatisation


回答 2

有两个方面可以显示它们的差异:

  1. 词干将返回一个字,这不必是完全相同的字的形态根的杆。即使词干本身不是有效的词根,通常只要相关词映射到相同的词干就足够了,而在词法化过程中,它将返回词的字典形式,该词必须是有效的词。

  2. lemmatisation,单词的语音部分,应首先确定和归一化的规则将是语音的不同部分不同,而词干上的单个字的运行不需要的上下文的知识,并且因此其具有不同的字之间不能区分含义取决于词性。

参考http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

There are two aspects to show their differences:

  1. A stemmer will return the stem of a word, which needn’t be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.

  2. In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization


回答 3

词干和词根化的目的都是为了减少形态变异。这与更通用的“术语合并”过程相反,后者也可以处理词汇语义,句法或正字法变化。

词干和词根化的真正区别有三点:

  1. 词干将单词形式简化为(伪)词干,而词义化将单词形式简化为在语言上有效的词缀。这种差异在形态更为复杂的语言中显而易见,但对于许多IR应用而言可能无关紧要。

  2. 引理化仅处理拐点变化,而词干化也可以处理导数变化。

  3. 在实现方面,词元化通常更为复杂(尤其是对于形态复杂的语言),并且通常需要某种词典。另一方面,可以通过相当简单的基于规则的方法来实现令人满意的词干。

词法标记器也可以支持词法化,以消除同音异义词的歧义。

The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general “term conflation” procedures, which may also address lexico-semantic, syntactic, or orthographic variations.

The real difference between stemming and lemmatization is threefold:

  1. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;

  2. Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;

  3. In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.

Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.


回答 4

正如MYYN所指出的那样,词干提取是将词尾的,有时是衍生词的词缀去除为所有原始词都可能与之相关的基本形式的过程。词法化与获得单个单词有关,该单词使您可以将一堆变形的表格组合在一起。这比阻止更难,因为它需要考虑上下文(因此要考虑单词的含义),而阻止则忽略上下文。

至于何时使用一个或另一个,则取决于您的应用程序在多大程度上取决于正确理解上下文中单词的含义。如果您要进行机器翻译,则可能需要进行词形化处理,以避免错误翻译单词。如果您要对10亿个文档进行信息检索,而您有99%的查询(从1-3个字不等),那么您就可以满足要求。

对于NLTK,WordNetLemmatizer确实会使用词性,尽管您必须提供(否则默认为名词)。传递给它“鸽子”和“ v”会产生“潜水”,而传递给“鸽子”和“ n”会产生“鸽子”。

As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.

As for when you would use one or the other, it’s a matter of how much your application depends on getting the meaning of a word in context correct. If you’re doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you’re doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.

As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it “dove” and “v” yields “dive” while “dove” and “n” yields “dove”.


回答 5

一个示例驱动的解释,说明了词源化和词干之间的区别:

词法化处理将“汽车”与“汽车”匹配以及将“汽车”与“汽车”匹配。

阻止处理将“ car”匹配到“ cars”

词法化意味着模糊词匹配的范围更广,仍然由相同的子系统处理。它暗示了用于引擎内低级处理的某些技术,也可能反映了工程上对术语的偏爱。

[…]以FAST为例,他们的词素化引擎不仅处理诸如单数或复数之类的基本单词变体,而且还处理诸如“热”匹配“暖”之类的词库运算符。

这并不是说其他​​引擎当然不会处理同义词,但是底层实现可能与处理基本词干的引擎不在同一个子系统中。

http://www.ideaeng.com/stemming-lemmatization-0601

An example-driven explanation on the differenes between lemmatization and stemming:

Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile”.

Stemming handles matching “car” to “cars” .

Lemmatization implies a broader scope of fuzzy word matching that is still handled by the same subsystems. It implies certain techniques for low level processing within the engine, and may also reflect an engineering preference for terminology.

[…] Taking FAST as an example, their lemmatization engine handles not only basic word variations like singular vs. plural, but also thesaurus operators like having “hot” match “warm”.

This is not to say that other engines don’t handle synonyms, of course they do, but the low level implementation may be in a different subsystem than those that handle base stemming.

http://www.ideaeng.com/stemming-lemmatization-0601


回答 6

ianacl
但我认为词干是一个粗略的黑客人们用它来获得所有不同形式的同一个单词到它不必是本身就是一个合法的字基本形式
有点像波特施特默尔可以使用简单的正则表达式来消除常见字后缀

词法化将单词还原为实际的基本形式,在不规则动词的情况下,该词看起来可能与输入词
完全不同,例如Morpha之类的东西,它使用FST将名词和动词带入其基本形式

ianacl
but i think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes

Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form


回答 7

词干只是去除或阻止单词的最后几个字符,通常会导致错误的含义和拼写。词法化会考虑上下文,并将单词转换为其有意义的基本形式,即词法。有时,同一个词可以有多个不同的引词。我们应该在特定的上下文中为单词识别词性(POS)标签。以下示例说明了所有差异和用例:

  1. 如果您对“ 关怀 ” 一词进行词缀化,它将返回“ 关怀 ”。如果阻止,它将返回“ Car ”,这是错误的。
  2. 如果在动词上下文中对单词“ Stripes ”进行词缀化,它将返回“ Strip ”。如果在名词上下文中对其进行词缀化,它将返回“ Stripe ”。如果您只是阻止它,它将仅返回’ Strip ‘。
  3. 无论您是词干化还是词干化(例如走路,跑步,游泳走路,跑步,游泳等)您都会得到相同的结果。
  4. 归类化在计算上很昂贵,因为它涉及查找表,而不涉及查找表。如果您的数据集很大并且性能是一个问题,请使用Stemming。请记住,您也可以将自己的规则添加到“词干”中。如果准确性至高无上,并且数据集不那么庞大,请使用Lemmatization。

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

  1. If you lemmatize the word ‘Caring‘, it would return ‘Care‘. If you stem, it would return ‘Car‘ and this is erroneous.
  2. If you lemmatize the word ‘Stripes‘ in verb context, it would return ‘Strip‘. If you lemmatize it in noun context, it would return ‘Stripe‘. If you just stem it, it would just return ‘Strip‘.
  3. You would get same results whether you lemmatize or stem words such as walking, running, swimming… to walk, run, swim etc.
  4. Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn’t humongous, go with Lemmatization.

回答 8

词干处理是删除给定单词的最后几个字符以获得较短形式的过程,即使该形式没有任何意义。

例子,

"beautiful" -> "beauti"
"corpora" -> "corpora"

可以非常快速地完成茎梗。

反之,词法化是根据单词的字典含义将给定单词转换为其基本形式的过程。

例子,

"beautiful" -> "beauty"
"corpora" -> "corpus"

合法化要花费比阻止更多的时间。

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning.

Examples,

"beautiful" -> "beauti"
"corpora" -> "corpora"

Stemming can be done very quickly.

Lemmatization on the other hand, is the process of converting the given word into it’s base form according to the dictionary meaning of the word.

Examples,

"beautiful" -> "beauty"
"corpora" -> "corpus"

Lemmatization takes more time than stemming.


如何使用NLTK标记器消除标点符号?

问题:如何使用NLTK标记器消除标点符号?

我刚刚开始使用NLTK,但我不太了解如何从文本中获取单词列表。如果使用nltk.word_tokenize(),则会得到单词和标点的列表。我只需要这些词。我如何摆脱标点符号?同样word_tokenize不适用于多个句子:点号会添加到最后一个单词中。

I’m just starting to use NLTK and I don’t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn’t work with multiple sentences: dots are added to the last word.


回答 0

看看nltk 在此处提供的其他标记化选项。例如,您可以定义一个令牌生成器,该令牌生成器将字母数字字符序列选作令牌,并丢弃其他所有内容:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

回答 1

您实际上并不需要NLTK来删除标点符号。您可以使用简单的python将其删除。对于字符串:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或对于unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在令牌生成器中使用此字符串。

PS字符串模块还有一些其他可以删除的元素集(例如数字)。

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


回答 2

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

回答 3

正如注释中所注意到的那样,因为word_tokenize()仅适用于单个句子,所以它以send_tokenize()开头。您可以使用filter()过滤出标点符号。如果您有一个unicode字符串,请确保它是一个unicode对象(而不是使用“ utf-8”之类的编码编码的“ str”)。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a ‘str’ encoded with some encoding like ‘utf-8’).

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

回答 4

我只使用了以下代码,删除了所有标点符号:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

回答 5

我认为您需要某种正则表达式匹配(以下代码在Python 3中):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以正常使用,因为它可以删除标点符号,同时保留“ n’t”之类的令牌,而这些令牌不能从regex令牌生成器(如)获得wordpunct_tokenize

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like “n’t”, which can’t be obtained from regex tokenizers such as wordpunct_tokenize.


回答 6

真诚的问,这是什么字?如果您假设一个单词仅由字母字符组成,那您就错了,因为如果在标记化之前删除标点符号,can't诸如的单词将被破坏成碎片(例如cant),这很可能会对程序产生负面影响。

因此,解决方案是先标记化然后删除标点标记

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

……然后,如果你愿意,你可以替换某些标记,如'mam

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

…and then if you wish, you can replace certain tokens such as 'm with am.


回答 7

我使用以下代码删除标点符号:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

而且,如果您想检查令牌是否为有效的英语单词,则可能需要PyEnchant

教程:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

回答 8

删除标点符号(它将删除和标点符号处理的一部分,使用下面的代码)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

样本输入/输出:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']


回答 9

只需添加@rmalouf的解决方案,就不会包含任何数字,因为\ w +等效于[a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

回答 10

您可以在没有nltk(python 3.x)的情况下一行完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

You can do it in one line without nltk (python 3.x).

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

找不到资源u’tokenizers / punkt / english.pickle’

问题:找不到资源u’tokenizers / punkt / english.pickle’

我的代码:

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

错误信息:

[ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py
Traceback (most recent call last):
File "mapper_local_v1.0.py", line 16, in <module>

    tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 774, in load

    opened_resource = _open(resource_url)

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 888, in _open

    return find(path_, path + ['']).open()

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 618, in find

    raise LookupError(resource_not_found)

LookupError:

Resource u'tokenizers/punkt/english.pickle' not found.  Please
use the NLTK Downloader to obtain the resource:

    >>>nltk.download()

Searched in:
- '/home/ec2-user/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

我正在尝试在Unix机器上运行此程序:

根据错误消息,我从unix机器登录python shell,然后使用以下命令:

import nltk
nltk.download()

然后我使用d-down加载程序和l-list选项下载了所有可用的内容,但问题仍然存在。

我尽力在Internet中找到解决方案,但得到的解决方案与上述步骤中提到的解决方案相同。

My Code:

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

ERROR Message:

[ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py
Traceback (most recent call last):
File "mapper_local_v1.0.py", line 16, in <module>

    tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 774, in load

    opened_resource = _open(resource_url)

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 888, in _open

    return find(path_, path + ['']).open()

File "/usr/lib/python2.6/site-packages/nltk/data.py", line 618, in find

    raise LookupError(resource_not_found)

LookupError:

Resource u'tokenizers/punkt/english.pickle' not found.  Please
use the NLTK Downloader to obtain the resource:

    >>>nltk.download()

Searched in:
- '/home/ec2-user/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

I’m trying to run this program in Unix machine:

As per the error message, I logged into python shell from my unix machine then I used the below commands:

import nltk
nltk.download()

and then I downloaded all the available things using d- down loader and l- list options but still the problem persists.

I tried my best to find the solution in internet but I got the same solution what I did as I mentioned in my above steps.


回答 0

要添加到alvas的答案中,您只能下载punkt语料库:

nltk.download('punkt')

all对我来说下载声音听起来像是过分杀了。除非那是您想要的。

To add to alvas’ answer, you can download only the punkt corpus:

nltk.download('punkt')

Downloading all sounds like overkill to me. Unless that’s what you want.


回答 1

如果您只想下载punkt模型:

import nltk
nltk.download('punkt')

如果不确定所需的数据/模型,可以从NLTK 安装流行的数据集,模型和标记器:

import nltk
nltk.download('popular')

使用上面的命令,无需使用GUI下载数据集。

If you’re looking to only download the punkt model:

import nltk
nltk.download('punkt')

If you’re unsure which data/model you need, you can install the popular datasets, models and taggers from NLTK:

import nltk
nltk.download('popular')

With the above command, there is no need to use the GUI to download the datasets.


回答 2

我得到了解决方案:

import nltk
nltk.download()

NLTK下载器启动后

d)下载l)列表u)更新c)配置h)帮助q)退出

下载器> d

下载哪个软件包(l = list; x = cancel)?标识符> punkt

I got the solution:

import nltk
nltk.download()

once the NLTK Downloader starts

d) Download l) List u) Update c) Config h) Help q) Quit

Downloader> d

Download which package (l=list; x=cancel)? Identifier> punkt


回答 3

您可以从外壳执行:

sudo python -m nltk.downloader punkt 

如果要安装流行的NLTK语料库/模型:

sudo python -m nltk.downloader popular

如果要安装所有 NLTK语料库/模型:

sudo python -m nltk.downloader all

列出您已下载的资源:

python -c 'import os; import nltk; print os.listdir(nltk.data.find("corpora"))'
python -c 'import os; import nltk; print os.listdir(nltk.data.find("tokenizers"))'

From the shell you can execute:

sudo python -m nltk.downloader punkt 

If you want to install the popular NLTK corpora/models:

sudo python -m nltk.downloader popular

If you want to install all NLTK corpora/models:

sudo python -m nltk.downloader all

To list the resources you have downloaded:

python -c 'import os; import nltk; print os.listdir(nltk.data.find("corpora"))'
python -c 'import os; import nltk; print os.listdir(nltk.data.find("tokenizers"))'

回答 4

import nltk
nltk.download('punkt')

打开Python提示符并运行以上语句。

sent_tokenize函数使用的一个实例PunktSentenceTokenizernltk.tokenize.punkt模块。该实例已经过培训,并适用于许多欧洲语言。因此,它知道哪些标点符号和字符标记了句子的结尾和新句子的开头。

import nltk
nltk.download('punkt')

Open the Python prompt and run the above statements.

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.


回答 5

最近我也发生了同样的事情,您只需要下载“ punkt”软件包,它就可以工作。

在“下载所有可用内容”之后执行“列表”(l)时,所有内容是否都标记为以下行?:

[*] punkt............... Punkt Tokenizer Models

如果看到带有星星的这条线,则表示您已经拥有它,并且nltk应该能够加载它。

The same thing happened to me recently, you just need to download the “punkt” package and it should work.

When you execute “list” (l) after having “downloaded all the available things”, is everything marked like the following line?:

[*] punkt............... Punkt Tokenizer Models

If you see this line with the star, it means you have it, and nltk should be able to load it.


回答 6

通过键入转到python控制台

$Python

在您的终端中。然后,在python shell中键入以下2条命令以安装相应的软件包:

>> nltk.download(’punkt’)>> nltk.download(’averaged_perceptron_tagger’)

这为我解决了这个问题。

Go to python console by typing

$ python

in your terminal. Then, type the following 2 commands in your python shell to install the respective packages:

>> nltk.download(‘punkt’) >> nltk.download(‘averaged_perceptron_tagger’)

This solved the issue for me.


回答 7

我的问题是我叫nltk.download('all')root用户,但是最终使用nltk的进程是另一个用户,该用户无权访问下载内容的/ root / nltk_data。

因此,我只是以递归方式将所有内容从下载位置复制到NLTK希望找到的路径之一,如下所示:

cp -R /root/nltk_data/ /home/ubuntu/nltk_data

My issue was that I called nltk.download('all') as the root user, but the process that eventually used nltk was another user who didn’t have access to /root/nltk_data where the content was downloaded.

So I simply recursively copied everything from the download location to one of the paths where NLTK was looking to find it like this:

cp -R /root/nltk_data/ /home/ubuntu/nltk_data

回答 8

  1. 执行以下代码:

    import nltk
    nltk.download()
  2. 此后,将弹出NLTK下载器。

  3. 选择所有软件包。
  4. 下载punkt。
  1. Execute the following code:

    import nltk
    nltk.download()
    
  2. After this, NLTK downloader will pop out.

  3. Select All packages.
  4. Download punkt.

回答 9

尽管导入了以下内容,但还是出现错误,

import nltk
nltk.download()

但是对于谷歌colab,这解决了我的问题。

   !python3 -c "import nltk; nltk.download('all')"

I was getting an error despite importing the following,

import nltk
nltk.download()

but for google colab this solved my issue.

   !python3 -c "import nltk; nltk.download('all')"

回答 10

简单的nltk.download()无法解决此问题。我尝试了以下方法,它对我有用:

在nltk文件夹中,创建一个tokenizers文件夹,然后将您的punkt文件夹复制到tokenizers文件夹中。

这将起作用。 文件夹结构必须如图所示

Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture


回答 11

您需要重新排列文件夹将tokenizers文件夹移到nltk_data文件夹中。如果您的nltk_data文件corpora夹中包含 tokenizers文件夹,则此方法不起作用

You need to rearrange your folders Move your tokenizers folder into nltk_data folder. This doesn’t work if you have nltk_data folder containing corpora folder containing tokenizers folder


回答 12

对我而言,上述方法均无效,因此我只是从网站http://www.nltk.org/nltk_data/手动下载了所有文件,并将它们也手动放置在“ nltk_data”内部的“ tokenizers”文件中”文件夹。不是一个漂亮的解决方案,但仍然是一个解决方案。

For me nothing of the above worked, so I just downloaded all the files by hand from the web site http://www.nltk.org/nltk_data/ and I put them also by hand in a file “tokenizers” inside of “nltk_data” folder. Not a pretty solution but still a solution.


回答 13

添加此行代码后,该问题将得到解决:

nltk.download('punkt')

After adding this line of code, the issue will be fixed:

nltk.download('punkt')

回答 14

我遇到了同样的问题。下载所有内容后,仍然存在“ punkt”错误。我在Windows机器上的C:\ Users \ vaibhav \ AppData \ Roaming \ nltk_data \ tokenizers中搜索了程序包,在那里可以看到“ punkt.zip”。我意识到,该zip尚未以某种方式提取到C:\ Users \ vaibhav \ AppData \ Roaming \ nltk_data \ tokenizers \ punk中。一旦我解压缩后,它就像音乐一样工作。

I faced same issue. After downloading everything, still ‘punkt’ error was there. I searched package on my windows machine at C:\Users\vaibhav\AppData\Roaming\nltk_data\tokenizers and I can see ‘punkt.zip’ present there. I realized that somehow the zip has not been extracted into C:\Users\vaibhav\AppData\Roaming\nltk_data\tokenizers\punk. Once I extracted the zip, it worked like music.


回答 15

只要确保您使用的是JupyterNotebook,然后在笔记本中执行以下操作:

import nltk

nltk.download()

然后将出现一个弹出窗口(显示信息https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml),您必须从中下载所有内容。

然后重新运行您的代码。

Just make sure you are using Jupyter Notebook and in a notebook, do the following:

import nltk

nltk.download()

Then one popup window will appear (showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml) From that you have to download everything.

Then rerun your code.


回答 16

对我来说,它可以通过使用“ nltk”来解决:

http://www.nltk.org/howto/data.html

使用nltk.data.load加载english.pickle失败

sent_tokenizer=nltk.data.load('nltk:tokenizers/punkt/english.pickle')

For me it got solved by using “nltk:”

http://www.nltk.org/howto/data.html

Failed loading english.pickle with nltk.data.load

sent_tokenizer=nltk.data.load('nltk:tokenizers/punkt/english.pickle')

如何使用nltk或python删除停用词

问题:如何使用nltk或python删除停用词

所以我有一个数据集,我想从中删除停用词

stopwords.words('english')

我在如何在我的代码中使用它以简单地取出这些单词的过程中苦苦挣扎。我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词。任何帮助表示赞赏。

So I have a dataset that I would like to remove stop words from using

stopwords.words('english')

I’m struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i’m struggling with is comparing to this list and removing the stop words. Any help is appreciated.


回答 0

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

回答 1

您还可以设置差异,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

回答 2

我想您有一个要删除停用词的单词列表(word_list)。您可以执行以下操作:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

回答 3

要排除所有类型的停用词,包括nltk停用词,您可以执行以下操作:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

回答 4

stop-words为此,有一个非常简单的轻量级python软件包。

拳头使用以下方法安装软件包: pip install stop-words

然后,您可以使用列表理解功能将一行中的单词删除:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

该软件包的下载量非常轻(不同于nltk),适用于Python 2Python 3,并且具有许多其他语言的停用词,例如:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

There’s a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

回答 5

使用textcleaner库从数据中删除停用词。

单击此链接:https : //yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

请按照以下步骤操作以使用此库。

pip install textcleaner

安装后:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代码删除停用词。

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.


回答 6

您可以使用此功能,请注意,您需要降低所有单词

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

回答 7

使用过滤器

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

回答 8

这是我的看法,以防万一您想立即将答案放入字符串中(而不是过滤单词的列表):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

回答 9

如果您将数据存储为a Pandas DataFrame,则可以remove_stopwords从textero使用默认情况下使用NLTK停用词列表的数据。

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

回答 10

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 
from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 

回答 11

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this
   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

如何检查scikit学习安装了哪个版本的nltk?

问题:如何检查scikit学习安装了哪个版本的nltk?

在shell脚本中,我正在检查是否已安装此软件包,如果未安装,请先安装它。因此,使用shell脚本:

import nltk
echo nltk.__version__

但它会在以下位置停止shell脚本 import在行

在Linux终端中尝试以这种方式查看:

which nltk

没有任何东西以为已安装。

还有没有其他方法可以在shell脚本中验证此软件包的安装,如果未安装,请同时安装。

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script:

import nltk
echo nltk.__version__

but it stops shell script at import line

in linux terminal tried to see in this manner:

which nltk

which gives nothing thought it is installed.

Is there any other way to verify this package installation in shell script, if not installed, also install it.


回答 0

import nltk 是Python语法,因此无法在Shell脚本中使用。

为了测试的版本nltkscikit_learn,你可以写一个Python脚本并运行它。这样的脚本可能看起来像

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

# The nltk version is 3.0.0.
# The scikit-learn version is 0.15.2.

请注意,并非所有Python软件包都保证具有__version__属性,因此对于某些其他软件包可能会失败,但是对于nltk和scikit-learn至少它会起作用。

import nltk is Python syntax, and as such won’t work in a shell script.

To test the version of nltk and scikit_learn, you can write a Python script and run it. Such a script may look like

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

# The nltk version is 3.0.0.
# The scikit-learn version is 0.15.2.

Note that not all Python packages are guaranteed to have a __version__ attribute, so for some others it may fail, but for nltk and scikit-learn at least it will work.


回答 1

试试这个:

$ python -c "import nltk; print nltk.__version__"

Try this:

$ python -c "import nltk; print nltk.__version__"

回答 2

在Windows®系统中,您可以尝试

pip3 list | findstr scikit

scikit-learn                  0.22.1

如果您在Anaconda上,请尝试

conda list scikit

scikit-learn              0.22.1           py37h6288b17_0

这可以用来找出您已安装的任何软件包的版本。例如

pip3 list | findstr numpy

numpy                         1.17.4
numpydoc                      0.9.2

或者,如果您想一次查找多个包裹

pip3 list | findstr "scikit numpy"

numpy                         1.17.4
numpydoc                      0.9.2
scikit-learn                  0.22.1

请注意,当搜索多个单词时,必须使用引号字符。

照顾自己。

In Windows® systems you can simply try

pip3 list | findstr scikit

scikit-learn                  0.22.1

If you are on Anaconda try

conda list scikit

scikit-learn              0.22.1           py37h6288b17_0

And this can be used to find out the version of any package you have installed. For example

pip3 list | findstr numpy

numpy                         1.17.4
numpydoc                      0.9.2

Or if you want to look for more than one package at a time

pip3 list | findstr "scikit numpy"

numpy                         1.17.4
numpydoc                      0.9.2
scikit-learn                  0.22.1

Note the quote characters are required when searching for more than one word.

Take care.


回答 3

要检查shell脚本中scikit-learn的版本,如果已安装pip,则可以尝试以下命令

pip freeze | grep scikit-learn
scikit-learn==0.17.1

希望能帮助到你!

For checking the version of scikit-learn in shell script, if you have pip installed, you can try this command

pip freeze | grep scikit-learn
scikit-learn==0.17.1

Hope it helps!


回答 4

您只需执行以下操作即可找到NLTK版本:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.5'

对于scikit-learn,

In [3]: import sklearn

In [4]: sklearn.__version__
Out[4]: '0.19.0'

我在这里使用python3。

You can find NLTK version simply by doing:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.5'

And similarly for scikit-learn,

In [3]: import sklearn

In [4]: sklearn.__version__
Out[4]: '0.19.0'

I’m using python3 here.


回答 5

您可以按照以下方式从python笔记本单元格中进行检查

!pip install --upgrade nltk     # needed if nltk is not already installed
import nltk      
print('The nltk version is {}.'.format(nltk.__version__))
print('The nltk version is '+ str(nltk.__version__))

#!pip install --upgrade sklearn      # needed if sklearn is not already installed
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The scikit-learn version is '+ str(nltk.__version__))

you may check from a python notebook cell as follows

!pip install --upgrade nltk     # needed if nltk is not already installed
import nltk      
print('The nltk version is {}.'.format(nltk.__version__))
print('The nltk version is '+ str(nltk.__version__))

and

#!pip install --upgrade sklearn      # needed if sklearn is not already installed
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The scikit-learn version is '+ str(nltk.__version__))

回答 6

在我的安装了python 2.7的ubuntu 14.04机器中,如果我去这里,

/usr/local/lib/python2.7/dist-packages/nltk/

有一个名为

VERSION

如果我这样做,cat VERSION它将打印3.1,这是已安装的NLTK版本。

In my machine which is ubuntu 14.04 with python 2.7 installed, if I go here,

/usr/local/lib/python2.7/dist-packages/nltk/

there is a file called

VERSION.

If I do a cat VERSION it prints 3.1, which is the NLTK version installed.


安装几乎所有库的pip问题

问题:安装几乎所有库的pip问题

我很难用pip安装几乎所有东西。我是编码的新手,所以我认为这可能是我做错了的事情,因此选择easy_install来完成我需要完成的大部分工作,而这种工作通常是有效的。但是,现在我正在尝试下载nltk库,但都没有完成任务。

我尝试进入

sudo pip install nltk

但得到以下回应:

/Library/Frameworks/Python.framework/Versions/2.7/bin/pip run on Sat May  4 00:15:38 2013
Downloading/unpacking nltk

  Getting page https://pypi.python.org/simple/nltk/
  Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link]/simple/nltk/ when looking for download links for nltk

  Getting page [need more reputation to post link]/simple/
  Could not fetch URL https://pypi.python. org/simple/: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link] when looking for download links for nltk

  Cannot fetch index base URL [need more reputation to post link]

  URLs to search for versions for nltk:
  * [need more reputation to post link]
  Getting page [need more reputation to post link]
  Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link] when looking for download links for nltk

  Could not find any downloads that satisfy the requirement nltk

No distributions at all found for nltk

Exception information:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/basecommand.py", line 139, in main
    status = self.run(options, args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/commands/install.py", line 266, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/req.py", line 1026, in prepare_files
    url = finder.find_requirement(req_to_install, upgrade=self.upgrade)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/index.py", line 171, in find_requirement
    raise DistributionNotFound('No distributions at all found for %s' % req)
DistributionNotFound: No distributions at all found for nltk

--easy_install installed fragments of the library and the code ran into trouble very quickly upon trying to run it.

对这个问题有什么想法吗?在此期间,我非常感谢您提供一些反馈意见,以帮助我解决问题或解决问题。

I have a difficult time using pip to install almost anything. I’m new to coding, so I thought maybe this is something I’ve been doing wrong and have opted out to easy_install to get most of what I needed done, which has generally worked. However, now I’m trying to download the nltk library, and neither is getting the job done.

I tried entering

sudo pip install nltk

but got the following response:

/Library/Frameworks/Python.framework/Versions/2.7/bin/pip run on Sat May  4 00:15:38 2013
Downloading/unpacking nltk

  Getting page https://pypi.python.org/simple/nltk/
  Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link]/simple/nltk/ when looking for download links for nltk

  Getting page [need more reputation to post link]/simple/
  Could not fetch URL https://pypi.python. org/simple/: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link] when looking for download links for nltk

  Cannot fetch index base URL [need more reputation to post link]

  URLs to search for versions for nltk:
  * [need more reputation to post link]
  Getting page [need more reputation to post link]
  Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: <urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm>

  Will skip URL [need more reputation to post link] when looking for download links for nltk

  Could not find any downloads that satisfy the requirement nltk

No distributions at all found for nltk

Exception information:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/basecommand.py", line 139, in main
    status = self.run(options, args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/commands/install.py", line 266, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/req.py", line 1026, in prepare_files
    url = finder.find_requirement(req_to_install, upgrade=self.upgrade)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/index.py", line 171, in find_requirement
    raise DistributionNotFound('No distributions at all found for %s' % req)
DistributionNotFound: No distributions at all found for nltk

--easy_install installed fragments of the library and the code ran into trouble very quickly upon trying to run it.

Any thoughts on this issue? I’d really appreciate some feedback on how I can either get pip working or something to get around the issue in the meantime.


回答 0

我发现将pypi主机指定为受信任就足够了。例:

pip install --trusted-host pypi.python.org pytest-xdist
pip install --trusted-host pypi.python.org --upgrade pip

这解决了以下错误:

  Could not fetch URL https://pypi.python.org/simple/pytest-cov/: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600) - skipping
  Could not find a version that satisfies the requirement pytest-cov (from versions: )
No matching distribution found for pytest-cov

2018年4月更新:对于任何收到TLSV1_ALERT_PROTOCOL_VERSION错误的人:与OP的受信任主机/验证问题或此答案无关。而是TLSV1错误是因为您的解释器不支持TLS v1.2,所以您必须升级您的解释器。例如见https://news.ycombinator.com/item?id=13539034http://pyfound.blogspot.ca/2017/01/time-to-upgrade-your-python-tls-v12.htmlHTTPS ://bugs.python.org/issue17128

2019年2月更新:对于某些更新点子可能就足够了。如果以上错误阻止您执行此操作,请使用get-pip.py。例如在Linux上,

curl https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

有关更多详细信息,参见https://pip.pypa.io/en/stable/installing/

I found it sufficient to specify the pypi host as trusted. Example:

pip install --trusted-host pypi.python.org pytest-xdist
pip install --trusted-host pypi.python.org --upgrade pip

This solved the following error:

  Could not fetch URL https://pypi.python.org/simple/pytest-cov/: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600) - skipping
  Could not find a version that satisfies the requirement pytest-cov (from versions: )
No matching distribution found for pytest-cov

Update April 2018: To anyone getting the TLSV1_ALERT_PROTOCOL_VERSION error: it has nothing to do with trusted-host/verification issue of the OP or this answer. Rather the TLSV1 error is because your interpreter does not support TLS v1.2, you must upgrade your interpreter. See for example https://news.ycombinator.com/item?id=13539034, http://pyfound.blogspot.ca/2017/01/time-to-upgrade-your-python-tls-v12.html and https://bugs.python.org/issue17128.

Update Feb 2019: For some it may be sufficient to upgrade pip. If the above error prevents you from doing this, use get-pip.py. E.g. on Linux,

curl https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

More details at https://pip.pypa.io/en/stable/installing/.


回答 1

我使用的是pip版本,9.0.1并且存在相同的问题,以上所有答案都无法解决问题,并且由于其他原因,我无法通过brew安装python / pip。

升级点子即可9.0.3解决问题。而且因为我无法使用pip升级pip,所以我下载了源代码并手动进行了安装。

  1. -从下载PIP的正确版本https://pypi.org/simple/pip/
  2. sudo python3 pip-9.0.3.tar.gz -安装点子

或者您可以使用以下方法安装更新的点子:

curl https://bootstrap.pypa.io/get-pip.py | python

I used pip version 9.0.1 and had the same issue, all the answers above didn’t solve the problem, and I couldn’t install python / pip with brew for other reasons.

Upgrading pip to 9.0.3 solved the problem. And because I couldn’t upgrade pip with pip I downloaded the source and installed it manually.

  1. Download the correct version of pip from – https://pypi.org/simple/pip/
  2. sudo python3 pip-9.0.3.tar.gz – Install pip

Or you can install newer pip with:

curl https://bootstrap.pypa.io/get-pip.py | python

回答 2

Pypi删除了对小于1.2的TLS版本的支持

您需要重新安装Pip,然后执行

curl https://bootstrap.pypa.io/get-pip.py | python

或对于全局Python:

curl https://bootstrap.pypa.io/get-pip.py | sudo python

Pypi removed support for TLS versions less than 1.2

You need to re-install Pip, do

curl https://bootstrap.pypa.io/get-pip.py | python

or for global Python:

curl https://bootstrap.pypa.io/get-pip.py | sudo python

回答 3

我使用的是pip3版本9.0.1,最近无法通过命令安装任何软件包pip3 install

Mac OS版本:EI Captain 10.11.5

python版本: 3.5

我尝试了命令:

curl https://bootstrap.pypa.io/get-pip.py | python

它对我不起作用。

因此,我10.0.0通过输入以下命令卸载了较旧的pip并安装了最新版本:

python3 -m pip uninstall pip setuptools
curl https://bootstrap.pypa.io/get-pip.py | python3

现在我的问题解决了。如果您使用的是python2,则可以将python3替换为python。我希望它也对您有用。

顺便说一下,对于像我这样的新秀,您必须输入代码: sudo -i

获得根本权利:)祝你好运!

I used pip3 version 9.0.1 and was unable to install any packages recently via the commandpip3 install.

Mac os version: EI Captain 10.11.5.

python version: 3.5

I tried the command:

curl https://bootstrap.pypa.io/get-pip.py | python

It didn’t work for me.

So I uninstalled the older pip and installed the newest version10.0.0 by entering this:

python3 -m pip uninstall pip setuptools
curl https://bootstrap.pypa.io/get-pip.py | python3

Now my problem was solved. If you are using the python2, you can substitute the python3 with python. I hope it also works for you.

By the way, for some rookies like me, you have to enter the code: sudo -i

to gain the root right :) Good luck!


回答 4

您可能会看到此错误;也可以在这里看到。

最简单的解决方法是将pip降级为不使用SSL:的点easy_install pip==1.2.1。这会使您失去使用SSL的安全优势。真正的解决方案是使用链接到最新SSL库的Python发行版。

You’re probably seeing this bug; see also here.

The easiest workaround is to downgrade pip to one that doesn’t use SSL: easy_install pip==1.2.1. This loses you the security benefit of using SSL. The real solution is to use a Python distribution linked to a more recent SSL library.


回答 5

SSL错误的另一个原因可能是糟糕的系统时间–如果证书与当前时间相距太远,证书将无法验证。

Another cause of SSL errors can be a bad system time – certificates won’t validate if it’s too far off from the present.


回答 6

对我有用的唯一解决方案是:

sudo curl https://bootstrap.pypa.io/get-pip.py | 须藤python

The only solution that worked for me is:

sudo curl https://bootstrap.pypa.io/get-pip.py | sudo python


回答 7

我通过添加--trusted-host pypi.python.org选项解决了类似的问题

I solved a similar problem by adding the --trusted-host pypi.python.org option


回答 8

要安装任何其他软件包,我必须使用最新版本的pip,因为9.0.1存在此SSL问题。要通过点子本身升级点子,我必须首先解决此SSL问题。要跳出这个无尽的循环,我发现这是唯一对我有用的方法。

  1. 在此页面中找到最新版本的pip:https//pypi.org/simple/pip/
  2. 下载.whl最新版本的文件。
  3. 使用pip安装最新的pip。(在这里使用您自己的最新版本)

须藤点安装pip-10.0.1-py2.py3-none-any.whl

现在pip是最新版本,可以安装任何东西。

To install any other package I have to use the latest version of pip, since the 9.0.1 has this SSL problem. To upgrade the pip by pip itself, I have to solve this SSL problem first. To jump out of this endless loop, I find this only way that works for me.

  1. Find the latest version of pip in this page: https://pypi.org/simple/pip/
  2. Download the .whl file of the latest version.
  3. Use pip to install the latest pip. (Use your own latest version here)

sudo pip install pip-10.0.1-py2.py3-none-any.whl

Now the pip is the latest version and can install anything.


回答 9

解决方案 -通过标记以下受信任的主机来安装任何软件包

  • pypi.python.org
  • pypi.org
  • files.pythonhosted.org

临时解决方案

pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org {package name}

永久解决方案 -将您的PIP(9.0.1版本的问题)更新为最新版本。

pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org pytest-xdist

python -m pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org --upgrade pip

Solution – Install any package by marking below hosts trusted

  • pypi.python.org
  • pypi.org
  • files.pythonhosted.org

Temporary solution

pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org {package name}

Permanent solution – Update your PIP(problem with version 9.0.1) to latest.

pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org pytest-xdist

python -m pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org --upgrade pip

回答 10

macOS Sierra 10.12.6。无法通过pip安装任何东西(通过homebrew安装的python)。以上所有答案均无效。

最终,从python 3.5升级到3.6可行。

brew update
brew doctor #(in case you see such suggestion by brew)

然后按照brew的任何其他建议,即覆盖python的链接。

macOS Sierra 10.12.6. Wasn’t able to install anything through pip (python installed through homebrew). All the answers above didn’t work.

Eventually, upgrade from python 3.5 to 3.6 worked.

brew update
brew doctor #(in case you see such suggestion by brew)

then follow any additional suggestions by brew, i.e. overwrite link to python.


回答 11

我有同样的问题。我刚刚将python从2.7.0更新为2.7.15。它解决了这个问题。

您可以在此处下载。

I had the same problem. I just updated the python from 2.7.0 to 2.7.15. It solves the problem.

You can download here.


回答 12

正如blackjar在上面发布的,以下几行对我有用

pip --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org install xxx

您需要同时给出全部三个--trusted-host options。看完答案后,我只尝试了第一个,但那样对我来说并不起作用。

As posted above by blackjar, the below lines worked for me

pip --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org install xxx

You need to give all three --trusted-host options. I was trying with only the first one after looking at the answers but it didn’t work for me like that.


回答 13

您还可以使用conda安装软件包:请参见http://conda.pydata.org

conda install nltk

使用conda的最佳方法是下载Miniconda,但您也可以尝试

pip install conda
conda init
conda install nltk

You can also use conda to install packages: See http://conda.pydata.org

conda install nltk

The best way to use conda is to download Miniconda, but you can also try

pip install conda
conda init
conda install nltk

回答 14

对我来说,最新的pip(1.5.6)可以与不安全的nltk软件包一起使用,如果您只是告诉它不要对安全性如此挑剔的话:

pip install --upgrade --force-reinstall --allow-all-external --allow-unverified ntlk nltk

For me, the latest pip (1.5.6) works fine with the insecure nltk package if you just tell it not to be so picky about security:

pip install --upgrade --force-reinstall --allow-all-external --allow-unverified ntlk nltk

回答 15

试过了

pip --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org install xxx 

最终得出结论,不太了解为什么更改了pypi.python.org域。

tried

pip --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org install xxx 

and finally worked out, not quite understand why the domain pypi.python.org is changed.


回答 16

如果通过代理连接,请执行export https_proxy=<your_proxy>(在Unix或Git Bash上),然后重试安装。

如果您使用的是Windows cmd,则更改为set https_proxy=<your_proxy>

If you’re connecting through a proxy, execute export https_proxy=<your_proxy> (on Unix or Git Bash) and then retry installation.

If you’re using Windows cmd, this changes to set https_proxy=<your_proxy>.


回答 17

为了解决此问题,我在Windows 7上执行了以下操作。

c:\ Program Files \ Python36 \ Scripts> pip install beautifulsoup4 –trusted-host *

–trusted-host似乎可以解决SSL问题,*表示每个主机。

当然这是行不通的,因为您会遇到其他错误,因为没有版本可以满足beautifulsoup4的要求,但是我认为该问题与一般性问题无关。

I did the following on Windows 7 to solve this problem.

c:\Program Files\Python36\Scripts> pip install beautifulsoup4 –trusted-host *

The –trusted-host seems to fix the SSL issue and * means every host.

Of course this does not work because you get other errors since there is no version that satisfies the requirement beautifulsoup4, but I don’t think that issue is related to the general question.


回答 18

只需卸载并重新安装将为您锻炼的pip软件包即可。

Mac OS版本:高Sierra 10.13.6

python版本:3.7

因此,我通过输入以下命令卸载了较旧的pip并安装了最新的版本10.0.0:

python3 -m pip uninstall pip setuptools

curl https://bootstrap.pypa.io/get-pip.py | python3

现在我的问题解决了。如果您使用的是python2,则可以将python3替换为python。我希望它也对您有用。

Just uninstall and reinstall pip packages it will workout for you guys.

Mac os version: high Sierra 10.13.6

python version: 3.7

So I uninstalled the older pip and installed the newest version10.0.0 by entering this:

python3 -m pip uninstall pip setuptools

curl https://bootstrap.pypa.io/get-pip.py | python3

Now my problem was solved. If you are using the python2, you can substitute the python3 with python. I hope it also works for you.


回答 19

如果只是关于nltk,我曾经遇到过类似的问题。尝试按照以下安装指南进行操作。 安装NLTK

如果确定它不能与任何其他模块一起使用,则可能是您安装了不同版本的Python时遇到了问题。

或尝试一下,看是否已安装pip 。:

sudo apt-get install python-pip python-dev build-essential 

看看是否可行。

If it is only about nltk, I once faced similar problem. Try following guide for installation. Install NLTK

If you are sure it doesn’t work with any other module, you may have problem with different versions of Python installed.

Or Give It a Try to see if it says pip is already installed.:

sudo apt-get install python-pip python-dev build-essential 

and see if it works.


回答 20

我通过以下步骤解决了此问题(在les 11sp2上)

zypper remove pip
easy_install pip=1.2.1
pip install --upgrade scons

这是puppet中的相同步骤(适用于所有发行版)

  package { 'python-pip':
    ensure => absent,
  }
  exec { 'python-pip':
    command  => '/usr/bin/easy_install pip==1.2.1',
    require  => Package['python-pip'],
  }
  package { 'scons': 
    ensure   => latest,
    provider => pip,
    require  => Exec['python-pip'],
  }

I solved this issue with the following steps (on sles 11sp2)

zypper remove pip
easy_install pip=1.2.1
pip install --upgrade scons

Here are the same steps in puppet (which should work on all distros)

  package { 'python-pip':
    ensure => absent,
  }
  exec { 'python-pip':
    command  => '/usr/bin/easy_install pip==1.2.1',
    require  => Package['python-pip'],
  }
  package { 'scons': 
    ensure   => latest,
    provider => pip,
    require  => Exec['python-pip'],
  }

回答 21

在Mac Python 2.7.15rc1上使用python的最新版本 https://bugs.python.org/issue17128

Use Latest version of python on mac Python 2.7.15rc1 https://bugs.python.org/issue17128


回答 22

我在PyCharm中遇到了这个问题,将点子升级到10.0.1时,出现了“在模块中找不到’main’的错误”点子。

我可以通过安装pip 9.0.3来解决此问题,如在其他一些线程中所见。这些是我执行的步骤:

  1. https://pypi.org/simple/pip/下载了9.0.3版本的pip (因为无法使用pip进行安装)。
  2. 从tar.gz安装pip 9.0.3 python -m pip安装pip-9.0.3.tar.gz

此后,一切开始起作用。

I had this with PyCharm and upgrading pip to 10.0.1 broke pip with “‘main’ not found in module” error.

I could solve this problem by installing pip 9.0.3 as seen in some other thread. These are the steps I did:

  1. Downloaded 9.0.3 version of pip from https://pypi.org/simple/pip/ (since pip couldn’t be used to install it).
  2. Install pip 9.0.3 from tar.gz python -m pip install pip-9.0.3.tar.gz

Everything started to work after that.


回答 23

该视频教程对我有用:

$ curl https://bootstrap.pypa.io/get-pip.py | python

This video tutorial worked for me:

$ curl https://bootstrap.pypa.io/get-pip.py | python

回答 24

我通过在Mac上更新Python3 Virtualenv解决了此问题。我引用了网站https://gist.github.com/pandafulmanda/730a9355e088a9970b18275cb9eadef3
brew install python3
pip3 install virtualenv

I solved this issue by update Python3 Virtualenv on my mac. I reference the site https://gist.github.com/pandafulmanda/730a9355e088a9970b18275cb9eadef3
brew install python3
pip3 install virtualenv


回答 25

我尝试了一些流行的答案,但是仍然无法使用安装任何库/软件包pip install

我的特定错误是'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain使用Windows Miniconda(安装程序Miniconda3-py37_4.8.3-Windows-x86.exe)。

当我这样做时,它终于可以工作了: pip install -r requirements.txt --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org

具体来说,我添加了它以使其工作: --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org

I tried some of the popular answers, but still could not install any libraries/packages using pip install.

My specific error was 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain using Miniconda for Windows (installer Miniconda3-py37_4.8.3-Windows-x86.exe).

It finally works when I did this: pip install -r requirements.txt --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org

Specifically, I added this to make it work: --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org


Nltk-NLTK源

自然语言工具包(NLTK)


NLTK–自然语言工具包–是一套开源的Python模块、数据集和教程,支持自然语言处理方面的研究和开发。NLTK需要Python版本3.5、3.6、3.7、3.8或3.9

有关文档,请访问nltk.org

贡献

你想为NLTK的发展做贡献吗?太棒了!请阅读CONTRIBUTING.md有关更多详细信息,请参阅

另请参阅how to contribute to NLTK

捐赠

你觉得这个工具包有帮助吗?请使用NLTK主页上的链接通过PayPal向项目捐款来支持NLTK开发

引用

如果您发表使用NLTK的作品,请引用NLTK书,如下所示:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

版权所有

版权所有(C)2001-2021年NLTK项目

有关许可证信息,请参阅LICENSE.txt

AUTHORS.md包含对NLTK做出贡献的每个人的列表

重新分布

  • NLTK源代码是在Apache2.0许可下分发的
  • NLTK文档在知识共享署名-非商业性-无衍生作品3.0美国许可下分发
  • NLTK语料库是根据自述文件中给出的条款为每个语料库提供的;所有语料库都可以再分发,并可用于非商业用途
  • 根据这些许可证的规定,NLTK可以自由再分发

使用nltk.data.load加载english.pickle失败

问题:使用nltk.data.load加载english.pickle失败

尝试加载punkt令牌生成器时…

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

LookupError有人提出:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

When trying to load the punkt tokenizer…

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

…a LookupError was raised:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

回答 0

我有同样的问题。进入python shell并输入:

>>> import nltk
>>> nltk.download()

然后出现安装窗口。转到“模型”标签,然后从“标识符”列下选择“ punkt”。然后单击下载,它将安装必要的文件。然后它应该工作!

I had this same problem. Go into a python shell and type:

>>> import nltk
>>> nltk.download()

Then an installation window appears. Go to the ‘Models’ tab and select ‘punkt’ from under the ‘Identifier’ column. Then click Download and it will install the necessary files. Then it should work!


回答 1

您可以这样做。

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

您可以通过将punkt参数传递为函数来下载令牌生成器download。然后可以在上使用单词和句子标记器nltk

如果你想下载的一切,即chunkersgrammarsmiscsentimenttaggerscorporahelpmodelsstemmerstokenizers,不通过任何这样的论点。

nltk.download()

请参阅此以获得更多见解。https://www.nltk.org/data.html

You can do that like this.

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

You can download the tokenizers by passing punkt as an argument to the download function. The word and sentence tokenizers are then available on nltk.

If you want to download everything i.e chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers, do not pass any arguments like this.

nltk.download()

See this for more insights. https://www.nltk.org/data.html


回答 2

这是对我来说刚刚起作用的:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

words_tokenized是令牌列表的列表:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

这些句子摘自与《采矿的社会网络,第二版》一书一起附带的示例ipython笔记本。

This is what worked for me just now:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

sentences_tokenized is a list of a list of tokens:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

The sentences were taken from the example ipython notebook accompanying the book “Mining the Social Web, 2nd Edition”


回答 3

从bash命令行运行:

$ python -c "import nltk; nltk.download('punkt')"

From bash command line, run:

$ python -c "import nltk; nltk.download('punkt')"

回答 4

这对我有用:

>>> import nltk
>>> nltk.download()

在Windows中,您还将获得nltk下载器

This Works for me:

>>> import nltk
>>> nltk.download()

In windows you will also get nltk downloader


回答 5

简单nltk.download()不会解决这个问题。我尝试了以下方法,它对我有用:

nltk文件夹中创建一个tokenizers文件夹,然后将您的punkt文件tokenizers夹复制到该文件夹中。

这将起作用。文件夹结构必须如图所示!1个

Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture!1


回答 6

nltk具有其预训练的令牌生成器模型。模型是从内部预定义的Web资源下载的,并在执行以下可能的函数调用时存储在已安装的nltk软件包的路径中。

例如1个tokenizer = nltk.data.load(’nltk:tokenizers / punkt / english.pickle’)

例如2 nltk.download(’punkt’)

如果您在代码中调用上述句子,请确保您具有Internet连接且没有任何防火墙保护。

我想分享一些更好的替代网络方法,以更深刻的理解来解决上述问题。

请按照以下步骤操作,并使用nltk享受英语单词标记化。

步骤1:首先按照网络路径下载“ english.pickle”模型。

转到链接“ http://www.nltk.org/nltk_data/ ”,然后在选项“ 107. Punkt Tokenizer Models”上单击“下载”。

步骤2:解压缩下载的“ punkt.zip”文件,并从中找到“ english.pickle”文件,并将其放置在C驱动器中。

步骤3:复制并粘贴以下代码并执行。

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

让我知道,如果您遇到任何问题

nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.

E.g. 1 tokenizer = nltk.data.load(‘nltk:tokenizers/punkt/english.pickle’)

E.g. 2 nltk.download(‘punkt’)

If you call above sentence in your code, Make sure you have internet connection without any firewall protections.

I would like to share some more better alter-net way to resolve above issue with more better deep understandings.

Please follow following steps and enjoy english word tokenization using nltk.

Step 1: First download the “english.pickle” model following web path.

Goto link “http://www.nltk.org/nltk_data/” and click on “download” at option “107. Punkt Tokenizer Models”

Step 2: Extract the downloaded “punkt.zip” file and find the “english.pickle” file from it and place in C drive.

Step 3: copy paste following code and execute.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

Let me know, if you face any problem


回答 7

在Jenkins上,可以通过在Build选项卡下的Virtualenv Builder中添加以下类似代码来解决此问题

python -m nltk.downloader punkt

On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:

python -m nltk.downloader punkt


回答 8

当我尝试在nltk中进行pos标记时遇到了这个问题。我得到它的正确方法是通过创建一个新目录以及名为“ taggers”的语料库目录,然后在目录标记器中复制max_pos_tagger。
希望它也对您有用。祝你好运!!!

i came across this problem when i was trying to do pos tagging in nltk. the way i got it correct is by making a new directory along with corpora directory named “taggers” and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.


回答 9

在Spyder中,转到您的活动外壳并使用以下2个命令下载nltk。import nltk nltk.download()然后,您应该看到NLTK下载器窗口打开,如下所示,转到此窗口中的“模型”选项卡,然后单击“ punkt”并下载“ punkt”

In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to ‘Models’ tab in this window and click on ‘punkt’ and download ‘punkt’


回答 10

检查是否具有所有NLTK库。

Check if you have all NLTK libraries.


回答 11

punkt令牌生成器数据非常大,超过35 MB,如果像我一样,您在lambda这样的资源有限的环境中运行nltk,这可能是一件大事。

如果只需要一种或几种语言标记器,则可以通过仅包含那些语言.pickle文件来大大减少数据的大小。

如果您只需要支持英语,那么您的nltk数据大小可以减少到407 KB(对于python 3版本)。

脚步

  1. 下载nltk punkt数据:https ://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. 在您的环境中的某个位置创建文件夹:nltk_data/tokenizers/punkt如果使用python 3,请添加另一个文件夹,PY3以便新目录结构如下所示nltk_data/tokenizers/punkt/PY3。就我而言,我在项目的根目录下创建了这些文件夹。
  3. 解压缩该zip .pickle文件,然后将要支持的语言的文件移动到punkt刚创建的文件夹中。注意:Python 3用户应使用PY3文件夹中的泡菜。加载语言文件后,其外观应类似于:example-folder-stucture
  4. 现在nltk_data,假设您的数据不在预定义的搜索路径之一中,只需要将文件夹添加到搜索路径。您可以使用任一环境变量添加数据NLTK_DATA='path/to/your/nltk_data'。您还可以在运行时在python中添加自定义路径,方法是:
from nltk import data
data.path += ['/path/to/your/nltk_data']

注意:如果您不需要在运行时加载数据或将数据与代码捆绑在一起,则最好在nltk查找nltk_data内置位置创建文件夹。

The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.

If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle files.

If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).

Steps

  1. Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. Somewhere in your environment create the folders: nltk_data/tokenizers/punkt, if using python 3 add another folder PY3 so that your new directory structure looks like nltk_data/tokenizers/punkt/PY3. In my case I created these folders at the root of my project.
  3. Extract the zip and move the .pickle files for the languages you want to support into the punkt folder you just created. Note: Python 3 users should use the pickles from the PY3 folder. With your language files loaded it should look something like: example-folder-stucture
  4. Now you just need to add your nltk_data folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variable NLTK_DATA='path/to/your/nltk_data'. You can also add a custom path at runtime in python by doing:
from nltk import data
data.path += ['/path/to/your/nltk_data']

NOTE: If you don’t need to load in the data at runtime or bundle the data with your code, it would be best to create your nltk_data folders at the built-in locations that nltk looks for.


回答 12

nltk.download()不会解决这个问题。我尝试了以下方法,它对我有用:

'...AppData\Roaming\nltk_data\tokenizers'文件夹中,将下载的punkt.zip文件夹解压缩到同一位置。

nltk.download() will not solve this issue. I tried the below and it worked for me:

in the '...AppData\Roaming\nltk_data\tokenizers' folder, extract downloaded punkt.zip folder at the same location.


回答 13

Python-3.6我可以看到在回溯的建议。那很有帮助。因此,我要说的是你们要注意您所得到的错误,大多数时候答案都在该问题之内;)。

然后像其他人所建议的那样,使用python终端或使用类似python -c "import nltk; nltk.download('wordnet')"我们可以即时安装的命令。您只需要运行一次该命令,然后它将数据本地保存在您的主目录中。

In Python-3.6 I can see the suggestion in the traceback. That’s quite helpful. Hence I will say you guys to pay attention to the error you got, most of the time answers are within that problem ;).

And then as suggested by other folks here either using python terminal or using a command like python -c "import nltk; nltk.download('wordnet')" we can install them on the fly. You just need to run that command once and then it will save the data locally in your home directory.


回答 14

使用分配的文件夹进行多次下载时,我遇到了类似的问题,我不得不手动添加数据路径:

一次下载,可以按以下步骤完成(工作)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

该代码有效,这意味着nltk会记住在下载功能中传递的下载路径。另一方面,如果我下载后续软件包,则会得到与用户所描述的类似的错误:

多次下载会引发错误:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

错误:

找不到资源点。请使用NLTK下载器获取资源:

导入nltk nltk.download(’punkt’)

现在,如果我将ntlk数据路径附加到我的下载路径中,则它可以正常工作:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

这行得通…不知道为什么在一种情况下行不通,但在另一种情况下行不通,但是错误消息似乎暗示它第二次不签入下载文件夹。注意:使用Windows8.1 / python3.7 / nltk3.5

I had similar issue when using an assigned folder for multiple downloads, and I had to append the data path manually:

single download, can be achived as followed (works)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

This code works, meaning that nltk remembers the download path passed in the download fuction. On the other nads if I download a subsequent package I get similar error as described by user:

Multiple downloads raise an error:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))


Error:

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download(‘punkt’)

Now if I append the ntlk data path with my download path, it works:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

This works… Not sure why works in one case but not the other, but error message seems to imply that it doesn’t check into the download folder the second time. NB: using windows8.1/python3.7/nltk3.5