使用nltk.data.load加载english.pickle失败

问题:使用nltk.data.load加载english.pickle失败

尝试加载punkt令牌生成器时…

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

LookupError有人提出:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

When trying to load the punkt tokenizer…

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

…a LookupError was raised:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

回答 0

我有同样的问题。进入python shell并输入:

>>> import nltk
>>> nltk.download()

然后出现安装窗口。转到“模型”标签,然后从“标识符”列下选择“ punkt”。然后单击下载,它将安装必要的文件。然后它应该工作!

I had this same problem. Go into a python shell and type:

>>> import nltk
>>> nltk.download()

Then an installation window appears. Go to the ‘Models’ tab and select ‘punkt’ from under the ‘Identifier’ column. Then click Download and it will install the necessary files. Then it should work!


回答 1

您可以这样做。

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

您可以通过将punkt参数传递为函数来下载令牌生成器download。然后可以在上使用单词和句子标记器nltk

如果你想下载的一切,即chunkersgrammarsmiscsentimenttaggerscorporahelpmodelsstemmerstokenizers,不通过任何这样的论点。

nltk.download()

请参阅此以获得更多见解。https://www.nltk.org/data.html

You can do that like this.

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

You can download the tokenizers by passing punkt as an argument to the download function. The word and sentence tokenizers are then available on nltk.

If you want to download everything i.e chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers, do not pass any arguments like this.

nltk.download()

See this for more insights. https://www.nltk.org/data.html


回答 2

这是对我来说刚刚起作用的:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

words_tokenized是令牌列表的列表:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

这些句子摘自与《采矿的社会网络,第二版》一书一起附带的示例ipython笔记本。

This is what worked for me just now:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

sentences_tokenized is a list of a list of tokens:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

The sentences were taken from the example ipython notebook accompanying the book “Mining the Social Web, 2nd Edition”


回答 3

从bash命令行运行:

$ python -c "import nltk; nltk.download('punkt')"

From bash command line, run:

$ python -c "import nltk; nltk.download('punkt')"

回答 4

这对我有用:

>>> import nltk
>>> nltk.download()

在Windows中,您还将获得nltk下载器

This Works for me:

>>> import nltk
>>> nltk.download()

In windows you will also get nltk downloader


回答 5

简单nltk.download()不会解决这个问题。我尝试了以下方法,它对我有用:

nltk文件夹中创建一个tokenizers文件夹,然后将您的punkt文件tokenizers夹复制到该文件夹中。

这将起作用。文件夹结构必须如图所示!1个

Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture!1


回答 6

nltk具有其预训练的令牌生成器模型。模型是从内部预定义的Web资源下载的,并在执行以下可能的函数调用时存储在已安装的nltk软件包的路径中。

例如1个tokenizer = nltk.data.load(’nltk:tokenizers / punkt / english.pickle’)

例如2 nltk.download(’punkt’)

如果您在代码中调用上述句子,请确保您具有Internet连接且没有任何防火墙保护。

我想分享一些更好的替代网络方法,以更深刻的理解来解决上述问题。

请按照以下步骤操作,并使用nltk享受英语单词标记化。

步骤1:首先按照网络路径下载“ english.pickle”模型。

转到链接“ http://www.nltk.org/nltk_data/ ”,然后在选项“ 107. Punkt Tokenizer Models”上单击“下载”。

步骤2:解压缩下载的“ punkt.zip”文件,并从中找到“ english.pickle”文件,并将其放置在C驱动器中。

步骤3:复制并粘贴以下代码并执行。

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

让我知道,如果您遇到任何问题

nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.

E.g. 1 tokenizer = nltk.data.load(‘nltk:tokenizers/punkt/english.pickle’)

E.g. 2 nltk.download(‘punkt’)

If you call above sentence in your code, Make sure you have internet connection without any firewall protections.

I would like to share some more better alter-net way to resolve above issue with more better deep understandings.

Please follow following steps and enjoy english word tokenization using nltk.

Step 1: First download the “english.pickle” model following web path.

Goto link “http://www.nltk.org/nltk_data/” and click on “download” at option “107. Punkt Tokenizer Models”

Step 2: Extract the downloaded “punkt.zip” file and find the “english.pickle” file from it and place in C drive.

Step 3: copy paste following code and execute.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

Let me know, if you face any problem


回答 7

在Jenkins上,可以通过在Build选项卡下的Virtualenv Builder中添加以下类似代码来解决此问题

python -m nltk.downloader punkt

On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:

python -m nltk.downloader punkt


回答 8

当我尝试在nltk中进行pos标记时遇到了这个问题。我得到它的正确方法是通过创建一个新目录以及名为“ taggers”的语料库目录,然后在目录标记器中复制max_pos_tagger。
希望它也对您有用。祝你好运!!!

i came across this problem when i was trying to do pos tagging in nltk. the way i got it correct is by making a new directory along with corpora directory named “taggers” and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.


回答 9

在Spyder中,转到您的活动外壳并使用以下2个命令下载nltk。import nltk nltk.download()然后,您应该看到NLTK下载器窗口打开,如下所示,转到此窗口中的“模型”选项卡,然后单击“ punkt”并下载“ punkt”

In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to ‘Models’ tab in this window and click on ‘punkt’ and download ‘punkt’


回答 10

检查是否具有所有NLTK库。

Check if you have all NLTK libraries.


回答 11

punkt令牌生成器数据非常大,超过35 MB,如果像我一样,您在lambda这样的资源有限的环境中运行nltk,这可能是一件大事。

如果只需要一种或几种语言标记器,则可以通过仅包含那些语言.pickle文件来大大减少数据的大小。

如果您只需要支持英语,那么您的nltk数据大小可以减少到407 KB(对于python 3版本)。

脚步

  1. 下载nltk punkt数据:https ://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. 在您的环境中的某个位置创建文件夹:nltk_data/tokenizers/punkt如果使用python 3,请添加另一个文件夹,PY3以便新目录结构如下所示nltk_data/tokenizers/punkt/PY3。就我而言,我在项目的根目录下创建了这些文件夹。
  3. 解压缩该zip .pickle文件,然后将要支持的语言的文件移动到punkt刚创建的文件夹中。注意:Python 3用户应使用PY3文件夹中的泡菜。加载语言文件后,其外观应类似于:example-folder-stucture
  4. 现在nltk_data,假设您的数据不在预定义的搜索路径之一中,只需要将文件夹添加到搜索路径。您可以使用任一环境变量添加数据NLTK_DATA='path/to/your/nltk_data'。您还可以在运行时在python中添加自定义路径,方法是:
from nltk import data
data.path += ['/path/to/your/nltk_data']

注意:如果您不需要在运行时加载数据或将数据与代码捆绑在一起,则最好在nltk查找nltk_data内置位置创建文件夹。

The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.

If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle files.

If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).

Steps

  1. Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. Somewhere in your environment create the folders: nltk_data/tokenizers/punkt, if using python 3 add another folder PY3 so that your new directory structure looks like nltk_data/tokenizers/punkt/PY3. In my case I created these folders at the root of my project.
  3. Extract the zip and move the .pickle files for the languages you want to support into the punkt folder you just created. Note: Python 3 users should use the pickles from the PY3 folder. With your language files loaded it should look something like: example-folder-stucture
  4. Now you just need to add your nltk_data folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variable NLTK_DATA='path/to/your/nltk_data'. You can also add a custom path at runtime in python by doing:
from nltk import data
data.path += ['/path/to/your/nltk_data']

NOTE: If you don’t need to load in the data at runtime or bundle the data with your code, it would be best to create your nltk_data folders at the built-in locations that nltk looks for.


回答 12

nltk.download()不会解决这个问题。我尝试了以下方法,它对我有用:

'...AppData\Roaming\nltk_data\tokenizers'文件夹中,将下载的punkt.zip文件夹解压缩到同一位置。

nltk.download() will not solve this issue. I tried the below and it worked for me:

in the '...AppData\Roaming\nltk_data\tokenizers' folder, extract downloaded punkt.zip folder at the same location.


回答 13

Python-3.6我可以看到在回溯的建议。那很有帮助。因此,我要说的是你们要注意您所得到的错误,大多数时候答案都在该问题之内;)。

然后像其他人所建议的那样,使用python终端或使用类似python -c "import nltk; nltk.download('wordnet')"我们可以即时安装的命令。您只需要运行一次该命令,然后它将数据本地保存在您的主目录中。

In Python-3.6 I can see the suggestion in the traceback. That’s quite helpful. Hence I will say you guys to pay attention to the error you got, most of the time answers are within that problem ;).

And then as suggested by other folks here either using python terminal or using a command like python -c "import nltk; nltk.download('wordnet')" we can install them on the fly. You just need to run that command once and then it will save the data locally in your home directory.


回答 14

使用分配的文件夹进行多次下载时,我遇到了类似的问题,我不得不手动添加数据路径:

一次下载,可以按以下步骤完成(工作)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

该代码有效,这意味着nltk会记住在下载功能中传递的下载路径。另一方面,如果我下载后续软件包,则会得到与用户所描述的类似的错误:

多次下载会引发错误:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

错误:

找不到资源点。请使用NLTK下载器获取资源:

导入nltk nltk.download(’punkt’)

现在,如果我将ntlk数据路径附加到我的下载路径中,则它可以正常工作:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

这行得通…不知道为什么在一种情况下行不通,但在另一种情况下行不通,但是错误消息似乎暗示它第二次不签入下载文件夹。注意:使用Windows8.1 / python3.7 / nltk3.5

I had similar issue when using an assigned folder for multiple downloads, and I had to append the data path manually:

single download, can be achived as followed (works)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

This code works, meaning that nltk remembers the download path passed in the download fuction. On the other nads if I download a subsequent package I get similar error as described by user:

Multiple downloads raise an error:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))


Error:

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download(‘punkt’)

Now if I append the ntlk data path with my download path, it works:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

This works… Not sure why works in one case but not the other, but error message seems to imply that it doesn’t check into the download folder the second time. NB: using windows8.1/python3.7/nltk3.5