使用nltk.data.load加载english.pickle失败

Question 1

When trying to load the punkt tokenizer…

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

…a LookupError was raised:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

Question 2

I had this same problem. Go into a python shell and type:

>>> import nltk
>>> nltk.download()

Then an installation window appears. Go to the ‘Models’ tab and select ‘punkt’ from under the ‘Identifier’ column. Then click Download and it will install the necessary files. Then it should work!

Question 3

You can do that like this.

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

You can download the tokenizers by passing punkt as an argument to the download function. The word and sentence tokenizers are then available on nltk.

If you want to download everything i.e chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers, do not pass any arguments like this.

nltk.download()

See this for more insights. https://www.nltk.org/data.html

Question 4

This is what worked for me just now:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

sentences_tokenized is a list of a list of tokens:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

The sentences were taken from the example ipython notebook accompanying the book “Mining the Social Web, 2nd Edition”

Question 5

From bash command line, run:

$ python -c "import nltk; nltk.download('punkt')"

Question 6

This Works for me:

>>> import nltk
>>> nltk.download()

In windows you will also get nltk downloader

Question 7

Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture!1

Question 8

nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.

E.g. 1 tokenizer = nltk.data.load(‘nltk:tokenizers/punkt/english.pickle’)

E.g. 2 nltk.download(‘punkt’)

If you call above sentence in your code, Make sure you have internet connection without any firewall protections.

I would like to share some more better alter-net way to resolve above issue with more better deep understandings.

Please follow following steps and enjoy english word tokenization using nltk.

Step 1: First download the “english.pickle” model following web path.

Goto link “http://www.nltk.org/nltk_data/” and click on “download” at option “107. Punkt Tokenizer Models”

Step 2: Extract the downloaded “punkt.zip” file and find the “english.pickle” file from it and place in C drive.

Step 3: copy paste following code and execute.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

Let me know, if you face any problem

Question 9

On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:

python -m nltk.downloader punkt

Question 10

i came across this problem when i was trying to do pos tagging in nltk. the way i got it correct is by making a new directory along with corpora directory named “taggers” and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.

Question 11

In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to ‘Models’ tab in this window and click on ‘punkt’ and download ‘punkt’

Question 12

Check if you have all NLTK libraries.

Question 13

The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.

If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle files.

If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).

Steps

Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
Somewhere in your environment create the folders: nltk_data/tokenizers/punkt, if using python 3 add another folder PY3 so that your new directory structure looks like nltk_data/tokenizers/punkt/PY3. In my case I created these folders at the root of my project.
Extract the zip and move the .pickle files for the languages you want to support into the punkt folder you just created. Note: Python 3 users should use the pickles from the PY3 folder. With your language files loaded it should look something like: example-folder-stucture
Now you just need to add your nltk_data folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variable NLTK_DATA='path/to/your/nltk_data'. You can also add a custom path at runtime in python by doing:

from nltk import data
data.path += ['/path/to/your/nltk_data']

NOTE: If you don’t need to load in the data at runtime or bundle the data with your code, it would be best to create your nltk_data folders at the built-in locations that nltk looks for.

Question 14

nltk.download() will not solve this issue. I tried the below and it worked for me:

in the '...AppData\Roaming\nltk_data\tokenizers' folder, extract downloaded punkt.zip folder at the same location.

Question 15

In Python-3.6 I can see the suggestion in the traceback. That’s quite helpful. Hence I will say you guys to pay attention to the error you got, most of the time answers are within that problem ;).

And then as suggested by other folks here either using python terminal or using a command like python -c "import nltk; nltk.download('wordnet')" we can install them on the fly. You just need to run that command once and then it will save the data locally in your home directory.

Question 16

I had similar issue when using an assigned folder for multiple downloads, and I had to append the data path manually:

single download, can be achived as followed (works)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

This code works, meaning that nltk remembers the download path passed in the download fuction. On the other nads if I download a subsequent package I get similar error as described by user:

Multiple downloads raise an error:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

Error:

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download(‘punkt’)

Now if I append the ntlk data path with my download path, it works:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

This works… Not sure why works in one case but not the other, but error message seems to imply that it doesn’t check into the download folder the second time. NB: using windows8.1/python3.7/nltk3.5

使用nltk.data.load加载english.pickle失败

问题：使用nltk.data.load加载english.pickle失败

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

脚步

Steps

回答 12

回答 13

回答 14

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何为多个环境自定义requirements.txt？

PySnooper – 永远不要使用print进行调试

Python：defaultdict的defaultdict？

如何在NumPy中创建一个空数组/矩阵？

如何将布尔数组转换为int数组

在matplotlib中删除已保存图像周围的空白

使用nltk.data.load加载english.pickle失败

问题：使用nltk.data.load加载english.pickle失败

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

脚步

Steps

回答 12

回答 13

回答 14

相关文章

排行榜展示

文章展示