问题:如何使用nltk或python删除停用词
所以我有一个数据集,我想从中删除停用词
stopwords.words('english')
我在如何在我的代码中使用它以简单地取出这些单词的过程中苦苦挣扎。我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词。任何帮助表示赞赏。
So I have a dataset that I would like to remove stop words from using
stopwords.words('english')
I’m struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i’m struggling with is comparing to this list and removing the stop words.
Any help is appreciated.
回答 0
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
回答 1
您还可以设置差异,例如:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
回答 2
我想您有一个要删除停用词的单词列表(word_list)。您可以执行以下操作:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
回答 3
要排除所有类型的停用词,包括nltk停用词,您可以执行以下操作:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
回答 4
stop-words
为此,有一个非常简单的轻量级python软件包。
拳头使用以下方法安装软件包:
pip install stop-words
然后,您可以使用列表理解功能将一行中的单词删除:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
该软件包的下载量非常轻(不同于nltk),适用于Python 2
和Python 3
,并且具有许多其他语言的停用词,例如:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
There’s a very simple light-weight python package stop-words
just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2
and Python 3
,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
回答 5
使用textcleaner库从数据中删除停用词。
单击此链接:https : //yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
请按照以下步骤操作以使用此库。
pip install textcleaner
安装后:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
使用上面的代码删除停用词。
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
pip install textcleaner
After installing:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
回答 6
您可以使用此功能,请注意,您需要降低所有单词
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
回答 7
使用过滤器:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
using filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
回答 8
这是我的看法,以防万一您想立即将答案放入字符串中(而不是过滤单词的列表):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
回答 9
如果您将数据存储为a Pandas DataFrame
,则可以remove_stopwords
从textero使用默认情况下使用NLTK停用词列表的数据。
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
In case your data are stored as a Pandas DataFrame
, you can use remove_stopwords
from textero that use the NLTK stopwords list by default.
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
回答 10
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
回答 11
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x not in list: # comparing from the list and removing it
another_list.append(x) # it is also possible to use .remove
for x in another_list:
print(x,end=' ')
# 2) if you want to use .remove more preferred code
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x in list:
userstring.remove(x)
for x in userstring:
print(x,end = ' ')
#the code will be like this
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x not in list: # comparing from the list and removing it
another_list.append(x) # it is also possible to use .remove
for x in another_list:
print(x,end=' ')
# 2) if you want to use .remove more preferred code
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x in list:
userstring.remove(x)
for x in userstring:
print(x,end = ' ')
#the code will be like this