标签归档:tokenize

如何使用NLTK标记器消除标点符号?

问题:如何使用NLTK标记器消除标点符号?

我刚刚开始使用NLTK,但我不太了解如何从文本中获取单词列表。如果使用nltk.word_tokenize(),则会得到单词和标点的列表。我只需要这些词。我如何摆脱标点符号?同样word_tokenize不适用于多个句子:点号会添加到最后一个单词中。

I’m just starting to use NLTK and I don’t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn’t work with multiple sentences: dots are added to the last word.


回答 0

看看nltk 在此处提供的其他标记化选项。例如,您可以定义一个令牌生成器,该令牌生成器将字母数字字符序列选作令牌,并丢弃其他所有内容:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

回答 1

您实际上并不需要NLTK来删除标点符号。您可以使用简单的python将其删除。对于字符串:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或对于unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在令牌生成器中使用此字符串。

PS字符串模块还有一些其他可以删除的元素集(例如数字)。

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


回答 2

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

回答 3

正如注释中所注意到的那样,因为word_tokenize()仅适用于单个句子,所以它以send_tokenize()开头。您可以使用filter()过滤出标点符号。如果您有一个unicode字符串,请确保它是一个unicode对象(而不是使用“ utf-8”之类的编码编码的“ str”)。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a ‘str’ encoded with some encoding like ‘utf-8’).

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

回答 4

我只使用了以下代码,删除了所有标点符号:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

回答 5

我认为您需要某种正则表达式匹配(以下代码在Python 3中):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以正常使用,因为它可以删除标点符号,同时保留“ n’t”之类的令牌,而这些令牌不能从regex令牌生成器(如)获得wordpunct_tokenize

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like “n’t”, which can’t be obtained from regex tokenizers such as wordpunct_tokenize.


回答 6

真诚的问,这是什么字?如果您假设一个单词仅由字母字符组成,那您就错了,因为如果在标记化之前删除标点符号,can't诸如的单词将被破坏成碎片(例如cant),这很可能会对程序产生负面影响。

因此,解决方案是先标记化然后删除标点标记

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

……然后,如果你愿意,你可以替换某些标记,如'mam

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

…and then if you wish, you can replace certain tokens such as 'm with am.


回答 7

我使用以下代码删除标点符号:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

而且,如果您想检查令牌是否为有效的英语单词,则可能需要PyEnchant

教程:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

回答 8

删除标点符号(它将删除和标点符号处理的一部分,使用下面的代码)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

样本输入/输出:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']


回答 9

只需添加@rmalouf的解决方案,就不会包含任何数字,因为\ w +等效于[a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

回答 10

您可以在没有nltk(python 3.x)的情况下一行完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

You can do it in one line without nltk (python 3.x).

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

一行Python代码可以知道其缩进嵌套级别吗?

问题:一行Python代码可以知道其缩进嵌套级别吗?

从这样的事情:

print(get_indentation_level())

    print(get_indentation_level())

        print(get_indentation_level())

我想得到这样的东西:

1
2
3

代码可以这样读取吗?

我想要的只是更多嵌套代码部分的输出。以使代码易于阅读的方式,使输出易于阅读。

当然,我可以使用eg手动实现此功能.format(),但是我想到的是自定义打印功能,该功能print(i*' ' + string)在哪里i是缩进级别。这将是使终端上的输出可读的一种快速方法。

有没有更好的方法可以避免麻烦的手动格式化?

From something like this:

print(get_indentation_level())

    print(get_indentation_level())

        print(get_indentation_level())

I would like to get something like this:

1
2
3

Can the code read itself in this way?

All I want is the output from the more nested parts of the code to be more nested. In the same way that this makes code easier to read, it would make the output easier to read.

Of course I could implement this manually, using e.g. .format(), but what I had in mind was a custom print function which would print(i*' ' + string) where i is the indentation level. This would be a quick way to make readable output on my terminal.

Is there a better way to do this which avoids painstaking manual formatting?


回答 0

如果您想缩进而不是使用空格和制表符来嵌套级别,那么事情就会变得棘手。例如,在以下代码中:

if True:
    print(
get_nesting_level())

get_nesting_level尽管实际上在行的行上没有前导空格,但对的调用实际上嵌套了一层深度get_nesting_level呼叫。同时,在以下代码中:

print(1,
      2,
      get_nesting_level())

调用 get_nesting_level尽管该行中存在领先的空格,对仍嵌套在零级深度。

在下面的代码中:

if True:
  if True:
    print(get_nesting_level())

if True:
    print(get_nesting_level())

两次调用 get_nesting_level尽管前导空白是相同的,但这处于不同的嵌套级别。

在下面的代码中:

if True: print(get_nesting_level())

是嵌套的零级,还是一级?在INDENTDEDENT形式语法中标记,深度为零,但是您可能会感觉不一样。


如果要执行此操作,则必须标记整个文件,直到调用,计数INDENTDEDENT标记为止。该tokenize模块对于此类功能非常有用:

import inspect
import tokenize

def get_nesting_level():
    caller_frame = inspect.currentframe().f_back
    filename, caller_lineno, _, _, _ = inspect.getframeinfo(caller_frame)
    with open(filename) as f:
        indentation_level = 0
        for token_record in tokenize.generate_tokens(f.readline):
            token_type, _, (token_lineno, _), _, _ = token_record
            if token_lineno > caller_lineno:
                break
            elif token_type == tokenize.INDENT:
                indentation_level += 1
            elif token_type == tokenize.DEDENT:
                indentation_level -= 1
        return indentation_level

If you want indentation in terms of nesting level rather than spaces and tabs, things get tricky. For example, in the following code:

if True:
    print(
get_nesting_level())

the call to get_nesting_level is actually nested one level deep, despite the fact that there is no leading whitespace on the line of the get_nesting_level call. Meanwhile, in the following code:

print(1,
      2,
      get_nesting_level())

the call to get_nesting_level is nested zero levels deep, despite the presence of leading whitespace on its line.

In the following code:

if True:
  if True:
    print(get_nesting_level())

if True:
    print(get_nesting_level())

the two calls to get_nesting_level are at different nesting levels, despite the fact that the leading whitespace is identical.

In the following code:

if True: print(get_nesting_level())

is that nested zero levels, or one? In terms of INDENT and DEDENT tokens in the formal grammar, it’s zero levels deep, but you might not feel the same way.


If you want to do this, you’re going to have to tokenize the whole file up to the point of the call and count INDENT and DEDENT tokens. The tokenize module would be very useful for such a function:

import inspect
import tokenize

def get_nesting_level():
    caller_frame = inspect.currentframe().f_back
    filename, caller_lineno, _, _, _ = inspect.getframeinfo(caller_frame)
    with open(filename) as f:
        indentation_level = 0
        for token_record in tokenize.generate_tokens(f.readline):
            token_type, _, (token_lineno, _), _, _ = token_record
            if token_lineno > caller_lineno:
                break
            elif token_type == tokenize.INDENT:
                indentation_level += 1
            elif token_type == tokenize.DEDENT:
                indentation_level -= 1
        return indentation_level

回答 1

是的,绝对有可能,这是一个可行的示例:

import inspect

def get_indentation_level():
    callerframerecord = inspect.stack()[1]
    frame = callerframerecord[0]
    info = inspect.getframeinfo(frame)
    cc = info.code_context[0]
    return len(cc) - len(cc.lstrip())

if 1:
    print get_indentation_level()
    if 1:
        print get_indentation_level()
        if 1:
            print get_indentation_level()

Yeah, that’s definitely possible, here’s a working example:

import inspect

def get_indentation_level():
    callerframerecord = inspect.stack()[1]
    frame = callerframerecord[0]
    info = inspect.getframeinfo(frame)
    cc = info.code_context[0]
    return len(cc) - len(cc.lstrip())

if 1:
    print get_indentation_level()
    if 1:
        print get_indentation_level()
        if 1:
            print get_indentation_level()

回答 2

您可以使用sys.current_frame.f_lineno以获取行号。然后,为了找到压痕级别的数量,您需要找到压痕为零的前一行,然后从该行的数量中减去当前行号,您将获得压痕数量:

import sys
current_frame = sys._getframe(0)

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()
    current_line_no = current_frame.f_lineno
    to_current = lines[:current_line_no]
    previous_zoro_ind = len(to_current) - next(i for i, line in enumerate(to_current[::-1]) if not line[0].isspace())
    return current_line_no - previous_zoro_ind

演示:

if True:
    print get_ind_num()
    if True:
        print(get_ind_num())
        if True:
            print(get_ind_num())
            if True: print(get_ind_num())
# Output
1
3
5
6

如果您想要基于先前行的缩进级别编号,:则只需稍作更改即可:

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()

    current_line_no = current_frame.f_lineno
    to_current = lines[:current_line_no]
    previous_zoro_ind = len(to_current) - next(i for i, line in enumerate(to_current[::-1]) if not line[0].isspace())
    return sum(1 for line in lines[previous_zoro_ind-1:current_line_no] if line.strip().endswith(':'))

演示:

if True:
    print get_ind_num()
    if True:
        print(get_ind_num())
        if True:
            print(get_ind_num())
            if True: print(get_ind_num())
# Output
1
2
3
3

作为替代答案,这里是一个用于获取缩进数量(空格)的函数:

import sys
from itertools import takewhile
current_frame = sys._getframe(0)

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()
    return sum(1 for _ in takewhile(str.isspace, lines[current_frame.f_lineno - 1]))

You can use sys.current_frame.f_lineno in order to get the line number. Then in order to find the number of indentation level you need to find the previous line with zero indentation then be subtracting the current line number from that line’s number you’ll get the number of indentation:

import sys
current_frame = sys._getframe(0)

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()
    current_line_no = current_frame.f_lineno
    to_current = lines[:current_line_no]
    previous_zoro_ind = len(to_current) - next(i for i, line in enumerate(to_current[::-1]) if not line[0].isspace())
    return current_line_no - previous_zoro_ind

Demo:

if True:
    print get_ind_num()
    if True:
        print(get_ind_num())
        if True:
            print(get_ind_num())
            if True: print(get_ind_num())
# Output
1
3
5
6

If you want the number of the indentation level based on the previouse lines with : you can just do it with a little change:

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()

    current_line_no = current_frame.f_lineno
    to_current = lines[:current_line_no]
    previous_zoro_ind = len(to_current) - next(i for i, line in enumerate(to_current[::-1]) if not line[0].isspace())
    return sum(1 for line in lines[previous_zoro_ind-1:current_line_no] if line.strip().endswith(':'))

Demo:

if True:
    print get_ind_num()
    if True:
        print(get_ind_num())
        if True:
            print(get_ind_num())
            if True: print(get_ind_num())
# Output
1
2
3
3

And as an alternative answer here is a function for getting the number of indentation (whitespace):

import sys
from itertools import takewhile
current_frame = sys._getframe(0)

def get_ind_num():
    with open(__file__) as f:
        lines = f.readlines()
    return sum(1 for _ in takewhile(str.isspace, lines[current_frame.f_lineno - 1]))

回答 3

为了解决导致您提出问题的“实际”问题,您可以实现一个contextmanager,它可以跟踪缩进级别并使with代码中的块结构与输出的缩进级别相对应。这样,代码缩进仍然可以反映输出缩进,而不会造成过多的耦合。仍然可以将代码重构为不同的功能,并基于代码结构使用其他缩进,而不会干扰输出缩进。

#!/usr/bin/env python
# coding: utf8
from __future__ import absolute_import, division, print_function


class IndentedPrinter(object):

    def __init__(self, level=0, indent_with='  '):
        self.level = level
        self.indent_with = indent_with

    def __enter__(self):
        self.level += 1
        return self

    def __exit__(self, *_args):
        self.level -= 1

    def print(self, arg='', *args, **kwargs):
        print(self.indent_with * self.level + str(arg), *args, **kwargs)


def main():
    indented = IndentedPrinter()
    indented.print(indented.level)
    with indented:
        indented.print(indented.level)
        with indented:
            indented.print('Hallo', indented.level)
            with indented:
                indented.print(indented.level)
            indented.print('and back one level', indented.level)


if __name__ == '__main__':
    main()

输出:

0
  1
    Hallo 2
      3
    and back one level 2

To solve the ”real” problem that lead to your question you could implement a contextmanager which keeps track of the indention level and make the with block structure in the code correspond to the indentation levels of the output. This way the code indentation still reflects the output indentation without coupling both too much. It is still possible to refactor the code into different functions and have other indentations based on code structure not messing with the output indentation.

#!/usr/bin/env python
# coding: utf8
from __future__ import absolute_import, division, print_function


class IndentedPrinter(object):

    def __init__(self, level=0, indent_with='  '):
        self.level = level
        self.indent_with = indent_with

    def __enter__(self):
        self.level += 1
        return self

    def __exit__(self, *_args):
        self.level -= 1

    def print(self, arg='', *args, **kwargs):
        print(self.indent_with * self.level + str(arg), *args, **kwargs)


def main():
    indented = IndentedPrinter()
    indented.print(indented.level)
    with indented:
        indented.print(indented.level)
        with indented:
            indented.print('Hallo', indented.level)
            with indented:
                indented.print(indented.level)
            indented.print('and back one level', indented.level)


if __name__ == '__main__':
    main()

Output:

0
  1
    Hallo 2
      3
    and back one level 2

回答 4

>>> import inspect
>>> help(inspect.indentsize)
Help on function indentsize in module inspect:

indentsize(line)
    Return the indent size, in spaces, at the start of a line of text.
>>> import inspect
>>> help(inspect.indentsize)
Help on function indentsize in module inspect:

indentsize(line)
    Return the indent size, in spaces, at the start of a line of text.