标签归档:string

在Python中显示带有两位小数的浮点数

问题:在Python中显示带有两位小数的浮点数

我有一个带浮点参数的函数(通常是整数或具有一位有效数字的十进制数),我需要在字符串中输出具有两位小数位的值(5-> 5.00、5.5-> 5.50等)。如何在Python中做到这一点?

I have a function taking float arguments (generally integers or decimals with one significant digit), and I need to output the values in a string with two decimal places (5 -> 5.00, 5.5 -> 5.50, etc). How can I do this in Python?


回答 0

您可以为此使用字符串格式运算符:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

运算符的结果是一个字符串,因此您可以将其存储在变量中,进行打印等。

You could use the string formatting operator for that:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

The result of the operator is a string, so you can store it in a variable, print etc.


回答 1

由于这篇文章可能会在这里出现一段时间,因此我们还要指出python 3语法:

"{:.2f}".format(5)

Since this post might be here for a while, lets also point out python 3 syntax:

"{:.2f}".format(5)

回答 2

f字符串格式:

这是Python 3.6中的新功能-照常将字符串放在引号中,并f'...以与r'...原始字符串相同的方式加上前缀。然后,将任何要放入字符串,变量,数字,大括号内的内容放入其中-Python会f'some string text with a {variable} or {number} within that text'像以前的字符串格式化方法那样进行求值,只是该方法更具可读性。

>>> a = 3.141592
>>> print(f'My number is {a:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

您可以在此示例中看到,我们以与以前的字符串格式化方法相似的方式用小数位格式化。

NB a可以是数字,变量甚至是表达式,例如f'{3*my_func(3.14):02f}'

展望未来,使用新代码,我更喜欢f字符串而不是常见的%s或str.format()方法,因为f字符串可以更容易阅读,并且通常更快

f-string formatting:

This was new in Python 3.6 – the string is placed in quotation marks as usual, prepended with f'... in the same way you would r'... for a raw string. Then you place whatever you want to put within your string, variables, numbers, inside braces f'some string text with a {variable} or {number} within that text' – and Python evaluates as with previous string formatting methods, except that this method is much more readable.

>>> foobar = 3.141592
>>> print(f'My number is {foobar:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

You can see in this example we format with decimal places in similar fashion to previous string formatting methods.

NB foobar can be an number, variable, or even an expression eg f'{3*my_func(3.14):02f}'.

Going forward, with new code I prefer f-strings over common %s or str.format() methods as f-strings can be far more readable, and are often much faster.


回答 3

字符串格式:

print "%.2f" % 5

String formatting:

print "%.2f" % 5

回答 4

使用python字符串格式。

>>> "%0.2f" % 3
'3.00'

Using python string formatting.

>>> "%0.2f" % 3
'3.00'

回答 5

字符串格式:

a = 6.789809823
print('%.2f' %a)

要么

print ("{0:.2f}".format(a)) 

舍入函数可以使用:

print(round(a, 2))

round()的好处是,我们可以将结果存储到另一个变量中,然后将其用于其他目的。

b = round(a, 2)
print(b)

String Formatting:

a = 6.789809823
print('%.2f' %a)

OR

print ("{0:.2f}".format(a)) 

Round Function can be used:

print(round(a, 2))

Good thing about round() is that, we can store this result to another variable, and then use it for other purposes.

b = round(a, 2)
print(b)

回答 6

最短的Python 3语法:

n = 5
print(f'{n:.2f}')

Shortest Python 3 syntax:

n = 5
print(f'{n:.2f}')

回答 7

如果您实际上想更改数字本身,而不是只显示不同的数字,请使用format()

将其格式化为2位小数:

format(value, '.2f')

例:

>>> format(5.00000, '.2f')
'5.00'

If you actually want to change the number itself instead of only displaying it differently use format()

Format it to 2 decimal places:

format(value, '.2f')

example:

>>> format(5.00000, '.2f')
'5.00'

回答 8

我知道这是一个古老的问题,但我一直在努力寻找答案。这是我想出的:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

I know it is an old question, but I was struggling finding the answer myself. Here is what I have come up with:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

回答 9

使用Python 3语法:

print('%.2f' % number)

Using Python 3 syntax:

print('%.2f' % number)

回答 10

如果要在调用输入时获得一个小数点后两位数限制的浮点值,

看看这个〜

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

If you want to get a floating point value with two decimal places limited at the time of calling input,

Check this out ~

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

字符串格式化命名参数?

问题:字符串格式化命名参数?

我知道这是一个非常简单的问题,但我不知道该如何使用Google。

我能怎么做

print '<a href="%s">%s</a>' % (my_url)

所以要my_url使用两次?我假设我必须“命名” the %s,然后在参数中使用字典,但是我不确定正确的语法吗?


仅供参考,我知道我可以my_url在参数中使用两次,但这不是重点:)

I know it’s a really simple question, but I have no idea how to google it.

how can I do

print '<a href="%s">%s</a>' % (my_url)

So that my_url is used twice? I assume I have to “name” the %s and then use a dict in the params, but I’m not sure of the proper syntax?


just FYI, I’m aware I can just use my_url twice in the params, but that’s not the point :)


回答 0

在Python 2.6+和Python 3中,您可能选择使用更新的字符串格式化方法。

print('<a href="{0}">{0}</a>'.format(my_url))

这样可以避免重复输入参数,或者

print('<a href="{url}">{url}</a>'.format(url=my_url))

如果要命名参数。

print('<a href="{}">{}</a>'.format(my_url, my_url))

这是严格的位置,只有警告:format()参数遵循Python规则,其中必须首先使用未命名的args,然后是命名参数,然后是* args(类似于list或tuple的序列),然后是* kwargs(一种dict如果您知道什么对您有好处,请使用字符串进行键控)。首先通过将命名值替换为它们的标签来确定插值点,然后从剩下的位置进行定位。因此,您也可以这样做…

print('<a href="{not_my_url}">{}</a>'.format(my_url, my_url, not_my_url=her_url))

但这不是…

print('<a href="{not_my_url}">{}</a>'.format(my_url, not_my_url=her_url, my_url))

In Python 2.6+ and Python 3, you might choose to use the newer string formatting method.

print('<a href="{0}">{0}</a>'.format(my_url))

which saves you from repeating the argument, or

print('<a href="{url}">{url}</a>'.format(url=my_url))

if you want named parameters.

print('<a href="{}">{}</a>'.format(my_url, my_url))

which is strictly positional, and only comes with the caveat that format() arguments follow Python rules where unnamed args must come first, followed by named arguments, followed by *args (a sequence like list or tuple) and then *kwargs (a dict keyed with strings if you know what’s good for you). The interpolation points are determined first by substituting the named values at their labels, and then positional from what’s left. So, you can also do this…

print('<a href="{not_my_url}">{}</a>'.format(my_url, my_url, not_my_url=her_url))

But not this…

print('<a href="{not_my_url}">{}</a>'.format(my_url, not_my_url=her_url, my_url))

回答 1

print '<a href="%(url)s">%(url)s</a>' % {'url': my_url}
print '<a href="%(url)s">%(url)s</a>' % {'url': my_url}

回答 2

Python 3.6+中的解决方案

Python 3.6引入了文字字符串格式化,因此您可以格式化命名参数,而无需在字符串外重复任何命名参数:

print(f'<a href="{my_url:s}">{my_url:s}</a>')

这将进行评估my_url,因此,如果未定义,您将获得NameError。实际上,my_url您可以编写一个任意的Python表达式(而不是),只要它的计算结果为字符串(由于:s格式代码)即可。如果要为表达式的结果表示字符串表示形式(可能不是字符串),请替换:s!s,就像一般,预文字字符串格式化。

有关文字字符串格式的详细信息,请参阅PEP 498(首次引入该格式)。

Solution in Python 3.6+

Python 3.6 introduces literal string formatting, so that you can format the named parameters without any repeating any of your named parameters outside the string:

print(f'<a href="{my_url:s}">{my_url:s}</a>')

This will evaluate my_url, so if it’s not defined you will get a NameError. In fact, instead of my_url, you can write an arbitrary Python expression, as long as it evaluates to a string (because of the :s formatting code). If you want a string representation for the result of an expression that might not be a string, replace :s by !s, just like with regular, pre-literal string formatting.

For details on literal string formatting, see PEP 498, where it was first introduced.


回答 3

您将沉迷于语法。

同样是C#6.0,EcmaScript开发人员也熟悉此语法。

In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa

In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa

You will be addicted to syntax.

Also C# 6.0, EcmaScript developers has also familier this syntax.

In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa

In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa

回答 4

对于构建HTML页面,您要使用模板引擎,而不是简单的字符串插值。

For building HTML pages, you want to use a templating engine, not simple string interpolation.


回答 5

与字典方式一样,了解以下格式可能会很有用:

print '<a href="%s">%s</a>' % (my_url, my_url)

这是一点点的冗余,并且在修改代码时,字典方式当然不易出错,但是仍然可以使用元组进行多次插入。第%s一个元素替换了元组中的第一个元素,第二%s个元素替换了元组中的第二个元素,以此类推。

As well as the dictionary way, it may be useful to know the following format:

print '<a href="%s">%s</a>' % (my_url, my_url)

Here it’s a tad redundant, and the dictionary way is certainly less error prone when modifying the code, but it’s still possible to use tuples for multiple insertions. The first %s is substituted for the first element in the tuple, the second %s is substituted for the second element in the tuple, and so on for each element in the tuple.


不区分大小写的列表排序,而不降低结果大小?

问题:不区分大小写的列表排序,而不降低结果大小?

我有一个这样的字符串列表:

['Aden', 'abel']

我要对项目排序,不区分大小写。所以我想得到:

['abel', 'Aden']

但与sorted()或相反list.sort(),因为大写字母先于小写字母。

我如何忽略这种情况?我已经看到了涉及降低所有列表项的解决方案,但是我不想更改列表项的大小写。

I have a list of strings like this:

['Aden', 'abel']

I want to sort the items, case-insensitive. So I want to get:

['abel', 'Aden']

But I get the opposite with sorted() or list.sort(), because uppercase appears before lowercase.

How can I ignore the case? I’ve seen solutions which involves lowercasing all list items, but I don’t want to change the case of the list items.


回答 0

在Python 3.3+中,有str.casefold一种专为无条件匹配而设计的方法:

sorted_list = sorted(unsorted_list, key=str.casefold)

在Python 2中使用lower()

sorted_list = sorted(unsorted_list, key=lambda s: s.lower())

它适用于普通字符串和unicode字符串,因为它们都有lower方法。

在Python 2中,它可以将普通字符串和unicode字符串混合使用,因为这两种类型的值可以相互比较。但是,Python 3并不是这样工作的:您无法比较字节字符串和unicode字符串,因此在Python 3中,您应该做明智的事情,并且只能对一种类型的字符串列表进行排序。

>>> lst = ['Aden', u'abe1']
>>> sorted(lst)
['Aden', u'abe1']
>>> sorted(lst, key=lambda s: s.lower())
[u'abe1', 'Aden']

In Python 3.3+ there is the str.casefold method that’s specifically designed for caseless matching:

sorted_list = sorted(unsorted_list, key=str.casefold)

In Python 2 use lower():

sorted_list = sorted(unsorted_list, key=lambda s: s.lower())

It works for both normal and unicode strings, since they both have a lower method.

In Python 2 it works for a mix of normal and unicode strings, since values of the two types can be compared with each other. Python 3 doesn’t work like that, though: you can’t compare a byte string and a unicode string, so in Python 3 you should do the sane thing and only sort lists of one type of string.

>>> lst = ['Aden', u'abe1']
>>> sorted(lst)
['Aden', u'abe1']
>>> sorted(lst, key=lambda s: s.lower())
[u'abe1', 'Aden']

回答 1

>>> x = ['Aden', 'abel']
>>> sorted(x, key=str.lower) # Or unicode.lower if all items are unicode
['abel', 'Aden']

在Python 3中str是unicode,但在Python 2中,您可以使用这种更通用的方法,该方法对str和都适用unicode

>>> sorted(x, key=lambda s: s.lower())
['abel', 'Aden']
>>> x = ['Aden', 'abel']
>>> sorted(x, key=str.lower) # Or unicode.lower if all items are unicode
['abel', 'Aden']

In Python 3 str is unicode but in Python 2 you can use this more general approach which works for both str and unicode:

>>> sorted(x, key=lambda s: s.lower())
['abel', 'Aden']

回答 2

您也可以尝试使用此方法对列表进行就地排序:

>>> x = ['Aden', 'abel']
>>> x.sort(key=lambda y: y.lower())
>>> x
['abel', 'Aden']

You can also try this to sort the list in-place:

>>> x = ['Aden', 'abel']
>>> x.sort(key=lambda y: y.lower())
>>> x
['abel', 'Aden']

回答 3

这在Python 3中有效,并且不涉及小写结果(!)。

values.sort(key=str.lower)

This works in Python 3 and does not involves lowercasing the result (!).

values.sort(key=str.lower)

回答 4

在python3中,您可以使用

list1.sort(key=lambda x: x.lower()) #Case In-sensitive             
list1.sort() #Case Sensitive

In python3 you can use

list1.sort(key=lambda x: x.lower()) #Case In-sensitive             
list1.sort() #Case Sensitive

回答 5

我是通过Python 3.3做到的:

 def sortCaseIns(lst):
    lst2 = [[x for x in range(0, 2)] for y in range(0, len(lst))]
    for i in range(0, len(lst)):
        lst2[i][0] = lst[i].lower()
        lst2[i][1] = lst[i]
    lst2.sort()
    for i in range(0, len(lst)):
        lst[i] = lst2[i][1]

然后,您可以调用此函数:

sortCaseIns(yourListToSort)

I did it this way for Python 3.3:

 def sortCaseIns(lst):
    lst2 = [[x for x in range(0, 2)] for y in range(0, len(lst))]
    for i in range(0, len(lst)):
        lst2[i][0] = lst[i].lower()
        lst2[i][1] = lst[i]
    lst2.sort()
    for i in range(0, len(lst)):
        lst[i] = lst2[i][1]

Then you just can call this function:

sortCaseIns(yourListToSort)

回答 6

不区分大小写的排序,在Python 2 OR 3中对字符串进行排序(在Python 2.7.17和Python 3.6.9中测试):

>>> x = ["aa", "A", "bb", "B", "cc", "C"]
>>> x.sort()
>>> x
['A', 'B', 'C', 'aa', 'bb', 'cc']
>>> x.sort(key=str.lower)           # <===== there it is!
>>> x
['A', 'aa', 'B', 'bb', 'C', 'cc']

关键是key=str.lower。这些命令只是这些命令的外观,以便于复制粘贴,因此您可以对其进行测试:

x = ["aa", "A", "bb", "B", "cc", "C"]
x.sort()
x
x.sort(key=str.lower)
x

请注意,但是,如果您的字符串是unicode字符串(如u'some string'),则仅在Python 2中(在这种情况下,在Python 3中不是),上述x.sort(key=str.lower)命令将失败并输出以下错误:

TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

如果出现此错误,请升级到Python 3来处理unicode排序,或者先使用列表推导将unicode字符串转换为ASCII字符串,如下所示:

# for Python2, ensure all elements are ASCII (NOT unicode) strings first
x = [str(element) for element in x]  
# for Python2, this sort will only work on ASCII (NOT unicode) strings
x.sort(key=str.lower)

参考文献:

  1. https://docs.python.org/3/library/stdtypes.html#list.sort
  2. 将Unicode字符串转换为Python中的字符串(包含多余的符号)
  3. https://www.programiz.com/python-programming/list-comprehension

Case-insensitive sort, sorting the string in place, in Python 2 OR 3 (tested in Python 2.7.17 and Python 3.6.9):

>>> x = ["aa", "A", "bb", "B", "cc", "C"]
>>> x.sort()
>>> x
['A', 'B', 'C', 'aa', 'bb', 'cc']
>>> x.sort(key=str.lower)           # <===== there it is!
>>> x
['A', 'aa', 'B', 'bb', 'C', 'cc']

The key is key=str.lower. Here’s what those commands look like with just the commands, for easy copy-pasting so you can test them:

x = ["aa", "A", "bb", "B", "cc", "C"]
x.sort()
x
x.sort(key=str.lower)
x

Note that if your strings are unicode strings, however (like u'some string'), then in Python 2 only (NOT in Python 3 in this case) the above x.sort(key=str.lower) command will fail and output the following error:

TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

If you get this error, then either upgrade to Python 3 where they handle unicode sorting, or convert your unicode strings to ASCII strings first, using a list comprehension, like this:

# for Python2, ensure all elements are ASCII (NOT unicode) strings first
x = [str(element) for element in x]  
# for Python2, this sort will only work on ASCII (NOT unicode) strings
x.sort(key=str.lower)

References:

  1. https://docs.python.org/3/library/stdtypes.html#list.sort
  2. Convert a Unicode string to a string in Python (containing extra symbols)
  3. https://www.programiz.com/python-programming/list-comprehension

回答 7

试试这个

def cSort(inlist, minisort=True):
    sortlist = []
    newlist = []
    sortdict = {}
    for entry in inlist:
        try:
            lentry = entry.lower()
        except AttributeError:
            sortlist.append(lentry)
        else:
            try:
                sortdict[lentry].append(entry)
            except KeyError:
                sortdict[lentry] = [entry]
                sortlist.append(lentry)

    sortlist.sort()
    for entry in sortlist:
        try:
            thislist = sortdict[entry]
            if minisort: thislist.sort()
            newlist = newlist + thislist
        except KeyError:
            newlist.append(entry)
    return newlist

lst = ['Aden', 'abel']
print cSort(lst)

输出量

['abel', 'Aden']

Try this

def cSort(inlist, minisort=True):
    sortlist = []
    newlist = []
    sortdict = {}
    for entry in inlist:
        try:
            lentry = entry.lower()
        except AttributeError:
            sortlist.append(lentry)
        else:
            try:
                sortdict[lentry].append(entry)
            except KeyError:
                sortdict[lentry] = [entry]
                sortlist.append(lentry)

    sortlist.sort()
    for entry in sortlist:
        try:
            thislist = sortdict[entry]
            if minisort: thislist.sort()
            newlist = newlist + thislist
        except KeyError:
            newlist.append(entry)
    return newlist

lst = ['Aden', 'abel']
print cSort(lst)

Output

['abel', 'Aden']


类似于sprintf的Python功能

问题:类似于sprintf的Python功能

我想创建一个字符串缓冲区来进行大量处理,格式化,最后使用sprintfPython中的C样式功能将缓冲区写入文本文件中。由于条件语句,我无法将它们直接写到文件中。

例如伪代码:

sprintf(buf,"A = %d\n , B= %s\n",A,B)
/* some processing */
sprint(buf,"C=%d\n",c)
....
...
fprintf(file,buf)

所以在输出文件中,我们有这种o / p:

A= foo B= bar
C= ded
etc...

编辑,以澄清我的问题:
buf是一个大缓冲区,其中包含所有使用sprintf格式化的字符串。按照您的示例,buf将仅包含当前值,而不包含旧值。例如,buf我最初写的书A= something ,B= something后来C= something被附加在同一书中buf,但是在您的Python答案buf中仅包含最后一个值,这不是我想要的-我想拥有自开始以来所做的buf所有printf操作,例如in C

I would like to create a string buffer to do lots of processing, format and finally write the buffer in a text file using a C-style sprintf functionality in Python. Because of conditional statements, I can’t write them directly to the file.

e.g pseudo code:

sprintf(buf,"A = %d\n , B= %s\n",A,B)
/* some processing */
sprint(buf,"C=%d\n",c)
....
...
fprintf(file,buf)

So in the output file we have this kind of o/p:

A= foo B= bar
C= ded
etc...

Edit, to clarify my question:
buf is a big buffer contains all these strings which have formatted using sprintf. Going by your examples, buf will only contain current values, not older ones. e.g first in buf I wrote A= something ,B= something later C= something was appended in the same buf, but in your Python answers buf contains only last value, which is not I want – I want buf to have all the printfs I have done since the beginning, like in C.


回答 0

Python %为此提供了一个运算符。

>>> a = 5
>>> b = "hello"
>>> buf = "A = %d\n , B = %s\n" % (a, b)
>>> print buf
A = 5
 , B = hello

>>> c = 10
>>> buf = "C = %d\n" % c
>>> print buf
C = 10

有关所有受支持的格式说明符,请参见此参考

您也可以使用format

>>> print "This is the {}th tome of {}".format(5, "knowledge")
This is the 5th tome of knowledge

Python has a % operator for this.

>>> a = 5
>>> b = "hello"
>>> buf = "A = %d\n , B = %s\n" % (a, b)
>>> print buf
A = 5
 , B = hello

>>> c = 10
>>> buf = "C = %d\n" % c
>>> print buf
C = 10

See this reference for all supported format specifiers.

You could as well use format:

>>> print "This is the {}th tome of {}".format(5, "knowledge")
This is the 5th tome of knowledge

回答 1

如果我正确理解了您的问题,那么format()就是您所要的东西,以及它的迷你语言

python 2.7及更高版本的愚蠢示例:

>>> print "{} ...\r\n {}!".format("Hello", "world")
Hello ...
 world!

对于早期的python版本:(已通过2.6.2测试)

>>> print "{0} ...\r\n {1}!".format("Hello", "world")
Hello ...
 world!

If I understand your question correctly, format() is what you are looking for, along with its mini-language.

Silly example for python 2.7 and up:

>>> print "{} ...\r\n {}!".format("Hello", "world")
Hello ...
 world!

For earlier python versions: (tested with 2.6.2)

>>> print "{0} ...\r\n {1}!".format("Hello", "world")
Hello ...
 world!

回答 2

我并不完全确定我了解您的目标,但是您可以将StringIO实例用作缓冲区:

>>> import StringIO 
>>> buf = StringIO.StringIO()
>>> buf.write("A = %d, B = %s\n" % (3, "bar"))
>>> buf.write("C=%d\n" % 5)
>>> print(buf.getvalue())
A = 3, B = bar
C=5

与不同sprintf,您只需将字符串传递给buf.write,即可使用%运算符或format字符串方法对其进行格式化。

您当然可以定义一个函数来获取sprintf您希望的接口:

def sprintf(buf, fmt, *args):
    buf.write(fmt % args)

可以这样使用:

>>> buf = StringIO.StringIO()
>>> sprintf(buf, "A = %d, B = %s\n", 3, "foo")
>>> sprintf(buf, "C = %d\n", 5)
>>> print(buf.getvalue())
A = 3, B = foo
C = 5

I’m not completely certain that I understand your goal, but you can use a StringIO instance as a buffer:

>>> import StringIO 
>>> buf = StringIO.StringIO()
>>> buf.write("A = %d, B = %s\n" % (3, "bar"))
>>> buf.write("C=%d\n" % 5)
>>> print(buf.getvalue())
A = 3, B = bar
C=5

Unlike sprintf, you just pass a string to buf.write, formatting it with the % operator or the format method of strings.

You could of course define a function to get the sprintf interface you’re hoping for:

def sprintf(buf, fmt, *args):
    buf.write(fmt % args)

which would be used like this:

>>> buf = StringIO.StringIO()
>>> sprintf(buf, "A = %d, B = %s\n", 3, "foo")
>>> sprintf(buf, "C = %d\n", 5)
>>> print(buf.getvalue())
A = 3, B = foo
C = 5

回答 3

使用格式运算符%

buf = "A = %d\n , B= %s\n" % (a, b)
print >>f, buf

Use the formatting operator %:

buf = "A = %d\n , B= %s\n" % (a, b)
print >>f, buf

回答 4

您可以使用字符串格式:

>>> a=42
>>> b="bar"
>>> "The number is %d and the word is %s" % (a,b)
'The number is 42 and the word is bar'

但这已在Python 3中删除,您应该使用“ str.format()”:

>>> a=42
>>> b="bar"
>>> "The number is {0} and the word is {1}".format(a,b)
'The number is 42 and the word is bar'

You can use string formatting:

>>> a=42
>>> b="bar"
>>> "The number is %d and the word is %s" % (a,b)
'The number is 42 and the word is bar'

But this is removed in Python 3, you should use “str.format()”:

>>> a=42
>>> b="bar"
>>> "The number is {0} and the word is {1}".format(a,b)
'The number is 42 and the word is bar'

回答 5

要插入很长的字符串,最好为不同的参数使用名称,而不是希望它们放在正确的位置。这也使替换多次重复变得更容易。

>>> 'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'

取自“ 格式”示例,其中Format还显示了所有其他与之相关的答案。

To insert into a very long string it is nice to use names for the different arguments, instead of hoping they are in the right positions. This also makes it easier to replace multiple recurrences.

>>> 'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'

Taken from Format examples, where all the other Format-related answers are also shown.


回答 6

这可能是从C代码到Python代码最接近的翻译。

A = 1
B = "hello"
buf = "A = %d\n , B= %s\n" % (A, B)

c = 2
buf += "C=%d\n" % c

f = open('output.txt', 'w')
print >> f, c
f.close()

%Python中的运算符几乎执行与C相同的操作sprintf。您也可以将字符串直接打印到文件中。如果涉及许多此类字符串格式的字符串,那么明智的做法是使用StringIO对象来加快处理时间。

因此+=,不要这样做,而是这样做:

import cStringIO
buf = cStringIO.StringIO()

...

print >> buf, "A = %d\n , B= %s\n" % (A, B)

...

print >> buf, "C=%d\n" % c

...

print >> f, buf.getvalue()

This is probably the closest translation from your C code to Python code.

A = 1
B = "hello"
buf = "A = %d\n , B= %s\n" % (A, B)

c = 2
buf += "C=%d\n" % c

f = open('output.txt', 'w')
print >> f, c
f.close()

The % operator in Python does almost exactly the same thing as C’s sprintf. You can also print the string to a file directly. If there are lots of these string formatted stringlets involved, it might be wise to use a StringIO object to speed up processing time.

So instead of doing +=, do this:

import cStringIO
buf = cStringIO.StringIO()

...

print >> buf, "A = %d\n , B= %s\n" % (A, B)

...

print >> buf, "C=%d\n" % c

...

print >> f, buf.getvalue()

回答 7

如果您想要类似python3打印功能的东西,但是想要一个字符串:

def sprint(*args, **kwargs):
    sio = io.StringIO()
    print(*args, **kwargs, file=sio)
    return sio.getvalue()
>>> x = sprint('abc', 10, ['one', 'two'], {'a': 1, 'b': 2}, {1, 2, 3})
>>> x
"abc 10 ['one', 'two'] {'a': 1, 'b': 2} {1, 2, 3}\n"

'\n'结尾没有:

def sprint(*args, end='', **kwargs):
    sio = io.StringIO()
    print(*args, **kwargs, end=end, file=sio)
    return sio.getvalue()
>>> x = sprint('abc', 10, ['one', 'two'], {'a': 1, 'b': 2}, {1, 2, 3})
>>> x
"abc 10 ['one', 'two'] {'a': 1, 'b': 2} {1, 2, 3}"

If you want something like the python3 print function but to a string:

def sprint(*args, **kwargs):
    sio = io.StringIO()
    print(*args, **kwargs, file=sio)
    return sio.getvalue()
>>> x = sprint('abc', 10, ['one', 'two'], {'a': 1, 'b': 2}, {1, 2, 3})
>>> x
"abc 10 ['one', 'two'] {'a': 1, 'b': 2} {1, 2, 3}\n"

or without the '\n' at the end:

def sprint(*args, end='', **kwargs):
    sio = io.StringIO()
    print(*args, **kwargs, end=end, file=sio)
    return sio.getvalue()
>>> x = sprint('abc', 10, ['one', 'two'], {'a': 1, 'b': 2}, {1, 2, 3})
>>> x
"abc 10 ['one', 'two'] {'a': 1, 'b': 2} {1, 2, 3}"

回答 8

就像是…

greetings = 'Hello {name}'.format(name = 'John')

Hello John

Something like…

greetings = 'Hello {name}'.format(name = 'John')

Hello John

回答 9

Take a look at “Literal String Interpolation” https://www.python.org/dev/peps/pep-0498/

I found it through the http://www.malemburg.com/


回答 10

两种方法是写入字符串缓冲区或将行写入列表,然后再将它们连接。我认为该StringIO方法更具pythonic功能,但在python 2.6之前不起作用。

from io import StringIO

with StringIO() as s:
   print("Hello", file=s)
   print("Goodbye", file=s)
   # And later...
   with open('myfile', 'w') as f:
       f.write(s.getvalue())

您也可以不使用ContextManangers = StringIO())使用它们。当前,我正在使用带有print函数的上下文管理器类。为了能够插入调试或奇数页的需求,此片段可能很有用:

class Report:
    ... usual init/enter/exit
    def print(self, *args, **kwargs):
        with StringIO() as s:
            print(*args, **kwargs, file=s)
            out = s.getvalue()
        ... stuff with out

with Report() as r:
   r.print(f"This is {datetime.date.today()}!", 'Yikes!', end=':')

Two approaches are to write to a string buffer or to write lines to a list and join them later. I think the StringIO approach is more pythonic, but didn’t work before Python 2.6.

from io import StringIO

with StringIO() as s:
   print("Hello", file=s)
   print("Goodbye", file=s)
   # And later...
   with open('myfile', 'w') as f:
       f.write(s.getvalue())

You can also use these without a ContextMananger (s = StringIO()). Currently, I’m using a context manager class with a print function. This fragment might be useful to be able to insert debugging or odd paging requirements:

class Report:
    ... usual init/enter/exit
    def print(self, *args, **kwargs):
        with StringIO() as s:
            print(*args, **kwargs, file=s)
            out = s.getvalue()
        ... stuff with out

with Report() as r:
   r.print(f"This is {datetime.date.today()}!", 'Yikes!', end=':')

‘str’对象不支持Python中的项目分配

问题:’str’对象不支持Python中的项目分配

我想从字符串中读取一些字符,然后将其放入其他字符串中(就像我们在C语言中一样)。

所以我的代码如下

import string
import re
str = "Hello World"
j = 0
srr = ""
for i in str:
    srr[j] = i #'str' object does not support item assignment 
    j = j + 1
print (srr)

在C中,代码可能是

i = j = 0; 
while(str[i] != '\0')
{
srr[j++] = str [i++];
}

如何在Python中实现相同的功能?

I would like to read some characters from a string and put it into other string (Like we do in C).

So my code is like below

import string
import re
str = "Hello World"
j = 0
srr = ""
for i in str:
    srr[j] = i #'str' object does not support item assignment 
    j = j + 1
print (srr)

In C the code may be

i = j = 0; 
while(str[i] != '\0')
{
srr[j++] = str [i++];
}

How can I implement the same in Python?


回答 0

在Python中,字符串是不可变的,因此您不能就地更改其字符。

但是,您可以执行以下操作:

for i in str:
    srr += i

起作用的原因是它是以下操作的快捷方式:

for i in str:
    srr = srr + i

上面的代码在每次迭代时都会创建一个新字符串,并将对该新字符串的引用存储在中srr

In Python, strings are immutable, so you can’t change their characters in-place.

You can, however, do the following:

for i in str:
    srr += i

The reasons this works is that it’s a shortcut for:

for i in str:
    srr = srr + i

The above creates a new string with each iteration, and stores the reference to that new string in srr.


回答 1

其他答案是正确的,但是您当然可以执行以下操作:

>>> str1 = "mystring"
>>> list1 = list(str1)
>>> list1[5] = 'u'
>>> str1 = ''.join(list1)
>>> print(str1)
mystrung
>>> type(str1)
<type 'str'>

如果您真的想要。

The other answers are correct, but you can, of course, do something like:

>>> str1 = "mystring"
>>> list1 = list(str1)
>>> list1[5] = 'u'
>>> str1 = ''.join(list1)
>>> print(str1)
mystrung
>>> type(str1)
<type 'str'>

if you really want to.


回答 2

Python字符串是不可变的,因此您在C语言中尝试执行的操作在python中根本不可能实现。您将必须创建一个新字符串。

我想从字符串中读取一些字符,然后将其放入其他字符串中。

然后使用字符串切片:

>>> s1 = 'Hello world!!'
>>> s2 = s1[6:12]
>>> print s2
world!

Python strings are immutable so what you are trying to do in C will be simply impossible in python. You will have to create a new string.

I would like to read some characters from a string and put it into other string.

Then use a string slice:

>>> s1 = 'Hello world!!'
>>> s2 = s1[6:12]
>>> print s2
world!

回答 3

如aix所述-Python中的字符串是不可变的(您不能就地更改它们)。

您要尝试执行的操作可以通过多种方式完成:

# Copy the string

foo = 'Hello'
bar = foo

# Create a new string by joining all characters of the old string

new_string = ''.join(c for c in oldstring)

# Slice and copy
new_string = oldstring[:]

As aix mentioned – strings in Python are immutable (you cannot change them inplace).

What you are trying to do can be done in many ways:

# Copy the string

foo = 'Hello'
bar = foo

# Create a new string by joining all characters of the old string

new_string = ''.join(c for c in oldstring)

# Slice and copy
new_string = oldstring[:]

回答 4

如果您想将特定字符换成另一个字符,则可以采用另一种方法:

def swap(input_string):
   if len(input_string) == 0:
     return input_string
   if input_string[0] == "x":
     return "y" + swap(input_string[1:])
   else:
     return input_string[0] + swap(input_string[1:])

Another approach if you wanted to swap out a specific character for another character:

def swap(input_string):
   if len(input_string) == 0:
     return input_string
   if input_string[0] == "x":
     return "y" + swap(input_string[1:])
   else:
     return input_string[0] + swap(input_string[1:])

回答 5

该解决方案如何:

str =“ Hello World”(如问题所述)srr = str +“”

How about this solution:

str=”Hello World” (as stated in problem) srr = str+ “”


回答 6

嗨,您应该尝试使用字符串拆分方法:

i = "Hello world"
output = i.split()

j = 'is not enough'

print 'The', output[1], j

Hi you should try the string split method:

i = "Hello world"
output = i.split()

j = 'is not enough'

print 'The', output[1], j

您如何在python中检查字符串是否仅包含数字?

问题:您如何在python中检查字符串是否仅包含数字?

如何检查字符串是否仅包含数字?

我已经去了这里。我想看看实现此目的的最简单方法。

import string

def main():
    isbn = input("Enter your 10 digit ISBN number: ")
    if len(isbn) == 10 and string.digits == True:
        print ("Works")
    else:
        print("Error, 10 digit number was not inputted and/or letters were inputted.")
        main()

if __name__ == "__main__":
    main()
    input("Press enter to exit: ")

How do you check whether a string contains only numbers?

I’ve given it a go here. I’d like to see the simplest way to accomplish this.

import string

def main():
    isbn = input("Enter your 10 digit ISBN number: ")
    if len(isbn) == 10 and string.digits == True:
        print ("Works")
    else:
        print("Error, 10 digit number was not inputted and/or letters were inputted.")
        main()

if __name__ == "__main__":
    main()
    input("Press enter to exit: ")

回答 0

您需要isdigitstr对象上使用方法:

if len(isbn) == 10 and isbn.isdigit():

isdigit文档中

str.isdigit()

如果字符串中的所有字符都是数字并且至少有一个字符,则返回True,否则返回False。数字包括需要特殊处理的十进制字符和数字,例如兼容性上标数字。它涵盖了不能用于​​以10为底的数字的数字,例如Kharosthi数字。形式上,数字是具有属性值Numeric_Type =数字或Numeric_Type =十进制的字符。

You’ll want to use the isdigit method on your str object:

if len(isbn) == 10 and isbn.isdigit():

From the isdigit documentation:

str.isdigit()

Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.


回答 1

用途str.isdigit

>>> "12345".isdigit()
True
>>> "12345a".isdigit()
False
>>>

Use str.isdigit:

>>> "12345".isdigit()
True
>>> "12345a".isdigit()
False
>>>

回答 2

使用字符串isdigit函数:

>>> s = '12345'
>>> s.isdigit()
True
>>> s = '1abc'
>>> s.isdigit()
False

Use string isdigit function:

>>> s = '12345'
>>> s.isdigit()
True
>>> s = '1abc'
>>> s.isdigit()
False

回答 3

您还可以使用正则表达式,

import re

例如:-1)word =“ 3487954”

re.match('^[0-9]*$',word)

例如:-2)word =“ 3487.954”

re.match('^[0-9\.]*$',word)

例如:-3)word =“ 3487.954 328”

re.match('^[0-9\.\ ]*$',word)

如您所见,所有3个eg表示您的字符串中没有任何内容。因此,您可以按照其提供的相应解决方案进行操作。

You can also use the regex,

import re

eg:-1) word = “3487954”

re.match('^[0-9]*$',word)

eg:-2) word = “3487.954”

re.match('^[0-9\.]*$',word)

eg:-3) word = “3487.954 328”

re.match('^[0-9\.\ ]*$',word)

As you can see all 3 eg means that there is only no in your string. So you can follow the respective solutions given with them.


回答 4

关于什么浮点数底片号码等。所有的例子之前,将是错误的。

到现在为止,我得到了类似的东西,但我认为它可能会好得多:

'95.95'.replace('.','',1).isdigit()

仅当存在一个或没有“。”时返回true。在数字字符串中。

'9.5.9.5'.replace('.','',1).isdigit()

将返回假

What about of float numbers, negatives numbers, etc.. All the examples before will be wrong.

Until now I got something like this, but I think it could be a lot better:

'95.95'.replace('.','',1).isdigit()

will return true only if there is one or no ‘.’ in the string of digits.

'9.5.9.5'.replace('.','',1).isdigit()

will return false


回答 5

正如该评论所指出的,如何检查python字符串是否仅包含数字?isdigit()方法在此用例中并不完全准确,因为对于某些类似数字的字符它返回True:

>>> "\u2070".isdigit() # unicode escaped 'superscript zero' 
True

如果需要避免这种情况,则以下简单函数检查字符串中的所有字符是否为“ 0”和“ 9”之间的数字:

import string

def contains_only_digits(s):
    # True for "", "0", "123"
    # False for "1.2", "1,2", "-1", "a", "a1"
    for ch in s:
        if not ch in string.digits:
            return False
    return True

在问题示例中使用:

if len(isbn) == 10 and contains_only_digits(isbn):
    print ("Works")

As pointed out in this comment How do you check in python whether a string contains only numbers? the isdigit() method is not totally accurate for this use case, because it returns True for some digit-like characters:

>>> "\u2070".isdigit() # unicode escaped 'superscript zero' 
True

If this needs to be avoided, the following simple function checks, if all characters in a string are a digit between “0” and “9”:

import string

def contains_only_digits(s):
    # True for "", "0", "123"
    # False for "1.2", "1,2", "-1", "a", "a1"
    for ch in s:
        if not ch in string.digits:
            return False
    return True

Used in the example from the question:

if len(isbn) == 10 and contains_only_digits(isbn):
    print ("Works")

回答 6

您可以在此处使用try catch块:

s="1234"
try:
    num=int(s)
    print "S contains only digits"
except:
    print "S doesn't contain digits ONLY"

You can use try catch block here:

s="1234"
try:
    num=int(s)
    print "S contains only digits"
except:
    print "S doesn't contain digits ONLY"

回答 7

因为每次我遇到检查问题都是因为str有时可以为None,并且如果str可以为None,则仅使用str.isdigit()是不够的,因为您会得到一个错误

AttributeError:’NoneType’对象没有属性’isdigit’

然后您需要首先验证str是否为None。为了避免多if分支,一种清晰的方法是:

if str and str.isdigit():

希望这对像我这样的人有帮助。

As every time I encounter an issue with the check is because the str can be None sometimes, and if the str can be None, only use str.isdigit() is not enough as you will get an error

AttributeError: ‘NoneType’ object has no attribute ‘isdigit’

and then you need to first validate the str is None or not. To avoid a multi-if branch, a clear way to do this is:

if str and str.isdigit():

Hope this helps for people have the same issue like me.


回答 8

我可以想到2种方法来检查字符串是否具有全位数

方法1(在python中使用内置的isdigit()函数):-

>>>st = '12345'
>>>st.isdigit()
True
>>>st = '1abcd'
>>>st.isdigit()
False

方法2(在字符串顶部执行异常处理):-

st="1abcd"
try:
    number=int(st)
    print("String has all digits in it")
except:
    print("String does not have all digits in it")

上面代码的输出将是:

String does not have all digits in it

There are 2 methods that I can think of to check whether a string has all digits of not

Method 1(Using the built-in isdigit() function in python):-

>>>st = '12345'
>>>st.isdigit()
True
>>>st = '1abcd'
>>>st.isdigit()
False

Method 2(Performing Exception Handling on top of the string):-

st="1abcd"
try:
    number=int(st)
    print("String has all digits in it")
except:
    print("String does not have all digits in it")

The output of the above code will be:

String does not have all digits in it

回答 9

您可以使用str.isdigit()方法或str.isnumeric()方法

you can use str.isdigit() method or str.isnumeric() method


从列中的字符串中删除不需要的部分

问题:从列中的字符串中删除不需要的部分

我正在寻找一种有效的方法来从DataFrame列的字符串中删除不需要的部分。

数据如下:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

我需要将这些数据修剪为:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

我试过了.str.lstrip('+-')str.rstrip('aAbBcC'),但出现错误:

TypeError: wrapper() takes exactly 1 argument (2 given)

任何指针将不胜感激!

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!


回答 0

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

回答 1

如何从列的字符串中删除不需要的部分?

在最初提出问题的6年后,pandas现在具有大量的“向量化”字符串函数,可以简洁地执行这些字符串操作操作。

该答案将探索其中的一些字符串函数,提出更快的替代方法,最后进行时序比较。


.str.replace

指定要匹配的子字符串/样式,以及要替换为的子字符串。

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您需要将结果转换为整数,则可以使用Series.astype

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

如果您不想df就地修改,请使用DataFrame.assign

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

对于提取要保留的子字符串很有用。

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

使用extract,必须指定至少一个捕获组。expand=False将返回带有第一个捕获组中捕获项目的系列。


.str.split.str.get

假设您所有的字符串都遵循这种一致的结构,则拆分工作有效。

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您正在寻找一般的解决方案,则不建议这样做。


如果您对str 上述基于简洁和可读的访问器的解决方案感到满意,则可以在此处停止。但是,如果您对更快,性能更高的替代产品感兴趣,请继续阅读。


优化:列表理解

在某些情况下,列表理解应优于熊猫字符串函数。原因是因为字符串函数本来就很难向量化(从字面意义上来说),所以大多数字符串和正则表达式函数只是循环包装,开销更大。

我写的文章,熊猫中的for循环真的不好吗?我什么时候应该在意?,详细介绍。

str.replace选项可以使用重写re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

str.extract示例可以使用列表理解用来重写re.search

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果可能出现NaN或不匹配的情况,则您需要重新编写上面的内容以包含一些错误检查。我使用一个函数来做到这一点。

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

我们还可以使用列表推导来重写@eumiro和@MonkeyButter的答案:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

和,

df['result'] = [x[1:-1] for x in df['result']]

适用于处理NaN等的相同规则。


性能比较

使用perfplot生成的图。完整的代码清单,供您参考。相关功能在下面列出。

这些比较中的一些比较不公平,因为它们利用了OP数据的结构,但从中得到了好处。需要注意的一件事是,每个列表理解功能都比其等效的pandas变体更快或更可比。

功能

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.


.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don’t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.


.str.split and .str.get

Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.


If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.


Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.


Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

回答 2

我会使用熊猫替换功能,因为您可以使用正则表达式,所以它非常简单而强大。在下面,我使用正则表达式\ D删除所有非数字字符,但显然,使用正则表达式可以变得很有创意。

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

回答 3

在特定情况下,如果您知道要从数据框列中删除的位置数,则可以在lambda函数内使用字符串索引来摆脱这些部分:

最后符:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

前两个字符:

data['result'] = data['result'].map(lambda x: str(x)[2:])

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

回答 4

这里有一个错误:目前无法将参数传递给str.lstripstr.rstrip

http://github.com/pydata/pandas/issues/2411

编辑:2012-12-07这现在可以在dev分支上工作:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

回答 5

一种非常简单的方法是使用该extract方法选择所有数字。只需为其提供'\d+'可提取任意数字的正则表达式即可。

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

回答 6

对于这些类型的任务,我经常使用列表推导,因为它们通常更快。

进行这种操作的各种方法(例如,修改DataFrame中序列的每个元素)的性能可能存在很大差异。通常,列表理解可能是最快的-有关此任务,请参见下面的代码竞赛:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

I often use list comprehensions for these types of tasks because they’re often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

回答 7

假设您的DF在数字之间也有那些多余的字符。

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

您可以尝试str.replace删除字符,不仅从开头和结尾,而且从中间删除。

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

输出:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

回答 8

使用正则表达式尝试:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

在Python 3中加速数百万个正则表达式的替换

问题:在Python 3中加速数百万个正则表达式的替换

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个“句子”(长字符串)的列表
  • 我想从我的750,000个句子中删除的大约20,000个“单词”的列表

因此,我必须遍历750,000个句子并执行大约20,000个替换,但前提是我的单词实际上是“单词”,并且不属于较大的字符串。

我这样做是通过预编译我的单词,使它们位于\b元字符的侧面

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍需要几个小时。

  • 有没有一种方法可以使用该str.replace方法(我认为该方法更快),但仍然要求仅在单词边界处进行替换?

  • 或者,有没有办法加快该re.sub方法?re.sub如果单词的长度大于句子的长度,我已经略微提高了速度,但这并没有太大的改进。

感谢您的任何建议。

I’m using Python 3.5.2

I have two lists

  • a list of about 750,000 “sentences” (long strings)
  • a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.


回答 0

您可以尝试做的一件事是编译一个单一模式,例如"\b(word1|word2|word3)\b"

由于re依靠C代码进行实际匹配,因此节省的费用可观。

正如@pvg在评论中指出的,它也受益于单遍匹配。

如果您的单词不是正则表达式,那么Eric的答案会更快。

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.


回答 1

TLDR

如果您想要最快的解决方案,请使用此方法(带有设置的查找)。对于类似于OP的数据集,它比接受的答案快大约2000倍。

如果您坚持使用正则表达式进行查找,请使用此基于Trie的版本,该版本仍比正则表达式联合快1000倍。

理论

如果您的句子不是笨拙的字符串,每秒处理50个以上的句子可能是可行的。

如果将所有禁止的单词保存到集合中,则可以非常快速地检查该集合中是否包含另一个单词。

将逻辑打包到一个函数中,将此函数作为参数提供给re.sub您,您就完成了!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子为:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意:

  • 搜索不区分大小写(感谢lower()
  • 用替换一个单词""可能会留下两个空格(如您的代码中所示)
  • 使用python3,\w+还可以匹配带重音符号的字符(例如"ångström")。
  • 任何非单词字符(制表符,空格,换行符,标记等)都将保持不变。

性能

一百万个句子,banned_words近十万个单词,脚本运行时间不到7秒。

相比之下,Liteye的答案需要1万个句子需要160秒。

由于n是单词的总数和m被禁止的单词的数量,OP和Liteye的代码为O(n*m)

相比之下,我的代码应在中运行O(n+m)。考虑到句子比禁止词多得多,该算法变为O(n)

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是什么?是O(N)还是O(1)

很难了解正则表达式引擎的工作方式,因此让我们编写一个简单的测试。

此代码将10**i随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:

  • 一个人显然不是一个词(以开头#
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个单词,但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

因此,看起来像一个带有'\b(word1|word2|...|wordN)\b'模式的单词的搜索具有:

  • O(1) 最好的情况
  • O(n/2) 一般情况,仍然 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致。

regex联合的一种更快的替代方法是从trie创建regex模式

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn’t


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.


回答 2

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于OP的数据集,它比接受的答案快大约1000倍。

如果您不关心正则表达式,请使用此基于集合的版本,它比正则表达式联合快2000倍。

使用Trie优化正则表达式

一个简单的正则表达式工会的做法与许多禁用词语变得缓慢,这是因为正则表达式引擎不会做了很好的工作优化格局。

可以使用所有禁止的单词创建Trie并编写相应的正则表达式。生成的trie或regex并不是真正的人类可读的,但是它们确实允许非常快速的查找和匹配。

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

该列表将转换为特里:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

然后到此正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

巨大的优势在于,要测试是否zoo匹配,正则表达式引擎只需比较第一个字符(不匹配),而无需尝试5个单词。这是5个单词的预处理过大杀伤力,但它显示了成千上万个单词的有希望的结果。

请注意,使用(?:)非捕获组是因为:

  • foobar|baz将匹配foobarbaz但不匹配foobaz
  • foo(bar|baz)将不需要的信息保存到捕获组

这是一个经过稍微修改的gist,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

测试

这是一个小测试(与测试相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

对于信息,正则表达式开始如下:

(?:a(?:(?:\’s | a(?:\’s | chen | liyah(?:\’s)?| r(?:dvark(?:(?:\’s | s ))?|| on))| b(?:\’s | a(?:c(?:us(?:(?:\’s | es))?| [ik])| ft | lone(? :(?:\’s | s))?| ndon(?:( ?: ed | ing | ment(?:\’s)?| s))?| s(?:e(?:( ?: ment(?:\’s)?| [ds]))?| h(?:( ?: e [ds] | ing))?| ing)| t(?:e(?:( ?: ment( ?:\’s)?| [ds]))?| ing | toir(?:(?:\’s | s))?))| b(?:as(?:id)?| e(? :ss(?:(?:\’s | es))?| y(?:(?:\’s | s))?)| ot(?:(?:\’s | t(?:\ ‘s)?| s))?| reviat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))| y(?:\’ s)?| \é(?:(?:\’s | s))?)| d(?:icat(?:e [ds]?| i(?:ng | on(?:(?:\ ‘s | s))?)))| om(?:en(?:(?:\’s | s))?| inal)| u(?:ct(?:( ?: ed | i(?: ng | on(?:(?:\’s | s))?)|或(?:(?:\’s | s))?| s))?| l(?:\’s)?) )| e(?:(?:\’s | am | l(?:(?:\’s | ard | son(?:\’s)?)))?| r(?:deen(?:\ ‘s)?| nathy(?:\’s)?| ra(?:nt | tion(?:(?:\’s | s))?))| t(?:( ?: t(?: e(?:r(?:(?:\’s | s))?| d)| ing | or(?:(?:\’s | s))?)| s))?| yance(?:\’s)?| d))?| hor(?:( ?: r(?:e(?:n(?:ce(? :\’s)?| t)| d)| ing)| s)))| i(?:d(?:e [ds]?| ing | jan(?:\’s)?)|盖尔| l(?:ene | it(?:ies | y(?:\’s)?)))| j(?:ect(?:ly)?| ur(?:ation(?:(?:\’ s | s))?| e [ds]?| ing))| l(?:a(?:tive(?:(?:\’s | s))?| ze)| e(?:(? :st | r))?| oom | ution(?:(?:\’s | s))?| y)| m \’s | n(?:e(?:gat(?:e [ds] || i(?:ng | on(?:\’s)?))| r(?:\’s)?)| ormal(?:( ?: it(?:ies | y(?:\’ s)?)| ly))?)| o(?:ard | de(?:(?:\’s | s))?| li(?:sh(?:( ?: e [ds] | ing ))|| tion(?:(?:\’s | ist(?:(?:\’s | s))?))?)| mina(?:bl [ey] | t(?:e [ ds]?| i(?:ng | on(?:(?:\’s | s))?))))| r(?:igin(?:al(?:(?:\’s | s) )?| e(?:(?:\’s | s))?)| t(?:( ?: ed | i(?:ng | on(?:(?:\’s | ist(?: (?:\’s | s))?| s))?| ve)| s))))| u(?:nd(?:(?:( ?: ed | ing | s |))?| t)| ve (?:(?:\’s | board))?)| r(?:a(?:cadabra(?:\’s)?| d(?:e [ds]?| ing)| ham(? :\’s)?| m(?:(?:\’s | s))?| si(?:on(?:(?:\’s | s))?| ve(?:( ?:\’s | ly | ness(?:\’s)?| s))?))| east | idg(?:e(?:( ?: ment(?:((?:\’s | s))) ?| [ds]))?| ing | ment(?:(?:\’s | s))?)| o(?:ad | gat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))))| upt(?:( ?: e(?:st | r)| ly | ness(?:\’s)?))?)) | s(?:alom | c(?:ess(?:(?:\’s | e [ds] | ing)))?| issa(?:(?:\’s | [es])))?| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?:( ?: e(?:e( ?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))))| inth(?:(?:\’s | e( ?:o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))?| i(?:on(?: \’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?: e(?:n(? :cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti …s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。

这确实让人难以理解,但是对于100000个禁用词的列表而言,此Trie regex比简单的regex联合快1000倍!

这是完整的trie的图,并通过trie-python-graphviz和graphviz 导出twopi

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:


回答 3

您可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过划分单词边界将每个句子变成单词列表。

这应该更快,因为要处理一个句子,您只需要逐步检查每个单词并检查它是否匹配即可。

当前,正则表达式搜索每次必须再次遍历整个字符串,以查找单词边界,然后在下一次遍历之前“舍弃”这项工作的结果。

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.


回答 4

好吧,这是一个快速简单的解决方案,带有测试仪。

取胜策略:

re.sub(“ \ w +”,repl,sentence)搜索单词。

“ repl”可以是可调用的。我使用了一个执行字典查找的函数,该字典包含要搜索和替换的单词。

这是最简单,最快的解决方案(请参见下面的示例代码中的函数replace4)。

次好的

想法是使用re.split将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的字典查找完成替换。

(请参见下面的示例代码中的函数replace3)。

功能示例的时间:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

编辑:检查是否传递小写的句子列表并编辑repl时,您也可以忽略小写

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

回答 5

也许Python不是这里的正确工具。这是Unix工具链中的一个

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件已经过预处理,并添加了字边界。步骤是:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,然后合并回行。

这应该至少快一个数量级。

用于从单词中预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

回答 6

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上划分并查找集合中的每个单词。它们应该比re.sub单词替代(Liteyes的解决方案)更快,因为这些解决方案是O(n),其中n是由于amortized O(1)设置查找而导致的,而使用正则表达式替代项将导致regex引擎必须检查单词是否匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案a格外小心,以保留原始文本中使用的空格(即,它不压缩空格,并保留制表符,换行符和其他空格字符),但是如果您决定不关心它,则可以从输出中删除它们应该非常简单。

我在corpus.txt上进行了测试,corpus.txt是从Gutenberg Project下载的多本电子书的串联,并且banned_words.txt是从Ubuntu的单词表(/ usr / share / dict / american-english)中随机选择的20000个单词。处理862462个句子(约占PyPy的一半)大约需要30秒。我已将句子定义为以“。”分隔的任何内容。

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码在Python 2和Python 3上都可以使用。

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.


回答 7

实用方法

下述解决方案使用大量内存将所有文本存储在同一字符串中,并降低了复杂度。如果RAM是一个问题,请在使用前三思。

使用join/ split技巧,您可以完全避免循环,从而可以加快算法的速度。

  • 用特殊分隔符连接句子,这些特殊分隔符不包含在句子中:
  • merged_sentences = ' * '.join(sentences)

  • 使用|“或”正则表达式语句为需要从句子中摆脱的所有单词编译一个正则表达式:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

  • 用已编译的正则表达式对单词下标,并用特殊的分隔符将其拆分回单独的句子:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

    性能

    "".join复杂度为O(n)。这是非常直观的,但是无论如何都会有一个简短的报价来源:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);

    因此,join/split有了O(words)+ 2 * O(sentences)仍然是线性复杂度,而初始方法为2 * O(N 2)。


    顺便说一句,不要使用多线程。GIL将阻止每个操作,因为您的任务严格地受CPU限制,因此GIL没有机会被释放,但是每个线程将同时发送滴答声,这会导致额外的工作量,甚至导致操作达到无穷大。

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.


    回答 8

    将所有句子连接到一个文档中。使用Aho-Corasick算法的任何实现(这里是)来查找所有“不好”的单词。遍历文件,替换每个坏词,更新后跟的发现词的偏移量等。

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.


    在Python字符串中转义正则表达式特殊字符

    问题:在Python字符串中转义正则表达式特殊字符

    Python是否具有可用来在正则表达式中转义特殊字符的函数?

    例如,I'm "stuck" :\应成为I\'m \"stuck\" :\\

    Does Python have a function that I can use to escape special characters in a regular expression?

    For example, I'm "stuck" :\ should become I\'m \"stuck\" :\\.


    回答 0

    re.escape

    >>> import re
    >>> re.escape(r'\ a.*$')
    '\\\\\\ a\\.\\*\\$'
    >>> print(re.escape(r'\ a.*$'))
    \\\ a\.\*\$
    >>> re.escape('www.stackoverflow.com')
    'www\\.stackoverflow\\.com'
    >>> print(re.escape('www.stackoverflow.com'))
    www\.stackoverflow\.com
    

    在这里重复:

    re.escape(字符串)

    返回所有非字母数字加反斜杠的字符串;如果您想匹配其中可能包含正则表达式元字符的任意文字字符串,这将很有用。

    从Python 3.7 re.escape()开始,更改为仅转义对正则表达式操作有意义的字符。

    Use re.escape

    >>> import re
    >>> re.escape(r'\ a.*$')
    '\\\\\\ a\\.\\*\\$'
    >>> print(re.escape(r'\ a.*$'))
    \\\ a\.\*\$
    >>> re.escape('www.stackoverflow.com')
    'www\\.stackoverflow\\.com'
    >>> print(re.escape('www.stackoverflow.com'))
    www\.stackoverflow\.com
    

    Repeating it here:

    re.escape(string)

    Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

    As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.


    回答 1

    我很惊讶没有人提到通过re.sub()以下方式使用正则表达式:

    import re
    print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
    print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
    print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"

    重要注意事项:

    • 搜索模式中,包括\您要查找的字符。你会使用\逃脱你的角色,所以你需要逃避 为好。
    • 搜索模式周围加上括号,例如([\"]),以便替换 模式在找到的字符添加\到其前面时可以使用该字符。(这就是 \1作用:使用第一个带括号的组的值。)
    • r前面r'([\"])'意味着它是一个原始字符串。原始字符串使用不同的规则来转义反斜杠。要([\"])以纯字符串形式编写,您需要将所有反斜杠加倍,并写入'([\\"])'。在编写正则表达式时,原始字符串更友好。
    • 替代模式,你需要转义\从先于一个取代基的反斜杠,例如区分\1,因此r'\\\1'。写 的是作为一个普通的字符串,你需要'\\\\\\1'-大家都不希望发生。

    I’m surprised no one has mentioned using regular expressions via re.sub():

    import re
    print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
    print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
    print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"
    

    Important things to note:

    • In the search pattern, include \ as well as the character(s) you’re looking for. You’re going to be using \ to escape your characters, so you need to escape that as well.
    • Put parentheses around the search pattern, e.g. ([\"]), so that the substitution pattern can use the found character when it adds \ in front of it. (That’s what \1 does: uses the value of the first parenthesized group.)
    • The r in front of r'([\"])' means it’s a raw string. Raw strings use different rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to double all the backslashes and write '([\\"])'. Raw strings are friendlier when you’re writing regular expressions.
    • In the substitution pattern, you need to escape \ to distinguish it from a backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write that as a plain string, you’d need '\\\\\\1' — and nobody wants that.

    回答 2

    使用repr()[1:-1]。在这种情况下,双引号不需要转义。[-1:1]切片是从开头和结尾删除单引号。

    >>> x = raw_input()
    I'm "stuck" :\
    >>> print x
    I'm "stuck" :\
    >>> print repr(x)[1:-1]
    I\'m "stuck" :\\

    或者,也许您只是想转义一个短语以粘贴到您的程序中?如果是这样,请执行以下操作:

    >>> raw_input()
    I'm "stuck" :\
    'I\'m "stuck" :\\'

    Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.

    >>> x = raw_input()
    I'm "stuck" :\
    >>> print x
    I'm "stuck" :\
    >>> print repr(x)[1:-1]
    I\'m "stuck" :\\
    

    Or maybe you just want to escape a phrase to paste into your program? If so, do this:

    >>> raw_input()
    I'm "stuck" :\
    'I\'m "stuck" :\\'
    

    回答 3

    如上所述,答案取决于您的情况。如果要转义正则表达式的字符串,则应使用re.escape()。但是,如果要转义一组特定的字符,请使用此lambda函数:

    >>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
    >>> s = raw_input()
    I'm "stuck" :\
    >>> print s
    I'm "stuck" :\
    >>> print escape(s, "\\", ['"'])
    I'm \"stuck\" :\\

    As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:

    >>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
    >>> s = raw_input()
    I'm "stuck" :\
    >>> print s
    I'm "stuck" :\
    >>> print escape(s, "\\", ['"'])
    I'm \"stuck\" :\\
    

    回答 4

    这并不难:

    def escapeSpecialCharacters ( text, characters ):
        for character in characters:
            text = text.replace( character, '\\' + character )
        return text
    
    >>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
    'I\\\'m \\"stuck\\" :\\'
    >>> print( _ )
    I\'m \"stuck\" :\

    It’s not that hard:

    def escapeSpecialCharacters ( text, characters ):
        for character in characters:
            text = text.replace( character, '\\' + character )
        return text
    
    >>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
    'I\\\'m \\"stuck\\" :\\'
    >>> print( _ )
    I\'m \"stuck\" :\
    

    回答 5

    如果只想替换某些字符,则可以使用以下命令:

    import re
    
    print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")

    If you only want to replace some characters you could use this:

    import re
    
    print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")
    

    Python中的字符串到字典

    问题:Python中的字符串到字典

    所以我花了很多时间在此上,在我看来,这应该是一个简单的修复。我正在尝试使用Facebook的身份验证在我的网站上注册用户,并且正在服务器端进行操作。我已经到了获取访问令牌的地步,并且当我去:

    https://graph.facebook.com/me?access_token=MY_ACCESS_TOKEN

    我得到的信息就是这样的字符串:

    {"id":"123456789","name":"John Doe","first_name":"John","last_name":"Doe","link":"http:\/\/www.facebook.com\/jdoe","gender":"male","email":"jdoe\u0040gmail.com","timezone":-7,"locale":"en_US","verified":true,"updated_time":"2011-01-12T02:43:35+0000"}

    似乎我应该可以使用dict(string)它,但出现此错误:

    ValueError: dictionary update sequence element #0 has length 1; 2 is required

    所以我尝试使用Pickle,但收到此错误:

    KeyError: '{'

    我尝试使用django.serializers反序列化它,但结果相似。有什么想法吗?我觉得答案必须很简单,而且我很愚蠢。谢谢你的帮助!

    So I’ve spent way to much time on this, and it seems to me like it should be a simple fix. I’m trying to use Facebook’s Authentication to register users on my site, and I’m trying to do it server side. I’ve gotten to the point where I get my access token, and when I go to:

    https://graph.facebook.com/me?access_token=MY_ACCESS_TOKEN

    I get the information I’m looking for as a string that’s like this:

    {"id":"123456789","name":"John Doe","first_name":"John","last_name":"Doe","link":"http:\/\/www.facebook.com\/jdoe","gender":"male","email":"jdoe\u0040gmail.com","timezone":-7,"locale":"en_US","verified":true,"updated_time":"2011-01-12T02:43:35+0000"}

    It seems like I should just be able to use dict(string) on this but I’m getting this error:

    ValueError: dictionary update sequence element #0 has length 1; 2 is required

    So I tried using Pickle, but got this error:

    KeyError: '{'

    I tried using django.serializers to de-serialize it but had similar results. Any thoughts? I feel like the answer has to be simple, and I’m just being stupid. Thanks for any help!


    回答 0

    此数据为JSON!如果您使用的是Python 2.6+,则可以使用内置json模块反序列化它,否则可以使用出色的第三方simplejson模块

    import json    # or `import simplejson as json` if on Python < 2.6
    
    json_string = u'{ "id":"123456789", ... }'
    obj = json.loads(json_string)    # obj now contains a dict of the data

    This data is JSON! You can deserialize it using the built-in json module if you’re on Python 2.6+, otherwise you can use the excellent third-party simplejson module.

    import json    # or `import simplejson as json` if on Python < 2.6
    
    json_string = u'{ "id":"123456789", ... }'
    obj = json.loads(json_string)    # obj now contains a dict of the data
    

    回答 1

    使用ast.literal_eval评估Python文字。但是,您拥有的是JSON(例如,请注意“ true”),因此请使用JSON解串器。

    >>> import json
    >>> s = """{"id":"123456789","name":"John Doe","first_name":"John","last_name":"Doe","link":"http:\/\/www.facebook.com\/jdoe","gender":"male","email":"jdoe\u0040gmail.com","timezone":-7,"locale":"en_US","verified":true,"updated_time":"2011-01-12T02:43:35+0000"}"""
    >>> json.loads(s)
    {u'first_name': u'John', u'last_name': u'Doe', u'verified': True, u'name': u'John Doe', u'locale': u'en_US', u'gender': u'male', u'email': u'jdoe@gmail.com', u'link': u'http://www.facebook.com/jdoe', u'timezone': -7, u'updated_time': u'2011-01-12T02:43:35+0000', u'id': u'123456789'}

    Use ast.literal_eval to evaluate Python literals. However, what you have is JSON (note “true” for example), so use a JSON deserializer.

    >>> import json
    >>> s = """{"id":"123456789","name":"John Doe","first_name":"John","last_name":"Doe","link":"http:\/\/www.facebook.com\/jdoe","gender":"male","email":"jdoe\u0040gmail.com","timezone":-7,"locale":"en_US","verified":true,"updated_time":"2011-01-12T02:43:35+0000"}"""
    >>> json.loads(s)
    {u'first_name': u'John', u'last_name': u'Doe', u'verified': True, u'name': u'John Doe', u'locale': u'en_US', u'gender': u'male', u'email': u'jdoe@gmail.com', u'link': u'http://www.facebook.com/jdoe', u'timezone': -7, u'updated_time': u'2011-01-12T02:43:35+0000', u'id': u'123456789'}