标签归档:string

如何在Python中执行包含Python代码的字符串?

问题:如何在Python中执行包含Python代码的字符串?

如何在Python中执行包含Python代码的字符串?

How do I execute a string containing Python code in Python?


回答 0

对于语句,请使用exec(string)(Python 2/3)或exec string(Python 2):

>>> mycode = 'print "hello world"'
>>> exec(mycode)
Hello world

当需要表达式的值时,请使用eval(string)

>>> x = eval("2+2")
>>> x
4

但是,第一步应该是问自己是否真的需要。通常,执行代码应该是最后的选择:如果代码中可能包含用户输入的代码,则它很慢,很丑陋而且很危险。您应该始终首先考虑替代项,例如高阶函数,以查看它们是否可以更好地满足您的需求。

For statements, use exec(string) (Python 2/3) or exec string (Python 2):

>>> mycode = 'print "hello world"'
>>> exec(mycode)
Hello world

When you need the value of an expression, use eval(string):

>>> x = eval("2+2")
>>> x
4

However, the first step should be to ask yourself if you really need to. Executing code should generally be the position of last resort: It’s slow, ugly and dangerous if it can contain user-entered code. You should always look at alternatives first, such as higher order functions, to see if these can better meet your needs.


回答 1

在示例中,使用exec函数将字符串作为代码执行。

import sys
import StringIO

# create file-like string to capture output
codeOut = StringIO.StringIO()
codeErr = StringIO.StringIO()

code = """
def f(x):
    x = x + 1
    return x

print 'This is my output.'
"""

# capture output and errors
sys.stdout = codeOut
sys.stderr = codeErr

exec code

# restore stdout and stderr
sys.stdout = sys.__stdout__
sys.stderr = sys.__stderr__

print f(4)

s = codeErr.getvalue()

print "error:\n%s\n" % s

s = codeOut.getvalue()

print "output:\n%s" % s

codeOut.close()
codeErr.close()

In the example a string is executed as code using the exec function.

import sys
import StringIO

# create file-like string to capture output
codeOut = StringIO.StringIO()
codeErr = StringIO.StringIO()

code = """
def f(x):
    x = x + 1
    return x

print 'This is my output.'
"""

# capture output and errors
sys.stdout = codeOut
sys.stderr = codeErr

exec code

# restore stdout and stderr
sys.stdout = sys.__stdout__
sys.stderr = sys.__stderr__

print f(4)

s = codeErr.getvalue()

print "error:\n%s\n" % s

s = codeOut.getvalue()

print "output:\n%s" % s

codeOut.close()
codeErr.close()

回答 2

eval并且exec是正确的解决方案,它们可以被用在更安全方式。

正如Python参考手册中所讨论并在教程中明确说明的那样,evalexec函数使用两个额外的参数,这些参数允许用户指定可用的全局和局部函数和变量。

例如:

public_variable = 10

private_variable = 2

def public_function():
    return "public information"

def private_function():
    return "super sensitive information"

# make a list of safe functions
safe_list = ['public_variable', 'public_function']
safe_dict = dict([ (k, locals().get(k, None)) for k in safe_list ])
# add any needed builtins back in
safe_dict['len'] = len

>>> eval("public_variable+2", {"__builtins__" : None }, safe_dict)
12

>>> eval("private_variable+2", {"__builtins__" : None }, safe_dict)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'private_variable' is not defined

>>> exec("print \"'%s' has %i characters\" % (public_function(), len(public_function()))", {"__builtins__" : None}, safe_dict)
'public information' has 18 characters

>>> exec("print \"'%s' has %i characters\" % (private_function(), len(private_function()))", {"__builtins__" : None}, safe_dict)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'private_function' is not defined

本质上,您是在定义将在其中执行代码的命名空间。

eval and exec are the correct solution, and they can be used in a safer manner.

As discussed in Python’s reference manual and clearly explained in this tutorial, the eval and exec functions take two extra parameters that allow a user to specify what global and local functions and variables are available.

For example:

public_variable = 10

private_variable = 2

def public_function():
    return "public information"

def private_function():
    return "super sensitive information"

# make a list of safe functions
safe_list = ['public_variable', 'public_function']
safe_dict = dict([ (k, locals().get(k, None)) for k in safe_list ])
# add any needed builtins back in
safe_dict['len'] = len

>>> eval("public_variable+2", {"__builtins__" : None }, safe_dict)
12

>>> eval("private_variable+2", {"__builtins__" : None }, safe_dict)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'private_variable' is not defined

>>> exec("print \"'%s' has %i characters\" % (public_function(), len(public_function()))", {"__builtins__" : None}, safe_dict)
'public information' has 18 characters

>>> exec("print \"'%s' has %i characters\" % (private_function(), len(private_function()))", {"__builtins__" : None}, safe_dict)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'private_function' is not defined

In essence you are defining the namespace in which the code will be executed.


回答 3

请记住,从版本3开始exec是一个功能!
因此请始终使用exec(mystring)代替exec mystring

Remember that from version 3 exec is a function!
so always use exec(mystring) instead of exec mystring.


回答 4

eval()仅用于表达,虽然eval('x+1')有效,但eval('x=1')不会起作用。在这种情况下,最好使用exec,甚至更好:尝试找到更好的解决方案:)

eval() is just for expressions, while eval('x+1') works, eval('x=1') won’t work for example. In that case, it’s better to use exec, or even better: try to find a better solution :)


回答 5

避免execeval

在Python中使用execeval受到了极大的反对。

有更好的选择

从最上面的答案(强调我的):

对于语句,请使用exec

当需要表达式的值时,请使用eval

但是,第一步应该是问自己是否真的需要。执行代码通常应该是万不得已的方法:如果它可以包含用户输入的代码,则它很慢,很丑陋而且很危险。您应该始终首先查看替代项,例如高阶函数,以查看它们是否可以更好地满足您的需求。

替代方案到exec / eval?

使用字符串中的名称设置和获取变量的值

[而 eval ]有效,但通常不建议使用对程序本身有意义的变量名。

相反,最好使用字典。

这不是惯用的

来自http://lucumr.pocoo.org/2011/2/1/exec-in-python/(重点是我的)

Python不是PHP

不要试图绕过Python的习惯用法,因为其他一些语言会做不同的事情。在Python中使用命名空间是有原因的,仅仅是因为它为您提供了该工具,exec但这并不意味着您应该使用该工具。

有危险

来自http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html(重点是我的)

因此,即使您删除了所有的全局变量和内置函数,eval也不安全!

所有这些尝试保护eval()的问题都在于它们是黑名单。他们明确删除了可能危险的内容。那是一场失败的战斗,因为如果只剩下一个项目,那么您可以攻击系统

那么,可以使eval安全吗?很难说。在这一点上,我最好的猜测是,如果您不能使用任何双下划线,那么不会造成任何伤害,因此,如果您排除任何具有双下划线的字符串,那么您是安全的。也许…

很难阅读和理解

来自http://stupidpythonideas.blogspot.it/2013/05/why-evalexec-is-bad.html(重点是我):

首先,exec使人类更难阅读您的代码。为了弄清楚发生了什么,我不仅要阅读您的代码,还必须阅读您的代码,弄清楚它将生成什么字符串,然后阅读该虚拟代码。因此,如果您在团队中工作,发布开源软件或在StackOverflow之类的地方寻求帮助,那么其他人将很难为您提供帮助。而且,如果您有机会在6个月后进行该代码的调试或扩展,那么直接给自己增加难度。

Avoid exec and eval

Using exec and eval in Python is highly frowned upon.

There are better alternatives

From the top answer (emphasis mine):

For statements, use exec.

When you need the value of an expression, use eval.

However, the first step should be to ask yourself if you really need to. Executing code should generally be the position of last resort: It’s slow, ugly and dangerous if it can contain user-entered code. You should always look at alternatives first, such as higher order functions, to see if these can better meet your needs.

From Alternatives to exec/eval?

set and get values of variables with the names in strings

[while eval] would work, it is generally not advised to use variable names bearing a meaning to the program itself.

Instead, better use a dict.

It is not idiomatic

From http://lucumr.pocoo.org/2011/2/1/exec-in-python/ (emphasis mine)

Python is not PHP

Don’t try to circumvent Python idioms because some other language does it differently. Namespaces are in Python for a reason and just because it gives you the tool exec it does not mean you should use that tool.

It is dangerous

From http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html (emphasis mine)

So eval is not safe, even if you remove all the globals and the builtins!

The problem with all of these attempts to protect eval() is that they are blacklists. They explicitly remove things that could be dangerous. That is a losing battle because if there’s just one item left off the list, you can attack the system.

So, can eval be made safe? Hard to say. At this point, my best guess is that you can’t do any harm if you can’t use any double underscores, so maybe if you exclude any string with double underscores you are safe. Maybe…

It is hard to read and understand

From http://stupidpythonideas.blogspot.it/2013/05/why-evalexec-is-bad.html (emphasis mine):

First, exec makes it harder to human beings to read your code. In order to figure out what’s happening, I don’t just have to read your code, I have to read your code, figure out what string it’s going to generate, then read that virtual code. So, if you’re working on a team, or publishing open source software, or asking for help somewhere like StackOverflow, you’re making it harder for other people to help you. And if there’s any chance that you’re going to be debugging or expanding on this code 6 months from now, you’re making it harder for yourself directly.


回答 6

您可以使用exec完成执行代码,就像下面的IDLE会话一样:

>>> kw = {}
>>> exec( "ret = 4" ) in kw
>>> kw['ret']

4

You accomplish executing code using exec, as with the following IDLE session:

>>> kw = {}
>>> exec( "ret = 4" ) in kw
>>> kw['ret']

4

回答 7

正如其他人提到的那样,它是“ exec” ..

但是,如果您的代码包含变量,则可以使用“全局”来访问它,还可以防止编译器引发以下错误:

NameError:名称“ p_variable”未定义

exec('p_variable = [1,2,3,4]')
global p_variable
print(p_variable)

As the others mentioned, it’s “exec” ..

but, in case your code contains variables, you can use “global” to access it, also to prevent the compiler to raise the following error:

NameError: name ‘p_variable’ is not defined

exec('p_variable = [1,2,3,4]')
global p_variable
print(p_variable)

回答 8

使用eval


回答 9

值得一提的是,如果您要调用python文件,则该exec兄弟也存在execfile。如果您使用的第三方程序包中包含糟糕的IDE,并且您想在其程序包之外进行编码,那有时会很好。

例:

execfile('/path/to/source.py)'

要么:

exec(open("/path/to/source.py").read())

It’s worth mentioning, that’ exec‘s brother exist as well called execfile if you want to call a python file. That is sometimes good if you are working in a third party package which have terrible IDE’s included and you want to code outside of their package.

Example:

execfile('/path/to/source.py)'

or:

exec(open("/path/to/source.py").read())


回答 10

查看eval

x = 1
print eval('x+1')
->2

Check out eval:

x = 1
print eval('x+1')
->2

回答 11

我尝试了很多事情,但是唯一可行的事情如下:

temp_dict = {}
exec("temp_dict['val'] = 10") 
print(temp_dict['val'])

输出:

10

I tried quite a few things, but the only thing that work was the following:

temp_dict = {}
exec("temp_dict['val'] = 10") 
print(temp_dict['val'])

output:

10


回答 12

最合乎逻辑的解决方案是使用内置的eval()函数。另一种解决方案是将该字符串写入临时python文件并执行。

The most logical solution would be to use the built-in eval() function .Another solution is to write that string to a temporary python file and execute it.


回答 13

好吧..我知道这不是一个确切的答案,但可能是对像我一样看这个问题的人的注释。我想为不同的用户/客户执行特定的代码,但也想避免执行/评估。我最初希望将代码存储在每个用户的数据库中,然后执行上述操作。

我最终在“ customer_filters”文件夹中的文件系统上创建了文件,并使用了“ imp”模块,如果没有针对该客户应用的过滤器,它将继续进行

import imp


def get_customer_module(customerName='default', name='filter'):
    lm = None
    try:
        module_name = customerName+"_"+name;
        m = imp.find_module(module_name, ['customer_filters'])
        lm = imp.load_module(module_name, m[0], m[1], m[2])
    except:
        ''
        #ignore, if no module is found, 
    return lm

m = get_customer_module(customerName, "filter")
if m is not None:
    m.apply_address_filter(myobj)

因此,customerName =“ jj”将执行customer_filters \ jj_filter.py文件中的apply_address_filter

Ok .. I know this isn’t exactly an answer, but possibly a note for people looking at this as I was. I wanted to execute specific code for different users/customers but also wanted to avoid the exec/eval. I initially looked to storing the code in a database for each user and doing the above.

I ended up creating the files on the file system within a ‘customer_filters’ folder and using the ‘imp’ module, if no filter applied for that customer, it just carried on

import imp


def get_customer_module(customerName='default', name='filter'):
    lm = None
    try:
        module_name = customerName+"_"+name;
        m = imp.find_module(module_name, ['customer_filters'])
        lm = imp.load_module(module_name, m[0], m[1], m[2])
    except:
        ''
        #ignore, if no module is found, 
    return lm

m = get_customer_module(customerName, "filter")
if m is not None:
    m.apply_address_filter(myobj)

so customerName = “jj” would execute apply_address_filter from the customer_filters\jj_filter.py file


如何判断字符串是否在Python中重复?

问题:如何判断字符串是否在Python中重复?

我正在寻找一种方法来测试给定的字符串是否对整个字符串重复。

例子:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

是重复的字符串,并且

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

是那些没有的例子。

我得到的字符串的重复部分可能会很长,并且字符串本身可以是500个或更多字符,因此循环遍历每个字符以尝试构建模式,然后检查模式与字符串的其余部分似乎很慢。将其乘以可能的数百个字符串,就看不到任何直观的解决方案。

我对正则表达式进行了一些研究,当您知道要查找的内容时,或者至少在寻找所需模式的长度时,它们似乎非常有用。不幸的是,我都不知道。

我怎么知道一个字符串是否在重复本身,如果是,最短的重复子序列是什么?

I’m looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

are strings which repeat themselves, and

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

are examples of ones that do not.

The repeating sections of the strings I’m given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can’t see any intuitive solution.

I’ve looked into regexes a bit and they seem good for when you know what you’re looking for, or at least the length of the pattern you’re looking for. Unfortunately, I know neither.

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?


回答 0

这是一个简洁的解决方案,它避免了正则表达式和缓慢的Python内循环:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

有关基准测试结果,请参阅由@davidism发起的Community Wiki答案。综上所述,

David Zhang的解决方案无疑是赢家,对于大型示例集,其性能至少比其他同类产品高出5倍。

(那个答案的话,不是我的。)

这是基于这样的观察,即且仅当字符串等于其自身的非平凡旋转时,它才是周期性的。@AleksiTorhamo的荣誉,意识到我们可以从sin 的第一次出现的索引中恢复本金周期(s+s)[1:-1],并告知我Python的可选参数start和自end变量string.find

Here’s a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang’s solution is the clear winner, outperforming all others by at least 5x for the large example set.

(That answer’s words, not mine.)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python’s string.find.


回答 1

这是使用正则表达式的解决方案。

import re

REPEATER = re.compile(r"(.+?)\1+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

迭代问题中的示例:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

…产生以下输出:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

正则表达式(.+?)\1+$分为三个部分:

  1. (.+?)是一个匹配组,其中包含至少一个(但尽可能少)任何字符(因为+?不是贪婪)。

  2. \1+ 在第一部分中检查匹配组的至少一个重复。

  3. $检查字符串的结尾,以确保在重复的子字符串之后没有多余的,非重复的内容(并使用re.match()确保在重复的子字符串之前没有非重复的文本)。

在Python 3.4和更高版本中,您可以删除$re.fullmatch()来代替,或者(在任何Python中至少可以追溯到2.3)使用另一种方式并re.search()与regex一起使用^(.+?)\1+$,所有这些都比其他人更受个人喜好。

Here’s a solution using regular expressions.

import re

REPEATER = re.compile(r"(.+?)\1+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

Iterating over the examples in the question:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

… produces this output:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

The regular expression (.+?)\1+$ is divided into three parts:

  1. (.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy).

  2. \1+ checks for at least one repetition of the matching group in the first part.

  3. $ checks for the end of the string, to ensure that there’s no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there’s no non-repeating text before the repeated substrings).

In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.


回答 2

您可以观察到对于要考虑重复的字符串,必须将其长度除以重复序列的长度。鉴于此,这是一个生成长度为从1n / 2包括的长度的除数的解决方案,将原始字符串分成具有除数长度的子字符串,并测试结果集的相等性:

from math import sqrt, floor

def divquot(n):
    if n > 1:
        yield 1, n
    swapped = []
    for d in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, d)
        if r == 0:
            yield d, q
            swapped.append((q, d))
    while swapped:
        yield swapped.pop()

def repeats(s):
    n = len(s)
    for d, q in divquot(n):
        sl = s[0:d]
        if sl * q == s:
            return sl
    return None

编辑:在Python 3中,/运算符已更改为默认情况下进行浮点除法。要从intPython 2中进行除法,可以改用//运算符。感谢@ TigerhawkT3引起我的注意。

//运算符在Python 2和Python 3中都执行整数除法,因此我更新了答案以支持这两个版本。现在,我们测试以查看所有子串是否相等的部分是使用all和生成器表达式的短路操作。

更新:为响应原始问题的更改,现在对代码进行了更新,以返回最小的重复子字符串(如果存在),None如果不存在,则返回最小。@godlygeek建议使用divmod减少divisors生成器上的迭代次数,并且代码也已更新为与此匹配。现在,它n以升序返回所有正数除数,不包括n其本身。

进一步更新以提高性能:经过多次测试,我得出的结论是,简单地测试字符串相等性具有Python中任何切片或迭代器解决方案中最好的性能。因此,我从@ TigerhawkT3的书中抽出了叶子,并更新了我的解决方案。现在它的速度是以前的6倍,比Tigerhawk的解决方案快得多,但比David的解决方案慢。

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

from math import sqrt, floor

def divquot(n):
    if n > 1:
        yield 1, n
    swapped = []
    for d in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, d)
        if r == 0:
            yield d, q
            swapped.append((q, d))
    while swapped:
        yield swapped.pop()

def repeats(s):
    n = len(s)
    for d, q in divquot(n):
        sl = s[0:d]
        if sl * q == s:
            return sl
    return None

EDIT: In Python 3, the / operator has changed to do float division by default. To get the int division from Python 2, you can use the // operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

The // operator performs integer division in both Python 2 and Python 3, so I’ve updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression.

UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. It now returns all positive divisors of n in ascending order, exclusive of n itself.

Further update for high performance: After multiple tests, I’ve come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I’ve taken a leaf out of @TigerhawkT3 ‘s book and updated my solution. It’s now over 6x as fast as before, noticably faster than Tigerhawk’s solution but slower than David’s.


回答 3

以下是针对此问题的各种答案的一些基准。有一些令人惊讶的结果,包括不同的性能,具体取决于所测试的字符串。

修改了某些功能以使其与Python 3兼容(主要是通过替换///以确保整数除法)。如果发现错误,请添加功能或添加另一个测试字符串,请在Python聊天室中 ping @ZeroPiraeus 。

总结:对于此处由OP提供的大量示例数据,最佳和最差解决方案之间存在大约50倍的差异(通过评论)。David Zhang的解决方案无疑是赢家,对于大型示例集,其解决方案比其他所有解决方案都高出约5倍。

在极大的“不匹配”情况下,几个答案非常慢。否则,根据测试,功能似乎是相等的,或者是明显的赢家。

以下是结果,包括使用matplotlib和seaborn绘制的图以显示不同的分布:


语料库1(提供的示例-小集)

mean performance:
 0.0003  david_zhang
 0.0009  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0015  carpetpython
 0.0029  tigerhawk_1
 0.0031  davidism
 0.0035  saksham
 0.0046  shashank
 0.0052  riad
 0.0056  piotr

median performance:
 0.0003  david_zhang
 0.0008  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0014  carpetpython
 0.0027  tigerhawk_1
 0.0031  davidism
 0.0038  saksham
 0.0044  shashank
 0.0054  riad
 0.0058  piotr

语料库1图


语料库2(提供的示例-大集合)

mean performance:
 0.0006  david_zhang
 0.0036  tigerhawk_2
 0.0036  antti
 0.0037  zero
 0.0039  carpetpython
 0.0052  shashank
 0.0056  piotr
 0.0066  davidism
 0.0120  tigerhawk_1
 0.0177  riad
 0.0283  saksham

median performance:
 0.0004  david_zhang
 0.0018  zero
 0.0022  tigerhawk_2
 0.0022  antti
 0.0024  carpetpython
 0.0043  davidism
 0.0049  shashank
 0.0055  piotr
 0.0061  tigerhawk_1
 0.0077  riad
 0.0109  saksham

语料库1图


语料库3(边缘案例)

mean performance:
 0.0123  shashank
 0.0375  david_zhang
 0.0376  piotr
 0.0394  carpetpython
 0.0479  antti
 0.0488  tigerhawk_2
 0.2269  tigerhawk_1
 0.2336  davidism
 0.7239  saksham
 3.6265  zero
 6.0111  riad

median performance:
 0.0107  tigerhawk_2
 0.0108  antti
 0.0109  carpetpython
 0.0135  david_zhang
 0.0137  tigerhawk_1
 0.0150  shashank
 0.0229  saksham
 0.0255  piotr
 0.0721  davidism
 0.1080  zero
 1.8539  riad

语料库3图


在此处获得测试和原始结果。

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

Some functions were modified to work with Python 3 (mainly by replacing / with // to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

In summary: there’s about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang’s solution is the clear winner, outperforming all others by around 5x for the large example set.

A couple of the answers are very slow in extremely large “no match” cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

Here are the results, including plots made using matplotlib and seaborn to show the different distributions:


Corpus 1 (supplied examples – small set)

mean performance:
 0.0003  david_zhang
 0.0009  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0015  carpetpython
 0.0029  tigerhawk_1
 0.0031  davidism
 0.0035  saksham
 0.0046  shashank
 0.0052  riad
 0.0056  piotr

median performance:
 0.0003  david_zhang
 0.0008  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0014  carpetpython
 0.0027  tigerhawk_1
 0.0031  davidism
 0.0038  saksham
 0.0044  shashank
 0.0054  riad
 0.0058  piotr

Corpus 1 graph


Corpus 2 (supplied examples – large set)

mean performance:
 0.0006  david_zhang
 0.0036  tigerhawk_2
 0.0036  antti
 0.0037  zero
 0.0039  carpetpython
 0.0052  shashank
 0.0056  piotr
 0.0066  davidism
 0.0120  tigerhawk_1
 0.0177  riad
 0.0283  saksham

median performance:
 0.0004  david_zhang
 0.0018  zero
 0.0022  tigerhawk_2
 0.0022  antti
 0.0024  carpetpython
 0.0043  davidism
 0.0049  shashank
 0.0055  piotr
 0.0061  tigerhawk_1
 0.0077  riad
 0.0109  saksham

Corpus 1 graph


Corpus 3 (edge cases)

mean performance:
 0.0123  shashank
 0.0375  david_zhang
 0.0376  piotr
 0.0394  carpetpython
 0.0479  antti
 0.0488  tigerhawk_2
 0.2269  tigerhawk_1
 0.2336  davidism
 0.7239  saksham
 3.6265  zero
 6.0111  riad

median performance:
 0.0107  tigerhawk_2
 0.0108  antti
 0.0109  carpetpython
 0.0135  david_zhang
 0.0137  tigerhawk_1
 0.0150  shashank
 0.0229  saksham
 0.0255  piotr
 0.0721  davidism
 0.1080  zero
 1.8539  riad

Corpus 3 graph


The tests and raw results are available here.


回答 4

非正则表达式解决方案:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

更快的非正则表达式解决方案,这要感谢@ThatWeirdo(请参见评论):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

上面的解决方案很少比原始解决方案慢几个百分点,但通常会快很多-有时快很多。对于较长的字符串,它仍然不比davidism的快,而对于短字符串,zero的正则表达式解决方案更胜一筹。它以大约1000-1500个字符的字符串显示出来,速度最快(根据github上davidism的测试-请参见他的回答)。无论如何,在我测试的所有情况下,它都是第二快的(或更好的)。谢谢,ThatWeirdo。

测试:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

结果:

009
2547
abcde
None
None
None

Non-regex solution:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

The above solution is very rarely slower than the original by a few percent, but it’s usually a good bit faster – sometimes a whole lot faster. It’s still not faster than davidism’s for longer strings, and zero’s regex solution is superior for short strings. It comes out to the fastest (according to davidism’s test on github – see his answer) with strings of about 1000-1500 characters. Regardless, it’s reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.

Test:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

Results:

009
2547
abcde
None
None
None

回答 5

首先,将字符串减半,只要它是“ 2部分”重复项即可。如果重复数为偶数,则会减少搜索空间。然后,继续寻找最小的重复字符串,检查是否通过将越来越大的子字符串拆分成完整的字符串而只得到空值。仅length // 2需要测试最多的子字符串,因为任何重复的内容都不会重复。

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

这将返回最短匹配项,如果没有匹配项,则返回None。

First, halve the string as long as it’s a “2 part” duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2 need to be tested since anything over that would have no repeats.

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

This returns the shortest match or None if there is no match.


回答 6

O(n)在最坏的情况下,也可以使用前缀功能解决该问题。

请注意,这可能是在一般的情况下慢(UPD:是慢得多),比取决于除数的一些其他的解决方案n,我认为不好的情况下,他们一会,但通常会发现失败越早aaa....aab,那里n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1 a

首先需要计算前缀函数

def prefix_function(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    return pi

那么要么没有答案,要么最短的时间是

k = len(s) - prefix_function(s[-1])

并且您只需要检查是否k != n and n % k == 0(如果k != n and n % k == 0答案为s[:k],则没有答案

您可以在此处检查证明(俄语,但在线翻译可能会解决问题)

def riad(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    k = n - pi[-1]
    return s[:k] if (n != k and n % k == 0) else None

The problem may also be solved in O(n) in worst case with prefix function.

Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n, but usually find fails sooner, I think one of bad cases for them will be aaa....aab, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1 a‘s

First of all you need to calculate prefix function

def prefix_function(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    return pi

then either there’s no answer or the shortest period is

k = len(s) - prefix_function(s[-1])

and you just have to check if k != n and n % k == 0 (if k != n and n % k == 0 then answer is s[:k], else there’s no answer

You may check the proof here (in Russian, but online translator will probably do the trick)

def riad(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    k = n - pi[-1]
    return s[:k] if (n != k and n % k == 0) else None

回答 7

这个版本只尝试那些影响字符串长度的候选序列长度。并使用*运算符从候选序列中构建一个全长字符串:

def get_shortest_repeat(string):
    length = len(string)
    for i in range(1, length // 2 + 1):
        if length % i:  # skip non-factors early
            continue

        candidate = string[:i]
        if string == candidate * (length // i):
            return candidate

    return None

感谢TigerhawkT3注意到,length // 2如果不+ 1这样做,将无法匹配abab案件。

This version tries only those candidate sequence lengths that are factors of the string length; and uses the * operator to build a full-length string from the candidate sequence:

def get_shortest_repeat(string):
    length = len(string)
    for i in range(1, length // 2 + 1):
        if length % i:  # skip non-factors early
            continue

        candidate = string[:i]
        if string == candidate * (length // i):
            return candidate

    return None

Thanks to TigerhawkT3 for noticing that length // 2 without + 1 would fail to match the abab case.


回答 8

这是没有正则表达式的直接解决方案。

对于s从零索引开始的,长度为1到的len(s)子字符串,请检查该子字符串substr是否为重复模式。可以通过将substr其自身的ratio时间串联在一起来执行此检查,以使由此形成的字符串的长度等于的长度s。因此ratio=len(s)/len(substr)

当找到第一个这样的子字符串时返回。如果存在,这将提供最小的子字符串。

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"

Here’s a straight forward solution, without regexes.

For substrings of s starting from zeroth index, of lengths 1 through len(s), check if that substring, substr is the repeated pattern. This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

Return when first such substring is found. This would provide the smallest possible substring, if one exists.

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"

回答 9

我从八个以上的解决方案开始。一些基于正则表达式(match,findall,split),一些基于字符串切片和测试,而另一些基于字符串方法(find,count,split)。每种代码在代码清晰度,代码大小,速度和内存消耗方面都有好处。当我注意到执行速度被列为重要事项时,我将在此处发布答案,因此我进行了更多测试和改进以得出结论:

def repeating(s):
    size = len(s)
    incr = size % 2 + 1
    for n in xrange(1, size//2+1, incr):
        if size % n == 0:
            if s[:n] * (size//n) == s:
                return s[:n]

该答案似乎与此处的其他一些答案相似,但是它具有一些其他人未使用的速度优化:

  • xrange 在这个应用程序中速度更快,
  • 如果输入字符串是奇数长度,请不要检查任何偶数长度的子字符串,
  • 通过s[:n]直接使用,我们避免在每个循环中创建变量。

我很想看看它在常见硬件的标准测试中如何执行。我相信,在大多数测试中,这将远远超出David Zhang的出色算法,但否则应该很快。

我发现这个问题非常违反直觉。我认为很快的解决方案很慢。看起来很慢的解决方案很快!看起来,使用乘法运算符和字符串比较对Python的字符串创建进行了高度优化。

I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:

def repeating(s):
    size = len(s)
    incr = size % 2 + 1
    for n in xrange(1, size//2+1, incr):
        if size % n == 0:
            if s[:n] * (size//n) == s:
                return s[:n]

This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:

  • xrange is a little faster in this application,
  • if an input string is an odd length, do not check any even length substrings,
  • by using s[:n] directly, we avoid creating a variable in each loop.

I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang’s excellent algorithm in most tests, but should be quite fast otherwise.

I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python’s string creation with the multiply operator and string comparisons are highly optimised.


回答 10

此功能运行非常快(经过测试,在超过10万个字符的字符串上,此功能比最快的解决方案快3倍以上,并且重复模式越长,差异越大)。它试图最小化获得答案所需的比较次数:

def repeats(string):
    n = len(string)
    tried = set([])
    best = None
    nums = [i for i in  xrange(2, int(n**0.5) + 1) if n % i == 0]
    nums = [n/i for i in nums if n/i!=i] + list(reversed(nums)) + [1]
    for s in nums:
        if all(t%s for t in tried):
            print 'Trying repeating string of length:', s
            if string[:s]*(n/s)==string:
                best = s
            else:
                tried.add(s)
    if best:
        return string[:best]

请注意,例如对于长度为8的字符串,它仅检查大小为4的片段,并且不必进一步测试,因为长度为1或2的模式将导致重复长度为4的模式:

>>> repeats('12345678')
Trying repeating string of length: 4
None

# for this one we need only 2 checks 
>>> repeats('1234567812345678')
Trying repeating string of length: 8
Trying repeating string of length: 4
'12345678'

This function runs very quickly (tested and it’s over 3 times faster than fastest solution here on strings with over 100k characters and the difference gets bigger the longer the repeating pattern is). It tries to minimise the number of comparisons needed to get the answer:

def repeats(string):
    n = len(string)
    tried = set([])
    best = None
    nums = [i for i in  xrange(2, int(n**0.5) + 1) if n % i == 0]
    nums = [n/i for i in nums if n/i!=i] + list(reversed(nums)) + [1]
    for s in nums:
        if all(t%s for t in tried):
            print 'Trying repeating string of length:', s
            if string[:s]*(n/s)==string:
                best = s
            else:
                tried.add(s)
    if best:
        return string[:best]

Note that for example for string of length 8 it checks only fragment of size 4 and it does not have to test further because pattern of length 1 or 2 would result in repeating pattern of length 4:

>>> repeats('12345678')
Trying repeating string of length: 4
None

# for this one we need only 2 checks 
>>> repeats('1234567812345678')
Trying repeating string of length: 8
Trying repeating string of length: 4
'12345678'

回答 11

在David Zhang的回答中,如果我们有某种循环缓冲区,这将不起作用:principal_period('6210045662100456621004566210045662100456621')由于开始的原因621,我希望将其吐出:00456621

扩展他的解决方案,我们可以使用以下方法:

def principal_period(s):
    for j in range(int(len(s)/2)):
        idx = (s[j:]+s[j:]).find(s[j:], 1, -1)
        if idx != -1:
            # Make sure that the first substring is part of pattern
            if s[:j] == s[j:][:idx][-j:]:
                break

    return None if idx == -1 else s[j:][:idx]

principal_period('6210045662100456621004566210045662100456621')
>>> '00456621'

In David Zhang’s answer if we have some sort of circular buffer this will not work: principal_period('6210045662100456621004566210045662100456621') due to the starting 621, where I would have liked it to spit out: 00456621.

Extending his solution we can use the following:

def principal_period(s):
    for j in range(int(len(s)/2)):
        idx = (s[j:]+s[j:]).find(s[j:], 1, -1)
        if idx != -1:
            # Make sure that the first substring is part of pattern
            if s[:j] == s[j:][:idx][-j:]:
                break

    return None if idx == -1 else s[j:][:idx]

principal_period('6210045662100456621004566210045662100456621')
>>> '00456621'

回答 12

这是python中的代码,用于检查用户给定的主字符串中子字符串的重复

print "Enter a string...."
#mainstring = String given by user
mainstring=raw_input(">")
if(mainstring==''):
    print "Invalid string"
    exit()
#charlist = Character list of mainstring
charlist=list(mainstring)
strarr=''
print "Length of your string :",len(mainstring)
for i in range(0,len(mainstring)):
    strarr=strarr+charlist[i]
    splitlist=mainstring.split(strarr)
    count = 0
    for j in splitlist:
        if j =='':
            count+=1
    if count == len(splitlist):
        break
if count == len(splitlist):
    if count == 2:
        print "No repeating Sub-String found in string %r"%(mainstring)

    else:
        print "Sub-String %r repeats in string %r"%(strarr,mainstring)
else :
    print "No repeating Sub-String found in string %r"%(mainstring)

输入

0045662100456621004566210045662100456621

输出

琴弦长度:40

子字符串’00456621’在字符串’0045662100456621004566210045662100456621’中重复

输入

004608294930875576​​036866359447

输出

字符串长度:30

在字符串’004608294930875576​​576036866359447’中找不到重复的子字符串

Here is the code in python that checks for repetition of sub string in the main string given by the user.

print "Enter a string...."
#mainstring = String given by user
mainstring=raw_input(">")
if(mainstring==''):
    print "Invalid string"
    exit()
#charlist = Character list of mainstring
charlist=list(mainstring)
strarr=''
print "Length of your string :",len(mainstring)
for i in range(0,len(mainstring)):
    strarr=strarr+charlist[i]
    splitlist=mainstring.split(strarr)
    count = 0
    for j in splitlist:
        if j =='':
            count+=1
    if count == len(splitlist):
        break
if count == len(splitlist):
    if count == 2:
        print "No repeating Sub-String found in string %r"%(mainstring)

    else:
        print "Sub-String %r repeats in string %r"%(strarr,mainstring)
else :
    print "No repeating Sub-String found in string %r"%(mainstring)

Input:

0045662100456621004566210045662100456621

Output :

Length of your string : 40

Sub-String ‘00456621’ repeats in string ‘0045662100456621004566210045662100456621’

Input :

004608294930875576036866359447

Output:

Length of your string : 30

No repeating Sub-String found in string ‘004608294930875576036866359447’


查找字符串中最后出现的子字符串的索引

问题:查找字符串中最后出现的子字符串的索引

我想在给定的输入string中找到某个子字符串最后一次出现的位置(或索引)str

例如,假设输入字符串为str = 'hello',子字符串为target = 'l',则它应输出3。

我怎样才能做到这一点?

I want to find the position (or index) of the last occurrence of a certain substring in given input string str.

For example, suppose the input string is str = 'hello' and the substring is target = 'l', then it should output 3.

How can I do this?


回答 0

用途.rfind()

>>> s = 'hello'
>>> s.rfind('l')
3

另外,请勿将其str用作变量名,否则将使内置的阴影变暗str()

Use .rfind():

>>> s = 'hello'
>>> s.rfind('l')
3

Also don’t use str as variable name or you’ll shadow the built-in str().


回答 1

您可以使用rfind()Python2链接:rindex()
rfind() rindex()

>>> s = 'Hello StackOverflow Hi everybody'

>>> print( s.rfind('H') )
20

>>> print( s.rindex('H') )
20

>>> print( s.rfind('other') )
-1

>>> print( s.rindex('other') )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

区别在于未找到子字符串时rfind()返回,-1rindex()引发异常ValueError(Python2链接:)ValueError

如果您不想检查rfind()返回码-1,则可能会希望rindex()提供一个可理解的错误消息。否则,您可能会搜索分钟,其中意外值-1来自您的代码…


示例:搜索最后一个换行符

>>> txt = '''first line
... second line
... third line'''

>>> txt.rfind('\n')
22

>>> txt.rindex('\n')
22

You can use rfind() or rindex()
Python2 links: rfind() rindex()

>>> s = 'Hello StackOverflow Hi everybody'

>>> print( s.rfind('H') )
20

>>> print( s.rindex('H') )
20

>>> print( s.rfind('other') )
-1

>>> print( s.rindex('other') )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

The difference is when the substring is not found, rfind() returns -1 while rindex() raises an exception ValueError (Python2 link: ValueError).

If you do not want to check the rfind() return code -1, you may prefer rindex() that will provide an understandable error message. Else you may search for minutes where the unexpected value -1 is coming from within your code…


Example: Search of last newline character

>>> txt = '''first line
... second line
... third line'''

>>> txt.rfind('\n')
22

>>> txt.rindex('\n')
22

回答 2

使用str.rindex方法。

>>> 'hello'.rindex('l')
3
>>> 'hello'.index('l')
2

Use the str.rindex method.

>>> 'hello'.rindex('l')
3
>>> 'hello'.index('l')
2

回答 3

尝试这个:

s = 'hello plombier pantin'
print (s.find('p'))
6
print (s.index('p'))
6
print (s.rindex('p'))
15
print (s.rfind('p'))

Try this:

s = 'hello plombier pantin'
print (s.find('p'))
6
print (s.index('p'))
6
print (s.rindex('p'))
15
print (s.rfind('p'))

回答 4

more_itertools库提供了用于查找所有字符或所有子字符串的索引的工具

给定

import more_itertools as mit


s = "hello"
pred = lambda x: x == "l"

性格

现在有rlocate可用的工具:

next(mit.rlocate(s, pred))
# 3

补充工具是locate

list(mit.locate(s, pred))[-1]
# 3

mit.last(mit.locate(s, pred))
# 3

子串

还有一个window_size参数可用于查找多个项目的前导项目:

s = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
substring = "chuck"
pred = lambda *args: args == tuple(substring)

next(mit.rlocate(s, pred=pred, window_size=len(substring)))
# 59

The more_itertools library offers tools for finding indices of all characters or all substrings.

Given

import more_itertools as mit


s = "hello"
pred = lambda x: x == "l"

Code

Characters

Now there is the rlocate tool available:

next(mit.rlocate(s, pred))
# 3

A complementary tool is locate:

list(mit.locate(s, pred))[-1]
# 3

mit.last(mit.locate(s, pred))
# 3

Substrings

There is also a window_size parameter available for locating the leading item of several items:

s = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
substring = "chuck"
pred = lambda *args: args == tuple(substring)

next(mit.rlocate(s, pred=pred, window_size=len(substring)))
# 59

回答 5

尚未尝试恢复无效的帖子,但是由于尚未发布…

(这是我在发现此问题之前的做法)

s = "hello"
target = "l"
last_pos = len(s) - 1 - s[::-1].index(target)

说明:当您搜索最后一个匹配项时,实际上是在搜索反向字符串中的第一个匹配项。知道了这一点,我做了s[::-1](返回一个反向字符串),然后target从那里索引了。然后我这样做len(s) - 1 - the index found是因为我们希望索引不被反转(即原始)字符串中建立。

不过要当心!如果target超过一个字符,则可能无法在反向字符串中找到它。要解决此问题,请使用last_pos = len(s) - 1 - s[::-1].index(target[::-1]),它会搜索的反向版本target

Not trying to resurrect an inactive post, but since this hasn’t been posted yet…

(This is how I did it before finding this question)

s = "hello"
target = "l"
last_pos = len(s) - 1 - s[::-1].index(target)

Explanation: When you’re searching for the last occurrence, really you’re searching for the first occurrence in the reversed string. Knowing this, I did s[::-1] (which returns a reversed string), and then indexed the target from there. Then I did len(s) - 1 - the index found because we want the index in the unreversed (i.e. original) string.

Watch out, though! If target is more than one character, you probably won’t find it in the reversed string. To fix this, use last_pos = len(s) - 1 - s[::-1].index(target[::-1]), which searches for a reversed version of target.


回答 6

如果您不想使用rfind,则可以解决问题/

def find_last(s, t):
    last_pos = -1
    while True:
        pos = s.find(t, last_pos + 1)
        if pos == -1:
            return last_pos
        else:
            last_pos = pos

If you don’t wanna use rfind then this will do the trick/

def find_last(s, t):
    last_pos = -1
    while True:
        pos = s.find(t, last_pos + 1)
        if pos == -1:
            return last_pos
        else:
            last_pos = pos

回答 7

您可以使用rindex()函数获取字符串中字符的最后一次出现

s="hellloooloo"
b='l'
print(s.rindex(b))

you can use rindex() function to get the last occurrence of a character in string

s="hellloooloo"
b='l'
print(s.rindex(b))

从Python的字符串中剥离字母数字字符以外的所有内容

问题:从Python的字符串中剥离字母数字字符以外的所有内容

使用Python从字符串中剥离所有非字母数字字符的最佳方法是什么?

这个问题PHP变体中提供的解决方案可能会进行一些小的调整,但对我来说似乎并不是很“ pythonic”。

作为记录,我不仅要删除句点和逗号(和其他标点符号),而且还要删除引号,方括号等。

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don’t seem very ‘pythonic’ to me.

For the record, I don’t just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.


回答 0

我只是出于好奇而对某些功能进行了计时。在这些测试中,我从字符串string.printable(内置string模块的一部分)中删除了非字母数字字符。发现使用已编译'[\W_]+'pattern.sub('', str)最快。

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop

I just timed some functions out of curiosity. In these tests I’m removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop

回答 1

救援的正则表达式:

import re
re.sub(r'\W+', '', your_string)

通过Python的定义'\W== [^a-zA-Z0-9_],其中不包括所有的numbersletters_

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)

By Python definition '\W == [^a-zA-Z0-9_], which excludes all numbers, letters and _


回答 2

使用str.translate()方法。

假设您经常这样做:

(1)创建一个包含您要删除的所有字符的字符串:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())

(2)每当要收缩字符串时:

scrunched = s.translate(None, delchars)

设置成本可能比re.compile好。边际成本要低得多:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

注意:使用string.printable作为基准数据会给'[\ W _] +’模式带来不公平的优势;所有非字母数字字符都是一堆的…在典型数据中,将有不止一个替换操作:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

如果您给re.sub做更多的工作,将会发生以下情况:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop

Use the str.translate() method.

Presuming you will be doing this often:

(1) Once, create a string containing all the characters you wish to delete:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())

(2) Whenever you want to scrunch a string:

scrunched = s.translate(None, delchars)

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern ‘[\W_]+’ an unfair advantage; all the non-alphanumeric characters are in one bunch … in typical data there would be more than one substitution to do:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here’s what happens if you give re.sub a bit more work to do:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop

回答 3

您可以尝试:

print ''.join(ch for ch in some_string if ch.isalnum())

You could try:

print ''.join(ch for ch in some_string if ch.isalnum())

回答 4

>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13

回答 5

怎么样:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

这是通过使用列表中理解产生字符的列表中InputString,如果它们存在于合并ascii_lettersdigits字符串。然后,它将列表连接到一个字符串中。

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.


回答 6

作为此处其他答案的补充,我提供了一种非常简单而灵活的方法来定义一组您希望将字符串内容限制为的字符。在这种情况下,我允许使用字母数字加破折号和下划线。只需PERMITTED_CHARS根据自己的用例从我添加或删除字符。

PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-" 
someString = "".join(c for c in someString if c in PERMITTED_CHARS)

As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string’s content to. In this case, I’m allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS as suits your use case.

PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-" 
someString = "".join(c for c in someString if c in PERMITTED_CHARS)

回答 7

sent = "".join(e for e in sent if e.isalpha())
sent = "".join(e for e in sent if e.isalpha())

回答 8

for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,"")
for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,"")

回答 9

使用ASCII可打印字符的随机字符串进行计时:

from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit

pattern_single = re.compile(r'[\W]')
pattern_repeat = re.compile(r'[\W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))


def generate_test_string(length):
    return ''.join(sample(printable, length))


def main():
    for i in range(0, 60, 10):
        for test in [
            lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
            lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
            lambda: re.sub(r'[\W]', '', generate_test_string(i)),
            lambda: re.sub(r'[\W]+', '', generate_test_string(i)),
            lambda: pattern_single.sub('', generate_test_string(i)),
            lambda: pattern_repeat.sub('', generate_test_string(i)),
            lambda: generate_test_string(i).translate(translation_tb),

        ]:
            print(timeit(test), i, getsource(test).lstrip('            lambda: ').rstrip(',\n'), sep='\t')


if __name__ == '__main__':
    main()

结果(Python 3.7):

       Time       Length                           Code                           
6.3716264850008880  00  ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750  00  ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940  00  re.sub(r'[\W]', '', generate_test_string(i))
8.0002205439959650  00  re.sub(r'[\W]+', '', generate_test_string(i))
5.5290945199958510  00  pattern_single.sub('', generate_test_string(i))
5.4417179649972240  00  pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590  00  generate_test_string(i).translate(translation_tb)
23.574712151996210  10  ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970  10  ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840  10  re.sub(r'[\W]', '', generate_test_string(i))
27.203713296003116  10  re.sub(r'[\W]+', '', generate_test_string(i))
24.008979928999906  10  pattern_single.sub('', generate_test_string(i))
23.945240008994006  10  pattern_repeat.sub('', generate_test_string(i))
21.830899796994345  10  generate_test_string(i).translate(translation_tb)
38.731336012999236  20  ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825  20  ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350  20  re.sub(r'[\W]', '', generate_test_string(i))
41.933375883003464  20  re.sub(r'[\W]+', '', generate_test_string(i))
38.899814646996674  20  pattern_single.sub('', generate_test_string(i))
38.636144253003295  20  pattern_repeat.sub('', generate_test_string(i))
36.201238164998360  20  generate_test_string(i).translate(translation_tb)
49.377356811004574  30  ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385  30  ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850  30  re.sub(r'[\W]', '', generate_test_string(i))
52.130339455994545  30  re.sub(r'[\W]+', '', generate_test_string(i))
50.061149017004940  30  pattern_single.sub('', generate_test_string(i))
49.366573111998150  30  pattern_repeat.sub('', generate_test_string(i))
46.649754120997386  30  generate_test_string(i).translate(translation_tb)
63.107938601999194  40  ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030  40  ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800  40  re.sub(r'[\W]', '', generate_test_string(i))
66.027950693998720  40  re.sub(r'[\W]+', '', generate_test_string(i))
63.315361931003280  40  pattern_single.sub('', generate_test_string(i))
62.342320287003530  40  pattern_repeat.sub('', generate_test_string(i))
58.249303059004890  40  generate_test_string(i).translate(translation_tb)
73.810345625002810  50  ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020  50  ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540  50  re.sub(r'[\W]', '', generate_test_string(i))
75.106637657001560  50  re.sub(r'[\W]+', '', generate_test_string(i))
74.681338128997600  50  pattern_single.sub('', generate_test_string(i))
72.430461594005460  50  pattern_repeat.sub('', generate_test_string(i))
69.394243567003290  50  generate_test_string(i).translate(translation_tb)

str.maketransstr.translate最快,但包含所有非ASCII字符。 re.compilepattern.sub较慢,但比''.join&更快filter

Timing with random strings of ASCII printables:

from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit

pattern_single = re.compile(r'[\W]')
pattern_repeat = re.compile(r'[\W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))


def generate_test_string(length):
    return ''.join(sample(printable, length))


def main():
    for i in range(0, 60, 10):
        for test in [
            lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
            lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
            lambda: re.sub(r'[\W]', '', generate_test_string(i)),
            lambda: re.sub(r'[\W]+', '', generate_test_string(i)),
            lambda: pattern_single.sub('', generate_test_string(i)),
            lambda: pattern_repeat.sub('', generate_test_string(i)),
            lambda: generate_test_string(i).translate(translation_tb),

        ]:
            print(timeit(test), i, getsource(test).lstrip('            lambda: ').rstrip(',\n'), sep='\t')


if __name__ == '__main__':
    main()

Result (Python 3.7):

       Time       Length                           Code                           
6.3716264850008880  00  ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750  00  ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940  00  re.sub(r'[\W]', '', generate_test_string(i))
8.0002205439959650  00  re.sub(r'[\W]+', '', generate_test_string(i))
5.5290945199958510  00  pattern_single.sub('', generate_test_string(i))
5.4417179649972240  00  pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590  00  generate_test_string(i).translate(translation_tb)
23.574712151996210  10  ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970  10  ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840  10  re.sub(r'[\W]', '', generate_test_string(i))
27.203713296003116  10  re.sub(r'[\W]+', '', generate_test_string(i))
24.008979928999906  10  pattern_single.sub('', generate_test_string(i))
23.945240008994006  10  pattern_repeat.sub('', generate_test_string(i))
21.830899796994345  10  generate_test_string(i).translate(translation_tb)
38.731336012999236  20  ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825  20  ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350  20  re.sub(r'[\W]', '', generate_test_string(i))
41.933375883003464  20  re.sub(r'[\W]+', '', generate_test_string(i))
38.899814646996674  20  pattern_single.sub('', generate_test_string(i))
38.636144253003295  20  pattern_repeat.sub('', generate_test_string(i))
36.201238164998360  20  generate_test_string(i).translate(translation_tb)
49.377356811004574  30  ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385  30  ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850  30  re.sub(r'[\W]', '', generate_test_string(i))
52.130339455994545  30  re.sub(r'[\W]+', '', generate_test_string(i))
50.061149017004940  30  pattern_single.sub('', generate_test_string(i))
49.366573111998150  30  pattern_repeat.sub('', generate_test_string(i))
46.649754120997386  30  generate_test_string(i).translate(translation_tb)
63.107938601999194  40  ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030  40  ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800  40  re.sub(r'[\W]', '', generate_test_string(i))
66.027950693998720  40  re.sub(r'[\W]+', '', generate_test_string(i))
63.315361931003280  40  pattern_single.sub('', generate_test_string(i))
62.342320287003530  40  pattern_repeat.sub('', generate_test_string(i))
58.249303059004890  40  generate_test_string(i).translate(translation_tb)
73.810345625002810  50  ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020  50  ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540  50  re.sub(r'[\W]', '', generate_test_string(i))
75.106637657001560  50  re.sub(r'[\W]+', '', generate_test_string(i))
74.681338128997600  50  pattern_single.sub('', generate_test_string(i))
72.430461594005460  50  pattern_repeat.sub('', generate_test_string(i))
69.394243567003290  50  generate_test_string(i).translate(translation_tb)

str.maketrans & str.translate is fastest, but includes all non-ASCII characters. re.compile & pattern.sub is slower, but is somehow faster than ''.join & filter.


回答 10

如果我正确理解,最简单的方法是使用正则表达式,因为它为您提供了很大的灵活性,但是另一种简单的方法是用于循环跟踪的是带有示例的代码,我也计算了单词的出现次数并存储在字典中。

s = """An... essay is, generally, a piece of writing that gives the author's own 
argument — but the definition is vague, 
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays 
have traditionally been 
sub-classified as formal and informal. Formal essays are characterized by "serious 
purpose, dignity, logical 
organization, length," whereas the informal essay is characterized by "the personal 
element (self-revelation, 
individual tastes and experiences, confidential manner), humor, graceful style, 
rambling structure, unconventionality 
or novelty of theme," etc.[1]"""

d = {}      # creating empty dic      
words = s.split() # spliting string and stroing in list
for word in words:
    new_word = ''
    for c in word:
        if c.isalnum(): # checking if indiviual chr is alphanumeric or not
            new_word = new_word + c
    print(new_word, end=' ')
    # if new_word not in d:
    #     d[new_word] = 1
    # else:
    #     d[new_word] = d[new_word] +1
print(d)

如果此答案有用,请评价此!

If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..

s = """An... essay is, generally, a piece of writing that gives the author's own 
argument — but the definition is vague, 
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays 
have traditionally been 
sub-classified as formal and informal. Formal essays are characterized by "serious 
purpose, dignity, logical 
organization, length," whereas the informal essay is characterized by "the personal 
element (self-revelation, 
individual tastes and experiences, confidential manner), humor, graceful style, 
rambling structure, unconventionality 
or novelty of theme," etc.[1]"""

d = {}      # creating empty dic      
words = s.split() # spliting string and stroing in list
for word in words:
    new_word = ''
    for c in word:
        if c.isalnum(): # checking if indiviual chr is alphanumeric or not
            new_word = new_word + c
    print(new_word, end=' ')
    # if new_word not in d:
    #     d[new_word] = 1
    # else:
    #     d[new_word] = d[new_word] +1
print(d)

please rate this if this answer is useful!


如何提取两个标记之间的子字符串?

问题:如何提取两个标记之间的子字符串?

假设我有一个字符串,'gfgfdAAA1234ZZZuijjk'而我只想提取'1234'一部分。

我只知道我感兴趣的部分之前AAA和之后ZZZ的几个字符1234

使用sed字符串可以执行以下操作:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

结果,这会给我1234

如何在Python中做同样的事情?

Let’s say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

How to do the same thing in Python?


回答 0

使用正则表达式- 文档以供进一步参考

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

要么:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

Using regular expressions – documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

回答 1

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

然后,您也可以在re模块中使用正则表达式,如果需要的话,但这不是必需的。

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that’s not necessary in your case.


回答 2

正则表达式

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

如果原样AttributeError中没有“ AAA”和“ ZZZ”,则上述原样会失败your_text

字符串方法

your_text.partition("AAA")[2].partition("ZZZ")[0]

如果中不存在“ AAA”或“ ZZZ”,则上面的内容将返回一个空字符串your_text

PS Python挑战?

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no “AAA” and “ZZZ” in your_text

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either “AAA” or “ZZZ” don’t exist in your_text.

PS Python Challenge?


回答 3

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

回答 4

惊讶的是没有人提到这是我一次性脚本的快速版本:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

Surprised that nobody has mentioned this which is my quick version for one-off scripts:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

回答 5

您可以只使用一行代码

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

结果将收到清单…

you can do using just one line of code

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list…


回答 6

您可以使用re模块:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

You can use re module for that:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

回答 7

使用sed可以用字符串执行以下操作:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

结果是我会得到1234。

您可以re.sub使用相同的正则表达式对函数执行相同的操作。

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

在基本sed中,捕获组由表示\(..\),但是在python中,捕获组由表示(..)

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).


回答 8

在python中,可以使用findall正则表达式(re)模块中的方法来提取子字符串形式的字符串。

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

回答 9

您可以在代码中找到此功能的第一个子字符串(按字符索引)。另外,您可以找到子字符串之后的内容。

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

回答 10

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

回答 11

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

string
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Gives

string

回答 12

以防万一某人必须做与我相同的事情。我必须在一行中提取括号内的所有内容。例如,如果我有一条类似“美国总统(巴拉克·奥巴马)与…会面……”这样的句子,而我只想获得“巴拉克·奥巴马”,这就是解决方案:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

即您需要用slash \符号来阻止括号。虽然这是关于Python的更多正则表达式的问题。

另外,在某些情况下,您可能会在正则表达式定义之前看到“ r”符号。如果没有r前缀,则需要像C中那样使用转义符。这里有更多讨论。

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like ‘US president (Barack Obama) met with …’ and I want to get only ‘Barack Obama’ this is solution:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see ‘r’ symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.


回答 13

使用PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

生成:

[['1234']]

Using PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:

[['1234']]


回答 14

这是一个不使用正则表达式的解决方案,它也解决了第一个子字符串包含第二个子字符串的情况。仅当第二个标记在第一个标记之后时,此函数才会找到子字符串。

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

Here’s a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

回答 15

另一种方法是使用列表(假设您要查找的子字符串仅由数字组成):

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

回答 16

如果没有匹配项,一个衬里返回其他字符串。编辑:改进的版本使用next功能,"not-found"如果需要,请替换为其他内容:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

我执行此操作的另一种方法(不太理想)第二次使用正则表达式,但仍未找到更短的方法:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn’t found a shorter way:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

如何在Python中将浮点数格式化为固定宽度

问题:如何在Python中将浮点数格式化为固定宽度

如何按照以下要求将浮点数格式化为固定宽度:

  1. 如果n <1,则前导零
  2. 添加尾随的十进制零以填充固定宽度
  3. 截断超出固定宽度的十进制数字
  4. 对齐所有小数点

例如:

% formatter something like '{:06}'
numbers = [23.23, 0.123334987, 1, 4.223, 9887.2]

for number in numbers:
    print formatter.format(number)

输出会像

  23.2300
   0.1233
   1.0000
   4.2230
9887.2000

How do I format a floating number to a fixed width with the following requirements:

  1. Leading zero if n < 1
  2. Add trailing decimal zero(s) to fill up fixed width
  3. Truncate decimal digits past fixed width
  4. Align all decimal points

For example:

% formatter something like '{:06}'
numbers = [23.23, 0.123334987, 1, 4.223, 9887.2]

for number in numbers:
    print formatter.format(number)

The output would be like

  23.2300
   0.1233
   1.0000
   4.2230
9887.2000

回答 0

for x in numbers:
    print "{:10.4f}".format(x)

版画

   23.2300
    0.1233
    1.0000
    4.2230
 9887.2000

花括号内的格式说明符遵循Python格式字符串语法。具体来说,在这种情况下,它由以下部分组成:

  • 空字符串冒号前的手段“采取下一个提供参数format()” -在这种情况下,x作为唯一的参数。
  • 10.4f冒号之后的部分是格式规范
  • f表示定点表示法。
  • 10是该领域的总宽度被印刷,用空格lefted-填充。
  • 4是小数点后的位数。
for x in numbers:
    print "{:10.4f}".format(x)

prints

   23.2300
    0.1233
    1.0000
    4.2230
 9887.2000

The format specifier inside the curly braces follows the Python format string syntax. Specifically, in this case, it consists of the following parts:

  • The empty string before the colon means “take the next provided argument to format()” – in this case the x as the only argument.
  • The 10.4f part after the colon is the format specification.
  • The f denotes fixed-point notation.
  • The 10 is the total width of the field being printed, lefted-padded by spaces.
  • The 4 is the number of digits after the decimal point.

回答 1

自从这个答案问了已经好几年了,但是从Python 3.6(PEP498)开始,您可以使用新的f-strings

numbers = [23.23, 0.123334987, 1, 4.223, 9887.2]

for number in numbers:
    print(f'{number:9.4f}')

印刷品:

  23.2300
   0.1233
   1.0000
   4.2230
9887.2000

It has been a few years since this was answered, but as of Python 3.6 (PEP498) you could use the new f-strings:

numbers = [23.23, 0.123334987, 1, 4.223, 9887.2]

for number in numbers:
    print(f'{number:9.4f}')

Prints:

  23.2300
   0.1233
   1.0000
   4.2230
9887.2000

回答 2

在python3中,以下工作原理:

>>> v=10.4
>>> print('% 6.2f' % v)
  10.40
>>> print('% 12.1f' % v)
        10.4
>>> print('%012.1f' % v)
0000000010.4

In python3 the following works:

>>> v=10.4
>>> print('% 6.2f' % v)
  10.40
>>> print('% 12.1f' % v)
        10.4
>>> print('%012.1f' % v)
0000000010.4

回答 3

请参阅Python 3.x 格式字符串语法

IDLE 3.5.1   
numbers = ['23.23', '.1233', '1', '4.223', '9887.2']

for x in numbers:  
    print('{0: >#016.4f}'. format(float(x)))  

     23.2300
      0.1233
      1.0000
      4.2230
   9887.2000

See Python 3.x format string syntax:

IDLE 3.5.1   
numbers = ['23.23', '.1233', '1', '4.223', '9887.2']

for x in numbers:  
    print('{0: >#016.4f}'. format(float(x)))  

     23.2300
      0.1233
      1.0000
      4.2230
   9887.2000

回答 4

您也可以将零填充为零。例如,如果您number要有9个字符的长度,请用零左填充,请使用:

print('{:09.3f}'.format(number))

因此,如果为number = 4.656,则输出为:00004.656

对于您的示例,输出将如下所示:

numbers  = [23.2300, 0.1233, 1.0000, 4.2230, 9887.2000]
for x in numbers: 
    print('{:010.4f}'.format(x))

印刷品:

00023.2300
00000.1233
00001.0000
00004.2230
09887.2000

一个可能有用的示例是当您要按字母顺序正确列出文件名时。我注意到在某些linux系统中,数字是:1,10,11,.. 2,20,21,…

因此,如果要在文件名中强制执行必要的数字顺序,则需要在键盘上填充适当数量的零。

You can also left pad with zeros. For example if you want number to have 9 characters length, left padded with zeros use:

print('{:09.3f}'.format(number))

Thus, if number = 4.656, the output is: 00004.656

For your example the output will look like this:

numbers  = [23.2300, 0.1233, 1.0000, 4.2230, 9887.2000]
for x in numbers: 
    print('{:010.4f}'.format(x))

prints:

00023.2300
00000.1233
00001.0000
00004.2230
09887.2000

One example where this may be useful is when you want to properly list filenames in alphabetical order. I noticed in some linux systems, the number is: 1,10,11,..2,20,21,…

Thus if you want to enforce the necessary numeric order in filenames, you need to left pad with the appropriate number of zeros.


回答 5

在Python 3中。

GPA = 2.5
print(" %6.1f " % GPA)

6.1f点之后手段1个数字显示,如果你,你应该只点打印后2位%6.2f,使得%6.3f3位点后打印。

In Python 3.

GPA = 2.5
print(" %6.1f " % GPA)

6.1f means after the dots 1 digits show if you print 2 digits after the dots you should only %6.2f such that %6.3f 3 digits print after the point.


从字符串中删除标点符号的最佳方法

问题:从字符串中删除标点符号的最佳方法

似乎应该有一个比以下方法更简单的方法:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

在那儿?

It seems like there should be a simpler way than:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

Is there?


回答 0

从效率的角度来看,您不会被击败

s.translate(None, string.punctuation)

对于更高版本的Python,请使用以下代码:

s.translate(str.maketrans('', '', string.punctuation))

它使用查找表在C语言中执行原始字符串操作-除了编写自己的C代码之外,没有什么比这更好的了。

如果不担心速度,那么另一个选择是:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

这比每个char的s.replace更快,但效果不如regexes或string.translate等非纯python方法,如下面的时序所示。对于这种类型的问题,在尽可能低的水平上进行操作会有所回报。

时间码:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

得到以下结果:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

From an efficiency perspective, you’re not going to beat

s.translate(None, string.punctuation)

For higher versions of Python use the following code:

s.translate(str.maketrans('', '', string.punctuation))

It’s performing raw string operations in C with a lookup table – there’s not much that will beat that but writing your own C code.

If speed isn’t a worry, another option though is:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

This is faster than s.replace with each char, but won’t perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.

Timing code:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

This gives the following results:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

回答 1

如果您知道正则表达式,就足够简单了。

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

Regular expressions are simple enough, if you know them.

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

回答 2

为了方便使用,我在Python 2和Python 3中总结了从字符串中删除标点符号的注意事项。有关详细说明,请参阅其他答案。


Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.


Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

回答 3

myString.translate(None, string.punctuation)
myString.translate(None, string.punctuation)

回答 4

我通常使用这样的东西:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

I usually use something like this:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

回答 5

string.punctuation是ASCII !一种更正确(但也慢得多)的方法是使用unicodedata模块:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

您也可以概括和去除其他类型的字符:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

它还会~*+§$根据个人的视点去掉那些可能为“标点”或不为“标点”的字符。

string.punctuation is ASCII only! A more correct (but also much slower) way is to use the unicodedata module:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

You can generalize and strip other types of characters as well:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

It will also strip characters like ~*+§$ which may or may not be “punctuation” depending on one’s point of view.


回答 6

如果您对re家族更加熟悉,则不一定会更简单,但会采用另一种方式。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

Not necessarily simpler, but a different way, if you are more familiar with the re family.

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

回答 7

对于Python 3 str或Python 2 unicode值,str.translate()只需要一个字典;在该映射中查找代码点(整数),并None删除所有映射到的代码点。

然后要删除(某些?)标点符号,请使用:

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

使用dict.fromkeys()class方法可以轻松创建映射,并None根据键序列将所有值设置为。

要删除所有标点符号,而不仅仅是ASCII标点符号,您的表需要更大一些。参见JF Sebastian的答案(Python 3版本):

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

For Python 3 str or Python 2 unicode values, str.translate() only takes a dictionary; codepoints (integers) are looked up in that mapping and anything mapped to None is removed.

To remove (some?) punctuation then, use:

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

The dict.fromkeys() class method makes it trivial to create the mapping, setting all values to None based on the sequence of keys.

To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger; see J.F. Sebastian’s answer (Python 3 version):

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

回答 8

string.punctuation错过了现实世界中常用的大量标点符号。一种适用于非ASCII标点的解决方案怎么样?

import regex
s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

我个人认为这是从Python中的字符串中删除标点符号的最佳方法,因为:

  • 删除所有Unicode标点符号
  • 它很容易修改,例如,\{S}如果要删除标点符号,则可以将其删除,但要保留诸如$
  • 您可以真正确定要保留的内容和要删除的内容,例如\{Pd}仅删除破折号。
  • 此正则表达式还规范了空格。它将制表符,回车符和其他奇数映射到漂亮的单个空格。

它使用Unicode字符属性,您可以在Wikipedia上了解更多信息

string.punctuation misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?

import regex
s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

Personally, I believe this is the best way to remove punctuation from a string in Python because:

  • It removes all Unicode punctuation
  • It’s easily modifiable, e.g. you can remove the \{S} if you want to remove punctuation, but keep symbols like $.
  • You can get really specific about what you want to keep and what you want to remove, for example \{Pd} will only remove dashes.
  • This regex also normalizes whitespace. It maps tabs, carriage returns, and other oddities to nice, single spaces.

This uses Unicode character properties, which you can read more about on Wikipedia.


回答 9

我还没有看到这个答案。只需使用正则表达式即可;它会删除单词字符(\w)和数字字符(\d)之外的所有字符,然后删除空格字符(\s):

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

I haven’t seen this answer yet. Just use a regex; it removes all characters besides word characters (\w) and number characters (\d), followed by a whitespace character (\s):

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

回答 10

这是Python 3.5的一线式:

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

Here’s a one-liner for Python 3.5:

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

回答 11

这可能不是最佳解决方案,但是这就是我的方法。

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

This might not be the best solution however this is how I did it.

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

回答 12

这是我编写的函数。它不是很有效,但是很简单,您可以添加或删除所需的标点符号:

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

Here is a function I wrote. It’s not very efficient, but it is simple and you can add or remove any punctuation that you desire:

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

回答 13

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

回答 14

作为更新,我重写了Python 3中的@Brian示例并对其进行了更改,以将regex编译步骤移至函数内部。我的想法是计时使该功能起作用所需的每个步骤。也许您使用的是分布式计算,并且您的工作人员之间无法共享正则表达式对象,因此需要re.compile在每个工作人员中走一步。另外,我很好奇地为Python 3的maketrans的两种不同实现计时了

table = str.maketrans({key: None for key in string.punctuation})

table = str.maketrans('', '', string.punctuation)

另外,我添加了另一种使用set的方法,其中利用了交集函数来减少迭代次数。

这是完整的代码:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can’t have regex object shared between your workers and need to have re.compile step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.

This is the complete code:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

This is my results:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

回答 15

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

回答 16

这是没有正则表达式的解决方案。

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then
  • 用空格替换标点符号
  • 用单个空格替换单词之间的多个空格
  • 如果有strip(),请删除尾随空格

Here’s a solution without regex.

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then
  • Replaces the punctuations with spaces
  • Replace multiple spaces in between words with a single space
  • Remove the trailing spaces, if any with strip()

回答 17

在不太严格的情况下,单线可能会有所帮助:

''.join([c for c in s if c.isalnum() or c.isspace()])

A one-liner might be helpful in not very strict cases:

''.join([c for c in s if c.isalnum() or c.isspace()])

回答 18

#FIRST METHOD
#Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring+=i
print "The string without punctuation is",newstring

#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring


#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage
#FIRST METHOD
#Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring+=i
print "The string without punctuation is",newstring

#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring


#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage

回答 19

with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 

for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:

    print(i,end='\n')
    print ("____________")
with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 

for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:

    print(i,end='\n')
    print ("____________")

回答 20

为什么你们没人使用这个?

 ''.join(filter(str.isalnum, s)) 

太慢了?

Why none of you use this?

 ''.join(filter(str.isalnum, s)) 

Too slow?


回答 21

考虑unicode。代码在python3中检查。

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

Considering unicode. Code checked in python3.

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

回答 22

使用Python从文本文件中删除停用词

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

Remove stop words from the text file using Python

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

回答 23

我喜欢使用这样的功能:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

I like to use a function like this:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

如何在Python中将一个字符串附加到另一个字符串?

问题:如何在Python中将一个字符串附加到另一个字符串?

除了以下内容外,我想要一种有效的方法来在Python中将一个字符串附加到另一个字符串。

var1 = "foo"
var2 = "bar"
var3 = var1 + var2

有什么好的内置方法可以使用吗?

I want an efficient way to append one string to another in Python, other than the following.

var1 = "foo"
var2 = "bar"
var3 = var1 + var2

Is there any good built-in method to use?


回答 0

如果只有一个对字符串的引用,并且将另一个字符串连接到末尾,则CPython现在会对此进行特殊处理,并尝试将字符串扩展到位。

最终结果是将操作摊销O(n)。

例如

s = ""
for i in range(n):
    s+=str(i)

过去是O(n ^ 2),但现在是O(n)。

从源(bytesobject.c):

void
PyBytes_ConcatAndDel(register PyObject **pv, register PyObject *w)
{
    PyBytes_Concat(pv, w);
    Py_XDECREF(w);
}


/* The following function breaks the notion that strings are immutable:
   it changes the size of a string.  We get away with this only if there
   is only one module referencing the object.  You can also think of it
   as creating a new string object and destroying the old one, only
   more efficiently.  In any case, don't use this if the string may
   already be known to some other part of the code...
   Note that if there's not enough memory to resize the string, the original
   string object at *pv is deallocated, *pv is set to NULL, an "out of
   memory" exception is set, and -1 is returned.  Else (on success) 0 is
   returned, and the value in *pv may or may not be the same as on input.
   As always, an extra byte is allocated for a trailing \0 byte (newsize
   does *not* include that), and a trailing \0 byte is stored.
*/

int
_PyBytes_Resize(PyObject **pv, Py_ssize_t newsize)
{
    register PyObject *v;
    register PyBytesObject *sv;
    v = *pv;
    if (!PyBytes_Check(v) || Py_REFCNT(v) != 1 || newsize < 0) {
        *pv = 0;
        Py_DECREF(v);
        PyErr_BadInternalCall();
        return -1;
    }
    /* XXX UNREF/NEWREF interface should be more symmetrical */
    _Py_DEC_REFTOTAL;
    _Py_ForgetReference(v);
    *pv = (PyObject *)
        PyObject_REALLOC((char *)v, PyBytesObject_SIZE + newsize);
    if (*pv == NULL) {
        PyObject_Del(v);
        PyErr_NoMemory();
        return -1;
    }
    _Py_NewReference(*pv);
    sv = (PyBytesObject *) *pv;
    Py_SIZE(sv) = newsize;
    sv->ob_sval[newsize] = '\0';
    sv->ob_shash = -1;          /* invalidate cached hash value */
    return 0;
}

凭经验进行验证很容易。

$ python -m timeit -s“ s =”“”对于xrange(10):s + ='a'
1000000次循环,每循环3:1.85最佳
$ python -m timeit -s“ s =”“”对于xrange(100):s + ='a'
10000次循环,最佳为3次:每个循环16.8微秒
$ python -m timeit -s“ s =”“”对于xrange(1000)中的我来说:s + ='a'“
10000次循环,最佳为3次:每个循环158微秒
$ python -m timeit -s“ s =”“”对于xrange(10000):s + ='a'
1000次循环,每循环3:1.71毫秒最佳
$ python -m timeit -s“ s =”“”对于xrange(100000):s + ='a'
10个循环,每循环最好3:14.6毫秒
$ python -m timeit -s“ s =”“”对于xrange(1000000):s + ='a'
10个循环,最佳3:每个循环173毫秒

不过,请务必注意,此优化不是Python规范的一部分。据我所知,它仅在cPython实现中。例如,对pypy或jython进行的相同经验测试可能会显示较旧的O(n ** 2)性能。

$ pypy -m timeit -s“ s =”“”对于xrange(10)中的i:s + ='a'“
10000次循环,最好为3:每个循环90.8微秒
$ pypy -m timeit -s“ s =”“”对于xrange(100)中的i:s + ='a'“
1000个循环,每循环3:896最佳
$ pypy -m timeit -s“ s =”“”对于xrange(1000)中的i:s + ='a'“
100个循环,每个循环最好3:9.03毫秒
$ pypy -m timeit -s“ s =”“”对于xrange(10000):s + ='a'
10个循环,最好为3:每个循环89.5毫秒

到目前为止一切顺利,但随后,

$ pypy -m timeit -s“ s =”“”对于xrange(100000):s + ='a'
10次​​循环,每循环3:12.8秒的最佳时间

哎呀,甚至比二次还差。因此,pypy可以在短字符串上做得很好,但是在较大的字符串上却表现不佳。

If you only have one reference to a string and you concatenate another string to the end, CPython now special cases this and tries to extend the string in place.

The end result is that the operation is amortized O(n).

e.g.

s = ""
for i in range(n):
    s+=str(i)

used to be O(n^2), but now it is O(n).

From the source (bytesobject.c):

void
PyBytes_ConcatAndDel(register PyObject **pv, register PyObject *w)
{
    PyBytes_Concat(pv, w);
    Py_XDECREF(w);
}


/* The following function breaks the notion that strings are immutable:
   it changes the size of a string.  We get away with this only if there
   is only one module referencing the object.  You can also think of it
   as creating a new string object and destroying the old one, only
   more efficiently.  In any case, don't use this if the string may
   already be known to some other part of the code...
   Note that if there's not enough memory to resize the string, the original
   string object at *pv is deallocated, *pv is set to NULL, an "out of
   memory" exception is set, and -1 is returned.  Else (on success) 0 is
   returned, and the value in *pv may or may not be the same as on input.
   As always, an extra byte is allocated for a trailing \0 byte (newsize
   does *not* include that), and a trailing \0 byte is stored.
*/

int
_PyBytes_Resize(PyObject **pv, Py_ssize_t newsize)
{
    register PyObject *v;
    register PyBytesObject *sv;
    v = *pv;
    if (!PyBytes_Check(v) || Py_REFCNT(v) != 1 || newsize < 0) {
        *pv = 0;
        Py_DECREF(v);
        PyErr_BadInternalCall();
        return -1;
    }
    /* XXX UNREF/NEWREF interface should be more symmetrical */
    _Py_DEC_REFTOTAL;
    _Py_ForgetReference(v);
    *pv = (PyObject *)
        PyObject_REALLOC((char *)v, PyBytesObject_SIZE + newsize);
    if (*pv == NULL) {
        PyObject_Del(v);
        PyErr_NoMemory();
        return -1;
    }
    _Py_NewReference(*pv);
    sv = (PyBytesObject *) *pv;
    Py_SIZE(sv) = newsize;
    sv->ob_sval[newsize] = '\0';
    sv->ob_shash = -1;          /* invalidate cached hash value */
    return 0;
}

It’s easy enough to verify empirically.

$ python -m timeit -s"s=''" "for i in xrange(10):s+='a'"
1000000 loops, best of 3: 1.85 usec per loop
$ python -m timeit -s"s=''" "for i in xrange(100):s+='a'"
10000 loops, best of 3: 16.8 usec per loop
$ python -m timeit -s"s=''" "for i in xrange(1000):s+='a'"
10000 loops, best of 3: 158 usec per loop
$ python -m timeit -s"s=''" "for i in xrange(10000):s+='a'"
1000 loops, best of 3: 1.71 msec per loop
$ python -m timeit -s"s=''" "for i in xrange(100000):s+='a'"
10 loops, best of 3: 14.6 msec per loop
$ python -m timeit -s"s=''" "for i in xrange(1000000):s+='a'"
10 loops, best of 3: 173 msec per loop

It’s important however to note that this optimisation isn’t part of the Python spec. It’s only in the cPython implementation as far as I know. The same empirical testing on pypy or jython for example might show the older O(n**2) performance .

$ pypy -m timeit -s"s=''" "for i in xrange(10):s+='a'"
10000 loops, best of 3: 90.8 usec per loop
$ pypy -m timeit -s"s=''" "for i in xrange(100):s+='a'"
1000 loops, best of 3: 896 usec per loop
$ pypy -m timeit -s"s=''" "for i in xrange(1000):s+='a'"
100 loops, best of 3: 9.03 msec per loop
$ pypy -m timeit -s"s=''" "for i in xrange(10000):s+='a'"
10 loops, best of 3: 89.5 msec per loop

So far so good, but then,

$ pypy -m timeit -s"s=''" "for i in xrange(100000):s+='a'"
10 loops, best of 3: 12.8 sec per loop

ouch even worse than quadratic. So pypy is doing something that works well with short strings, but performs poorly for larger strings.


回答 1

不要过早优化。如果您没有理由相信字符串连接会造成速度瓶颈,那么请坚持使用+and +=

s  = 'foo'
s += 'bar'
s += 'baz'

就是说,如果您的目标是Java的StringBuilder之类的东西,那么规范的Python习惯用法就是将项目添加到列表中,然后最后str.join将它们全部串联起来:

l = []
l.append('foo')
l.append('bar')
l.append('baz')

s = ''.join(l)

Don’t prematurely optimize. If you have no reason to believe there’s a speed bottleneck caused by string concatenations then just stick with + and +=:

s  = 'foo'
s += 'bar'
s += 'baz'

That said, if you’re aiming for something like Java’s StringBuilder, the canonical Python idiom is to add items to a list and then use str.join to concatenate them all at the end:

l = []
l.append('foo')
l.append('bar')
l.append('baz')

s = ''.join(l)

回答 2

str1 = "Hello"
str2 = "World"
newstr = " ".join((str1, str2))

这将str1和str2加上一个空格作为分隔符。您也可以"".join(str1, str2, ...)str.join()需要迭代,因此您必须将字符串放入列表或元组中。

这与内置方法一样高效。

str1 = "Hello"
str2 = "World"
newstr = " ".join((str1, str2))

That joins str1 and str2 with a space as separators. You can also do "".join(str1, str2, ...). str.join() takes an iterable, so you’d have to put the strings in a list or a tuple.

That’s about as efficient as it gets for a builtin method.


回答 3

别。

也就是说,在大多数情况下,最好一次性生成整个字符串,而不是附加到现有字符串。

例如,不要: obj1.name + ":" + str(obj1.count)

相反:使用 "%s:%d" % (obj1.name, obj1.count)

这将更容易阅读和更有效。

Don’t.

That is, for most cases you are better off generating the whole string in one go rather then appending to an existing string.

For example, don’t do: obj1.name + ":" + str(obj1.count)

Instead: use "%s:%d" % (obj1.name, obj1.count)

That will be easier to read and more efficient.


回答 4

Python 3.6为我们提供了f字符串,这很令人高兴:

var1 = "foo"
var2 = "bar"
var3 = f"{var1}{var2}"
print(var3)                       # prints foobar

您可以在花括号内执行大多数操作

print(f"1 + 1 == {1 + 1}")        # prints 1 + 1 == 2

Python 3.6 gives us f-strings, which are a delight:

var1 = "foo"
var2 = "bar"
var3 = f"{var1}{var2}"
print(var3)                       # prints foobar

You can do most anything inside the curly braces

print(f"1 + 1 == {1 + 1}")        # prints 1 + 1 == 2

回答 5

如果需要执行许多附加操作来构建大字符串,则可以使用StringIO或cStringIO。界面就像一个文件。即:您write在其上附加文本。

如果您只是追加两个字符串,请使用+

If you need to do many append operations to build a large string, you can use StringIO or cStringIO. The interface is like a file. ie: you write to append text to it.

If you’re just appending two strings then just use +.


回答 6

这实际上取决于您的应用程序。如果您要遍历数百个单词并将其全部添加到列表中,.join()那就更好了。但是,如果要把很长的句子放在一起,最好使用+=

it really depends on your application. If you’re looping through hundreds of words and want to append them all into a list, .join() is better. But if you’re putting together a long sentence, you’re better off using +=.


回答 7

基本上没有区别。唯一一致的趋势是,每个版本的Python似乎都变得越来越慢… :(


清单

%%timeit
x = []
for i in range(100000000):  # xrange on Python 2.7
    x.append('a')
x = ''.join(x)

Python 2.7

1个循环,每循环3:7.34 s 最佳

Python 3.4

1个循环,每个循环最好3:7.99 s

Python 3.5

1次循环,每循环3:8.48 s 最佳

Python 3.6

1次循环,每循环3:9.93 s 最佳


%%timeit
x = ''
for i in range(100000000):  # xrange on Python 2.7
    x += 'a'

Python 2.7

1次循环,每循环3:7.41 s最佳

Python 3.4

1个循环,每个循环最好3:9.08 s

Python 3.5

1次循环,每循环3:8.82 s 最佳

Python 3.6

1次循环,每循环3:9.24 s 最佳

Basically, no difference. The only consistent trend is that Python seems to be getting slower with every version… :(


List

%%timeit
x = []
for i in range(100000000):  # xrange on Python 2.7
    x.append('a')
x = ''.join(x)

Python 2.7

1 loop, best of 3: 7.34 s per loop

Python 3.4

1 loop, best of 3: 7.99 s per loop

Python 3.5

1 loop, best of 3: 8.48 s per loop

Python 3.6

1 loop, best of 3: 9.93 s per loop


String

%%timeit
x = ''
for i in range(100000000):  # xrange on Python 2.7
    x += 'a'

Python 2.7:

1 loop, best of 3: 7.41 s per loop

Python 3.4

1 loop, best of 3: 9.08 s per loop

Python 3.5

1 loop, best of 3: 8.82 s per loop

Python 3.6

1 loop, best of 3: 9.24 s per loop


回答 8

__add__函数追加字符串

str = "Hello"
str2 = " World"
st = str.__add__(str2)
print(st)

输出量

Hello World

append strings with __add__ function

str = "Hello"
str2 = " World"
st = str.__add__(str2)
print(st)

Output

Hello World

回答 9

a='foo'
b='baaz'

a.__add__(b)

out: 'foobaaz'
a='foo'
b='baaz'

a.__add__(b)

out: 'foobaaz'

TypeError:需要类似字节的对象,而在Python3中写入文件时不是’str’

问题:TypeError:需要类似字节的对象,而在Python3中写入文件时不是’str’

我最近已经迁移到Py 3.5。这段代码在Python 2.7中正常工作:

with open(fname, 'rb') as f:
    lines = [x.strip() for x in f.readlines()]

for line in lines:
    tmp = line.strip().lower()
    if 'some-pattern' in tmp: continue
    # ... code

升级到3.5后,我得到了:

TypeError: a bytes-like object is required, not 'str'

最后一行错误(模式搜索代码)。

我试过使用.decode()语句两侧的函数,也尝试过:

if tmp.find('some-pattern') != -1: continue

-无济于事。

我能够很快解决几乎所有的2:3问题,但是这个小小的声明困扰着我。

I’ve very recently migrated to Py 3.5. This code was working properly in Python 2.7:

with open(fname, 'rb') as f:
    lines = [x.strip() for x in f.readlines()]

for line in lines:
    tmp = line.strip().lower()
    if 'some-pattern' in tmp: continue
    # ... code

After upgrading to 3.5, I’m getting the:

TypeError: a bytes-like object is required, not 'str'

error on the last line (the pattern search code).

I’ve tried using the .decode() function on either side of the statement, also tried:

if tmp.find('some-pattern') != -1: continue

– to no avail.

I was able to resolve almost all 2:3 issues quickly, but this little statement is bugging me.


回答 0

您以二进制模式打开文件:

with open(fname, 'rb') as f:

这意味着从文件读取的所有数据都作为bytes对象而不是作为对象返回str。然后,您不能在收容测试中使用字符串:

if 'some-pattern' in tmp: continue

您必须改为使用一个bytes对象进行测试tmp

if b'some-pattern' in tmp: continue

或以文本文件形式打开文件,而不是将'rb'模式替换为'r'

You opened the file in binary mode:

with open(fname, 'rb') as f:

This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:

if 'some-pattern' in tmp: continue

You’d have to use a bytes object to test against tmp instead:

if b'some-pattern' in tmp: continue

or open the file as a textfile instead by replacing the 'rb' mode with 'r'.


回答 1

您可以使用以下方式对字符串进行编码 .encode()

例:

'Hello World'.encode()

You can encode your string by using .encode()

Example:

'Hello World'.encode()

回答 2

就像已经提到的一样,您正在以二进制模式读取文件,然后创建字节列表。在下面的for循环中,您将字符串与字节进行比较,这就是代码失败的地方。

在将字节添加到列表时对字节进行解码应该可以。更改后的代码应如下所示:

with open(fname, 'rb') as f:
    lines = [x.decode('utf8').strip() for x in f.readlines()]

字节类型是在Python 3中引入的,这就是为什么您的代码在Python 2中可以工作的原因。在Python 2中,没有字节的数据类型:

>>> s=bytes('hello')
>>> type(s)
<type 'str'>

Like it has been already mentioned, you are reading the file in binary mode and then creating a list of bytes. In your following for loop you are comparing string to bytes and that is where the code is failing.

Decoding the bytes while adding to the list should work. The changed code should look as follows:

with open(fname, 'rb') as f:
    lines = [x.decode('utf8').strip() for x in f.readlines()]

The bytes type was introduced in Python 3 and that is why your code worked in Python 2. In Python 2 there was no data type for bytes:

>>> s=bytes('hello')
>>> type(s)
<type 'str'>

回答 3

您必须从wb更改为w:

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'wb')) 
    self.myCsv.writerow(['title', 'link'])

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'w'))
    self.myCsv.writerow(['title', 'link'])

更改此设置后,错误消失,但是您无法写入文件(以我为例)。毕竟,我没有答案吗?

来源:如何删除^ M

更改为“ rb”会给我带来另一个错误:io.UnsupportedOperation:写入

You have to change from wb to w:

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'wb')) 
    self.myCsv.writerow(['title', 'link'])

to

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'w'))
    self.myCsv.writerow(['title', 'link'])

After changing this, the error disappears, but you can’t write to the file (in my case). So after all, I don’t have an answer?

Source: How to remove ^M

Changing to ‘rb’ brings me the other error: io.UnsupportedOperation: write


回答 4

对于这个小例子:import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(**b**'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if ( len(data) < 1 ) :
        break
    print (data);

mysock.close()

在’GET http://www.py4inf.com/code/romeo.txt HTTP / 1.0 \ n \ n’ 之前添加“ b” 解决了我的问题

for this small example: import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(**b**'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if ( len(data) < 1 ) :
        break
    print (data);

mysock.close()

adding the “b” before ‘GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n’ solved my problem


回答 5

与单引号中给出的硬编码字符串值一起使用encode()函数。

例如:

file.write(answers[i] + '\n'.encode())

要么

line.split(' +++$+++ '.encode())

Use encode() function along with hardcoded String value given in a single quote.

Ex:

file.write(answers[i] + '\n'.encode())

OR

line.split(' +++$+++ '.encode())

回答 6

您以二进制模式打开文件:

以下代码将引发TypeError:需要一个类似字节的对象,而不是’str’。

for line in lines:
    print(type(line))# <class 'bytes'>
    if 'substring' in line:
       print('success')

以下代码将起作用-您必须使用encode()函数:

for line in lines:
    line = line.decode()
    print(type(line))# <class 'str'>
    if 'substring' in line:
       print('success')

You opened the file in binary mode:

The following code will throw a TypeError: a bytes-like object is required, not ‘str’.

for line in lines:
    print(type(line))# <class 'bytes'>
    if 'substring' in line:
       print('success')

The following code will work – you have to use the decode() function:

for line in lines:
    line = line.decode()
    print(type(line))# <class 'str'>
    if 'substring' in line:
       print('success')

回答 7

为什么不尝试以文本形式打开文件?

with open(fname, 'rt') as f:
    lines = [x.strip() for x in f.readlines()]

此外,以下是官方页面上python 3.x的链接:https : //docs.python.org/3/library/io.html 这是开放功能:https : //docs.python.org/3 /library/functions.html#open

如果您确实想将其作为二进制文件处理,则考虑对字符串进行编码。

why not try opening your file as text?

with open(fname, 'rt') as f:
    lines = [x.strip() for x in f.readlines()]

Additionally here is a link for python 3.x on the official page: https://docs.python.org/3/library/io.html And this is the open function: https://docs.python.org/3/library/functions.html#open

If you are really trying to handle it as a binary then consider encoding your string.


回答 8

当我尝试将char(或字符串)转换为时,出现此错误bytes,代码在Python 2.7中是这样的:

# -*- coding: utf-8 -*-
print( bytes('ò') )

这是Python 2.7处理Unicode字符的方式。

这在Python 3.6中不起作用,因为bytes需要一个额外的参数来编码,但这可能有点棘手,因为不同的编码可能会输出不同的结果:

print( bytes('ò', 'iso_8859_1') ) # prints: b'\xf2'
print( bytes('ò', 'utf-8') ) # prints: b'\xc3\xb2'

就我而言,我不得不使用 iso_8859_1在对字节进行编码时来解决问题。

希望这对某人有帮助。

I got this error when I was trying to convert a char (or string) to bytes, the code was something like this with Python 2.7:

# -*- coding: utf-8 -*-
print( bytes('ò') )

This is the way of Python 2.7 when dealing with unicode chars.

This won’t work with Python 3.6, since bytes require an extra argument for encoding, but this can be little tricky, since different encoding may output different result:

print( bytes('ò', 'iso_8859_1') ) # prints: b'\xf2'
print( bytes('ò', 'utf-8') ) # prints: b'\xc3\xb2'

In my case I had to use iso_8859_1 when encoding bytes in order to solve the issue.

Hope this helps someone.


检查Python列表项是否在另一个字符串中包含一个字符串

问题:检查Python列表项是否在另一个字符串中包含一个字符串

我有一个清单:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']

并要搜索包含字符串的项目'abc'。我怎样才能做到这一点?

if 'abc' in my_list:

会检查是否'abc'存在在列表中,但它的一部分'abc-123''abc-456''abc'对自己不存在。那么,如何获得包含的所有物品'abc'

I have a list:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']

and want to search for items that contain the string 'abc'. How can I do that?

if 'abc' in my_list:

would check if 'abc' exists in the list but it is a part of 'abc-123' and 'abc-456', 'abc' does not exist on its own. So how can I get all items that contain 'abc' ?


回答 0

如果您只想检查abc列表中是否存在任何字符串,则可以尝试

some_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
if any("abc" in s for s in some_list):
    # whatever

如果您确实要获取包含的所有项目abc,请使用

matching = [s for s in some_list if "abc" in s]

If you only want to check for the presence of abc in any string in the list, you could try

some_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
if any("abc" in s for s in some_list):
    # whatever

If you really want to get all the items containing abc, use

matching = [s for s in some_list if "abc" in s]

回答 1

只是丢掉它:如果您碰巧需要与多个字符串匹配,例如abcdef,则可以按如下方式组合两种理解:

matchers = ['abc','def']
matching = [s for s in my_list if any(xs in s for xs in matchers)]

输出:

['abc-123', 'def-456', 'abc-456']

Just throwing this out there: if you happen to need to match against more than one string, for example abc and def, you can combine two comprehensions as follows:

matchers = ['abc','def']
matching = [s for s in my_list if any(xs in s for xs in matchers)]

Output:

['abc-123', 'def-456', 'abc-456']

回答 2

使用filter以获取该具备的要素abc

>>> lst = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
>>> print filter(lambda x: 'abc' in x, lst)
['abc-123', 'abc-456']

您还可以使用列表推导。

>>> [x for x in lst if 'abc' in x]

顺便说一句,不要将单词list用作变量名,因为它已经用于list类型。

Use filter to get at the elements that have abc.

>>> lst = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
>>> print filter(lambda x: 'abc' in x, lst)
['abc-123', 'abc-456']

You can also use a list comprehension.

>>> [x for x in lst if 'abc' in x]

By the way, don’t use the word list as a variable name since it is already used for the list type.


回答 3

如果您只想知道’abc’是否在其中一项中,这是最短的方法:

if 'abc' in str(my_list):

If you just need to know if ‘abc’ is in one of the items, this is the shortest way:

if 'abc' in str(my_list):

回答 4

这是一个很老的问题,但是我提供这个答案,因为先前的答案不能解决列表中不是字符串(或某种可迭代对象)的项。这些项目将导致整个列表理解失败,并发生异常。

要通过跳过不可迭代的项目来优雅地处理列表中的此类项目,请使用以下命令:

[el for el in lst if isinstance(el, collections.Iterable) and (st in el)]

然后,带有这样的列表:

lst = [None, 'abc-123', 'def-456', 'ghi-789', 'abc-456', 123]
st = 'abc'

您仍然会得到匹配的项目(['abc-123', 'abc-456']

可迭代的测试可能不是最好的。从这里得到它:在Python中,如何确定对象是否可迭代?

This is quite an old question, but I offer this answer because the previous answers do not cope with items in the list that are not strings (or some kind of iterable object). Such items would cause the entire list comprehension to fail with an exception.

To gracefully deal with such items in the list by skipping the non-iterable items, use the following:

[el for el in lst if isinstance(el, collections.Iterable) and (st in el)]

then, with such a list:

lst = [None, 'abc-123', 'def-456', 'ghi-789', 'abc-456', 123]
st = 'abc'

you will still get the matching items (['abc-123', 'abc-456'])

The test for iterable may not be the best. Got it from here: In Python, how do I determine if an object is iterable?


回答 5

x = 'aaa'
L = ['aaa-12', 'bbbaaa', 'cccaa']
res = [y for y in L if x in y]
x = 'aaa'
L = ['aaa-12', 'bbbaaa', 'cccaa']
res = [y for y in L if x in y]

回答 6

for item in my_list:
    if item.find("abc") != -1:
        print item
for item in my_list:
    if item.find("abc") != -1:
        print item

回答 7

any('abc' in item for item in mylist)
any('abc' in item for item in mylist)

回答 8

使用__contains__()Pythons字符串类的方法:

a = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
for i in a:
    if i.__contains__("abc") :
        print(i, " is containing")

Use the __contains__() method of Pythons string class.:

a = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
for i in a:
    if i.__contains__("abc") :
        print(i, " is containing")

回答 9

我是Python的新手。我得到了下面的代码,使其易于理解:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
for str in my_list:
    if 'abc' in str:
       print(str)

I am new to Python. I got the code below working and made it easy to understand:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
for str in my_list:
    if 'abc' in str:
       print(str)

回答 10

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']

for item in my_list:
    if (item.find('abc')) != -1:
        print ('Found at ', item)
my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']

for item in my_list:
    if (item.find('abc')) != -1:
        print ('Found at ', item)

回答 11

mylist=['abc','def','ghi','abc']

pattern=re.compile(r'abc') 

pattern.findall(mylist)
mylist=['abc','def','ghi','abc']

pattern=re.compile(r'abc') 

pattern.findall(mylist)

回答 12

我进行了搜索,要求您输入某个值,然后它将从包含您的输入的列表中查找一个值:

my_list = ['abc-123',
        'def-456',
        'ghi-789',
        'abc-456'
        ]

imp = raw_input('Search item: ')

for items in my_list:
    val = items
    if any(imp in val for items in my_list):
        print(items)

尝试搜索“ abc”。

I did a search, which requires you to input a certain value, then it will look for a value from the list which contains your input:

my_list = ['abc-123',
        'def-456',
        'ghi-789',
        'abc-456'
        ]

imp = raw_input('Search item: ')

for items in my_list:
    val = items
    if any(imp in val for items in my_list):
        print(items)

Try searching for ‘abc’.


回答 13

def find_dog(new_ls):
    splt = new_ls.split()
    if 'dog' in splt:
        print("True")
    else:
        print('False')


find_dog("Is there a dog here?")
def find_dog(new_ls):
    splt = new_ls.split()
    if 'dog' in splt:
        print("True")
    else:
        print('False')


find_dog("Is there a dog here?")

回答 14

我需要与匹配相对应的列表索引,如下所示:

lst=['abc-123', 'def-456', 'ghi-789', 'abc-456']

[n for n, x in enumerate(lst) if 'abc' in x]

输出

[0, 3]

I needed the list indices that correspond to a match as follows:

lst=['abc-123', 'def-456', 'ghi-789', 'abc-456']

[n for n, x in enumerate(lst) if 'abc' in x]

output

[0, 3]

回答 15

问题:提供abc的信息

    a = ['abc-123', 'def-456', 'ghi-789', 'abc-456']


    aa = [ string for string in a if  "abc" in string]
    print(aa)

Output =>  ['abc-123', 'abc-456']

Question : Give the informations of abc

    a = ['abc-123', 'def-456', 'ghi-789', 'abc-456']


    aa = [ string for string in a if  "abc" in string]
    print(aa)

Output =>  ['abc-123', 'abc-456']

回答 16

据我所知,“ for”陈述总是会浪费时间。

当列表长度增加时,执行时间也会增加。

我认为,使用“ is”语句在字符串中搜索子字符串会更快一些。

In [1]: t = ["abc_%s" % number for number in range(10000)]

In [2]: %timeit any("9999" in string for string in t)
1000 loops, best of 3: 420 µs per loop

In [3]: %timeit "9999" in ",".join(t)
10000 loops, best of 3: 103 µs per loop

但是,我同意该any声明更具可读性。

From my knowledge, a ‘for’ statement will always consume time.

When the list length is growing up, the execution time will also grow.

I think that, searching a substring in a string with ‘is’ statement is a bit faster.

In [1]: t = ["abc_%s" % number for number in range(10000)]

In [2]: %timeit any("9999" in string for string in t)
1000 loops, best of 3: 420 µs per loop

In [3]: %timeit "9999" in ",".join(t)
10000 loops, best of 3: 103 µs per loop

But, I agree that the any statement is more readable.