标签归档:anti-patterns

有什么理由不使用’+’连接两个字符串吗?

问题:有什么理由不使用’+’连接两个字符串吗?

Python中常见的反模式是+在循环中使用串联字符串序列。这很不好,因为Python解释器必须为每次迭代创建一个新的字符串对象,并且最终要花费二次时间。(在某些情况下,最新版本的CPython显然可以优化此功能,但其他实现则不能,因此建议程序员不要依赖此功能。)''.join是执行此操作的正确方法。

但是,我听说它说过(包括Stack Overflow上的内容),您永远都不要将它+用于字符串连接,而应该始终使用''.join或格式字符串。我不明白为什么只连接两个字符串会出现这种情况。如果我的理解是正确的,则不应该花费二次时间,而且我认为a + b''.join((a, b))或更加简洁易读'%s%s' % (a, b)

+串联两个字符串是否是一种好习惯?还是有我不知道的问题?

A common antipattern in Python is to concatenate a sequence of strings using + in a loop. This is bad because the Python interpreter has to create a new string object for each iteration, and it ends up taking quadratic time. (Recent versions of CPython can apparently optimize this in some cases, but other implementations can’t, so programmers are discouraged from relying on this.) ''.join is the right way to do this.

However, I’ve heard it said (including here on Stack Overflow) that you should never, ever use + for string concatenation, but instead always use ''.join or a format string. I don’t understand why this is the case if you’re only concatenating two strings. If my understanding is correct, it shouldn’t take quadratic time, and I think a + b is cleaner and more readable than either ''.join((a, b)) or '%s%s' % (a, b).

Is it good practice to use + to concatenate two strings? Or is there a problem I’m not aware of?


回答 0

两个字符串与连接在一起没有错+。确实,它比容易阅读''.join([a, b])

您是对的,尽管用2个以上的字符串进行连接+是O(n ^ 2)操作(与相比,O(n)join)因此效率低下。但是,这与使用循环无关。偶数a + b + c + ...为O(n ^ 2),原因是每个串联产生一个新的字符串。

CPython2.4及更高版本试图缓解这种情况,但是join在连接两个以上的字符串时仍然建议使用。

There is nothing wrong in concatenating two strings with +. Indeed it’s easier to read than ''.join([a, b]).

You are right though that concatenating more than 2 strings with + is an O(n^2) operation (compared to O(n) for join) and thus becomes inefficient. However this has not to do with using a loop. Even a + b + c + ... is O(n^2), the reason being that each concatenation produces a new string.

CPython2.4 and above try to mitigate that, but it’s still advisable to use join when concatenating more than 2 strings.


回答 1

加号运算符是连接两个 Python字符串的完美解决方案。但是,如果您继续添加两个以上的字符串(n> 25),则可能需要考虑其他问题。

''.join([a, b, c]) 技巧是性能优化。

Plus operator is perfectly fine solution to concatenate two Python strings. But if you keep adding more than two strings (n > 25) , you might want to think something else.

''.join([a, b, c]) trick is a performance optimization.


回答 2

假设永远不要使用+进行字符串连接,而始终使用”.join可能是一个神话。的确,使用+会创建不必要的不​​可变字符串对象的临时副本,但另一个经常引用的事实是,join在循环中调用通常会增加的开销function call。让我们举个例子。

创建两个列表,一个来自链接的SO问题,另一个列表更大

>>> myl1 = ['A','B','C','D','E','F']
>>> myl2=[chr(random.randint(65,90)) for i in range(0,10000)]

让我们创建两个函数,UseJoinUsePlus分别使用join+功能。

>>> def UsePlus():
    return [myl[i] + myl[i + 1] for i in range(0,len(myl), 2)]

>>> def UseJoin():
    [''.join((myl[i],myl[i + 1])) for i in range(0,len(myl), 2)]

让timeit与第一个列表一起运行

>>> myl=myl1
>>> t1=timeit.Timer("UsePlus()","from __main__ import UsePlus")
>>> t2=timeit.Timer("UseJoin()","from __main__ import UseJoin")
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
2.48 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
2.61 usec/pass
>>> 

它们具有几乎相同的运行时。

让我们使用cProfile

>>> myl=myl2
>>> cProfile.run("UsePlus()")
         5 function calls in 0.001 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <pyshell#1376>:1(UsePlus)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}


>>> cProfile.run("UseJoin()")
         5005 function calls in 0.029 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.029    0.029 <pyshell#1388>:1(UseJoin)
        1    0.000    0.000    0.029    0.029 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     5000    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {range}

而且看起来使用Join会导致不必要的函数调用,这可能会增加开销。

现在回到问题。在所有情况下都应该不鼓励使用+over join吗?

我相信不,应该考虑

  1. 所讨论字符串的长度
  2. 串联操作数。

在开发中过早地进行优化是不明智的。

The assumption that one should never, ever use + for string concatenation, but instead always use ”.join may be a myth. It is true that using + creates unnecessary temporary copies of immutable string object but the other not oft quoted fact is that calling join in a loop would generally add the overhead of function call. Lets take your example.

Create two lists, one from the linked SO question and another a bigger fabricated

>>> myl1 = ['A','B','C','D','E','F']
>>> myl2=[chr(random.randint(65,90)) for i in range(0,10000)]

Lets create two functions, UseJoin and UsePlus to use the respective join and + functionality.

>>> def UsePlus():
    return [myl[i] + myl[i + 1] for i in range(0,len(myl), 2)]

>>> def UseJoin():
    [''.join((myl[i],myl[i + 1])) for i in range(0,len(myl), 2)]

Lets run timeit with the first list

>>> myl=myl1
>>> t1=timeit.Timer("UsePlus()","from __main__ import UsePlus")
>>> t2=timeit.Timer("UseJoin()","from __main__ import UseJoin")
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
2.48 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
2.61 usec/pass
>>> 

They have almost the same runtime.

Lets use cProfile

>>> myl=myl2
>>> cProfile.run("UsePlus()")
         5 function calls in 0.001 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <pyshell#1376>:1(UsePlus)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}


>>> cProfile.run("UseJoin()")
         5005 function calls in 0.029 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.029    0.029 <pyshell#1388>:1(UseJoin)
        1    0.000    0.000    0.029    0.029 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     5000    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {range}

And it looks that using Join, results in unnecessary function calls which could add to the overhead.

Now coming back to the question. Should one discourage the use of + over join in all cases?

I believe no, things should be taken into consideration

  1. Length of the String in Question
  2. No of Concatenation Operation.

And off-course in a development pre-mature optimization is evil.


回答 3

与多个人一起工作时,有时很难确切知道正在发生什么。使用格式字符串而不是连接可以避免对我们造成无数次特定烦恼:

说,一个函数需要一个参数,然后编写它以获取字符串:

In [1]: def foo(zeta):
   ...:     print 'bar: ' + zeta

In [2]: foo('bang')
bar: bang

因此,在整个代码中可能经常使用此功能。您的同事可能确切知道它的功能,但不一定完全了解内部功能,并且可能不知道该函数需要一个字符串。因此,他们最终可能会这样:

In [3]: foo(23)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

/home/izkata/<ipython console> in foo(zeta)

TypeError: cannot concatenate 'str' and 'int' objects

如果您只使用格式字符串,将没有问题:

In [1]: def foo(zeta):
   ...:     print 'bar: %s' % zeta
   ...:     
   ...:     

In [2]: foo('bang')
bar: bang

In [3]: foo(23)
bar: 23

对于所有定义了的对象,__str__也可以传入:

In [1]: from datetime import date

In [2]: zeta = date(2012, 4, 15)

In [3]: print 'bar: ' + zeta
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

TypeError: cannot concatenate 'str' and 'datetime.date' objects

In [4]: print 'bar: %s' % zeta
bar: 2012-04-15

所以可以:如果您可以使用格式字符串,充分利用Python所提供的功能。

When working with multiple people, it’s sometimes difficult to know exactly what’s happening. Using a format string instead of concatenation can avoid one particular annoyance that’s happened a whole ton of times to us:

Say, a function requires an argument, and you write it expecting to get a string:

In [1]: def foo(zeta):
   ...:     print 'bar: ' + zeta

In [2]: foo('bang')
bar: bang

So, this function may be used pretty often throughout the code. Your coworkers may know exactly what it does, but not necessarily be fully up-to-speed on the internals, and may not know that the function expects a string. And so they may end up with this:

In [3]: foo(23)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

/home/izkata/<ipython console> in foo(zeta)

TypeError: cannot concatenate 'str' and 'int' objects

There would be no problem if you just used a format string:

In [1]: def foo(zeta):
   ...:     print 'bar: %s' % zeta
   ...:     
   ...:     

In [2]: foo('bang')
bar: bang

In [3]: foo(23)
bar: 23

The same is true for all types of objects that define __str__, which may be passed in as well:

In [1]: from datetime import date

In [2]: zeta = date(2012, 4, 15)

In [3]: print 'bar: ' + zeta
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

TypeError: cannot concatenate 'str' and 'datetime.date' objects

In [4]: print 'bar: %s' % zeta
bar: 2012-04-15

So yes: If you can use a format string do it and take advantage of what Python has to offer.


回答 4

我做了一个快速测试:

import sys

str = e = "a xxxxxxxxxx very xxxxxxxxxx long xxxxxxxxxx string xxxxxxxxxx\n"

for i in range(int(sys.argv[1])):
    str = str + e

并定时:

mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  8000000
8000000 times

real    0m2.165s
user    0m1.620s
sys     0m0.540s
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  16000000
16000000 times

real    0m4.360s
user    0m3.480s
sys     0m0.870s

显然有针对此a = a + b情况的优化。它没有表现出人们可能会怀疑的O(n ^ 2)时间。

因此,至少在性能方面,使用+还不错。

I have done a quick test:

import sys

str = e = "a xxxxxxxxxx very xxxxxxxxxx long xxxxxxxxxx string xxxxxxxxxx\n"

for i in range(int(sys.argv[1])):
    str = str + e

and timed it:

mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  8000000
8000000 times

real    0m2.165s
user    0m1.620s
sys     0m0.540s
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  16000000
16000000 times

real    0m4.360s
user    0m3.480s
sys     0m0.870s

There is apparently an optimisation for the a = a + b case. It does not exhibit O(n^2) time as one might suspect.

So at least in terms of performance, using + is fine.


回答 5

根据Python文档,使用str.join()将为您提供各种Python实现的性能一致性。尽管CPython优化了s = s + t的二次行为,但其他Python实现可能没有。

CPython实现细节:如果s和t都是字符串,则某些Python实现(例如CPython)通常可以对s = s + t或s + = t形式的赋值执行就地优化。如果适用,此优化将使二次运行的可能性大大降低。此优化取决于版本和实现。对于性能敏感的代码,最好使用str.join()方法,以确保各个版本和实现之间一致的线性串联性能。

Python文档中的序列类型(请参见脚注[6])

According to Python docs, using str.join() will give you performance consistence across various implementations of Python. Although CPython optimizes away the quadratic behavior of s = s + t, other Python implementations may not.

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.

Sequence Types in Python docs (see the foot note [6])


回答 6

我在python 3.8中使用以下内容

string4 = f'{string1}{string2}{string3}'

I use the following with python 3.8

string4 = f'{string1}{string2}{string3}'

回答 7

”.join([a,b])+更好。

因为应该以不损害Python其他实现(PyPy,Jython,IronPython,Cython,Psyco等)的方式编写代码

形式a + = b或a = a + b即使在CPython中也很脆弱,并且在不使用 引用计数的 实现中根本不存在(引用计数是一种存储引用,指针或对a的句柄的技术资源,例如对象,内存块,磁盘空间或其他资源

https://www.python.org/dev/peps/pep-0008/#programming-recommendations

”.join([a, b]) is better solution than +.

Because Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such)

form a += b or a = a + b is fragile even in CPython and isn’t present at all in implementations that don’t use refcounting (reference counting is a technique of storing the number of references, pointers, or handles to a resource such as an object, block of memory, disk space or other resource)

https://www.python.org/dev/peps/pep-0008/#programming-recommendations