标签归档:concat

将熊猫数据框列表连接在一起

问题:将熊猫数据框列表连接在一起

我有一个熊猫数据框列表,我想将其合并为一个熊猫数据框。我正在使用Python 2.7.10和Pandas 0.16.2

我从以下位置创建了数据框列表:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

这将返回数据帧列表

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

这是一些示例数据

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

我想将d1d2和组合d3成一个熊猫数据框。另外,使用该chunksize选项时将大表直接读入数据框的方法将非常有帮助。

I have a list of Pandas dataframes that I would like to combine into one Pandas dataframe. I am using Python 2.7.10 and Pandas 0.16.2

I created the list of dataframes from:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

This returns a list of dataframes

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

Here is some sample data

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

I would like to combine d1, d2, and d3 into one pandas dataframe. Alternatively, a method of reading a large-ish table directly into a dataframe when using the chunksize option would be very helpful.


回答 0

鉴于所有数据框都具有相同的列,您可以简单地将concat它们:

import pandas as pd
df = pd.concat(list_of_dataframes)

Given that all the dataframes have the same columns, you can simply concat them:

import pandas as pd
df = pd.concat(list_of_dataframes)

回答 1

如果数据帧的所有列都不相同,请尝试以下操作:

df = pd.DataFrame.from_dict(map(dict,df_list))

If the dataframes DO NOT all have the same columns try the following:

df = pd.DataFrame.from_dict(map(dict,df_list))

回答 2

您也可以使用函数式编程来做到这一点:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

You also can do it with functional programming:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

回答 3

concat 对于使用“ loc”命令针对现有数据框提取的列表理解也可以很好地工作

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

concat also works nicely with a list comprehension pulled using the “loc” command against an existing dataframe

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

按索引合并两个数据框

问题:按索引合并两个数据框

嗨,我有以下数据框:

> df1
  id begin conditional confidence discoveryTechnique  
0 278    56       false        0.0                  1   
1 421    18       false        0.0                  1 

> df2
   concept 
0  A  
1  B

如何合并索引以获取:

  id begin conditional confidence discoveryTechnique   concept 
0 278    56       false        0.0                  1  A 
1 421    18       false        0.0                  1  B

我问,因为据我了解,merge()df1.merge(df2)使用列进行匹配。实际上,这样做我得到:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
    self._validate_specification()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
    raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on

在索引上合并是不好的做法吗?不可能吗 如果是这样,如何将索引移到称为“索引”的新列中?

谢谢

I have the following dataframes:

> df1
  id begin conditional confidence discoveryTechnique  
0 278    56       false        0.0                  1   
1 421    18       false        0.0                  1 

> df2
   concept 
0  A  
1  B
   

How do I merge on the indices to get:

  id begin conditional confidence discoveryTechnique   concept 
0 278    56       false        0.0                  1  A 
1 421    18       false        0.0                  1  B

I ask because it is my understanding that merge() i.e. df1.merge(df2) uses columns to do the matching. In fact, doing this I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
    self._validate_specification()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
    raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on

Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called “index”?


回答 0

使用merge,默认情况下是内部联接:

pd.merge(df1, df2, left_index=True, right_index=True)

join,默认情况下为左连接:

df1.join(df2)

concat,默认情况下为外部联接:

pd.concat([df1, df2], axis=1)

样品

df1 = pd.DataFrame({'a':range(6),
                    'b':[5,3,6,9,2,4]}, index=list('abcdef'))

print (df1)
   a  b
a  0  5
b  1  3
c  2  6
d  3  9
e  4  2
f  5  4

df2 = pd.DataFrame({'c':range(4),
                    'd':[10,20,30, 40]}, index=list('abhi'))

print (df2)
   c   d
a  0  10
b  1  20
h  2  30
i  3  40

#default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
   a  b  c   d
a  0  5  0  10
b  1  3  1  20

#default left join
df4 = df1.join(df2)
print (df4)
   a  b    c     d
a  0  5  0.0  10.0
b  1  3  1.0  20.0
c  2  6  NaN   NaN
d  3  9  NaN   NaN
e  4  2  NaN   NaN
f  5  4  NaN   NaN

#default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
     a    b    c     d
a  0.0  5.0  0.0  10.0
b  1.0  3.0  1.0  20.0
c  2.0  6.0  NaN   NaN
d  3.0  9.0  NaN   NaN
e  4.0  2.0  NaN   NaN
f  5.0  4.0  NaN   NaN
h  NaN  NaN  2.0  30.0
i  NaN  NaN  3.0  40.0

Use merge, which is inner join by default:

pd.merge(df1, df2, left_index=True, right_index=True)

Or join, which is left join by default:

df1.join(df2)

Or concat, which is outer join by default:

pd.concat([df1, df2], axis=1)

Samples:

df1 = pd.DataFrame({'a':range(6),
                    'b':[5,3,6,9,2,4]}, index=list('abcdef'))

print (df1)
   a  b
a  0  5
b  1  3
c  2  6
d  3  9
e  4  2
f  5  4

df2 = pd.DataFrame({'c':range(4),
                    'd':[10,20,30, 40]}, index=list('abhi'))

print (df2)
   c   d
a  0  10
b  1  20
h  2  30
i  3  40

#default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
   a  b  c   d
a  0  5  0  10
b  1  3  1  20

#default left join
df4 = df1.join(df2)
print (df4)
   a  b    c     d
a  0  5  0.0  10.0
b  1  3  1.0  20.0
c  2  6  NaN   NaN
d  3  9  NaN   NaN
e  4  2  NaN   NaN
f  5  4  NaN   NaN

#default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
     a    b    c     d
a  0.0  5.0  0.0  10.0
b  1.0  3.0  1.0  20.0
c  2.0  6.0  NaN   NaN
d  3.0  9.0  NaN   NaN
e  4.0  2.0  NaN   NaN
f  5.0  4.0  NaN   NaN
h  NaN  NaN  2.0  30.0
i  NaN  NaN  3.0  40.0

回答 1

您可以使用concat([df1,df2,…],axis = 1)来连接两个或更多个按索引对齐的DF:

pd.concat([df1, df2, df3, ...], axis=1)

合并以通过自定义字段/索引进行串联:

# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])

# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)

参加由指数加盟:

 df1.join(df2)

you can use concat([df1, df2, …], axis=1) in order to concatenate two or more DFs aligned by indexes:

pd.concat([df1, df2, df3, ...], axis=1)

or merge for concatenating by custom fields / indexes:

# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])

# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)

or join for joining by index:

 df1.join(df2)

回答 2

默认情况下:
join是按列的左联接
pd.merge是按列的内部联接
pd.concat是按行的外部联接

pd.concat
采用Iterable参数。因此,它不能直接使用DataFrames(使用[df,df2]
DataFrame的尺寸应沿轴匹配

Joinpd.merge
可以接受DataFrame参数

By default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join

pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis

Join and pd.merge:
can take DataFrame arguments


回答 3

一个愚蠢的错误吸引了我:由于索引dtypes不同,联接失败。这不是很明显,因为两个表都是同一原始表的数据透视表。之后reset_index,索引在Jupyter中看起来相同。保存到Excel时才发现…

固定于: df1[['key']] = df1[['key']].apply(pd.to_numeric)

希望这可以节省一个小时!

A silly bug that got me: the joins failed because index dtypes differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index, the indices looked identical in Jupyter. It only came to light when saving to Excel…

Fixed with: df1[['key']] = df1[['key']].apply(pd.to_numeric)

Hopefully this saves somebody an hour!


回答 4

如果要在熊猫中加入两个数据框,则可以简单地使用可用的属性,例如mergeconcatenate。例如,如果我有两个数据框df1df2可以通过以下方式将它们加入:

newdataframe=merge(df1,df2,left_index=True,right_index=True)

If u want to join two dataframes in pandas you can simply use available attributes like merge or concatenate. For example if I have two dataframes df1 and df2 I can join them by:

newdataframe=merge(df1,df2,left_index=True,right_index=True)

在Python中连接字符串的首选方法是什么?

问题:在Python中连接字符串的首选方法是什么?

由于string无法更改Python ,因此我想知道如何更有效地连接字符串?

我可以这样写:

s += stringfromelsewhere

或像这样:

s = []
s.append(somestring)

later

s = ''.join(s)

在写这个问题时,我找到了一篇很好的文章,谈论这个话题。

http://www.skymind.com/~ocrow/python_string/

但是它在Python 2.x中,所以问题是在Python 3中会有所改变吗?

Since Python’s string can’t be changed, I was wondering how to concatenate a string more efficiently?

I can write like it:

s += stringfromelsewhere

or like this:

s = []
s.append(somestring)

later

s = ''.join(s)

While writing this question, I found a good article talking about the topic.

http://www.skymind.com/~ocrow/python_string/

But it’s in Python 2.x., so the question would be did something change in Python 3?


回答 0

将字符串附加到字符串变量的最佳方法是使用++=。这是因为它可读且快速。它们的速度也一样快,您选择的是一个品味问题,后者是最常见的。以下是该timeit模块的时间安排:

a = a + b:
0.11338996887207031
a += b:
0.11040496826171875

但是,那些建议拥有列表并附加到列表然后再连接这些列表的人之所以这么做,是因为将字符串附加到列表与扩展字符串相比可能非常快。在某些情况下,这可能是正确的。例如,这里是一字符字符串的一百万个追加,首先是字符串,然后是列表:

a += b:
0.10780501365661621
a.append(b):
0.1123361587524414

好的,事实证明,即使结果字符串的长度为一百万个字符,追加操作仍然更快。

现在让我们尝试将十千个字符长的字符串追加十万次:

a += b:
0.41823482513427734
a.append(b):
0.010656118392944336

因此,最终字符串的长度约为100MB。那太慢了,追加到列表上要快得多。那个时机不包括决赛a.join()。那要花多长时间?

a.join(a):
0.43739795684814453

哎呀 即使在这种情况下,append / join也较慢。

那么,该建议来自何处?Python 2?

a += b:
0.165287017822
a.append(b):
0.0132720470428
a.join(a):
0.114929914474

好吧,如果您使用的是非常长的字符串(通常不是,那么内存中100MB的字符串会是什么),append / join的速度会稍微快一些。

但是真正的关键是Python 2.3。我什至不告诉您时间安排,因为它是如此之慢以至于还没有完成。这些测试突然耗时数分钟。除了append / join之外,它和以后的Python一样快。

对。在石器时代,字符串连接在Python中非常缓慢。但是在2.4上已经不存在了(或者至少是Python 2.4.7),因此在2008年Python 2.3停止更新时,使用append / join的建议已过时,您应该停止使用它。:-)

(更新:当我更仔细地进行测试时发现,使用++=在Python 2.3上使用两个字符串的速度也更快。关于使用的建议''.join()一定是一种误解)

但是,这是CPython。其他实现可能还有其他问题。这是过早优化是万恶之源的又一个原因。除非先进行测量,否则不要使用被认为“更快”的技术。

因此,进行字符串连接的“最佳”版本是使用+或+ =。如果事实证明这对您来说很慢,那是不太可能的,那么请执行其他操作。

那么,为什么在我的代码中使用大量的添加/联接?因为有时它实际上更清晰。尤其是当您应将其串联在一起时,应以空格,逗号或换行符分隔。

The best way of appending a string to a string variable is to use + or +=. This is because it’s readable and fast. They are also just as fast, which one you choose is a matter of taste, the latter one is the most common. Here are timings with the timeit module:

a = a + b:
0.11338996887207031
a += b:
0.11040496826171875

However, those who recommend having lists and appending to them and then joining those lists, do so because appending a string to a list is presumably very fast compared to extending a string. And this can be true, in some cases. Here, for example, is one million appends of a one-character string, first to a string, then to a list:

a += b:
0.10780501365661621
a.append(b):
0.1123361587524414

OK, turns out that even when the resulting string is a million characters long, appending was still faster.

Now let’s try with appending a thousand character long string a hundred thousand times:

a += b:
0.41823482513427734
a.append(b):
0.010656118392944336

The end string, therefore, ends up being about 100MB long. That was pretty slow, appending to a list was much faster. That that timing doesn’t include the final a.join(). So how long would that take?

a.join(a):
0.43739795684814453

Oups. Turns out even in this case, append/join is slower.

So where does this recommendation come from? Python 2?

a += b:
0.165287017822
a.append(b):
0.0132720470428
a.join(a):
0.114929914474

Well, append/join is marginally faster there if you are using extremely long strings (which you usually aren’t, what would you have a string that’s 100MB in memory?)

But the real clincher is Python 2.3. Where I won’t even show you the timings, because it’s so slow that it hasn’t finished yet. These tests suddenly take minutes. Except for the append/join, which is just as fast as under later Pythons.

Yup. String concatenation was very slow in Python back in the stone age. But on 2.4 it isn’t anymore (or at least Python 2.4.7), so the recommendation to use append/join became outdated in 2008, when Python 2.3 stopped being updated, and you should have stopped using it. :-)

(Update: Turns out when I did the testing more carefully that using + and += is faster for two strings on Python 2.3 as well. The recommendation to use ''.join() must be a misunderstanding)

However, this is CPython. Other implementations may have other concerns. And this is just yet another reason why premature optimization is the root of all evil. Don’t use a technique that’s supposed “faster” unless you first measure it.

Therefore the “best” version to do string concatenation is to use + or +=. And if that turns out to be slow for you, which is pretty unlikely, then do something else.

So why do I use a lot of append/join in my code? Because sometimes it’s actually clearer. Especially when whatever you should concatenate together should be separated by spaces or commas or newlines.


回答 1

如果要串联很多值,那么两者都不是。追加列表很昂贵。您可以为此使用StringIO。特别是如果您要通过大量操作来构建它。

from cStringIO import StringIO
# python3:  from io import StringIO

buf = StringIO()

buf.write('foo')
buf.write('foo')
buf.write('foo')

buf.getvalue()
# 'foofoofoo'

如果您已经有其他操作返回的完整列表,则只需使用 ''.join(aList)

从python常见问题解答:将许多字符串连接在一起的最有效方法是什么?

str和bytes对象是不可变的,因此将多个字符串连接在一起效率不高,因为每个串联都会创建一个新对象。在一般情况下,总运行时成本在总字符串长度中是二次方的。

要累积许多str对象,建议的惯用法是将它们放入列表中,并在最后调用str.join():

chunks = []
for s in my_strings:
    chunks.append(s)
result = ''.join(chunks)

(另一个合理有效的习惯用法是使用io.StringIO)

要累积许多字节对象,建议的惯用法是使用就地串联(+ =运算符)扩展一个bytearray对象:

result = bytearray()
for b in my_bytes_objects:
    result += b

编辑:我很愚蠢,并且将结果向后粘贴,使其看起来比cStringIO更快。我还添加了针对bytearray / str concat的测试,以及使用较大列表和较大字符串的第二轮测试。(python 2.7.3)

大量字符串的ipython测试示例

try:
    from cStringIO import StringIO
except:
    from io import StringIO

source = ['foo']*1000

%%timeit buf = StringIO()
for i in source:
    buf.write(i)
final = buf.getvalue()
# 1000 loops, best of 3: 1.27 ms per loop

%%timeit out = []
for i in source:
    out.append(i)
final = ''.join(out)
# 1000 loops, best of 3: 9.89 ms per loop

%%timeit out = bytearray()
for i in source:
    out += i
# 10000 loops, best of 3: 98.5 µs per loop

%%timeit out = ""
for i in source:
    out += i
# 10000 loops, best of 3: 161 µs per loop

## Repeat the tests with a larger list, containing
## strings that are bigger than the small string caching 
## done by the Python
source = ['foo']*1000

# cStringIO
# 10 loops, best of 3: 19.2 ms per loop

# list append and join
# 100 loops, best of 3: 144 ms per loop

# bytearray() +=
# 100 loops, best of 3: 3.8 ms per loop

# str() +=
# 100 loops, best of 3: 5.11 ms per loop

If you are concatenating a lot of values, then neither. Appending a list is expensive. You can use StringIO for that. Especially if you are building it up over a lot of operations.

from cStringIO import StringIO
# python3:  from io import StringIO

buf = StringIO()

buf.write('foo')
buf.write('foo')
buf.write('foo')

buf.getvalue()
# 'foofoofoo'

If you already have a complete list returned to you from some other operation, then just use the ''.join(aList)

From the python FAQ: What is the most efficient way to concatenate many strings together?

str and bytes objects are immutable, therefore concatenating many strings together is inefficient as each concatenation creates a new object. In the general case, the total runtime cost is quadratic in the total string length.

To accumulate many str objects, the recommended idiom is to place them into a list and call str.join() at the end:

chunks = []
for s in my_strings:
    chunks.append(s)
result = ''.join(chunks)

(another reasonably efficient idiom is to use io.StringIO)

To accumulate many bytes objects, the recommended idiom is to extend a bytearray object using in-place concatenation (the += operator):

result = bytearray()
for b in my_bytes_objects:
    result += b

Edit: I was silly and had the results pasted backwards, making it look like appending to a list was faster than cStringIO. I have also added tests for bytearray/str concat, as well as a second round of tests using a larger list with larger strings. (python 2.7.3)

ipython test example for large lists of strings

try:
    from cStringIO import StringIO
except:
    from io import StringIO

source = ['foo']*1000

%%timeit buf = StringIO()
for i in source:
    buf.write(i)
final = buf.getvalue()
# 1000 loops, best of 3: 1.27 ms per loop

%%timeit out = []
for i in source:
    out.append(i)
final = ''.join(out)
# 1000 loops, best of 3: 9.89 ms per loop

%%timeit out = bytearray()
for i in source:
    out += i
# 10000 loops, best of 3: 98.5 µs per loop

%%timeit out = ""
for i in source:
    out += i
# 10000 loops, best of 3: 161 µs per loop

## Repeat the tests with a larger list, containing
## strings that are bigger than the small string caching 
## done by the Python
source = ['foo']*1000

# cStringIO
# 10 loops, best of 3: 19.2 ms per loop

# list append and join
# 100 loops, best of 3: 144 ms per loop

# bytearray() +=
# 100 loops, best of 3: 3.8 ms per loop

# str() +=
# 100 loops, best of 3: 5.11 ms per loop

回答 2

在Python> = 3.6中,新的f字符串是连接字符串的有效方法。

>>> name = 'some_name'
>>> number = 123
>>>
>>> f'Name is {name} and the number is {number}.'
'Name is some_name and the number is 123.'

In Python >= 3.6, the new f-string is an efficient way to concatenate a string.

>>> name = 'some_name'
>>> number = 123
>>>
>>> f'Name is {name} and the number is {number}.'
'Name is some_name and the number is 123.'

回答 3

推荐的方法仍然是使用附加和联接。

The recommended method is still to use append and join.


回答 4

如果要串联的字符串是文字,请使用字符串文字串联

re.compile(
        "[A-Za-z_]"       # letter or underscore
        "[A-Za-z0-9_]*"   # letter, digit or underscore
    )

如果要对字符串的一部分进行注释(如上)或要使用原始字符串,这将很有用文本的一部分(但不是全部)或三引号,。

由于这是在语法层发生的,因此它使用零个串联运算符。

If the strings you are concatenating are literals, use String literal concatenation

re.compile(
        "[A-Za-z_]"       # letter or underscore
        "[A-Za-z0-9_]*"   # letter, digit or underscore
    )

This is useful if you want to comment on part of a string (as above) or if you want to use raw strings or triple quotes for part of a literal but not all.

Since this happens at the syntax layer it uses zero concatenation operators.


回答 5

你写这个函数

def str_join(*args):
    return ''.join(map(str, args))

然后,您可以随时随地调用

str_join('Pine')  # Returns : Pine
str_join('Pine', 'apple')  # Returns : Pineapple
str_join('Pine', 'apple', 3)  # Returns : Pineapple3

You write this function

def str_join(*args):
    return ''.join(map(str, args))

Then you can call simply wherever you want

str_join('Pine')  # Returns : Pine
str_join('Pine', 'apple')  # Returns : Pineapple
str_join('Pine', 'apple', 3)  # Returns : Pineapple3

回答 6

虽然有些过时,代码像Pythonista:地道的Python建议join()+ 本节。就像PythonSpeedPerformanceTips在其有关字符串连接的部分中一样,具有以下免责声明:

对于更高版本的Python,本节的准确性存在争议。在CPython 2.5中,字符串连接相当快,尽管这可能不适用于其他Python实现。有关讨论,请参见ConcatenationTestCode。

While somewhat dated, Code Like a Pythonista: Idiomatic Python recommends join() over + in this section. As does PythonSpeedPerformanceTips in its section on string concatenation, with the following disclaimer:

The accuracy of this section is disputed with respect to later versions of Python. In CPython 2.5, string concatenation is fairly fast, although this may not apply likewise to other Python implementations. See ConcatenationTestCode for a discussion.


回答 7

就稳定性和交叉实现而言,最糟糕的串联方法是使用’+’进行就地字符串串联,因为它不支持所有值。PEP8标准不鼓励这样做,并鼓励长期使用format(),join()和append()。

如链接的“编程建议”部分所述:

例如,对于形式为+ = b或a = a + b的语句,请不要依赖CPython有效地实现就地字符串连接。即使在CPython中,这种优化也是脆弱的(仅适用于某些类型),并且在不使用引用计数的实现中根本没有这种优化。在库的性能敏感部分,应改用”.join()形式。这将确保在各种实现中串联发生在线性时间内。

Using in place string concatenation by ‘+’ is THE WORST method of concatenation in terms of stability and cross implementation as it does not support all values. PEP8 standard discourages this and encourages the use of format(), join() and append() for long term use.

As quoted from the linked “Programming Recommendations” section:

For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ”.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.


回答 8

如@jdi所述,Python文档建议使用str.joinio.StringIO进行字符串连接。并说开发人员应该从+=循环中期待二次时间,即使自Python 2.4开始进行了优化。正如这个答案所说:

如果Python检测到left参数没有其他引用,它将调用realloc来尝试通过调整字符串的大小来避免复制。这不是您应该依靠的东西,因为它是一个实现细节,并且因为如果realloc最终需要频繁移动字符串,那么性能会下降到O(n ^ 2)。

我将展示一个天真地依赖于+=这种优化的真实代码示例,但它并不适用。下面的代码将可迭代的短字符串转换为更大的块,以用于批量API。

def test_concat_chunk(seq, split_by):
    result = ['']
    for item in seq:
        if len(result[-1]) + len(item) > split_by: 
            result.append('')
        result[-1] += item
    return result

由于二次时间的复杂性,该代码可能在文学上运行数小时。以下是具有建议的数据结构的替代方案:

import io

def test_stringio_chunk(seq, split_by):
    def chunk():
        buf = io.StringIO()
        size = 0
        for item in seq:
            if size + len(item) <= split_by:
                size += buf.write(item)
            else:
                yield buf.getvalue()
                buf = io.StringIO()
                size = buf.write(item)
        if size:
            yield buf.getvalue()

    return list(chunk())

def test_join_chunk(seq, split_by):
    def chunk():
        buf = []
        size = 0
        for item in seq:
            if size + len(item) <= split_by:
                buf.append(item)
                size += len(item)
            else:
                yield ''.join(buf)                
                buf.clear()
                buf.append(item)
                size = len(item)
        if size:
            yield ''.join(buf)

    return list(chunk())

还有一个微基准测试:

import timeit
import random
import string
import matplotlib.pyplot as plt

line = ''.join(random.choices(
    string.ascii_uppercase + string.digits, k=512)) + '\n'
x = []
y_concat = []
y_stringio = []
y_join = []
n = 5
for i in range(1, 11):
    x.append(i)
    seq = [line] * (20 * 2 ** 20 // len(line))
    chunk_size = i * 2 ** 20
    y_concat.append(
        timeit.timeit(lambda: test_concat_chunk(seq, chunk_size), number=n) / n)
    y_stringio.append(
        timeit.timeit(lambda: test_stringio_chunk(seq, chunk_size), number=n) / n)
    y_join.append(
        timeit.timeit(lambda: test_join_chunk(seq, chunk_size), number=n) / n)
plt.plot(x, y_concat)
plt.plot(x, y_stringio)
plt.plot(x, y_join)
plt.legend(['concat', 'stringio', 'join'], loc='upper left')
plt.show()

As @jdi mentions Python documentation suggests to use str.join or io.StringIO for string concatenation. And says that a developer should expect quadratic time from += in a loop, even though there’s an optimisation since Python 2.4. As this answer says:

If Python detects that the left argument has no other references, it calls realloc to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it’s an implementation detail and because if realloc ends up needing to move the string frequently, performance degrades to O(n^2) anyway.

I will show an example of real-world code that naively relied on += this optimisation, but it didn’t apply. The code below converts an iterable of short strings into bigger chunks to be used in a bulk API.

def test_concat_chunk(seq, split_by):
    result = ['']
    for item in seq:
        if len(result[-1]) + len(item) > split_by: 
            result.append('')
        result[-1] += item
    return result

This code can literary run for hours because of quadratic time complexity. Below are alternatives with suggested data structures:

import io

def test_stringio_chunk(seq, split_by):
    def chunk():
        buf = io.StringIO()
        size = 0
        for item in seq:
            if size + len(item) <= split_by:
                size += buf.write(item)
            else:
                yield buf.getvalue()
                buf = io.StringIO()
                size = buf.write(item)
        if size:
            yield buf.getvalue()

    return list(chunk())

def test_join_chunk(seq, split_by):
    def chunk():
        buf = []
        size = 0
        for item in seq:
            if size + len(item) <= split_by:
                buf.append(item)
                size += len(item)
            else:
                yield ''.join(buf)                
                buf.clear()
                buf.append(item)
                size = len(item)
        if size:
            yield ''.join(buf)

    return list(chunk())

And a micro-benchmark:

import timeit
import random
import string
import matplotlib.pyplot as plt

line = ''.join(random.choices(
    string.ascii_uppercase + string.digits, k=512)) + '\n'
x = []
y_concat = []
y_stringio = []
y_join = []
n = 5
for i in range(1, 11):
    x.append(i)
    seq = [line] * (20 * 2 ** 20 // len(line))
    chunk_size = i * 2 ** 20
    y_concat.append(
        timeit.timeit(lambda: test_concat_chunk(seq, chunk_size), number=n) / n)
    y_stringio.append(
        timeit.timeit(lambda: test_stringio_chunk(seq, chunk_size), number=n) / n)
    y_join.append(
        timeit.timeit(lambda: test_join_chunk(seq, chunk_size), number=n) / n)
plt.plot(x, y_concat)
plt.plot(x, y_stringio)
plt.plot(x, y_join)
plt.legend(['concat', 'stringio', 'join'], loc='upper left')
plt.show()


回答 9

您可以采用不同的方式。

str1 = "Hello"
str2 = "World"
str_list = ['Hello', 'World']
str_dict = {'str1': 'Hello', 'str2': 'World'}

# Concatenating With the + Operator
print(str1 + ' ' + str2)  # Hello World

# String Formatting with the % Operator
print("%s %s" % (str1, str2))  # Hello World

# String Formatting with the { } Operators with str.format()
print("{}{}".format(str1, str2))  # Hello World
print("{0}{1}".format(str1, str2))  # Hello World
print("{str1} {str2}".format(str1=str_dict['str1'], str2=str_dict['str2']))  # Hello World
print("{str1} {str2}".format(**str_dict))  # Hello World

# Going From a List to a String in Python With .join()
print(' '.join(str_list))  # Hello World

# Python f'strings --> 3.6 onwards
print(f"{str1} {str2}")  # Hello World

我通过以下文章创建了这个小总结。

You can do in different ways.

str1 = "Hello"
str2 = "World"
str_list = ['Hello', 'World']
str_dict = {'str1': 'Hello', 'str2': 'World'}

# Concatenating With the + Operator
print(str1 + ' ' + str2)  # Hello World

# String Formatting with the % Operator
print("%s %s" % (str1, str2))  # Hello World

# String Formatting with the { } Operators with str.format()
print("{}{}".format(str1, str2))  # Hello World
print("{0}{1}".format(str1, str2))  # Hello World
print("{str1} {str2}".format(str1=str_dict['str1'], str2=str_dict['str2']))  # Hello World
print("{str1} {str2}".format(**str_dict))  # Hello World

# Going From a List to a String in Python With .join()
print(' '.join(str_list))  # Hello World

# Python f'strings --> 3.6 onwards
print(f"{str1} {str2}")  # Hello World

I created this little summary through following articles.


回答 10

我的用例略有不同。我不得不构造一个查询,其中有20多个字段是动态的。我遵循了这种使用格式化方法的方法

query = "insert into {0}({1},{2},{3}) values({4}, {5}, {6})"
query.format('users','name','age','dna','suzan',1010,'nda')

这对我来说相对来说比较简单,而不是使用+或其他方式

my use case was slight different. I had to construct a query where more then 20 fields were dynamic. I followed this approach of using format method

query = "insert into {0}({1},{2},{3}) values({4}, {5}, {6})"
query.format('users','name','age','dna','suzan',1010,'nda')

this was comparatively simpler for me instead of using + or other ways


回答 11

您也可以使用此(更有效)。(/software/304445/why-is-s-better-than-for-concatenation

s += "%s" %(stringfromelsewhere)