标签归档:join

熊猫加入问题:列重叠但未指定后缀

问题:熊猫加入问题:列重叠但未指定后缀

我有以下2个数据帧:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

当我尝试加入这两个数据框时:

join_df = df_a.join(df_b,on='mukey',how='left')

我得到错误:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

为什么会这样呢?数据帧确实具有通用的“ mukey”值。

I have following 2 data frames:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

When I try to join these 2 dataframes:

join_df = df_a.join(df_b,on='mukey',how='left')

I get the error:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

Why is this so? The dataframes do have common ‘mukey’ values.


回答 0

您发布的数据片段中的错误有点神秘,因为没有通用值,所以联接操作失败,因为这些值不重叠,这需要您在左侧和右侧提供后缀:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge 之所以有效,是因为它没有此限制:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don’t overlap it requires you to supply a suffix for the left and right hand side:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge works because it doesn’t have this restriction:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

回答 1

.join()函数正在使用index传递的参数数据集的,因此您应该改用set_index或使用.mergefunction。

请找到适合您的情况的两个示例:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

要么

join_df = df_a.merge(df_b, on='mukey', how='left')

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.

Please find the two examples that should work in your case:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

or

join_df = df_a.merge(df_b, on='mukey', how='left')

回答 2

此错误表明两个表具有1个或多个具有相同列名的列名。错误消息翻译为:“我可以在两个表中看到同一列,但是您没有告诉我在将其中一个引入之前重命名了任何一个”

您要么要删除一列,然后再使用del df [‘column name’]从另一列中引入,要么使用lsuffix来重写原始列,或者使用rsuffix重命名要引入的列。

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

This error indicates that the two tables have the 1 or more column names that have the same column name. The error message translates to: “I can see the same column in both tables but you haven’t told me to rename either before bringing one of them in”

You either want to delete one of the columns before bringing it in from the other on using del df[‘column name’], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought it.

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

回答 3

主要是join是专门用于基于索引的联接,而不是基于属性名称的联接,因此在两个不同的数据框中更改属性名称,然后尝试联接,它们将被联接,否则会引发此错误

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised


在python中加入字符串列表,并将每个字符串都用引号引起来

问题:在python中加入字符串列表,并将每个字符串都用引号引起来

我有:

words = ['hello', 'world', 'you', 'look', 'nice']

我希望有:

'"hello", "world", "you", "look", "nice"'

用Python做到这一点最简单的方法是什么?

I’ve got:

words = ['hello', 'world', 'you', 'look', 'nice']

I want to have:

'"hello", "world", "you", "look", "nice"'

What’s the easiest way to do this with Python?


回答 0

>>> words = ['hello', 'world', 'you', 'look', 'nice']
>>> ', '.join('"{0}"'.format(w) for w in words)
'"hello", "world", "you", "look", "nice"'
>>> words = ['hello', 'world', 'you', 'look', 'nice']
>>> ', '.join('"{0}"'.format(w) for w in words)
'"hello", "world", "you", "look", "nice"'

回答 1

您也可以执行一次format通话

>>> words = ['hello', 'world', 'you', 'look', 'nice']
>>> '"{0}"'.format('", "'.join(words))
'"hello", "world", "you", "look", "nice"'

更新:一些基准测试(以2009 Mbps的速度执行):

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.32559704780578613

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; '"{}"'.format('", "'.join(words))""").timeit(1000)
0.018904924392700195

所以看来format实际上很贵

更新2:在@JCode的注释之后,添加了一个map以确保join可以运行,Python 2.7.12

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.08646488189697266

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; '"{}"'.format('", "'.join(map(str, words)))""").timeit(1000)
0.04855608940124512

>>> timeit.Timer("""words = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.17348504066467285

>>> timeit.Timer("""words = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] * 100; '"{}"'.format('", "'.join(map(str, words)))""").timeit(1000)
0.06372308731079102

you may also perform a single format call

>>> words = ['hello', 'world', 'you', 'look', 'nice']
>>> '"{0}"'.format('", "'.join(words))
'"hello", "world", "you", "look", "nice"'

Update: Some benchmarking (performed on a 2009 mbp):

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.32559704780578613

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; '"{}"'.format('", "'.join(words))""").timeit(1000)
0.018904924392700195

So it seems that format is actually quite expensive

Update 2: following @JCode’s comment, adding a map to ensure that join will work, Python 2.7.12

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.08646488189697266

>>> timeit.Timer("""words = ['hello', 'world', 'you', 'look', 'nice'] * 100; '"{}"'.format('", "'.join(map(str, words)))""").timeit(1000)
0.04855608940124512

>>> timeit.Timer("""words = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] * 100; ', '.join('"{0}"'.format(w) for w in words)""").timeit(1000)
0.17348504066467285

>>> timeit.Timer("""words = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] * 100; '"{}"'.format('", "'.join(map(str, words)))""").timeit(1000)
0.06372308731079102

回答 2

您可以尝试以下方法:

str(words)[1:-1]

You can try this :

str(words)[1:-1]

回答 3

>>> ', '.join(['"%s"' % w for w in words])
>>> ', '.join(['"%s"' % w for w in words])

回答 4

@jamylak答案的更新版本带有F字符串(适用于python 3.6+),我已经在SQL脚本使用的字符串中使用了反引号。

keys = ['foo', 'bar' , 'omg']
', '.join(f'`{k}`' for k in keys)
# result: '`foo`, `bar`, `omg`'

An updated version of @jamylak answer with F Strings (for python 3.6+), I’ve used backticks for a string used for a SQL script.

keys = ['foo', 'bar' , 'omg']
', '.join(f'`{k}`' for k in keys)
# result: '`foo`, `bar`, `omg`'

大熊猫:在多列上合并(合并)两个数据框

问题:大熊猫:在多列上合并(合并)两个数据框

我正在尝试使用两列加入两个熊猫数据框:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

但出现以下错误:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

任何想法应该是正确的方法吗?谢谢!

I am trying to join two pandas data frames using two columns:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this? Thanks!


回答 0

试试这个

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on:要在左侧DataFrame中加入的标签或列表或类似数组的字段名称。可以是DataFrame长度的向量或向量列表,以使用特定向量作为连接键而不是列

right_on:标签或列表,或类似数组的字段名称,以在right DataFrame或每个left_on文档的向量/向量列表中加入

Try this

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs


回答 1

这里的问题是,通过使用撇号,您可以将要传递的值设置为字符串,而实际上,正如文档中的@Shijo所述,该函数需要的是标签或列表,而不是字符串!如果列表包含为左和右数据帧传递的每个列名,则每个列名必须分别在撇号内。通过上述陈述,我们可以理解为什么这是不正确的:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

这是使用函数的正确方法:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

回答 2

另一种方法是: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')

Another way of doing this: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')


为什么2012年Pandas在python中的合并速度比data.table在R中的合并速度快?

问题:为什么2012年Pandas在python中的合并速度比data.table在R中的合并速度快?

最近,我遇到了python 的pandas库,根据该基准,该库执行非常快的内存中合并。它甚至比R(我选择分析的语言)中的data.table包还要快。

为什么pandas要比这快得多data.table?是因为python相对于R具有固有的速度优势,还是我不了解一些折衷方案?有没有一种方法可以执行内部和外部联接data.table而无需使用merge(X, Y, all=FALSE)and merge(X, Y, all=TRUE)

这是用于对各种软件包进行基准测试的R代码Python代码

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It’s even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I’m not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Here’s the R code and the Python code used to benchmark the various packages.


回答 0

data.table当唯一字符串(级别)的数量很大:10,000 时,Wes似乎发现了一个已知问题。

是否Rprof()显示通话中的大部分时间sortedmatch(levels(i[[lc]]), levels(x[[rc]])?这实际上不是联接本身(算法),而是第一步。

最近的努力涉及允许键中的字符列,这应该通过与R自己的全局字符串哈希表更紧密地集成来解决该问题。已经报告了一些基准测试结果,test.data.table()但是尚未连接代码以替换级别匹配的级别。

大熊猫的合并速度是否比data.table常规整数列快?那应该是一种隔离算法本身与因素问题的方法。

此外,data.table时间序列合并的初衷。这有两个方面:i)多列有序键,例如(id,datetime); ii)快速流行的连接(roll=TRUE),也称为上一个观察结转。

我需要一些时间来确认,因为这是我所看到的与之比较的第一个data.table


2012年7月发布的data.table v1.8.0的更新

  • 当将“因子”类型的列的i级与x级进行匹配时,内部函数sortedmatch()被删除并替换为chmatch()。当因子列的级别数很大(例如,> 10,000)时,此初步步骤导致(已知)显着的减慢。如Wes McKinney(Python软件包Pandas的作者)所演示的,在连接四个这样的列的测试中加剧了这种情况。例如,匹配的100万个字符串(其中600,000个是唯一的)现在从16s减少到0.5s。

在该版本中还包括:

  • 现在,键中允许使用字符列,并且首选将其分解。data.table()和setkey()不再强制字符分解。仍然支持因素。实现FR#1493,FR#1224和(部分)FR#951。

  • 新功能chmatch()和%chin%,用于字符向量的match()和%in%的更快版本。使用R的内部字符串缓存(不构建哈希表)。它们比fchmatch示例中的match()快约4倍。

截至2013年9月,data.table在CRAN上的版本为v1.8.10,我们正在开发v1.9.0。新闻是实时更新的。


但是正如我最初在上面写的:

data.table具有时间序列合并的初衷。这有两个方面:i)多列有序键,例如(id,datetime); ii)快速流行的连接(roll=TRUE),也称为上一个观察结转。

因此,两个字符列的熊猫等值连接可能仍比data.table快。由于听起来像是对合并的两列进行哈希处理。data.table不会对键进行哈希处理,因为它考虑了普遍的有序连接。data.table中的“键”实际上只是排序顺序(类似于SQL中的聚簇索引;即,这就是在RAM中对数据进行排序的方式)。例如,在列表上添加辅助键。

总而言之,由于已知问题已得到解决,因此由具有10,000个以上唯一字符串的特殊的两个字符的列测试所强调的明显的速度差异现在应该不会那么糟糕。

It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn’t really the join itself (the algorithm), but a preliminary step.

Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R’s own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn’t hooked up yet to replace the levels to levels match.

Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

I’ll need some time to confirm as it’s the first I’ve seen of the comparison to data.table as presented.


UPDATE from data.table v1.8.0 released July 2012

  • Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type ‘factor’. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

also in that release was :

  • character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.

  • New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R’s internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

As of Sep 2013 data.table is v1.8.10 on CRAN and we’re working on v1.9.0. NEWS is updated live.


But as I wrote originally, above :

data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn’t hash the key because it has prevailing ordered joins in mind. A “key” in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that’s how the data is ordered in RAM). On the list is to add secondary keys, for example.

In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn’t be as bad now, since the known problem has been fixed.


回答 1

pandas更快的原因是因为我想出了一个更好的算法,该算法使用快速哈希表实现非常仔细地实现-klib和C / Cython,以避免不可向量化部分的Python解释器开销。在我的演示文稿中对该算法进行了详细描述:熊猫设计和开发的内幕

与之data.table比较实际上有点有趣,因为R的要点data.table是它包含用于各个列的预先计算的索引,以加快诸如数据选择和合并之类的操作。在这种情况下(数据库连接),pandas的DataFrame不包含用于合并的预先计算的信息,可以说这是“冷”合并。如果我存储了联接键的分解版本,则联接将明显更快-因为分解是该算法的最大瓶颈。

我还应该补充一点,熊猫的DataFrame的内部设计比R的data.frame(内部只是数组的列表)更适合于此类操作。

The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation – klib and in C/Cython to avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.

The comparison with data.table is actually a bit interesting because the whole point of R’s data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas’ DataFrame contains no pre-computed information that is being used for the merge, so to speak it’s a “cold” merge. If I had stored the factorized versions of the join keys, the join would be significantly faster – as factorizing is the biggest bottleneck for this algorithm.

I should also add that the internal design of pandas’ DataFrame is much more amenable to these kinds of operations than R’s data.frame (which is just a list of arrays internally).


回答 2

这个话题已经有两年历史了,但是当人们寻找熊猫和数据的比较时,它似乎是一个着陆的地方。

由于这两种方法都随着时间的推移而发展,因此我想在此为感兴趣的用户发布一个相对较新的比较(2014年以来):https : //github.com/Rdatatable/data.table/wiki/Benchmarks- : -Grouping

有趣的是,Wes和/或Matt(分别是Pandas和data.table的创建者,并且都在上面发表了评论)是否也有任何新闻要补充。

-更新-

jangorecki在下面发布的评论包含一个我认为非常有用的链接:https : //github.com/szilard/benchm-databases

该图描绘了不同技术的聚合和连接操作的平均时间(更低=更快;比较上次更新于2016年9月)。对我来说真的很有教育意义。

回到问题,R DT keyR DT参考R的data.table的键控/非键控风格,并且在本基准测试中碰巧比Python的Pandas(Py pandas)快。

This topic is two years old but seems like a probable place for people to land when they search for comparisons of Pandas and data.table

Since both of these have evolved over time, I want to post a relatively newer comparison (from 2014) here for the interested users: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

It would be interesting to know if Wes and/or Matt (who, by the way, are creators of Pandas and data.table respectively and have both commented above) have any news to add here as well.

— UPDATE —

A comment posted below by jangorecki contains a link that I think is very useful: https://github.com/szilard/benchm-databases

This graph depicts the average times of aggregation and join operations for different technologies (lower = faster; comparison last updated in Sept 2016). It was really educational for me.

Going back to the question, R DT key and R DT refer to the keyed/unkeyed flavors of R’s data.table and happen to be faster in this benchmark than Python’s Pandas (Py pandas).


回答 3

有很好的答案,特别是这两个工具的作者都提出了这个问题。马特(Matt)的答案解释了问题中报告的情况,该情况是由错误而非合并算法引起的。错误已在第二天(超过7年前)修复。

在我的回答中,我将提供一些data.table和pandas合并操作的最新时间。请注意,不包括plyr和基本R合并。

我正在介绍的时间来自db-benchmark项目,它是一个连续运行的可复制基准。它将工具升级到最新版本并重新运行基准脚本。它运行许多其他软件解决方案。如果您对Spark,Dask和其他一些产品感兴趣,请务必检查链接。


截至目前…(仍在实施:一个数据大小和五个问题)

我们测试了LHS表的2种不同数据大小。
对于这些数据大小,我们运行5个不同的合并问题。

Q1:LHS内部联接RHS- 整数
Q2:LHS内加入RHS的培养基上整数
Q3:LHS 加入RHS的培养基上整数
Q4:LHS内加入RHS的培养基上因子(分类)
Q5:LHS内部联接RHS- 整数

RHS桌子有3种尺寸

  • 转化为LHS / 1e6的大小
  • 介质转换为LHS / 1e3的大小
  • 转换为LHS的大小

在所有情况下,LHS和RHS之间大约有90%的匹配行,并且RHS连接列中没有重复项(没有笛卡尔积)。


截止到现在(2019年11月2日运行)

熊猫0.25.3于2019年11月1日发布数据
表0.12.7(92abb70)于2019年11月2日发布

对于LHS的两种不同数据大小,以下计时单位为秒。列pd2dt中添加了大熊猫比data.table慢多少倍的字段存储比率。

  • 0.5 GB LHS数据
+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        0.51  |    3.60  |      7 |
| q2        |        0.50  |    7.37  |     14 |
| q3        |        0.90  |    4.82  |      5 |
| q4        |        0.47  |    5.86  |     12 |
| q5        |        2.55  |   54.10  |     21 |
+-----------+--------------+----------+--------+
  • 5 GB LHS数据
+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        6.32  |    89.0  |     14 |
| q2        |        5.72  |   108.0  |     18 |
| q3        |       11.00  |    56.9  |      5 |
| q4        |        5.57  |    90.1  |     16 |
| q5        |       30.70  |   731.0  |     23 |
+-----------+--------------+----------+--------+

There are great answers, notably made by authors of both tools that question asks about. Matt’s answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.

In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.

Timings I am presenting are coming from db-benchmark project, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.


As of now… (still to be implemented: one more data size and 5 more questions)

We tests 2 different data sizes of LHS table.
For each of those data sizes we run 5 different merge questions.

q1: LHS inner join RHS-small on integer
q2: LHS inner join RHS-medium on integer
q3: LHS outer join RHS-medium on integer
q4: LHS inner join RHS-medium on factor (categorical)
q5: LHS inner join RHS-big on integer

RHS table is of 3 various sizes

  • small translates to size of LHS/1e6
  • medium translates to size of LHS/1e3
  • big translates to size of LHS

In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).


As of now (run on 2nd Nov 2019)

pandas 0.25.3 released on 1st Nov 2019
data.table 0.12.7 (92abb70) released on 2nd Nov 2019

Below timings are in seconds, for two different data sizes of LHS. Column pd2dt is added field storing ratio of how many times pandas is slower than data.table.

  • 0.5 GB LHS data
+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        0.51  |    3.60  |      7 |
| q2        |        0.50  |    7.37  |     14 |
| q3        |        0.90  |    4.82  |      5 |
| q4        |        0.47  |    5.86  |     12 |
| q5        |        2.55  |   54.10  |     21 |
+-----------+--------------+----------+--------+
  • 5 GB LHS data
+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        6.32  |    89.0  |     14 |
| q2        |        5.72  |   108.0  |     18 |
| q3        |       11.00  |    56.9  |      5 |
| q4        |        5.57  |    90.1  |     16 |
| q5        |       30.70  |   731.0  |     23 |
+-----------+--------------+----------+--------+

在熊猫中加入和合并有什么区别?

问题:在熊猫中加入和合并有什么区别?

假设我有两个像这样的DataFrame:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})

我想合并它们,所以我尝试这样的事情:

pd.merge(left, right, left_on='key1', right_on='key2')

我很开心

    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5

但是我正在尝试使用join方法,我被认为这是非常相似的。

left.join(right, on=['key1', 'key2'])

我得到这个:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError: 

我想念什么?

Suppose I have two DataFrames like so:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})

I want to merge them, so I try something like this:

pd.merge(left, right, left_on='key1', right_on='key2')

And I’m happy

    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5

But I’m trying to use the join method, which I’ve been lead to believe is pretty similar.

left.join(right, on=['key1', 'key2'])

And I get this:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError: 

What am I missing?


回答 0

我总是join在索引上使用:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     val_l  val_r
key            
foo      1      4
bar      2      5

通过merge在以下各列上使用,可以具有相同的功能:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

   key  val_l  val_r
0  foo      1      4
1  bar      2      5

I always use join on indices:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     val_l  val_r
key            
foo      1      4
bar      2      5

The same functionality can be had by using merge on the columns follows:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

   key  val_l  val_r
0  foo      1      4
1  bar      2      5

回答 1

pandas.merge() 是用于所有合并/联接行为的基础函数。

DataFrames提供pandas.DataFrame.merge()pandas.DataFrame.join()方法,作为一种方便的方法来访问的功能pandas.merge()。例如,df1.merge(right=df2, ...)等效于pandas.merge(left=df1, right=df2, ...)

这些是df.join()和之间的主要区别df.merge()

  1. 在右表上查找:df1.join(df2)始终通过的索引进行连接df2,但df1.merge(df2)可以与df2(默认)的一个或多个列或df2(与right_index=True)的索引进行连接。
  2. 在左表上查找:默认情况下,df1.join(df2)使用的索引df1df1.merge(df2)使用的列df1。可以通过指定df1.join(df2, on=key_or_keys)或覆盖df1.merge(df2, left_index=True)
  3. 左vs内部联接:df1.join(df2)默认情况下执行左联接(保留的所有行df1),但df.merge默认情况下进行内部联接(仅返回df1和的匹配行df2)。

因此,通用方法是使用pandas.merge(df1, df2)df1.merge(df2)。但是在许多常见情况下(将中的所有行保留df1并连接到中的索引df2),您可以使用df1.join(df2)代替保存一些类型。

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging上的文档中针对这些问题的一些说明:

merge 是pandas命名空间中的一个函数,它也可以作为DataFrame实例方法使用,调用的DataFrame被隐式视为联接中的左侧对象。

相关DataFrame.join方法在merge内部用于索引索引连接和列索引连接,但是默认情况下在索引上进行连接,而不是尝试在公共列上进行连接(的默认行为merge)。如果您要加入索引,则不妨使用它DataFrame.join来保存自己的输入内容。

这两个函数调用是完全等效的:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

pandas.merge() is the underlying function used for all merge/join behavior.

DataFrames provide the pandas.DataFrame.merge() and pandas.DataFrame.join() methods as a convenient way to access the capabilities of pandas.merge(). For example, df1.merge(right=df2, ...) is equivalent to pandas.merge(left=df1, right=df2, ...).

These are the main differences between df.join() and df.merge():

  1. lookup on right table: df1.join(df2) always joins via the index of df2, but df1.merge(df2) can join to one or more columns of df2 (default) or to the index of df2 (with right_index=True).
  2. lookup on left table: by default, df1.join(df2) uses the index of df1 and df1.merge(df2) uses column(s) of df1. That can be overridden by specifying df1.join(df2, on=key_or_keys) or df1.merge(df2, left_index=True).
  3. left vs inner join: df1.join(df2) does a left join by default (keeps all rows of df1), but df.merge does an inner join by default (returns only matching rows of df1 and df2).

So, the generic approach is to use pandas.merge(df1, df2) or df1.merge(df2). But for a number of common situations (keeping all rows of df1 and joining to an index in df2), you can save some typing by using df1.join(df2) instead.

Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:

merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

These two function calls are completely equivalent:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

回答 2

我相信这join()只是一种方便的方法。请尝试尝试df1.merge(df2),它允许您指定left_onright_on

In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]: 
  key1  lval key2  rval
0  foo     1  foo     4
1  bar     2  bar     5

I believe that join() is just a convenience method. Try df1.merge(df2) instead, which allows you to specify left_on and right_on:

In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]: 
  key1  lval key2  rval
0  foo     1  foo     4
1  bar     2  bar     5

回答 3

本文档

pandas提供一个合并功能,作为DataFrame对象之间所有标准数据库联接操作的入口点:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

和:

DataFrame.join是一种将两个可能具有不同索引的DataFrame的列组合到单个结果DataFrame中的便捷方法。这是一个非常基本的示例:此处的数据对齐在索引(行标签)上。使用合并加上指示它使用索引的其他参数,可以实现相同的行为:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

From this documentation

pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

And :

DataFrame.join is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

回答 4

区别之一merge是创建新索引,并join保留左侧索引。如果您错误地认为索引未使用进行更改,则可能对以后的转换产生重大影响merge

例如:

import pandas as pd

df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
                    'date': [201801, 201801, 201802, 201802],
                    'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1

       date  org_index  val
101  201801        101    1
102  201801        102    2
103  201802        103    3
104  201802        104    4

df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2

       dateval
date          
201801       A
201802       B

df1.merge(df2, on='date')

     date  org_index  val dateval
0  201801        101    1       A
1  201801        102    2       A
2  201802        103    3       B
3  201802        104    4       B

df1.join(df2, on='date')
       date  org_index  val dateval
101  201801        101    1       A
102  201801        102    2       A
103  201802        103    3       B
104  201802        104    4       B

One of the difference is that merge is creating a new index, and join is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn’t changed with merge.

For example:

import pandas as pd

df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
                    'date': [201801, 201801, 201802, 201802],
                    'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1

       date  org_index  val
101  201801        101    1
102  201801        102    2
103  201802        103    3
104  201802        104    4

df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2

       dateval
date          
201801       A
201802       B

df1.merge(df2, on='date')

     date  org_index  val dateval
0  201801        101    1       A
1  201801        102    2       A
2  201802        103    3       B
3  201802        104    4       B

df1.join(df2, on='date')
       date  org_index  val dateval
101  201801        101    1       A
102  201801        102    2       A
103  201802        103    3       B
104  201802        104    4       B

回答 5

  • 联接:默认索引(如果使用相同的列名,则由于未定义lsuffix或rsuffix,它将在默认模式下引发错误)
df_1.join(df_2)
  • 合并:默认相同的列名(如果没有相同的列名,则在默认模式下将引发错误)
df_1.merge(df_2)
  • on 在这两种情况下,参数具有不同的含义
df_1.merge(df_2, on='column_1')

df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
  • Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
df_1.join(df_2)
  • Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
df_1.merge(df_2)
  • on parameter has different meaning in both cases
df_1.merge(df_2, on='column_1')

df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')

回答 6

用类似于SQL的方式表示“ Pandas合并是外部/内部联接,Pandas联接是自然联接”。因此,当您在熊猫中使用合并时,您要指定要使用哪种sqlish联接,而当使用熊猫联接时,您确实希望有一个匹配的列标签以确保其联接

To put it analogously to SQL “Pandas merge is to outer/inner join and Pandas join is to natural join”. Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins


熊猫三向联接列上的多个数据框

问题:熊猫三向联接列上的多个数据框

我有3个CSV文件。每个列都有第一列作为人员的(字符串)名称,而每个数据框中的所有其他列都是该人员的属性。

如何将所有三个CSV文档“连接”在一起以创建一个CSV,而每一行都具有该人的字符串名称的每个唯一值的所有属性?

join()pandas中的函数指定我需要一个多索引,但是我对层次化索引方案与基于单个索引进行联接有何关系感到困惑。

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I “join” together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person’s string name?

The join() function in pandas specifies that I need a multiindex, but I’m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.


回答 0

假设进口:

import pandas as pd

John Galt的答案基本上是一项reduce手术。如果我有几个数据帧,则将它们放在这样的列表中(通过列表推导或循环或其他方式生成):

dfs = [df0, df1, df2, dfN]

假设它们有一些共同的列,例如name您的示例,我将执行以下操作:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

这样,您的代码应该可以与要合并的任意数量的数据框一起使用。

编辑2016年8月1日:对于使用Python 3的用户:reduce已移入functools。因此,要使用此功能,您首先需要导入该模块:

from functools import reduce

Assumed imports:

import pandas as pd

John Galt’s answer is basically a reduce operation. If I have more than a handful of dataframes, I’d put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, dfN]

Assuming they have some common column, like name in your example, I’d do the following:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Edit August 1, 2016: For those using Python 3: reduce has been moved into functools. So to use this function, you’ll first need to import that module:

from functools import reduce

回答 1

如果您有3个数据框,则可以尝试

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

或者,如cwharland所述

df1.merge(df2,on='name').merge(df3,on='name')

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

回答 2

这是该join方法的理想情况

join方法正是针对这些类型的情况而构建的。您可以将任意数量的DataFrame与其一起加入。调用DataFrame与传递的DataFrames集合的索引连接。要使用多个DataFrame,必须将联接列放在索引中。

代码看起来像这样:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

使用@zero的数据,您可以执行以下操作:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero’s data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

回答 3

对于数据帧列表,也可以按以下步骤进行操作df_list

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

或数据帧在生成器对象中(例如,以减少内存消耗):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

回答 4

python3.6.3和pandas0.22.0中concat,只要将要用于联接的列设置为索引,也可以使用

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

其中df1df2df3定义为John Galt的答案

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt’s answer

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

回答 5

一个并不需要一个多指标进行连接操作。只需正确设置要在其上执行联接操作的索引列(df.set_index('Name')例如,该命令)

join默认情况下,该操作是对索引执行的。对于您的情况,只需要指定该Name列对应于您的索引即可。下面是一个例子

教程可能是有用的。

# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

回答 6

这是一种合并数据帧字典,同时使列名与字典同步的方法。如果需要,它还会填写缺失值:

这是合并数据帧字典的功能

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

好的,让我们生成数据并进行测试:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

This is the function to merge a dict of data frames

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

回答 7

简单的解决方案:

如果列名相似:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

如果列名不同:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Simple Solution:

If the column names are similar:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

回答 8

pandas文档中还有另一种解决方案(我在这里看不到),

使用 .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

ignore_index=True被用来忽略所附数据帧的索引,在源一个可用下一个索引代替。

如果列名不同,Nan将引入。

There is another solution from the pandas documentation (that I don’t see here),

using the .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

If there are different column names, Nan will be introduced.


回答 9

这三个数据帧是

让我们使用嵌套的pd.merge合并这些框架

在这里,我们有了合并的数据框。

快乐的分析!!!

The three dataframes are

Let’s merge these frames using nested pd.merge

Here we go, we have our merged dataframe.

Happy Analysis!!!


在Python中构建完整路径文件名

问题:在Python中构建完整路径文件名

我需要将文件路径名传递给模块。如何从目录名称,基本文件名和文件格式字符串构建文件路径?

呼叫时目录可能存在也可能不存在。

例如:

dir_name='/home/me/dev/my_reports'
base_filename='daily_report'
format = 'pdf'

我需要创建一个字符串 '/home/me/dev/my_reports/daily_report.pdf'

手动连接片段似乎不是一个好方法。我试过了os.path.join

join(dir_name,base_filename,format)

但它给

/home/me/dev/my_reports/daily_report/pdf

I need to pass a file path name to a module. How do I build the file path from a directory name, base filename, and a file format string?

The directory may or may not exist at the time of call.

For example:

dir_name='/home/me/dev/my_reports'
base_filename='daily_report'
format = 'pdf'

I need to create a string '/home/me/dev/my_reports/daily_report.pdf'

Concatenating the pieces manually doesn’t seem to be a good way. I tried os.path.join:

join(dir_name,base_filename,format)

but it gives

/home/me/dev/my_reports/daily_report/pdf

回答 0

这工作正常:

os.path.join(dir_name, base_filename + "." + filename_suffix)

请记住,这种os.path.join()存在仅是因为不同的操作系统使用不同的路径分隔符。它消除了这种差异,因此跨平台代码不必因每个操作系统的特殊情况而混乱。无需对文件名“扩展名”执行此操作(请参见脚注),因为在每个OS上,它们始终以点字符连接到名称的其余部分。

如果仍然使用函数会使您感觉更好(并且您喜欢使代码不必要地复杂化),则可以执行以下操作:

os.path.join(dir_name, '.'.join((base_filename, filename_suffix)))

如果您希望保持代码干净,只需在后缀中添加点即可:

suffix = '.pdf'
os.path.join(dir_name, base_filename + suffix)

(这种方法也恰好与python 3.4中引入的pathlib中的后缀约定兼容。)


脚注:在非Micorsoft操作系统上没有文件名“扩展名”。它在Windows上的存在来自MS-DOS和FAT,它们是从CP / M借来的,它已经死了几十年。我们中许多人习惯看到的点加三字母只是其他任何现代OS上文件名的一部分,在其中它没有内置的含义。

This works fine:

os.path.join(dir_name, base_filename + "." + filename_suffix)

Keep in mind that os.path.join() exists only because different operating systems use different path separator characters. It smooths over that difference so cross-platform code doesn’t have to be cluttered with special cases for each OS. There is no need to do this for file name “extensions” (see footnote) because they are always connected to the rest of the name with a dot character, on every OS.

If using a function anyway makes you feel better (and you like needlessly complicating your code), you can do this:

os.path.join(dir_name, '.'.join((base_filename, filename_suffix)))

If you prefer to keep your code clean, simply include the dot in the suffix:

suffix = '.pdf'
os.path.join(dir_name, base_filename + suffix)

That approach also happens to be compatible with the suffix conventions in pathlib, which was introduced in python 3.4 after this question was asked. New code that doesn’t require backward compatibility can do this:

suffix = '.pdf'
pathlib.PurePath(dir_name, base_filename + suffix)

You might prefer the shorter Path instead of PurePath if you’re only handling paths for the local OS.

Warning: Do not use pathlib’s with_suffix() for this purpose. That method will corrupt base_filename if it ever contains a dot.


Footnote: Outside of Micorsoft operating systems, there is no such thing as a file name “extension”. Its presence on Windows comes from MS-DOS and FAT, which borrowed it from CP/M, which has been dead for decades. That dot-plus-three-letters that many of us are accustomed to seeing is just part of the file name on every other modern OS, where it has no built-in meaning.


回答 1

如果您有幸能够运行Python 3.4+,则可以使用pathlib

>>> from pathlib import Path
>>> dirname = '/home/reports'
>>> filename = 'daily'
>>> suffix = '.pdf'
>>> Path(dirname, filename).with_suffix(suffix)
PosixPath('/home/reports/daily.pdf')

If you are fortunate enough to be running Python 3.4+, you can use pathlib:

>>> from pathlib import Path
>>> dirname = '/home/reports'
>>> filename = 'daily'
>>> suffix = '.pdf'
>>> Path(dirname, filename).with_suffix(suffix)
PosixPath('/home/reports/daily.pdf')

回答 2

嗯,为什么不只是:

>>>> import os
>>>> os.path.join(dir_name, base_filename + "." + format)
'/home/me/dev/my_reports/daily_report.pdf'

Um, why not just:

>>>> import os
>>>> os.path.join(dir_name, base_filename + "." + format)
'/home/me/dev/my_reports/daily_report.pdf'

回答 3

只需使用os.path.join文件名和扩展名来连接路径即可。用于sys.argv在执行脚本时访问传递给脚本的参数:

#!/usr/bin/env python3
# coding: utf-8

# import netCDF4 as nc
import numpy as np
import numpy.ma as ma
import csv as csv

import os.path
import sys

basedir = '/data/reu_data/soil_moisture/'
suffix = 'nc'


def read_fid(filename):
    fid = nc.MFDataset(filename,'r')
    fid.close()
    return fid

def read_var(file, varname):
    fid = nc.Dataset(file, 'r')
    out = fid.variables[varname][:]
    fid.close()
    return out


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Please specify a year')

    else:
        filename = os.path.join(basedir, '.'.join((sys.argv[1], suffix)))
        time = read_var(ncf, 'time')
        lat = read_var(ncf, 'lat')
        lon = read_var(ncf, 'lon')
        soil = read_var(ncf, 'soilw')

只需像这样运行脚本:

   # on windows-based systems
   python script.py year

   # on unix-based systems
   ./script.py year

Just use os.path.join to join your path with the filename and extension. Use sys.argv to access arguments passed to the script when executing it:

#!/usr/bin/env python3
# coding: utf-8

# import netCDF4 as nc
import numpy as np
import numpy.ma as ma
import csv as csv

import os.path
import sys

basedir = '/data/reu_data/soil_moisture/'
suffix = 'nc'


def read_fid(filename):
    fid = nc.MFDataset(filename,'r')
    fid.close()
    return fid

def read_var(file, varname):
    fid = nc.Dataset(file, 'r')
    out = fid.variables[varname][:]
    fid.close()
    return out


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Please specify a year')

    else:
        filename = os.path.join(basedir, '.'.join((sys.argv[1], suffix)))
        time = read_var(ncf, 'time')
        lat = read_var(ncf, 'lat')
        lon = read_var(ncf, 'lon')
        soil = read_var(ncf, 'soilw')

Simply run the script like:

   # on windows-based systems
   python script.py year

   # on unix-based systems
   ./script.py year

熊猫合并101

问题:熊猫合并101

  • 如何执行与熊猫(LEFT| RIGHT| FULL)(INNER| OUTER)的联接?
  • 合并后如何为缺失的行添加NaN?
  • 合并后如何去除NaN?
  • 我可以合并索引吗?
  • 与大熊猫交叉交往?
  • 如何合并多个DataFrame?
  • mergejoinconcatupdate?WHO?什么?为什么?!

… 和更多。我已经看到这些重复出现的问题,询问有关熊猫合并功能的各个方面。如今,有关合并及其各种用例的大多数信息都分散在数十个措辞不好,无法搜索的帖子中。这里的目的是整理后代的一些更重要的观点。

本QnA是下一章有关大熊猫习语的有用的用户指南中的下一部分(请参阅有关透视的这篇文章有关串联的这篇文章,我将在以后进行探讨)。

请注意,本文并非旨在代替文档,因此也请阅读!一些示例是从那里获取的。

  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?
  • How do I add NaNs for missing rows after merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • Cross join with pandas?
  • How do I merge multiple DataFrames?
  • merge? join? concat? update? Who? What? Why?!

… and more. I’ve seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


回答 0

这篇文章旨在为读者提供有关SQL风格的与熊猫的合并,使用方法以及何时不使用它的入门。

特别是,这是这篇文章的内容:

  • 基础-联接类型(左,右,外,内)

    • 与不同的列名合并
    • 避免在输出中出现重复的合并键列
  • 在不同条件下与索引合并
    • 有效地使用您的命名索引
    • 合并键作为一个索引,另一个索引
  • 多路合并列和索引(唯一和非唯一)
  • 值得注意的替代品mergejoin

这篇文章不会讲的内容:

  • 与性能相关的讨论和时间安排(目前)。在适当的地方,最引人注目的是提到更好的替代方案。
  • 处理后缀,删除多余的列,重命名输出以及其他特定用例。还有其他(阅读:更好)的帖子可以解决这个问题,所以请弄清楚!

注意
除非另有说明,否则大多数示例在演示各种功能时会默认使用INNER JOIN操作。

此外,可以复制和复制此处的所有DataFrame,以便您可以使用它们。另外,请参阅这篇文章 ,了解如何从剪贴板读取DataFrames。

最后,已使用Google绘图手工绘制了JOIN操作的所有可视表示形式。从这里得到启示。

聊够了,只是告诉我如何使用merge

设定

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

为了简单起见,键列具有相同的名称(目前)。

一个内连接由下式表示

注意:
此规则以及即将发布的附图均遵循以下约定:

  • 蓝色表示合并结果中存在的行
  • 红色表示从结果中排除(即删除)的行
  • 绿色表示缺少的值将在结果中替换为NaN

要执行INNER JOIN,请调用merge左侧的DataFrame,并指定右侧的DataFrame和联接键(至少)作为参数。

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

这仅返回来自leftright共享一个公共密钥的行(在本示例中为“ B”和“ D”)。

LEFT OUTER JOIN,或LEFT JOIN由下式表示

可以通过指定执行how='left'

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

请仔细注意NaN的位置。如果指定how='left',则仅left使用from 的键,而rightNaN替换缺少的数据。

同样,对于RIGHT OUTER JOIN或RIGHT JOIN来说…

…指定how='right'

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

在这里,right使用了from 的键,而缺失的数据left被NaN代替。

最后,对于FULL OUTER JOIN,由

指定how='outer'

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

这将使用两个框架中的关键点,并且会为两个框架中缺少的行插入NaN。

该文档很好地总结了这些各种合并:

其他联接-左排除,右排除和全排除/ ANTI连接

如果您需要分两个步骤进行LEFT-Exclusive JOINRIGHT-Exclusive JOIN

对于不包括JOIN的LEFT,表示为

首先执行LEFT OUTER JOIN,然后过滤(不包括!)left仅来自于行的行,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

哪里,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

同样,对于除权利加入之外,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

最后,如果您需要执行合并操作,而该合并操作只保留左侧或右侧的键,而不同时保留两者(IOW,执行ANTI-JOIN),

您可以通过类似的方式进行操作-

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

键列的不同名称

如果键列的名称不同(例如,lefthas keyLeftrighthas keyRight代替),key则必须指定left_onright_on作为参数,而不是on

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

避免在输出中重复键列

keyLeftfrom leftkeyRightfrom 上进行合并时right,如果只希望在输出中使用keyLeftkeyRight(但不同时使用)中的任何一个,则可以从将索引设置为初步步骤开始。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

将此与命令输出(恰恰是的输出left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner'))进行对比(您会发现keyLeft它丢失了)。您可以根据将哪个帧的索引设置为关键字来找出要保留的列。例如,当执行某些OUTER JOIN操作时,这可能很重要。

仅合并其中一个的单个列 DataFrames

例如,考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

如果只需要合并“ new_val”(不包含任何其他列),通常可以在合并之前仅对列进行子集化:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

如果您要进行左外部联接,则性能更高的解决方案将涉及map

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

如前所述,这类似于但比

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

合并多列

要加入对多列,指定列表on(或left_onright_on,如适用)。

left.merge(right, on=['key1', 'key2'] ...)

或者,如果名称不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

其他有用的merge*操作和功能

本节仅介绍最基本的内容,目的只是为了激发您的胃口。更多的例子和案例,看到的文档mergejoin以及concat还有链接的功能规格。


基于索引的* -JOIN(+ index-column merges)

设定

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

通常,索引合并看起来像这样:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

支持索引名称

如果索引被命名,然后v0.23用户还可以指定的级别名称on(或left_onright_on必要的)。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

合并一个索引,另一个列

可以(并且非常简单)使用一个索引和另一个列进行合并。例如,

left.merge(right, left_on='key1', right_index=True)

反之亦然(right_on=...left_index=True)。

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

在这种特殊情况下,left为命名了索引,因此您也可以将索引名称与一起使用left_on,如下所示:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
除了这些,还有另一个简洁的选择。您可以使用DataFrame.join默认值来联接索引。DataFrame.join默认情况下不做LEFT OUTER JOIN,所以how='inner'在这里是必要的。

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

请注意,我需要指定lsuffixrsuffix参数join,否则会出错:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

由于列名相同。如果它们的名称不同,这将不是问题。

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
最后,作为基于索引的联接的替代方法,可以使用pd.concat

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

省略join='inner'是否需要FULL OUTER JOIN(默认):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

有关更多信息,请参见@piRSquared pd.concat的此规范帖子


通用化:merge处理多个DataFrame

通常,将多个DataFrame合并在一起时会出现这种情况。天真的,这可以通过链接merge调用来完成:

df1.merge(df2, ...).merge(df3, ...)

但是,对于许多DataFrame,这很快就变得一发不可收拾。此外,可能有必要归纳为未知数量的DataFrame。

在这里,我将介绍pd.concat针对唯一DataFrame.join的多方联接,以及针对非唯一键的多方联接。首先,设置。

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

多路合并唯一键(或索引)

如果您的键(此处的键可以是列或索引)是唯一的,则可以使用pd.concat。请注意,pd.concat将DataFrames连接到索引

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

省略join='inner'完整的外部联接。请注意,您不能指定LEFT或RIGHT OUTER连接(如果需要这些连接,请使用join,如下所述)。

多路合并重复项

concat速度很快,但也有缺点。它不能处理重复项。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

在这种情况下,我们可以使用join它,因为它可以处理非唯一键(请注意,join将DataFrames连接到它们的索引上;merge除非另有说明,否则它在后台进行调用并执行LEFT OUTER JOIN)。

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here’s what this post will go through:

  • The basics – types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • avoiding duplicate merge key column in output
  • Merging with index under different conditions
    • effectively using your named index
    • merge key as the index of one and column of another
  • Multiway merges on columns and indexes (unique and non-unique)
  • Notable alternatives to merge and join

What this post will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough Talk, just show me how to use merge!

Setup

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note
This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, “B” and “D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is…

…specify how='right':

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

Other JOINs – LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you’ll notice keyLeft is missing. You can figure out what column to keep based on which frame’s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only “new_val” (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you’re doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.


Index-based *-JOIN (+ index-column merges)

Setup

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Merging on index of one, column(s) of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=... and left_index=True).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

In this special case, the index for left is named, so you can also use the index name with left_on, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

For more information, see this canonical post on pd.concat by @piRSquared.


Generalizing: mergeing multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys (or index)

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

回答 1

一个补充的视觉观pd.concat([df0, df1], kwargs)。请注意,kwarg axis=0axis=1的含义不如df.mean()或直观df.apply(func)


A supplemental visual view of pd.concat([df0, df1], kwargs). Notice that, kwarg axis=0 or axis=1 ‘s meaning is not as intuitive as df.mean() or df.apply(func)



将列表中的项目连接到字符串

问题:将列表中的项目连接到字符串

有没有更简单的方法将列表中的字符串项连接为单个字符串?我可以使用该str.join()功能吗?

例如,这是输入['this','is','a','sentence'],这是所需的输出this-is-a-sentence

sentence = ['this','is','a','sentence']
sent_str = ""
for i in sentence:
    sent_str += str(i) + "-"
sent_str = sent_str[:-1]
print sent_str

Is there a simpler way to concatenate string items in a list into a single string? Can I use the str.join() function?

E.g. this is the input ['this','is','a','sentence'] and this is the desired output this-is-a-sentence

sentence = ['this','is','a','sentence']
sent_str = ""
for i in sentence:
    sent_str += str(i) + "-"
sent_str = sent_str[:-1]
print sent_str

回答 0

用途join

>>> sentence = ['this','is','a','sentence']
>>> '-'.join(sentence)
'this-is-a-sentence'

Use join:

>>> sentence = ['this','is','a','sentence']
>>> '-'.join(sentence)
'this-is-a-sentence'

回答 1

将python列表转换为字符串的更通用的方法是:

>>> my_lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> my_lst_str = ''.join(map(str, my_lst))
>>> print(my_lst_str)
'12345678910'

A more generic way to convert python lists to strings would be:

>>> my_lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> my_lst_str = ''.join(map(str, my_lst))
>>> print(my_lst_str)
'12345678910'

回答 2

对于初学者来说,了解join为什么是字符串方法非常有用 。

一开始很奇怪,但此后非常有用。

连接的结果始终是一个字符串,但是要连接的对象可以有多种类型(生成器,列表,元组等)。

.join更快,因为它只分配一次内存。比经典串联更好(请参阅扩展说明)。

一旦学习了它,它就会非常舒适,您可以执行以下技巧来添加括号。

>>> ",".join("12345").join(("(",")"))
Out:
'(1,2,3,4,5)'

>>> list = ["(",")"]
>>> ",".join("12345").join(list)
Out:
'(1,2,3,4,5)'

It’s very useful for beginners to know why join is a string method.

It’s very strange at the beginning, but very useful after this.

The result of join is always a string, but the object to be joined can be of many types (generators, list, tuples, etc).

.join is faster because it allocates memory only once. Better than classical concatenation (see, extended explanation).

Once you learn it, it’s very comfortable and you can do tricks like this to add parentheses.

>>> ",".join("12345").join(("(",")"))
Out:
'(1,2,3,4,5)'

>>> list = ["(",")"]
>>> ",".join("12345").join(list)
Out:
'(1,2,3,4,5)'

回答 3

尽管@Burhan Khalid的回答很好,但我认为这样更容易理解:

from str import join

sentence = ['this','is','a','sentence']

join(sentence, "-") 

join()的第二个参数是可选的,默认为“”。

编辑:此功能已在Python 3中删除

Although @Burhan Khalid’s answer is good, I think it’s more understandable like this:

from str import join

sentence = ['this','is','a','sentence']

join(sentence, "-") 

The second argument to join() is optional and defaults to ” “.

EDIT: This function was removed in Python 3


回答 4

我们可以指定如何连接字符串。除了使用’-‘,我们还可以使用”

sentence = ['this','is','a','sentence']
s=(" ".join(sentence))
print(s)

We can specify how we have to join the string. Instead of ‘-‘, we can use ‘ ‘

sentence = ['this','is','a','sentence']
s=(" ".join(sentence))
print(s)

回答 5

我们还可以使用Python的reduce函数:

from functools import reduce

sentence = ['this','is','a','sentence']
out_str = str(reduce(lambda x,y: x+"-"+y, sentence))
print(out_str)

We can also use Python’s reduce function:

from functools import reduce

sentence = ['this','is','a','sentence']
out_str = str(reduce(lambda x,y: x+"-"+y, sentence))
print(out_str)

回答 6

def eggs(someParameter):
    del spam[3]
    someParameter.insert(3, ' and cats.')


spam = ['apples', 'bananas', 'tofu', 'cats']
eggs(spam)
spam =(','.join(spam))
print(spam)
def eggs(someParameter):
    del spam[3]
    someParameter.insert(3, ' and cats.')


spam = ['apples', 'bananas', 'tofu', 'cats']
eggs(spam)
spam =(','.join(spam))
print(spam)