问题:在熊猫中加入和合并有什么区别?
假设我有两个像这样的DataFrame:
left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})
我想合并它们,所以我尝试这样的事情:
pd.merge(left, right, left_on='key1', right_on='key2')
我很开心
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
但是我正在尝试使用join方法,我被认为这是非常相似的。
left.join(right, on=['key1', 'key2'])
我得到这个:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
我想念什么?
Suppose I have two DataFrames like so:
left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})
I want to merge them, so I try something like this:
pd.merge(left, right, left_on='key1', right_on='key2')
And I’m happy
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
But I’m trying to use the join method, which I’ve been lead to believe is pretty similar.
left.join(right, on=['key1', 'key2'])
And I get this:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
What am I missing?
回答 0
我总是join
在索引上使用:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
通过merge
在以下各列上使用,可以具有相同的功能:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
I always use join
on indices:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
The same functionality can be had by using merge
on the columns follows:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
回答 1
pandas.merge()
是用于所有合并/联接行为的基础函数。
DataFrames提供pandas.DataFrame.merge()
和pandas.DataFrame.join()
方法,作为一种方便的方法来访问的功能pandas.merge()
。例如,df1.merge(right=df2, ...)
等效于pandas.merge(left=df1, right=df2, ...)
。
这些是df.join()
和之间的主要区别df.merge()
:
- 在右表上查找:
df1.join(df2)
始终通过的索引进行连接df2
,但df1.merge(df2)
可以与df2
(默认)的一个或多个列或df2
(与right_index=True
)的索引进行连接。
- 在左表上查找:默认情况下,
df1.join(df2)
使用的索引df1
并df1.merge(df2)
使用的列df1
。可以通过指定df1.join(df2, on=key_or_keys)
或覆盖df1.merge(df2, left_index=True)
。
- 左vs内部联接:
df1.join(df2)
默认情况下执行左联接(保留的所有行df1
),但df.merge
默认情况下进行内部联接(仅返回df1
和的匹配行df2
)。
因此,通用方法是使用pandas.merge(df1, df2)
或df1.merge(df2)
。但是在许多常见情况下(将中的所有行保留df1
并连接到中的索引df2
),您可以使用df1.join(df2)
代替保存一些类型。
http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging上的文档中针对这些问题的一些说明:
merge
是pandas命名空间中的一个函数,它也可以作为DataFrame实例方法使用,调用的DataFrame被隐式视为联接中的左侧对象。
相关DataFrame.join
方法在merge
内部用于索引索引连接和列索引连接,但是默认情况下在索引上进行连接,而不是尝试在公共列上进行连接(的默认行为merge
)。如果您要加入索引,则不妨使用它DataFrame.join
来保存自己的输入内容。
…
这两个函数调用是完全等效的:
left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
pandas.merge()
is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge()
and pandas.DataFrame.join()
methods as a convenient way to access the capabilities of pandas.merge()
. For example, df1.merge(right=df2, ...)
is equivalent to pandas.merge(left=df1, right=df2, ...)
.
These are the main differences between df.join()
and df.merge()
:
- lookup on right table:
df1.join(df2)
always joins via the index of df2
, but df1.merge(df2)
can join to one or more columns of df2
(default) or to the index of df2
(with right_index=True
).
- lookup on left table: by default,
df1.join(df2)
uses the index of df1
and df1.merge(df2)
uses column(s) of df1
. That can be overridden by specifying df1.join(df2, on=key_or_keys)
or df1.merge(df2, left_index=True)
.
- left vs inner join:
df1.join(df2)
does a left join by default (keeps all rows of df1
), but df.merge
does an inner join by default (returns only matching rows of df1
and df2
).
So, the generic approach is to use pandas.merge(df1, df2)
or df1.merge(df2)
. But for a number of common situations (keeping all rows of df1
and joining to an index in df2
), you can save some typing by using df1.join(df2)
instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
merge
is a function in the pandas namespace, and it is also
available as a DataFrame instance method, with the calling DataFrame
being implicitly considered the left object in the join.
The related DataFrame.join
method, uses merge
internally for the
index-on-index and index-on-column(s) joins, but joins on indexes by
default rather than trying to join on common columns (the default
behavior for merge
). If you are joining on index, you may wish to
use DataFrame.join
to save yourself some typing.
…
These two function calls are completely equivalent:
left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
回答 2
我相信这join()
只是一种方便的方法。请尝试尝试df1.merge(df2)
,它允许您指定left_on
和right_on
:
In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]:
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
I believe that join()
is just a convenience method. Try df1.merge(df2)
instead, which allows you to specify left_on
and right_on
:
In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]:
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
回答 3
从本文档
pandas提供一个合并功能,作为DataFrame对象之间所有标准数据库联接操作的入口点:
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
和:
DataFrame.join
是一种将两个可能具有不同索引的DataFrame的列组合到单个结果DataFrame中的便捷方法。这是一个非常基本的示例:此处的数据对齐在索引(行标签)上。使用合并加上指示它使用索引的其他参数,可以实现相同的行为:
result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')
From this documentation
pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects:
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
And :
DataFrame.join
is a convenient method for combining the columns of two
potentially differently-indexed DataFrames into a single result
DataFrame. Here is a very basic example: The data alignment here is on
the indexes (row labels). This same behavior can be achieved using
merge plus additional arguments instructing it to use the indexes:
result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')
回答 4
区别之一merge
是创建新索引,并join
保留左侧索引。如果您错误地认为索引未使用进行更改,则可能对以后的转换产生重大影响merge
。
例如:
import pandas as pd
df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
'date': [201801, 201801, 201802, 201802],
'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1
date org_index val
101 201801 101 1
102 201801 102 2
103 201802 103 3
104 201802 104 4
—
df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
—
df1.merge(df2, on='date')
date org_index val dateval
0 201801 101 1 A
1 201801 102 2 A
2 201802 103 3 B
3 201802 104 4 B
—
df1.join(df2, on='date')
date org_index val dateval
101 201801 101 1 A
102 201801 102 2 A
103 201802 103 3 B
104 201802 104 4 B
One of the difference is that merge
is creating a new index, and join
is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn’t changed with merge
.
For example:
import pandas as pd
df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
'date': [201801, 201801, 201802, 201802],
'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1
date org_index val
101 201801 101 1
102 201801 102 2
103 201802 103 3
104 201802 104 4
–
df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
–
df1.merge(df2, on='date')
date org_index val dateval
0 201801 101 1 A
1 201801 102 2 A
2 201802 103 3 B
3 201802 104 4 B
–
df1.join(df2, on='date')
date org_index val dateval
101 201801 101 1 A
102 201801 102 2 A
103 201802 103 3 B
104 201802 104 4 B
回答 5
- 联接:默认索引(如果使用相同的列名,则由于未定义lsuffix或rsuffix,它将在默认模式下引发错误)
df_1.join(df_2)
- 合并:默认相同的列名(如果没有相同的列名,则在默认模式下将引发错误)
df_1.merge(df_2)
df_1.merge(df_2, on='column_1')
df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
- Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
df_1.join(df_2)
- Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
df_1.merge(df_2)
on
parameter has different meaning in both cases
df_1.merge(df_2, on='column_1')
df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
回答 6
用类似于SQL的方式表示“ Pandas合并是外部/内部联接,Pandas联接是自然联接”。因此,当您在熊猫中使用合并时,您要指定要使用哪种sqlish联接,而当使用熊猫联接时,您确实希望有一个匹配的列标签以确保其联接
To put it analogously to SQL “Pandas merge is to outer/inner join and Pandas join is to natural join”. Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins