In[173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')Out[173]:
mukey_left DI PI mukey_right niccdcd
index
01000003514NaNNaN110000054414NaNNaN210000064414NaNNaN310000074313NaNNaN410000084313NaNNaN
merge 之所以有效,是因为它没有此限制:
In[176]:
df_a.merge(df_b, on='mukey', how='left')Out[176]:
mukey DI PI niccdcd
01000003514NaN110000054414NaN210000064414NaN310000074313NaN410000084313NaN
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don’t overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn’t have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
This error indicates that the two tables have the 1 or more column names that have the same column name. The error message translates to: “I can see the same column in both tables but you haven’t told me to rename either before bringing one of them in”
You either want to delete one of the columns before bringing it in from the other on using del df[‘column name’], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought it.
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()KeyError:'[B_1, c2]'
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()
KeyError: '[B_1, c2]'
Any idea what should be the right way to do this? Thanks!
left_on : label or list, or array-like Field names to join on in left
DataFrame. Can be a vector or list of vectors of the length of the
DataFrame to use a particular vector as the join key instead of
columns
right_on : label or list, or array-like Field names to join on
in right DataFrame or vector/list of vectors per left_on docs
the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:
I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It’s even faster than the data.table package in R (my language of choice for analysis).
Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I’m not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?
Here’s the R code and the Python code used to benchmark the various packages.
It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.
Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn’t really the join itself (the algorithm), but a preliminary step.
Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R’s own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn’t hooked up yet to replace the levels to levels match.
Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.
Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.
I’ll need some time to confirm as it’s the first I’ve seen of the comparison to data.table as presented.
UPDATE from data.table v1.8.0 released July 2012
Internal function sortedmatch() removed and replaced with chmatch()
when matching i levels to x levels for columns of type ‘factor’. This
preliminary step was causing a (known) significant slowdown when the number
of levels of a factor column was large (e.g. >10,000). Exacerbated in
tests of joining four such columns, as demonstrated by Wes McKinney
(author of Python package Pandas). Matching 1 million strings of which
of which 600,000 are unique is now reduced from 16s to 0.5s, for example.
also in that release was :
character columns are now allowed in keys and are preferred to
factor. data.table() and setkey() no longer coerce character to
factor. Factors are still supported. Implements FR#1493, FR#1224
and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match()
and %in% for character vectors. R’s internal string cache is
utilised (no hash table is built). They are about 4 times faster
than match() on the example in ?chmatch.
As of Sep 2013 data.table is v1.8.10 on CRAN and we’re working on v1.9.0. NEWS is updated live.
But as I wrote originally, above :
data.table has time series merge in mind. Two aspects to that: i)
multi column ordered keys such as (id,datetime) ii) fast prevailing
join (roll=TRUE) a.k.a. last observation carried forward.
So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn’t hash the key because it has prevailing ordered joins in mind. A “key” in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that’s how the data is ordered in RAM). On the list is to add secondary keys, for example.
In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn’t be as bad now, since the known problem has been fixed.
The comparison with data.table is actually a bit interesting because the whole point of R’s data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas’ DataFrame contains no pre-computed information that is being used for the merge, so to speak it’s a “cold” merge. If I had stored the factorized versions of the join keys, the join would be significantly faster – as factorizing is the biggest bottleneck for this algorithm.
I should also add that the internal design of pandas’ DataFrame is much more amenable to these kinds of operations than R’s data.frame (which is just a list of arrays internally).
It would be interesting to know if Wes and/or Matt (who, by the way, are creators of Pandas and data.table respectively and have both commented above) have any news to add here as well.
This graph depicts the average times of aggregation and join operations for different technologies (lower = faster; comparison last updated in Sept 2016). It was really educational for me.
Going back to the question, R DT key and R DT refer to the keyed/unkeyed flavors of R’s data.table and happen to be faster in this benchmark than Python’s Pandas (Py pandas).
There are great answers, notably made by authors of both tools that question asks about.
Matt’s answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.
In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.
Timings I am presenting are coming from db-benchmark project, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.
As of now… (still to be implemented: one more data size and 5 more questions)
We tests 2 different data sizes of LHS table.
For each of those data sizes we run 5 different merge questions.
q1: LHS inner join RHS-small on integer
q2: LHS inner join RHS-medium on integer
q3: LHS outer join RHS-medium on integer
q4: LHS inner join RHS-medium on factor (categorical)
q5: LHS inner join RHS-big on integer
RHS table is of 3 various sizes
small translates to size of LHS/1e6
medium translates to size of LHS/1e3
big translates to size of LHS
In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).
As of now (run on 2nd Nov 2019)
pandas 0.25.3 released on 1st Nov 2019
data.table 0.12.7 (92abb70) released on 2nd Nov 2019
Below timings are in seconds, for two different data sizes of LHS. Column pd2dt is added field storing ratio of how many times pandas is slower than data.table.
But I’m trying to use the join method, which I’ve been lead to believe is pretty similar.
left.join(right, on=['key1', 'key2'])
And I get this:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
What am I missing?
回答 0
我总是join在索引上使用:
import pandas as pd
left = pd.DataFrame({'key':['foo','bar'],'val':[1,2]}).set_index('key')
right = pd.DataFrame({'key':['foo','bar'],'val':[4,5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 14
bar 25
通过merge在以下各列上使用,可以具有相同的功能:
left = pd.DataFrame({'key':['foo','bar'],'val':[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'val':[4,5]})
left.merge(right, on=('key'), suffixes=('_l','_r'))
key val_l val_r
0 foo 141 bar 25
pandas.merge() is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge() and pandas.DataFrame.join() methods as a convenient way to access the capabilities of pandas.merge(). For example, df1.merge(right=df2, ...) is equivalent to pandas.merge(left=df1, right=df2, ...).
These are the main differences between df.join() and df.merge():
lookup on right table: df1.join(df2) always joins via the index of df2, but df1.merge(df2) can join to one or more columns of df2 (default) or to the index of df2 (with right_index=True).
lookup on left table: by default, df1.join(df2) uses the index of df1 and df1.merge(df2) uses column(s) of df1. That can be overridden by specifying df1.join(df2, on=key_or_keys) or df1.merge(df2, left_index=True).
left vs inner join: df1.join(df2) does a left join by default (keeps all rows of df1), but df.merge does an inner join by default (returns only matching rows of df1 and df2).
So, the generic approach is to use pandas.merge(df1, df2) or df1.merge(df2). But for a number of common situations (keeping all rows of df1 and joining to an index in df2), you can save some typing by using df1.join(df2) instead.
merge is a function in the pandas namespace, and it is also
available as a DataFrame instance method, with the calling DataFrame
being implicitly considered the left object in the join.
The related DataFrame.join method, uses merge internally for the
index-on-index and index-on-column(s) joins, but joins on indexes by
default rather than trying to join on common columns (the default
behavior for merge). If you are joining on index, you may wish to
use DataFrame.join to save yourself some typing.
…
These two function calls are completely equivalent:
DataFrame.join is a convenient method for combining the columns of two
potentially differently-indexed DataFrames into a single result
DataFrame. Here is a very basic example: The data alignment here is on
the indexes (row labels). This same behavior can be achieved using
merge plus additional arguments instructing it to use the indexes:
result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')
import pandas as pd
df1 = pd.DataFrame({'org_index':[101,102,103,104],'date':[201801,201801,201802,201802],'val':[1,2,3,4]}, index=[101,102,103,104])
df1
date org_index val
1012018011011102201801102210320180210331042018021044
—
df2 = pd.DataFrame({'date':[201801,201802],'dateval':['A','B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
—
df1.merge(df2, on='date')
date org_index val dateval
02018011011 A
12018011022 A
22018021033 B
32018021044 B
—
df1.join(df2, on='date')
date org_index val dateval
1012018011011 A
1022018011022 A
1032018021033 B
1042018021044 B
One of the difference is that merge is creating a new index, and join is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn’t changed with merge.
To put it analogously to SQL “Pandas merge is to outer/inner join and Pandas join is to natural join”. Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins
I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I “join” together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person’s string name?
The join() function in pandas specifies that I need a multiindex, but I’m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
回答 0
假设进口:
import pandas as pd
John Galt的答案基本上是一项reduce手术。如果我有几个数据帧,则将它们放在这样的列表中(通过列表推导或循环或其他方式生成):
John Galt’s answer is basically a reduce operation. If I have more than a handful of dataframes, I’d put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name in your example, I’d do the following:
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
# Simple example where dataframes index are the name on which to perform the join operationsimport pandas as pd
import numpy as np
name =['Sophia','Emma','Isabella','Olivia','Ava','Emily','Abigail','Mia']
df1 = pd.DataFrame(np.random.randn(8,3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8,1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8,2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8,3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8,1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8,2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('AB'))
A B
012134>>> df2 = pd.DataFrame([[5,6],[7,8]], columns=list('AB'))
A B
056178>>> df.append(df2, ignore_index=True)
A B
012134256378
Keep in mind that os.path.join() exists only because different operating systems use different path separator characters. It smooths over that difference so cross-platform code doesn’t have to be cluttered with special cases for each OS. There is no need to do this for file name “extensions” (see footnote) because they are always connected to the rest of the name with a dot character, on every OS.
If using a function anyway makes you feel better (and you like needlessly complicating your code), you can do this:
That approach also happens to be compatible with the suffix conventions in pathlib, which was introduced in python 3.4 after this question was asked. New code that doesn’t require backward compatibility can do this:
You might prefer the shorter Path instead of PurePath if you’re only handling paths for the local OS.
Warning: Do not use pathlib’s with_suffix() for this purpose. That method will corrupt base_filename if it ever contains a dot.
Footnote: Outside of Micorsoft operating systems, there is no such thing as a file name “extension”. Its presence on Windows comes from MS-DOS and FAT, which borrowed it from CP/M, which has been dead for decades. That dot-plus-three-letters that many of us are accustomed to seeing is just part of the file name on every other modern OS, where it has no built-in meaning.
… and more. I’ve seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
np.random.seed(0)
left = pd.DataFrame({'key':['A','B','C','D'],'value': np.random.randn(4)})
right = pd.DataFrame({'key':['B','D','E','F'],'value': np.random.randn(4)})
left
key value0 A 1.7640521 B 0.4001572 C 0.9787383 D 2.240893
right
key value0 B 1.8675581 D -0.9772782 E 0.9500883 F -0.151357
left.merge(right, on='key')# Or, if you want to be explicit# left.merge(right, on='key', how='inner')
key value_x value_y0 B 0.4001571.8675581 D 2.240893-0.977278
这仅返回来自left并right共享一个公共密钥的行(在本示例中为“ B”和“ D”)。
甲LEFT OUTER JOIN,或LEFT JOIN由下式表示
可以通过指定执行how='left'。
left.merge(right, on='key', how='left')
key value_x value_y0 A 1.764052NaN1 B 0.4001571.8675582 C 0.978738NaN3 D 2.240893-0.977278
left.merge(right, on='key', how='right')
key value_x value_y0 B 0.4001571.8675581 D 2.240893-0.9772782 E NaN0.9500883 F NaN-0.151357
在这里,right使用了from 的键,而缺失的数据left被NaN代替。
最后,对于FULL OUTER JOIN,由
指定how='outer'。
left.merge(right, on='key', how='outer')
key value_x value_y0 A 1.764052NaN1 B 0.4001571.8675582 C 0.978738NaN3 D 2.240893-0.9772784 E NaN0.9500885 F NaN-0.151357
(left.merge(right, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge',1))
key value_x value_y0 A 1.764052NaN2 C 0.978738NaN
哪里,
left.merge(right, on='key', how='left',indicator=True)
key value_x value_y _merge0 A 1.764052NaN left_only1 B 0.4001571.867558 both2 C 0.978738NaN left_only3 D 2.240893-0.977278 both
同样,对于除权利加入之外,
(left.merge(right, on='key', how='right',indicator=True).query('_merge == "right_only"').drop('_merge',1))
key value_x value_y2 E NaN0.9500883 F NaN-0.151357
(left.merge(right, on='key', how='outer', indicator=True).query('_merge != "both"').drop('_merge',1))
key value_x value_y0 A 1.764052NaN2 C 0.978738NaN4 E NaN0.9500885 F NaN-0.151357
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value0 A 1.7640521 B 0.4001572 C 0.9787383 D 2.240893
right2
keyRight value0 B 1.8675581 D -0.9772782 E 0.9500883 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y0 B 0.400157 B 1.8675581 D 2.240893 D -0.977278
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol0 B 1.86755801 D -0.97727812 E 0.95008823 F -0.1513573
如果只需要合并“ new_val”(不包含任何其他列),通常可以在合并之前仅对列进行子集化:
left.merge(right3[['key','newcol']], on='key')
key value newcol0 B 0.40015701 D 2.2408931
如果您要进行左外部联接,则性能更高的解决方案将涉及map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol0 A 1.764052NaN1 B 0.4001570.02 C 0.978738NaN3 D 2.2408931.0
如前所述,这类似于但比
left.merge(right3[['key','newcol']], on='key', how='left')
key value newcol0 A 1.764052NaN1 B 0.4001570.02 C 0.978738NaN3 D 2.2408931.0
np.random.seed([3,14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A','B','C','D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B','D','E','F'])
left.index.name = right.index.name ='idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
通常,索引合并看起来像这样:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.4026550.543843
D -0.5243490.013135
right2 = right.reset_index().rename({'idxkey':'colkey'}, axis=1)
right2
colkey value
0 B 0.5438431 D 0.0131352 E -0.3264983 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0-0.402655 B 0.5438431-0.524349 D 0.013135
在这种特殊情况下,left为命名了索引,因此您也可以将索引名称与一起使用left_on,如下所示:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0-0.402655 B 0.5438431-0.524349 D 0.013135
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.4026550.543843
D -0.5243490.013135
省略join='inner'是否需要FULL OUTER JOIN(默认):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.602923NaN
B -0.4026550.543843
C 0.302329NaN
D -0.5243490.013135
E NaN-0.326498
F NaN1.385076
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key':['A','B','C','D'],'valueA': np.random.randn(4)})
B = pd.DataFrame({'key':['B','D','E','F'],'valueB': np.random.randn(4)})
C = pd.DataFrame({'key':['D','E','J','C'],'valueC': np.ones(4)})
dfs =[A, B, C]# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 =[A2, B2, C2]
# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key')for df in dfs], axis=1, join='inner').reset_index()
key valueA valueB valueC
0 D 2.240893-0.9772781.0# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893-0.9772781.0
# join on `key` column, set as the index first# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([df.set_index('key')for df in(B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893-0.9772781.0# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274-0.9772781.0
D 0.761038-0.9772781.0
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here’s what this post will go through:
The basics – types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
avoiding duplicate merge key column in output
Merging with index under different conditions
effectively using your named index
merge key as the index of one and column of another
Multiway merges on columns and indexes (unique and non-unique)
Notable alternatives to merge and join
What this post will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough Talk, just show me how to use merge!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, “B” and “D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is…
…specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs – LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you’ll notice keyLeft is missing. You can figure out what column to keep based on which frame’s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only “new_val” (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you’re doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.
Index-based *-JOIN (+ index-column merges)
Setup
np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Support for index names
If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Merging on index of one, column(s) of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
In this special case, the index for left is named, so you can also use the index name with left_on, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Omit join='inner' if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.602923 NaN
B -0.402655 0.543843
C 0.302329 NaN
D -0.524349 0.013135
E NaN -0.326498
F NaN 1.385076
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys (or index)
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 ‘s meaning is not as intuitive as df.mean() or df.apply(func)